#Q1: Probabilistic N-Gram Language Model(50 points)

**Objective:**

The objective of this question is to implement and experiment with an N-Gram language model using the Reuters dataset. The task involves building a probabilistic N-Gram model and creating a text generator based on the trained model with customizable parameters.

**Tasks:**


**1.Text Preprocessing (5 points):**
*   Implement the preprocess_text function to perform necessary text preprocessing. You may use NLTK or other relevant libraries for this task. (Already provided, no modification needed)


**2.Build Probabilistic N-Gram Model (15 points):**

*   Implement the build_probabilistic_ngram_model function to construct a probabilistic N-Gram model from the Reuters dataset.


**3.Generate Text with Customizable Parameters (15 points):**

*   Implement the generate_text function to generate text given a seed text and the probabilistic N-Gram model.
*   The function should have parameters for probability_threshold and min_length to customize the generation process.
*   Ensure that the generation stops when either the specified min_length is reached or the probabilities fall below probability_threshold.


**4.Experimentation and Parameter Tuning (5 points):**

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length.
Find the optimal parameters that result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.


**5.Results and Conclusion (10 points):**

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [24]:
import nltk
from nltk.corpus import reuters
from nltk import ngrams
from nltk.tokenize import word_tokenize
import random
import string
from collections import defaultdict

# Download the Reuters dataset if not already downloaded
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [62]:
## source : https://www.datacamp.com/tutorial/case-conversion-python
## source2 : https://www.shiksha.com/online-courses/articles/how-to-remove-punctuation-from-python-string/#:~:text=We%20can%20use%20the%20translate,character%20or%20delete%20them%20altogether.
## source3 : https://www.holisticseo.digital/python-seo/nltk/tokenization

# Function to preprocess text
def preprocess_text(text):
    # Fill in: Implement text preprocessing steps like lowercasing, removing punctuation, etc.
    # You may use NLTK or other libraries for this.
    text = text.lower()
    text = text.translate(str.maketrans("", "", string.punctuation))
    return text


## source4 : https://www.projectpro.io/recipes/find-ngrams-from-text
# Function to build a probabilistic n-gram model
def build_probabilistic_ngram_model(corpus, n):
    # Initialize n-gram model dictionary
    ngram_model = defaultdict(lambda: defaultdict(lambda: 0))

    for txt in corpus:
        words = word_tokenize(txt)
        padded_words = (['<s>'] * (n - 1)) + words
        document_ngrams = list(ngrams(padded_words, n))

        for ngram in document_ngrams:
            prefix = tuple(ngram[:-1])
            suffix = ngram[-1]
            ngram_model[prefix][suffix] += 1

    # Convert counts to probabilities
    for prefix, following_words in ngram_model.items():
        total_count = sum(following_words.values())
        for word in following_words:
            ngram_model[prefix][word] /= total_count

    return ngram_model

## source5 : https://www.youtube.com/watch?v=pEYfD5aVrRI
# Function to generate text using the probabilistic n-gram model with stop criteria
def generate_text(model, seed_text, n, probability_threshold=0.1, min_length=10):
    # Fill in: Implement code to generate text given a seed text and the n-gram model.
    # Use the model to predict the next words and generate a sequence.
    seed_text = seed_text.lower().split()
    generated_text = seed_text
    while len(generated_text) < min_length:
      prefix = tuple(generated_text[-(n - 1):])
      next_word_probs = model.get(prefix, {})
      # if not next_word_probs:
      #       break
      filtered_words = {word: prob for word, prob in next_word_probs.items() if prob >= probability_threshold}
      next_word = random.choices(list(filtered_words.keys()), weights=list(filtered_words.values()))[0]
      generated_text.append(next_word)
    return ' '.join(generated_text)


In [52]:
# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]


# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Choose an n for the n-gram model
n_value = 3  # You may change this value

# Build the probabilistic n-gram model
probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

In [73]:
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=20)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is the first quarter earnings for the first five months to end january 31 shr three cts vs loss


#Q2: Sentiment Analysis with Naive Bayes Classifier(50 Points)

**Objective:**

You are tasked with implementing a Naive Bayes classifier for sentiment analysis. The provided code is incomplete, and your goal is to complete the missing parts. Additionally, you should train the classifier on a small dataset and analyze its performance.

**Tasks:**

1.**Complete the Code (35 points)**: Fill in the missing parts in the provided Python code for the Naive Bayes classifier. Pay special attention to the `extract_features` function.

2.**Train and Test**: Train the Naive Bayes classifier on the training data and test it on a separate test set. Evaluate the accuracy of the classifier.

3.**Analysis (15 points)**: Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.


In [74]:
import random
import math
import string
from collections import defaultdict

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
import nltk

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [75]:
def get_features(tokens):
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens

In [78]:
## source : https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

class NaiveBayesClassifier:
    def __init__(self, classes):
        self.classes = classes
        self.class_probs = defaultdict(float)
        self.feature_probs = defaultdict(lambda: defaultdict(float))

    def train(self, training_data):
        # Implement training here
        # You should use get_features function to extract useful tokens from
        # dataset and use them to train the classifier.
        total_count = defaultdict(int)
        class_counts = defaultdict(int)
        for features, label in training_data:
            for feature in features:
                self.feature_probs[feature][label] += 1
                total_count[label] += 1
            class_counts[label] += 1

        # Calculate prob based on class
        total_samples = sum(class_counts.values())
        for label in self.classes:
            self.class_probs[label] = class_counts[label] / total_samples

        # Calculate prob based on class
        for feature, label_counts in self.feature_probs.items():
            for label in self.classes:
                self.feature_probs[feature][label] /= total_count[label]


    def classify(self, features):
        # Implement classification here
        class_scores = {label: math.log(self.class_probs[label]) for label in self.classes}

        # Calculate log likelihoods for each feature
        for feature in features:
            if feature in self.feature_probs:
                for label in self.classes:
                    # Add Laplace smoothing to handle zero probabilities
                    smoothed_prob = self.feature_probs[feature][label] if self.feature_probs[feature][label] > 0 else 1e-10
                    class_scores[label] += math.log(smoothed_prob)

        # Return the class label with the highest score
        return max(class_scores, key=class_scores.get)

In [82]:
# Load the movie reviews dataset from NLTK
data = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(data)

# Shuffle the dataset for randomness
random.shuffle(data)

# Split the dataset into training and testing sets
split_ratio = 0.8
split_index = int(len(data) * split_ratio)
train_set = data[:split_index]
test_set = data[split_index:]

# Train the Naive Bayes classifier
classes = set(sentiment for _, sentiment in train_set)
classifier = NaiveBayesClassifier(classes)
classifier.train(train_set)

def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")

calculate_accuracy(train_set, 'Train')
calculate_accuracy(test_set, 'Test')

Train Accuracy: 0.963125
Test Accuracy: 0.6825


#Submission Instructions:


1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

3.Clearly present the results of your parameter tuning in the notebook.

4.Provide a brief summary of your findings and insights in the conclusion section.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Experiment with various seed texts to showcase the diversity of generated text.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.