#Q1: Probabilistic N-Gram Language Model(50 points)

**Objective:**

The objective of this question is to implement and experiment with an N-Gram language model using the Reuters dataset. The task involves building a probabilistic N-Gram model and creating a text generator based on the trained model with customizable parameters.

**Tasks:**


**1.Text Preprocessing (5 points):**
*   Implement the preprocess_text function to perform necessary text preprocessing. You may use NLTK or other relevant libraries for this task. (Already provided, no modification needed)


**2.Build Probabilistic N-Gram Model (15 points):**

*   Implement the build_probabilistic_ngram_model function to construct a probabilistic N-Gram model from the Reuters dataset.


**3.Generate Text with Customizable Parameters (15 points):**

*   Implement the generate_text function to generate text given a seed text and the probabilistic N-Gram model.
*   The function should have parameters for probability_threshold and min_length to customize the generation process.
*   Ensure that the generation stops when either the specified min_length is reached or the probabilities fall below probability_threshold.


**4.Experimentation and Parameter Tuning (5 points):**

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length.
Find the optimal parameters that result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.


**5.Results and Conclusion (10 points):**

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [4]:
import nltk
from nltk.corpus import reuters
from nltk import ngrams
import random
import string
from collections import defaultdict

# Download the Reuters dataset if not already downloaded
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Explanations of each Function:**


**1.preprocess_text(text):**
*   This function takes a text string as input and performs preprocessing steps on it. It tokenizes the text by lowercasing it and removing punctuation using the NLTK library's word_tokenize function. The function returns a list of tokens.

**2.generate_ngrams(tokens, n):**
*   This function generates n-grams from a list of tokens. It takes the list of tokens and the value of n as input. It pads the tokens with None values on the left and right to handle the boundaries of the n-grams. Then, it iterates over the padded tokens and creates tuples of n consecutive tokens, representing the n-grams. The function returns a list of n-grams.

**3.build_probabilistic_ngram_model(corpus, n):**
*   This function builds a probabilistic n-gram model from a corpus of tokens. It takes the corpus (a list of tokenized texts) and the value of n as input. It creates a defaultdict to store the n-gram model. It iterates over the corpus and generates n-grams using the pre-defined "ngrams" function or "generate_ngrams" function. Then, it updates the n-gram model by storing the context (tuple of n-1 tokens) as a key and the following word as a value. The n-gram model is stored as a dictionary of lists. Finally, it converts the model to probabilities by counting the occurrences of each word and dividing by the total count. The function returns the probabilistic n-gram model.

**4.generate_text(model, seed_text, n, probability_threshold=0.1, min_length=10):**
*   This function generates text based on a probabilistic n-gram model. It takes the model (built using "build_probabilistic_ngram_model"), a seed text, the value of n, a probability threshold, and a minimum length as input. It preprocesses the seed text using the preprocess_text function and initializes the generated text with the seed tokens. It then enters a loop to generate the remaining text until the minimum length is reached. In each iteration, it gets the context (tuple of the last n tokens) from the generated text and checks if it is in the model. If the context is in the model, it retrieves the probabilities for the next word based on the context. It filters the probabilities based on the threshold and selects the next word randomly using the random.choices function. The next word is added to the generated text. If the context is not in the model or no words are left after filtering, the loop breaks. Finally, the generated text is converted back to a string and returned.

In [15]:
# Function to preprocess text
def preprocess_text(text):
    # Tokenize the text (lowercasing)
    tokens = nltk.word_tokenize(text.lower())

    # Removing punctuation
    tokens = [token for token in tokens if token not in string.punctuation]

    return tokens

# Function to generate n-grams
def generate_ngrams(tokens, n):
    # Pad the tokens with None values on the left and right
    padded_tokens = [None] * (n - 1) + tokens + [None] * (n - 1)

    # Generate n-grams
    ngrams = []
    for i in range(len(padded_tokens) - n + 1):
        ngram = tuple(padded_tokens[i:i + n])
        ngrams.append(ngram)

    return ngrams

# Function to build a probabilistic n-gram model
def build_probabilistic_ngram_model(corpus, n):
    # Create a defaultdict to store the n-gram model
    ngram_model = defaultdict(list)

    # Iterate over the corpus
    for tokens in corpus:
        # Create n-grams
        # grams = list(ngrams(tokens, n, pad_left=True, pad_right=True)) # Pre-defined Function
        grams = generate_ngrams(tokens, n) # Self-defined Function

        # Update the n-gram model
        for gram in grams:
            context = tuple(gram[:-1])
            word = gram[-1]
            ngram_model[context].append(word)

    # Convert the model to probabilities
    for context, words in ngram_model.items():
        word_counts = defaultdict(int)
        total_count = len(words)
        for word in words:
            word_counts[word] += 1
        probabilities = {}
        for word, count in word_counts.items():
            probabilities[word] = count / total_count
        ngram_model[context] = probabilities
    return ngram_model

def generate_text(model, seed_text, n, probability_threshold=0.1, min_length=10):
    # Preprocess the seed text
    seed_tokens = preprocess_text(seed_text)

    # Initialize the generated text with the seed text
    generated_text = seed_tokens

    # Generate the remaining text
    while len(generated_text) < min_length:
        # Get the context for the next word
        context = tuple(generated_text[-n:])

        # Check if the context is in the model
        if context in model:
            # Get the probabilities for the next word
            probabilities = model[context]

            # Filter the probabilities based on the threshold
            filtered_probs = {word: prob for word, prob in probabilities.items() if prob >= probability_threshold}

            # Check if there are any words left after filtering
            if filtered_probs:
                # Select the next word based on the probabilities
                next_word = random.choices(list(filtered_probs.keys()), list(filtered_probs.values()))[0]

                # Add the next word to the generated text
                generated_text.append(next_word)
            else:
                # If no words are left, break the loop
                # print("No words are left")
                break
        else:
            # If the context is not in the model, break the loop
            # print("Context is not in the model")
            break

    # Convert the generated text back to a string
    generated_text = ' '.join(generated_text)

    return generated_text

In [16]:
# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]

# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Choose an n for the n-gram model
n_value = 3  # You may change this value

# Build the probabilistic n-gram model
probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

In [21]:
# Test the text generator with constant value of n
seed_text = "Inflation is"
Result = ' '.join(preprocess_text(seed_text))

# Number of iterations
words_count = 10

for i in range(words_count):
    generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
    generated_words = generated_text.split()
    print(f"Generated Text: {generated_text}")
    seed_text = " ".join(generated_words[-2:])
    Result += (" " + " ".join(generated_words[-1:]))

print(f"Result is: {Result}")

Generated Text: inflation is likely
Generated Text: is likely to
Generated Text: likely to fall
Generated Text: to fall to
Generated Text: fall to about
Generated Text: to about three
Generated Text: about three months
Generated Text: three months it
Generated Text: months it wants
Generated Text: it wants to
Result is: inflation is likely to fall to about three months it wants to


In [None]:
# for n_value in range (2, 10):
#   print("n is: ", n_value)
#   # Build the probabilistic n-gram model
#   probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)
#   generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=10)
#   print(f"Generated Text: {generated_text}")
#   seed_text = generated_text

#Q2: Sentiment Analysis with Naive Bayes Classifier(50 Points)

**Objective:**

You are tasked with implementing a Naive Bayes classifier for sentiment analysis. The provided code is incomplete, and your goal is to complete the missing parts. Additionally, you should train the classifier on a small dataset and analyze its performance.

**Tasks:**

1.**Complete the Code (35 points)**: Fill in the missing parts in the provided Python code for the Naive Bayes classifier. Pay special attention to the `extract_features` function.

2.**Train and Test**: Train the Naive Bayes classifier on the training data and test it on a separate test set. Evaluate the accuracy of the classifier.

3.**Analysis (15 points)**: Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.


In [None]:
import random
import math
import string
from collections import defaultdict

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
import nltk

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
def get_features(tokens):
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens

In [None]:
class NaiveBayesClassifier:
    def __init__(self, classes):
        self.classes = classes
        self.class_probs = defaultdict(float)
        self.feature_probs = defaultdict(lambda: defaultdict(float))

    def train(self, training_data):
        # Implement training here
        # You should use get_features function to extract useful tokens from
        # dataset and use them to train the classifier.
        pass

    def classify(self, features):
        # Implement classification here
        pass

In [None]:
# Load the movie reviews dataset from NLTK
data = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(data)

# Shuffle the dataset for randomness
random.shuffle(data)

# Split the dataset into training and testing sets
split_ratio = 0.8
split_index = int(len(data) * split_ratio)
train_set = data[:split_index]
test_set = data[split_index:]

# Train the Naive Bayes classifier
classes = set(sentiment for _, sentiment in train_set)
classifier = NaiveBayesClassifier(classes)
classifier.train(train_set)

def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")

calculate_accuracy(train_set, 'Train')
calculate_accuracy(test_set, 'Test')

#Submission Instructions:


1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

3.Clearly present the results of your parameter tuning in the notebook.

4.Provide a brief summary of your findings and insights in the conclusion section.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Experiment with various seed texts to showcase the diversity of generated text.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.