#Q1: Probabilistic N-Gram Language Model(50 points)



**Objective:**

The objective of this question is to implement and experiment with an N-Gram language model using the Reuters dataset. The task involves building a probabilistic N-Gram model and creating a text generator based on the trained model with customizable parameters.

**Tasks:**


**1.Text Preprocessing (5 points):**
*   Implement the preprocess_text function to perform necessary text preprocessing. You may use NLTK or other relevant libraries for this task. (Already provided, no modification needed)


**2.Build Probabilistic N-Gram Model (15 points):**

*   Implement the build_probabilistic_ngram_model function to construct a probabilistic N-Gram model from the Reuters dataset.


**3.Generate Text with Customizable Parameters (15 points):**

*   Implement the generate_text function to generate text given a seed text and the probabilistic N-Gram model.
*   The function should have parameters for probability_threshold and min_length to customize the generation process.
*   Ensure that the generation stops when either the specified min_length is reached or the probabilities fall below probability_threshold.


**4.Experimentation and Parameter Tuning (5 points):**

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length.
Find the optimal parameters that result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.


**5.Results and Conclusion (10 points):**

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [1]:
import nltk
from nltk.corpus import reuters
from nltk import ngrams
import random
import string
from collections import defaultdict

# Download the Reuters dataset if not already downloaded
nltk.download('reuters')
nltk.download('punkt')

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
# Function to preprocess text
def preprocess_text(text):
    # Fill in: Implement text preprocessing steps like lowercasing, removing punctuation, etc.
    # You may use NLTK or other libraries for this.

    text = text.lower() # Lowercasing

    text = text.translate(str.maketrans("", "", string.punctuation)) # Removing punctuation

    return text

# Function to build a probabilistic n-gram model
def build_probabilistic_ngram_model(corpus, n):
    # Fill in: Implement code to build an n-gram model from the given corpus.
    # You may use NLTK's word_tokenize function.

    ## Can build 1-gram to n-gram to use in Backoff !!
    # keep 1-gram to n-gram in a list
    ngram_models = [] # Create a list to store the n-gram models
    for i in range(1, n+1): # Iterate over the range from 1 to n
        ngram_model = defaultdict(list) # Create a dictionary to store the n-gram model

        for text in corpus: # Iterate over the corpus
            words = nltk.word_tokenize(text) # Tokenize the preprocessed text into words

            ngrams_list = list(ngrams(words, n)) # Create n-grams

            if (i == 1):
                for word in words:
                    ngram_model[tuple()].append(word) # Add the value to the list of words for this (n-1)-gram
            else:
                for ngram in ngrams_list:
                    # Use the (n-1)-gram as the key and the last word as the value
                    key = tuple(ngram[:-1]) # (n-1)-gram
                    value = ngram[-1] # Last word
                    ngram_model[key].append(value) # Add the value to the list of words for this (n-1)-gram

        ngram_models.append(ngram_model)

    return ngram_models

# Function to generate text using the probabilistic n-gram model with stop criteria
def generate_text(model, seed_text, n, probability_threshold=0.1, min_length=10):
    # Fill in: Implement code to generate text given a seed text and the n-gram model.
    # Use the model to predict the next words and generate a sequence.

    ## use back-off instead of random with (n-1)-gram to 1-gram
    generated_text = seed_text.lower() # Initialize the generated text with the seed text
    current_ngram = tuple(seed_text.lower().split()[-(n-1):]) # Initialize the current n-gram with the last (n-1) words of the seed text
    if (n == 1): # If n = 1, then current_ngram = ()
        current_ngram = tuple() # ()

    while (2 == 2):
        if current_ngram in model[n-1]: # Check if the current_ngram is in the model
            next_word = random.choice(model[n-1][current_ngram]) # is weighted because we use duplicate words in the list

            if (len(generated_text.split()) > min_length):
                if (model[n-1][current_ngram].count(next_word) / len(model[n-1][current_ngram]) < probability_threshold): # (probability of the generated word < probability_threshold)
                    break
                # else nothing

            generated_text += f" {next_word}" # Add the next word to the generated text

            current_ngram = tuple(generated_text.split()[-(n-1):]) # Update the current n-gram with the last (n-1) words of the generated text
            if (n == 1): # If n = 1, then current_ngram = ()
                current_ngram = tuple() # ()

        else:
            # take unigram word randomly from the model
            # next_word = random.choice(list(model[n-1].keys()))[0] # take the first word of the unigram

            # use back-off instead of random with (n-1)-gram to 1-gram
            chosen_model = 0
            for i in range(n-1, 0, -1):
                if current_ngram in model[i-1]:
                    next_word = random.choice(model[i-1][current_ngram])
                    chosen_model = i-1
                    break

            if (len(generated_text.split()) > min_length):
                if (model[chosen_model][current_ngram].count(next_word) / len(model[chosen_model][current_ngram]) < probability_threshold):
                    break

            generated_text += f" {next_word}" # Add the next word to the generated text
            current_ngram = tuple(generated_text.split()[-(n-1):]) # Update the current n-gram with the last (n-1) words of the generated text
            if (n == 1):
                current_ngram = tuple() # ()

    return generated_text


In [None]:
corpus = ["Hello, World! I Love Python", "I am Farzan Rahmani.", "Ali is stupid!", "Hello, World! I love python"]
corpus

['Hello, World! I Love Python',
 'I am Farzan Rahmani.',
 'Ali is stupid!',
 'Hello, World! I love python']

test preprocess_text function

In [None]:
preprocessed_corpus = [preprocess_text(text) for text in corpus]
preprocessed_corpus

['hello world i love python',
 'i am farzan rahmani',
 'ali is stupid',
 'hello world i love python']

test build_probabilistic_ngram_model function

In [None]:
probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, 2)
probabilistic_ngram_model

[defaultdict(list,
             {(): ['hello',
               'world',
               'i',
               'love',
               'python',
               'i',
               'am',
               'farzan',
               'rahmani',
               'ali',
               'is',
               'stupid',
               'hello',
               'world',
               'i',
               'love',
               'python']}),
 defaultdict(list,
             {('hello',): ['world', 'world'],
              ('world',): ['i', 'i'],
              ('i',): ['love', 'am', 'love'],
              ('love',): ['python', 'python'],
              ('am',): ['farzan'],
              ('farzan',): ['rahmani'],
              ('ali',): ['is'],
              ('is',): ['stupid']})]

In [None]:
probabilistic_ngram_model[1] # bigram

defaultdict(list,
            {('hello',): ['world', 'world'],
             ('world',): ['i', 'i'],
             ('i',): ['love', 'am', 'love'],
             ('love',): ['python', 'python'],
             ('am',): ['farzan'],
             ('farzan',): ['rahmani'],
             ('ali',): ['is'],
             ('is',): ['stupid']})

test for duplicate words behavior

In [None]:
probabilistic_ngram_model[1][tuple(["hello"])]

['world', 'world']

In [11]:
# Load the Reuters dataset
corpus = [reuters.raw(file_id) for file_id in reuters.fileids()]

# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Choose an n for the n-gram model
n_value = 3  # You may change this value

# Build the probabilistic n-gram model
probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

In [None]:
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is not good the report was released before the


In [None]:
# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is cut these analysts argue prices will not


In [4]:
# Test the text generator
seed_text = "Prices are"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: prices are expected to keep the


In [14]:
# Test the text generator
seed_text = "effective april"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: effective april 15 record april 10 to shareholders in progressive to 52 pct in


4.Experimentation and Parameter Tuning (5 points):

*   Use Google Colab to experiment with different values of n_value, probability_threshold, and min_length. Find the optimal parameters that   result in coherent and meaningful generated text.
*   Provide a detailed analysis of the impact of changing each parameter on the generated text's quality.
*   Discuss any challenges faced during parameter tuning and propose potential improvements.

5.Results and Conclusion (10 points):

*   Summarize your findings and present the optimal parameter values for n_value, probability_threshold, and min_length.
*   Discuss the trade-offs and considerations when selecting these parameters.
*   Conclude with insights gained from the experimentation.

In [15]:
n_value = 2

probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation" # or this
# seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation with the options in january


In [16]:
# because runned above
# n_value = 2
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Shareholders"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: shareholders and production at 53 billion dlr


In [None]:
n_value = 1

probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

# seed_text = "Inflation" # or this
seed_text = "Inflation is"

generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.02, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is property team growers march


In [None]:
# because runned above
# n_value = 1  # You may change this value
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.0002, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation vs postings pct report dividend this has in us shr of cts


In [None]:
# because runned above
# n_value = 1  # You may change this value
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.2, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation from to transaction car 1773


In [None]:
n_value = 2

probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation"
# seed_text = "Inflation is"  # or this

generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.2, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation was effective april 1 may


In [None]:
# because runned above
# n_value = 2
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation"
# seed_text = "Inflation is" # or this

generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.002, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation monetary course of 905 mln note 1986


In [None]:
n_value = 4

probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation is not"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.2, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is not as worried as it was 18 months ago


In [None]:
# because runned above
# n_value = 4  # You may change this value
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation is not"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.6, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is not as good as that of


In [None]:
# because runned above
# n_value = 4  # You may change this value
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation is not"

generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.002, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is not such a constructive factor as this time last year but the bank has agreed to purchase the santa barbara biltmore hotel in california from marriott corp for undisclosed terms upon completion of the proposed sale resulted from an offer made by ltgroupe bruxelles lambert sa at the same rate as last year the charges carry a maximum 50000 dlr fine on each count gulf said the source the source predicted national amusements controlled by investor sumner redstone would need half a year to american telephone and telegraph cos decision to postpone the implementation of the governments philosophy of keeping expenditure levels flat most analysts said they are giving us the benefit of the united states


In [None]:
n_value = 5

probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation is not as"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.3, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is not as bad as what volcker fed chairman has said lately industrial production growth is along the lines of what the fed wants the energy products component of ppi rose 40 pct in february after a


In [None]:
# because runned above
# n_value = 5
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation is not as"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.5, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is not as bad as what volcker fed chairman has said lately industrial production growth is along the lines of what the fed wants the energy products component of ppi rose 40 pct in february after a


In [None]:
# because runned above
# n_value = 5
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

seed_text = "Inflation is not as"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.1, min_length=5) # probability_threshold=0.02 --> did not stop
print(f"Generated Text: {generated_text}")

Generated Text: inflation is not as high as people had feared and the narrowing us trade balance in nominal terms samuel kahan chief financial economist with kleinwort benson government securities said kahan said recent government reports have shown strength in the economy during the first quarter down from 4994 mln a year earlier it said


In [None]:
n_value = 3
probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.1, min_length=5)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is for the country may have


In [None]:
# because runned above
# n_value = 3
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.1, min_length=10)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is likely to build a case filed by the six months to


In [None]:
# because runned above
# n_value = 3
# probabilistic_ngram_model = build_probabilistic_ngram_model(preprocessed_corpus, n_value)

# Test the text generator
seed_text = "Inflation is"
generated_text = generate_text(probabilistic_ngram_model, seed_text, n_value, probability_threshold=0.1, min_length=15)
print(f"Generated Text: {generated_text}")

Generated Text: inflation is for shipment in june 1986 oxford would pay differentials for different types of people


<div dir="rtl">
    همان طور که در بالا میبینیم ابرپارامتر های گوناگونی داریم که با تغییر آن ها عملکرد مدل عوض می شود.
    <br/>
    seed_text: متن های متفاوت، گوناگون و متنوع تری تولید میشوند.
    <br/>
    n_value: با افزایش آن مدل چیز های تکراری و با بیش برازش بیشتری تولید میکند و با کاهش آن کلمات تولید شده تصادفی تر و نا مرتبط تر خواهند شد.
    <br/>
    probability_threshold: با افزایش این آستانه مدل سخت گیرانه تر عمل میکند و متن تولید شده طول کمتری خواهد داشت و با معنی تر خواهد بود ولی با کاهش این پارامتر متن تولیدی تصادفی تر خواهد بود و طول آن افزایش خواهد یافت و مدل ساده تر و بی ربط تر عمل میکند.
    <br/>
    min_length: با افزایش طول کمینه طبیعتا تولید متن تولیدی افزایش میابد و تعداد کلمات بیشتری در خروجی مشاهده میشود.
    <br/>
    نتیجه کلی:
    <br/>
    ایجاد تعادل بین انسجام و تنوع متن بسیار مهم است. این یک مبادله بین تولید متن معنی دار و اجتناب از تطبیق بیش از حد با داده های آموزشی است.
    پارامترهای بهینه به مورد استفاده خاص بستگی دارد. آزمایش کلید یافتن تعادل مناسب است. به طور منظم باید پارامترها را بر اساس ویژگی های تولید متن مورد نظر ارزیابی و تنظیم کنیم.
    با آزمایش و تنظیم سیستماتیک این پارامترها، می‌توانیم مدل زبان N-Gram را برای برآورده کردن الزامات خاص، چه با تأکید بر تنوع، انسجام یا تعادل بین این دو، تنظیم کنیم.
</div>

<div dir="rtl">
    در ادامه توضیحات مفصل تری آمده است که با کمک chat gpt نوشتم:
    توجه شود که اعداد نوشته شده طبق تحلیل دستی متن های تولیدی بالا نوشته شده و 
    معیار دقیقی مانند accuracy نداشتیم برای قضاوت.
</div>
<div>

a. n_value:

Experimentation: We Tried different values for n_value to observe their impact on text generation. Range from small values (1 or 2) to larger ones (5).

Analysis: Smaller n_value may result in more diverse and random text, while larger values may produce more coherent but repetitive text.

Challenges and Improvements: Smaller values may generate less meaningful text, and larger values might lead to overfitting. We Experimented with a range of values to observe the balance between diversity and coherence.

b. probability_threshold:

Experimentation: Vary probability_threshold to control the randomness of word selection. WE test values from very low (close to 0) to higher thresholds (0.6).

Analysis: Lower thresholds may lead to more randomness, while higher thresholds result in more deterministic text generation.

Challenges and Improvements: Finding the right balance is crucial. Very low thresholds may produce incoherent text, while very high thresholds may make the generated text too repetitive. We Experimented with different values to strike a balance.

c. min_length:

Experimentation: Explore different values for min_length to determine the minimum length of the generated text. We test values from 5 to 15.

Analysis: Smaller values may result in shorter, potentially less meaningful text, while larger values ensure longer and more complete sentences.

Challenges and Improvements: Striking a balance is essential. Too small min_length might lead to incomplete or nonsensical sentences, while too large values might limit diversity. Experiment with different values to find a suitable minimum length.

Results and Conclusion:
a. Optimal Parameter Values:

n_value: Experimentally we determine the optimal value, 3 or 4, balancing diversity and coherence.

probability_threshold: Fine-tuned this parameter based on the desired level of randomness, 0.1.

min_length: Optimal value depends on the application; a balanced value might be around 10.

b. Trade-offs and Considerations:

n_value: Higher values enhance coherence but might lead to overfitting. Lower values increase diversity but may result in less meaningful text.

probability_threshold: Low values increase randomness but might result in incoherent text. High values ensure coherence but may make the text too repetitive.

min_length: Smaller values might produce incomplete text, while larger values limit diversity.

c. Insights and Conclusion:

Insights: Balancing coherence and diversity is crucial. It's a trade-off between generating meaningful text and avoiding overfitting to the training data.

Conclusion: The optimal parameters depend on the specific use case. Experimentation is key to finding the right balance. Regularly evaluate and adjust parameters based on the desired text generation characteristics.

By systematically experimenting and adjusting these parameters, you can tailor the N-Gram language model to meet specific requirements, whether emphasizing diversity, coherence, or a balance between the two.
</div>

#Q2: Sentiment Analysis with Naive Bayes Classifier(50 Points)

**Objective:**

You are tasked with implementing a Naive Bayes classifier for sentiment analysis. The provided code is incomplete, and your goal is to complete the missing parts. Additionally, you should train the classifier on a small dataset and analyze its performance.

**Tasks:**

1.**Complete the Code (35 points)**: Fill in the missing parts in the provided Python code for the Naive Bayes classifier. Pay special attention to the `extract_features` function.

2.**Train and Test**: Train the Naive Bayes classifier on the training data and test it on a separate test set. Evaluate the accuracy of the classifier.

3.**Analysis (15 points)**: Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.


In [None]:
import random
import math
import string
from collections import defaultdict

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.corpus import movie_reviews
import nltk

# Download NLTK resources
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
def get_features(tokens):
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.lower() not in stop_words]

    # Perform stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens

In [None]:
class NaiveBayesClassifier:
    def __init__(self, classes):
        self.classes = classes
        self.class_probs = defaultdict(float) # P(class=c)
        self.feature_probs = defaultdict(lambda: defaultdict(float)) # P(Words=w|class=c)

    def train(self, training_data):
        # Implement training here
        # You should use get_features function to extract useful tokens from
        # dataset and use them to train the classifier.

        total_documents = len(training_data) # Total number of documents

        for doc, label in training_data: # Iterate over the training data
            self.class_probs[label] += 1 # Update class probabilities

            features = get_features(doc) # Extract features from the document

            for feature in features: # Update feature probabilities given the class
                self.feature_probs[feature][label] += 1

        for label in self.classes: # Normalize probabilities
            self.class_probs[label] /= total_documents # Calculate class probabilities

        for feature in self.feature_probs: # Calculate feature probabilities
            total_count = sum(self.feature_probs[feature].values()) # Total count of the feature
            for label in self.classes: # Normalize probabilities
                self.feature_probs[feature][label] /= total_count # can use Laplace Smoothing too

        # pass

    def classify(self, features):
        # Implement classification here

        max_prob = float('-inf') # Initialize maximum probability
        predicted_class = None # Initialize predicted class

        for label in self.classes: # Iterate over the classes
            # Calculate the log probability for each class
            log_prob = math.log1p(self.class_probs[label]) # math.log1p(x) = log(1 + x) to avoid log(0)

            for feature in features:
                # Handle unseen features
                log_prob += math.log1p(self.feature_probs[feature].get(label, 1e-10)) # math.log1p(x) = log(1 + x) to avoid log(0)

            # Update predicted class if a higher probability is found
            if log_prob > max_prob:
                max_prob = log_prob
                predicted_class = label

        return predicted_class

        # pass

In [None]:
# try different seeds
# random.seed(222)
# random.seed(400)
# random.seed(512)
random.seed(600) # best seed found

# Load the movie reviews dataset from NLTK
data = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(data)

# Shuffle the dataset for randomness
random.shuffle(data) # use seed to see difference

# Split the dataset into training and testing sets
split_ratio = 0.8
split_index = int(len(data) * split_ratio)
train_set = data[:split_index]
test_set = data[split_index:]

# Train the Naive Bayes classifier
classes = set(sentiment for _, sentiment in train_set)
classifier = NaiveBayesClassifier(classes)
classifier.train(train_set)

def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")

calculate_accuracy(train_set, 'Train')
calculate_accuracy(test_set, 'Test')

Train Accuracy: 0.955625
Test Accuracy: 0.735


3.Analysis (15 points): Discuss the results. Identify any misclassifications and try to understand why the classifier may fail in those cases. Provide examples of sentences that were not predicted correctly and explain possible reasons.
<div dir="rtl">
    برای پیدا کردن نمونه هایی که به اشتباه دسته بندی شده اند از سلول های زیر استفاده میکنیم.
</div>

In [None]:
def find_misclassified(dataset, dataset_type):
    print(f"{dataset_type}:")
    print("len of dataset", len(dataset))
    wrong_predictions = 0
    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment != true_sentiment:
            wrong_predictions += 1
            print(tokens.join(" "))
            print(true_sentiment)
            print(predicted_sentiment)

    print(f"num of wrong predtictions: {wrong_predictions}")
    print("================================================")

find_misclassified(train_set[:100], 'Train') # for simpilicity we just analyze 100 samples from train and test
find_misclassified(test_set[:100], 'Test')

Train:
len of dataset 100
['lucas', 'was', 'wise', 'to', 'start', 'his', 'star', 'wars', 'trilogy', 'with', 'episode', '4', ':', 'episode', '1', 'is', 'a', 'boring', ',', 'empty', 'spectacle', 'that', 'features', 'some', 'nice', 'special', 'effects', '.', 'after', 'the', 'familiar', "'", 'a', 'long', 'time', 'ago', '.', '.', '.', '.', "'", 'opening', ',', 'the', 'film', 'starts', 'with', 'the', 'opening', 'yellow', 'crawl', 'that', 'features', 'in', 'every', 'star', 'wars', 'movie', 'and', 'computer', 'game', '.', 'the', 'plot', 'is', 'that', 'the', 'trade', 'confederation', 'are', 'blocking', 'off', 'supplies', 'to', 'the', 'peaceful', 'planet', 'of', 'naboo', ',', 'ruled', 'by', 'queen', 'amidala', '(', 'portman', ')', 'jedi', 'knights', 'qui', '-', 'gon', '(', 'neeson', ')', 'and', 'obi', 'wan', '(', 'mcgregor', ')', 'are', 'sent', 'to', 'negotiate', 'a', 'deal', 'with', 'the', 'confederation', 'to', 'stop', 'the', 'blockade', '.', 'however', ',', 'this', 'simple', 'blockade', 'is',

In [None]:
def calculate_accuracy(dataset, dataset_type):
    # Test the classifier on the testing set
    correct_predictions = 0
    misclassifications = []

    for example in dataset:
        tokens, true_sentiment = example
        features = get_features(tokens)
        predicted_sentiment = classifier.classify(features)
        if predicted_sentiment == true_sentiment:
            correct_predictions += 1
        else:
            misclassifications.append((tokens, true_sentiment, predicted_sentiment))

    accuracy = correct_predictions / len(dataset)
    print(f"{dataset_type} Accuracy: {accuracy}")

    # Print misclassifications
    print(f"\nMisclassifications in {dataset_type} set:")
    for tokens, true_sentiment, predicted_sentiment in misclassifications:
        print(f"\nTrue Sentiment: {true_sentiment}")
        print(f"Predicted Sentiment: {predicted_sentiment}")
        print(f"Sentence: {' '.join(tokens)}\n")

    print(f"\nNumber of Misclassifications in {dataset_type} set:", len(misclassifications))

In [None]:
# Calculate accuracy for both training and testing sets
calculate_accuracy(test_set, 'Test')

Test Accuracy: 0.735

Misclassifications in Test set:

True Sentiment: neg
Predicted Sentiment: pos
Sentence: synopsis : back - up quarterback moxxon becomes starting quarterback midway through his senior year of high school , even though he ' d rather read " slaughterhouse five " than the playbook . evil football coach kilmer throws away moxxon ' s book , though , while the evil team physician injects painkillers into the players . in the meantime , moxxon ' s kid brother forms a cult , and a bubblegum - blond cheerleader smears whip cream on herself to seduce the new star quarterback . comments : since i usually review horror and science fiction films , i feel a little out of my league discussing this teen football movie . ( pun intended . thank you ! ) varsity blues was produced by mtv , and it really shows . several extended scenes allow for a continual soundtrack of mediocre pop songs meant to appeal to the adolescent male audience this crap was intended for . the teenagers all ha

In [None]:
calculate_accuracy(train_set, 'Train')

Train Accuracy: 0.955625

Misclassifications in Train set:

True Sentiment: neg
Predicted Sentiment: pos
Sentence: lucas was wise to start his star wars trilogy with episode 4 : episode 1 is a boring , empty spectacle that features some nice special effects . after the familiar ' a long time ago . . . . ' opening , the film starts with the opening yellow crawl that features in every star wars movie and computer game . the plot is that the trade confederation are blocking off supplies to the peaceful planet of naboo , ruled by queen amidala ( portman ) jedi knights qui - gon ( neeson ) and obi wan ( mcgregor ) are sent to negotiate a deal with the confederation to stop the blockade . however , this simple blockade is not all it seems , and the jedi knights soon have to deal with many more dangers , including facing the evil darth maul ( ray park . ) they also meet the future darth vadar , anakin skywalker ( jake lloyd ) star wars is largely a failure in all the major areas of filmmaking

<div dir="rtl">
    همان طور که در بالا مشاهده میکنیم این مدل Naive Bayes دارای overfitting است. چرا که دقت مجموعه داده test حدود 20 درصد کمتر از
    مجموعه داده train است. این مشکل را میتوانیم با افزایش مجموعه داده آموزشی بهتر کنیم چرا که
    generality و تعمیم دهی مدل با افزایش تعداد نمونه های
    آموزشی بهتر می شود.
    یکی از دلایل آن میتواند وجود نویز در مجموعه داده و کشف الگوهای نا مرتبط برای مدل به علت خوب و قابل تعمیم نبودن مدل است.
    همچنین همان طور که میبینید با تعویض seed و بر زدن مجموعه داده
    دقت ها تغییر میکند که به دلیل فرض هایی که مدل برای سادگی پردازش مانند استقلال مکانی کلمات و استقلال معنایی و در نظر نگرفتن
    negation
    و حساسیت به فرکانس تکرار کلمات  دارد میباشد.
    برای بهبود نتایج میتوانیم از feature engineering و هندل کردن کلمات منفی مانند not استفاده کنیم.
    همچنین با افزایش مجموعه داده میتوانیم نتایج بهتری کسب کنیم و مشکل overfitting را بهبود ببخشیم.
    در نهایت نیز با استفاده از مدل های قوی تر مانند LSTM و Transformers به نتایج بهتر دست پیدا کنیم.
    به طور خلاصه:
    مدل ساده بیز برای کارهای ساده تحلیل احساسات مناسب است اما ممکن است با زبان ظریف تر و زمینه های پیچیده مشکل داشته باشد.
    بسته به کاربرد، دسته بندی کننده ساده بیز ممکن است برای تجزیه و تحلیل سریع احساسات کافی باشد، اما
    مدل های پیشرفته تر ممکن است برای دقت بالاتر در زمینه های پیچیده ضروری باشد.
    درک محدودیت های طبقه بندی کننده Naive Bayes و دلایل طبقه بندی نادرست برای اصلاح مدل یا انتخاب مدل بسیار مهم است.
    جایگزین های پیشرفته تر در مواردی که زمینه و زبان ظرافت و نقش مهمی دارند باید در نظر گرفته شود .
</div>
<div dir="rtl">
    در ادامه توضیحات مفصل تری آمده است که با کمک chat gpt نوشتم:
</div>
<div>

    a. Misclassifications:

    Observation: The classifier may misclassify sentences due to the inherent simplifying assumptions of the Naive Bayes algorithm.

    Possible Reasons:
    1. Neglected Context: Naive Bayes assumes independence between features, which might neglect contextual dependencies in language.
    2. Sensitivity to Word Frequency: High sensitivity to word frequencies may lead to misclassifications when a sentence contains uncommon words or context-specific terms.
    3. Handling Negations: Difficulty in handling negations or nuanced language, as the model treats individual words independently.

    b. Examples of Misclassified Sentences:
    Example 1(Synthetic Example):
    True Sentiment: Positive
    Predicted Sentiment: Negative
    Sentence: "Although the movie had some flaws, overall, it was enjoyable."
    Explanation: The classifier might struggle with phrases like "some flaws" and may not capture the overall positive sentiment.

    Example 2(Synthetic Example):
    True Sentiment: Negative
    Predicted Sentiment: Positive
    Sentence: "The acting was terrible, but the plot was intriguing."
    Explanation: The classifier may focus on the positive term "intriguing" and neglect the negative term "terrible."

    Example 3(Real Example):
    True Sentiment: Positive
    Predicted Sentiment: Negative
    Sentence: "harmless , silly and fun comedy about dim - witted wrestling fans gordie and sean ( david arquette and scott caan ) who idolize current world championship wrestling heavyweight champion jimmy king ( oliver platt ) . when king is screwed out of his title by a corrupt promoter ( joe pantoliano ) , gordie and sean take it upon themselves to find their fallen hero and restore his glory . my biggest fear about ready to rumble was dispatched early on , as the filmmakers are quick to show that wresting is indeed choreographed ( but not fake , mind you ) . the hook of the movie is that gordie and sean are just too stupid to realize that . arquette and caan are suitably over the top with their performances , which is exactly what a movie like this requires , and oliver platt ( one of my favorite actors ) is a riot as the drunken washed - up ex - champion . many have scoffed at the idea that platt should be playing a heavyweight champion wrestler with an unbeaten record , but for me it just added to the " silly factor " of the film , thereby increasing my enjoyment of it . one casting complaint however : rose mcgowan as a sexy dancer ? please . . . if rose mcgowan is sexy then i ' m marilyn manson . given the current state of the actual wcw , if oliver platt were appearing as jimmy king right now on wcw programming , he ' d be the most popular guy they have . on a similar note , the " plot line " of the wresting portions of the film are more entertaining than anything the wcw writers have been able to come up with in the last two years . although one does have to ask . . . why would any wrestling promoter fire the head wrestler of a company who is both unbeaten and extremely popular with the fans ? director brian robbins ( you ' ll remember him as eric from tv ' s " head of the class " ) just knows how to make good dumb movies . this movie fits in nicely with his previous efforts good burger and varsity blues . and screenwriter steven brill ( the epic mighty ducks trilogy , late last night ) manages to keep things both sophomoric and clever at the same time , with almost all the jokes of the film getting a laugh out of me . the only exceptions to that were : 1 ) a scene involving a van full of singing nuns and 2 ) any scene involving the old woman wrestling fan . those moments made me cringe and / or groan . as an added bonus though , the audience is treated to outtakes from the film as the final credits roll . [ pg - 13 ]"
    Explanation: The classifier might struggle with phrases like "harmless , silly - those moments made me cringe" and may not capture the overall positive sentiment.

    Example 4(Real Example):
    True Sentiment: Negative
    Predicted Sentiment: Positive
    Sentence: " " the red violin " is a cold , sterile feature that leaves you uninvolved and detached . it ' s a movie that seems almost clinical , as it traces the 300 - plus - years history of the legendary musical instrument of the title . opening in the 17th century , the story shows how violin - maker nicolo bussotti created the instrument as a gift for his unborn son . but when tragedy strikes , the violin becomes the personification of its maker ' s grief . from there the violin comes into the hands of an orphaned child prodigy at an austrian monastery . again , tragedy strikes as the child is struck down at the moment of his triumph . we follow the violin through the centuries as it finds a home in england and in mao ' s communist china before being discovered by expert charles morritz ( samuel l . jackson ) , who mounts a painstaking investigation to prove its authenticity . the violin becomes morritz ' s obsession , just as it is for all those who converge on a montreal auction house to bid on it . morritz , however , is the only one who knows the secret of the instrument and can understand and appreciate its creator ' s intention . " the red violin " could have been a touching , inspirational story , as soaring as a beethoven symphony . however director francois girard fails to make any emotional connection with the viewer . here is a story that could have made use of various camera angles and lighting to heighten its impact . girard , for some unknown reason , uses mostly master shots , keeping his camera - and thus us - at a distance . we get no feel for the miracle that is the violin . it ' s resonance , its purity of sound are not emphasized enough to make an impression . nor are any of the performances memorable . it ' s as if girard wanted all his actors to play second fiddle to his violin . " the red violin " promises much , but delivers little . it is dull at times , a bit pretentious and a might murky . the movie ' s music soars over its story and performers . and that is its only saving grace. "
    Explanation: The classifier may focus on the positive sentence ' " the red violin " could have been a touching , inspirational story , as soaring as a beethoven symphony .' and neglect the negative term "cold , sterile"

    c. Possible Improvements:
    1. Consider Contextual Information: Explore more advanced models (e.g., LSTM, Transformers) that capture contextual information and dependencies between words.
    2. Feature Engineering: Experiment with more sophisticated features, such as n-grams, to capture word combinations and improve model performance.
    3. Handling Negations: Implement strategies to handle negations, possibly by considering word pairs or incorporating sentiment lexicons.

    d. Reflection on Naive Bayes:
    Strengths: Simplicity, efficiency, and effectiveness with large feature spaces.
    Weaknesses: Assumes independence, struggles with context, and may misclassify in the presence of nuanced language.
    Trade-offs: A trade-off between simplicity and capturing complex language structures.

    e. Conclusion:
    Naive Bayes Suitability: Suitable for straightforward sentiment analysis tasks but may struggle with more nuanced language and complex contexts.
    Consideration for Application: Depending on the application, the Naive Bayes classifier might be sufficient for quick sentiment analysis, but
    more advanced models might be necessary for higher accuracy in sophisticated contexts.
    
    Understanding the limitations of the Naive Bayes classifier and the reasons behind misclassifications is crucial for refining the model or
    considering more advanced alternatives in cases where context and nuanced language play a significant role.
</div>

<div dir="rtl">
    در ادامه نیز توضیحات مفصلی برای دلیل های بیش برازش آمده است:
    
</div>

<div dir="rtl">
   
    دلایل بیش از حد برازش:

     Overfitting: ممکن است طبقه بندی کننده داده های آموزشی را بیش از حد برازش داشته باشد، به این معنی که یاد گرفته است نمونه های آموزشی را خیلی خوب طبقه بندی کند، از جمله نویز یا الگوهای نامربوط مخصوص مجموعه آموزشی که به خوبی به داده های جدید تعمیم نمی دهند.

     داده های محدود(Limited Data): مجموعه داده آموزشی ممکن است خیلی کوچک باشد یا به اندازه کافی نماینده توزیع کلی داده ها نباشد. این می تواند منجر به عدم یادگیری کافی طبقه بندی کننده در مورد الگوهای اساسی در داده ها شود.

     مهندسی ویژگی(Feature Engineering): ویژگی‌های استخراج‌شده از متن ممکن است اطلاعات معنی‌داری کافی برای طبقه‌بندی دقیق نداشته باشند. این امکان وجود دارد که نمایش ویژگی فعلی برای تمایز مؤثر بین احساسات مثبت و منفی کافی نباشد.

     تنظیم فراپارامتر(Hyperparameter Tuning): عملکرد طبقه‌بندی‌کننده می‌تواند به فراپارامترهایی مانند انتخاب تکنیک‌های هموارسازی(Laplace Smoothing) حساس باشد. مقادیر فراپارامتر نا بهینه ممکن است منجر به کاهش عملکرد تعمیم شود.

     پیچیدگی مدل(Model Complexity): طبقه‌بندی‌کننده Naive Bayes استقلال بین ویژگی‌ها را فرض می‌کند، که ممکن است در عمل درست نباشد. اگر وظیفه طبقه‌بندی احساسات وابستگی‌های پیچیده بین کلمات یا عبارات را نشان می‌دهد، یک مدل ساده ساده بیز ممکن است این روابط را به اندازه کافی نشان ندهد.

     برای رفع این مشکلات، می توانید روش های زیر را امتحان کنیم:

     اعتبار سنجی متقابل: مجموعه داده را به چند تا تقسیم کنید و طبقه‌بندی کننده را روی تا(fold)های مختلف آموزش دهید/آزمایش کنید تا تخمین بهتری از عملکرد آن به دست آورید. این می تواند به تشخیص بیش از حد برازش و ارزیابی توانایی تعمیم طبقه بندی کننده کمک کند.

     انتخاب/استخراج ویژگی: نمایش ویژگی های مختلف (مانند TF-IDF، جاسازی کلمه) یا تکنیک های پیش پردازش متن اضافی را برای استخراج ویژگی های اطلاعاتی بیشتر از داده های متن آزمایش کنید.

     تنظیم Hyperparameter: جستجوی شبکه ای یا جستجوی تصادفی را برای یافتن مقادیر بهینه برای فراپارامترها مانند آستانه احتمال و پارامترهای هموارسازی انجام دهید.

     منظم‌سازی: تکنیک‌های منظم‌سازی (مثلاً هموارسازی لاپلاس) را برای جلوگیری از برازش بیش از حد و بهبود عملکرد تعمیم‌کننده طبقه‌بندی‌کننده معرفی کنید.

     روش‌های مجموعه: دسته‌بندی‌کننده‌های متعدد (مانند Naive Bayes با رگرسیون لجستیک یا درخت‌های تصمیم‌گیری) را با استفاده از روش‌های مجموعه‌ای مانند bagging یا boosting برای بهبود عملکرد کلی ترکیب کنید.

     با تجزیه و تحلیل دقیق نمونه های طبقه بندی شده اشتباه و آزمایش با رویکردهای مختلف، می توانید به طور مکرر عملکرد طبقه بندی کننده را در مجموعه داده های آموزشی و آزمایشی بهبود بخشید.
</div>


#Submission Instructions:


1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

3.Clearly present the results of your parameter tuning in the notebook.

4.Provide a brief summary of your findings and insights in the conclusion section.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Experiment with various seed texts to showcase the diversity of generated text.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.