<h1> Movie Review Sentiment Classification using N-Gram Language Model</h1>

<h4>Install NLTK and scikit-learn</h4>

In [1]:
pip install nltk scikit-learn




[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





<h4>Import necessary libraries</h4>

In [2]:
import nltk
from nltk.util import ngrams
from collections import Counter
from sklearn.model_selection import train_test_split
import string
import re
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

<h4>01) Import the movie reviews</h4>

In [4]:
# import Movie_Reviews.txt file

import os

# Get the current working directory
current_directory = os.getcwd()

# Specify the relative path to the uploaded file
file_name = "Movie_Reviews.txt"
relative_path = os.path.join(current_directory, file_name)

# Specify the path to your "Movie_Reviews.txt" file
file_path = relative_path

# Open the file and read its content
with open(file_path, "r", encoding="utf-8") as file:
  movie_reviews = file.read()

<h4>02) Pre-process the text data</h4>

<h5>Split the reviews</h5>

In [5]:
# Split the content into positive and negative reviews
positive_reviews, negative_reviews = movie_reviews.split("Negative Reviews\n================")
positive_reviews = positive_reviews.split("Positive Reviews\n================")[1].strip().split('\n')[1:]
negative_reviews = negative_reviews.strip().split('\n')[1:]

<h5>Preprocess a single review</h5>

In [6]:
from nltk.corpus import stopwords
def preprocess_text(review):
    tokens = nltk.word_tokenize(review)
    tokens = [word.lower() for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

<h4>03) Choose an appropriate value for N and implement the N-Gram model</h4>

In [7]:
# Define the N for N-Grams (e.g., bigram, trigram)
N = 2  # We can change N as needed

<p><b>Explanation:</b> The value of N is determined by the complexity of the language and the dataset. Bigrams (N=2) are a useful starting point in this scenario because they capture pairs of words and can provide context in sentiment analysis.</p>

<h4>04) Calculate N-Gram probabilities for each N-Gram in the corpus</h4>

In [8]:
def generate_ngrams(tokens, n):
    ngrams_list = list(zip(*[tokens[i:] for i in range(n)]))
    return ngrams_list

# Define a function to calculate N-Gram probabilities
def calculate_ngram_probabilities(corpus, n):
    ngram_counts = Counter()

    for review in corpus:
        tokens = preprocess_text(review)
        ngrams = generate_ngrams(tokens, n)
        ngram_counts.update(ngrams)

    total_ngrams = sum(ngram_counts.values())
    ngram_probabilities = {ngram: count / total_ngrams for ngram, count in ngram_counts.items()}

    return ngram_probabilities

    # Define a function to calculate the probability of a test N-gram
def calculate_probability(test_ngram, ngram_probabilities):
    probability = 1.0
    for ngram in test_ngram:
        if ngram in ngram_probabilities:
            probability *= ngram_probabilities[ngram]
    return probability

In [13]:
positive_ngram_probabilities = calculate_ngram_probabilities(positive_reviews, N)
negative_ngram_probabilities = calculate_ngram_probabilities(negative_reviews, N)
for ngram, probability in positive_ngram_probabilities.items():
    print(f"N-Gram: {ngram}, Probability: {probability}")
for ngram, probability in negative_ngram_probabilities.items():
    print(f"N-Gram: {ngram}, Probability: {probability}")

N-Gram: ('forrest', 'gump'), Probability: 0.00398406374501992
N-Gram: ('gump', 'absolute'), Probability: 0.00398406374501992
N-Gram: ('absolute', 'masterpiece'), Probability: 0.00398406374501992
N-Gram: ('masterpiece', 'tom'), Probability: 0.00398406374501992
N-Gram: ('tom', 'hanks'), Probability: 0.00398406374501992
N-Gram: ('hanks', 'delivers'), Probability: 0.00398406374501992
N-Gram: ('delivers', 'unforgettable'), Probability: 0.00398406374501992
N-Gram: ('unforgettable', 'performance'), Probability: 0.00398406374501992
N-Gram: ('performance', 'storytelling'), Probability: 0.00398406374501992
N-Gram: ('storytelling', 'heartwarming'), Probability: 0.00398406374501992
N-Gram: ('heartwarming', 'movie'), Probability: 0.00398406374501992
N-Gram: ('movie', 'journey'), Probability: 0.00398406374501992
N-Gram: ('journey', 'life'), Probability: 0.00398406374501992
N-Gram: ('life', 'make'), Probability: 0.00398406374501992
N-Gram: ('make', 'laugh'), Probability: 0.00398406374501992
N-Gram: (

<h4>05) Calculate the N-gram probability for given test movie review</h4>

In [14]:
# Calculate N-Gram probabilities for the test review
test_review = "It's clear that the movie has both its enthusiasts and critics. While it may not be to everyone's taste, it's worth watching with an open mind to form your own opinion."
test_tokens = preprocess_text(test_review)
test_bigrams =generate_ngrams(test_tokens,N)

# Calculate N-Gram probabilities for the test review
positive_probability = calculate_probability(test_bigrams, positive_ngram_probabilities )
negative_probability = calculate_probability(test_bigrams, negative_ngram_probabilities )
print("Positive Probability: ", positive_probability)
print("Negative Probability: ",negative_probability)

Positive Probability:  1.5872763924382153e-05
Negative Probability:  2.029056083110137e-05


<h4>06) Predict the category of the test movie review</h4>

In [15]:
# Determine the sentiment
sentiment = "positive" if positive_probability >= negative_probability else "negative"
print(f"Predicted sentiment: {sentiment}")

Predicted sentiment: negative


<p><b>Explanation: </b>According to the algorithm's analysis of the test movie review, it falls into the "Negative" category. This conclusion is based on the computed probabilities, which reveal that negative is more likely than positivity.</p>

<h4>07) Explain the concept of perplexity and how it measures the model's performance</h4>

<p>Perplexity is a metric that measures how effectively any language model predicts a given word sequence. Low perplexity suggests improved model performance.The inverse of the average probability of the following word given the preceding words is used to compute it. Perplexity can be used to compare and evaluate alternative language models.

The following formula can be used to calculate perplexity<b>: PPL = 2^(-log2(likelihood</b>)).

According to the model, likelihood is the log-likelihood of the test data. The log2 function logarithmically scales the likelihood. T 2^-x-x factor returns the logarithmic value to the probability scal Perplexity is a measure of how well an N-Gram model predicts the next word based on the previous N-1 words. A lower perplexity suggests that the model is more accurate in anticipating the next word and, as a result, better at understanding the language. It is a standard statistic for evaluating language models.</p>