/*  ============================================   
Title: Week 3: Sentiment Analysis and Preprocessing Text
Author: Catie Williams
Date: 28 Sep 2025 
Created By: Sathya Raj Eswaran
Description: Sentiment Analysis and Preprocessing Text
=========================================== */  

In [53]:
# Importing Libraries
import pandas as pd
from textblob import TextBlob
from sklearn.metrics import accuracy_score
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [55]:
#Import the movie review data
# Reading *.tsv file
df = pd.read_csv("data\\labeledTrainData.tsv", sep='\t')

# Check if the data is loaded properly by displaying the top 5 rows
print("\nFirst 5 rows of the DataFrame:")
print(df.head())


First 5 rows of the DataFrame:
       id  sentiment                                             review
0  5814_8          1  With all this stuff going down at the moment w...
1  2381_9          1  \The Classic War of the Worlds\" by Timothy Hi...
2  7759_3          0  The film starts with a manager (Nicholas Bell)...
3  3630_4          0  It must be assumed that those who praised this...
4  9495_8          1  Superbly trashy and wondrously unpretentious 8...


In [56]:
df.head(5)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [58]:
# Count the number of positive and negative reviews
Sent_Count = df['sentiment'].value_counts()
# Print the positive and negative reviews
print("Number of positive and negative reviews:")
print(f"Positive Reviews (1): {Sent_Count[1]}")
print(f"Negative Reviews (0): {Sent_Count[0]}")

Number of positive and negative reviews:
Positive Reviews (1): 12500
Negative Reviews (0): 12500


In [61]:
# Display the first 10 rows of the dataframe to preview 
df.head(10)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
5,8196_8,1,I dont know why people think this is such a ba...
6,7166_2,0,"This movie could have been very good, but come..."
7,10633_1,0,I watched this video at a friend's house. I'm ...
8,319_1,0,"A friend of mine bought this film for £1, and ..."
9,8713_10,1,<br /><br />This movie is full of references. ...


In [63]:
# TextBlob Sentiment Analysis
# Analyzing the sentiment using TextBlob. If polarity >= 0 is 'Positive', else 'Negative'.
def get_textblob_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity >= 0:
        return 'Positive'
    else:
        return 'Negative'

# Applying the sentiment analysis function to the 'review' column
print("Classifying movie reviews using TextBlob")
df['TextBlob_Sentiment'] = df['review'].apply(get_textblob_sentiment)


Classifying movie reviews using TextBlob


In [64]:

# Display Results
print("\nFirst 5 rows with the new TextBlob_Sentiment column:")
print(df[['review', 'sentiment', 'TextBlob_Sentiment']].head())
print("\n--- Summary of TextBlob Sentiment Classification ---")
print(df['TextBlob_Sentiment'].value_counts())


First 5 rows with the new TextBlob_Sentiment column:
                                              review  sentiment  \
0  With all this stuff going down at the moment w...          1   
1  \The Classic War of the Worlds\" by Timothy Hi...          1   
2  The film starts with a manager (Nicholas Bell)...          0   
3  It must be assumed that those who praised this...          0   
4  Superbly trashy and wondrously unpretentious 8...          1   

  TextBlob_Sentiment  
0           Positive  
1           Positive  
2           Negative  
3           Positive  
4           Negative  

--- Summary of TextBlob Sentiment Classification ---
TextBlob_Sentiment
Positive    19017
Negative     5983
Name: count, dtype: int64


In [65]:
# TextBlob Accuracy

# Mapping the original numerical sentiment to labels for comparison
df['true_sentiment_text'] = df['sentiment'].map({1: 'Positive', 0: 'Negative'})

# Calculating the accuracy by comparing the true labels to TextBlob's predictions
accuracy = accuracy_score(df['true_sentiment_text'], df['TextBlob_Sentiment'])


In [66]:
# Display Results
print("\n Accuracy of the Model")
print(f"The accuracy of the TextBlob model is: {accuracy:.2f}")


 Accuracy of the Model
The accuracy of the TextBlob model is: 0.69


Model Performance Analysis: TextBlob Sentiment Classification

We assessed the TextBlob model's accuracy by comparing its sentiment predictions to the dataset's true labels. With a perfectly balanced dataset (50% positive, 50% negative), the random chance baseline is 50%. The TextBlob model, utilizing its inherent sentiment lexicon, achieved an accuracy of 69%. This performance is a statistically significant improvement over random guessing, confirming its value as an untrained yet effective classifier for movie review sentiment. The calculated accuracy metric, visible in the script's output, serves as the key indicator of its practical utility.

In [69]:
# Download the VADER lexicon
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\saran\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [70]:
# VADER Sentiment Analysis

# Analyzing the sentiment of a given text using VADER. Returns 'Positive' if polarity >= 0, otherwise 'Negative'.
def get_vader_sentiment(text):
    """
    Analyzing the sentiment of a given text using VADER.
    Returns 'Positive' if compound score > 0, otherwise 'Negative'.
    """
    analyzer = SentimentIntensityAnalyzer()
    compound_score = analyzer.polarity_scores(text)['compound']
    if compound_score > 0:
        return 'Positive'
    else:
        return 'Negative'



In [71]:
# Applying the sentiment analysis function to the 'review' column
print("Classifying movie reviews using VADER...")
df['VADER_Sentiment'] = df['review'].apply(get_vader_sentiment)

# Mapping the original numerical sentiment to text labels for comparison
df['true_sentiment_text'] = df['sentiment'].map({1: 'Positive', 0: 'Negative'})



Classifying movie reviews using VADER...


In [72]:
# Calculating the accuracy by comparing the true labels to VADER's predictions
accuracy = accuracy_score(df['true_sentiment_text'], df['VADER_Sentiment'])

# --- Display Results ---
print("\n--- Model Accuracy ---")
print(f"The accuracy of the VADER model is: {accuracy:.2f}")


--- Model Accuracy ---
The accuracy of the VADER model is: 0.69


Conclusion: VADER Sentiment Classification

With an achieved classification accuracy of 69%, the VADER model significantly outperforms the 50% baseline, confirming its value as an untrained, rule-based classifier for movie review sentiment. This robust performance affirms the model's ability to quickly and effectively interpret sentiment. However, the 0.69 accuracy also highlights a performance ceiling related to VADER's reliance on a fixed lexicon. The model's inability to fully capture complex linguistic features—like domain-specific jargon, sarcasm, or highly nuanced expressions—limits its final precision. Consequently, while VADER serves as an excellent and efficient starting point, we anticipate that dedicated machine-learning models will yield superior results by addressing these linguistic gaps.

In [74]:
# Download the stopwords list if not already available
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    print("Downloading NLTK stopwords...")
    nltk.download('stopwords')

In [75]:
# Define a function for text cleaning and stemming
def clean_and_stem_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove HTML tags (common in this dataset)
    text = re.sub(r'<.*?>', '', text)
    # Remove punctuation and special characters, keeping only letters and numbers
    text = re.sub(r'[^a-z0-9\s]', '', text)
    # Tokenize the text and remove stop words
    stop_words = set(stopwords.words('english'))
    words = text.split()
    cleaned_words = [word for word in words if word not in stop_words]
    # Apply PorterStemmer
    porter = PorterStemmer()
    stemmed_words = [porter.stem(word) for word in cleaned_words]
    # Join the words back into a single string
    return ' '.join(stemmed_words)

In [76]:
# Applying the cleaning and stemming function
df['stemmed_review'] = df['review'].apply(clean_and_stem_text)

In [77]:
print("Text preprocessing and stemming completed.")
print("\nFirst 5 stemmed reviews:")
print(df['stemmed_review'].head())

Text preprocessing and stemming completed.

First 5 stemmed reviews:
0    stuff go moment mj ive start listen music watc...
1    classic war world timothi hine entertain film ...
2    film start manag nichola bell give welcom inve...
3    must assum prais film greatest film opera ever...
4    superbl trashi wondrous unpretenti 80 exploit ...
Name: stemmed_review, dtype: object


In [78]:
# Vectorization

# Creating a Bag-of-Words matrix
print("\nCreating Bag-of-Words matrix")
count_vectorizer = CountVectorizer()
bag_of_words_matrix = count_vectorizer.fit_transform(df['stemmed_review'])

print(f"Bag-of-Words matrix dimensions: {bag_of_words_matrix.shape}")



Creating Bag-of-Words matrix
Bag-of-Words matrix dimensions: (25000, 112735)


In [79]:
# Create a TF-IDF matrix
print("\nCreating TF-IDF matrix")
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['stemmed_review'])
print(f"TF-IDF matrix dimensions: {tfidf_matrix.shape}")


Creating TF-IDF matrix
TF-IDF matrix dimensions: (25000, 112735)


In [80]:
# Verify dimensions
if bag_of_words_matrix.shape[0] == df.shape[0] and tfidf_matrix.shape[0] == df.shape[0]:
    print("\nVerified: Number of rows in matrices which matches original DataFrame.")
else:
    print("\nWarning: Number of rows in matrices not matching with original DataFrame.")


Verified: Number of rows in matrices which matches original DataFrame.


Conclusion and Path Forward

The data preparation phase is complete and validated. The Bag-of-Words and TF-IDF representations were successfully constructed, resulting in matrices of (25000, 112735) dimensions. This technical achievement ensures a feature space of 112,735 unique stemmed terms is available for modeling, corresponding perfectly to the total number of documents.

Our baseline assessment using VADER and TextBlob showed a consistent 69% accuracy, demonstrating the initial efficacy but also the inherent performance ceiling of lexicon-based models. The newly generated BoW and TF-IDF matrices now enable the transition to super-vised machine learning. This numerical feature set is ready for sophisticated classifiers that will learn context and relationships within the data, which is necessary to improve accuracy beyond the limitations of the current rule-based systems.