# **Sentiment Analysis on IMDB Movie Reviews**

The project involves preprocessing the text data, training a machine learning model, and evaluating its performance in classifying reviews as either positive or negative.

---
[Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) is sourced from kaggle



## **Loading Dataset**


loading the dataset using `pandas` The dataset is stored in a CSV file located in the resources directory of the cloned repository.

In [2]:
import pandas as pd

# Loading the dataset
data_path = 'movie-review/resources/IMDB Dataset.csv'
data = pd.read_csv(data_path)
print(data.head())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [3]:
print(data['review'].iloc[1])

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.


## **Preprocessing**
To prepare the text data for model training, following steps are performed:

*   HTML Tags Removal
*   URL and Mention Removal
*   Tokenization
*   Stop Words Removal (removing common words)
*   Stemming (to reduce words to their base form)
*   Conversion to Lowercase

In [4]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [5]:
def tokenizer(text):
    '''
    Removes HTML tags, URLs, mentions, hashtags, and special characters.
    Tokenizes the cleaned text into words.
    '''
    text = re.sub('<br />', '', text)
    text = re.sub(r"https\S+|www\S+|http\S+", '', text, flags = re.MULTILINE)
    text = re.sub(r'\@w+|\#', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text_tokens = word_tokenize(text)
    return text_tokens

In [6]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nukhb\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nukhb\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nukhb\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [7]:
def preprocess_text(text):
    '''
    Preprocesses the input text by tokenizing, removing stop words,
    Stemming, and converting to lowercase.
    '''
    tokens = tokenizer(text)                          # Tokenization
    tokens = [word for word in tokens if word.isalnum()]  # Stop words removal
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]       # Stemming
    tokens = [word.lower() for word in tokens]

    return ' '.join(tokens)

data['processed_review'] = data['review'].apply(preprocess_text)
print(data.head())

                                              review sentiment  \
0  One of the other reviewers has mentioned that ...  positive   
1  A wonderful little production. <br /><br />The...  positive   
2  I thought this was a wonderful way to spend ti...  positive   
3  Basically there's a family where a little boy ...  negative   
4  Petter Mattei's "Love in the Time of Money" is...  positive   

                                    processed_review  
0  one review mention watch 1 oz episod youll hoo...  
1  a wonder littl product the film techniqu unass...  
2  i thought wonder way spend time hot summer wee...  
3  basic there famili littl boy jake think there ...  
4  petter mattei love time money visual stun film...  


In [8]:
print(data['processed_review'].iloc[1])
print(data['sentiment'].iloc[1])

a wonder littl product the film techniqu unassum oldtimebbc fashion give comfort sometim discomfort sens realism entir piec the actor extrem well chosen michael sheen got polari voic pat you truli see seamless edit guid refer william diari entri well worth watch terrificli written perform piec a master product one great master comedi life the realism realli come home littl thing fantasi guard rather use tradit dream techniqu remain solid disappear it play knowledg sens particularli scene concern orton halliwel set particularli flat halliwel mural decor everi surfac terribl well done
positive


## **Model Training and evaluation**

The dataset is split into training and test sets, with 80% of the data used for training and 20% for testing. The text data is vectorized using `TF-IDF` to transform it into a numerical format suitable for model training.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression

In [10]:
X = data['processed_review']
y = data['sentiment']

# Splitting the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

first training the Naive bayes model and seeing its performance

In [12]:
nb_model = MultinomialNB()
nb_model.fit(X_train_vectorized, y_train)

# Predict on test set
y_pred_nb = nb_model.predict(X_test_vectorized)

# Evaluate the model
print("Naive Bayes Model")
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Classification Report:\n", classification_report(y_test, y_pred_nb))

Naive Bayes Model
Accuracy: 0.8637
Classification Report:
               precision    recall  f1-score   support

    negative       0.85      0.88      0.86      4961
    positive       0.88      0.85      0.86      5039

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



Now, evaluating the performance of logistic Regression

In [13]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_vectorized, y_train)

# Predict on test set
y_pred_lr = lr_model.predict(X_test_vectorized)

# Evaluate the model
print("Logistic Regression Model")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))

Logistic Regression Model
Accuracy: 0.8904
Classification Report:
               precision    recall  f1-score   support

    negative       0.90      0.87      0.89      4961
    positive       0.88      0.91      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



## **Predicting Sentiment for Sample Reviews**

As logistic regression has acheived more accuracy than naive bayes I am choosing logistic regression to predict sample reviews and comparing them to actual sentiments to see how well model actually is predicting.

In [14]:
# Select 5 random reviews from the test set
sample_reviews = X_test.sample(5, random_state=42)

# Predict sentiment using Logistic Regression model
predicted_sentiments = lr_model.predict(vectorizer.transform(sample_reviews))

# Actual sentiments for comparison
actual_sentiments = y_test[sample_reviews.index]

for i, review in enumerate(sample_reviews):
    print(f"Review {i+1}:")
    print(f"Text: {review}")
    print(f"Predicted Sentiment: {predicted_sentiments[i]}")
    print(f"Actual Sentiment: {actual_sentiments.iloc[i]}")
    print("-" * 80)

Review 1:
Text: tortuou emot impact degrad whether adult adolesc person valu shown movi belong bad psychodrama anywher thi movi plot evil start end thi way peopl act degrad sex way movi teen kill bad preteen sex bad emot batter bad anim cruelti bad psycholog tortur bad parent neglect bad merit excel color shot contrast red blond green leav bad feel anyon respect life peac bad mistak make watch ugli
Predicted Sentiment: negative
Actual Sentiment: negative
--------------------------------------------------------------------------------
Review 2:
Text: anyon know anyth evolut wouldnt even need see film say fake never disprov also weak argument say univers creat giant hippo disprov although fair seem like peopl believ peopl open email attach peopl dont know give bank detail dude zambia no bone primat found unit state canada there also good reason legitim scientist dont bother studi the argument goe loch ness monster ghost god
Predicted Sentiment: negative
Actual Sentiment: negative
-------

saving the preprocessed data file

In [15]:
data.to_csv('preprocessed_IMDB_dataset.csv', index=False)