#Task 5: Movie Review Sentiment Analysis

# Problem Statement
**Sentiment Analysis of IMDB Movie Reviews**

Predict the sentiment (positive or negative) of movie reviews in the IMDB dataset using classification or deep learning algorithms.

# Objective:
1. Train a model: Train a classification or deep learning model on the IMDB dataset to predict the sentiment of movie reviews.
2. Evaluate performance: Evaluate the model's performance on the test dataset using metrics such as accuracy, precision, recall, and F1-score.
3. Predict sentiment: Use the trained model to predict the sentiment of new, unseen movie reviews.

#Dataset:
- IMDB dataset: 50,000 movie reviews (25,000 for training and 25,000 for testing)
- Task: Binary sentiment classification (positive or negative)

In [None]:
# importing libraries
import pandas as pd
import sklearn as skn
# from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords



In [None]:
#loading the datase
df = pd.read_csv("IMDB_dataset.csv")
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [None]:
#removing . <br /><br />

import pandas as pd
import sklearn as skn
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
# importing libraries
# from nltk.tokenize import word_tokenize


#loading the datase
df = pd.read_csv("IMDB_dataset.csv")

# Example of removing stopwords from the 'review' column
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)


df['review'] = df['review'].apply(remove_stopwords)

df


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,review,sentiment
0,One reviewers mentioned watching 1 Oz episode ...,positive
1,wonderful little production. <br /><br />The f...,positive
2,thought wonderful way spend time hot summer we...,positive
3,Basically there's family little boy (Jake) thi...,negative
4,"Petter Mattei's ""Love Time Money"" visually stu...",positive
...,...,...
49995,thought movie right good job. creative origina...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,Catholic taught parochial elementary schools n...,negative
49998,going disagree previous comment side Maltin on...,negative


## Preprocess

In [None]:

# Convert 'review' column to lowercase
df['review'] = df['review'].str.lower()
df['review'] = df['review'].apply(remove_stopwords)

df


Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production. <br /><br />the f...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there's family little boy (jake) thi...,negative
4,"petter mattei's ""love time money"" visually stu...",positive
...,...,...
49995,thought movie right good job. creative origina...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,catholic taught parochial elementary schools n...,negative
49998,going disagree previous comment side maltin on...,negative


In [None]:
# data cleaning
# removing stopwords from the 'review' column
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Tokenize the 'review' column
from nltk.tokenize import word_tokenize
# Download the 'punkt_tab' data for tokenization
import nltk
nltk.download('punkt_tab') # Download the necessary punkt_tab data

def tokenize_text(text):
    return word_tokenize(text)

df['tokenized_review'] = df['review'].apply(tokenize_text)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:
# labele 'sentiment' column'positive' and 'negative' labels as 0 and 1
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})
df


Unnamed: 0,review,sentiment,tokenized_review
0,one reviewers mentioned watching 1 oz episode ...,1,"[one, reviewers, mentioned, watching, 1, oz, e..."
1,wonderful little production. <br /><br />the f...,1,"[wonderful, little, production, ., <, br, /, >..."
2,thought wonderful way spend time hot summer we...,1,"[thought, wonderful, way, spend, time, hot, su..."
3,basically there's family little boy (jake) thi...,0,"[basically, there, 's, family, little, boy, (,..."
4,"petter mattei's ""love time money"" visually stu...",1,"[petter, mattei, 's, ``, love, time, money, ''..."
...,...,...,...
49995,thought movie right good job. creative origina...,1,"[thought, movie, right, good, job, ., creative..."
49996,"bad plot, bad dialogue, bad acting, idiotic di...",0,"[bad, plot, ,, bad, dialogue, ,, bad, acting, ..."
49997,catholic taught parochial elementary schools n...,0,"[catholic, taught, parochial, elementary, scho..."
49998,going disagree previous comment side maltin on...,0,"[going, disagree, previous, comment, side, mal..."


In [None]:
# Count the occurrences of 0 and 1 in the 'sentiment' column
sentiment_counts = df['sentiment'].value_counts()
sentiment_counts


Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
1,25000
0,25000


## Model Training

In [None]:
# (Logistic Regression, Naïve Bayes, or SVM). with accuracy and f1 score

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df['review'], df['sentiment'], test_size=0.2, random_state=42
)

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train and evaluate Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train_vec, y_train)
lr_pred = lr_model.predict(X_test_vec)
lr_accuracy = accuracy_score(y_test, lr_pred)
lr_f1 = f1_score(y_test, lr_pred)
print(f"Logistic Regression - Accuracy: {lr_accuracy:.4f}, F1-score: {lr_f1:.4f}")

# Train and evaluate Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train_vec, y_train)
nb_pred = nb_model.predict(X_test_vec)
nb_accuracy = accuracy_score(y_test, nb_pred)
nb_f1 = f1_score(y_test, nb_pred)
print(f"Naive Bayes - Accuracy: {nb_accuracy:.4f}, F1-score: {nb_f1:.4f}")


# Train and evaluate SVM
svm_model = LinearSVC()
svm_model.fit(X_train_vec, y_train)
svm_pred = svm_model.predict(X_test_vec)
svm_accuracy = accuracy_score(y_test, svm_pred)
svm_f1 = f1_score(y_test, svm_pred)
print(f"SVM - Accuracy: {svm_accuracy:.4f}, F1-score: {svm_f1:.4f}")


Logistic Regression - Accuracy: 0.8981, F1-score: 0.9003
Naive Bayes - Accuracy: 0.8671, F1-score: 0.8658
SVM - Accuracy: 0.8955, F1-score: 0.8972


In [None]:
# hypertuning Model

import pandas as pd
import sklearn as skn
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score
import re

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt')

# Load the dataset
df = pd.read_csv("IMDB_dataset.csv")

# Preprocessing functions (same as before)
def preprocess_text(text):
    text = re.sub('<.*?>', '', text)
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply preprocessing
df['review'] = df['review'].apply(preprocess_text)
df['review'] = df['review'].apply(remove_stopwords)
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['review'], df['sentiment'], test_size=0.2, random_state=42
)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


# Hyperparameter Tuning with GridSearchCV
param_grid = {
    'C': [0.1, 1, 10],  # Regularization parameter
    'penalty': ['l1', 'l2'], # Regularization method
    'solver': ['liblinear', 'saga'] # Solver algorithm
}


model = LogisticRegression(max_iter=10000) # Initialize logistic regression model
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='f1') # Initialize gridsearch
grid_search.fit(X_train_vec, y_train)

print("Best hyperparameters:", grid_search.best_params_)
print("Best F1 score:", grid_search.best_score_)

# Evaluate best model on the test set
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test_vec)
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

print(f"Best Model - Accuracy: {accuracy:.4f}, F1-score: {f1:.4f}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Best hyperparameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
Best F1 score: 0.8969844510254511
Best Model - Accuracy: 0.9003, F1-score: 0.9018


#Report

## Approach

This project aims to predict the sentiment (positive or negative) of movie reviews from the IMDB dataset.  The approach involved the following steps:

1. **Data Loading and Preprocessing:** The IMDB dataset was loaded, and the reviews underwent several preprocessing steps. This included:
    * Converting text to lowercase.
    * Removing HTML tags.
    * Removing special characters.
    * Removing stopwords (common words like "the", "a", "is" that don't usually contribute to sentiment).
    * Tokenization.
    * Converting sentiment labels ('positive', 'negative') to numerical representation (1 and 0).
    
2. **Feature Extraction:**  TF-IDF vectorization was used to convert the preprocessed text reviews into numerical features that machine learning models can understand. TF-IDF considers the frequency of words within a document and across the entire corpus, giving more weight to words that are important in specific reviews.

3. **Model Training and Evaluation:** Three different machine learning models were trained and evaluated:
    * Logistic Regression
    * Multinomial Naive Bayes
    * Linear Support Vector Machine (SVM)
    
    Each model was trained on the TF-IDF vectors of the training data and evaluated on the test data using accuracy and F1-score.


4. **Hyperparameter Tuning**: GridSearchCV was utilized to optimize the hyperparameters of the Logistic Regression model.  A parameter grid was defined with various regularization strengths ('C'), penalty types ('l1', 'l2'), and solver algorithms to find the combination that yields the best F1-score.

## Challenges

* **Data Cleaning:**  The initial dataset contained HTML tags and special characters, requiring careful cleaning to ensure the model receives clean text data.
* **Stopword Removal:** Deciding whether or not to remove stop words is a trade-off.  While removing them can reduce noise, they can sometimes carry subtle sentiment cues.
* **Model Selection:** Choosing the best performing model requires experimentation with various algorithms, along with consideration for interpretability and efficiency.
* **Hyperparameter Tuning:** Finding the optimal hyperparameters can be a computationally expensive process. Using GridSearchCV simplified this but required thoughtful configuration of the parameter grid and cross-validation settings.

## Model Performance & Improvements

The initial models (Logistic Regression, Naive Bayes, and SVM) provided a baseline for performance.  After hyperparameter tuning on the Logistic Regression model, the performance increased. The report includes the accuracy and F1-scores for each model, clearly showing the improvements after hyperparameter optimization.


**Further improvements could include:**

* **More Advanced Feature Engineering:**  Experimenting with different vectorization techniques (word embeddings, Doc2Vec), or incorporating sentiment lexicons might lead to performance gains.
* **Deep Learning Models:**  Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) are well-suited for text data and might outperform traditional machine learning algorithms.
* **Ensemble Methods:** Combining multiple models can often improve prediction accuracy and robustness.
* **More Sophisticated Data Cleaning:** Investigate more aggressive data cleaning techniques (handling misspellings, stemming/lemmatization), or further experimentation with stop-word removal.
* **Larger Dataset:** Using a larger and more diverse dataset might lead to a more robust model.
