# NLP Movie Reviews Sentiment Prediction

This notebook implements sentiment analysis for movie reviews. It preprocesses the data, trains a classification model, and evaluates the model's performance. The steps are as follows:

1. Load the movie reviews dataset.
2. Preprocess the reviews (tokenization, stopword removal, lemmatization).
3. Split the dataset into training and testing sets.
4. Transform the reviews into numerical features using TF-IDF.
5. Train a logistic regression model to predict sentiments.
6. Evaluate the model using accuracy and a classification report.

## 1. Load the Dataset


In [16]:
import pandas as pd

# Load the movie reviews dataset
movie_reviews = pd.read_csv("movie_reviews.tsv", sep="\t")

# Display the first few rows of the dataset
print(movie_reviews.head())

# Display column names
print("Columns in the dataset:", movie_reviews.columns)


       id  sentiment                                             review
0  5814_8          1  With all this stuff going down at the moment w...
1  2381_9          1  \The Classic War of the Worlds\" by Timothy Hi...
2  7759_3          0  The film starts with a manager (Nicholas Bell)...
3  3630_4          0  It must be assumed that those who praised this...
4  9495_8          1  Superbly trashy and wondrously unpretentious 8...
Columns in the dataset: Index(['id', 'sentiment', 'review'], dtype='object')


## 2. Preprocess the Reviews

In [22]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Step 1: Load the dataset
file_path = "movie_reviews.tsv"
movie_reviews = pd.read_csv(file_path, sep="\t")

# Verify the column names
print("Columns in the dataset:", movie_reviews.columns)

# Step 2: Preprocess the 'review' column
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    filtered_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalnum() and word not in stop_words]
    return " ".join(filtered_tokens)

# Apply preprocessing to the 'review' column
movie_reviews['processed_review'] = movie_reviews['review'].apply(preprocess_text)

# Display processed reviews
print("Processed reviews:")
print(movie_reviews[['review', 'processed_review']].head())


[nltk_data] Downloading package punkt to /home/vscode/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/vscode/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/vscode/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Columns in the dataset: Index(['id', 'sentiment', 'review'], dtype='object')
Processed reviews:
                                              review  \
0  With all this stuff going down at the moment w...   
1  \The Classic War of the Worlds\" by Timothy Hi...   
2  The film starts with a manager (Nicholas Bell)...   
3  It must be assumed that those who praised this...   
4  Superbly trashy and wondrously unpretentious 8...   

                                    processed_review  
0  stuff going moment mj started listening music ...  
1  classic war timothy hines entertaining film ob...  
2  film start manager nicholas bell giving welcom...  
3  must assumed praised film greatest filmed oper...  
4  superbly trashy wondrously unpretentious 80 ex...  


## 3. Split the Dataset


In [23]:
from sklearn.model_selection import train_test_split

# Split data into features (X) and labels (y)
X = movie_reviews['processed_review']
y = movie_reviews['sentiment']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", len(X_train))
print("Test set size:", len(X_test))

Training set size: 20000
Test set size: 5000


## 4. Transform Reviews Using TF-IDF


In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the training data, transform the test data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("TF-IDF feature matrix shape (training):", X_train_tfidf.shape)
print("TF-IDF feature matrix shape (testing):", X_test_tfidf.shape)

TF-IDF feature matrix shape (training): (20000, 5000)
TF-IDF feature matrix shape (testing): (5000, 5000)


## 5. Train the Model


In [25]:
from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression()

# Train the model
model.fit(X_train_tfidf, y_train)

print("Model training complete.")

Model training complete.


## 6. Evaluate the Model


In [26]:
from sklearn.metrics import classification_report, accuracy_score

# Make predictions on the test set
y_pred = model.predict(X_test_tfidf)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.8826
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.87      0.88      2481
           1       0.87      0.90      0.89      2519

    accuracy                           0.88      5000
   macro avg       0.88      0.88      0.88      5000
weighted avg       0.88      0.88      0.88      5000



### Jupyter notebook --footer info-- (please always provide this at the end of each submitted notebook)

In [27]:
import os
import platform
import socket
from platform import python_version
from datetime import datetime

print('-----------------------------------')
print(os.name.upper())
print(platform.system(), '|', platform.release())
print('Datetime:', datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print('Python Version:', python_version())
print('-----------------------------------')

-----------------------------------
POSIX
Linux | 6.5.0-1025-azure
Datetime: 2024-12-15 17:27:43
Python Version: 3.11.10
-----------------------------------
