# **Movie Sentiment Analysis**

## 1. Data Loading

In [15]:
import pandas as pd
import numpy as np

In [16]:
splits = {'train': 'plain_text/train-00000-of-00001.parquet', 'test': 'plain_text/test-00000-of-00001.parquet', 'unsupervised': 'plain_text/unsupervised-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["train"])

'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: 954c0774-9e0f-4748-b896-899621de7537)')' thrown while requesting GET https://huggingface.co/datasets/stanfordnlp/imdb/resolve/main/plain_text/train-00000-of-00001.parquet
Retrying in 1s [Retry 1/5].


In [17]:
df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [18]:
# Rename columns for better name
df = df.rename(columns={'text': 'review', 'label': 'sentiment'})
df.head()

Unnamed: 0,review,sentiment
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [19]:
# Check if data is balaced or not
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
0,12500
1,12500


## 2. Data Preprocessing

In [20]:
# Clean up symbols, tags and punctuations
import re

# Simple text cleaning function
def clean_text(text):
  # Remove any HTML tag
  text = re.sub(r'<.*?>', '', text)
  # Remove non-alphabetical characters
  text = re.sub(r'[^a-zA-Z]', ' ', text)
  # Convert to lowercase
  text = text.lower()
  # Remove extra spaces
  text = re.sub(r'\s+',' ',text).strip()
  return text

# Apply to the whole dataset
df['clean_review'] = df['review'].apply(clean_text)

In [21]:
# Comparison
print("Original: ", df['review'].iloc[0])
print("Cleaned: ", df['clean_review'].iloc[0])

Original:  I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

For word handling and vectorizing, we are going to use **TF-IDF**
- TF (Term Frequency): How often does the word appear in a review, with higher numbers being better.
- IDF (Inverse Document Frequency): How often does the word appear in all reviews. A word that appears often is less important.

In [22]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# 1. Split data
X = df['clean_review']
y = df['sentiment']

# 2. Train test split: 80/20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Start TF-IDF
tfidf = TfidfVectorizer(max_features=5000)

# 4. Fit and transform into numbers
# This is done so the training is done with numeric data based on the word data
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print(f"Training Matrix Shape: {X_train_tfidf.shape}")

Training Matrix Shape: (20000, 5000)


## 3. Data Training

In [23]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

# 1. Initialize the Linear Support Vector Classification model
model = LinearSVC()

# 2. Train
model.fit(X_train_tfidf, y_train)

# 3. Prediction
predictions = model.predict(X_test_tfidf)

# 4. Evaluate
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 87.84%


In [24]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.88      0.87      0.88      2515
           1       0.87      0.88      0.88      2485

    accuracy                           0.88      5000
   macro avg       0.88      0.88      0.88      5000
weighted avg       0.88      0.88      0.88      5000



## 4. Model Testing

In [25]:
def predict_sentiment(text):
  # 1. Clean text using previous function
  clean = clean_text(text)

  # 2. Vectorize the words
  vec = tfidf.transform([clean])

  # 3. Predict
  prediction = model.predict(vec)[0]

  return "Positive" if prediction == 1 else "Negative"

In [27]:
# Test time
review1 = "I loved how absolutely boring this film was."
review2 = "I hate that I love this."
review3 = "It was okay, not great but not bad either."

print(f"Review: '{review1}' -> {predict_sentiment(review1)}")
print(f"Review: '{review2}' -> {predict_sentiment(review2)}")
print(f"Review: '{review3}' -> {predict_sentiment(review3)}")

Review: 'I loved how absolutely boring this film was.' -> Negative
Review: 'I hate that I love this.' -> Positive
Review: 'It was okay, not great but not bad either.' -> Negative
