# Business Context

**Introduction:**
Sentiment analysis is a natural language processing (NLP) technique used to determine the emotional tone behind a piece of text. When applied to movie reviews, it involves identifying whether a review is positive, negative, or neutral.

**How it works:**

* **Data Collection**: Gathering a large dataset of movie reviews, labeled with their sentiment (positive or negative).
* **Text Preprocessing:** Cleaning the data by removing stop words, punctuation, and converting text to lowercase.
* **Feature Extraction:** Identifying key words or phrases that indicate positive or negative sentiment.
* **Model Training:** Using machine learning algorithms to train a model on the preprocessed data to classify new reviews.
* **Sentiment Prediction:** Applying the trained model to new, unlabeled movie reviews to determine their sentiment.

# Importing Dataset and Libraries

In [None]:
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
! unzip imdb-dataset-of-50k-movie-reviews.zip


cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 97% 25.0M/25.7M [00:01<00:00, 24.9MB/s]
100% 25.7M/25.7M [00:01<00:00, 19.1MB/s]
Archive:  imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import spacy
import re
import string
import nltk

import numpy as np
import pandas as pd

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Exploratory Data Analysis

In [None]:
df.sentiment.value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


In [None]:
df.review[10]

'Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines.<br /><br />At first it was very odd and pretty funny but as the movie progressed I didn\'t find the jokes or oddness funny anymore.<br /><br />Its a low budget film (thats never a problem in itself), there were some pretty interesting characters, but eventually I just lost interest.<br /><br />I imagine this film would appeal to a stoner who is currently partaking.<br /><br />For something similar but better try "Brother from another planet"'

# Text Preprocessing

## Cleaning HTML tags and punctuations

In [None]:
def clean_html(sentence):
    clean_tags = re.compile('<.*?>')
    clean_text = re.sub(clean_tags, ' ', sentence)
    return clean_text

def clean_punc(sentence):
    one = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',one)
    return  cleaned

In [None]:
df['clean_html'] = df.review.apply(clean_html)
df['clean_html_punc'] = df.clean_html.apply(clean_punc)
df['clean_html_punc'][10]

'Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines   At first it was very odd and pretty funny but as the movie progressed I didnt find the jokes or oddness funny anymore   Its a low budget film  thats never a problem in itself   there were some pretty interesting characters  but eventually I just lost interest   I imagine this film would appeal to a stoner who is currently partaking   For something similar but better try Brother from another planet'

## Removing Stopwords

In [None]:
def remove_stop(text):
  # Tokenize the text into words
  words = word_tokenize(text)

  # Filter out the stop words from the text
  filtered_words = [word for word in words if not word in stopwords.words('english')]

  # Join the filtered words into a string
  filtered_text = ' '.join(filtered_words)

  # return the filtered text
  return(filtered_text)

In [None]:
df['no_stop'] = df.clean_html_punc.apply(remove_stop)
df['no_stop'][10]

'Phil Alien one quirky films humour based around oddness everything rather actual punchlines At first odd pretty funny movie progressed I didnt find jokes oddness funny anymore Its low budget film thats never problem pretty interesting characters eventually I lost interest I imagine film would appeal stoner currently partaking For something similar better try Brother another planet'

## Lemmatization

In [None]:
# prepare spacy model for lemmatization
import spacy.cli
spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# defining lemmatization function
def lemm(text):
    lemme=[]
    for token in nlp(text):
        lemme.append(token.lemma_)

    return " ".join(lemme)

In [None]:
df['lemma'] = df.no_stop.apply(lemm)
df['lemma'][10]

'Phil Alien one quirky film humour base around oddness everything rather actual punchline at first odd pretty funny movie progress I do not find joke oddness funny anymore its low budget film that s never problem pretty interesting character eventually I lose interest I imagine film would appeal stoner currently partake for something similar well try Brother another planet'

## Missing Values and Duplicated Data

In [None]:
df[['lemma','sentiment']].to_csv('cleaned_data.csv', index=False)

In [3]:
df = pd.read_csv('cleaned_data.csv')

In [8]:
df.dropna(inplace=True)

In [9]:
df.duplicated().sum()

289

In [10]:
df.drop_duplicates(inplace=True)

In [12]:
df.reset_index(drop=True, inplace=True)

# Feature Extraction

## TF-IDF Vectorizer

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
from sklearn.model_selection import train_test_split

In [None]:
df.sentiment = df.sentiment.apply(lambda x: 1 if x == 'positive' else 0)

In [13]:
x_train, x_test, y_train, y_test = train_test_split(df.lemma, df.sentiment, test_size=0.2, random_state=42, shuffle=True,stratify=df.sentiment)

In [14]:
len(max(df.lemma.tolist(), key=len))

9101

In [15]:
vectorizer1 = TfidfVectorizer(max_features=5000 , ngram_range=(1, 3), sublinear_tf=False)

In [16]:
vectorizer1.fit(x_train)

In [17]:
x_train_tfidf = vectorizer1.transform(x_train)
x_test_tfidf = vectorizer1.transform(x_test)

In [75]:
from scipy.sparse import csr_matrix

In [76]:
sparse_matrix = csr_matrix(x_test_tfidf)

# Show the stored values
print(sparse_matrix.data)

[0.07596665 0.09622959 0.05385915 ... 0.12974485 0.1776897  0.1388134 ]


In [82]:
sparse_matrix.data.min()

0.0044730612980563925

# Hybrid Approache: Feature Concatenation

## Create word Embeddings

**Ensure consistency between the size of TF-IDF features and word embeddings**

In [22]:
from gensim.models import Word2Vec


# Train Word2Vec model
sentences = [text.split() for text in df['lemma']]
word2vec = Word2Vec(sentences, min_count=1, vector_size=256)  # Adjust size as needed

# Create document vectors
def create_doc_vectors(reviews):
    doc_vectors = []
    for review in reviews:
        words = review.split()
        word_vecs = []
        for word in words:
            if word in word2vec.wv.key_to_index:
                word_vecs.append(word2vec.wv[word])
        if len(word_vecs) > 0:
            doc_vectors.append(np.mean(word_vecs, axis=0))
        else:
            doc_vectors.append(np.zeros(100))  # Replace with desired vector for unknown words
    return doc_vectors




In [None]:
X = create_doc_vectors(df['lemma'])
y = df['sentiment']

In [28]:
x_train2, x_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True,stratify=df.sentiment)

In [34]:
LR2 = LogisticRegression(max_iter=1000)
LR2.fit(x_train2, y_train2)
y_pred2 = LR2.predict(x_test2)
print(f"Accuracy: {accuracy_score(y_test2, y_pred2)}")

Accuracy: 0.8646057991852384


## Concatenate Features

In [35]:
# Concatenate TF-IDF and word embeddings
X_train_combined = np.hstack((x_train_tfidf.toarray(), x_train2))
X_test_combined = np.hstack((x_test_tfidf.toarray(), x_test2))


In [40]:
LR3 = LogisticRegression(max_iter=1000)
LR3.fit(X_train_combined, y_train)
y_pred3 = LR3.predict(X_test_combined)
print(f"Accuracy: {accuracy_score(y_test, y_pred3)}")

Accuracy: 0.8919242751018452


In [71]:
LR_combined = LogisticRegression(max_iter=1000, C=2.5, penalty='elasticnet', solver='saga', l1_ratio= 0.5)
LR_combined.fit(X_train_combined, y_train)
y_pred4 = LR_combined.predict(X_test_combined)
print(f"Accuracy: {accuracy_score(y_test, y_pred4) * 100}")

Accuracy: 89.51593577761801




## Scaling Features

In [98]:
# importing minmax scaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_train2)
x_train_scaled = scaler.transform(x_train2)
x_test_scaled = scaler.transform(x_test2)


# Concatenate TF-IDF and word embeddings
X_train_combined_sc = np.hstack((x_train_tfidf.toarray(), x_train_scaled))
X_test_combined_sc = np.hstack((x_test_tfidf.toarray(), x_test_scaled))


# Model Selection

In [85]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# import models that works with text data:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier


In [20]:
NB = MultinomialNB()
LR = LogisticRegression()
RF = RandomForestClassifier()
models = [NB, LR, RF]

In [21]:
for model in models:
    model.fit(x_train_tfidf, y_train)
    y_pred = model.predict(x_test_tfidf)
    print(f"Model: {model.__class__.__name__}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Model: MultinomialNB
Accuracy: 0.8560987299305056
Model: LogisticRegression
Accuracy: 0.8907260963335729
Model: RandomForestClassifier
Accuracy: 0.8474718427989456


In [69]:
from xgboost import XGBClassifier

In [70]:
xg = XGBClassifier()
xg.fit(x_train_tfidf, y_train)
y_pred = xg.predict(x_test_tfidf)
print(f"Accuracy: {accuracy_score(y_test, y_pred)*100}")

Accuracy: 85.87347232207046


# Hyper-parameters Tuning for Logistic Regression

In [97]:
LR_tf = LogisticRegression(max_iter=500, C=2.5, penalty='elasticnet', solver='saga', l1_ratio= 0.5)
LR_tf.fit(x_train_tfidf, y_train)
y_pred = LR_tf.predict(x_test_tfidf)
print(f"Accuracy: {accuracy_score(y_test, y_pred) * 100}")

Accuracy: 89.37215432542536


# Enseble between Embeddings and TFIDF models

In [117]:
from sklearn.ensemble import VotingClassifier


# Create individual models
clf1 = LogisticRegression(random_state=1,max_iter=1000, C=2.5, penalty='elasticnet', solver='saga', l1_ratio= 0.5)
clf2 = XGBClassifier(random_state=1)
clf3 = MultinomialNB()

# Create the ensemble model
voting_clf = VotingClassifier(
    estimators=[('lr', clf1), ('xg', clf2), ('NB', clf3)],
    voting='hard')  # or 'soft' for soft voting

# Fit the model
voting_clf.fit(X_train_combined_sc, y_train)

# Predict on test data
predictions = voting_clf.predict(X_test_combined_sc)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Accuracy: 0.8856937455068297
