<a href="https://colab.research.google.com/github/SanskarJadhav/NaturalLanguageProcessing/blob/main/NLP_IMDBSentiment_6076.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis Performed On IMDB Dataset
By Sanskar Jadhav PRN 21070126076 AIML B1

In [None]:
# importing the required libraries for preprocessing
import pandas as pd
import numpy as np
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
import spacy
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# saving the dataset in a dataframe
# i'm taking the dataset from my dropbox, so no need to upload to Colab manually
imdb_df = pd.read_csv("https://www.dropbox.com/scl/fi/by1d23shj320pz7wcljb6/IMDB-Dataset.csv?rlkey=448p28vifmtrx38bf24b16ypy&dl=1")
imdb_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
# check for null values
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [None]:
# selecting the reviews column for text preprocessing
reviews = imdb_df['review']
reviews[:5]

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

## Data Preprocessing

In [None]:
# note that there are some additional cleaning steps required
clean_reviews = []
# defining a regular expression pattern to match URLs
url_pattern = re.compile(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", re.MULTILINE)
for review in reviews:
  # handling html tags before removing symbols
  review = re.sub("\n", " ", review)
  review = re.sub("<br /><br />", " ", review)
  # removing URLs
  review = re.sub(url_pattern, " ", review)
  # removing years
  review = re.sub("\d{4}", " ", review)
  # removing symbols
  review = re.sub("-", " ", review)
  review = re.sub("'", "", review)
  review = re.sub("[^A-Za-z0-9 ]+"," ", review)
  # removing extra spaces
  review = re.sub(" +", " ", review)
  clean_reviews.append(review)

In [None]:
clean_reviews[:5]

['One of the other reviewers has mentioned that after watching just 1 Oz episode youll be hooked They are right as this is exactly what happened with me The first thing that struck me about Oz was its brutality and unflinching scenes of violence which set in right from the word GO Trust me this is not a show for the faint hearted or timid This show pulls no punches with regards to drugs sex or violence Its is hardcore in the classic use of the word It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary It focuses mainly on Emerald City an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda Em City is home to many Aryans Muslims gangstas Latinos Christians Italians Irish and more so scuffles death stares dodgy dealings and shady agreements are never far away I would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare Forget pretty 

In [None]:
full_text = "".join([review for review in clean_reviews])
full_text[1500:3600] # to confirm it's one large string

'eir lack of street skills or prison experience Watching Oz you may become comfortable with what is uncomfortable viewing thats if you can get in touch with your darker side A wonderful little production The filming technique is very unassuming very old time BBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece The actors are extremely well chosen Michael Sheen not only has got all the polari but he has all the voices down pat too You can truly see the seamless editing guided by the references to Williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece A masterful production about one of the great masters of comedy and his life The realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears It plays on our knowledge and our senses particularly with the scenes concerning Orton and Halliwel

In [None]:
# make all text lowercase
full_text = full_text.lower()
# tokenization
tokens = word_tokenize(full_text)
print(len(tokens))

11536215


In [None]:
stop_words = []
new_stopwords = ["you'll", "i've", "i'll", "who'll", "they'll", "we've", "a",
                 "i", "i'd", "i'm", "my", "his", "her", "our", "their"]
swlist = stopwords.words('english')
swlist.extend(new_stopwords)
for word in swlist:
  word = re.sub("'", "", word)
  stop_words.append(word)
set_stop_words = set(stop_words)

In [None]:
print(set_stop_words)

{'himself', 'did', 'of', 'hasn', 'they', 'here', 'aren', 'im', 'not', 'up', 'm', 'because', 'you', 'can', 'o', 'was', 'about', 'we', 't', 'yourselves', 'shouldnt', 'just', 'that', 'a', 'didn', 'hadnt', 'above', 'is', 'couldn', 'against', 'into', 'wasn', 'ill', 'its', 'hasnt', 'under', 'same', 'before', 'weren', 'when', 'themselves', 'myself', 'whom', 'after', 'these', 'do', 'youre', 'than', 'i', 'ive', 'will', 'hers', 'doesnt', 'ain', 'by', 'll', 'isn', 'if', 'for', 'shouldve', 'needn', 'am', 'wouldnt', 'have', 'don', 'won', 'havent', 'what', 'been', 'itself', 'has', 'down', 'further', 'd', 'mustnt', 'shant', 'shes', 'does', 'so', 'from', 'mightnt', 'as', 'isnt', 'shan', 'off', 'weve', 'mustn', 'his', 'should', 'again', 've', 'until', 'wont', 'youve', 'shouldn', 'haven', 'then', 'very', 'nor', 'them', 'were', 'id', 'now', 'mightn', 'on', 'once', 'ma', 'which', 'at', 'it', 're', 'with', 'wholl', 'but', 'couldnt', 'or', 'how', 'only', 'more', 'over', 'theirs', 'who', 'most', 'those', 'be

In [None]:
filtered_tokens = [w for w in tokens if not w in set_stop_words]
print(len(filtered_tokens))

5995506


In [None]:
print(len(set(filtered_tokens)))

109636


In [None]:
# Thus, we have reduced the number of tokens from over 11 million to nearly 100,000!
tokens_in_reviews = []
filteredtokenset = set(filtered_tokens)
for i in clean_reviews:
  review = i.lower()
  words = review.split(" ")
  listoftokens = [token for token in words if token in filteredtokenset]
  stringoftokens = " ".join(listoftokens)
  tokens_in_reviews.append(stringoftokens)
tokens_in_reviews[:5]

['one reviewers mentioned watching 1 oz episode hooked right exactly happened first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements never far away would say main appeal show due fact goes shows dare forget pretty pictures painted mainstream audiences forget charm forget romance oz mess around first episode ever saw struck nasty surreal say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bitches due lack street sk

In [None]:
# perform lemmatization with SpaCy
lemmatizedreviews = []
for doc in nlp.pipe(tokens_in_reviews, batch_size=500, n_process=2,
                    disable=["parser", "ner"]):
    lemstring = " ".join([tok.lemma_ for tok in doc])
    #print(lemstring) did this to confirm it was running correctly
    lemmatizedreviews.append(lemstring)

## Vectorization

In [None]:
# CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
countvector = CountVectorizer()
countvector.fit(lemmatizedreviews)
#print("Vocabulary: ", countvector.vocabulary_)

In [None]:
# encode document
cv_reviews = countvector.transform(lemmatizedreviews)
# summarize encoded vector
print(cv_reviews.shape)

(50000, 91689)


In [None]:
# TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvector = TfidfVectorizer()
tfidfvector.fit(lemmatizedreviews)
#print("Vocabulary: ", tfidfvector.vocabulary_)
# printing vocabulary of count and tfidf took up >900 pages in the PDF
# thus commented the code and ran again before downloading

In [None]:
tv_reviews = tfidfvector.transform(lemmatizedreviews)
print(tv_reviews.shape)

(50000, 91689)


In [None]:
import gensim
from gensim.models import Word2Vec
# Word2Vec model using CBOW architecture
lemwordlist = []
for i in lemmatizedreviews:
  words = i.split(" ")
  lemwordlist.append(words)
model1 = gensim.models.Word2Vec(lemwordlist, sg=0)

In [None]:
print(len(model1.wv))

32550


In [None]:
w2v_vectors = []
for review in lemwordlist:
    vectors = [model1.wv[word] for word in review if word in model1.wv]
    if vectors: # some reviews may have no words in model's vocab
      doc_vector = np.mean(vectors, axis=0)
      w2v_vectors.append(doc_vector)

In [None]:
# GoogleNews Word2Vec
import gensim.downloader as api
model2 = api.load("word2vec-google-news-300")  # download the model and return as object ready for use



In [None]:
googlew2v_vectors = []
for review in lemwordlist:
    vectors = [model2[word] for word in review if word in model2]
    if vectors: # some reviews may have no words in model's vocab
      doc_vector = np.mean(vectors, axis=0)
      googlew2v_vectors.append(doc_vector)

## Machine Learning Implementation

According to the Kaggle website, the expected train:test ratio to be used is 50:50. Also, keeping shuffle = False so that the testing set is the same for all algorithms for a fair evaluation and comparison.

### Logistic Regression

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# using cv_reviews
X_train, X_test, y_train, y_test = train_test_split(cv_reviews, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining the hyperparameter grid for GridSearchCV
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10],  # regularization parameters
    'max_iter': [300],     # limit max no. of iterations to a balance between time and convergence
}
logreg = LogisticRegression()
grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy', verbose=1)
# finding the best parameters
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
# making predictions
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best Model Accuracy: 0.884
Classification Report:
               precision    recall  f1-score   support

    negative       0.89      0.88      0.88     12474
    positive       0.88      0.89      0.88     12526

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

Confusion Matrix:
 [[10949  1525]
 [ 1375 11151]]


In [None]:
logresacc = []
logresacc.append(accuracy)

In [None]:
# using tv_reviews
X_train, X_test, y_train, y_test = train_test_split(tv_reviews, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining the hyperparameter grid for GridSearchCV
logreg = LogisticRegression()
grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy', verbose=1)
# finding the best parameters
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
# making predictions
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
logresacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best Model Accuracy: 0.88916
Classification Report:
               precision    recall  f1-score   support

    negative       0.90      0.88      0.89     12474
    positive       0.88      0.90      0.89     12526

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000

Confusion Matrix:
 [[10985  1489]
 [ 1282 11244]]


In [None]:
# using w2v_vectors
X = np.array(w2v_vectors)
X_train, X_test, y_train, y_test = train_test_split(X, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining the hyperparameter grid for GridSearchCV
logreg = LogisticRegression()
grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy', verbose=1)
# finding the best parameters
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
# making predictions
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
logresacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best Model Accuracy: 0.86016
Classification Report:
               precision    recall  f1-score   support

    negative       0.86      0.85      0.86     12474
    positive       0.86      0.87      0.86     12526

    accuracy                           0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000

Confusion Matrix:
 [[10643  1831]
 [ 1665 10861]]


In [None]:
# using googlew2v_vectors
X = np.array(googlew2v_vectors)
X_train, X_test, y_train, y_test = train_test_split(X, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining the hyperparameter grid for GridSearchCV
logreg = LogisticRegression()
grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy', verbose=1)
# finding the best parameters
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
# making predictions
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
logresacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best Model Accuracy: 0.85316
Classification Report:
               precision    recall  f1-score   support

    negative       0.86      0.85      0.85     12474
    positive       0.85      0.86      0.85     12526

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000

Confusion Matrix:
 [[10570  1904]
 [ 1767 10759]]


In [None]:
print(logresacc)

[0.884, 0.88916, 0.86016, 0.85316]


Thus, we can see that the best performer in Logistic Regression was TfidfVectorizer, followed closely by CountVectorizer.

### SVC

In [None]:
svcacc = []
from sklearn.svm import SVC
# using cv_reviews
X_train, X_test, y_train, y_test = train_test_split(cv_reviews, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining the SVC classifier with your chosen parameters
svc_classifier = SVC(C=0.1, kernel='linear')
svc_classifier.fit(X_train, y_train)
# making predictions
y_pred = svc_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
svcacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Best Model Accuracy: 0.87276
Classification Report:
               precision    recall  f1-score   support

    negative       0.87      0.87      0.87     12474
    positive       0.87      0.87      0.87     12526

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000

Confusion Matrix:
 [[10860  1614]
 [ 1567 10959]]


In [None]:
# the previous code took over 20 mins!
# using tv_reviews
X_train, X_test, y_train, y_test = train_test_split(tv_reviews, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining the SVC classifier with your chosen parameters
svc_classifier = SVC(C=0.1, kernel='linear')
svc_classifier.fit(X_train, y_train)
# making predictions
y_pred = svc_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
svcacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Best Model Accuracy: 0.87188
Classification Report:
               precision    recall  f1-score   support

    negative       0.90      0.83      0.87     12474
    positive       0.85      0.91      0.88     12526

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000

Confusion Matrix:
 [[10402  2072]
 [ 1131 11395]]


In [None]:
# using w2v_vectors
X = np.array(w2v_vectors)
X_train, X_test, y_train, y_test = train_test_split(X, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining the SVC classifier with your chosen parameters
svc_classifier = SVC(C=0.1, kernel='linear')
svc_classifier.fit(X_train, y_train)
# making predictions
y_pred = svc_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
svcacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Best Model Accuracy: 0.8614
Classification Report:
               precision    recall  f1-score   support

    negative       0.87      0.85      0.86     12474
    positive       0.86      0.87      0.86     12526

    accuracy                           0.86     25000
   macro avg       0.86      0.86      0.86     25000
weighted avg       0.86      0.86      0.86     25000

Confusion Matrix:
 [[10631  1843]
 [ 1622 10904]]


In [None]:
# using googlew2v_vectors
X = np.array(googlew2v_vectors)
X_train, X_test, y_train, y_test = train_test_split(X, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining the SVC classifier with your chosen parameters
svc_classifier = SVC(C=0.1, kernel='linear')
svc_classifier.fit(X_train, y_train)
# making predictions
y_pred = svc_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
svcacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Best Model Accuracy: 0.83824
Classification Report:
               precision    recall  f1-score   support

    negative       0.84      0.83      0.84     12474
    positive       0.83      0.84      0.84     12526

    accuracy                           0.84     25000
   macro avg       0.84      0.84      0.84     25000
weighted avg       0.84      0.84      0.84     25000

Confusion Matrix:
 [[10379  2095]
 [ 1949 10577]]


In [None]:
print(svcacc)

[0.87276, 0.87188, 0.8614, 0.83824]


Here, we see that SVC performed best with CountVectorizer and then TfidfVectorizer

### Random Forest

In [None]:
rfacc = []
from sklearn.ensemble import RandomForestClassifier
# using cv_reviews
X_train, X_test, y_train, y_test = train_test_split(cv_reviews, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining RF classifier with chosen parameters
# no need for grid search, we know more trees and more depth will give more accuracy
# but also more training time and possible overfitting, so keep balanced values
rf_model = RandomForestClassifier(
    n_estimators=100,        # no of trees
    max_depth=50,            # max depth of tree
    min_samples_split=4,     # min samples required for split
    min_samples_leaf=2,      # min samples required for leaf
    random_state=0
)
rf_model.fit(X_train, y_train)
# making predictions
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
rfacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Best Model Accuracy: 0.85076
Classification Report:
               precision    recall  f1-score   support

    negative       0.86      0.84      0.85     12474
    positive       0.84      0.86      0.85     12526

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000

Confusion Matrix:
 [[10458  2016]
 [ 1715 10811]]


In [None]:
# using tv_reviews
X_train, X_test, y_train, y_test = train_test_split(tv_reviews, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining RF classifier with chosen parameters
rf_model = RandomForestClassifier(
    n_estimators=100,        # no of trees
    max_depth=50,            # max depth of tree
    min_samples_split=4,     # min samples required for split
    min_samples_leaf=2,      # min samples required for leaf
    random_state=0
)
rf_model.fit(X_train, y_train)
# making predictions
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
rfacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Best Model Accuracy: 0.84772
Classification Report:
               precision    recall  f1-score   support

    negative       0.86      0.83      0.84     12474
    positive       0.84      0.87      0.85     12526

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000

Confusion Matrix:
 [[10336  2138]
 [ 1669 10857]]


In [None]:
# using w2v_vectors
X = np.array(w2v_vectors)
X_train, X_test, y_train, y_test = train_test_split(X, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining RF classifier with chosen parameters
rf_model = RandomForestClassifier(
    n_estimators=100,        # no of trees
    max_depth=50,            # max depth of tree
    min_samples_split=4,     # min samples required for split
    min_samples_leaf=2,      # min samples required for leaf
    random_state=0
)
rf_model.fit(X_train, y_train)
# making predictions
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
rfacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Best Model Accuracy: 0.8314
Classification Report:
               precision    recall  f1-score   support

    negative       0.84      0.81      0.83     12474
    positive       0.82      0.85      0.83     12526

    accuracy                           0.83     25000
   macro avg       0.83      0.83      0.83     25000
weighted avg       0.83      0.83      0.83     25000

Confusion Matrix:
 [[10143  2331]
 [ 1884 10642]]


In [None]:
# using googlew2v_vectors
X = np.array(googlew2v_vectors)
X_train, X_test, y_train, y_test = train_test_split(X, imdb_df['sentiment'], test_size=0.5, shuffle=False)
# defining RF classifier with chosen parameters
rf_model = RandomForestClassifier(
    n_estimators=100,        # no of trees
    max_depth=50,            # max depth of tree
    min_samples_split=4,     # min samples required for split
    min_samples_leaf=2,      # min samples required for leaf
    random_state=0
)
rf_model.fit(X_train, y_train)
# making predictions
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
rfacc.append(accuracy)
print("Best Model Accuracy:", accuracy)
# classification report
cr = classification_report(y_test, y_pred)
print("Classification Report:\n", cr)
# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Best Model Accuracy: 0.80952
Classification Report:
               precision    recall  f1-score   support

    negative       0.82      0.79      0.81     12474
    positive       0.80      0.83      0.81     12526

    accuracy                           0.81     25000
   macro avg       0.81      0.81      0.81     25000
weighted avg       0.81      0.81      0.81     25000

Confusion Matrix:
 [[ 9852  2622]
 [ 2140 10386]]


In [None]:
print(rfacc)

[0.85076, 0.84772, 0.8314, 0.80952]


Thus, with Random Forest Classifier, CountVectorizer provides the highest accuracy, followed closely by TfidfVectorizer.

In the end, Word2Vec optimizes model training time by reducing vector size for input, but suffers from poorer accuracy in all three classifiers.