For this notebook, I have used IMDBmovies dataset ( this modified dataset contains only 15,000 rows)

At the end of the notebook, you can find 2 trained models that predict sentiment based on the movie reviews, along with insights into their respective performances measured by accuracy and confusion matrices.

**This comprehensive sentiment analysis pipeline serves as a solid foundation for exploring text classification using various feature extraction techniques and machine learning models.**


**Work in progress, new techniques are yet to be added to this notebook**.


- **Workflow**: The code follows a clear pipeline for sentiment analysis, starting from loading and preprocessing the dataset, through feature extraction, to model training and evaluation.
- **Feature Extraction Methods**:
  - Bag of Words (BoW)
  - TF-IDF
  - Word2Vec embeddings

- **Model Training**: It uses multiple classifiers (Naive Bayes and Random Forest) to predict sentiments based on the extracted features.

- **Evaluation**: Accuracy scores and confusion matrices are generated to assess the performance of each model.

In [None]:
import numpy as np
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:


# Try to read the file, handling potential errors
try:
    temp_df = pd.read_csv('IMDBmovies.csv')
except pd.errors.ParserError as e:
    print(f"Error reading CSV: {e}")
    # If there's an error, try specifying different parameters to handle potential issues:
    try:
        temp_df = pd.read_csv('IMDBmovies.csv', error_bad_lines=False)  # Skip bad lines
        print("Successfully read CSV by skipping bad lines.")
    except pd.errors.ParserError as e:
        print(f"Error reading CSV even after skipping bad lines: {e}")
        try:
            temp_df = pd.read_csv('IMDBmovies.csv', quoting=pd.QUOTE_NONE, escapechar='\\')  # Handle special characters
            print("Successfully read CSV by handling special characters.")
        except pd.errors.ParserError as e:
            print(f"Error reading CSV even after handling special characters: {e}")

# Proceed if the file was read successfully
if 'temp_df' in locals():
    df = temp_df.iloc[:15000]    #selects the first 15,000 rows for analysis and stores them in a new DataFrame df
    df.head()

In [None]:

# temp_df = pd.read_csv('IMDBmovies.csv')
# df = temp_df.iloc[:15000]    #selects the first 15,000 rows for analysis and stores them in a new DataFrame df
# df.head()

In [None]:
df.size

30000

In [None]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [None]:
df.duplicated().sum()

39

In [None]:
df.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace=True)


In [None]:
def remove_tags(raw_text):
  if isinstance(raw_text, str):  # Check if the input is a string
    cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
    return cleaned_text
  return raw_text   # Return the original value if it's not a string

A function remove_tags is defined to clean HTML tags from the text. It checks if the input is a string and applies a regular expression to remove any tags.
The function is applied to the review column of the DataFrame.

In [None]:
sw_list = stopwords.words('english')

In [None]:
df.loc[:, 'review'] = df['review'].apply(remove_tags)

In [None]:
# Remove stopwords and convert to lower case
df.loc[:, 'review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list] if isinstance(x, str) else x).apply(lambda x: " ".join(x) if isinstance(x, list) else x)  # Remove stopwords

df.loc[:, 'review'] = df['review'].apply(lambda x: x.lower() if isinstance(x, str) else x)  # Convert to lowercase

# Display the modified DataFrame
print(df)

                                                  review sentiment
0      one reviewers mentioned watching 1 oz episode ...  positive
1      a wonderful little production. the filming tec...  positive
2      i thought wonderful way spend time hot summer ...  positive
3      basically there's family little boy (jake) thi...  negative
4      petter mattei's "love time money" visually stu...  positive
...                                                  ...       ...
14995  bobcat goldthwait commended attempting somethi...  negative
14996  and since days "clarissa explains it all" i've...  positive
14997  a traveling couple (horton hamilton)stumble on...  negative
14998  this film deeply disappointing. not wenders di...  negative
14999  the revelation lana turner's dancing ability. ...  positive

[14961 rows x 2 columns]


Stopwords (common words like "and," "the," etc.) are removed from the reviews using the NLTK library.
The cleaned reviews are then converted to lowercase for uniformity.

In [None]:
X = df.iloc[:,0:1]
y = df['sentiment']

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)
X_train.shape

(11968, 1)

In [None]:
# Applying BoW
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

X_train_bow.shape

(11968, 57305)

The CountVectorizer is used to convert the text reviews into a Bag of Words representation, creating a matrix of token counts for both training and testing data.

#Training with GaussianNB

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(X_train_bow,y_train)

In [None]:
y_pred = gnb.predict(X_test_bow)

from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_test,y_pred)

0.6625459405278984

In [None]:
confusion_matrix(y_test,y_pred)

array([[1144,  359],
       [ 651,  839]])

A Gaussian Naive Bayes classifier is trained on the BoW features, and predictions are made on the test set.
The accuracy of the model and the confusion matrix are computed to evaluate performance.

#Training with Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8509856331440027

A Random Forest classifier is also trained using the BoW features and evaluated for accuracy.

In [None]:
cv = CountVectorizer(max_features=3000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8342799866354828

converts text reviews into a Bag of Words representation using the top 3,000 most frequent words, trains a Random Forest model on this data, and evaluates its accuracy in predicting sentiment labels.

In [None]:
cv = CountVectorizer(ngram_range=(1,2),max_features=5000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8362846642165052

The feature extraction is enhanced by including bigrams (sequences of two words) using the ngram_range parameter, and the process is repeated for training and evaluation.

#**Using TF-IDF**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review'])

The TF-IDF vectorization method is applied to convert the text into a TF-IDF representation. A Random Forest classifier is trained on these features, and accuracy is computed.

In [None]:
rf = RandomForestClassifier()

rf.fit(X_train_tfidf,y_train)
y_pred = rf.predict(X_test_tfidf)

accuracy_score(y_test,y_pred)

0.8416304710992315

#**Using Word2Vec**

In [None]:
import gensim
import nltk
nltk.download('punkt')

from nltk import sent_tokenize
from gensim.utils import simple_preprocess
story = []
for doc in df['review']:
    raw_sent = sent_tokenize(doc)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Sentences are tokenized from the reviews and processed into a list of words for Word2Vec training.
A Word2Vec model is initialized, built, and trained on the tokenized sentences to learn word embeddings.

In [None]:
model.build_vocab(story)
model.train(story, total_examples=model.corpus_count, epochs=model.epochs)

(9197228, 9810805)

In [None]:
len(model.wv.index_to_key)

38121

In [None]:
def document_vector(doc):
    # remove out-of-vocabulary words
    doc = [word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc], axis=0)

document_vector(df['review'].values[0])

array([-1.21867247e-01,  1.98360085e-01,  1.77877396e-02,  6.53934723e-04,
        6.30961433e-02, -6.46396875e-01,  2.84476131e-01,  6.99922562e-01,
       -1.11608893e-01, -3.72724861e-01, -9.32564884e-02, -6.02043509e-01,
       -3.69267911e-02,  4.23178464e-01, -1.76801980e-02, -4.76853460e-01,
       -1.49038836e-01, -2.25483909e-01,  7.32048079e-02, -4.60291415e-01,
        5.19118786e-01,  1.52935162e-01,  3.02278817e-01,  2.18892068e-01,
        3.49237695e-02, -7.79215395e-02, -1.92852050e-01, -1.55371487e-01,
       -3.10850084e-01, -6.02239519e-02,  4.27446514e-01,  1.38595894e-01,
        1.02748372e-01, -2.26638302e-01,  4.92330864e-02,  6.06161714e-01,
        5.96871413e-02, -3.50647002e-01, -1.66628689e-01, -7.09155738e-01,
        2.39120603e-01, -3.90726298e-01,  6.71600997e-02,  1.44873247e-01,
        1.25350431e-01, -2.33766902e-02, -3.54483813e-01, -2.11934805e-01,
       -2.87121460e-02,  1.99263036e-01,  2.06211612e-01, -2.17697620e-01,
       -1.22451507e-01,  

A function is defined to calculate document vectors by averaging the word vectors of words in a document.
The function is applied to all reviews to create a NumPy array of document vectors

In [None]:
from tqdm import tqdm
X = []
for doc in tqdm(df['review'].values):
    X.append(document_vector(doc))

100%|██████████| 14961/14961 [12:46<00:00, 19.51it/s]


In [None]:
X = np.array(X)
X[0]

array([-1.21867247e-01,  1.98360085e-01,  1.77877396e-02,  6.53934723e-04,
        6.30961433e-02, -6.46396875e-01,  2.84476131e-01,  6.99922562e-01,
       -1.11608893e-01, -3.72724861e-01, -9.32564884e-02, -6.02043509e-01,
       -3.69267911e-02,  4.23178464e-01, -1.76801980e-02, -4.76853460e-01,
       -1.49038836e-01, -2.25483909e-01,  7.32048079e-02, -4.60291415e-01,
        5.19118786e-01,  1.52935162e-01,  3.02278817e-01,  2.18892068e-01,
        3.49237695e-02, -7.79215395e-02, -1.92852050e-01, -1.55371487e-01,
       -3.10850084e-01, -6.02239519e-02,  4.27446514e-01,  1.38595894e-01,
        1.02748372e-01, -2.26638302e-01,  4.92330864e-02,  6.06161714e-01,
        5.96871413e-02, -3.50647002e-01, -1.66628689e-01, -7.09155738e-01,
        2.39120603e-01, -3.90726298e-01,  6.71600997e-02,  1.44873247e-01,
        1.25350431e-01, -2.33766902e-02, -3.54483813e-01, -2.11934805e-01,
       -2.87121460e-02,  1.99263036e-01,  2.06211612e-01, -2.17697620e-01,
       -1.22451507e-01,  

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

y = encoder.fit_transform(df['sentiment'])
y

array([1, 1, 1, ..., 0, 0, 1])

Finally, a Random Forest classifier is trained on the document vectors, and its accuracy is evaluated on the test set.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.7901770798529903