## Dataset Overview: Movie Review Sentiment Analysis

The dataset contains movie reviews and their corresponding sentiment labels. The sentiment labels indicate whether the review is positive or negative based on the content of the review.

### Key Features:

* **Text:** The movie review text containing opinions, critiques, and evaluations of the movie.
* **Label:** The sentiment label of the review (1 = Positive, 0 = Negative).

## Import

In [21]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import Dataset
import torch.nn.functional as F
import numpy as np
import gensim.downloader as api
import gensim
from gensim.models import Word2Vec

from nltk.tokenize import word_tokenize
import nltk

In [2]:
fn='data/rt-polarity.neg'

with open(fn, "r",encoding='utf-8', errors='ignore') as f: # some invalid symbols encountered
    texts_neg  = f.read().splitlines()

print ('len of texts_neg = {:,}'.format (len(texts_neg )))
for review in texts_neg [:5]:
    print ( '\n', review)

len of texts_neg = 5,331

 simplistic , silly and tedious . 

 it's so laddish and juvenile , only teenage boys could possibly find it funny . 

 exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . 

 [garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . 

 a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . 


In [3]:
fn='data/rt-polarity.pos'

with open(fn, "r",encoding='utf-8', errors='ignore') as f:
    texts_pos   = f.read().splitlines()
 
print ('len of texts_pos = {:,}'.format (len(texts_pos)))
for review in texts_pos [:5]:
    print ('\n', review)

len of texts_pos = 5,331

 the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 

 the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . 

 effective but too-tepid biopic

 if you sometimes like to go to the movies to have fun , wasabi is a good place to start . 

 emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . 


#### Assign labels (0 for negative, 1 for positive) and combine the datasets into a single dataset

In [4]:
df_neg = pd.DataFrame({'text': texts_neg, 'label': 0})
df_pos = pd.DataFrame({'text': texts_pos, 'label': 1})

df = pd.concat([df_neg, df_pos], ignore_index=True)

In [5]:
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

In [6]:
df.head(10)

Unnamed: 0,text,label
0,"this film seems thirsty for reflection , itsel...",1
1,the movie's thesis -- elegant technology for t...,1
2,tries too hard to be funny in a way that's too...,0
3,disturbingly superficial in its approach to th...,0
4,"an ugly , pointless , stupid movie .",0
5,neither a rousing success nor a blinding embar...,0
6,ice age posits a heretofore unfathomable quest...,0
7,"the hours , a delicately crafted film , is an ...",1
8,the tenderness of the piece is still intact .,1
9,"occasionally , in the course of reviewing art-...",1


In [31]:
df['label'].value_counts()

label
1    5331
0    5331
Name: count, dtype: int64

### Split the dataset into training and testing sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2)

## Text Classification Methods

For text classification, will be use the following methods:

### Text Vectorization:

* **TF-IDF**: Computes the importance of words based on term frequency and document frequency.
* **CountVectorizer**: Creates a word count matrix.
* **Word2Vec**: Generates vector representations of words based on context.

### Classification Model:

* **Logistic Regression**: A linear model for binary classification tasks.

The performance metrics for each method (precision, recall, F1-score) will be evaluate on the test set.


### TF-IDF

In [15]:
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

#### Train model on features extracted by tfidf vectorizer

In [17]:
lr = LogisticRegression(max_iter=2000)
lr.fit(X_train_tfidf, y_train)
y_pred_lr = lr.predict(X_test_tfidf)

print("TF-IDF")
print(classification_report(y_test, y_pred_lr))

TF-IDF
              precision    recall  f1-score   support

           0       0.77      0.74      0.75      1098
           1       0.73      0.77      0.75      1035

    accuracy                           0.75      2133
   macro avg       0.75      0.75      0.75      2133
weighted avg       0.75      0.75      0.75      2133



### CountVectorizer vectorizer

In [28]:
count_vectorizer =  CountVectorizer(ngram_range=(1, 2), max_df=0.9, min_df=2, max_features=5000)
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

#### Train model

In [29]:
clf  = LogisticRegression(max_iter=1000)
clf.fit(X_train_count, y_train)
y_pred_count = clf.predict(X_test_count)

print("CountVectorizer")
print(classification_report(y_test, y_pred_count))

CountVectorizer
              precision    recall  f1-score   support

           0       0.76      0.74      0.75      1098
           1       0.73      0.75      0.74      1035

    accuracy                           0.74      2133
   macro avg       0.74      0.74      0.74      2133
weighted avg       0.74      0.74      0.74      2133



### Word Embeddings with Word2Vec

In [None]:
nltk.download('punkt')

def tokenize_text(text):
    return word_tokenize(text.lower())

# Prepare sentences for Word2Vec (it needs a list of tokenized sentences)
X_train_tokenized = [tokenize_text(text) for text in X_train]

# Train Word2Vec model
w2v_model = Word2Vec(X_train_tokenized,
                    vector_size=100,     # Dimensionality of word vectors
                    window=5,            # Context window size
                    min_count=5,         # Ignore words with frequency below this
                    workers=4)           # Number of threads


In [33]:
print(f"Vocabulary size: {len(w2v_model.wv.key_to_index)}")

Vocabulary size: 3816


In [24]:
# Function to create document vectors by averaging word vectors
def document_vector(doc, model):
    # Remove out-of-vocabulary words
    doc = [word for word in doc if word in model.wv]
    if len(doc) == 0:
        return np.zeros(model.vector_size)
    return np.mean([model.wv[word] for word in doc], axis=0)

# Create document vectors for train and test sets
X_train_w2v = np.array([document_vector(doc, w2v_model) for doc in X_train_tokenized])
X_test_tokenized = [tokenize_text(text) for text in X_test]
X_test_w2v = np.array([document_vector(doc, w2v_model) for doc in X_test_tokenized])

#### Train a logistic regression model on Word2Vec features

In [26]:
w2v_clf = LogisticRegression(max_iter=1000).fit(X_train_w2v, y_train)
w2v_predictions = w2v_clf.predict(X_test_w2v)
w2v_scores = w2v_clf.predict_proba(X_test_w2v)[:, 1]

print(f"Word2Vec")
print(classification_report(y_test, w2v_predictions))

Word2Vec
              precision    recall  f1-score   support

           0       0.60      0.54      0.57      1098
           1       0.56      0.62      0.59      1035

    accuracy                           0.58      2133
   macro avg       0.58      0.58      0.58      2133
weighted avg       0.58      0.58      0.58      2133



## Model Comparison: Sentiment Analysis

### Methods Used:

1. **TF-IDF (Term Frequency-Inverse Document Frequency):**
   - **Precision:** 0.77
   - **Recall:** 0.74
   - **F1-Score:** 0.75
   - **Accuracy:** 0.75
   - **Macro Average:** 0.75 (Precision, Recall, F1-Score)
   - **Weighted Average:** 0.75 (Precision, Recall, F1-Score)

2. **CountVectorizer:**
   - **Precision:** 0.77
   - **Recall:** 0.76
   - **F1-Score:** 0.76
   - **Accuracy:** 0.76
   - **Macro Average:** 0.76 (Precision, Recall, F1-Score)
   - **Weighted Average:** 0.76 (Precision, Recall, F1-Score)

3. **Word2Vec:**
   - **Precision:** 0.60
   - **Recall:** 0.54
   - **F1-Score:** 0.57
   - **Accuracy:** 0.58
   - **Macro Average:** 0.58 (Precision, Recall, F1-Score)
   - **Weighted Average:** 0.58 (Precision, Recall, F1-Score)

### Conclusion:
- **TF-IDF and CountVectorizer** showed comparable performance, with **CountVectorizer** slightly outperforming **TF-IDF** in terms of recall and F1-score.
- **Word2Vec** performed significantly worse than both TF-IDF and CountVectorizer in all evaluation metrics, with lower precision, recall, and accuracy.