### 3. Sentiment Classifier

The [scikit-learn](https://scikit-learn.org/stable/index.html) machine learning package will be used throughout this notebook.

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Load the movie review data

In [2]:
df = pd.read_csv("..\Sentiments.csv")
df.head()

Unnamed: 0,text,label
0,Jack Webb is riveting as a Marine Corp drill i...,pos
1,This film seems to be completely pointless. Th...,neg
2,It's not just that the movie is lame. It's mor...,neg
3,This is a very strange product from Hollywood....,neg
4,If you like horror or action watch this film A...,pos


Sample a fraction of the dataset, save time for the model to train

In [3]:
df = df.sample(frac=0.3, random_state=1).reset_index(drop=True)

Split the data into training, validation and test sets.

In [4]:
# convert pandas series to lists
Xr = df["text"].tolist()
Yr = df["label"].tolist()

# compute the train, val, test splits
train_frac, val_frac, test_frac = 0.7, 0.1, 0.2
train_end = int(train_frac*len(Xr))
val_end = int((train_frac + val_frac)*len(Xr))

# store the train val test splits
X_train = Xr[0:train_end]
Y_train = Yr[0:train_end]
X_val = Xr[train_end:val_end]
Y_val = Yr[train_end:val_end]
X_test = Xr[val_end:]
Y_test = Yr[val_end:]

Fit model

In [5]:
def fit_model(Xtr, Ytr, C):
    """Tokenizes the sentences, calculates TF vectors, and trains a logistic regression model.
    
    Args:
    - Xtr: A list of training documents provided as text
    - Ytr: A list of training class labels
    - C: The regularization parameter
    
    return:
    - the model
    - CountVectorizer
    """
    
    count_vectorizer = CountVectorizer()
    X_train_tf = count_vectorizer.fit_transform(Xtr)
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train_tf, Ytr)
    
    return model, count_vectorizer

Test

In [6]:
def test_model(Xtst, Ytst, model, count_vectoriser):
    """Evaluate a trained classifier on the test set.
    
    Args:
    - Xtst: A list of test or validation documents
    - Ytst: A list of test or validation class labels
    - count_vectoriser: A fitted CountVectorizer
    
    return:
    - score
    """
    X_test_tf = count_vectoriser.transform(Xtst)
    Y_pred = model.predict(X_test_tf)
    score = accuracy_score(Ytst, Y_pred)
    return score

Hyper-parameter tuning

In [7]:
K = [-5,-4,-3,-2,-1,0,1,2,3,4,5]
C_values = [3**k for k in K]
print(C_values)

best_accuracy = 0
best_C = None

# Hyperparameter tuning loop
for C in C_values:
    # Train the model with the current C value
    model, count_vec = fit_model(X_train, Y_train, C)
    
    # Evaluate the model on the validation set
    accuracy = test_model(X_val, Y_val, model, count_vec)
    
    # Update best accuracy and C if needed
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_C = C

print("Best C:", best_C)
print("Best validation accuracy:", best_accuracy)

[0.00411522633744856, 0.012345679012345678, 0.037037037037037035, 0.1111111111111111, 0.3333333333333333, 1, 3, 9, 27, 81, 243]
Best C: 0.1111111111111111
Best validation accuracy: 0.8498727735368957


Train your classifier using both the training and validation data with the best value of `C`

In [8]:
# TODO: fit the model to the concatenated training and validation set
#       test on the test set and print the result


X_train_1 = Xr[:val_end]
Y_train_1 = Yr[:val_end]
model, count_vec = fit_model(X_train_1, Y_train_1, best_C)
accuracy = test_model(X_test, Y_test, model, count_vec)
accuracy

0.8702290076335878

Inspect the co-efficients of your logistic regression classifier

In [9]:
ls = np.argsort(model.coef_)
items = {v:k for k,v in count_vec.vocabulary_.items()}
items

{19490: 'oh',
 7167: 'dear',
 28036: 'this',
 12748: 'has',
 28266: 'to',
 2592: 'be',
 19551: 'one',
 19446: 'of',
 27932: 'the',
 30960: 'worst',
 10525: 'films',
 12796: 'have',
 9692: 'ever',
 24564: 'seen',
 14790: 'it',
 29104: 'unbelievably',
 23010: 'repetitive',
 9698: 'every',
 24251: 'scene',
 24563: 'seems',
 6042: 'consist',
 20487: 'people',
 2733: 'being',
 12394: 'gunned',
 8448: 'down',
 23854: 'running',
 23754: 'round',
 24399: 'screaming',
 19655: 'or',
 15438: 'kicked',
 14020: 'in',
 10048: 'face',
 30577: 'which',
 22137: 'quickly',
 2652: 'becomes',
 29894: 'very',
 8696: 'dull',
 30969: 'wouldn',
 17908: 'mind',
 13794: 'if',
 5607: 'combat',
 30331: 'was',
 9684: 'even',
 1495: 'any',
 11934: 'good',
 4087: 'but',
 14777: 'isn',
 16930: 'main',
 4795: 'character',
 20701: 'phillips',
 22035: 'pushes',
 29782: 'various',
 11959: 'goons',
 19853: 'over',
 30820: 'with',
 23418: 'ridiculous',
 8853: 'ease',
 1306: 'and',
 19074: 'no',
 17352: 'matter',
 13529: 'h

### Cosine Distance - Sparse & Dense Vectors

- **Cosine Distance**
    - sim(A,B)= ∥A∥∥B∥/(A⋅B)
    - dist(A,B)=1−sim(A,B)
    - A⋅B is the dot product of the two vectors
    - ∥A∥ and ∥B∥ are the magnitudes (or norms) of vectors A and B, respectively.

- **Sparse Vectors**:
  - Use the Coordinate list (COO) format, storing only non-zero elements with their row and column indices.
  - Example: `(0, 0) 2` and `(0, 3) 5` corresponds to the vector \[ 2, 0, 0, 5 \].
  - Common in text processing (e.g., TF-IDF vectors) and recommendation systems.
  - Other formats include CSR, CSC, and more.

- **Dense Vectors**:
  - Most elements are non-zero, typically stored in arrays or lists.
  - Example: \[ 1.2, 3.4, -0.5, 4.1, 5.6 \].
  - Used in physics, graphics, and machine learning.

- **Cosine Distance**:
  - Derived from cosine similarity: \(d = 1 - s\).
  - Ranges between 0 and 2; 0 indicates perfect alignment, and 2 indicates diametric opposition.
  - Used in machine learning, text processing, and recommendation systems.

In [10]:
from scipy.sparse import csr_matrix, issparse
from sklearn.metrics.pairwise import cosine_distances

In [11]:
def cosine_distance(v1, v2):
    if issparse(v1) and not issparse(v2):
        v2 = csr_matrix(v2)
    elif not issparse(v1) and issparse(v2):
        v1 = csr_matrix(v1)
    distance = cosine_distances(v1, v2)
    return distance

In [12]:
v1 = np.array([2, 3, 4, 5])
v2 = np.array([4, 5, 6, 7])

v1_sparse = csr_matrix([2, 0, 0, 5])
v2_sparse = csr_matrix([0, 5, 6, 7])

print(v1_sparse)

  (0, 0)	2
  (0, 3)	5


In [13]:
print(cosine_distance(v1, v2_sparse))
print(cosine_distance(v1_sparse, v2))

[[0.0398513]]
[[0.28864861]]


In [14]:
x = v2_sparse.reshape(1, -1)
x

<1x4 sparse matrix of type '<class 'numpy.intc'>'
	with 3 stored elements in Compressed Sparse Row format>

In [15]:
print(cosine_distance(v1_sparse, v2_sparse))

[[0.38031255]]


In [16]:
x = np.array([2, 0, 0, 5])
y = np.array([0, 5, 6, 7])

tes = x.dot(y)
mag_v1 = np.linalg.norm(x)
mag_v2 = np.linalg.norm(y)
Rres = tes/(mag_v1*mag_v2)
res = 1-Rres
res

0.38031254717805285