#### 1. [20 pts] In this assignment, we will update our pipeline to extract keywords that specifically help to differentiate between reviews labeled as sentiment 0 and reviews labeled as sentiment 1. First, remove HTML specific keywords, apply your favorite way of tokenizing and use Tf-Idf features to classify reviews using an SVM classifier. Report the 10-fold CV performance. (Hint: Aim 90% plus performance)


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
import re
import nltk
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import StratifiedKFold
import numpy as np

In [2]:
stop_words = set(stopwords.words('english'))

# Add custom words to the set
custom_words = ['br','the', 'of','to','in','that','and','he','is','it']
for word in custom_words:
    stop_words.add(word)

In [3]:
def ie_preprocess(document):
    # Sentence Parse
    document = re.sub('<br />', '', document)
    document = re.sub(r'[^\w\s]', '', document)
    sentences = nltk.sent_tokenize(document)
    
    # Word Parse and remove stopwords
    stop_words = set(stopwords.words('english'))
    
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [[word for word in sent if word.lower() not in stop_words and len(word) > 2] for sent in sentences]
    
    return sentences

In [4]:
path = './movie_data.csv'

df = pd.read_csv(path, encoding="utf-8")
df['review'].apply(ie_preprocess)

0        [[1974, teenager, Martha, Moxley, Maggie, Grac...
1        [[really, like, Kris, Kristofferson, usual, ea...
2        [[SPOILER, read, think, watching, movie, altho...
3        [[people, seen, wonderful, movie, sure, thet, ...
4        [[recently, bought, DVD, forgetting, much, hat...
                               ...                        
49995    [[lets, start, best, building, although, hard,...
49996    [[British, heritage, film, industry, control, ...
49997    [[dont, even, know, begin, one, family, worst,...
49998    [[Richard, Tyler, little, boy, scared, everyth...
49999    [[waited, long, watch, movie, Also, like, Bruc...
Name: review, Length: 50000, dtype: object

In [12]:
# Create a CountVectorizer for text data
tfidf_vectorizer = TfidfVectorizer()

# Prepare your features and labels
X = tfidf_vectorizer.fit_transform(df['review'])
y = df['sentiment']
# Transform the training data using the CountVectorizer

# Initialize and train the Logistic Regression classifier

#SVM was too slow my computer took too long
log_classifier = LinearSVC()
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(log_classifier, X, y, cv=cv, scoring='accuracy')

# Report the 10-fold cross-validation performance
mean_accuracy = np.mean(scores)
print("Mean Accuracy:", mean_accuracy)

TypeError: cross_val_score() got an unexpected keyword argument 'dual'

#### 2. [20 pts] Rank the first 10 keywords that indicate the difference between the classes 0 and 1 (i.e., 10 words for sentiment 0, and 10 words for sentiment 1).
#### (Hint: Use the classifier coef_ field, consult the scikit-learn API if necessary)

In [6]:
# Extract the coefficients and feature names
log_classifier.fit(X,y)

coefficients = log_classifier.coef_[0]
feature_names = tfidf_vectorizer.get_feature_names_out()
feature_coefficients = list(zip(feature_names, coefficients))

# Sort the feature coefficients
sorted_feature_coefficients = sorted(feature_coefficients, key=lambda x: x[1])

# Print the top and bottom coefficients
def print_top_and_bottom_coefs(sorted_feature_coefficients, num_top=10):
    print("Keywords for Best:")
    for i, (feature, coefficient) in enumerate(sorted_feature_coefficients[-num_top:][::-1]):
        print(f"#{i+1}, {feature}: {coefficient}")

    print("\nKeywords for Worst:")
    for i, (feature, coefficient) in enumerate(sorted_feature_coefficients[:num_top]):
        print(f"#{i+1}, {feature}: {coefficient}")

print_top_and_bottom_coefs(sorted_feature_coefficients)




Keywords for Best:
#1, excellent: 4.026698778286953
#2, great: 3.6040472361043303
#3, perfect: 3.4247399655035737
#4, refreshing: 3.020334847658237
#5, today: 2.999946166693743
#6, wonderful: 2.8764745245509435
#7, enjoyable: 2.816545075386197
#8, amazing: 2.7022374965444023
#9, hilarious: 2.6728862919206637
#10, brilliant: 2.6627572805013484

Keywords for Worst:
#1, worst: -6.30897808007753
#2, waste: -5.084048478808138
#3, awful: -4.721001754762347
#4, boring: -4.159115585293118
#5, disappointment: -3.870624803293975
#6, fails: -3.6488816302643583
#7, terrible: -3.599614638171832
#8, poor: -3.544712368363313
#9, disappointing: -3.460399802424631
#10, horrible: -3.3207839064586593


##### 3. [20 pts] Using the results in (2.) list two pairs of words that can be sentimental antonyms.

Best and Worst can both be sentimental antonyms as well as terrible and brilliant.

#### 4. [20 pts] Cluster the reviews into two groups.

In [7]:
from sklearn.cluster import KMeans

print(f'N data points= {X.shape[0]}, M features= {X.shape[1]}')


N data points= 50000, M features= 101895


#### Report the size of the clusters.

In [8]:
from collections import Counter
Clusters = KMeans(n_clusters=2, random_state=20, n_init=10).fit(X)
Counter(Clusters.labels_)
y_ground = Clusters.predict(X)

#### Using the cluster IDs as the ground truth, classify and report the 10-fold CV classification performance.

In [9]:
scores = cross_val_score(LinearSVC(), X, y_ground, cv=cv, scoring='accuracy')
np.mean(scores)



0.9837999999999999

#### Considering the results in this problem and your results, do you support using the method of clustering for sentiments when a ground truth is not available?

Considering the results in this problem, I do support the method of clustering for sentiments when ground truth is not avaiable. While the ground truth is not entirely accurate, it is a good indicator of what the correct answer is. A model that was able to predict the actual ground truth at 90% was able to predict the simulated ground truth at nearly 100%. This indicates some level of benefit to mapping unseen data, as the actual ground truth was not far off from the simulated ground truth

5. [20 pts] Compare the top 10 keywords as generated in (2.) and then comment about these new keywords? Now have a look at the results, notice that clustering and the given sentiment classes are completely different. Do you have any suggestions about automatic labeling of reviews? Perhaps one way could be assigning class labels according to some offline positive and negative keywords. Outline an approach.

In [10]:
# Extract the coefficients and feature names
cluster_fitted=LinearSVC().fit(X,y_ground)

coefficients = cluster_fitted.coef_[0]
feature_names = tfidf_vectorizer.get_feature_names_out()
feature_coefficients = list(zip(feature_names, coefficients))

# Sort the feature coefficients
sorted_feature_coefficients = sorted(feature_coefficients, key=lambda x: x[1])

# Print the top and bottom coefficients
def print_top_and_bottom_coefs(sorted_feature_coefficients, num_top=10):
    print("Keywords for Best:")
    for i, (feature, coefficient) in enumerate(sorted_feature_coefficients[-num_top:][::-1]):
        print(f"#{i+1}, {feature}: {coefficient}")

    print("\nKeywords for Worst:")
    for i, (feature, coefficient) in enumerate(sorted_feature_coefficients[:num_top]):
        print(f"#{i+1}, {feature}: {coefficient}")

print_top_and_bottom_coefs(sorted_feature_coefficients)



Keywords for Best:
#1, br: 31.047357646055904
#2, the: 16.18957161950414
#3, of: 6.017721950187207
#4, to: 5.3504105453229664
#5, in: 4.137673089842074
#6, that: 4.075721682992671
#7, and: 3.8328570608025294
#8, he: 3.1985988501207205
#9, is: 3.078283518956929
#10, it: 2.52955987862024

Keywords for Worst:
#1, weren: -1.0234126221661974
#2, oliver: -0.969345766766179
#3, werewolf: -0.8914495685983678
#4, propaganda: -0.8562043090716025
#5, freedom: -0.8399016940342013
#6, concentrates: -0.8326535524452353
#7, truck: -0.8259538979617969
#8, boxing: -0.8238037737376392
#9, 70: -0.8148036872395056
#10, bone: -0.8120330324767976
