#### 1. [20 pts] In this assignment, we will update our pipeline to extract keywords that specifically help to differentiate between reviews labeled as sentiment 0 and reviews labeled as sentiment 1. First, remove HTML specific keywords, apply your favorite way of tokenizing and use Tf-Idf features to classify reviews using an SVM classifier. Report the 10-fold CV performance. (Hint: Aim 90% plus performance)


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
import re
import nltk
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import StratifiedKFold
import numpy as np

In [2]:
def ie_preprocess(document):
    # Sentence Parse
    document = re.sub('<br />', '', document)
    document = re.sub(r'[^\w\s]', '', document)
    sentences = nltk.sent_tokenize(document)
    
    # Word Parse and remove stopwords
    stop_words = set(stopwords.words('english'))
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [[word for word in sent if word.lower() not in stop_words] for sent in sentences]
    
    return sentences

In [3]:
path = './movie_data.csv'

df = pd.read_csv(path, encoding="utf-8")
df['review'].apply(ie_preprocess)

0        [[1974, teenager, Martha, Moxley, Maggie, Grac...
1        [[OK, really, like, Kris, Kristofferson, usual...
2        [[SPOILER, read, think, watching, movie, altho...
3        [[hi, people, seen, wonderful, movie, im, sure...
4        [[recently, bought, DVD, forgetting, much, hat...
                               ...                        
49995    [[OK, lets, start, best, building, although, h...
49996    [[British, heritage, film, industry, control, ...
49997    [[dont, even, know, begin, one, family, worst,...
49998    [[Richard, Tyler, little, boy, scared, everyth...
49999    [[waited, long, watch, movie, Also, like, Bruc...
Name: review, Length: 50000, dtype: object

In [4]:
# Create a CountVectorizer for text data
tfidf_vectorizer = TfidfVectorizer()

# Prepare your features and labels
X = tfidf_vectorizer.fit_transform(df['review'])
y = df['sentiment']
# Transform the training data using the CountVectorizer

# Initialize and train the Logistic Regression classifier

#SVM was too slow my computer took too long
svm_classifier = LogisticRegression(max_iter=1000)
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(svm_classifier, X, y, cv=cv, scoring='accuracy')

# Report the 10-fold cross-validation performance
mean_accuracy = np.mean(scores)
print("Mean Accuracy:", mean_accuracy)

Mean Accuracy: 0.8985200000000001


#### 2. [20 pts] Rank the first 10 keywords that indicate the difference between the classes 0 and 1 (i.e., 10 words for sentiment 0, and 10 words for sentiment 1).
#### (Hint: Use the classifier coef_ field, consult the scikit-learn API if necessary)

In [34]:
# Extract the coefficients and feature names
coefficients = svm_classifier.coef_[0]
feature_names = tfidf_vectorizer.get_feature_names_out()
feature_coefficients = list(zip(feature_names, coefficients))

# Sort the feature coefficients
sorted_feature_coefficients = sorted(feature_coefficients, key=lambda x: x[1])

# Print the top and bottom coefficients
def print_top_and_bottom_coefs(sorted_feature_coefficients, num_top=10):
    print("Keywords for Best:")
    for i, (feature, coefficient) in enumerate(sorted_feature_coefficients[-num_top:][::-1]):
        print(f"#{i+1}, {feature}: {coefficient}")

    print("\nKeywords for Worst:")
    for i, (feature, coefficient) in enumerate(sorted_feature_coefficients[:num_top]):
        print(f"#{i+1}, {feature}: {coefficient}")

print_top_and_bottom_coefs(sorted_feature_coefficients)


Keywords for Best:
#1, great: 9.127518653723115
#2, excellent: 8.090241311712196
#3, best: 6.113411362054497
#4, perfect: 6.062811038307453
#5, wonderful: 5.6953862208981025
#6, amazing: 5.4603758236816295
#7, loved: 5.029862392170517
#8, today: 4.959431171316908
#9, brilliant: 4.843816644909755
#10, enjoyed: 4.7478434792514985

Keywords for Worst:
#1, worst: -11.908592211252717
#2, bad: -9.37669364710265
#3, awful: -8.70671904307931
#4, waste: -8.507849207199447
#5, boring: -8.002396674762371
#6, poor: -7.050906745114649
#7, terrible: -7.019335497621038
#8, nothing: -6.2169962595996315
#9, worse: -5.674746285129455
#10, dull: -5.64338965296613


##### 3. [20 pts] Using the results in (2.) list two pairs of words that can be sentimental antonyms.

Best and Worst can both be sentimental antonyms as well as terrible and brilliant.

#### 4. [20 pts] Cluster the reviews into two groups.

In [42]:
from sklearn.cluster import KMeans
print(f'N data points= {X.shape[0]}, M features= {X.shape[1]}')

Clusters = KMeans(n_clusters=2, random_state=20, n_init=10).fit_predict(np.array(X.todense()))


N data points= 50000, M features= 101895


MemoryError: Unable to allocate 38.0 GiB for an array with shape (50000, 101895) and data type float64

#### Report the size of the clusters.

#### Using the cluster IDs as the ground truth, classify and report the 10-fold CV classification performance.

#### Considerinbg the results in this problem and your results, do you support using the method of clustering for sentiments when a ground truth is not available?

5. [20 pts] Compare the top 10 keywords as generated in (2.) and then comment about these new keywords? Now have a look at the results, notice that clustering and the given sentiment classes are completely different. Do you have any suggestions about automatic labeling of reviews? Perhaps one way could be assigning class labels according to some offline positive and negative keywords. Outline an approach.