## INFS 770 - Assignment 4

**Note**: Created using Anaconda Python 3.7.3 (64-bit)

---

### Pre-task setup

In [1]:
# Imports
%matplotlib inline
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gensim
from gensim.models import LdaModel, LsiModel
from pprint import pprint

# Sci-Py Packages
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn import metrics, model_selection
from sklearn.cluster import KMeans
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import normalize
from sklearn.model_selection import train_test_split

# nltk packages
import nltk
from nltk import word_tokenize 
from nltk.corpus import stopwords
from nltk import FreqDist

### Q1: Read data into dataframe

In [2]:
# Q1.1: Load data
data_file = "amazon_review_texts.csv"
df = pd.read_csv(data_file,
                 sep=",",
                 header='infer',
                 )

# Get a listing of our categories
categories = list(df.columns)
#pprint(categories)

In [3]:
# Q1.2: Make sure we've got data in our dataframe
print(df.head())

          pid helpful  score  \
0  B000GAYQL8     0/0      5   
1  B000IBNPDA     0/0      5   
2  B000J2HA16     0/0      5   
3  B000BDIQPM     0/0      5   
4  B000GZTH9E     0/3      4   

                                                text category  
0  GREAT WATCH AND GREAT LOOK. BIG FACE AND 4 DIF...    watch  
1  Bought this as a Christmas gift, my boyfriend ...    watch  
2  I love this watch! Its sporty, without looking...    watch  
3  Works great,looks nice,dont have to worry abou...    watch  
4  I need to change the watch wrist and I havent ...    watch  


In [4]:
# Q1.3: Show the distribution counts of the variable "score"
df['score'].value_counts()

5    2070
4     773
1     595
3     303
2     259
Name: score, dtype: int64

In [5]:
# Q1.4: Show the distribution counts of the variable "category"
df['category'].value_counts()

automotive     1000
electronics    1000
watch          1000
software       1000
Name: category, dtype: int64

### Q2: Tokenize the reviews

In [6]:
# Q2.1: Tokenize, performing the following:
##      - lowercase
##      - remove stopwords
##      - perform stemming

# get a set of stopwords
stopwords = set(nltk.corpus.stopwords.words("english"))

import re
def before_token(documents):
    # conver words to lower case
    lower = map(str.lower, documents)
    # remove puntuations
    punctuationless = list(map(lambda x: " ".join(re.findall('\\b\\w\\w+\\b',x)), lower))
    # remove numbers
    return list(map(lambda x:re.sub('\\b[0-9]+\\b', '', x), punctuationless))

# initialize a stemmer
stemmer = nltk.stem.PorterStemmer()

# initialize a container of token frequencies
fdist = nltk.FreqDist()


# define a function that preprocess a single document and returns a list of tokens
def preprocess(doc):
    tokens = []
    for token in doc.split():
        if token not in stopwords:
            tokens.append(stemmer.stem(token))
    return tokens
            
# preprocess all documents
processed = list(map(preprocess, before_token(df['text'])))


#print(processed[0])

In [7]:
# Q2.2: Calculate word frequency distribution
#type(processed)

# calculate the token frequency
# the FreqDist function takes in a list of tokens and return a dict containg unique tokens and frequency
fdist = nltk.FreqDist([token for doc in processed for token in doc])

print("Unique tokens: %d" % fdist.B())
print("Total tokens: %d" % fdist.N())
print("Tokens occurred only once: %d" % len(fdist.hapaxes()))

Unique tokens: 10973
Total tokens: 193927
Tokens occurred only once: 4701


In [8]:
# Q2.3: Print top 10 most frequent words

fdist.tabulate(10)

  watch     use     one    work    time    like product   great     get   would 
   2553    2476    1795    1605    1420    1375    1336    1318    1309    1217 


**Q: Which of the top 10 words might not be useful in text clustering and classification, and why?**

A: These words from the top 10 may not be useful in text clustering:
* use
* one
* work
* like
* product
* great
* get 
* would

These words could appear in the reviews for almost any of the product categories, so I don't think they'll be helpful in classifying our reviews into categories.  Many of them are stop words, and will be removed in later steps.

The two words "watch" and "time", however, are quite unique and in fact one of them is specifically one of our categories.  The word "time" would be much more likely to be observed in reviews for the watch category, so I think it will be incredibly useful in classifying our review data.

### Q3: Vectorize

In [9]:
# Q3.1: Reconstruct the documents
processed_doc = list(map(" ".join, processed))

In [10]:
# Q3.2: Vectorize all of the documents, use norm=l2


vectorizer = TfidfVectorizer(norm="l2",
                             max_df=0.8, # remove frequent words (>80%)
                             stop_words='english' # remove English stopwords stored in Scikit-Learn
                            )

X = vectorizer.fit_transform(processed_doc)

In [11]:
# Q3.3: Print the number of features extracted by the TFIDF vectorizer
print("n_samples: %d, n_features: %d" % X.shape)

n_samples: 4000, n_features: 10833


### Q4: K-means clustering

In [12]:
# Q4.1: Categorize the documents into 4 clusters
km = KMeans(n_clusters=4, max_iter=100, random_state=54321)

# Fit using our data
km.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=54321, tol=0.0001, verbose=0)

In [13]:
# Q4.2: Print the top 10 representative words for each cluster
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(4):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    print

Cluster 0:
 use
 product
 work
 great
 good
 instal
 program
 like
 softwar
 time
Cluster 1:
 batteri
 charg
 charger
 power
 adapt
 appl
 camera
 canon
 work
 origin
Cluster 2:
 watch
 look
 band
 time
 great
 wear
 love
 like
 nice
 price
Cluster 3:
 bed
 air
 inflat
 comfort
 pump
 sleep
 mattress
 deflat
 airb
 easi


**Q: How well you think these words describe the 4 product categories?**

A: From the output of Q1.4, we know that our 4 categories are:
1. software
2. electronics
3. automotive
4. watch  

For our K-means clusters, 3 out of the 4 categories match up very well to the clusters:

* Cluster 0 maps to software
* Cluster 1 maps to electronics
* Cluster 2 maps to watch

However, Cluster 3 does not seem to map up well with the remaining category of automotive.  It seems that Cluster 3 maps up to an inflatable mattress, not anything automotive.  I think that's a major problem in our K-means clustering results, and could send us down the wrong path for further analyzing that 4th category if we were to rely on the current clustering results.

### Q5: Build a topic model

In [18]:
# Q5.1: Build a topic model using Latent Dirichlet Allocation (LDA)
# Set number of topics to 4

# Convert the vectorized data to a gensim corpus object
corpus_vect = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
id2word = dict((v,k) for k,v in vectorizer.vocabulary_.items())
#print(id2word)

# Build the lda model
lda = LdaModel(corpus_vect, 
               num_topics=4,
               #random_state=54321,
               id2word=id2word, 
               passes=10)

In [19]:
# Q5.2: Print the topics
pprint(lda.print_topics())

[(0,
  '0.006*"use" + 0.004*"product" + 0.004*"work" + 0.004*"softwar" + '
  '0.004*"program" + 0.003*"time" + 0.003*"great" + 0.003*"like" + '
  '0.003*"game" + 0.003*"good"'),
 (1,
  '0.010*"watch" + 0.006*"great" + 0.005*"price" + 0.004*"batteri" + '
  '0.004*"charger" + 0.004*"work" + 0.004*"love" + 0.004*"product" + '
  '0.004*"bought" + 0.004*"amazon"'),
 (2,
  '0.002*"dent" + 0.001*"snow" + 0.001*"la" + 0.001*"el" + 0.001*"cap" + '
  '0.001*"tester" + 0.001*"en" + 0.001*"cat" + 0.001*"es" + 0.001*"lite"'),
 (3,
  '0.008*"watch" + 0.005*"use" + 0.005*"great" + 0.004*"work" + 0.004*"air" + '
  '0.004*"bed" + 0.004*"good" + 0.004*"like" + 0.004*"batteri" + 0.004*"fit"')]


**Q: Examine the representative words for each topic. Please create a text box and discuss how well these words describe the 4 product categories, and also tell me which unsupervised method (clustering vs LDA) you think is more effective in identifying the categories in this example and why?**

A: The LDA model seems to have performed worse than clustering.  The first two topics map up well to our categories:

* Topic 0 maps to Software 
* Topic 1 maps to Watch  

Topic 2 is a bit less coherent.  It might map up to Automotive, as it contains the words "dent" and "snow".  

Topic 3 looks like it maps up with the Watch category again though, duplicating topic 1.

I think overall K-means Clustering had more coherent topics, even if it did highlight a topic that wasn't one of our categories (air mattress).  K-means Clustering had 3 of the 4 topics that were more clear at identifying our review categories, so I would pick it over LDA.

### Q6: Cross-validation

In [20]:
# Q6.1: Vectorize all of the document again, use norm=l2 & drop any words appearing only once


vectorizer = TfidfVectorizer(norm="l2",
                             max_df=0.8, # remove frequent words (>80%)
                             stop_words='english', # remove English stopwords stored in Scikit-Learn
                             min_df=1 # remove unique words, appearing in just 1 document
                            )

X = vectorizer.fit_transform(processed_doc)

In [21]:
# Print the number of features extracted by the TFIDF vectorizer
print("n_samples: %d, n_features: %d" % X.shape)

n_samples: 4000, n_features: 10833


In [22]:
# Q6.2: Perform 5-fold cross-validation using SGD classifier
# Use norm=l2
# Drop any words appearing in only 1 document

f1_overall_avg = []

skf = model_selection.StratifiedKFold(n_splits=5)
fold = 0
for train_index, test_index in skf.split(np.array(processed_doc), df["score"]):
    fold += 1
    print("Fold %d" % fold)
    # partition
    train_x, test_x = np.array(processed_doc)[train_index], np.array(processed_doc)[test_index]
    train_y, test_y = df["score"][train_index], df["score"][test_index]
    # vectorize
    vectorizer = TfidfVectorizer(norm="l2",
                                 max_df=0.8, # remove frequent words (>80%)
                                 stop_words='english', # remove English stopwords stored in Scikit-Learn
                                 min_df=1 # remove unique words, appearing in just 1 document
                                )
    X = vectorizer.fit_transform(train_x)
    print("Number of features: %d" % len(vectorizer.vocabulary_))
    X_test = vectorizer.transform(test_x)
    # train model
    clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
    clf.fit(X, train_y)
    # predict
    pred = clf.predict(X_test)
    # classification results
    for line in metrics.classification_report(test_y, pred).split("\n"):
        print(line)
    # Save average across fold
    f1_overall_avg.append(metrics.f1_score(test_y, pred, labels=None, average='weighted'))

Fold 1
Number of features: 9726
              precision    recall  f1-score   support

           1       0.75      0.39      0.52       119
           2       0.27      0.06      0.10        52
           3       0.10      0.03      0.05        61
           4       0.26      0.14      0.18       155
           5       0.62      0.93      0.74       414

   micro avg       0.57      0.57      0.57       801
   macro avg       0.40      0.31      0.32       801
weighted avg       0.50      0.57      0.50       801

Fold 2
Number of features: 9078
              precision    recall  f1-score   support

           1       0.46      0.63      0.53       119
           2       0.13      0.06      0.08        52
           3       0.17      0.11      0.14        61
           4       0.29      0.23      0.25       155
           5       0.68      0.74      0.71       414

   micro avg       0.53      0.53      0.53       801
   macro avg       0.34      0.35      0.34       801
weighted avg 

In [23]:
# Print overall average f1 score across 5 folds
f1_score = sum(f1_overall_avg) / float(len(f1_overall_avg))

print("Average f1 score across all 5 folds:\n {:.2}".format(f1_score))

Average f1 score across all 5 folds:
 0.52


### Q7: Satisfaction

In [24]:
# Q7.1: Create a new variable "satisfaction", based off score data

# Make a copy of the original dataframe
df_sat = pd.DataFrame(df)

# Add satisfaction column, code based off score values
df_sat['satisfaction'] = np.where(df_sat['score']<4, 0, 1)

# Ensure we have a new satisfaction column
df_sat.head(20)

Unnamed: 0,pid,helpful,score,text,category,satisfaction
0,B000GAYQL8,0/0,5,GREAT WATCH AND GREAT LOOK. BIG FACE AND 4 DIF...,watch,1
1,B000IBNPDA,0/0,5,"Bought this as a Christmas gift, my boyfriend ...",watch,1
2,B000J2HA16,0/0,5,"I love this watch! Its sporty, without looking...",watch,1
3,B000BDIQPM,0/0,5,"Works great,looks nice,dont have to worry abou...",watch,1
4,B000GZTH9E,0/3,4,I need to change the watch wrist and I havent ...,watch,1
5,B000GB0G7A,0/0,4,There is not much to this gloriously inexpensi...,watch,1
6,B0000643Q6,2-Feb,4,Have always loved Movado designs and when I lo...,watch,1
7,B000GAYQJK,0/0,5,this watch is cool you can switch the settings...,watch,1
8,B0002XV266,Apr-42,1,Materialism involves the importance one attach...,watch,0
9,B000E4ARN2,2-Jan,5,I bought this watch just a short time ago and ...,watch,1


In [25]:
# Q7.2: Perform a 5-fold cross validation to predict satisfaction using an SGD classifier

f1_overall_avg = []

skf = model_selection.StratifiedKFold(n_splits=5)
fold = 0
for train_index, test_index in skf.split(np.array(processed_doc), df_sat["satisfaction"]):
    fold += 1
    print("Fold %d" % fold)
    # partition
    train_x, test_x = np.array(processed_doc)[train_index], np.array(processed_doc)[test_index]
    train_y, test_y = df_sat["satisfaction"][train_index], df_sat["satisfaction"][test_index]
    # vectorize
    vectorizer = TfidfVectorizer(norm="l2",
                                 max_df=0.8, # remove frequent words (>80%)
                                 stop_words='english', # remove English stopwords stored in Scikit-Learn
                                 min_df=1 # remove unique words, appearing in just 1 document
                                )
    X = vectorizer.fit_transform(train_x)
    print("Number of features: %d" % len(vectorizer.vocabulary_))
    X_test = vectorizer.transform(test_x)
    # train model
    clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
    clf.fit(X, train_y)
    # predict
    pred = clf.predict(X_test)
    # classification results
    for line in metrics.classification_report(test_y, pred).split("\n"):
        print(line)
    # Save average across fold
    f1_overall_avg.append(metrics.f1_score(test_y, pred, labels=None, average='weighted'))

Fold 1
Number of features: 9722
              precision    recall  f1-score   support

           0       0.85      0.15      0.26       232
           1       0.74      0.99      0.85       569

   micro avg       0.75      0.75      0.75       801
   macro avg       0.80      0.57      0.55       801
weighted avg       0.77      0.75      0.68       801

Fold 2
Number of features: 9103
              precision    recall  f1-score   support

           0       0.65      0.69      0.67       232
           1       0.87      0.85      0.86       569

   micro avg       0.80      0.80      0.80       801
   macro avg       0.76      0.77      0.77       801
weighted avg       0.81      0.80      0.80       801

Fold 3
Number of features: 9390
              precision    recall  f1-score   support

           0       0.69      0.69      0.69       231
           1       0.88      0.88      0.88       569

   micro avg       0.82      0.82      0.82       800
   macro avg       0.78      0.7

In [26]:
# Print overall average f1 score across 5 folds
f1_score = sum(f1_overall_avg) / float(len(f1_overall_avg))

print("Average f1 score across all 5 folds:\n {:.2}".format(f1_score))

Average f1 score across all 5 folds:
 0.78


### Q8: Vectorize again

In [27]:
# Q8.1: Use the opinion lexicon described in the lecture as the vocabulary

# read the lexicon
lexicon = dict()

# read postive words
with open("opinion-lexicon-English/negative-words.txt", "r") as in_file:
    for line in in_file.readlines():
        if not line.startswith(";") and line != "\n":
            lexicon[line.strip()] = -1

# read negative words
with open("opinion-lexicon-English/positive-words.txt", "r") as in_file:
    for line in in_file.readlines():
        if not line.startswith(";") and line != "\n":
            lexicon[line.strip()] = 1

# print the top 5 entries
for i, (k, v) in enumerate(lexicon.items()):
    print(k, v)
    if i > 4: break

2-faced -1
2-faces -1
abnormal -1
abolish -1
abominable -1
abominably -1


In [28]:
# Perform a 5-fold cross validation to predict satisfaction using an SGD classifier

vocab = lexicon.keys()

f1_overall_avg = []

skf = model_selection.StratifiedKFold(n_splits=5)
fold = 0
for train_index, test_index in skf.split(np.array(processed_doc), df_sat["satisfaction"]):
    fold += 1
    print("Fold %d" % fold)
    # partition
    train_x, test_x = np.array(processed_doc)[train_index], np.array(processed_doc)[test_index]
    train_y, test_y = df_sat["satisfaction"][train_index], df_sat["satisfaction"][test_index]
    # vectorize
    vectorizer = TfidfVectorizer(norm="l2",
                                 max_df=0.8, # remove frequent words (>80%)
                                 stop_words='english', # remove English stopwords stored in Scikit-Learn
                                 min_df=1, # remove unique words, appearing in just 1 document
                                 vocabulary=vocab # use our opinion lexicon
                                )
    X = vectorizer.fit_transform(train_x)
    print("Number of features: %d" % len(vectorizer.vocabulary_))
    X_test = vectorizer.transform(test_x)
    # train model
    clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
    clf.fit(X, train_y)
    # predict
    pred = clf.predict(X_test)
    # classification results
    for line in metrics.classification_report(test_y, pred).split("\n"):
        print(line)
    # Save average across fold
    f1_overall_avg.append(metrics.f1_score(test_y, pred, labels=None, average='weighted'))

Fold 1
Number of features: 6786
              precision    recall  f1-score   support

           0       0.65      0.21      0.31       232
           1       0.75      0.95      0.84       569

   micro avg       0.74      0.74      0.74       801
   macro avg       0.70      0.58      0.58       801
weighted avg       0.72      0.74      0.69       801

Fold 2
Number of features: 6786
              precision    recall  f1-score   support

           0       0.60      0.53      0.56       232
           1       0.82      0.85      0.84       569

   micro avg       0.76      0.76      0.76       801
   macro avg       0.71      0.69      0.70       801
weighted avg       0.75      0.76      0.76       801

Fold 3
Number of features: 6786
              precision    recall  f1-score   support

           0       0.67      0.45      0.54       231
           1       0.80      0.91      0.85       569

   micro avg       0.78      0.78      0.78       800
   macro avg       0.74      0.6

In [29]:
# Print overall average f1 score across 5 folds
f1_score = sum(f1_overall_avg) / float(len(f1_overall_avg))

print("Average f1 score across all 5 folds:\n {:.2}".format(f1_score))

Average f1 score across all 5 folds:
 0.73


**Q: Tell me if the average F1 score has increased? If so, why?**

A: No, the average F1 score has not increased from Q7.  Instead, it went down by about 0.05.

### Q9: PCA

In [30]:
# PCA
vectorizer = TfidfVectorizer(norm="l2",
                             max_df=0.8, # remove frequent words (>80%)
                             stop_words='english', # remove English stopwords stored in Scikit-Learn
                             min_df=1 # remove unique words, appearing in just 1 document
                            )

X = vectorizer.fit_transform(processed_doc).todense()
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)
print(len(X[0]))
from sklearn.decomposition import PCA
pca = PCA(svd_solver='randomized',whiten=True).fit(X)
print(pca.explained_variance_ratio_)
sumofvariance=0.0
n_components = 0
for item in pca.explained_variance_ratio_:
    sumofvariance += item
    n_components+=1
    if sumofvariance>=0.9:
        break

10833
[1.38860801e-02 3.83311952e-03 3.20978202e-03 ... 8.53055910e-39
 5.47672946e-40 2.31403188e-40]


**Q: Please create a textbox and use your own words to briefly describe PCA**

A: Principle Component analysis is a method where you can use the relationships between existing variables in your data to derive new variables.  From those new variables, you can select the most important ones that help best explain the variance in your data.  Those most important new variables are called the principle components, and since you're usually left with a smaller number of principle components than the original number of variables in your data, using PCA can help reduce the dimensionality of your data to make it simpler to work with.

In [31]:
# Print the number of components:

print("Number of components: {}".format(n_components))

Number of components: 2103


In [34]:
# Perform a 5-fold cross validation to predict satisfaction using an SGD classifier.
pca = PCA(n_components=n_components, svd_solver='randomized',whiten=True).fit(X)
X_train_pca = pca.transform(X)

f1_overall_avg = []

skf = model_selection.StratifiedKFold(n_splits=5)
fold = 0
for fold in range(5):
    fold += 1
    print("Fold %d" % fold)
    # partition
    train_x, test_x, train_y, test_y = train_test_split(X_train_pca, 
                                                        df_sat["satisfaction"], 
                                                        test_size=0.2, 
                                                        stratify=df_sat["satisfaction"])
    # train model
    clf = SGDClassifier()
    clf.fit(train_x, train_y)
    # predict
    pred_y = clf.predict(test_x)
    # classification results
    for line in metrics.classification_report(test_y, pred_y).split("\n"):
        print(line)
    # Save average across fold
    f1_overall_avg.append(metrics.f1_score(test_y, pred_y, labels=None, average='weighted'))


Fold 1
              precision    recall  f1-score   support

           0       0.62      0.45      0.53       231
           1       0.80      0.89      0.84       569

   micro avg       0.76      0.76      0.76       800
   macro avg       0.71      0.67      0.68       800
weighted avg       0.75      0.76      0.75       800

Fold 2
              precision    recall  f1-score   support

           0       0.57      0.37      0.45       231
           1       0.78      0.89      0.83       569

   micro avg       0.74      0.74      0.74       800
   macro avg       0.68      0.63      0.64       800
weighted avg       0.72      0.74      0.72       800

Fold 3
              precision    recall  f1-score   support

           0       0.62      0.42      0.50       231
           1       0.79      0.89      0.84       569

   micro avg       0.76      0.76      0.76       800
   macro avg       0.70      0.66      0.67       800
weighted avg       0.74      0.76      0.74       800

In [35]:
# Print overall average f1 score across 5 folds
f1_score = sum(f1_overall_avg) / float(len(f1_overall_avg))

print("Average f1 score across all 5 folds:\n {:.2}".format(f1_score))

Average f1 score across all 5 folds:
 0.73


**Q: Tell me if the average F1 score has increased? If so, why?**

A: No, the average F1 score has not increased from Q7, and the score remains the same as Q8.  