## SENTIMENT ANALYSIS Based on Supervised Learning

**<font color=green>INSTRUCTIONS:</font>** <br> <br>
    **<font color=green>1. Look for EXERCISES and QUESTIONS in this script. </font>** <br> <br>
    **<font color=green>2. Each student INDIVIDUALLY uploads this script with their answers embedded to Canvas.</font>** <br>

### Objectives

1. Learn how to perform lexicon-based (unsupervised machine learning) sentiment analysis.
2. Fine-tune a lexicon-based sentiment analyzer.
2. Train your own sentiment classifier.

### Session Prep

In [53]:
#packages needed

#ignore warnings about future changes in functions as they take too much space
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

import numpy as np 
import pandas as pd

#text normalization function
%run ./Text_Normalization_Function.ipynb

#ignore warnings about future changes in functions as they take too much space
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  ['<', 'p', '>', 'The', 'circus', 'dog', 'in', 'a', 'plissé', 'skirt', 'jumped', 'over', 'Python', 'who', 'was', "n't", 'that', 'large', ',', 'just', '3', 'feet', 'long.', '<', '/p', '>']
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  <p>The circus dog in a plissé skirt jumped over Python who was not that large, just 3 feet long.</p>
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  [('<', 'a'), ('p', 'n'), ('>', 'v'), ('the', None), ('circus', 'n'), ('dog', 'n'), ('in', None), ('a', None), ('plissé', 'n'), ('skirt', 'n'), ('jumped', 'v'), ('over', None), ('python', 'n'), ('who', None), ('was', 'v'), ("n't", 'r'), ('that', None), ('large', 'a'), (',', None), ('just', 'r'), ('3', None), ('feet', 'n'), ('long.', 'a'), 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LiGoudan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LiGoudan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\LiGoudan\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\LiGoudan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Dataset

AmazonMucisInstrucment

### Data Preprocess

In [54]:
# Import in the data
data = pd.read_csv('Musical_instruments_reviews.csv')
data = data.loc[:, ['reviewText', 'overall']]

# Delete Null Samples
float_ind = list()
str_ind = list()
for ind in data.index:
    if type(data.loc[ind, 'reviewText']) == float:
        float_ind.append(ind)
    elif type(data.loc[ind, 'reviewText']) == str:
        str_ind.append(ind)
    else:
        print(str(i) + ': ' + str(type(data.loc[ind, 'reviewText'])))

# Delete blank samples
data.drop(labels=None, axis=0, index=float_ind, columns=None, inplace=True)

print("Dimensions for data:", data.shape)
print("First 5 rows in dataset: \n", data.head(),"\n")

Dimensions for data: (10254, 2)
First 5 rows in dataset: 
                                           reviewText  overall
0  Not much to write about here, but it does exac...        5
1  The product does exactly as it should and is q...        5
2  The primary job of this device is to block the...        5
3  Nice windscreen protects my MXL mic and preven...        5
4  This pop filter is great. It looks and perform...        5 



### Topic Modelling

In [55]:
from sklearn import metrics
#from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import  CountVectorizer #bag-of-words vectorizer 
from sklearn.decomposition import LatentDirichletAllocation #package for LDA

# Plotting tools

from pprint import pprint
import pyLDAvis
import pyLDAvis.sklearn

import matplotlib.pyplot as plt
%matplotlib inline

#define text normalization function
%run ./Text_Normalization_Function.ipynb #defining text normalization function

Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  ['<', 'p', '>', 'The', 'circus', 'dog', 'in', 'a', 'plissé', 'skirt', 'jumped', 'over', 'Python', 'who', 'was', "n't", 'that', 'large', ',', 'just', '3', 'feet', 'long.', '<', '/p', '>']
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  <p>The circus dog in a plissé skirt jumped over Python who was not that large, just 3 feet long.</p>
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  [('<', 'a'), ('p', 'n'), ('>', 'v'), ('the', None), ('circus', 'n'), ('dog', 'n'), ('in', None), ('a', None), ('plissé', 'n'), ('skirt', 'n'), ('jumped', 'v'), ('over', None), ('python', 'n'), ('who', None), ('was', 'v'), ("n't", 'r'), ('that', None), ('large', 'a'), (',', None), ('just', 'r'), ('3', None), ('feet', 'n'), ('long.', 'a'), 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LiGoudan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LiGoudan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\LiGoudan\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\LiGoudan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [56]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        
def get_topic_words(vectorizer, lda_model, n_words):
    keywords = np.array(vectorizer.get_feature_names())
    topic_words = []
    for topic_weights in lda_model.components_:
        top_word_locs = (-topic_weights).argsort()[:n_words]
        topic_words.append(keywords.take(top_word_locs).tolist())
    return topic_words

In [57]:
data_topic = data['reviewText']
news_corpus = list(data_topic)
news_corpus[:3]

["Not much to write about here, but it does exactly what it's supposed to. filters out the pop sounds. now my recordings are much more crisp. it is one of the lowest prices pop filters on amazon so might as well buy it, they honestly work the same despite their pricing,",
 "The product does exactly as it should and is quite affordable.I did not realized it was double screened until it arrived, so it was even better than I had expected.As an added bonus, one of the screens carries a small hint of the smell of an old grape candy I used to buy, so for reminiscent's sake, I cannot stop putting the pop filter next to my nose and smelling it after recording. :DIf you needed a pop filter, this will work just as well as the expensive ones, and it may even come with a pleasing aroma like mine did!Buy this product! :]",
 'The primary job of this device is to block the breath that would otherwise produce a popping sound, while allowing your voice to pass through with no noticeable reduction of vo

In [65]:
#normalize data
normalized_corpus_news = normalize_corpus(news_corpus)

#define a Bag-of-Words vecgtorizer
bow_vectorizer_news = CountVectorizer(max_features=500)

#vectorize data
bow_news_corpus = bow_vectorizer_news.fit_transform(normalized_corpus_news)

In [59]:
# 2 topic model
lda_news_2 = LatentDirichletAllocation(n_components=2, max_iter=100,
                                     doc_topic_prior = 0.25,
                                     topic_word_prior = 0.25).fit(bow_news_corpus)
#Perplexity score:
print("Perplexity of 2 topic: ", lda_news_2.perplexity(bow_news_corpus))

Perplexity of 2 topic:  489.0481654962962


In [76]:
# 3 topic model
lda_news_3 = LatentDirichletAllocation(n_components=3, max_iter=100,
                                     doc_topic_prior = 0.25,
                                     topic_word_prior = 0.25).fit(bow_news_corpus)
#Perplexity score:
print("Perplexity of 3 topic: ", lda_news_3.perplexity(bow_news_corpus))

Perplexity of 3 topic:  292.5259229430621


In [70]:
# 4 topic model
lda_news_4 = LatentDirichletAllocation(n_components=4, max_iter=100,
                                     doc_topic_prior = 0.25,
                                     topic_word_prior = 0.25).fit(bow_news_corpus)
#Perplexity score:
print("Perplexity of 4 topic: ", lda_news_4.perplexity(bow_news_corpus))

Perplexity of 4 topic:  294.3938967784993


In [66]:
# 5 topic model
lda_news_5 = LatentDirichletAllocation(n_components=5, max_iter=100,
                                     doc_topic_prior = 0.25,
                                     topic_word_prior = 0.25).fit(bow_news_corpus)
#Perplexity score:
print("Perplexity of 5 topic: ", lda_news_5.perplexity(bow_news_corpus))

Perplexity of 5 topic:  289.28868435154976


In [77]:
# Find the best topic number
lda_news = lda_news_3

In [78]:
no_top_words_news = 10
display_topics(lda_news, bow_vectorizer_news.get_feature_names(), no_top_words_news)

Topic 0:
guitar use stand strap work well good mic great fit
Topic 1:
string guitar use pick play sound like tuner good great
Topic 2:
pedal sound use amp cable good like great tone work


In [80]:
word_weights = lda_news.components_ / lda_news.components_.sum(axis=1)[:, np.newaxis]
word_weights_df = pd.DataFrame(word_weights.T, 
                               index = bow_vectorizer_news.get_feature_names(), 
                               columns = ["Topic_" + str(i) for i in range(3)])

word_weights_df.sort_values(by='Topic_0',ascending=False).head(10)

Unnamed: 0,Topic_0,Topic_1,Topic_2
guitar,0.032856,0.036893,0.005226
use,0.02183,0.026576,0.025514
stand,0.021564,3e-06,2e-06
strap,0.020636,3e-06,2e-06
work,0.019741,0.009106,0.010546
well,0.019315,0.012435,0.010278
good,0.018847,0.016301,0.016189
mic,0.016478,3e-06,3e-06
great,0.015085,0.015007,0.014163
fit,0.013083,1.4e-05,2e-06


In [81]:
#Visualize topic modeling result
pyLDAvis.enable_notebook()

#run the visualization [mds is a function to use for visualizing the "distance" between topics]
pyLDAvis.sklearn.prepare(lda_news, bow_news_corpus, bow_vectorizer_news, mds='tsne')