## Machine Learning Models on sentiment analysis

Note: training dataset: 90%, testing dataset: 10%

### Analysis results for existing libs

* Trainning Dataset: /100k-courseras-course-reviews-dataset/reviews_train.csv
  * totalCnt: 96316, #1 = 2,122, #2 = 1,922, #3 = 4,427, #4 = 16,078, #5 = 71,767
  * vaderErr = 35,345, textBlobErr = 57,552, stanfordNlpErr = 75,420

| Tables   |  total_acc  |  #5_acc |  #4_acc |  #3_acc |  #2_acc |  #1_acc |
|---------- |:-----------:|--------:|--------:|--------:|--------:|--------:|
| VaderSent |  63.3%    | 76.7%   |  26.9%  |  14.7%  |  21.9%  |  24.5%  |
| TextBlob  |  40.2%    | 37.6%   |  65.6%  |  15.3%  |  22.3%  |  05.5%  |
| SCoreNLP  |  21.7%    | 13.0%   |  57.9%  |  18.0%  |  66.5%  |  07.7%  |
| ~PatternE~  |  40.2%    | 37.6%   |  65.6%  |  15.6%  |  22.8%  |  05.5%  |
| LingPipe  |  76.5%    |

* Testing Dataset:
  * totalCnt= 10702 , vaderErr= 4280 , textBlobErr= 6525 , stanfordNlpErr= 8321
  
Notes: 
- compute the StanfordNLP score on training dataset took > 4 hours
- TextBlob seems same with Pattern, they could use same ML algo

### Combination Model with Random Forest Classifier

* RFC with `n_estimators=30` on training dataset with 3 libs
  * acc on training dataset: 91.0%
  * acc on testing dataset: 67.3%
* RFC with `n_estimators=30` on training dataset with 4 libs (with the LingPipe feature from Anthony!!)
  * acc on training dataset: 94.2%
  * acc on testing dataset: 85.1%
 
TODO/Improvements:
1. pre-process the reviews
2. more features
3. adjust RFC super params


In [4]:
import csv

def divideDataSet(file, totalCount, percent):
    divider = totalCount * percent
    count = 0
    with open(file, newline='', encoding='utf-8') as inputfile:
        reader = csv.DictReader(inputfile)
        with open('./test.csv', 'w', newline='', encoding='utf-8') as trainfile:
            fieldnames = ['Id', 'Review', 'Label']
            trainWriter = csv.DictWriter(trainfile, fieldnames = fieldnames)  
            trainWriter.writeheader()
            for row in reader:
                count += 1
                if count > divider:
                    trainWriter.writerow({'Id': row['Id'], 'Review': row['Review'], 'Label':row['Label']})
                
           
divideDataSet('../100k-courseras-course-reviews-dataset/reviews.csv', 107018, 0.9)

In [29]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from pycorenlp import StanfordCoreNLP

vaderAnalyser = SentimentIntensityAnalyzer()
stanfordNlp = StanfordCoreNLP('http://localhost:9000')

def convertScore(score):
    label = 0
    if score > 0.525:
        label = 5
    elif score <= 0.525 and score >= 0.05:
        label = 4
    elif score < 0.05 and score > -0.05:
        label = 3
    elif score <= -0.05 and score >= -0.525:
        label = 2
    else:
        label = 1
        
    return label
    
    
def computeScore(file):
    totalCnt = 0
    vaderErr = 0
    textBlobErr = 0
    stanfordNlpErr = 0
    with open(file, newline='', encoding='utf-8') as inputfile:
        reader = csv.DictReader(inputfile)
        with open('./reviews_test_predict_v0.3.csv', 'w', newline='', encoding='utf-8') as trainfile:
            fieldnames = ['Id', 'Review', 'Label', 'VaderScore', 'VaderLabel', 'TextBlobScore', 'TextBlobLabel', 'StanfordnlpScore', 'StanfordnlpLabel']
            trainWriter = csv.DictWriter(trainfile, fieldnames = fieldnames)  
            trainWriter.writeheader()
            for row in reader:
                totalCnt += 1
#                 if totalCnt >= 10:
#                     continue
                # compute stanford CoreNLP score
                output = stanfordNlp.annotate(row['Review'], properties={
                            'annotators': 'sentiment',
                            'outputFormat': 'json'
                        })
                stanfordScore = output['sentences'][0]['sentimentValue']
                stanfordLabel = int(stanfordScore) + 1
                stanfordSentiment = output['sentences'][0]['sentiment']
#                 if int(stanfordScore) < 1:
#                     print('Label=', row['Label'], ', stanfordScore=', stanfordScore, ', stanfordSentiment=', stanfordSentiment, ', row=', row)
                
                # compute vader score
                vaderScore = vaderAnalyser.polarity_scores(row['Review'])['compound']
                vaderLabel = convertScore(vaderScore)
                # compute textblob score
                testimonial = TextBlob(row['Review'])
                textBlobScore = testimonial.sentiment.polarity
                textBlobLabel = convertScore(textBlobScore)
                
                labelErr = False
                if vaderLabel != int(row['Label']):
                    vaderErr += 1
                    labelErr = True
                if textBlobLabel != int(row['Label']):
                    textBlobErr += 1
                    labelErr = True
                if stanfordLabel != int(row['Label']):
                    stanfordNlpErr += 1
                    labelErr = True
#                 if labelErr:
#                     print('Label=', row['Label'], ', vaderCompound=', vaderCompound, ', vaderLabel=', vaderLabel, ', textBlobScore=', textBlobScore, ', textBlobLabel=', textBlobLabel, ', row=', row)
                trainWriter.writerow({'Id': row['Id'], 'Review': row['Review'], 'Label':row['Label'], 
                                      'VaderScore': vaderScore, 'VaderLabel': vaderLabel, 
                                      'TextBlobScore': textBlobScore, 'TextBlobLabel': textBlobLabel,
                                      'StanfordnlpScore': stanfordScore, 'StanfordnlpLabel': stanfordLabel
                                     })
      
    #totalCnt= 96316 , vaderErr= 35345 (36.7%), textBlobErr= 57552 (59.7%), stanfordNlpErr= 75420(21.7%) on training data set
    #totalCnt= 10702 , vaderErr= 4280 , textBlobErr= 6525 , stanfordNlpErr= 8321 on testing data set
    
    print('totalCnt=', totalCnt, ', vaderErr=', vaderErr, ', textBlobErr=', textBlobErr, ', stanfordNlpErr=', stanfordNlpErr)
            
computeScore('../100k-courseras-course-reviews-dataset/reviews_test.csv')
            

totalCnt= 10702 , vaderErr= 4280 , textBlobErr= 6525 , stanfordNlpErr= 8321


In [53]:
### Pattern

import pandas as pd
from pattern.en import sentiment

data_train = pd.read_csv('./reviews_train_predict_v0.2.csv')

data_train['PatternScore'] = data_train['Review'].apply(lambda x: sentiment(x)[0])
data_train['PatternLabel'] = data_train['PatternScore'].apply(lambda x: convertScore(x))

data_train.to_csv('./reviews_train_predict_v0.3.csv', encoding='utf-8')

data_test = pd.read_csv('./reviews_test_predict_v0.2.csv')

data_test['PatternScore'] = data_test['Review'].apply(lambda x: sentiment(x)[0])
data_test['PatternLabel'] = data_test['PatternScore'].apply(lambda x: convertScore(x))

data_test.to_csv('./reviews_test_predict_v0.3.csv', encoding='utf-8')


In [52]:
### Compute accuracy

data = pd.read_csv('./reviews_train_predict_v0.3.csv')

totalCnt = [ data[data['Label'] == i].shape[0] for i in range(1,6)]
print(totalCnt)

# Pattern
patternCnt = [ data[(data['Label'] == i) & (data['PatternLabel'] == i)].shape[0] for i in range(1,6)]
print(patternCnt)

patternAcc = [patternCnt[i]/totalCnt[i] for i in range(0,5)]
print(patternAcc)

print('pattern total acc:', sum(patternCnt)/sum(totalCnt))

# TextBlob
textBlobCnt = [ data[(data['Label'] == i) & (data['TextBlobLabel'] == i)].shape[0] for i in range(1,6)]
print(textBlobCnt)

textBlobAcc = [textBlobCnt[i]/totalCnt[i] for i in range(0,5)]
print(textBlobAcc)

print('TextBlob total acc:', sum(textBlobCnt)/sum(totalCnt))

[2122, 1922, 4427, 16078, 71767]
[117, 439, 692, 10540, 26954]
[0.05513666352497644, 0.22840790842872008, 0.15631353060763498, 0.6555541734046523, 0.37557651845555756]
pattern total acc: 0.4022384650525354
[117, 429, 678, 10542, 26998]
[0.05513666352497644, 0.22320499479708636, 0.15315111813869436, 0.6556785669859435, 0.37618961361071246]
TextBlob total acc: 0.4024668798538145


In [72]:
### Naive Bayes classifier - extract features from reviews of training data set
import timeit
import nltk
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB

stopwords_set = set(stopwords.words("english"))


def clean_review(review):
    words_filtered = [e.lower() for e in review.split() if len(e) >= 3]
    words_without_stopwords = [word for word in words_filtered if not word in stopwords_set]
    return words_without_stopwords

def clean_reviews(train):
    reviews = []
    for index, row in train.iterrows():
        words_without_stopwords = clean_review(row['Review'])
        reviews.append(words_without_stopwords)
        
    return reviews

# Extracting word features
def get_words_in_reviews(reviews):
    all = []
    for review in reviews:
        all.extend(review)
    return all

def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    features = wordlist.keys()
    print('the top 1 frequent word ',list(features)[0], ' appears ', wordlist[list(features)[0]], 'times')
    # extract 78299 top words from total 1285873 words
    return features

data_train = pd.read_csv('./reviews_train_predict_v0.3.csv')
reviews = clean_reviews(data_train)
print('For total #reviews=', len(reviews))
w_features = get_word_features(get_words_in_reviews(reviews))
print('Got total top frequent #words=', len(w_features))

def extract_features(document):
    document_words = set(document)
#     features = {}
    digit_features = []
    for word in w_features:
#         features['contains(%s)' % word] = (word in document_words)
        digit_feature = 1 if word in document_words else 0
        digit_features.append(digit_feature)
    return digit_features

# features = extract_features(reviews[0])
# print('reviews[0]: ', reviews[0])
# print('features: ', features)

For total #reviews= 96316
the top 1 frequent word  good  appears  17000 times
Got total top frequent #words= 78299


In [75]:
start = timeit.default_timer()

X = [extract_features(review) for review in reviews]
y = data_train['Label']
# print(X.shape, y.shape)
mnb_model = MultinomialNB()
mnb_model.fit(X, y)
pickle.dump(rfcModel, open('NultinomialNB_train_v0.3.model', 'wb'))

print(mnb_model.predict(X[0:1]))
                               
stop = timeit.default_timer()
print('Trainning model took(s): ', stop - start)      

MemoryError: 

In [None]:
### Naive Bayes classifier - evaluate the model

scores=mnb_model.score(X, y)
print(scores)


In [96]:
### extract LingPipe feature

# lingpipe_data = pd.read_csv('./lingPipe.csv', encoding='ISO-8859-1')
# lingpipe_data.to_csv('./lingPipe_v2.csv', encoding='utf-8')


lingpipe_data=pd.read_csv('./lingPipe_v2.csv', encoding='utf-8')
others_data = pd.read_csv('./reviews_test_predict_v0.2.csv', encoding='utf-8')


combine_data = others_data.join(lingpipe_data, on=['Id'], how='left', rsuffix='_lingpipe')
print(combine_data[0:4])
combine_data = combine_data[['Id', 'Review', 'Label', 'VaderScore', 'VaderLabel', 'TextBlobScore', 'TextBlobLabel', 'StanfordnlpScore', 'StanfordnlpLabel', 'LingPipeLabel']]
combine_data.to_csv('reviews_test_predict_v0.4.csv', encoding='utf-8')


      Id                                             Review  Label  \
0  96316  This course can be a bit dry at times. But sti...      4   
1  96317                                  awesome course...      4   
2  96318                     Good course with great teacher      5   
3  96319  Great intro to bootstrap. I feel like more boo...      4   

   VaderScore  VaderLabel  TextBlobScore  TextBlobLabel  StanfordnlpScore  \
0      0.0000           3       0.066667              4                 1   
1      0.6249           5       1.000000              5                 2   
2      0.7906           5       0.750000              5                 3   
3      0.7650           5       0.650000              5                 3   

   StanfordnlpLabel  Unnamed: 0  Id_lingpipe  \
0                 2       96316        96316   
1                 3       96317        96317   
2                 4       96318        96318   
3                 4       96319        96319   

                      

## Analysis results for existing libs

* Trainning Dataset: /100k-courseras-course-reviews-dataset/reviews_train.csv
  * totalCnt: 96316, #1 = 2,122, #2 = 1,922, #3 = 4,427, #4 = 16,078, #5 = 71,767
  * vaderErr = 35,345, textBlobErr = 57,552, stanfordNlpErr = 75,420

| Tables   |  total_acc  |  #5_acc |  #4_acc |  #3_acc |  #2_acc |  #1_acc |
|---------- |:-----------:|--------:|--------:|--------:|--------:|--------:|
| VaderSent |  63.3%    | 76.7%   |  26.9%  |  14.7%  |  21.9%  |  24.5%  |
| TextBlob  |  40.2%    | 37.6%   |  65.6%  |  15.3%  |  22.3%  |  05.5%  |
| SCoreNLP  |  21.7%    | 13.0%   |  57.9%  |  18.0%  |  66.5%  |  07.7%  |
| PatternE  |  40.2%    | 37.6%   |  65.6%  |  15.6%  |  22.8%  |  05.5%  |

* Testing Dataset:
  * totalCnt= 10702 , vaderErr= 4280 , textBlobErr= 6525 , stanfordNlpErr= 8321
  
Notes: 
- compute the StanfordNLP score on training dataset took > 4 hours
- TextBlob seems same with Pattern, they could use same ML algo


### Combination Model with Random Forest Classifier

* RFC with `n_estimators=30` on training dataset with 3 libs
  * acc on training dataset: 91.0%
  * acc on testing dataset: 67.3%
* RFC with `n_estimators=30` on training dataset with 4 libs
  * acc on training dataset: 91.1%
  * acc on testing dataset: 66.6%
 
Improvements:
1. more libs as classifiers

In [95]:
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier

start = timeit.default_timer()
data = pd.read_csv('./reviews_train_predict_v0.4.csv')

X = data[['VaderScore', 'VaderLabel', 'TextBlobScore', 'TextBlobLabel', 'StanfordnlpScore', 'StanfordnlpLabel', 'LingPipeLabel']].values
y = data['Label'].values

n_estimators = 30
rfcModel = RandomForestClassifier(n_estimators=n_estimators)
rfcModel.fit(X, y)

pickle.dump(rfcModel, open('rfc_train_v0.4.model', 'wb'))
# loaded_model = pickle.load(open(filename, 'rb'))
scores = rfcModel.score(X, y)
print(scores) 
stop = timeit.default_timer()
print('Trainning model took(s): ', stop - start)      

0.9423148801860542
Trainning model took(s):  4.9431297999981325


In [97]:
test = pd.read_csv('./reviews_test_predict_v0.4.csv')
X_test = test[['VaderScore', 'VaderLabel', 'TextBlobScore', 'TextBlobLabel', 'StanfordnlpScore', 'StanfordnlpLabel', 'LingPipeLabel']].values
y_test = test['Label'].values
scores_test = rfcModel.score(X_test, y_test)
print(scores_test) 

0.8518968417118296


In [None]:
### Demo 

loaded_model = pickle.load(open('rfc_train_v0.4.model', 'rb'))