## Classifying whether feedback left on a website is either positive or negative.

In [2]:
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
amazon = pd.read_csv("amazon_cells_labelled.txt",delimiter="\t",header=None)
amazon.columns = ['feedback','score']
amazon.head(5)
amazon.shape
# score 1 (positive), score 0 (negative)

(1000, 2)

In [4]:
# import negative sentiment words data to create list
# source citation: 
#    Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." 
#;       Proceedings of the ACM SIGKDD International Conference on Knowledge 
#;       Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, 
#;       Washington, USA, 
neg_words = pd.read_csv("negative-words.txt",delimiter='\t',encoding="ISO-8859-1",skiprows=34,header=None)
neg_words.columns = ["Negative Words"]

In [5]:
# change score to boolean values (looking for instances where negative messages return True)
amazon['score'] = (amazon['score'] == 0)
amazon.head(3)

Unnamed: 0,feedback,score
0,So there is no way for me to plug it in here i...,True
1,"Good case, Excellent value.",False
2,Great for the jawbone.,False


In [6]:
# create list of keywords + "!" 
keywords = list(neg_words.values.flatten())
keywords.append("no")
keywords.append("never")
keywords.append("not")
len(keywords)

4786

In [7]:
#strip punctuations from feedback messages
def strip_punctuation(message):
    from string import punctuation
    return ''.join(m for m in message if m not in punctuation)

# compare two lists to see if feedback contains negative word
def neg_message_check(df,col_name,alist):
    import re
    message_list = list(df[col_name].values.flatten())
    new_message_list = []
    for message in message_list:
        new_message = strip_punctuation(message.lower())
        #escape_message = re.escape(new_message)
        new_message_list.append(new_message)
        
    nm = pd.Series(new_message_list)
    df["modified_feedback"] = nm.values
    
    for key in alist:
        escaped_key = re.escape(key)
        df[str(key)] = df.modified_feedback.str.contains("" + str(escaped_key) + "",case=False)
        #amazon[str(key)] = amazon.modified_feedback.apply(lambda sentence: any(word in sentence for word in alist))

In [8]:
neg_message_check(amazon,"feedback",keywords)

In [9]:
data = amazon[keywords]
target = amazon["score"]

In [10]:
from sklearn.naive_bayes import BernoulliNB

bnb = BernoulliNB()

bnb.fit(data,target)

y_pred = bnb.predict(data)

print("Number of mislabeled points out of total {} points: {}".format(data.shape[0],(target != y_pred).sum()))

Number of mislabeled points out of total 1000 points: 218


## Evaluating Classifier

In [11]:
# checking accuracy 
success = (1000-218)/1000
fails = 218/1000
print("Classifier successfully identified negative message: {}\n".format(success))
print("Classifier failed to identify negative message: {}".format(fails))

Classifier successfully identified negative message: 0.782

Classifier failed to identify negative message: 0.218


In [12]:
# confusion matrix (false negative, false positives, sensitivity, specificity)
from sklearn.metrics import confusion_matrix
confusion_matrix(target,y_pred)

array([[471,  29],
       [189, 311]])

False Negatives (Type II error): 189 of 218 errors are from failing to identify negative messages (miss)

False Positives (Type I error): 29 negative messages that were not negative messages (false alarm)

Sensitivity : 311 out of 500 (0.622) "how sensitive model is at identifying positives"

Specificity : 471 out of 500 (0.942) "% of negatives correctly identified"

### Testing for Overfitting

In [13]:
# holdout grouping
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data,target,test_size=0.2,random_state=20)
print("With 20% Holdout: " + str(bnb.fit(X_train,y_train).score(X_test,y_test)))
print('Testing on Sample: ' + str(bnb.fit(data,target).score(data,target)))

With 20% Holdout: 0.77
Testing on Sample: 0.782


Single holdout grouping shows little indication that there's overfitting present in the model. 

In [14]:
# cross validation (creating several holdout groups)
from sklearn.model_selection import cross_val_score
cross_val_score(bnb,data,target,cv=10)

array([0.66, 0.77, 0.78, 0.71, 0.79, 0.66, 0.69, 0.75, 0.78, 0.73])

There appears to be some fluctuation in the series of accuracy scores in cross validation array. There could potentially be some overfitting in the data...not clear how much fluctuation is ok to pass overfit test with cross validation."