## Sentiment Analysis using Bernoulli NB
In this notebook a series of sentiment analysis models are created and packaged into a pickle file. The models are created using a bernoulii distribution niave-bayes model as it is very efficent and generally performs quite well with text classification. The classifications have been created manually are are related to themes within the input comments ie. Capacity - A comment where a respondant is stating there is not enough capacity on the service. Each class has its own model, the entire dataset is iterated for each class.

As the number of postive classes is quite small in comparison to the size of the entire dataset, SMOTE resampling has been used to ensure that the models are robust. SMOTE resamples the training data, so that there is a specified proportion of positive classifications as negative, this means that the model is less likely to automatically consider each result postitive (or negative) which would result in a very high accuracy, but low recall and precision metrics.
Once the models have been created they are packaged into a pickle file which can be extracted and run on new data.

The resulting model will be used as the first stage in processing public consultation responses, the model will highlight themes mentioned in the commments, however it is expected that a user will go over each of the results and correct the themes and add further comments as necessary. Because of this second stage of the process it is not necessary that this model be optimised fully as this would be quite time consuming. Therefore a base model will be made where each classification result has a score of at least 50% (Recall).

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score,precision_score,recall_score
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction import text 

In [2]:
## Read in Data
Data=pd.read_csv('Sentiment_Analysis.csv')

In [3]:
## Display head to see what the data looks like
Data.head()

Unnamed: 0,Response (-R),Theme (-T),First Class,Routing,Ticketing (Pricing),Ticketing (Method),Ticketing (Process),Ticketing (Office),Ticketing (Barrier),Ticketing (Machine),...,Reward Schemes,Accessibility,Environment,Electrification,Rolling Stock,Community,General,HS2,Renationalisation,Themes Total
0,-,,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,information on websites - to give current up-t...,Website,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,website information and at the time of looking...,Website / Station or Train Communication,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
3,message to my phone - when on the move website...,Website / Mobile,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2
4,-,,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
## Dropping this column as it is just an aggrigation of the other features.
Data=Data.drop('Themes Total',axis=1)

In [5]:
## Dropping results where no classification is provided to reduce the size of the data set.
Data.dropna(subset=['Theme (-T)'],inplace=True)

In [6]:
Data.head()

Unnamed: 0,Response (-R),Theme (-T),First Class,Routing,Ticketing (Pricing),Ticketing (Method),Ticketing (Process),Ticketing (Office),Ticketing (Barrier),Ticketing (Machine),...,Staffing (Policing),Reward Schemes,Accessibility,Environment,Electrification,Rolling Stock,Community,General,HS2,Renationalisation
1,information on websites - to give current up-t...,Website,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,website information and at the time of looking...,Website / Station or Train Communication,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,message to my phone - when on the move website...,Website / Mobile,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,"All train journeys have at least one transfer,...",Reliability - Connected Timetable for Delays/T...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Long distance capacity is key to attracting pa...,Yes,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
## Creating the y variable this is a large array with each column representing one classification where one model 
## will be used to deteremine each classification.
y=Data.drop(['Response (-R)','Theme (-T)'],axis=1)

In [8]:
## Preparing the X features, intially the string format will need to be standardised for the models and pickling 
## to work.
##Converting to string type
Data['Response (-R)']=Data['Response (-R)'].astype('str')
## Removing all non utf-8 characters
X_Base=Data['Response (-R)'].map(lambda x: x.decode('utf-8','ignore').encode("utf-8"))

In [9]:
## Loading a list of additional stop words that will be used in conjuction to the 'english' set in the countvectoriser 
## function
Station_List=pd.read_csv('Station_List.csv',header=None)

In [10]:
## Creating a variable that contains all of the 'english' stop words from Sklearn and those from the file above.
stop_words = text.ENGLISH_STOP_WORDS.union(list(Station_List[0]))

In [11]:
## Creating and applying the countvectoriser function, this will need to be pickled along with the models for use 
## with new data.
CV=CountVectorizer(stop_words=stop_words,ngram_range=(1,3),max_features=20000)
X_Model=CV.fit(X_Base)
X=X_Model.transform(X_Base)

In [12]:
#CV2=CountVectorizer(stop_words='english',ngram_range=(1,3),max_features=75000)
#X_Model2=CV2.fit(X_Base)
#X2=X_Model2.transform(X_Base)

In [13]:
## Creating a list of all the classes from the y matrix.
Class_List=y.columns

In [14]:
## I have run a number of other models to find a better ratio (then auto) used by the SMOTE function (This will be 
## discussed later). However not all of these models are contained within this workbook, so the values are saved into 
## the dictionary below.
Sm_Ratio={'Accessibility': 'auto',
         'App': 0.1,
         'Capacity (Carriage)': 'auto',
         'Capacity (Ratio)': 'auto',
         'Capacity (Seating)': 'auto',
         'Communication (Advertising)': 'auto',
         'Communication (General)': 'auto',
         'Communication (Station)': 'auto',
         'Community': 'auto',
         'Electrification': 'auto',
         'Email': 0.1,
         'Environment': 'auto',
         'First Class': 'auto',
         'Frequency': 'auto',
         'General': 'auto',
         'HS2 ': 'auto',
         'Integration (Franchise)': 'auto',
         'Integration (Transport)': 'auto',
         'Mobile': 0.1,
         'On-Board Facilities (Catering)': 'auto',
         'On-Board Facilities (Cycle)': 'auto',
         'On-Board Facilities (General)': 'auto',
         'On-Board Facilities (Luggage)': 'auto',
         'On-Board Facilities (Toilets)': 'auto',
         'On-Board Facilities (WIFI)': 'auto',
         'Reliability': 'auto',
         'Renationalisation': 'auto',
         'Reward Schemes': 'auto',
         'Rolling Stock': 'auto',
         'Routing': 'auto',
         'Social Media': 0.1,
         'Staffing (General)': 'auto',
         'Staffing (Policing)': 'auto',
         'Staffing (Training)': 'auto',
         'Staffing (Visibility)': 'auto',
         'Station Facilities (Access)': 'auto',
         'Station Facilities (Catering)': 0.1,
         'Station Facilities (Cycle)': 'auto',
         'Station Facilities (General)': 'auto',
         'Station Facilities (Lifts)': 'auto',
         'Station Facilities (Parking)': 'auto',
         'Station Facilities (Passenger Management)': 'auto',
         'Station Facilities (Retail)': 'auto',
         'Station Facilities (Toilets)': 0.25,
         'Station Facilities (WIFI)': 0.25,
         'Station Facilities (Waiting Area)': 'auto',
         'Station or Train Communication': 0.1,
         'Ticketing (Barrier)': 'auto',
         'Ticketing (Machine)': 'auto',
         'Ticketing (Method)': 'auto',
         'Ticketing (Office)': 'auto',
         'Ticketing (Pricing)': 'auto',
         'Ticketing (Process)': 'auto',
         'Ticketing (Schemes)': 'auto',
         'Timetables': 0.1,
         'Website': 0.25}

## Precision vs Recall vs Accuracy

### Accuracy
As discussed in the introduction, the number of positive classifications in the data set (for each class) is a very small proportion of the dataset as a whole. Therefore using a metric such as accuracy will not give a good view of whether the model is performangin well or not. The base case (applying negitive/0 to all inputs) would result in an accuracy of above 95% in most cases, therefore it is dificult to build the model around this. 

### Precision vs Recall
As this model will be used as the first step in a process, and a user will be evaulating each result, it is better for the model to 'overpredict' then 'underpredict' as it is easier for a user to to remove wrong selections then add the correct ones. Based on this principle the driving metric will be Recall, however I will ensure that precision is not so low that the model is just churning out all positive results.


In [15]:
##Creating two results in which the results will be contained.
Test_Results=[]
Test_Results2=[]

## Creating a dictionary where each model will be saved.
Dict={}


## This function iterates through each of the classes in the y matrix, and trains and tests the models, saving 
## the resulting model in the dictionary above, and recording the results in the two lists detailed above.
for j in Class_List:
    
    ## Defining the SMOTE parameters
    smote=SMOTE(ratio=Sm_Ratio[j],kind='regular')
    
    ## Creating the traing and test sets
    X_train,X_test,y_train,y_test=train_test_split(X,y[j].astype('bool'),test_size=0.3)
    
    ## unfortunately SMOTE requires a dense array input so the X is transformed from sparse to dense, this is 
    ## by far the most time consuming part of the process
    smox,smoy=smote.fit_sample(X.toarray(),y[j])
    
    ##Fitting the model and producing the result metrics
    BNB=BernoulliNB()
    Model=BNB.fit(smox,smoy)    
    y_pred=Model.predict(X_test)    
    Test_Score=recall_score(y_test,y_pred)
    Precision_Score=precision_score(y_test,y_pred)
    Test_Results.append(Test_Score)
    Test_Results2.append(Precision_Score) 
    Dict[j]=Model
    
    ## Producing the cross validated result metrics.
    Cross_Score=cross_val_score(Model,smox,smoy,cv=5,scoring='recall')
    Results.append(Cross_Score)
    
    ## Printing the results of each iteration
    print "Class : ",j
    print "Training Score (Accuracy): ", Model.score(X_train,y_train)
    print "Test Recall Score : ",Test_Score
    print "Cross Validated Score (Recall) : ",Cross_Score
    print "Test Precision Score : ",Precision_Score
    print time.ctime()
    print "----------------"
    
print "Average of all Cross Validated Scores",np.average(Results)
print "Average of all Test Scores 1",np.average(Test_Results)

Class :  First Class
Training Score (Accuracy):  0.998391420912
Test Recall Score :  0.875
Cross Validated Score (Recall) :  [ 0.95217118  0.95717884  0.96410579  0.97670025  0.95403023]
Test Precision Score :  0.933333333333
Wed Feb 01 08:58:22 2017
----------------
Class :  Routing
Training Score (Accuracy):  0.948168007149
Test Recall Score :  0.617391304348
Cross Validated Score (Recall) :  [ 0.51158173  0.56121774  0.55790867  0.54533422  0.56821192]
Test Precision Score :  0.581967213115
Wed Feb 01 09:02:03 2017
----------------
Class :  Ticketing (Pricing)
Training Score (Accuracy):  0.960321715818
Test Recall Score :  0.705202312139
Cross Validated Score (Recall) :  [ 0.71932203  0.72067797  0.70440678  0.71777476  0.71302578]
Test Precision Score :  0.777070063694
Wed Feb 01 09:05:46 2017
----------------
Class :  Ticketing (Method)
Training Score (Accuracy):  0.984271671135
Test Recall Score :  0.986666666667
Cross Validated Score (Recall) :  [ 0.9658725   0.96651642  0.97424

## Intital Results
We can see from the above print out that as expected the accuracy results are very high for each model. The recall results are also acceptable for most of the models, however there are certain cases where this is coupled with a very low precision score. For the cases with either a low Recall score or a very low Preicison score the models will be rerun with a modified SMOTE ratio. This will ensure when the model is fitting that the base accuracy recall and preicison are a lot lower, leading to a hopefully more accurate model.

In [16]:
## Creating a list that contains all of the models, the classes, and the count-vectoriser model.
List_Of_Everything=[Class_List,X_Model,Dict]

In [17]:
## Saving all of the model information to a pickle format file
with open('models.pickle', 'wb') as handle:
    pickle.dump(List_Of_Everything, handle, protocol=pickle.HIGHEST_PROTOCOL)

Work below was used to establish SM_Ratio values.

In [42]:
## Creating a list of models that had Recall score is less then 0.5 or the precision score is greater then 0.2. 
## I will run this list again, but this time varigin the SMOTE proportion parameters.
Low_Class_List=Class_List[np.array(Test_Results)<0.5]
Low_Class_List_Precision=Class_List[np.array(Test_Results2)<0.2]

In [43]:
Low_Scores=np.array(Test_Results)[np.array(Test_Results)<0.5]

In [44]:
Low_Scores

array([ 0.35      ,  0.40277778,  0.47252747,  0.36315789,  0.23076923,
        0.15384615,  0.49019608,  0.43806647])

In [45]:
## Display the list of the classifications with low recall scores.
Low_Class_List

Index([u'Communication (Advertising)', u'App', u'Mobile',
       u'Station or Train Communication', u'Environment', u'Electrification',
       u'Rolling Stock', u'General'],
      dtype='object')

In [46]:
## Display the list of the classifications with low precision scores.
Low_Class_List_Precision

Index([u'Staffing (General)'], dtype='object')

In [47]:
## Combining the two lists
Combined_Low_Classes=Low_Class_List+Low_Class_List_Precision

  if __name__ == '__main__':


In [48]:
Combined_Low_Classes

Index([u'App', u'Communication (Advertising)', u'Electrification',
       u'Environment', u'General', u'Mobile', u'Rolling Stock',
       u'Staffing (General)', u'Station or Train Communication'],
      dtype='object')

In [49]:
## Rerunning the function above on the classes with low scores using a SMOTE ratio of 0.25.

Low_Class_Test_Results=[]
Low_Class_Test_Results2=[]


Low_Class_Dict={}

for j in Combined_Low_Classes:
    
    smote=SMOTE(ratio=0.25,kind='regular')
    
    X_train,X_test,y_train,y_test=train_test_split(X,y[j].astype('bool'),test_size=0.3)

    smox,smoy=smote.fit_sample(X.toarray(),y[j])
     
    BNB=BernoulliNB()
    Model=BNB.fit(smox,smoy)    
    y_pred=Model.predict(X_test)    
    Test_Score=recall_score(y_test,y_pred)
    Precision_Score=precision_score(y_test,y_pred)
    Low_Class_Test_Results.append(Test_Score)
    Low_Class_Test_Results2.append(Precision_Score) 
    Low_Class_Dict[j]=Model
    
    Cross_Score=cross_val_score(Model,smox,smoy,cv=5,scoring='recall')
    
    print "Class : ",j
    print "Training Score (Accuracy): ", Model.score(X_train,y_train)
    print "Test Recall Score : ",Test_Score
    print "Cross Validated Score (Recall) : ",Cross_Score
    print "Test Precision Score : ",Precision_Score
    print time.ctime()
    print "----------------"
    
print "Average of all Cross Validated Scores",np.average(Results)
print "Average of all Test Scores 1",np.average(Test_Results)


Class :  App
Training Score (Accuracy):  0.909204647006
Test Recall Score :  0.985507246377
Cross Validated Score (Recall) :  [ 0.96907216  0.97680412  0.9871134   0.98195876  0.95607235]
Test Precision Score :  0.237762237762
Wed Feb 01 13:15:35 2017
----------------
Class :  Communication (Advertising)
Training Score (Accuracy):  0.991063449508
Test Recall Score :  0.5
Cross Validated Score (Recall) :  [ 0.38131313  0.44949495  0.33585859  0.50757576  0.44191919]
Test Precision Score :  0.636363636364
Wed Feb 01 13:15:58 2017
----------------
Class :  Electrification
Training Score (Accuracy):  0.992672028597
Test Recall Score :  0.1875
Cross Validated Score (Recall) :  [ 0.17632242  0.17632242  0.16120907  0.12846348  0.19647355]
Test Precision Score :  0.3
Wed Feb 01 13:16:19 2017
----------------
Class :  Environment
Training Score (Accuracy):  0.993565683646
Test Recall Score :  0.176470588235
Cross Validated Score (Recall) :  [ 0.15365239  0.13350126  0.16876574  0.19899244  0.1

## Results pt 2.
We can see that most of the model results have improved with the modified ratio, with the exception of a 5 classes which are displayed below. As there are so few positive cases fro these classes it will be hard to improve the model sufficiently whilst not overfitting, therefore I will leave theses results as they are. I will now updated the original dictionary containing the models and resulting pickle file.

In [52]:
Low_Class_List2=Combined_Low_Classes[np.array(Low_Class_Test_Results)<0.5]
Low_Class_List_Precision2=Combined_Low_Classes[np.array(Low_Class_Test_Results2)<0.2]

In [53]:
Low_Class_List2

Index([u'Electrification', u'Environment', u'General', u'Rolling Stock'], dtype='object')

In [54]:
## The mobile cartegory has not improved, therefore I will leave the original model.
Low_Class_List_Precision2

Index([u'Mobile'], dtype='object')

In [56]:
## Iterating through the models of the first and second runs, if the model scores are higher from the first, nothing happens,
## If the model results are higher from the second the model is replaced.
for x in Combined_Low_Classes:
    if x=='Mobile':
        pass
    else:
        Dict[x]=Low_Class_Dict[x]

In [57]:
## Recreate a final list containing all the models
List_Of_Everything2=[Class_List,X_Model,Dict]

In [58]:
## Create a pickle file of all the models.
with open('models.pickle', 'wb') as handle:
    pickle.dump(List_Of_Everything, handle, protocol=pickle.HIGHEST_PROTOCOL)