# Sentimental Analysis Part 1¶
Predicting Review Score of Amazon Foods through Summary, Text and other features 

<b>Name</b>: Bharathvaj Devarajan<br>
<b>Student No</b>: 16212388<br>
<b>Module</b>: CA684 - Machine Learning<br>
<b>Aim</b> The first part of this project is to combine the Numerical and Text features to create a Pipeline for predicting the review score,
I have also done Hyper Parameter tuning using GridSearch CV to find the best hyper parameters for modelling 

In [36]:
# Pandas, numpy and other helpful libraries
import pandas as pd
import matplotlib.pyplot as plt
import string
import sqlite3
import numpy as np
# Scikit learn libraries for Preprocessing,modelling,evaluating, Count Vectorization and X_train X_test split
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import roc_curve, auc,mean_squared_error
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion,Pipeline
from sklearn.preprocessing import Imputer,MaxAbsScaler
from sklearn.linear_model import SGDRegressor, ElasticNet,Ridge,Lasso,LinearRegression
from sklearn.model_selection import GridSearchCV 
from sklearn.ensemble import RandomForestRegressor
# Natural Language Text Processing Libraries for NLP functions
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk import word_tokenize
from nltk.corpus import stopwords
import nltk
%matplotlib inline

The first step is to establish a database connection and read the data to a panda dataframe

In [12]:
con = sqlite3.connect('/home/bharath/Downloads/db/amazon_db.sqlite')

As the Score is scaled between 1 to 5 , the scores with 3 represent neutral sentiments,so those amazon_df are omitted from this analysis

In [13]:
amazon_df = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", con)

In [14]:
amazon_df = amazon_df.dropna()

In [15]:
amazon_df.head(1)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...


Helper functions to extract Text and Numeric features using Function Transformer

In [16]:
get_text = FunctionTransformer(lambda x: x[['Summary','Text']], validate=False)
get_numeric = FunctionTransformer(lambda x: x[['HelpfulnessNumerator','HelpfulnessDenominator']],
                                       validate=False)

This is the class for doing transformation based on the extracted features of Function Transformer

In [17]:
class ItemSelector(BaseEstimator, TransformerMixin):
    
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

class NoFitMixin:
    def fit(self, X, y=None):
        return self

class SelectTransform(TransformerMixin, NoFitMixin, BaseEstimator):
    def __init__(self, func, copy=True):
        self.func = func
        self.copy = copy

    def transform(self, X):
        X_ = X if not self.copy else X.copy()
        return self.func(X_)
        

<b>Data Preprocessing</b> 

In [18]:
#Drop duplicate score-text values
amazon_df = amazon_df.drop_duplicates(subset = ['Summary', 'Text', 'Score'])

#Create text - remove line breaks
amazon_df['Summary'] = amazon_df['Summary'].str.replace('<br />', ' ')
amazon_df['Text'] = amazon_df['Text'].str.replace('<br />', ' ')

#Replace nan with ""
amazon_df['Summary'] = amazon_df['Summary'].fillna(value = "")
amazon_df['Text'] = amazon_df['Text'].fillna(value = "")

#Extract target variable
score = amazon_df['Score']

#Remove score from amazon_df
amazon_df = amazon_df.drop('Score', axis = 'columns')

#Count the number of words
amazon_df['summary_count'] = amazon_df['Summary'].str.split().apply(len)
amazon_df['text_count'] = amazon_df['Text'].str.split().apply(len)
amazon_df = amazon_df.assign(all_words_count = amazon_df['summary_count'] + amazon_df['text_count'])

<b>Creating the X_train X_test split and Pipeline</b><br> 
The pipeline module of scikit-learn allows you to chain transformers and estimators together in such a way that you can use them as a single unit. This comes in very handy when you need to jump through a few hoops of data extraction, transformation, normalization, and finally train your model (or use it to generate predictions).

In this analysis , the following sequence of steps will be carried out by our Pipeline<br>
1. <b>Feature Union</b>, In which the text and Numerical features will be combined<br> 
2. <b>Scaling</b>, Using Max absoulte Scaler, to ensure scaled input<br>
3. <b>Modelling</b>, SGDRegressor is just a place holder, we will be using a variety of models using Grid Search<br>

In [21]:
# Stratified split will ensure equal proportions of 0's and 1's in both datasets
X_train, X_test, y_train, y_test = train_test_split(amazon_df, score, test_size= 0.2,
                                                        stratify = score, random_state = 44)

# Pipeline
pipeline = Pipeline([
    ('union', FeatureUnion([
        
        ('summary', Pipeline([
            ('selector', ItemSelector(key = 'Summary')),
            ('count_vectorizer', CountVectorizer(min_df = 0.0005, max_df = 1.0)),
        ])),
        
        ('text', Pipeline([
            ('selector', ItemSelector(key = 'Text')),
            ('count_vectorizer', CountVectorizer(min_df = 0.001, max_df = 1.0)),
        ])),
        
        ('numerical', Pipeline([
            ('select', SelectTransform(lambda X: X.select_dtypes(exclude=['object']))),
        ])),
    ])),
    
    ('scaler',  MaxAbsScaler()),
    
        ('model', SGDRegressor())
])


Initializing the Parameter grid for GridSearch CV,
We pass a series of values based on our intituiton to GridSearchCV, which will train the model by chronologically passing these arguments one after the other, and ultimately returns the model and its parameters with the best fit

I have commented out most of the model parameters due to compute power restriction, these models take hours to train and my computer usually hangs after a while

In [31]:
param_grid = [
    {'model' : [SGDRegressor(loss = 'squared_loss', n_iter = 3, random_state = 44,
                             verbose = 3)],
        #'model__l1_ratio' : [0, 0.5, 1],
        #'model__alpha' : [0.0001, 0.1, 1, 10],
        #'union__summary__count_vectorizer__ngram_range': [(1, 1), (1, 2)],
        #'union__text__count_vectorizer__ngram_range': [(1, 1), (1, 2)]
    },
    {'model' : [RandomForestRegressor(n_estimators = 2, random_state = 44, verbose = 3, n_jobs = -1)],
        #'model__max_depth' : [3],
        #'scaler' : [None],
        #'union__summary__count_vectorizer__ngram_range': [(1, 1), (1, 2)],
        #'union__text__count_vectorizer__ngram_range': [(1, 1), (1, 2)]
    },
    {'model' : [LinearRegression(n_jobs = -1)],
        #'union__summary__count_vectorizer__ngram_range': [(1, 1), (1, 2)],
        #'union__text__count_vectorizer__ngram_range': [(1, 1), (1, 2)]
    },
    {'model' : [Ridge(random_state = 44)],
        #'model__alpha' : [0.1, 1, 10],
        #'union__summary__count_vectorizer__ngram_range': [(1, 1), (1, 2)],
        #'union__text__count_vectorizer__ngram_range': [(1, 1), (1, 2)]
    },
    {'model' : [Lasso(random_state = 44)],
        #'model__alpha' : [0.1, 1, 10],
        #'union__summary__count_vectorizer__ngram_range': [(1, 1), (1, 2)],
        #'union__text__count_vectorizer__ngram_range': [(1, 1), (1, 2)]
    },
    {'model' : [ElasticNet(random_state = 44)],
        #'model__l1_ratio' : [0.1, 0.5, 0.7, 0.9],          
        #'model__alpha' : [0.1, 1, 10],
        #'union__summary__count_vectorizer__ngram_range': [(1, 1), (1, 2)],
        #'union__text__count_vectorizer__ngram_range': [(1, 1), (1, 2)]
    }
]

Creating the Grid using GridSearchCV , we pass the pipeline , parameter grid, number of cross validations and method for scoring 

In [32]:
grid = GridSearchCV(pipeline, param_grid,cv=5,scoring = 'neg_mean_squared_error')

Training the model

In [33]:
grid.fit(X_train, y_train)

-- Epoch 1
Norm: 7.58, NNZs: 4597, Bias: 0.056325, T: 233703, Avg. loss: 0.699441
Total training time: 0.20 seconds.
-- Epoch 2
Norm: 9.57, NNZs: 4597, Bias: 0.063065, T: 467406, Avg. loss: 0.632610
Total training time: 0.37 seconds.
-- Epoch 3
Norm: 10.88, NNZs: 4597, Bias: 0.068787, T: 701109, Avg. loss: 0.595685
Total training time: 0.54 seconds.
-- Epoch 1
Norm: 7.70, NNZs: 4610, Bias: 0.056438, T: 233703, Avg. loss: 0.692351
Total training time: 0.18 seconds.
-- Epoch 2
Norm: 9.62, NNZs: 4610, Bias: 0.062845, T: 467406, Avg. loss: 0.625480
Total training time: 0.36 seconds.
-- Epoch 3
Norm: 10.86, NNZs: 4610, Bias: 0.068666, T: 701109, Avg. loss: 0.589448
Total training time: 0.48 seconds.
-- Epoch 1
Norm: 7.67, NNZs: 4579, Bias: 0.056347, T: 233703, Avg. loss: 0.696580
Total training time: 0.11 seconds.
-- Epoch 2
Norm: 9.66, NNZs: 4579, Bias: 0.062755, T: 467406, Avg. loss: 0.628572
Total training time: 0.22 seconds.
-- Epoch 3
Norm: 10.96, NNZs: 4579, Bias: 0.068761, T: 701109,

[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 26.3min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 26.3min finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.1s finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.2s finished


building tree 2 of 2building tree 1 of 2



[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 28.0min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 28.0min finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.1s finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.3s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.5s finished


building tree 2 of 2
building tree 1 of 2


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 27.0min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 27.0min finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.1s finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.2s finished


building tree 1 of 2building tree 2 of 2



[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 24.4min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 24.4min finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.1s finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.2s finished


building tree 2 of 2building tree 1 of 2



[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 26.2min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed: 26.2min finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    1.9s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    2.7s finished
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:   16.8s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:   17.6s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('union', FeatureUnion(n_jobs=1,
       transformer_list=[('summary', Pipeline(steps=[('selector', ItemSelector(key='Summary')), ('count_vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
 ..., penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'model': [SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=3, penalty='l2', power_t=0.25,
       random_state=44, shuffle=True, verbose=3, warm_start=False)]}, {'mod...False, precompute=False,
      random_state=44, selection='cyclic', tol=0.0001, warm_start=False)]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
     

Now that we have our model, we can see the best Score and return the model with the best parameter settings

In [35]:
scores = grid.grid_scores_
print("The best score is %s" % grid.best_score_)
print(scores)
print("The best model parameters are: %s" % grid.best_params_)

The best score is -0.751477963212
[mean: -1.01724, std: 0.01187, params: {'model': SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=3, penalty='l2', power_t=0.25,
       random_state=44, shuffle=True, verbose=3, warm_start=False)}, mean: -1.23880, std: 0.01421, params: {'model': RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=2, n_jobs=-1, oob_score=False, random_state=44,
           verbose=3, warm_start=False)}, mean: -0.75148, std: 0.00687, params: {'model': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)}, mean: -0.76456, std: 0.00978, params: {'model': Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,




<b>Model Evaluation</b>
Mean Squared Error is taken as the metrics for calculating the accuracy of our model

In [37]:
X_test_preds = grid.predict(X_test)

score = mean_squared_error(y_test, X_test_preds)
print(score)

0.743957953959


In [38]:
# Standard Deviation is calculated to Check the Model fit
std = float(np.std(y_test))

In [39]:
if score < std*0.5:
    print("Poor Model Score: "+str(score))
else:
    print("Model fits well, Score:"+str(score))

Model fits well, Score:0.743957953959
