# End-to-End Sentiment Analysis
### Using neural models to predict sentiment on Amazon reivews
As an end-to-end model, we will be covering:
1) Data collection
2) Exploratory data analysis
3) Data processing and preparation
4) Building the models
5) Selecting a model using cross-validation
6) Tuning hyperparameters

#### Data Collection
Load libraries used for data collection

In [1]:
import numpy as np
import pandas as pd

We will be using the Amazon reviews dataset from Kaggle.com

In [2]:
# Load the Amazon reviews dataset
amazon_df = pd.read_csv('amazon_alexa.tsv', sep='\t')

#### Exploratory Data Analysis
Check the data for null values, value distributions, datatypes, and shapes.

In [3]:
print(f'Amazon dataframe columns: {amazon_df.columns.values}')
print(f'Data info: {amazon_df.info()}')
print(f'Amazon dataframe shape: {amazon_df.shape}')
print(amazon_df.head(5))

Amazon dataframe columns: ['rating' 'date' 'variation' 'verified_reviews' 'feedback']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3150 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB
Data info: None
Amazon dataframe shape: (3150, 5)
   rating       date         variation  \
0       5  31-Jul-18  Charcoal Fabric    
1       5  31-Jul-18  Charcoal Fabric    
2       4  31-Jul-18    Walnut Finish    
3       5  31-Jul-18  Charcoal Fabric    
4       5  31-Jul-18  Charcoal Fabric    

                                    verified_reviews  feedback  
0                                      Love my Echo!         1  
1

In [4]:
# Check for null values
print(amazon_df.isnull().sum())

# Check the distribution in the feedback column
label_dist = amazon_df['feedback'].value_counts()/len(amazon_df)
print('Label distribution:')
print(label_dist)

rating              0
date                0
variation           0
verified_reviews    0
feedback            0
dtype: int64
Label distribution:
1    0.918413
0    0.081587
Name: feedback, dtype: float64


There are no null values however, the data is extremely imbalanced if we use feedback as our labels. We will likely have to disregard the results since it would be difficult to do better than the Zero Rule Algorithm.

That is, our model has to score better than a random algorithm which has an expected value of 91% due to the imbalanced distribution.

One way we can work around this is to use the ratings and decide how many stars dictate a positive rating. Let's take a look at the distribution of the ratings.

In [5]:
rating_dist = amazon_df['rating'].value_counts()/len(amazon_df)
print('Rating distribution:')
print(rating_dist)

Rating distribution:
5    0.725714
4    0.144444
1    0.051111
3    0.048254
2    0.030476
Name: rating, dtype: float64


The distribution shows that in order to create a more balance dataset, we would have to classify 4 stars and lower as negative. However, that would be counter-intuitive as many 4 star ratings are also generally positive. 

This kind of split would lower the accuracy of our model and push it towards a classifer between extremely positive and others, rather than positive and negative. For this reason, we should not use this dataset for any real insight on sentiment anaylsis. There will be another sentiment analysis using Twitter feeds in my repository.

For now, we will complete the model as though the dataset was okay just for the sake of building an end-to-end project. This is why data exploration is imperative. 

#### Data Preparation
We want to clean up the reviews by removing stopwords (common words like 'the' that don't add much meaning), numbers, and symbols, and keep only words. Then we will give the words corresponding numbers by tokenizing them for our neural network to learn dependencies via an Embedding layer. 

It helps us to automate the data processing and machine learning workflow by utitlizing pipelines in our program. Pipelines also allow for better scalablity for our algorithm.

In [6]:
import nltk
import re

from sklearn.base import TransformerMixin
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from nltk.corpus import stopwords

To use a pipeline for data transformation, we will develop custom classes that will remove stopwords, clean up numbers and symbols and one that will tokenize the reviews. 

Our tokenizer needs a vocabulary size and we also have to define a constant length of the words so it can be put into our neural networks.

In [7]:
# Create a class transformer to clean up the reviews
class ReviewPrep(TransformerMixin):
    def __init__(self, stopwords):
        self.stopwords = stopwords # Removing stopwords requires a new variable
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['verified_reviews'] = X['verified_reviews'].apply(lambda x: ' '.join([word for word in x.split() if word not in (self.stopwords)])) # reviews are idx = 3
        X['verified_reviews'] = X['verified_reviews'].apply(lambda x: re.sub('[^a-zA-Z\s]', '',x)) # Keep only letters
        return X

class TokenReview(TransformerMixin):
    def __init__(self, vocab, max_words):
        self.vocab = vocab
        self.max_words = max_words
        self.tokenizer = Tokenizer(num_words = self.vocab, split=' ', filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
    def fit(self, X, y=None):
        self.tokenizer.fit_on_texts(X['verified_reviews'])
        return self
    def transform(self, X):
        X['verified_reviews'] = self.tokenizer.texts_to_sequences(X['verified_reviews'])
        X['verified_reviews'] = pad_sequences(X['verified_reviews'], padding='post', maxlen=max_words)
        return X

Now we use our custom transformers to setup a pipeline for the transformations on the review column. 

We also build another pipeline to process all of the data. 

This kind of modularity is useful should we ever want to build another pipeline that performs a different set of transfromations on the numerical data columns, then we could integrate that pipeline into our overarching pipeline.

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

def review_pipeline(stopwords, vocab, max_words):
    """
    Pipeline to prep the reviews and tokenize them

    Arguments:
        stopwords: stopwords to remove with ReviewPrep
        vocab: Vocab size for tokenizer
        max_words: max word length for padding
    Returns:
        text_pipeline: pipeline object of transformations for reviews
    """

    text_pipeline = Pipeline([
        ('cleanup', ReviewPrep(stopwords)),
        ('tokenize', TokenReview(vocab, max_words))
        ])
    return text_pipeline

def full_pipeline(data, stopwords, vocab, max_words):
    """
    Transformation pipeline for the data

    Arguments:
        data: original data
    Returns:
        prepped_data: fully transformed review data
        prepped_labels: review labels
    """

    
    all_transformers = ColumnTransformer([
        ('prep_rev', review_pipeline(stopwords, vocab, max_words), ['verified_reviews'])
        ])

    prepped_reviews = all_transformers.fit_transform(data)
    prepped_labels = data['feedback'].to_numpy()
    
    return prepped_reviews, prepped_labels

Now that we have the classes and definitions that we need, we will set the variables and call the pipeline to process our data.

In [9]:
# Set stopwords to remove common but not useful words for analysis such as 'the', 'an'
stopwords = set(stopwords.words('english'))
vocab = 500 # The first x most used words aka vocab size
max_words = 20 # Max number of words in text (in order for Dense layer to connect to Embedding layer)
embedding_dim = 5 # Dimensions of vector to represent word in embedding layer

# Call the pipeline on the original dataset
prepped_reviews, prepped_labels = full_pipeline(amazon_df, stopwords, vocab, max_words)

In [10]:
# Quick check to see everything working
print(f'Prepped reviews sequence shape: {prepped_reviews.shape}')
print(f'Prepped labels shape: {prepped_labels.shape}')
print(f'Prepped data: {prepped_reviews}')

Prepped reviews sequence shape: (3150, 20)
Prepped labels shape: (3150,)
Prepped data: [[  2   4   0 ...   0   0   0]
 [203   3   0 ...   0   0   0]
 [231 112 265 ...   0   0   0]
 ...
 [151  46  41 ...  96  50  34]
 [ 93  16 487 ... 124 126 118]
 [ 17   0   0 ...   0   0   0]]


Our training data and our training labels have equal number of samples and the words are now numbers that correspond to a word in our tokenizer dictionary.

#### Building the models
We will only build one simple RNN model (using an LSTM layer) since we are not looking to use this data due to the dataset issue stated earlier.

We will be using a gridsearch and a cross validation function from sklearn in the next section. In order to do so, we need to create a function that will build our model so we can use a KerasClassifier wrapper.

In [11]:
import keras
keras.__version__

'2.8.0'

In [12]:
# Build a LSTM model
from keras.models import Model
from keras.layers import Dense, Embedding, LSTM, Dropout, Input, Flatten

def build_LSTM(vocab=200, max_words=20, embedding_dim=8, neurons=20):
    """
    Builds an LSTM model for analysis

    Arguments:
        None
    Returns:
        model: an LSTM model
    """

    input_layer = Input(shape=(max_words))
    x = Embedding(vocab, embedding_dim, input_length=max_words)(input_layer) # Converts positive integer encoding of words as vectors into dimensional space where similiarity in meaning is represented by closeness in space
    x = LSTM(neurons, dropout=0.1)(x)# Stacked LSTM potentially allows the hidden state to operate at different time scales
    out = Dense(1, activation='sigmoid')(x)

    model = Model(input_layer, out)

    model.compile(
    loss ='binary_crossentropy',
    optimizer='adam',
    metrics='acc'
    )
    
    return model

# Let's check out our model
LSTM_model = build_LSTM(vocab=vocab, max_words=max_words, embedding_dim=embedding_dim)
print(LSTM_model.summary())

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20)]              0         
                                                                 
 embedding (Embedding)       (None, 20, 5)             2500      
                                                                 
 lstm (LSTM)                 (None, 20)                2080      
                                                                 
 dense (Dense)               (None, 1)                 21        
                                                                 
Total params: 4,601
Trainable params: 4,601
Non-trainable params: 0
_________________________________________________________________
None


#### Select the model with cross-validation
In order to properly evaluate the model, we have to designate certain samples to be held-out of the training set into the test set.

We will then use cross-validation on the training set to get an average of our model's performances.

In [13]:
# Create the training and held-out test sets with stratified sampling so we better represent the data's proportions
from sklearn.model_selection import StratifiedShuffleSplit

# Fix random seed for reproducibility
seed = 33
np.random.seed(seed)

s_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=seed) # Using a random seed here for the Grid Search 

for train_idx, test_idx in s_split.split(prepped_reviews, prepped_labels):
    train_set = prepped_reviews[train_idx]
    train_labels = prepped_labels[train_idx]
    test_set = prepped_reviews[test_idx]
    test_labels = prepped_labels[test_idx]

For our cross-validation from sklearn to work, we have to use a wrapper called KerasClassifier.

In [14]:
from scikeras.wrappers import KerasClassifier

BATCH = 64
EPOCHS = 1 # Temporarily due to my slow computer

LSTM_wrapped = KerasClassifier(
    model=build_LSTM,
    neurons = 20,
    vocab = vocab,
    max_words = max_words,
    embedding_dim = embedding_dim,
    batch_size = BATCH,
    epochs = EPOCHS,
    random_state = seed,
    optimizer = 'adam',
    verbose = 0
    )

from sklearn.model_selection import cross_val_score

def cv_scores(model, train_set, train_labels):
    """
    Evaluate score by cross-validation

    Arguments:
        model: neural network model
        train_set: training set
    Returns:
        scores: an array of scores for each run
    """

    scores = cross_val_score(
        model,
        train_set,
        train_labels,
        scoring = 'accuracy',
        cv = 10
        )

    print(f'This is the average score of the cv: {scores.mean()}')
    return scores

Now we pass in the training set and training labels to see how our model did.

In [15]:
LSTM_scores = cv_scores(LSTM_wrapped, train_set, train_labels)
print('CV Score for LSTM:')
print(LSTM_scores)

This is the average score of the cv: 0.9182539682539682
CV Score for LSTM:
[0.91666667 0.91666667 0.91666667 0.91666667 0.91666667 0.91666667
 0.92063492 0.92063492 0.92063492 0.92063492]


The average score for our model did not perform better than the Zero Rule with the expected value of 0.9184. This was expected since our dataset was so heavily imbalanced.

#### Hyperparameters Tuning
We will still go ahead and finish building the end-to-end machine learning program for completeness but we cannot expect any improvement using our flawed dataset.

For hyperparameters tuning, we will use GridSearchCV from sklearn.

In [16]:
# HYPERPARAMETER TUNING
# Now that we've chosen the best model, we'll tune the hyperparameters
from sklearn.model_selection import GridSearchCV

# Define the grid search parameters
param_grid = [{
    'neurons' : [5, 10, 15, 20],
    'vocab': [500, 1000, 2000],
    'embedding_dim' : [5, 8, 10, 12]
    }]

grid = GridSearchCV(
    LSTM_wrapped,
    param_grid=param_grid,
    scoring = 'accuracy', # defaults to accuracy
    n_jobs = -1, # -1 means it'll use all the cores in the computer
    cv = 3 # defaults to 3
    )

grid_results = grid.fit(train_set, train_labels)

cv_scores = grid_results.cv_results_
for mean, params in zip(cv_scores['mean_test_score'],cv_scores['params']):
    print(f'Mean: {mean} with {params}')

print(f'Best score is: {grid_results.best_score_}')
print(f'Best params is: {grid_results.best_params_}')

Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 5, 'vocab': 500}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 5, 'vocab': 1000}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 5, 'vocab': 2000}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 10, 'vocab': 500}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 10, 'vocab': 1000}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 10, 'vocab': 2000}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 15, 'vocab': 500}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 15, 'vocab': 1000}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 15, 'vocab': 2000}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 20, 'vocab': 500}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 20, 'vocab': 1000}
Mean: 0.9182539682539682 with {'embedding_dim': 5, 'neurons': 20, 'vocab': 2000}
Mean: 0.9182539682539682 with {'emb

#### Evaluating the final model
Now that we've (theoretically) chosen the right model and the optimal hyperparameters, we will use our held-out test set to evaluate it's performance.

*Note that our flawed data gave the same estimate for the gridsearch*

In [18]:
# EVALUATE FINAL MODEL WITH HELD OUT TEST SET
# Best configuration
final_model = grid_results.best_estimator_

# Final evaluation
final_pred = final_model.predict(
    test_set
    )

final_acc = np.mean(test_labels == final_pred)

print(f'This is the test accuracy: {final_acc}')

This is the test accuracy: 0.919047619047619
