# Sentiment Analysis

In this series of notebooks we are building different machine learning models to detect sentiment (i.e. detect if a review is positive or negative) using Pytorch.

Sentiment analysis make sure that we are truly knowing to what our customers think, want and need.

Here are some benefits of sentiment analysis:
- Adjust marketing strategy: How we can know if we are doing the right thinks in e.g. social media etc.? The information in e.g. social media etc. provide us knowing what our customers feel and think about our brand.
- Measure ROI of your marketing campaign: Success of marketing campaign can lies in how positive discussions are amongst the customers.
- Develop a better product: Sentiment analysis helps us complete our market research by getting to know what our customers' opinions are about our product/services and how we can align our products/services quality and features with their tastes.
- Improve a better customer service: Sentiment analysis can pick up negative discussions, and give us real-time alerts so that we can respond quickly. Sentiment analysis as part of social listening to manage complaints can help us avoid leaving our customers feeling ignored and angry.
- Crisis management: Constant monitoring of what is currently happening in social media conversations also helps us to prevent or at least mitigate the damage of online communication crisis.


This will be done on Amazon reviews, using the ... dataset.
In this first notebook, we'll start very simple to understand the general concepts. Futher notebooks will build on this knowledge and we will actually get better results.

### Introduction

In this Notebook, we show how to approach a sentiment analysis problem. We will be starting with preprocessing and exploration of data. Then we extracted features from the cleaned text using Bag-of-Words, TF-IDF and Word_Embedding (Word2Vec). Finally, we were able to build a couple of models using the feature sets to classify the reviews.

In [3]:
import pandas as pd
import numpy as np
#import datapreprocessing as dp

import matplotlib.pyplot as plt
import seaborn as sns

# Bag-of-Words
from sklearn.feature_extraction.text import CountVectorizer
# Tf-Idf
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import metrics

import re
import nltk
from nltk import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
stop = stopwords.words('english')
from nltk.stem import PorterStemmer
from textblob import Word
from wordcloud import WordCloud

from gensim.models import Word2Vec

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dimitriwilhelm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Data Collection and Review

Let's quickly start and read the file from the dataset in order to perform different tasks on it. We will use the Amazon reviews.

In [4]:
# Reading the data provided via http://jmcauley.ucsd.edu/data/amazon/
def parse(path):
    g = open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [6]:
df_Apps = getDF("data/Apps_for_Android_5.json")

Let's create a new dataframe with the columns 'reviewText' and 'sentiment'

In [7]:
# New column called 'sentiment'
df_Apps['sentiment'] = [1 if x > 3 else 0 for x in df_Apps['overall']]
# New DataFrame with columsn 'reviewText' and 'sentiment'
df_sentiment_data = df_Apps[['reviewText', 'sentiment']]

# Rename our columns
#review_data = df_sentiment_data.reviewText
#labels = df_sentiment_data.sentiment

In [13]:
print(f'Minimum review length: {len(min(df_sentiment_data.reviewText, key=len))}')
print(f'Maximum review lenght: {len(max(df_sentiment_data.reviewText, key=len))}')

Minimum review length: 0
Maximum review lenght: 18077


At first we will discuss different feature extraction methods, starting with some basic techniques and learn about preprocessing of the text data in order to extract better features from clean data, which will lead into further Natural Language Processing techniques at the end.

1. Feature extraction using text data
2. Data Preprocessing of text data
3. Further Textprocessing

## Feature extraction using text data

In [14]:
# function to extract some feature of our data
def extract_data(data):
    """

    """
    data = pd.DataFrame(data)
    # Number of words in each review
    # Intuition: Negative sentiments conatin a lesser amout of words than the positive ones
    data['word_count'] = data['reviewText'].apply(lambda x: len(str(x).split(" ")))
    
    # Number of characters in each review
    data['char_count'] = data['reviewText'].str.len() ## this includes spaces too
    
    # Number of stopwords
    # While solving a NLP problem, the first thing we do is to remove the stopwords.
    data['numb_stopwords'] = data['reviewText'].apply(
        lambda x: len([x for x in x.split() if x in stop])
    )
    
    # Number of special characters
    
    
    # Number of numerics
    # It could be a useful feature that should be run
    data['numerics'] = data['reviewText'].apply(
        lambda x: len([x for x in x.split() if x.isdigit()])
    )
    
    # Number of Uppercase words
    # Anger or rage is quite often expressed by writing in UPPERCASE words
    data['uppercase'] = data['reviewText'].apply(
        lambda x: len([x for x in x.split() if x.isupper()])
    )
    
    # Check of empty reviews
    data['avg_word'] = data['reviewText'].apply(
        lambda x: sum( len(word) for word in (x.split()) )
    )
    idx = data[data.avg_word == 0].index
    data = data.drop(idx)
    
    # Average Word Length
    # We simply take the sum of the length of all the words and divide it by the total length of the review
    data['avg_word'] = data['reviewText'].apply(
        lambda x: sum( len(word) for word in (x.split())) / (len(x.split()) )
    )

In [15]:
extract_data(df_sentiment_data)

In [16]:
# Let's check the first 5 rows
df_sentiment_data.head()

Unnamed: 0,reviewText,sentiment,word_count,char_count,numb_stopwords,numerics,uppercase,avg_word
0,"Loves the song, so he really couldn't wait to ...",0,41,206,19,1,1,166
1,"Oh, how my little grandson loves this app. He'...",1,47,255,21,0,0,209
2,I found this at a perfect time since my daught...,1,53,288,19,0,1,236
3,My 1 year old goes back to this game over and ...,1,43,186,15,2,0,144
4,There are three different versions of the song...,1,134,746,55,0,0,613


Now we extract some features from text data. Before we diving into text and feature extraction, our next step will be clean the data in order to obtain better features. We achieve this by doing some of the preprocessing steps on our data.

## Data preprocessing of text data

Data preprocessing and cleaning is an important step before any text mining task. In this step we will remove the punctuations, stopwords and normalize the reviews as much as possible.

All these data preprocessing steps are essential and will help us in reducing our vocabulary clutter so that the features produced in the end are more effective.

In [17]:
# function to clean our data
def clean_data(data):
    """

    """
    # Make entire text lowercase
    # Transform our review into lower case. This avoids having multiple copies of the same words
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join( x.lower() for x in x.split() )
    )
    
    # Removing Punctuation, Numbers and Special Characters/Symbols
    # It does not add any extra information while treating text data. Therefore it will help us reduce the size of the data
    data['reviewText'] = data['reviewText'].str.replace('[^a-zA-Z#]',' ')
    
    # Removal of Stop Words, i.e. we just removed commonly occurring words in a general sense
    # Stop Words should be removed from the text data. We use for this predefined libraries from nltk
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join( x for x in x.split() if x not in stop)
    )
    
    # Removing commonly occurring words from our text data
    # Let's check the 10 most frequently occuring words in our text data
    freq = pd.Series(" ".join( data['reviewText'] ).split()).value_counts()[:1]
    # Let's remove these words as their presence will not of any use in classification of our text data
    freq = list(freq.index)
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join( x for x in x.split() if x not in freq)
    )
    
    # Remove rare words
    # Let's check the 10 rarely occurring words in our text data
    #freq_1 = pd.Series(" ".join( data['reviewText'] ).split()).value_counts()[:-1]
    # Let's remove these words as their presence will not of any use in classification of our text data
    #freq_1 = list(freq.index)
    #data['reviewText'] = data['reviewText'].apply(
    #    lambda x: " ".join(x for x in x.split() if x not in freq_1)
    #)
    
    # Stemming, i.e. we're removing suffices, like "ing", "ly", etc. by a simple rule-based approach.
    # For this purpose, we will use PorterStemmer from the NLTK library
    #st = PorterStemmer()
    #data['reviewText'] = data['reviewText'].apply(
    #    lambda x: " ".join([ st.stem(word) for word in x.split() ])
    #)
   
    # Lemmatization
    # Lemmatization is more effective that stemming because it converts the word into its root word, 
    # rather than just stripping the suffices. We usually prefer using lemmatiziation over stemming.
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join([ Word(word).lemmatize() for word in x.split() ])
    )
    
    # Remove short words (Length < 3)
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join([w for w in x.split() if len(w) > 3])
    )

In [18]:
data = pd.DataFrame(df_sentiment_data)

## Clean the data

In [19]:
clean_data(data)

In [20]:
df_sentiment_data = data
df_sentiment_data.head()

Unnamed: 0,reviewText,sentiment,word_count,char_count,numb_stopwords,numerics,uppercase,avg_word
0,love song really wait play little interesting ...,0,41,206,19,1,1,166
1,little grandson love always asking monkey gran...,1,47,255,21,0,0,209
2,found perfect time since daughter favorite son...,1,53,288,19,0,1,236
3,year back simple easy toddler even caught year...,1,43,186,15,2,0,144
4,three different version song keep occupied eve...,1,134,746,55,0,0,613


After every preprocessing step, it is a good practice to check the most frequent words in the data. Therefore we define a function that would plot a bar graph of n most frequent words in the data.

Now we have done all the preprocessing steps in order to clean our text data. Now, we can finally move on to extracting features using NLP techniques.

## Further Text Processing

### N-grams

# Exploring and  visualizing the reviews

In this section, we will explore the cleaned reviews.

Before we begin exploration, we must think and ask questions related to the data in hand. A few probable questions are as follows:

- What are the most common words in the entire dataset?
- What are the most common words in the dataset for negative and positive reviews, respectively?
- How many  are there in a review?
- Which trends are associated with my dataset?
- Which trends are associated with either of the sentiments? Are they compatible with the sentiments?


### Understanding the common words in the dataset

Let's see how well the given sentiments are distributed across the review dataset. One way to accomplish this task is by understanding the common words by plotting wordclouds.

At first we define our sets of words.

In [None]:
def total_words(data):
    return ' '.join([word for word in data['reviewText']])

def negative_sentiment_words(data):
    return ' '.join([text for text in data['reviewText'][data['sentiment'] == 0]])

def positive_sentiment_words(data):
    return ' '.join([text for text in data['reviewText'][data['sentiment'] == 1]])

## Let’s visualize all the words of our data using the wordcloud plot.

In [None]:
def plot_wordcloud(data):
    wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(data)
    plt.figure(figsize=(10, 7))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.show()

### All words in our dataset

In [None]:
# All words in our dataset
all_words = total_words(data)

# Plotting our whole words using the wordcloud plot
plot_wordcloud(all_words)

We can see most of the words are positive or neutral. With kindle and fire being the most frequent ones. Next we will plot separate wordclouds for both the classes (negative or positive) in our dataset.

### Plotting all words in our positive sentiment dataset

In [None]:
# All words in our positive sentiment dataset
positive_words = positive_sentiment_words(data)

# Plotting our positive words using the wordcloud plot
plot_wordcloud(positive_words)

We can see most of the words are positive or neutral. With kindle and fire being the most frequent ones.

### Plotting all words in our negative sentiment dataset

In [None]:
# All words in our negative sentiment dataset
negative_words = negative_sentiment_words(data)

# Plotting our negative words using the wordcloud plot
plot_wordcloud(negative_words)

We can see most of the words are positive or neutral.

## Model Building: Sentiment Analysis

We will start using logistic regression to build the models. It predicts the probability of occurrence of an event by fitting data to a logit function.

In [None]:
# We define our input data and target
X = df_sentiment_data['reviewText']
y = df_sentiment_data['sentiment']

### Train-Test-Split

In [None]:
# splitting data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.3) 
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, random_state=1, test_size=0.5)

In [None]:
print('Train_data set size: {}'.format(X_train.shape[0]))
print('Train_target set size: {}'.format(y_train.shape[0]))
print('Val_data set size: {}'.format(X_val.shape[0]))
print('Val_target set size: {}'.format(y_val.shape[0]))
print('Test_data set size: {}'.format(X_test.shape[0]))
print('Test_target set size: {}'.format(y_test.shape[0]))

### Extracting Features from our cleaned text data

To analyze a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques 

- Bag-of-Words
- TF-IDF 
- Word Embeddings

In this notebook, we will be covering all of them. We will start with Bag-of-Words

### Building model using Bag-of-Words features

## Bag-of-Words

Bag-of-Words is a method to represent text into numerical features. Consider a corpus (a collection of texts) called C of D documents {d_1,d_2…..d_D} and N unique tokens extracted out of the corpus C. The N tokens (words) will form a list, and the size of the bag-of-words matrix M will be given by D X N. Each row in the matrix M contains the frequency of tokens in document D(i).

Bag-of-Words features can be easily created using sklearn’s CountVectorizer function. We will set the parameter max_features = 1000 to select only top 1000 terms ordered by term frequency across the corpus.

### Features of Bag-of-Words

In [None]:
# initialize a CountVectorizer object: CountVectorizer()
bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english', ngram_range=(1,1))

# fit and transforms the train data into a bag of words feature matrix
count_train = bow_vectorizer.fit(X_train)
X_train_bow = bow_vectorizer.transform(X_train)

# transform the test data
X_val_bow = bow_vectorizer.transform(X_val)

# transform the text for submission at the end
X_test_bow = bow_vectorizer.transform(X_test)

# Print the first 10 features of the count_vec
print("Every feature:\n{}".format(bow_vectorizer.get_feature_names()))
print("\nEvery 3rd feature:\n{}".format(bow_vectorizer.get_feature_names()[::3]))

### Vocabulary and vocabulary ID

In [None]:
print("Vocabulary size: {}".format(len(count_train.vocabulary_)))
print("Vocabulary content:\n {}".format(count_train.vocabulary_))

Now the columns in the above matrix can be used as features to build a classification model.

### Run a logistic Regression

First we build our functions for the logistic regression model

In [None]:
def run_logistic_regression(X_train, y_train):
    log_reg = LogisticRegression()
    log_reg.fit(X_train, y_train)
    return log_reg

def evaluate_model(model, X_train, X_val, y_train, y_val, threshold=0.5):
    cv_loss = np.mean(cross_val_score(model, X_train, y_train, cv=5, scoring='neg_log_loss'))
    print('CV Log_loss score is {}'.format(cv_loss))

    cv_score = np.mean(cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy'))
    print('CV Accuracy score is {}'.format(cv_score))

    y_pred = model.predict(X_val)
    y_pred_prob = model.predict_proba(X_val) # predicting on the validation set
    y_pred_prob_int = y_pred_prob[:,1] >= threshold # if prediction is greater than or equal to 0.5 than 1 else 0
    y_pred_prob_int = y_pred_prob_int.astype(np.int)

    # calculate the auc-score
    auc_score = metrics.roc_auc_score(y_val, y_pred_prob_int)
    print("CV ROC_AUC score {}\n".format(auc_score))

    target_name = ['Negative', 'Positive']
    print(classification_report(y_val, y_pred, target_names=target_name))
    

    cm = confusion_matrix(y_val, y_pred)
    #cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    cm = pd.DataFrame(cm)

    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title('Confusion_Matrix')
    plt.colorbar()
    tick_marks = np.arange(len(target_name))
    plt.xticks(tick_marks, target_name, rotation=45)
    plt.yticks(tick_marks, target_name)


    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(i, j, cm.iloc[i, j],
                horizontalalignment="center",
                color="black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
    
def predict(model, X_test, threshold=0.5):
    # predicting on the test set
    test_pred = model.predict_proba(X_test)

    # if prediction is greater than or equal to 0.5 than 1 else 0
    test_pred_int = test_pred[:,1] >= threshold
    test_pred_int = test_pred_int.astype(np.int)
    return test_pred_int

In [None]:
# fit the model
model_bow = run_logistic_regression(X_train_bow, y_train)

In [None]:
# evaluate the model
evaluate_model(model_bow, X_train_bow, X_val_bow, y_train, y_val)

We trained the logistic regression model on the Bag-of-Words features. Now we will use this model to predict for the test data.

In [None]:
pred_test_bow = predict(model_bow, X_test_bow)
X_test = pd.DataFrame(X_test)
X_test['pred_bow_sentiment'] = pred_test_bow

In [None]:
# our result 
X_test.head()

Now we will again train a logistic regression model but this time on the TF-IDF features. Let’s see how it performs.

### Building model using TF-IDF features

## TF-IDF

This is another method which is based on the frequency method but it is different to the bag-of-words approach in the sense that it takes into account, not just the occurrence of a word in a single document (or reviews) but in the entire corpus.

TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words in a document based on how frequently they appear across multiple documents.

Intuively:
- If a word appears frequently in a document, it's important. Give the word a high score.
- If a word appears in many documents, it's not a unique identifier. Give the word a low score.

That's why, common words like "the" and "for", which appear in many documents, will be scaled down. Words that appear frequently in a single document will be scaled up.

Let’s have a look at the important terms related to TF-IDF:

- TF = (Number of times term t appears in a document)/(Number of terms in the document)
- IDF = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
- TF-IDF = TF*IDF

### TF - The Term Trequency

In [None]:
# Initialize a TfidfVectorizer object: TfidfVectorizer()
tfidf_vectorizer= TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')

# fit and transforms the train data into a tf-idf feature matrix
tfidf_fitted = tfidf_vectorizer.fit(X_train)
X_train_tfidf = tfidf_vectorizer.transform(X_train)

# transform the val data
X_val_tfidf = tfidf_vectorizer.transform(X_val)

# transform the text
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [None]:
# fit the model
model_tfidf = run_logistic_regression(X_train_tfidf, y_train)

In [None]:
# evaluate the model
evaluate_model(model_tfidf, X_train_tfidf, X_val_tfidf, y_train, y_val)

In [None]:
pred_test_tfidf = predict(model_tfidf, X_test_tfidf)
X_test = pd.DataFrame(X_test)
X_test['pred_tfidf_sentiment'] = pred_test_tfidf

In [None]:
X_test.head()

In [None]:
print(f'The learned corpus vocabulary: \n {tfidf_vectorizer.vocabulary_}')

### IDF: The inverse document frequency

In [None]:
idf = tfidf_vectorizer.idf_
idf_dict = tfidf_fitted.get_feature_names()
print(f'The learning TF-IDF: \n {(dict(zip(idf_dict, idf)))}')

In [None]:
rr = dict(zip(idf_dict[:10], idf))

In [None]:
def plot_dict_freq(data):
    token_weight = pd.DataFrame.from_dict(data, orient='index').reset_index()
    token_weight.columns=('token','weight')
    token_weight = token_weight.sort_values(by='weight', ascending=False)

    sns.barplot(x='token', y='weight', data=token_weight)            
    #plt.title("Inverse Document Frequency(idf) per token")
    fig=plt.gcf()
    fig.set_size_inches(18,10)
    plt.show()

In [None]:
plot_dict_freq(rr)

Now we will use advanced techniques like word2vec model for feature extraction and neural networks (LSTM).

For that we wil use Word Embeddings (word2vec and doc2vec) for creating better features.

## N-grams (sets of consecutive words)

- N = 2

In [None]:
# Prepare data
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)

training_data = []
for i in range(len(review_data[:10])):
    training_data.append(
        (review_data[i].split(), [labels[i]])
    )
    
# First, we build an index of all tokens in the data.
token_index={}
for sample in review_data[:10]:
    #print(sample.split())
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index)
print(training_data)
#print(token_index)            

### Building an LSTM model for sentiment analysis

Problems with a standard RNN

The simplest RNN model has a major drawback, called vanishing gradient problem, which prevents it from being accurate. The problem comes from the fact that at each time step during training we are using the same weights to calculate y_t. That multiplication is also done during backpropagation. The further we move backwards, the bigger or smaller our error signal becommes. This means that the network experiences difficulty in memorizing words from far away in the sequence and makes predictions based onv only the most recent ones.

Pytorch’s LSTM expects all of its inputs to be 3D tensors. The semantics of the axes of these tensors is important. The three dimensions of this input are:

- Samples: One sequence is one sample. A batch is comprised of one or more samples.
- Time Steps: One time step is one point of observation in the sample.
- Features: One feature is one observation at a time step.

This means that the input layer expects a 3D array of data when fitting the model and when making predictions, even if specific dimensions of the array contain a single value, e.g. one sample or one feature.

When defining the input layer of your LSTM network, the network assumes you have 1 or more samples and requires that you specify the number of time steps and the number of features. You can do this by specifying a tuple to the “input_shape” argument.

We start building our model architecture in the code below.

Input: Our input is a sequence of words (technically, integer wordIDs) of maximum length = max_words
Output: Our Output is a binary sentiment label(0 or 1)

In [None]:
# Create the model
class LSTMClassifier(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size, sentence_dim, batch_size):
        super(LSTMClassifier, self).__init__()
        self.hidden_dim = hidden_dim
        self.sentence_dim = sentence_dim
        self.batch_size = batch_size
        
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        # The linear layer that maps from hidden state space to label_size
        self.linear = nn.Linear(hidden_dim, label_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        # the first is the hidden h
        # the second is the cell  c
        return (torch.zeros(self.batch_size, self.sentence_dim, self.hidden_dim),
                torch.zeros(self.batch_size, self.sentence_dim, self.hidden_dim))

    def forward(self, sentence):
        embedded = self.word_embeddings(sentence)
        x = embedded.view(len(sentence), 1, -1)
        lstm_out, self.hidden = self.lstm(x, self.hidden)
        y  = self.linear(lstm_out[-1])
        log_probs = F.log_softmax(y)
        return log_probs

In [None]:
def train():
    # These will usually be more like 32 or 64 dimensional.
    # We will keep them small, so we can see how the weights change as we train.
    EMBEDDING_DIM = 6
    HIDDEN_DIM = 6
    INPUT_SIZE = len(token_index)
    OUTPUT_DIM = 2
    EPOCH = 20
    
    # Train the model
    model = LSTMClassifier(EMBEDDING_DIM, HIDDEN_DIM, INPUT_SIZE, OUTPUT_DIM, SENTENCE_DIMA, BATCH_SIZE)
    loss_function = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1e-3)
    
    # See what the scores are before training
    # Note that element i,j of the output is the score for tag j for word i.
    # Here we don't need to train, so the code is wrapped in torch.no_grad()
    #with torch.no_grad():
    #    inputs = prepare_sequence(training_data[1][0], token_index)
    #    sentiment_scores = model(inputs)

    for epoch in range(EPOCH):  # again, normally you would NOT do 300 epochs, it is toy data
        epoch_loss = 0
        count = 0
        for sentence, sentiment in training_data:
            # Step 1. Remember that Pytorch accumulates gradients.
            # We need to clear them out before each instance
            model.zero_grad()

            # Also, we need to clear out the hidden state of the LSTM,
            # detaching it from its history on the last instance.
            model.hidden = model.init_hidden()

            # Step 2. Get our inputs ready for the network, that is, turn them into
            # Tensors of word indices.
            sentence_in = prepare_sequence(sentence, token_index)

            targets = torch.tensor([1], dtype=torch.long)
            # Step 3. Run our forward pass.
            sentiment_scores = model(sentence_in)

            # Step 4. Compute the loss, gradients, and update the parameters by
            #  calling optimizer.step()
            loss = loss_function(sentiment_scores, targets)
            count += 1
            if count % 500 == 0:
                print(f'epoch: {epoch} iterations: {count} loss {loss.item()}')
            loss.backward()
            optimizer.step()
                   
            epoch_loss += loss.item()
            #print(epoch_loss)
        #print(f'{epoch_loss}')
        print(f'Epoche {epoch}: {epoch_loss / (len(training_data))}')
        
train()
# See what the scores are after training
#with torch.no_grad():
#    inputs = prepare_sequence(training_data[1][0], token_index)
#    sentiment_scores = model(inputs)
    #acc = get_accuracy(sentiment_scores, [1])
#    print(sentiment_scores)

### Word Embeddings

Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. 

Why do we need Word Embeddings?

Many Machine Learning algorithms and almost all Deep Learning Architectures are incapable of processing strings or plain text in their raw text. They requires numbers as inputs to perform any sort of job.

A Word Embedding format generally tries to map a word using a dictionary to a vector.

Example sentence: "I love programming"

A word in this sentence may be "love" or "programming" etc.

A dictionary may be the list of all unique words in the sentence. So, a dictionary may look like - ["I", "love", "programming"]

A vector representation of a word may be a one-hot encoded vector where 1 stands for the position where the word exists and 0 everywhere else. The vector representation of "love" in this format according to the above dictionary is [0,1,0] and of "programming is [0,0,1].

This is a very simple method to represent a word in the vector form.

---------------------------------------------------------------------------------------------------------------------

### Word2Vec

The idea behind Word2Vec is pretty simple. If you have two words that have very similar neighbors, i.e. the context in which its used is about the same, then then these words are probably quite similar in meaning or are at least related. For example, the words shocked, appalled and astonished are usually used in a similar context.

Using this underlying assumption, you can use Word2Vec to surface similar concepts, find unrelated concepts, compute similarity between two words and more!

### Training the Word2Vec model

Our parameters in order to train our model:
- size: The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me.
- window: The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window.
- min_count: Minimium frequency count of words. The model would ignore words that do not statisfy the min_count. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.
- workers: How many threads to use behind the scenes?

At first we create a new list of lists of our tokens

In [None]:
review_list = []
for i in range(len(df_sentiment_data[:20000])):
    #list1 = sent_data.reviewText[:10].apply(lambda x: word_tokenize(x))
    review_list.append(df_sentiment_data.reviewText[:20000][i].split())

Let's train our Word2Vec model

In [None]:
# train model
model = Word2Vec(review_list, size=50, min_count=1, window=5, workers=5)
model.train(review_list, total_examples=len(review_list), epochs=10)

# summarize the loaded model
print(f'Our model is trained: {model}')

# summarize vocabulary
#words = model.wv.vocab

We save our model

In [None]:
# save model
model.save('model.bin')

# load model
new_model = Word2Vec.load('model.bin')
#print (new_model)

# That looks pretty good, right? Let's look at a few more. Let's look at similarity for polite, france and shocked.

Let's check how good is our model trained

In [None]:
# access vector for one word
w1 = 'tthis'
print(model.wv.most_similar(positive=w1, topn=2))

In [None]:
# First, we build an index of all tokens in the data for our Word2Vec.
def token_ix(model, data):
    token_index={}
    for sample in data:
        for word in sample.split():
            if word in model.wv:
                token_index[word] = model.wv[word]
    return token_index

token_index = token_ix(model, df_sentiment_data['reviewText'][:20000])

In [None]:
def pad_seq(data_review, data_sentiment, num_words, token_index):
    tt = torch.zeros((len(data_review), num_words, len(token_index['love'])))
    target = torch.zeros(len(data_review), dtype=torch.long)
    #data_review = data_review.apply(lambda x: x.split()[:num_words])
    data_review = data_review.apply(lambda x: x if len(x.split()) > num_words else 0)
    #print(len(data_review[0]))
    pad_seq_list = []
    for i in range(len(data_review)):
        pad_seq_list.append(
            (data_review[i], [data_sentiment[i]])
        )
    for i in range(len(data_review)):
        idx = [token_index[word] for word in pad_seq_list[i][0]]
        target[i] = int(data_sentiment[i])
        tt[i] = torch.tensor(idx, dtype=torch.float)
    
    tt = tt.transpose(1,0)
    return tt, target

Train-Test-Split

In [None]:
xy = X
#xy = xy.apply(lambda x: x.split())
xy = xy.apply(lambda x: x if len(x.split()) > 10 else 0)
non_zero = xy[xy == 0].index
#len(xy)
#X = X[non_zero]
#y = y[non_zero]
X_filter = X.drop(non_zero)
y_filter = y.drop(non_zero)
X_new = X_filter.reset_index(drop=True, inplace=True)
y_new = y_filter.reset_index(drop=True, inplace=True)

In [None]:
len(X_filter[18])

In [None]:
# splitting data into training and test set
X_train = X_filter[:20000]
y_train = y_filter[:20000]

In [None]:
input_data, target_data = pad_seq(X_train, y_train, 5, token_index)

In [None]:
input_data_train = input_data[:, :8000]
input_data_val = input_data[:, 8001:9000]
target_train = target_data[:8000]
target_val = target_data[8001:9000]

In [None]:
# Create the model
class LSTMClassifier(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, label_size, sentence_dim, batch_size):
        super(LSTMClassifier, self).__init__()
        self.hidden_dim = hidden_dim
        self.sentence_dim = sentence_dim
        self.batch_size = batch_size
        
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        # The linear layer that maps from hidden state space to label_size
        self.linear = nn.Linear(hidden_dim, label_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        # the first is the hidden h
        # the second is the cell  c
        return (torch.zeros(self.batch_size, self.sentence_dim, self.hidden_dim),
                torch.zeros(self.batch_size, self.sentence_dim, self.hidden_dim))

    def forward(self, sentence):
        lstm_out, self.hidden = self.lstm(sentence)
        y  = self.linear(lstm_out[-1])
        log_probs = F.log_softmax(y, dim=1)
        return log_probs

In [None]:
# Train for Word2Vec
def train():
    # These will usually be more like 32 or 64 dimensional.
    # We will keep them small, so we can see how the weights change as we train.
    EMBEDDING_DIM = 50
    HIDDEN_DIM = 6
    INPUT_SIZE = len(token_index)
    OUTPUT_DIM = 2
    SENTENCE_DIM = 3
    BATCH_SIZE = 2
    EPOCH = 300
    SIZE = 50
    
    mean_val_loss = []
    mean_train_loss = []
    # Train the model
    model = LSTMClassifier(EMBEDDING_DIM, HIDDEN_DIM, INPUT_SIZE, OUTPUT_DIM, SENTENCE_DIM, BATCH_SIZE)
    loss_function = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.1e-2)
    
    # See what the scores are before training
    # Note that element i,j of the output is the score for tag j for word i.
    # Here we don't need to train, so the code is wrapped in torch.no_grad()
    #with torch.no_grad():
    #    inputs = prepare_sequence(training_data[1][0], token_index)
    #    sentiment_scores = model(inputs)
    
    for epoch in range(EPOCH):  # again, normally you would NOT do 300 epochs, it is toy data
        #model = torch.load('lstm.pt')
        epoch_loss = 0
        count = 0
        val_loss = []
        train_loss = []
        sample_count = len(input_data_train)
        sample_range = int(sample_count/BATCH_SIZE)
        for sample in range(0,sample_range):
            # Step 1. Remember that Pytorch accumulates gradients.
            # We need to clear them out before each instance
            model.train()
            model.zero_grad()

            # Also, we need to clear out the hidden state of the LSTM,
            # detaching it from its history on the last instance.
            model.hidden = model.init_hidden()

            # Step 2. Get our inputs ready for the network, that is, turn them into
            # Tensors of word indices.  
            # create mini_batch
            input_batch = input_data_train[:, sample * BATCH_SIZE : (sample + 1) * BATCH_SIZE]
            target_batch = target_train[sample * BATCH_SIZE : (sample + 1) * BATCH_SIZE]
    
            # Step 3. Run our forward pass.
            sentiment_scores = model(input_batch)

            # Step 4. Compute the loss, gradients, and update the parameters by
            #  calling optimizer.step()
            loss = loss_function(sentiment_scores, target_batch)
            #count += 1
            #if count % 500 == 0:
                #print(f'epoch: {epoch} iterations: {count} loss {loss.item()}')
            loss.backward()
            optimizer.step()
                   
            #epoch_loss += loss.item()
            train_loss.append(loss.data)
        #torch.save(model, 'lstm.pt')
        #print(f'{epoch}: {epoch_loss}')
    
        # Each Epoch we validate our model
        if epoch % 10 == 0:
            model.eval()
            correct = 0
            total = 0
            with torch.no_grad():
                sample_count = len(input_data_val)
                sample_range = int(sample_count/BATCH_SIZE)
                for sample in range(0, sample_range):
                    input_batch = input_data_val[:, sample * BATCH_SIZE : (sample + 1) * BATCH_SIZE]
                    target_batch = target_val[sample * BATCH_SIZE : (sample + 1) * BATCH_SIZE]
                    sentiment_scores = model(input_batch)
                    loss = loss_function(sentiment_scores, target_batch)
                    val_loss.append(loss.data)
                
                    _, predicted = torch.max(sentiment_scores.data, 1)
                    correct += (predicted == target_batch).sum().item()
                    total += target_batch.size(0)
            
            mean_train_loss.append(np.mean(train_loss))
            mean_val_loss.append(np.mean(val_loss))
        
        
            print('Epoch:[{}/{}], train loss : {:.4f}, val loss : {:.4f}, val acc : {:.2f}%'\
              .format(epoch+1, EPOCH, np.mean(train_loss),\
                               np.mean(val_loss), 100*correct/total))
        
    # Plotting our test and val losses
    fig, ax = plt.subplots(figsize=(15,10))
    ax.plot(mean_train_loss, label='train')
    ax.plot(mean_val_loss, label='val')
    lines, labels = ax.get_legend_handles_labels()
    ax.legend(lines, labels, loc='best')
    plt.ylim(0,0.90)
    plt.show()
    
        
train()
# See what the scores are after training
#with torch.no_grad():
#    inputs = prepare_sequence(training_data[1][0], token_index)
#    sentiment_scores = model(inputs)
    #acc = get_accuracy(sentiment_scores, [1])
#    print(sentiment_scores)