# Sentiment Classification for Yelp Restaurant Reviews using CNN in PyTorch
- For article [Click Here](https://towardsdatascience.com/sentiment-classification-using-cnn-in-pytorch-fba3c6840430)

1. Generate Word2Vec model and save it plus KeyedVectors (weights)
2. Create input tensor which has the index from Word2Vec model as the representer of each word plus pad token index for empty places

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [18]:
import pandas as pd

In [3]:
top_data_df = pd.read_csv('yelp_review.csv')
print("Columns in the original dataset:\n")
print(top_data_df.columns)
len(top_data_df)

Columns in the original dataset:

Index(['review_id', 'user_id', 'business_id', 'stars', 'date', 'text',
       'useful', 'funny', 'cool'],
      dtype='object')


5261668

### After the data is available, mapping from stars to sentiment is done and distribution for each sentiment is plotted. 

In [4]:
import matplotlib.pyplot as plt 
plt.style.use('dark_background')

print("Number of rows per star rating:")
print(top_data_df['stars'].value_counts())

# Function to map stars to sentiment
def map_sentiment(stars_received):
    if stars_received <= 2:
        return -1
    elif stars_received == 3:
        return 0
    else:
        return 1
# Mapping stars to sentiment into three categories
top_data_df['sentiment'] = [ map_sentiment(x) for x in top_data_df['stars']]
# Plotting the sentiment distribution
plt.figure()
pd.value_counts(top_data_df['sentiment']).plot.bar(title="Sentiment distribution in df")
plt.xlabel("Sentiment")
plt.ylabel("No. of rows in df")
plt.show()

Number of rows per star rating:
5    2253347
4    1223316
1     731363
3     615481
2     438161
Name: stars, dtype: int64


<Figure size 640x480 with 1 Axes>

### Positive : 1
### Negative: -1
### Neutral: 0

### let's create a smaller subsample of the original dataframe

In [57]:
top_data_df_small = pd.DataFrame.sample(top_data_df, frac = 0.01, random_state=42).reset_index()
top_data_df_small

Unnamed: 0,index,review_id,user_id,business_id,stars,date,text,useful,funny,cool,sentiment
0,4528116,m5jjU8KhAPmDSa5BIopIqw,F9vYcUknd9JY2lxsaEObQQ,T6ihfy4SYiF4PvuE6Y0VPA,3,2015-01-29,Airport Wendy's. You curbed my hunger. That wa...,1,2,2,0
1,3097267,pCURaqs8o9kCOl6fEVcsKA,H5d_nFqzwrREE-YduK2ABg,fPpO5751xJI78__uTU2q7g,5,2008-01-13,I stumbled across this store on my way to Nest...,19,6,13,1
2,2290314,2C8Gr_EX_gVTlJsobcey6w,xycmBfvZtDX9Bao9kwNQCw,sdE4iWulUozJXOxzQ5Bjhw,3,2016-05-22,Pizza was decent. Very disappointed in the del...,0,3,0,0
3,1146971,JWwPv1cIS0YfiQrKtcL9nA,dccateTjyakPfsWd5U0wsQ,K6fYrrTorlpXmqutRcrHzg,3,2010-01-15,My first time: the bartenders were so cute [an...,3,1,1,0
4,3184541,xW3umQlqu00xiu9UgkBDHw,OvpTIjhGpg2y2kklHa47NQ,Jt28TYWanzKrJYYr0Tf1MQ,3,2014-12-11,I was in las vegas staying at the Paris hotel ...,0,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...
52612,5076157,b83swwcJgYYUCtuirXnx-A,YWkIeKGcuRFLwxcTBvn6Yg,JD0Wod1xotR3LckHm-Ql8A,4,2016-11-07,"I come here all the time for a quick lunch, al...",2,0,0,1
52613,1011115,xHwTKVMNrwExZ484CSjUOg,J67R37zomRDYB2_TbC6Lnw,Sg9R6OwNBq5Zf-kjiVBxuw,2,2014-04-24,Atmosphere is great and the patio is big but t...,0,0,0,-1
52614,1677893,nwfHpovi0tXXVuEO6nJbBg,a2MZowCokvZKbFizcVC75g,4foKEzZMx7pL1DWvqLXfcQ,5,2017-01-09,I have been going to Desert Valley Dental for ...,0,0,0,1
52615,5041208,hpIVJEHVxUHSVBNSTyA43g,CiXvlCLs-cksW1PcE4aJhw,aqONNC5onqX6EqHHUO1CJA,5,2015-08-26,Was looking for a good Italian restaurant and ...,0,0,0,1


In [58]:
#top_data_df_small.reset_index(inplace = True, drop = True)
top_data_df_small.drop(columns = 'index', inplace=True)
top_data_df_small

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,sentiment
0,m5jjU8KhAPmDSa5BIopIqw,F9vYcUknd9JY2lxsaEObQQ,T6ihfy4SYiF4PvuE6Y0VPA,3,2015-01-29,Airport Wendy's. You curbed my hunger. That wa...,1,2,2,0
1,pCURaqs8o9kCOl6fEVcsKA,H5d_nFqzwrREE-YduK2ABg,fPpO5751xJI78__uTU2q7g,5,2008-01-13,I stumbled across this store on my way to Nest...,19,6,13,1
2,2C8Gr_EX_gVTlJsobcey6w,xycmBfvZtDX9Bao9kwNQCw,sdE4iWulUozJXOxzQ5Bjhw,3,2016-05-22,Pizza was decent. Very disappointed in the del...,0,3,0,0
3,JWwPv1cIS0YfiQrKtcL9nA,dccateTjyakPfsWd5U0wsQ,K6fYrrTorlpXmqutRcrHzg,3,2010-01-15,My first time: the bartenders were so cute [an...,3,1,1,0
4,xW3umQlqu00xiu9UgkBDHw,OvpTIjhGpg2y2kklHa47NQ,Jt28TYWanzKrJYYr0Tf1MQ,3,2014-12-11,I was in las vegas staying at the Paris hotel ...,0,0,2,0
...,...,...,...,...,...,...,...,...,...,...
52612,b83swwcJgYYUCtuirXnx-A,YWkIeKGcuRFLwxcTBvn6Yg,JD0Wod1xotR3LckHm-Ql8A,4,2016-11-07,"I come here all the time for a quick lunch, al...",2,0,0,1
52613,xHwTKVMNrwExZ484CSjUOg,J67R37zomRDYB2_TbC6Lnw,Sg9R6OwNBq5Zf-kjiVBxuw,2,2014-04-24,Atmosphere is great and the patio is big but t...,0,0,0,-1
52614,nwfHpovi0tXXVuEO6nJbBg,a2MZowCokvZKbFizcVC75g,4foKEzZMx7pL1DWvqLXfcQ,5,2017-01-09,I have been going to Desert Valley Dental for ...,0,0,0,1
52615,hpIVJEHVxUHSVBNSTyA43g,CiXvlCLs-cksW1PcE4aJhw,aqONNC5onqX6EqHHUO1CJA,5,2015-08-26,Was looking for a good Italian restaurant and ...,0,0,0,1


### Preprocessing the data
- Using Texthero


In [59]:
import texthero as hero

In [60]:
top_data_df_small['stemmed_tokens'] = hero.clean(top_data_df_small.text)
top_data_df_small['stemmed_tokens'] = hero.tokenize(top_data_df_small.stemmed_tokens)

In [61]:
type(top_data_df_small['stemmed_tokens'][0])

list

In [62]:
top_data_df_small

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,sentiment,stemmed_tokens
0,m5jjU8KhAPmDSa5BIopIqw,F9vYcUknd9JY2lxsaEObQQ,T6ihfy4SYiF4PvuE6Y0VPA,3,2015-01-29,Airport Wendy's. You curbed my hunger. That wa...,1,2,2,0,"[airport, wendy, curbed, hunger, needed, fries..."
1,pCURaqs8o9kCOl6fEVcsKA,H5d_nFqzwrREE-YduK2ABg,fPpO5751xJI78__uTU2q7g,5,2008-01-13,I stumbled across this store on my way to Nest...,19,6,13,1,"[stumbled, across, store, way, nest, right, ne..."
2,2C8Gr_EX_gVTlJsobcey6w,xycmBfvZtDX9Bao9kwNQCw,sdE4iWulUozJXOxzQ5Bjhw,3,2016-05-22,Pizza was decent. Very disappointed in the del...,0,3,0,0,"[pizza, decent, disappointed, delivery, told, ..."
3,JWwPv1cIS0YfiQrKtcL9nA,dccateTjyakPfsWd5U0wsQ,K6fYrrTorlpXmqutRcrHzg,3,2010-01-15,My first time: the bartenders were so cute [an...,3,1,1,0,"[first, time, bartenders, cute, happy, second,..."
4,xW3umQlqu00xiu9UgkBDHw,OvpTIjhGpg2y2kklHa47NQ,Jt28TYWanzKrJYYr0Tf1MQ,3,2014-12-11,I was in las vegas staying at the Paris hotel ...,0,0,2,0,"[las, vegas, staying, paris, hotel, sisters, b..."
...,...,...,...,...,...,...,...,...,...,...,...
52612,b83swwcJgYYUCtuirXnx-A,YWkIeKGcuRFLwxcTBvn6Yg,JD0Wod1xotR3LckHm-Ql8A,4,2016-11-07,"I come here all the time for a quick lunch, al...",2,0,0,1,"[come, time, quick, lunch, meat, halal, fan, d..."
52613,xHwTKVMNrwExZ484CSjUOg,J67R37zomRDYB2_TbC6Lnw,Sg9R6OwNBq5Zf-kjiVBxuw,2,2014-04-24,Atmosphere is great and the patio is big but t...,0,0,0,-1,"[atmosphere, great, patio, big, service, terri..."
52614,nwfHpovi0tXXVuEO6nJbBg,a2MZowCokvZKbFizcVC75g,4foKEzZMx7pL1DWvqLXfcQ,5,2017-01-09,I have been going to Desert Valley Dental for ...,0,0,0,1,"[going, desert, valley, dental, years, love, g..."
52615,hpIVJEHVxUHSVBNSTyA43g,CiXvlCLs-cksW1PcE4aJhw,aqONNC5onqX6EqHHUO1CJA,5,2015-08-26,Was looking for a good Italian restaurant and ...,0,0,0,1,"[looking, good, italian, restaurant, definitel..."


### Splitting into Train and Test Sets

In [63]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(top_data_df_small, stratify = top_data_df_small.stars, test_size = 0.3, random_state = 42)

In [74]:
# set new indices for both dataframes and drop the previus indices
train = train.reset_index(drop=True)
test = test.reset_index(drop = True)
len(train), len(test)

(36831, 15786)

In [75]:
len(top_data_df_small)

52617

In [78]:
type(test['stemmed_tokens'][10])

list

In [183]:
top_data_df_small.to_pickle("./top_data_df_small.pkl")
train.to_pickle("./yelp_reviews_train.pkl")
test.to_pickle("./yelp_reviews_test.pkl")


## Start from here

In [19]:
# top_data_df_small = pd.read_csv('top_data_df_small')
top_data_df_small = pd.read_pickle("./top_data_df_small.pkl")
top_data_df_small.head()

Unnamed: 0,review_id,user_id,business_id,stars,date,text,useful,funny,cool,sentiment,stemmed_tokens
0,m5jjU8KhAPmDSa5BIopIqw,F9vYcUknd9JY2lxsaEObQQ,T6ihfy4SYiF4PvuE6Y0VPA,3,2015-01-29,Airport Wendy's. You curbed my hunger. That wa...,1,2,2,0,"[airport, wendy, curbed, hunger, needed, fries..."
1,pCURaqs8o9kCOl6fEVcsKA,H5d_nFqzwrREE-YduK2ABg,fPpO5751xJI78__uTU2q7g,5,2008-01-13,I stumbled across this store on my way to Nest...,19,6,13,1,"[stumbled, across, store, way, nest, right, ne..."
2,2C8Gr_EX_gVTlJsobcey6w,xycmBfvZtDX9Bao9kwNQCw,sdE4iWulUozJXOxzQ5Bjhw,3,2016-05-22,Pizza was decent. Very disappointed in the del...,0,3,0,0,"[pizza, decent, disappointed, delivery, told, ..."
3,JWwPv1cIS0YfiQrKtcL9nA,dccateTjyakPfsWd5U0wsQ,K6fYrrTorlpXmqutRcrHzg,3,2010-01-15,My first time: the bartenders were so cute [an...,3,1,1,0,"[first, time, bartenders, cute, happy, second,..."
4,xW3umQlqu00xiu9UgkBDHw,OvpTIjhGpg2y2kklHa47NQ,Jt28TYWanzKrJYYr0Tf1MQ,3,2014-12-11,I was in las vegas staying at the Paris hotel ...,0,0,2,0,"[las, vegas, staying, paris, hotel, sisters, b..."


In [20]:
len(top_data_df_small)

52617

In [21]:
train = pd.read_pickle("./yelp_reviews_train.pkl")
test = pd.read_pickle("./yelp_reviews_test.pkl")

In [22]:
len(train), len(test)

(36831, 15786)

### Convolutional Neural Network for Text Classification

- These layers are used to find patterns by sliding small kernel window over input. Instead of multiplying the filters on the small regions of the images, it slides through embedding vectors of few words as mentioned by window size. 
- For looking at sequences of word embeddings, the window has to look at multiple word embeddings in a sequence. They will be rectangular with size window_size * embedding_size. For example, in our case if window size is 3 and embedding size is 500, then kernel will be 3*500. This essentially represents n-grams in the model.
- The kernel weights (filter) are multiplied to word embeddings in pairs and summed up to get output values. As the network is being learned, these kernel weights are also being learned

![Conv Filter](https://miro.medium.com/max/626/1*A094Vuq3OiLFVD2ogxUS7Q.gif "chess")

### Input and output channels for Convolutional

 - We are feeding only one feature i.e. word embedding so the first parameter for conv2d is 1 (like grayscale images) and output_channels is total number of features which will be NUM_FILTERS.

### Maxpooing

 - Once we have the feature vector and it has extracted the significant features, it is enough to know that it exists in sentence like some positive phrase “great food” and it does not matter where it appears in the sentence. 
 
- Maxpooling is used to just get that information and discard the rest of it. For example, in the above animation the feature vector we had, after applying maxpooling, the max value will be chosen. In the above case it shows max when very and delicious are in the phrase, which makes sense.

In [23]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torch
# Use cuda if present
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device available for running: ")
print(device)

Device available for running: 
cuda


### Generating input and output tensor

- Input will be Word2Vec vectors trained with embedding size 500. As we want to keep the length of sentences the same, padding token will be used to fill extra remaining words when the size of sentence is less than the highest length sentence in the corpus.
- Let’s train the Word2Vec model by using following function

`!chmod 755 models`

In [24]:
%%capture
!pip install sent2vec
import nltk
nltk.download('punkt')
from scipy import spatial
from sent2vec.vectorizer import Vectorizer
from sent2vec.splitter import Splitter
from nltk.tokenize import word_tokenize

In [None]:
# list of list (list of all reveiews where each review is convetred into tokens) + 'pad' token as a seperate review
words_list = [x for x in top_data_df_small['stemmed_tokens']]
words_list.append('[pad]')
words_list

In [12]:
from gensim.models import Word2Vec
size = 500
window = 3
min_count = 1
workers = 3
sg = 1 # 1 for SkipGram otherwise, CBOW

w2v_model = Word2Vec(words_list, min_count = 1, size = 500, workers = 3, window = 3, sg = 1)
word2vec_file =  'models/' + 'word2vec_' + str(size) + '_PAD.model'
w2v_model.save(word2vec_file)

In [13]:
# total number fo words in w2vmodel
len(w2v_model.wv.vocab)

67996

In [14]:
w2v_model.wv['cake'][:50]

array([-0.31676582,  0.07629906,  0.12840565, -0.26190862,  0.08043399,
        0.07768615, -0.06731065, -0.09520342,  0.2874018 , -0.12392927,
       -0.30881032,  0.03571766, -0.1280366 , -0.17926586, -0.04265871,
       -0.04278198, -0.12346992,  0.0512777 , -0.00735853, -0.20282559,
       -0.0893134 ,  0.11177756, -0.02960602, -0.16376258,  0.07934507,
        0.01524296,  0.11616605,  0.10360915, -0.20276146,  0.18994613,
        0.02354078,  0.2692116 , -0.00881538,  0.06750507,  0.17672347,
        0.21561511, -0.07358584, -0.04221582,  0.06700817, -0.29893652,
       -0.09397522, -0.0261348 , -0.12915015,  0.22960377,  0.05646389,
        0.28817514,  0.20030904,  0.44291428, -0.06394979,  0.0939202 ],
      dtype=float32)

In [15]:
w2v_model.wv.most_similar('pizza',topn=5)

[('pizzas', 0.7784745097160339),
 ('crust', 0.7089414000511169),
 ('pepperoni', 0.6924998164176941),
 ('margherita', 0.6862131357192993),
 ('calzone', 0.6766183376312256)]

In [18]:
# index of word 'pad' in w2v_model
w2v_model.wv.vocab['pad'].index

1053

### Once the model is ready, we can create a function to generate input tensor.
- The below function creates a tensor of length 888 for each review where each word is replaced by the index of that word from Word2Vec model.
- These indices then would be converted to vectora of 500 elements to be given to the Neural Nets

In [25]:
# the next line apply len() to each rows of top_data_df_small.stemmed_tokens and find the max of these values
max_sen_len = top_data_df_small.stemmed_tokens.map(len).max() # 774 
padding_idx = w2v_model.wv.vocab['pad'].index # index of the pad token is 1053

def make_word2vec_vector_cnn(sentence):
    padded_X = [padding_idx for i in range(max_sen_len)] # first create a list of all pad tokens
    i = 0
    for word in sentence:
        if word not in w2v_model.wv.vocab:
            padded_X[i] = 0
            #print(word)
        else:
            padded_X[i] = w2v_model.wv.vocab[word].index
        i += 1
    return torch.tensor(padded_X, dtype=torch.long, device=device).view(1, -1)

In [17]:
print(max_sen_len, padding_idx)

774 1053


In [19]:
# example of tensor representation of a review
# 'I' & 'and' are not in the word2vec vocabulary so they are represented with 0
# 'ate' has the key = 434 in the vicabulary
# 1053 is the index of pad token to fill the empty space, makes all reviews of the same length
make_word2vec_vector_cnn(['I', 'ate', 'cake', 'pizza', 'and', 'hot', 'cafe'])

I
and


tensor([[   0,  434,  411,   76,    0,  123,  601, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053, 1053,
         1053, 1053, 1053, 1

In [37]:
x = top_data_df_small['stemmed_tokens'].apply(make_word2vec_vector_cnn)
x.shape

(52617,)

In [38]:
x[0].shape

torch.Size([1, 774])

### Attention!
- Each review is an input to the CNN model
- All inputs should be of the same length so we create a tensor of lengthe equal to the longest review
- Then each is review converted to one of those tensors. For each word in a review, we insert the index of that word generated by Word2Vec vocab and fill the empty places with the index of pad (words which are not in that specific review)

### For creating the output tensor, mapping from label to positive values has to be done. 
- Currently we had -1 for negative, this is not possible in neural network. 
- Three neurons in the output layer will give probabilities for each label so we just need mapping to positive numbers

In [26]:
# Function to get the output tensor
def make_target(label):
    if label == -1: # negative
        return torch.tensor([0], dtype=torch.long, device=device)
    elif label == 0: # neural
        return torch.tensor([1], dtype=torch.long, device=device)
    else: # positive
        return torch.tensor([2], dtype=torch.long, device=device)


In [27]:
from torch.utils.data import Dataset

class make_dataset(Dataset):
    def __init__(self, dataframe):
        if type(dataframe) == str: # when input is the name of a csv file
            df = pd.read_csv(dataframe)
        else: # when a dataframe is directly given
            df = dataframe
            
        #X = df['stemmed_tokens'].values
        #y = df.Class.values
         
        self.X = df['stemmed_tokens'].apply(make_word2vec_vector_cnn)
        #print(X.shape)
        #self.X = torch.tensor(X, dtype = torch.float32) # these are decimals
        
        self.y = df['sentiment'].apply(make_target) # returns 0, 1 or 2 as label 
        #self.y = torch.tensor(y, dtype = torch.float32) # these are 0 or 1 floats
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):    
        return self.X[idx], self.y[idx]

In [28]:
train_data = make_dataset(train)
test_data = make_dataset(test)

In [29]:
len(train_data), len(test_data)

(36831, 15786)

In [30]:
train_data[17][0]

tensor([[  251,    95,   601,  3594,    86,  3016,   294,  5008,  2586,    60,
             1,  3682,   601,  3594,   478,   651,   294,   457,     1,   193,
           363,  1367,  1484, 10965,  1991,   866,     4,  1896,   560,  1367,
          4100,    11,  5186,  1009,    27,   349,   141, 12502, 47293,    11,
             0,  1490,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,
          1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,
          1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,
          1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,
          1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,
          1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,
          1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,
          1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,  1053,
          1053,  1053,  1053,  1053,  1053,  1053,  

In [31]:
train_data[32][1]

tensor([2], device='cuda:0')

In [32]:
train_loader = torch.utils.data.DataLoader(train_data, batch_size = 32, drop_last=True)
test_loader = torch.utils.data.DataLoader(test_data, batch_size = 32, drop_last=True )

In [33]:
data_iter = iter(train_loader)

predictors, target = next(data_iter)
print(predictors.squeeze(1).shape, target.shape) # each train batch has 32 elements in it

torch.Size([32, 774]) torch.Size([32, 1])


In [34]:
data_iter = iter(train_loader)

predictors, target = next(data_iter)
print(predictors.squeeze(1).shape, target) # each train batch has 2 elements in it

torch.Size([32, 774]) tensor([[2],
        [1],
        [2],
        [0],
        [2],
        [1],
        [2],
        [2],
        [2],
        [1],
        [2],
        [2],
        [1],
        [1],
        [0],
        [0],
        [2],
        [0],
        [0],
        [0],
        [0],
        [1],
        [0],
        [2],
        [2],
        [1],
        [0],
        [1],
        [2],
        [2],
        [2],
        [2]], device='cuda:0')



<img src="images/inputs.jpg" width = 700>


<img src="images/cnn for text.jpg" width = 700>

In [35]:
EMBEDDING_SIZE = 500
NUM_FILTERS = 10
import gensim

class CnnTextClassifier(nn.Module):
    def __init__(self, vocab_size, num_classes, window_sizes=(1,2,3,5)):
        super(CnnTextClassifier, self).__init__()
        w2v_model = gensim.models.KeyedVectors.load('models/' + 'word2vec_500_PAD.model') # load the saved model
        weights = w2v_model.wv # get the KeydVectors (keyvectors have keys as index of each word)
        
        # With pretrained embeddings
        tensor_weights = torch.FloatTensor(weights.vectors)
        self.embedding = nn.Embedding.from_pretrained(tensor_weights, padding_idx=w2v_model.wv.vocab['pad'].index)

        # for each window size, 1 conv layer
        self.convs = nn.ModuleList([ 
                               nn.Conv2d(1, NUM_FILTERS, [window_size, EMBEDDING_SIZE], padding=(window_size - 1, 0))
                               for window_size in window_sizes
        ])

        self.fc = nn.Linear(NUM_FILTERS * len(window_sizes), num_classes)

    def forward(self, x):
        #print("model: ",x.shape) = [32, 774]
        # x represents one row of dataframe. It is a vector of 888 elements where each element is an index of a word 
        x = self.embedding(x) # [Batch, Sequence_length, Embedding] = [32, 774, 500])
        #print("x after embedding: ", x.shape)
        # Apply a convolution + max_pool layer for each window size
        x = torch.unsqueeze(x, 1) #  [32, 1, 774, 500]) like a grayscale image
        xs = []
        for conv in self.convs: # we have 4 conv layers with different winodow sizes
            x2 = torch.tanh(conv(x))
            #print("x2 after conv: ", x2.shape) # [32, 10, 774, 1]>>[32, 10, 775, 1]>>[32, 10, 776, 1]>>[32, 10, 778, 1]
            x2 = torch.squeeze(x2, -1)
            #print("x2 after squeeze: ", x2.shape)# [32, 10, 774] >> [32, 10, 775] >> [32, 10, 776] >> [32, 10, 778]
            x2 = F.max_pool1d(x2, x2.size(2)) # keyps only the highets value of each feature vector (detected features for that window size)
            #print("x2 after maxpool: ", x2.shape) # [32, 10, 1]
            xs.append(x2) # combines all these matricies to from one final matrix of all detected features
            #print("xs: ", len(xs)) # 4, a list of 4 matricies eaxch is [32, 10, 1]
        x = torch.cat(xs, 2) # concatanate all matricies in xs on the last dimension(1) to form the final feature matrix
        #print("x before flatten: ", x.shape) # [32, 10, 4])
        # FC, x.size(0) is the batch_size
        x = x.view(x.size(0), -1) # flatten the feature matrix into a vector
        #print("x after flatten: ", x.shape) # [32, 40]
        logits = self.fc(x)

        return logits


In [36]:
### criterion = nn.NLLLoss()
NUM_CLASSES = 3
VOCAB_SIZE = len(w2v_model.wv.vocab)

cnn_model = CnnTextClassifier(vocab_size=VOCAB_SIZE, num_classes=NUM_CLASSES)
cnn_model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=0.001)


In [43]:
import torch
def check_accuracy(data_loader, model):
    model.to(device)
    num_correct = 0
    num_samples = 0
    with torch.no_grad():
        for x, y in data_loader:
            x = x.squeeze(1).to(device) # converts froom [32, 1, 774] to [32, 774]
            y = y.squeeze(1).to(device)
            #x = x.reshape(s.shape[0], -1)
            
            scores = model(x)
            print(scores)
            _, predictions = scores.max(1) # _ is the max value, predictions is the max_indx
            # predictions = torch.round(scores) # for BINARY classification
            print(predictions,_, y)
            if predictions == y:
                print(predictions, y)
            num_correct += (predictions == y).sum()
            num_samples += predictions.size(0)
            print(num_correct, num_samples)
                        
    return  round(float(num_correct)/float(num_samples) * 100, 2)

In [44]:
check_accuracy(train_loader, cnn_model)

tensor([[-3.2001, -0.4812,  3.9140],
        [-1.0135,  0.6460,  0.7117],
        [-1.8760, -0.6591,  2.7160],
        [ 1.0975,  0.1144,  0.1381],
        [-2.5012, -0.0941,  3.0032],
        [-0.3712,  0.2576,  1.4193],
        [-2.4190, -0.8888,  3.4356],
        [-2.4882, -0.5533,  3.7098],
        [-1.0818, -1.0533,  2.5664],
        [-0.2050,  0.0483,  0.4834],
        [-2.6715, -0.2223,  3.3766],
        [-1.6949,  0.3704,  1.0109],
        [-1.3840,  0.1668,  1.6035],
        [ 0.8042, -0.0207, -0.1656],
        [ 2.3033, -0.0304, -1.0542],
        [ 0.1378,  0.7062, -0.7309],
        [-1.7849, -0.4068,  2.3639],
        [ 1.4658,  0.1157, -0.4083],
        [ 2.9122,  0.3163, -1.2804],
        [ 2.1535,  0.8506, -1.8336],
        [ 3.1021, -0.5603, -1.0137],
        [-1.0753,  0.2473,  0.9598],
        [ 0.7472,  0.2129, -0.1648],
        [ 0.0480, -0.3185,  0.7240],
        [-2.8538, -0.4893,  3.6820],
        [ 0.2912,  0.0207,  0.3264],
        [ 2.8496, -0.0970, -1.3806],
 

RuntimeError: bool value of Tensor with more than one value is ambiguous

In [246]:
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True
torch.cuda.empty_cache()

In [None]:
from tqdm import trange

print("Begin training.")
EPOCHS = 10
for e in trange(1, EPOCHS+1):
    num_correct = 0
    num_samples = 0
    # TRAINING
    train_epoch_loss = 0
    train_epoch_acc = 0
    cnn_model.train()
    for inputs, labels in train_loader:
        #print("inp: ", inputs.shape, labels.shape)
        inputs = inputs.squeeze(1).to(device) # converts [32, 1, 774] to [32, 774]
        labels = labels.squeeze(1).to(device) # converts [32, 1] to [32]
        #print("inp: ", inputs.shape, labels.shape)
        #X_train_batch, y_train_batch = X_train_batch.to(device), y_train_batch.to(device)
        optimizer.zero_grad()
        
        predictions = cnn_model(inputs)
        
        train_loss = criterion(predictions, labels)
        #train_acc = multi_acc(predictions, labels)
        #train_acc = check_accuracy(train_loader, cnn_model)
        _, preds = predictions.max(1) # _ is the max value, predictions is the max_indx
        correct = (preds == labels).float()
        #num_samples += preds.size(0)
        train_acc = correct.sum() / len(correct) # calculate the accutace for each batch in train_iterator
        assert len(correct) ==  32
        #round(float(num_correct)/float(num_samples) * 100, 2)
        
        train_loss.backward()
        optimizer.step()
        
        train_epoch_loss += train_loss.item()
        train_epoch_acc += train_acc.item()
        
    
    # VALIDATION    
    with torch.no_grad():
        
        num_correct = 0
        num_samples = 0
        val_epoch_loss = 0
        val_epoch_acc = 0

        cnn_model.eval()
        for data, targets in test_loader:
         #X_val_batch, y_val_batch = X_val_batch.to(device), y_val_batch.to(device)
            data = data.squeeze(1).to(device) # converts froom [32, 1, 774] to [32, 774]
            targets = targets.squeeze(1).to(device)
            predictions = cnn_model(data)

            val_loss = criterion(predictions, targets)

            _, preds = predictions.max(1)
            corrects = (preds == targets).float()
            test_acc = corrects.sum() / len(corrects) 
            #test_acc = round(float(num_correct)/float(num_samples) * 100, 2)

            val_epoch_loss += val_loss.item()
            val_epoch_acc += test_acc.item()
                              

    print(f"""Epoch {e+0:03}: 
| Train Loss: {train_epoch_loss/len(train_loader):.5f} | Val Loss: {val_epoch_loss/len(test_loader):.5f} 
| Train Acc: {train_epoch_acc/len(train_loader):.5f}   | Val Acc: {val_epoch_acc/len(test_loader):.5f}""") 

### Save and Load the model

In [250]:
torch.save(cnn_model.state_dict(), "cnn_model.pkl")

CnnTextClassifier(
  (embedding): Embedding(67996, 500, padding_idx=1053)
  (convs): ModuleList(
    (0): Conv2d(1, 10, kernel_size=[1, 500], stride=(1, 1))
    (1): Conv2d(1, 10, kernel_size=[2, 500], stride=(1, 1), padding=(1, 0))
    (2): Conv2d(1, 10, kernel_size=[3, 500], stride=(1, 1), padding=(2, 0))
    (3): Conv2d(1, 10, kernel_size=[5, 500], stride=(1, 1), padding=(4, 0))
  )
  (fc): Linear(in_features=40, out_features=3, bias=True)
)


In [10]:
import gensim
w2v_model = gensim.models.KeyedVectors.load('models/' + 'word2vec_500_PAD.model') 
VOCAB_SIZE = len(w2v_model.wv.vocab)
NUM_CLASSES = 3
cnn_model = CnnTextClassifier(VOCAB_SIZE, NUM_CLASSES)

cnn_model.load_state_dict(torch.load('cnn_model.pkl'))
print(cnn_model)

CnnTextClassifier(
  (embedding): Embedding(67996, 500, padding_idx=1053)
  (convs): ModuleList(
    (0): Conv2d(1, 10, kernel_size=[1, 500], stride=(1, 1))
    (1): Conv2d(1, 10, kernel_size=[2, 500], stride=(1, 1), padding=(1, 0))
    (2): Conv2d(1, 10, kernel_size=[3, 500], stride=(1, 1), padding=(2, 0))
    (3): Conv2d(1, 10, kernel_size=[5, 500], stride=(1, 1), padding=(4, 0))
  )
  (fc): Linear(in_features=40, out_features=3, bias=True)
)


### Test a Review

In [553]:
import texthero as hero

def test_review(rev):
    rev = pd.Series(rev)
    rev = hero.clean(rev)
    rev = hero.tokenize(rev)
    rev = rev.to_list()[0]
    rev = make_word2vec_vector_cnn(rev)
    _, pred = cnn_model(rev).max(1)
    print('negative') if pred.item() == 0 else print('neural') if pred.item() == 1 else print('positive')


In [555]:
import random
for x in range(10):
    n = random.randint(1,10000)
    print (top_data_df_small.text[n], test_review(top_data_df_small.text[n]))
    print('----------------------------------------------------------------------------------------------------')


negative
Dice is filthy. Crude. Rude. Sexist.
 Downright nasty. But there is some twinkles of genius in his material, delivery, & stage persona. Grew up during his heyday but was never a fan. I am now. My wife and I laughed a lot. His takes on the nuances of sex were hilarious. His interaction with audience upfront was a pro at work.

Eleanor, who opens for him is a piece of work. Also 10 times cruder than Dice. But very talented.

We paid $37 per tick via travelzoo. With that said, I'd say Dice Clay was the greatest entertainment value I've ever enjoy in Vegas. Not for most.. None
----------------------------------------------------------------------------------------------------
negative
The place is nice BUT. We arrived on Thursday afternoon and fount dirty towels, gum wrappers, and coke bottles laying out side the doors of the suites next to us. They were still there when we left on Sunday. On Friday I bought a beer at the bar in the lobby and was walking into the pool area with it