# The data

   ##  About the data
The analysis seeks to establish transformation of word into vectors on any text. We are not concerned about whether the text data has label or not. The data set supplied consists of  **50000 IMDB reviews**  with review ID on a certain movie  with no labels.We'll use this unlabelled data to train a model. which can be applied on test data.

Please visit the site to download the data
https://www.kaggle.com/c/word2vec-nlp-tutorial/data

In [0]:
import numpy as np
import pandas as pd

## Import the data

The data was imported from local repository using the command below.

In [0]:
from google.colab import files
files.upload()

Saving unlabeledTrainData.tsv to unlabeledTrainData.tsv


In [0]:
df=pd.read_csv("unlabeledTrainData.tsv",delimiter="\t",quoting=3,header=0)

In [0]:
df.head()

Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."


In [0]:
import re,string

##  Data Cleaning
We've gone through the reviews & detected punctuations in many reviews.The punctuations don't contribute anything to our analysis & moreover they are considered as unique word & distort the meaning of other words.This is why the data needs to be cleaned before we jump into core analysis.

In [0]:
def clean_string(string):                                                         # The entire document is cleaned defining clean_string
  try:
    string=re.sub(r'^https?:\/\/<>.*[\r\n]*','',string,flags=re.MULTILINE)
    string=re.sub(r"[^A-Za-z]"," ",string)
    words=string.strip().lower().split()
    return " ".join(words)
  except:
    return " "
  

Above we defined a function called **clean_string** & this function we have applied on the raw review column and created a new column(**clean_review**) to save the cleaned reviews.

In [0]:
df['clean_review']=df.review.apply(clean_string)                                  # Finally cleaned format is applied on the reviews


In [0]:
print ("No.of samples \n:",(len(df)))
df.head()

No.of samples 
: 50000


Unnamed: 0,id,review,clean_review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was...",watching time chasers it obvious that it was m...
1,"""45057_0""","""I saw this film about 20 years ago and rememb...",i saw this film about years ago and remember i...
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B...",minor spoilers br br in new york joan barnard ...
3,"""7161_0""","""I went to see this film with a great deal of ...",i went to see this film with a great deal of e...
4,"""43971_0""","""Yes, I agree with everyone on this site this ...",yes i agree with everyone on this site this mo...


If we look at the data now, we'll not notice any punctuations in the **clean_review** column.

#  Word2Vec with Gensim(The Word2Vec toolkit)

Gensim is an open source Python library for natural language processing, with a focus on topic modeling.Gensim was developed and is maintained by the Czech natural language processing researcher **Radim Řehůřek** and his company RaRe Technologies.

It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling. Most notably for this tutorial, it supports an implementation of the** Word2Vec word embedding** for learning new word vectors from text.

It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.


### Objective

In this tutorial, we dig a little "deeper" into sentiment analysis. Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and **semantic relationships** among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This tutorial focuses on Word2Vec for sentiment analysis.

**Please install & import the gensim everytime you work on Google colab**

In [0]:
!pip install gensim --quiet                                      

In [0]:
import gensim

**Since we are going to work with words, so we are required to split the each review so that we can have word tokens.**

In [0]:
Document=[]
for doc in df['clean_review']:
  Document.append(doc.split(' '))                             

In [0]:
len(Document)

50000

**Let us explore split reviews**

In [0]:
Document[10][6:13]                                                                # This what is there in 10th Document starting from 6 till 12

['movie', 'i', 'am', 'not', 'sure', 'whether', 'i']

In [0]:
print(len(Document[10]))                                                          # Lenth of the 10th document ,  It has 524 words in it
print(Document[10])

524
['after', 'reading', 'the', 'comments', 'for', 'this', 'movie', 'i', 'am', 'not', 'sure', 'whether', 'i', 'should', 'be', 'angry', 'sad', 'or', 'sickened', 'seeing', 'comments', 'typical', 'of', 'people', 'who', 'a', 'know', 'absolutely', 'nothing', 'about', 'the', 'military', 'or', 'b', 'who', 'base', 'everything', 'they', 'think', 'they', 'know', 'on', 'movies', 'like', 'this', 'or', 'on', 'cnn', 'reports', 'about', 'abu', 'gharib', 'makes', 'me', 'wonder', 'about', 'the', 'state', 'of', 'intellectual', 'stimulation', 'in', 'the', 'world', 'br', 'br', 'at', 'the', 'time', 'i', 'type', 'this', 'the', 'number', 'of', 'people', 'in', 'the', 'us', 'military', 'million', 'on', 'active', 'duty', 'with', 'another', 'almost', 'in', 'the', 'guard', 'and', 'reserves', 'for', 'a', 'total', 'of', 'roughly', 'million', 'br', 'br', 'the', 'number', 'of', 'people', 'indicted', 'for', 'abuses', 'at', 'at', 'abu', 'gharib', 'currently', 'less', 'than', 'br', 'br', 'that', 'makes', 'the', 'total',

In [0]:
import logging                                                                    # Please import logging to keep & check information regarding word2vec transformation

In [0]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

model=gensim.models.Word2Vec(Document,                                           # List of reviews
                          min_count=10,                                          # we want words appearing atleast 10 times in the vocab otherwise ignore 
                          workers=4,                                             # Use these many worker threads to train the model (=faster training with multicore machines
                           size=50,                                              # it means aword is represented by 50 numbers,in other words the number of neorons in hidden layer is 50 
                          window=5)                                              # 5 neighbors on the either side of a word

2019-06-30 04:49:08,792 : INFO : collecting all words and their counts
2019-06-30 04:49:08,793 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-06-30 04:49:09,279 : INFO : PROGRESS: at sentence #10000, processed 2399440 words, keeping 51654 word types
2019-06-30 04:49:09,749 : INFO : PROGRESS: at sentence #20000, processed 4835846 words, keeping 69077 word types
2019-06-30 04:49:10,225 : INFO : PROGRESS: at sentence #30000, processed 7267977 words, keeping 81515 word types
2019-06-30 04:49:10,707 : INFO : PROGRESS: at sentence #40000, processed 9669772 words, keeping 91685 word types
2019-06-30 04:49:11,172 : INFO : collected 100479 word types from a corpus of 12084660 raw words and 50000 sentences
2019-06-30 04:49:11,173 : INFO : Loading a fresh vocabulary
2019-06-30 04:49:11,640 : INFO : effective_min_count=10 retains 28322 unique words (28% of original 100479, drops 72157)
2019-06-30 04:49:11,641 : INFO : effective_min_count=10 leaves 11910457 word cor

**Please note that after applying Word2Vec function on the clean_review giving all the arguments corretly we have got 28322 words**

In [0]:
print(len(model.wv.vocab))                                                        # Now the vocab contains 28322 uinque words

28322


**Let's check the dimension of a vector i.e. the number of words that represent a word**

In [0]:
print(model.wv.vector_size)                                                       # It means each vector has 50 numbers in it or in other words each word is vector of 5o numbers that we predefined

50


In [0]:
model.wv.vectors.shape                                                            # Dimension of the the entire corpus        

(28322, 50)

### Let's explore some interesting results of word2vec experiment



In [0]:
model.wv.most_similar("beautiful")                                                # 10 similar words beautiful,the maximum similarity is 1,minimum is 0.When they are completely similar the 
                                                                                  # Value will be 1 , when completely dissimilar,the value will be 0.

2019-06-30 04:50:32,822 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('gorgeous', 0.8662823438644409),
 ('lovely', 0.8383572101593018),
 ('stunning', 0.8253401517868042),
 ('wonderful', 0.7457817196846008),
 ('haunting', 0.7313393354415894),
 ('breathtaking', 0.7230619788169861),
 ('delicious', 0.7071415781974792),
 ('delightful', 0.6918222904205322),
 ('exquisite', 0.6858062148094177),
 ('fabulous', 0.6851967573165894)]

In [0]:
model.wv.most_similar("princess")                                                  # 10 similar words returned with numbers

  if np.issubdtype(vec.dtype, np.int):


[('widow', 0.8457926511764526),
 ('prince', 0.8364083170890808),
 ('maid', 0.8170045614242554),
 ('nurse', 0.8026759028434753),
 ('queen', 0.7945679426193237),
 ('dakota', 0.7875654101371765),
 ('alice', 0.7805353403091431),
 ('pianist', 0.7736424207687378),
 ('maria', 0.7730264067649841),
 ('servant', 0.7711195945739746)]

In [0]:
model.wv.doesnt_match("she talked to me in the evening publicly".split())         # publicly does not match in the sentence given

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


'publicly'

Below the word **right** is represented by a dense 50 dimensional vector

In [0]:
model.wv["right"]                                                                  # right word is represented by 50 numbers in other words the word "right" is vector of 50 numbers
                                                                                   # 50 numbers are summarized weights because these numbers are obtained in the hidden layer of predefined 50 neurons

array([ 1.084792  ,  0.9277433 ,  1.2515309 ,  1.1081741 , -1.0420096 ,
        1.9343015 ,  1.6928643 ,  1.3139058 , -0.55044943,  1.7915639 ,
        1.3197316 , -0.64483315,  0.45559508,  0.80886555, -2.484303  ,
        0.17833237,  1.3680307 ,  1.3672882 , -2.1542923 , -0.12052315,
       -0.02813105,  0.3288807 ,  3.7106562 ,  0.13608542, -0.5899354 ,
       -0.06722905, -2.050071  , -1.3693739 ,  0.18830606,  1.7286797 ,
       -1.0732532 , -0.8536867 ,  1.1823726 ,  1.9744762 ,  0.42149726,
        0.8830604 , -0.06469347,  2.1468382 , -1.2366889 , -2.5028865 ,
       -2.1869085 ,  0.43791404, -0.16663122, -1.2541647 , -2.5873227 ,
        2.2192307 ,  0.88265616, -1.2270586 , -0.9617601 , -0.36817485],
      dtype=float32)

In [0]:
model.wv['great']

array([-0.6426506 ,  0.05484062, -1.2672698 ,  0.0847162 ,  5.371844  ,
        2.1987514 ,  1.7663705 ,  0.5578455 ,  1.0657201 ,  5.6036015 ,
       -0.23015527, -2.7573566 ,  0.13810502, -0.2886024 , -2.2121024 ,
        0.6800541 ,  1.4409364 ,  1.2620891 , -0.64830357,  1.0953355 ,
        1.7287182 ,  2.8370798 ,  2.4627166 ,  0.42812717,  0.3164176 ,
        2.7381628 , -1.4414704 ,  1.9006734 ,  0.13591126,  1.1135874 ,
       -0.5841767 , -2.1699212 , -0.74955994,  1.3712415 , -1.2692451 ,
        2.9015708 , -0.46379066,  1.2144006 , -1.7756954 , -2.5923414 ,
       -0.12859172, -1.050146  , -2.5589857 , -0.4764793 ,  0.5757201 ,
        2.653173  , -1.0175519 ,  1.3231046 , -0.6623386 , -2.3848255 ],
      dtype=float32)

In [0]:
model.wv.

## Saving the model

In [0]:
model.save("word2vec movie-50")                                                    # We save this model for further use.
                                                                                   # Google has such many pre-trained models

2018-08-07 08:46:57,134 : INFO : saving Word2Vec object under word2vec movie-50, separately None
2018-08-07 08:46:57,136 : INFO : not storing attribute vectors_norm
2018-08-07 08:46:57,138 : INFO : not storing attribute cum_table
2018-08-07 08:46:57,356 : INFO : saved word2vec movie-50


# Sentiment Analysis with pre-trained Word2Vec model 

## Overview
In this tutorial we'll do Sentiment analysis based on the concept of Word2Vec using our **pre-trained model** with unlabelled data where we've applied **Word2Vec** technique i.e representing a word with a dense vector of **50 numbers**. The unlabelled data has **50000 IMDB movie reviews** & we extracted  some **28000+** unique words after doing some data preprocessing & applying Word2Vec technique with length of 50 numbers.

###Set the seed

In [0]:
import numpy as np
np.random.seed(42)

###Load data
Data can be downloaded from Kaggle -> https://www.kaggle.com/c/word2vec-nlp-tutorial/data

In [0]:
from google.colab import files
files.upload()

Saving labeledTrainData.tsv to labeledTrainData.tsv


In [0]:
import pandas as pd

df1 = pd.read_csv('labeledTrainData.tsv',  #filepath
                 header=0, delimiter="\t", quoting=3)

print(df1.shape)  

(25000, 3)


## About the data

The labelled data set contains 25000 reviews with label(**Sentiment**). The output column  Sentiment consists of 2 categories[0 & 1]. 

**0 -- Indicates negative sentiment **               ,  if the rating < 5

**1-- Indicates positive sentiment **                  , if the rating >= 7

In [0]:
df1.iloc[10:15,:]                                                                  # Have 10th & 11th review of the dataset alongwith review id, sentiment.

Unnamed: 0,id,sentiment,review
10,"""2486_3""",0,"""What happens when an army of wetbacks, towelh..."
11,"""6811_10""",1,"""Although I generally do not like remakes beli..."
12,"""11744_9""",1,"""\""Mr. Harvey Lights a Candle\"" is anchored by..."
13,"""7369_1""",0,"""I had a feeling that after \""Submerged\"", thi..."
14,"""12081_1""",0,"""note to George Litman, and others: the Myster..."


# Data Preprocessing

**1.Split Data into Training and Test Data**

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df1['review'],
    df1['sentiment'],
    test_size=0.2, 
    random_state=42
)

**2.Build Tokenizer to get Number sequences for Each review**

In [0]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

#Vocab size
top_words = 10000

t = Tokenizer(num_words=top_words)
t.fit_on_texts(X_train.tolist())

#Get the word index for each of the word in the review
X_train = t.texts_to_sequences(X_train.tolist())
X_test = t.texts_to_sequences(X_test.tolist())

**3.Pad sequences to make each review size equal Get the word index for each of the word in the review**

We  want to bring all the reviewa into same length because we want to build matrix with this dimension

In [0]:
from tensorflow.python.keras.preprocessing import sequence

#Each review size
max_review_length = 300

X_train = sequence.pad_sequences(X_train,maxlen=max_review_length,padding='post')
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length, padding='post') 

## Build Embedding Matrix from Pre-Trained Word2Vec model

In [0]:
#Install gensim
!pip install gensim --quiet

#Load pre-trained model
import gensim
word2vec = gensim.models.Word2Vec.load('word2vec movie-50')

#Embedding Length
embedding_vector_length = word2vec.wv.vectors.shape[1]

print('Loaded word2vec model..')
print('Model shape: ', word2vec.wv.vectors.shape)

2018-08-07 10:03:28,370 : INFO : loading Word2Vec object from word2vec movie-50
2018-08-07 10:03:28,543 : INFO : loading wv recursively from word2vec movie-50.wv.* with mmap=None
2018-08-07 10:03:28,544 : INFO : setting ignored attribute vectors_norm to None
2018-08-07 10:03:28,545 : INFO : loading vocabulary recursively from word2vec movie-50.vocabulary.* with mmap=None
2018-08-07 10:03:28,552 : INFO : loading trainables recursively from word2vec movie-50.trainables.* with mmap=None
2018-08-07 10:03:28,554 : INFO : setting ignored attribute cum_table to None
2018-08-07 10:03:28,557 : INFO : loaded word2vec movie-50


Loaded word2vec model..
Model shape:  (28322, 50)


In [0]:
word2vec.wv.vector_size

50

**Build matrix for current data**

In [0]:
#Initialize embedding matrix to all zeros
embedding_matrix = np.zeros((top_words + 1, # Vocablury size + 1,, we add 1 to vocab size for padding
                             embedding_vector_length))

#Steps for populating embedding matrix

#1. Check each word in tokenizer vocablury to see if it exist in pre-trained
# word2vec model.
#2. If found, update embedding matrix with embeddings for the word 
# from word2vec model

for word, i in sorted(t.word_index.items(),key=lambda x:x[1]):
    if i > top_words:
        break
    if word in word2vec.wv.vocab:
        embedding_vector = word2vec.wv[word]
        embedding_matrix[i] = embedding_vector

In [0]:
#Check embeddings for word 'great'
embedding_matrix[t.word_index['great']]

array([ 0.59144205,  0.94809264,  2.92205071, -2.57998848,  2.06668258,
        0.03379907, -2.07701755, -1.28192663,  2.37326407, -1.6968323 ,
       -1.46692789, -2.43406081, -0.99238962, -2.35702658,  0.37269598,
       -1.23948109,  1.67976511,  1.22183132, -2.27092576, -0.52730691,
        2.21310592,  3.8952992 , -1.38157284, -0.99453694, -0.90861291,
       -1.57382619, -0.62930226,  1.70807695, -1.20810831,  2.12286615,
       -0.50363177, -0.57258892, -0.01908715, -2.85462713,  0.36451188,
        0.2708773 ,  3.52137017,  2.90140653,  2.48585653, -2.98677659,
       -1.01710439,  1.52898908, -0.93782079,  0.80436903, -3.12551713,
        1.43007016,  2.68136525,  1.97543514,  0.14813299,  2.30020237])

## Build the Graph

In [0]:
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dropout, Dense, Embedding, Flatten

#Build a sequential model
model1 = Sequential()

**Add Embedding layer**

In [0]:
model1.add(Embedding(top_words + 1,
                    embedding_vector_length,
                    input_length=max_review_length,
                    weights=[embedding_matrix],                                    # Pre-trained embedding
                    trainable=False)                                               # We do not want to change embedding
         )

Output from Embedding is 3 dimension 
- batch_size x max_review_length x embedding_vector_length. 

We need to flatten the output for Dense layer

In [0]:
#Flatten embedding layer output and flatten layers
model1.add(Flatten())                                                             # Flatten enables us to bring down the dimension of the prepared data
model1.add(Dense(200,activation='relu'))                                          # Dense layer is for fully connected layer
model1.add(Dense(100,activation='relu'))
model1.add(Dropout(0.5))                                                          # Dropout is required to avoid overfiting & make the model generalize
model1.add(Dense(60,activation='relu'))
model1.add(Dropout(0.4))
model1.add(Dense(30,activation='relu'))
model1.add(Dropout(0.3))
model1.add(Dense(1,activation='sigmoid'))                                         # We've used sigmoid because output variable is binary

model1.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [0]:
#from keras.utils import to_categorical
#Y_train=to_categorical(y_train,2)
#Y_test=to_categorical(y_test,2)

## Execute the graph

Here we'll  use split data to find train & validation accuracy with 10 iterations on 20000 train data & 5000 validation data with batch size of 200.

In [0]:
model1.fit(X_train,y_train,
          epochs=5,
          batch_size=200,          
          validation_data=(X_test, y_test))

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f0962e4ff60>

In [0]:
model1.predict(X_test[10:12])

array([[0.91270083],
       [0.04830996]], dtype=float32)

In [0]:
df1.iloc[10:12,:]

Unnamed: 0,id,sentiment,review
10,"""2486_3""",0,"""What happens when an army of wetbacks, towelh..."
11,"""6811_10""",1,"""Although I generally do not like remakes beli..."
