### Performing NLP on Amazon's review on books category. The dataset we are using here is a subset of Amazon reviews from the Books category. The data is stored as a csv file and can be read using pandas.

In [23]:
import pandas as pd
import numpy as np
import gensim as gs
import os

In [24]:
os.getcwd()

'C:\\Users\\22000370\\Downloads'

In [25]:
os.chdir(r"C:\Users\22000370\Downloads")

In [26]:
# read in data
book_r=pd.read_csv(r"C:\Users\22000370\Downloads\amazondata.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [27]:
book_r

Unnamed: 0,Helpful Votes (bin),Number of Records,Star Rating (bin),Customer Id,Helpful Votes,Overall Votes,Product Id,Review Body,Review Year,Review Headline,Star Rating
0,0,1,0.0,,4.0,14.0,26009102,You will love this book. It is a hard long re...,03/17/2005 0:00,Best Book Ever,5.0
1,,1,,,,,7491727,This is the UK edition of Dr. Omit's book. Dr....,,researchers from John Hopkins School of Medici...,
2,0,1,0.0,,2.0,2.0,002782683X,This is a fun and entertaining book about lear...,06/25/2012 0:00,Michelle,5.0
3,0,1,0.0,,0.0,0.0,60187271,"Started a big slow, but once into it the autho...",06/09/2013 0:00,Loved the book,5.0
4,0,1,0.0,,14.0,20.0,60392452,Received this book as a Christmas present. I h...,08/05/2003 0:00,Challenges your assumptions,4.0
...,...,...,...,...,...,...,...,...,...,...,...
128840,0,1,0,,4.0,6.0,60529148,John Stossel explains within these pages how h...,05/19/2004 0:00,Heroic,4.0
128841,,1,,,,,60579412,When Bill Clinton said that we were all cold w...,,the record needed to be set straight. Mona Ch...,
128842,,1,,,,,60184973,"During her reign, Queen Mary foiled several pl...",,Queen of Scots -- but then,
128843,0,1,0,,1.0,1.0,7444117,I just don't understand how this was supposed ...,03/26/2014 0:00,So upsetting,2.0


### Renaming the Column

We only have to work on the Review Body column therefore we only changed this column's naming convention.

In [28]:
book_r = book_r.rename(columns={'Review Body': 'Review_Body'})

In [29]:
book_r.columns

Index(['Helpful Votes (bin)', 'Number of Records', 'Star Rating (bin)',
       'Customer Id', 'Helpful Votes', 'Overall Votes', 'Product Id',
       'Review_Body', 'Review Year', 'Review Headline', 'Star Rating'],
      dtype='object')

# Data Dictionary

reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B

Number of Records-No.of rows 

Star Rating (bin) - Ratings given divided in groups

CustomerID- Id of the Reviewer

Helpful Votes- helpfulness rating of the review, e.g. 2/3

Overall Votes - rating of the product

Product Id- ID of the product

Review Body - text of the review

Review_Headline- summarized version of the review

ReviewYear - Year when the review was given

Star Rating-Ratings by the reviewer

In [30]:
book_r.shape

(128845, 11)

In [31]:
book_r.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128845 entries, 0 to 128844
Data columns (total 11 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Helpful Votes (bin)  114947 non-null  object 
 1   Number of Records    128843 non-null  object 
 2   Star Rating (bin)    116311 non-null  object 
 3   Customer Id          59980 non-null   float64
 4   Helpful Votes        114943 non-null  float64
 5   Overall Votes        116121 non-null  float64
 6   Product Id           128840 non-null  object 
 7   Review_Body          128834 non-null  object 
 8   Review Year          114933 non-null  object 
 9   Review Headline      128831 non-null  object 
 10  Star Rating          116305 non-null  float64
dtypes: float64(4), object(7)
memory usage: 10.8+ MB


In [32]:
book_r.isnull().sum()

Helpful Votes (bin)    13898
Number of Records          2
Star Rating (bin)      12534
Customer Id            68865
Helpful Votes          13902
Overall Votes          12724
Product Id                 5
Review_Body               11
Review Year            13912
Review Headline           14
Star Rating            12540
dtype: int64

In [33]:
#train.headline_text.fillna("IGNORE TEXT")

### Dropping the NA Values

NA values do not have any role to play for us in the Word2Vec model and thus removing them is a better option.

In [34]:
reviews = book_r.dropna(subset=['Review_Body'])

### Data Preprocessing

We need to do a lot of pre-processing before we move on to the modelling step like converting all words to a single case, removing stop words etc.

In [35]:
review_text=reviews.Review_Body.apply(gs.utils.simple_preprocess)

In [36]:
##Checking the top 5 rows for us to check the pre-processing step

In [37]:
review_text.head()

0    [you, will, love, this, book, it, is, hard, lo...
1    [this, is, the, uk, edition, of, dr, omit, boo...
2    [this, is, fun, and, entertaining, book, about...
3    [started, big, slow, but, once, into, it, the,...
4    [received, this, book, as, christmas, present,...
Name: Review_Body, dtype: object

### Entire result or the tokenized text came in the form of a series.
Each object in the series is a list and each list contains a list of tokenized words

In [38]:
model=gs.models.Word2Vec(window=10,min_count= 2,workers=4 )


In [39]:
#window= how many words before the target ex.10(10 words before your target word and 10 after target word)
#min_count=If a sentence only has one word don't use that sentence, atleast 2 words are required for the sentence to be considered in the training
#workers= how many CPU threads you need to train this model


In [40]:
model.build_vocab(review_text,progress_per=1000)

The build_vocab() step is how the model discovers the set of all possible words/doc-tags – and in the case of words, finds which words occur more than min_count times.
First and foremost, the model needs to know the words present and their frequencies – a working vocabulary – so that it can determine the words that will remain after the min_count floor is applied, and allocate/initialize word-vectors & internal model structures for the relevant words. The word-frequencies will also be used to influence the random sampling of negative-word-examples (for the default negative-sampling mode) and the downsampling of very-frequent words (per the sample parameter).

Additionally, the model needs to know the rough size of the overall training set in order to gradually decrement the internal alpha learning-rate over the course of each epoch, and give meaningful progress-estimates in logging output.

At the end of build_vocab(), all memory/objects needed for the model have been created. Per the needs of the underlying algorithm, all vectors will have been initialized to low-magnitude random vectors to ready the model for training. (It essentially won't use any more memory, internally, through training.)

Also, after build_vocab(), the vocabulary is frozen: any words presented during training (or later inference) that aren't already in the model will be ignored.

In [41]:
model.epochs

5

In [42]:
#By default the epochs are set to 5

In [43]:
model.corpus_count

128834

In [44]:
model.train(review_text,total_examples=model.corpus_count,epochs=model.epochs)

(42849560, 56538715)

### This is the training step where our model will go through all the 128834 sentences to prepare the Word2Vec model

#### Saving the model(in order to store it in Cloud according to the future requirements)

In [45]:
model.save("./word2vec-Amazon_Book_Reviews.model")

### Similarity Check as a part of Word2Vec

We entered a word interesting and wanted to test our model to check for words which are most similar to interesting and our model gave us the words along with giving us a similarity score too.

In [46]:
model.wv.most_similar("interesting")

[('intriguing', 0.7801146507263184),
 ('enjoyable', 0.7178838849067688),
 ('entertaining', 0.676686704158783),
 ('exciting', 0.6748363971710205),
 ('fascinating', 0.6471625566482544),
 ('engaging', 0.6395599842071533),
 ('impressive', 0.5981829166412354),
 ('accurate', 0.5847204923629761),
 ('amusing', 0.5754761099815369),
 ('unusual', 0.5723066926002502)]

### We can check for more such similar words with putting up different words for our model to work on.

### Similarity score between huge and big

In [47]:
model.wv.similarity(w1="huge",w2="big")

0.72469556

### Therefore, we can check for similarity between more word combinations like these which shows that our model is working nicely.