####  Using Amazon data set, perform the following: 

a)Create a bigram TF-IDF DTM using Nouns and proper noun
b)Find top 20 words
c)Create a cosine similarity matrix using above DTM

POS - Part of Speech tagging - Process in which a sequence of words are tagged with a specific part of speech, based on the context in which it is used in a sentence. 

In [1]:
#First,need to download the following:

import nltk
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
#Importing the libraries

from nltk import pos_tag,word_tokenize
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

#To remove unwanted warnings

import warnings
warnings.filterwarnings('ignore')

In [4]:
text=word_tokenize("Hello welcome to the world of learning Pos tagging using NLTK")
text

['Hello',
 'welcome',
 'to',
 'the',
 'world',
 'of',
 'learning',
 'Pos',
 'tagging',
 'using',
 'NLTK']

In [5]:
nltk.pos_tag(text)

[('Hello', 'NNP'),
 ('welcome', 'NN'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('world', 'NN'),
 ('of', 'IN'),
 ('learning', 'VBG'),
 ('Pos', 'NNP'),
 ('tagging', 'VBG'),
 ('using', 'VBG'),
 ('NLTK', 'NNP')]

In [25]:
##finding te noun from above sentence

is_noun = lambda x: x =="NN"

nouns = [y for (y,x) in pos_tag(text) if is_noun(x)]

print(nouns)

['welcome', 'world']


In [7]:
#Import the dataset

df=pd.read_excel("Amazon.xlsx")
df.head(2)

Unnamed: 0,id,asins,brand,categories,colors,dateAdded,dateUpdated,dimension,ean,keys,...,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username,sizes,upc,weight
0,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,I initially had trouble deciding between the p...,"Paperwhite voyage, no regrets!",,,Cristina M,,,205 grams
1,AVpe7AsMilAPnD_xQ78G,B00QJDU3KY,Amazon,"Amazon Devices,mazon.co.uk",,2016-03-08T20:21:53Z,2017-07-18T23:52:58Z,169 mm x 117 mm x 9.1 mm,,kindlepaperwhite/b00qjdu3ky,...,5.0,https://www.amazon.com/Kindle-Paperwhite-High-...,Allow me to preface this with a little history...,One Simply Could Not Ask For More,,,Ricky,,,205 grams


In [8]:
#Separate the review.text column from the dataframe df

df_review=df[["reviews.text"]]

In [9]:
##Extracting the first review

df_review["reviews.text"][0]

"I initially had trouble deciding between the paperwhite and the voyage because reviews more or less said the same thing: the paperwhite is great, but if you have spending money, go for the voyage.Fortunately, I had friends who owned each, so I ended up buying the paperwhite on this basis: both models now have 300 ppi, so the 80 dollar jump turns out pricey the voyage's page press isn't always sensitive, and if you are fine with a specific setting, you don't need auto light adjustment).It's been a week and I am loving my paperwhite, no regrets! The touch screen is receptive and easy to use, and I keep the light at a specific setting regardless of the time of day. (In any case, it's not hard to change the setting either, as you'll only be changing the light level at a certain time of day, not every now and then while reading).Also glad that I went for the international shipping option with Amazon. Extra expense, but delivery was on time, with tracking, and I didnt need to worry about cu

In [10]:
#Step1:Convert  the text into lower case

df_review['lower_text']=df_review['reviews.text'].str.lower()
df_review.head(2)

Unnamed: 0,reviews.text,lower_text
0,I initially had trouble deciding between the p...,i initially had trouble deciding between the p...
1,Allow me to preface this with a little history...,allow me to preface this with a little history...


In [11]:
#Step2:Removing Punc,special characters etc from the lower_text

df_review['new_text']=df_review['lower_text'].str.replace("[^a-z' ]" , "")
df_review['new_text'][0]

"i initially had trouble deciding between the paperwhite and the voyage because reviews more or less said the same thing the paperwhite is great but if you have spending money go for the voyagefortunately i had friends who owned each so i ended up buying the paperwhite on this basis both models now have  ppi so the  dollar jump turns out pricey the voyage's page press isn't always sensitive and if you are fine with a specific setting you don't need auto light adjustmentit's been a week and i am loving my paperwhite no regrets the touch screen is receptive and easy to use and i keep the light at a specific setting regardless of the time of day in any case it's not hard to change the setting either as you'll only be changing the light level at a certain time of day not every now and then while readingalso glad that i went for the international shipping option with amazon extra expense but delivery was on time with tracking and i didnt need to worry about customs which i may have if i use

In [12]:
# STEP 3 - Removing the stopwords

from nltk.corpus import stopwords

# Create a list of stopwords

stop = stopwords.words('english')


In [14]:
# Write a user define function to split the text of your review, then do a match of words with the stop list and return the words which are not 
# present in the stop list

def sw(x):
    x = [y for y in x.split() if y not in stop]
    return " ".join(x)

# Lets apply the UDF sw on the new_text column of the data set

df_review['clean_text'] = df_review['new_text'].apply(sw)

df_review['clean_text'][0]

"initially trouble deciding paperwhite voyage reviews less said thing paperwhite great spending money go voyagefortunately friends owned ended buying paperwhite basis models ppi dollar jump turns pricey voyage's page press always sensitive fine specific setting need auto light adjustmentit's week loving paperwhite regrets touch screen receptive easy use keep light specific setting regardless time day case hard change setting either changing light level certain time day every readingalso glad went international shipping option amazon extra expense delivery time tracking didnt need worry customs may used third party shipping service"

In [15]:
# STEP 4 - Lets create a user define function to apply the POS tags on each word and filter all the nouns

def nouns(x):
    
    # Filter condition using lambda function
    
    is_noun = lambda x : x == "NN"
    
    # Word tokenizer using word_tokenize()
    
    token = word_tokenize(x)
    
    # Apply the pos_tags and filter the nouns
    
    all_nouns = [y for (y,x) in pos_tag(token) if is_noun(x)]
    
    # Before returning the words, its should join them to create a 
    # sentence. Thus: 
    
    return ' '.join(all_nouns)

In [17]:
#Step 5 :Lets apply the UDF nouns on the clean_text column of your dataframe(df_text)

df_review['Final_text'] = df_review['clean_text'].apply(nouns)

df_review.head(2)

Unnamed: 0,reviews.text,lower_text,new_text,clean_text,Final_text
0,I initially had trouble deciding between the p...,i initially had trouble deciding between the p...,i initially had trouble deciding between the p...,initially trouble deciding paperwhite voyage r...,trouble voyage thing paperwhite spending money...
1,Allow me to preface this with a little history...,allow me to preface this with a little history...,allow me to preface this with a little history...,allow preface little history casual reader own...,preface history reader touch series girl serie...


In [18]:
# STEP 6 - Creating the BIGRAM DTM

# STEP 6.1 - Create a count vectorizer object

tfidf_vec_bigram = TfidfVectorizer(min_df=0.001, ngram_range=(2,2))

# STEP 6.2 - Fit this count vectorizer object on the clean_text column of dt_text

tfidf_vec_bigram.fit(df_review['Final_text'])

# STEP 6.3 - Create a DTM by using a command fit_transform

DTM_bigram = tfidf_vec_bigram.fit_transform(df_review['Final_text'])

DTM_bigram

<1597x5001 sparse matrix of type '<class 'numpy.float64'>'
	with 31109 stored elements in Compressed Sparse Row format>

In [19]:
# FINAL STEP - We will not convert our DTM into a data frame. This can be done by converting the sparse matrix into a array and then passing it
# to the pd.DataFrame

DTM_BIGRAM_DF = pd.DataFrame(DTM_bigram.toarray(), 
                      columns = tfidf_vec_bigram.get_feature_names())

DTM_BIGRAM_DF.head()

Unnamed: 0,aa aa,aa energizer,ability display,ability download,ability filter,ability plug,ability screen,ability storage,ability stream,ability try,...,year video,year year,youbattery life,youi ereader,youtube chance,youtube fire,youtube hdx,youtube hear,youtube video,youtube videos
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# Finding the Total tfidf score of the words

word_tfidf_bigram = DTM_BIGRAM_DF.sum().reset_index()

word_tfidf_bigram

# Lets rename the columns 

word_table_bigram = word_tfidf_bigram.rename(columns = {'index' : 'TERMS', 
                                        0 : 'TF_IDF Score'})

word_table_bigram

# Finding the top 20 words

word_table_bigram.sort_values(by= 'TF_IDF Score', ascending = False).head(20)

Unnamed: 0,TERMS,TF_IDF Score
1600,fire hd,27.754811
237,apple buds,27.676492
1635,fire tv,26.147451
254,apple tv,16.034167
2233,kindle fire,14.42506
1604,fire hdx,14.248908
344,battery life,12.98445
4538,tv tv,12.889196
4013,star rating,10.845378
4759,voice search,10.367382


Word similarity using cosine similarity - A commonly usded approach to match similar words.

Cosine Similarity: is a matrix to determine how similar a words are. Mathematically, it measures the cosine angle btween two words (which are represented as vectors) in a multi-dimensional space.

In [21]:
# Lets use DTM_BIGRAM_DF to convert this into a square matrix. This can be done my simply passing the transpose of your DTM_BIGRAM_DF into the 
# function cosine_similarity() which is in library sklearn

# lets import the function

from sklearn.metrics.pairwise import cosine_similarity

# lets create the cosine similarity matrix

sim_mat = cosine_similarity(DTM_BIGRAM_DF.T)

sim_mat

array([[1.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.02428543,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.02428543, ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

In [22]:
# Lets convert this sim_mat square matrix into a data frame

sim_df = pd.DataFrame(sim_mat, columns = DTM_BIGRAM_DF.columns, 
                     index = DTM_BIGRAM_DF.columns)

sim_df.head()

Unnamed: 0,aa aa,aa energizer,ability display,ability download,ability filter,ability plug,ability screen,ability storage,ability stream,ability try,...,year video,year year,youbattery life,youi ereader,youtube chance,youtube fire,youtube hdx,youtube hear,youtube video,youtube videos
aa aa,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aa energizer,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ability display,0.0,0.0,1.0,0.234198,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.234198,0.0,0.0,0.0,0.024285,0.0,0.024285,0.0
ability download,0.0,0.0,0.234198,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.103696,0.0,0.103696,0.0
ability filter,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [23]:
# create the user define function

def get_similar_words(input_word, sim_mat_df, n_words):
    
    # For a given input words it will find the similarity score and will
    # arrange the same from highest to lowest
    
    val = sim_mat_df[input_word].sort_values(ascending = False)
    
    # I should drop the input word from final list
    
    words = val.drop(input_word).head(n_words)
    
    # Returning the list of words
    return words

In [24]:
# Lets find the top 20 words 

get_similar_words("kindle fire",sim_df,20)

fire hd            0.601994
year hd            0.585267
fire kindle        0.550263
hd year            0.550175
hd fire            0.543074
business tablet    0.512349
device review      0.508924
comparison sake    0.508857
ghz quad           0.508857
keyboard nexus     0.508857
year comparison    0.508857
content year       0.508857
goto device        0.508857
quad processor     0.508857
hd keyboard        0.508857
hdx play           0.508857
internet search    0.508857
tablet day         0.508857
play ghz           0.508857
year video         0.508857
Name: kindle fire, dtype: float64

In [None]:
### END ####