In [13]:
import pandas as pd
df = pd.read_csv('scraping.csv')
df.head()

Unnamed: 0,Comment,Rating
0,From last 5 years my younger brother was using...,5
1,excellent phone camera is very nice and the st...,4
2,I have been using the earlier versions of iPho...,4
3,IMPORTANT NOTICEIf you buy some apple device o...,5
4,"Well, what can I say... iPhone is awesome as e...",5


#### Ngrams

In [14]:
from textblob import TextBlob
TextBlob(df['Comment'][0]).ngrams(2)

[WordList(['From', 'last']),
 WordList(['last', '5']),
 WordList(['5', 'years']),
 WordList(['years', 'my']),
 WordList(['my', 'younger']),
 WordList(['younger', 'brother']),
 WordList(['brother', 'was']),
 WordList(['was', 'using']),
 WordList(['using', 'iphone']),
 WordList(['iphone', '4s']),
 WordList(['4s', 'and']),
 WordList(['and', 'i']),
 WordList(['i', 'bought']),
 WordList(['bought', 'iphone']),
 WordList(['iphone', '7']),
 WordList(['7', 'for']),
 WordList(['for', 'his']),
 WordList(['his', 'birthday']),
 WordList(['birthday', 'gift']),
 WordList(['gift', 'When']),
 WordList(['When', 'i']),
 WordList(['i', 'gave']),
 WordList(['gave', 'gift']),
 WordList(['gift', 'packet']),
 WordList(['packet', 'to']),
 WordList(['to', 'him']),
 WordList(['him', 'he']),
 WordList(['he', 'was']),
 WordList(['was', 'thinking']),
 WordList(['thinking', 'that']),
 WordList(['that', 'my']),
 WordList(['my', 'bro']),
 WordList(['bro', 'bought']),
 WordList(['bought', 'watch/pen/or']),
 WordList(['

#### Term frequency

In [15]:
tf1 = (df['Comment'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1.head()

Unnamed: 0,words,tf
0,it,2
1,for,2
2,the,2
3,cool,1
4,.,1


#### Inverse Document Frequency

In [16]:
import numpy as np
for i,word in enumerate(tf1['words']):
    tf1.loc[i, 'idf'] = np.log(df.shape[0]/(len(df[df['Comment'].str.contains(word)])))
tf1.head()

Unnamed: 0,words,tf,idf
0,it,2,0.227081
1,for,2,0.527489
2,the,2,0.541937
3,cool,1,3.3856
4,.,1,0.0


#### Term Frequency – Inverse Document Frequency (TF-IDF)

In [18]:
# TF-IDF is the multiplication of the TF and IDF which we calculated above.

tf1['tfidf'] = tf1['tf'] * tf1['idf']
tf1.head()

Unnamed: 0,words,tf,idf,tfidf
0,it,2,0.227081,0.454162
1,for,2,0.527489,1.054979
2,the,2,0.541937,1.083875
3,cool,1,3.3856,3.3856
4,.,1,0.0,0.0


In [19]:
#We don’t have to calculate TF and IDF every time beforehand and then multiply it to obtain TF-IDF. Instead, sklearn has a separate function to directly obtain it:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(df['Comment'])

train_vect

<827x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 17616 stored elements in Compressed Sparse Row format>

In [20]:
## We can also perform basic pre-processing steps like lower-casing and removal of stopwords, if we haven’t done them earlier.

#### Bag of Words

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(df['Comment'])
train_bow

<827x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 28682 stored elements in Compressed Sparse Row format>

 #### Sentiment Analysis

If you recall, our problem was to detect the sentiment of the tweet. So, before applying any ML/DL models (which can have a separate feature detecting the sentiment using the textblob library), let’s check the sentiment of the first few tweets.

In [22]:
df['Comment'][:5].apply(lambda x: TextBlob(x).sentiment)


0    (-0.014285714285714287, 0.3666666666666667)
1                                  (0.506, 0.75)
2       (0.1619047619047619, 0.6476190476190476)
3       (0.5041666666666667, 0.7305555555555555)
4       (0.4055555555555556, 0.5944444444444444)
Name: Comment, dtype: object

Above, you can see that it returns a tuple representing polarity and subjectivity of each tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment. This can also work as a feature for building a machine learning model.

In [24]:
df['sentiment'] = df['Comment'].apply(lambda x: TextBlob(x).sentiment[0] )
df[['Comment','sentiment']].head()

Unnamed: 0,Comment,sentiment
0,From last 5 years my younger brother was using...,-0.014286
1,excellent phone camera is very nice and the st...,0.506
2,I have been using the earlier versions of iPho...,0.161905
3,IMPORTANT NOTICEIf you buy some apple device o...,0.504167
4,"Well, what can I say... iPhone is awesome as e...",0.405556


#### Word Embeddings

Word Embedding is the representation of text in the form of vectors. The underlying idea here is that similar words will have a minimum distance between their vectors.

Word2Vec models require a lot of text, so either we can train it on our training data or we can use the pre-trained word vectors developed by Google, Wiki, etc.

In [30]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = r'C:\Users\z030590\OneDrive - Alliance\Desktop\Personal\Practice\WEB_SCRAPING_ML\glove.6B\glove.6B.100d.txt'
word2vec_output_file = r'C:\Users\z030590\OneDrive - Alliance\Desktop\Personal\Practice\WEB_SCRAPING_ML\glove.6B\glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

(400000, 100)

Now, we can load the above word2vec file as a model

In [31]:
from gensim.models import KeyedVectors # load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

Let’s say our Review contains a text saying ‘Worst Product’. We can easily obtain it’s word vector using the above model:

In [34]:
model['worst']

array([ 0.10823 ,  0.53741 ,  0.885   , -0.30311 , -0.58348 , -0.19097 ,
       -0.33086 , -0.37934 , -0.24386 ,  0.12135 , -0.094278, -0.38937 ,
        0.29407 , -0.19807 , -0.13636 ,  0.56784 , -0.2646  , -0.38357 ,
       -0.35089 ,  0.86936 ,  0.86949 ,  0.15402 ,  0.28293 ,  0.1179  ,
       -0.1568  , -0.66417 , -0.29127 ,  0.48797 ,  0.10922 , -0.40979 ,
        0.44929 , -0.4953  ,  0.4276  ,  0.37889 , -0.57139 , -0.59073 ,
       -0.34217 , -0.36057 , -0.027326,  0.12269 , -0.65577 ,  0.66994 ,
        0.21313 ,  0.013424,  0.98328 ,  1.1144  ,  1.0071  , -0.57594 ,
       -0.38352 , -1.0815  , -0.14601 , -0.40779 , -0.30278 ,  0.88406 ,
       -0.13341 , -2.0044  ,  0.55541 ,  0.53672 ,  1.4316  ,  0.61509 ,
       -0.84024 ,  0.83837 , -0.90179 , -0.52281 ,  0.69407 ,  0.42772 ,
        0.26326 , -0.46818 , -0.010143, -0.1348  , -0.1372  ,  0.09444 ,
       -0.82502 ,  1.1274  ,  0.23282 ,  0.021967, -0.46332 , -0.020093,
       -0.72729 ,  0.49759 ,  0.82981 ,  0.66017 , 

In [35]:
model['product']

array([ 0.12804  ,  0.34131  ,  0.33106  , -0.026678 , -0.022675 ,
       -1.0228   ,  0.65186  , -0.14204  ,  0.29102  ,  0.56137  ,
       -0.1294   , -0.77794  , -0.014738 , -0.0082412,  0.19769  ,
        0.42299  ,  0.64201  ,  0.89195  ,  0.28199  ,  0.038209 ,
       -0.066105 , -0.39848  , -0.025111 ,  0.45934  , -0.45628  ,
        0.36668  ,  0.56928  , -0.15604  , -0.82312  , -0.46751  ,
        0.35949  ,  0.97564  , -0.047988 , -0.47062  ,  0.65927  ,
        0.66212  ,  0.18403  , -0.052545 , -0.63723  , -0.53374  ,
        0.50934  , -0.55863  ,  0.011983 ,  0.096682 ,  0.053548 ,
        0.29566  , -0.15537  , -0.40615  , -0.58044  , -0.92148  ,
        0.61701  , -0.019925 , -0.19368  ,  0.72811  ,  0.076774 ,
       -1.6533   , -0.6374   , -0.060303 ,  1.9839   ,  0.13529  ,
        0.47406  , -0.1415   , -0.37578  ,  0.15041  ,  0.89496  ,
       -0.073249 ,  0.6373   , -0.33459  ,  0.97642  , -0.41846  ,
        0.26385  ,  0.6476   , -0.057542 ,  0.0052852,  0.3126

We then take the average to represent the string ‘worst product’ in the form of vectors having 100 dimensions

In [36]:
(model['worst'] + model['product'])/2

array([ 0.11813501,  0.43936002,  0.60802996, -0.164894  , -0.3030775 ,
       -0.60688496,  0.1605    , -0.26069   ,  0.02358   ,  0.34136   ,
       -0.111839  , -0.583655  ,  0.139666  , -0.10315561,  0.030665  ,
        0.49541497,  0.18870498,  0.25419003, -0.03445001,  0.45378453,
        0.4016925 , -0.12223   ,  0.1289095 ,  0.28862   , -0.30654   ,
       -0.14874502,  0.13900502,  0.16596499, -0.35694999, -0.43865   ,
        0.40439   ,  0.24017   ,  0.189806  , -0.045865  ,  0.04394001,
        0.03569499, -0.07907   , -0.20655751, -0.33227798, -0.205525  ,
       -0.07321501,  0.055655  ,  0.1125565 ,  0.055053  ,  0.518414  ,
        0.70503   ,  0.425865  , -0.491045  , -0.48198   , -1.00149   ,
        0.23550001, -0.2138575 , -0.24823001,  0.806085  , -0.028318  ,
       -1.82885   , -0.04099497,  0.23820849,  1.70775   ,  0.37519002,
       -0.18309   ,  0.348435  , -0.638785  , -0.1862    ,  0.794515  ,
        0.17723551,  0.45028   , -0.401385  ,  0.4831385 , -0.27