<img src="../css/thro.svg" align="right" width="200"> 

# Introduction to AI (PART II) - Natural Language Processing (NLP)

## Lecture 10

Now, let's use our nicely cleaned Wine Review dataset to find similar wine reviews. Each wine review is a list of terms. In order to find similar wine reviews, we therefore need to define a similarity measure on lists of terms. 

---
## Part 1 - Code

#### Setup

In [None]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import webtext
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

In [None]:
nltk.download('webtext')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# read the preprocessed data
with open('wines_lem.data', 'rb') as filehandle:
    wines_lem = pickle.load(filehandle)

with open('wine_lines.data', 'rb') as filehandle:
    wine_lines = pickle.load(filehandle)

# TF-IDF

In [None]:
# compute the word counts for each document
cv=CountVectorizer(analyzer=lambda x:x)
word_count_vector=cv.fit_transform(wines_lem)
feature_names = cv.get_feature_names()
print(word_count_vector.shape)

show = 9
# get count vector for one of the documents
show_doc_vector=word_count_vector[show]

# print the count
df = pd.DataFrame(show_doc_vector.T.todense(), index=feature_names, columns=["count"])
print(wines_lem[show])
print(df.sort_values(by=["count"],ascending=False)[:10])


(1348, 2711)
['holding', 'together', 'drying', 'touch', 'finish', 'good', 'thought', 'would', 'good']
          count
good          2
touch         1
holding       1
together      1
finish        1
drying        1
would         1
thought       1
project       0
progress      0


In [None]:
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer.fit(word_count_vector)

# print the lowest and highest idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(), columns=["idf"])
print(df_idf.sort_values(by=['idf'])[:10])
print(df_idf.sort_values(by=['idf'])[-10:])

             idf
good    2.432567
fruit   2.516759
quite   2.601317
wine    2.718181
top     2.850533
bit     2.929004
touch   3.176681
lovely  3.196484
nose    3.203173
nice    3.244274
                      idf
honeyinfluenced  7.513972
hollower         7.513972
holiday          7.513972
hill             7.513972
highttoned       7.513972
hightonedness    7.513972
highquality      7.513972
highoned         7.513972
hibiscus         7.513972
ł20              7.513972


In [None]:
# note that many of the very frequent words have low idf values, i.e. they appear in many
# reviews

In [None]:
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(word_count_vector)

show = 0
# get tfidf vector for first document
show_doc_vector=tf_idf_vector[show]

#print the scores
df = pd.DataFrame(show_doc_vector.T.todense(), index=feature_names, columns=["tfidf"])
print(wines_lem[show])
print(df.sort_values(by=["tfidf"],ascending=False)[:20])

['lovely', 'delicate', 'fragrant', 'rhone', 'wine', 'polished', 'leather', 'strawberry', 'perhaps', 'bit', 'dilute', 'good', 'drinking']
               tfidf
polished    0.403730
leather     0.374718
rhone       0.355608
dilute      0.341334
delicate    0.329937
strawberry  0.295983
fragrant    0.248677
drinking    0.216542
perhaps     0.211834
lovely      0.181546
bit         0.166354
wine        0.154380
good        0.138159
port        0.000000
portugese   0.000000
porty       0.000000
positively  0.000000
positive    0.000000
pool        0.000000
posse       0.000000


# Compute similar wine reviews

In [None]:
similarities = cosine_similarity(tf_idf_vector)

In [None]:
index = 107
df = pd.DataFrame(similarities[index], index=wine_lines, columns=["similarity"])
df['#']=np.arange(0, len(df))
df.sort_values(by=["similarity"],ascending=False)[:20]

Unnamed: 0,similarity,#
"i don't get to try an awful lot of bacchus nowadays, although way back i used to attend the english wine fair and try samples there. i can't say i miss it much. this wine is floral, grapefruity and with hints of sauvignon-like flavours. strangely for a cleanly made 11% white wine, it isn't the sort of stuff you want to drink a whole bottle of - it's not unpleasant, just that the acidity and fruit somehow manage to dull the tastebuds. hopefully the champagne style pinot that was also sent will be more interesting. *",1.0,107
"i don't want to be unfair to this wine (it was a quick replacement for the undrinkable lamaione), but this isn't quite of the standard that previous vintages of this wine used to be - although, possibly, it is just a bit closed. *",0.187396,333
"on the whole, well made stuff, but there does seem to be an odd dull metallic note to the wine. otherwise there is soft, well-balanced fruit. **",0.174306,81
a sample sent by the winery,0.170993,706
"very sweet, some mushroomyness but so sickly - might almost be a desert wine. i really don't want to drink this. i can't even bring myself to give it stars on an ""if you like this sort of thing"" basis. no stars",0.145233,98
method champenois from chardonnay and pinot - idiosynchratic wine with a touch of pink and some high-toned red wine aromas. i'd like to try some more. at least a top **,0.126328,591
"100% old-vine pinot meunier and thus interesting. floral and perhaps floury (or am i being influenced by the name of the grape) fruit, quite forward in the mouth and a bit colourless on the mid-palate with an ordinary finish. real champagne balance (quite dry) makes it pretty drinkable however, although it is hard to find much real complexity. an oddity that is worth trying, particularly if one wants to get a grip on the grape and what it might contribute to wines like moet nv, for instance. scrapes ***",0.125658,887
"a touch oxidised, but i couldn't decide whether this was the winemaking or a duff bottle. so, until i get to try it again not rated",0.117342,376
"red wine made from pinot noir, of course. fairly complex nose, bone dry - seems a bit flat and old on the mid-palate. all the interest is on the nose. rather dull in the mouth. if fizz could be made with these grapes, then this is a bit of a waste except that it gives some insight into the nature of fruit that makes good champagne. bare **",0.116731,433
"i have quite a lot of this wine, and i have been a bit concerned of more recent times whether i needed to rush to drink it up. on this bottle's showing, i need not have too many worries about that. not quite up to the standard of the previous wine (it somehow lacks the breeding), but good nonetheless with fruit and maturity working well together. no rush to drink up. ****",0.115462,799


# Word2Vec

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space. [wikipedia]

In [None]:
import gensim
import gensim.downloader as api

In [None]:
# load a pretrained word embedding model - this one has 400.000 words with vectors of
# length 50 and has been trained on the wikipedia from 2014 plus the Gigaword 5 dataset
# see https://github.com/RaRe-Technologies/gensim-data
# and https://catalog.ldc.upenn.edu/LDC2011T07
model = api.load("glove-wiki-gigaword-50")



In [None]:
model['wine']

array([-0.1145  ,  0.75404 , -1.6432  , -0.61038 ,  0.60352 , -0.56396 ,
       -1.0069  , -0.44103 ,  0.61256 ,  1.1812  ,  0.18128 ,  0.30032 ,
        1.1817  , -0.62548 ,  1.2156  , -0.30738 ,  0.54095 ,  0.53758 ,
       -0.026086, -1.7387  ,  0.46533 , -0.62835 ,  0.50936 ,  1.1192  ,
       -0.74747 , -0.57528 , -0.9203  ,  0.98612 ,  0.29107 ,  0.60208 ,
        1.9703  , -0.27461 , -0.34921 ,  0.44141 ,  0.64402 , -0.32353 ,
       -1.4541  ,  1.1472  ,  0.86875 , -0.074512,  0.85632 ,  0.59341 ,
        0.4655  , -0.0387  ,  0.26463 ,  0.94151 , -0.27335 , -0.085403,
        0.12693 , -0.23861 ], dtype=float32)

In [None]:
model.most_similar("wine")

[('wines', 0.8682538866996765),
 ('tasting', 0.8336498737335205),
 ('coffee', 0.8141363263130188),
 ('beer', 0.7646089792251587),
 ('champagne', 0.7601780891418457),
 ('drink', 0.748607873916626),
 ('taste', 0.7410123348236084),
 ('grape', 0.7345452904701233),
 ('drinks', 0.727954626083374),
 ('beers', 0.7146300673484802)]

In [None]:
print(len(model.vocab))

400000


In [None]:
# remove all words not in the pre-trained vocabulary (nested list comprehension)
wines_vo = [[w for w in wine if w in model.vocab] for wine in wines_lem]

In [None]:
# check if there are "empty" wine reviews now, i.e. reviews without any words
len([len(wine) for wine in wines_vo if len(wine)==0])

117

In [None]:
# remove all these empty wine reviews (from both the word vectors and the original data)
notempty = [len(wine)>0 for wine in wines_vo]
wines_fwc = np.array(wines_vo)[notempty]
wine_lines_fwc = np.array(wine_lines)[notempty]
print(len(wines_fwc))
print(len(wine_lines_fwc))

1231
1231


  This is separate from the ipykernel package so we can avoid doing imports until


In [None]:
# compute the document vectors bei averaging the word vectors
rr_wv = [np.mean([model[w] for w in r if w in model.vocab], axis=0) for r in wines_fwc]

In [None]:
rr_wv

[array([ 0.05163078,  0.20983793, -0.8738954 , -0.1546394 ,  0.36209074,
         0.16506353, -0.22556415, -0.18489629, -0.01367631,  0.6129858 ,
        -0.13323453,  0.15231685,  0.40892935, -0.07278054,  0.16604792,
         0.268915  ,  0.23744695,  0.21520002,  0.19899003, -0.9826338 ,
        -0.09097776,  0.14750537,  0.33440524, -0.12985508,  0.08135768,
        -0.54591244, -0.51147926,  0.9965726 ,  0.6901459 , -0.23485279,
         1.8015813 , -0.00958455, -0.03335731,  0.29535893,  0.330269  ,
         0.01913745, -0.535178  ,  0.61187774,  0.2348362 , -0.37630552,
         0.03307088,  0.16704464,  0.03899822, -0.00318461,  0.2745731 ,
         0.22953516,  0.10814829, -0.19573846, -0.04113436, -0.08179228],
       dtype=float32),
 array([-0.1451496 ,  0.17380565, -0.7525051 ,  0.10118982,  0.5748633 ,
         0.512425  , -0.04410667, -0.20181395, -0.11186501,  0.273455  ,
        -0.2732335 ,  0.03128866,  0.6601534 , -0.24845807,  0.0182195 ,
        -0.174082  ,  0.128

In [None]:
# compute the cosine-similarity matrix
sim_dv = cosine_similarity(rr_wv)

In [None]:
# find the most similar reviews for review # 100
index = 100
df = pd.DataFrame(sim_dv[index], index=wine_lines_fwc, columns=["similarity"])
df['#']=np.arange(0, len(df))
df.sort_values(by=["similarity"],ascending=False)[:20]

Unnamed: 0,similarity,#
"i don't get to try an awful lot of bacchus nowadays, although way back i used to attend the english wine fair and try samples there. i can't say i miss it much. this wine is floral, grapefruity and with hints of sauvignon-like flavours. strangely for a cleanly made 11% white wine, it isn't the sort of stuff you want to drink a whole bottle of - it's not unpleasant, just that the acidity and fruit somehow manage to dull the tastebuds. hopefully the champagne style pinot that was also sent will be more interesting. *",1.0,100
honied - interesting green-tinged high-toned fruit. bone-dry and rather nice. time to drink though. bare ***,0.97472,82
"100% old-vine pinot meunier and thus interesting. floral and perhaps floury (or am i being influenced by the name of the grape) fruit, quite forward in the mouth and a bit colourless on the mid-palate with an ordinary finish. real champagne balance (quite dry) makes it pretty drinkable however, although it is hard to find much real complexity. an oddity that is worth trying, particularly if one wants to get a grip on the grape and what it might contribute to wines like moet nv, for instance. scrapes ***",0.974085,800
"the last time i had this, in 2002, it seemed a weedy wine that probably wasn't going to live to its youthful potential. it just had the odd glimpse suggesting it might improve - but i wasn't that convinced. fortunately though, this bottle at least is on lovely form. a dry wine without flashiness, but lovely toned acidity and flavours that are constantly developing in the glass. and to think that i almost decided to drink all my bottles in 2002. an easy ****",0.973052,321
my memory of vintages of this provence wine from some years ago is of the fruit being a bit sweet - this though seems rather good: provencal herbs over cassis fruit with a lot of smoky cigar box. very much a point and it has kept well. a little simple perhaps but i believe this was not a particularly good vintage for the property. good effort - i'd like to try some other vintages. ***,0.968854,1196
"promoted conversation - apart from anything else putting a wax seal on an australian wine is making a statement. a touch mineral, quite complex fruit - pure too and not overstated. personally, i still find an overlying minty sweetness in the finish not to my taste although that might be less noticable with the right food. interesting wine that somebody with more of a taste for australian style might rate a bit more. still ***",0.967369,1147
"from magnum. i guess this was always going to be a come down after the previous wine (i rather naughtily tasted them in this order when others at the table were more conventional in their drinking order). this is reliable stuff that is drinking well with its tarry, coffee and chocolate overtones. good claret that, enjoyable though it is (i am drinking it whilst typing these notes), gets relegated, for now, to the luncheon division. a good ***",0.967261,223
"fairly vinous and with good, sharply focused acidity at the end. a champagne that tastes like a good young fizzy burgundy. this needs time, and could be stunning. ***(**)",0.967221,380
"turkish delight. dry. could be a bit tiring although perhaps gewurz is ever that way. nice balance of rose-scented fruit with a dry palate. good length. another wine i'dlike to try ""properly"". a good ***",0.966289,265
"a bit extracted and a bitter for my palate. not my sort of thing, although there's alot of wine in the glass. perhaps i might warm to it with more time. good **",0.965443,275


In [None]:
# EOF