<img src="../css/thro.svg" align="right" width="200"> 

# Introduction to AI (PART II) - Natural Language Processing (NLP)

## Lecture 10

Now, let's use our nicely cleaned Wine Review dataset to find similar wine reviews. Each wine review is a list of terms. In order to find similar wine reviews, we therefore need to define a similarity measure on lists of terms. 

---
## Part 1 - Code

#### Setup

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import webtext
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

In [2]:
nltk.download('webtext')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\Felix\AppData\Roaming\nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Felix\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Felix\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Felix\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# read the preprocessed data
with open('wines_lem.data', 'rb') as filehandle:
    wines_lem = pickle.load(filehandle)

with open('wine_lines.data', 'rb') as filehandle:
    wine_lines = pickle.load(filehandle)

FileNotFoundError: [Errno 2] No such file or directory: 'wines_lem.data'

# TF-IDF

In [None]:
# compute the word counts for each document
cv=CountVectorizer(analyzer=lambda x:x)
word_count_vector=cv.fit_transform(wines_lem)
feature_names = cv.get_feature_names()
print(word_count_vector.shape)

show = 9
# get count vector for one of the documents
show_doc_vector=word_count_vector[show]

# print the count
df = pd.DataFrame(show_doc_vector.T.todense(), index=feature_names, columns=["count"])
print(wines_lem[show])
print(df.sort_values(by=["count"],ascending=False)[:10])


In [None]:
tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer.fit(word_count_vector)

# print the lowest and highest idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(), columns=["idf"])
print(df_idf.sort_values(by=['idf'])[:10])
print(df_idf.sort_values(by=['idf'])[-10:])

In [None]:
# note that many of the very frequent words have low idf values, i.e. they appear in many
# reviews

In [None]:
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(word_count_vector)

show = 0
# get tfidf vector for first document
show_doc_vector=tf_idf_vector[show]

#print the scores
df = pd.DataFrame(show_doc_vector.T.todense(), index=feature_names, columns=["tfidf"])
print(wines_lem[show])
print(df.sort_values(by=["tfidf"],ascending=False)[:20])

# Compute similar wine reviews

In [None]:
similarities = cosine_similarity(tf_idf_vector)

In [None]:
index = 107
df = pd.DataFrame(similarities[index], index=wine_lines, columns=["similarity"])
df['#']=np.arange(0, len(df))
df.sort_values(by=["similarity"],ascending=False)[:20]

# Word2Vec

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space. [wikipedia]

In [None]:
import gensim
import gensim.downloader as api

In [None]:
# load a pretrained word embedding model - this one has 400.000 words with vectors of
# length 50 and has been trained on the wikipedia from 2014 plus the Gigaword 5 dataset
# see https://github.com/RaRe-Technologies/gensim-data
# and https://catalog.ldc.upenn.edu/LDC2011T07
model = api.load("glove-wiki-gigaword-50")

In [None]:
model['wine']

In [None]:
model.most_similar("wine")

In [None]:
print(len(model.vocab))

In [None]:
# remove all words not in the pre-trained vocabulary (nested list comprehension)
wines_vo = [[w for w in wine if w in model.vocab] for wine in wines_lem]

In [None]:
# check if there are "empty" wine reviews now, i.e. reviews without any words
len([len(wine) for wine in wines_vo if len(wine)==0])

In [None]:
# remove all these empty wine reviews (from both the word vectors and the original data)
notempty = [len(wine)>0 for wine in wines_vo]
wines_fwc = np.array(wines_vo)[notempty]
wine_lines_fwc = np.array(wine_lines)[notempty]
print(len(wines_fwc))
print(len(wine_lines_fwc))

In [None]:
# compute the document vectors bei averaging the word vectors
rr_wv = [np.mean([model[w] for w in r if w in model.vocab], axis=0) for r in wines_fwc]

In [None]:
rr_wv

In [None]:
# compute the cosine-similarity matrix
sim_dv = cosine_similarity(rr_wv)

In [None]:
# find the most similar reviews for review # 100
index = 100
df = pd.DataFrame(sim_dv[index], index=wine_lines_fwc, columns=["similarity"])
df['#']=np.arange(0, len(df))
df.sort_values(by=["similarity"],ascending=False)[:20]

In [None]:
# EOF