<b>The goal of this project is to train a predictive language model - Word2Vec, to learn word associations in a large corpus and encode the relatedness into vector similarity. This model will be able to detect synonymous words or suggest additional words for a partial sentence.</b>

 The following dataset was downloaded from Kaggle and contains 3.6 million random Amazon reviews

In [156]:
import numpy as np
import pandas as pd
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
%matplotlib inline
import gensim
import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level = logging.ERROR) 

In [2]:
df = pd.read_csv(r'/Users/dannystatland/Drive/MBA/machine_learning/ex2/amazon_reviews.csv')

In [3]:
df = df.rename(columns={'This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^': 'review'})

In [8]:
df.drop(['2','Stuning even for the non-gamer'], axis=1, inplace=True)

In [157]:
df.head(20)

Unnamed: 0,review
0,I'm reading a lot of reviews saying that this ...
1,This soundtrack is my favorite music of all ti...
2,I truly like this soundtrack and I enjoy video...
3,"If you've played the game, you know how divine..."
4,I am quite sure any of you actually taking the...
5,"This is a self-published book, and if you want..."
6,I loved Whisper of the wicked saints. The stor...
7,I just finished reading Whisper of the Wicked ...
8,This was a easy to read book that made me want...
9,A complete waste of time. Typographical errors...


In [14]:
df.shape

(3599999, 1)

In [42]:
content = [gensim.utils.simple_preprocess(line) for line in df['review']]

In [43]:
model = gensim.models.Word2Vec(sentences=content, 
    vector_size=300, 
    window=7, 
    min_count=2, 
    epochs=5,
    sg=0, 
    workers=12) 

In [44]:
print(f"Size of the corpus: {model.corpus_count}. Total number of words: {model.corpus_total_words}")
print(f"Vector size: {model.vector_size}. Dictionary size: {len(model.wv)}")
print(f"vector size: {model.syn1neg.shape} or {model.wv.vectors.shape}")

Size of the corpus: 3599999. Total number of words: 253120184
Vector size: 300. Dictionary size: 347228
vector size: (347228, 300) or (347228, 300)


In [79]:
#Compute distance between terms
model.wv.distance('coffee', 'tea'), model.wv.distance('excellent', 'great'),  model.wv.distance('container', 'canister')

(0.33706265687942505, 0.3031187057495117, 0.23520660400390625)

In [88]:
model.wv.distances('coffee', ['tea', 'coffee','latte', 'espresso']) # distance to each of the words in list

array([ 3.3706266e-01, -1.1920929e-07,  3.4367758e-01,  1.9195610e-01],
      dtype=float32)

In [87]:
distances = model.wv.distances('coffee') 
[model.wv.index_to_key[key] for key in np.argsort(distances)[:10]]

['coffee',
 'coffe',
 'espresso',
 'cofee',
 'cappuccino',
 'expresso',
 'brew',
 'tea',
 'latte',
 'coffeemaker']

In [152]:
#Get all keys that are closer to key1 than key2 is to key1
model.wv.closer_than('train', 'plane')

['bus',
 'trains',
 'wagon',
 'subway',
 'sled',
 'railway',
 'sleigh',
 'buses',
 'brio',
 'bulldozer',
 'racetrack',
 'ethie']

In [101]:
model.wv.doesnt_match(['window', 'door', 'stairs', 'paratrooper'])

'paratrooper'

In [109]:
model.wv.doesnt_match(['adidas', 'nike', 'new balance', 'vans'])

'vans'

In [112]:
model.wv.doesnt_match(['plane', 'space shuttle', 'helicopter', 'car'])

'car'

In [117]:
#Cosine similarity between a vector and a matrix of vectors
model.wv.cosine_similarities(model.wv['nike'], model.wv.vectors).shape

(347228,)

In [154]:
#play with trained data
from sklearn.preprocessing import normalize

v = normalize(model.wv['travel'].reshape(1,-1))  \
    - normalize(model.wv['plane'].reshape(1,-1)) \
    + normalize(model.wv['car'].reshape(1,-1))

[(key, sim) for key, sim in model.wv.most_similar(v) if key not in ['travel', 'plane', 'car']] # remove source words

[('traveling', 0.49178817868232727),
 ('portable', 0.42995724081993103),
 ('travelling', 0.42399126291275024),
 ('rv', 0.412985235452652),
 ('purse', 0.40917715430259705),
 ('motorhome', 0.4021863341331482),
 ('carry', 0.39855170249938965),
 ('transport', 0.3972422182559967)]

In [155]:
model.wv.most_similar(positive = ['travel', 'plane'], negative=['car'],topn=25 )

[('traveling', 0.5229520201683044),
 ('traveller', 0.5130165815353394),
 ('travelling', 0.5113769173622131),
 ('traveler', 0.5096741318702698),
 ('travelers', 0.47555598616600037),
 ('flight', 0.47495096921920776),
 ('travels', 0.4616290330886841),
 ('travellers', 0.4573076665401459),
 ('sailing', 0.45491278171539307),
 ('flights', 0.43874984979629517),
 ('machu', 0.43307238817214966),
 ('traveled', 0.4288750886917114),
 ('frommer', 0.42803341150283813),
 ('travelogue', 0.421441912651062),
 ('galapagos', 0.42039623856544495),
 ('airline', 0.41551995277404785),
 ('canoeing', 0.4136318564414978),
 ('concorde', 0.41104909777641296),
 ('backpacking', 0.40802204608917236),
 ('picchu', 0.40571582317352295),
 ('belize', 0.40074464678764343),
 ('planes', 0.40006712079048157),
 ('peru', 0.39813318848609924),
 ('voyage', 0.3971570134162903),
 ('tourism', 0.39436835050582886)]

In [131]:
model.wv.most_similar(positive = ['nike'],topn=10)

[('speedo', 0.6434626579284668),
 ('adidas', 0.6376157999038696),
 ('asics', 0.6318009495735168),
 ('triax', 0.6263116002082825),
 ('timex', 0.6221684813499451),
 ('reebok', 0.6173878312110901),
 ('rockport', 0.6008673310279846),
 ('suunto', 0.596798300743103),
 ('seiko', 0.5903413891792297),
 ('dockers', 0.5786839127540588)]

In [143]:
model.wv.most_similar(positive = ['flower'],topn=10 )

[('flowers', 0.6981518268585205),
 ('petals', 0.6193898320198059),
 ('blooming', 0.609285295009613),
 ('flowering', 0.6071677207946777),
 ('garden', 0.5877366662025452),
 ('butterfly', 0.5793254971504211),
 ('bouquet', 0.5792271494865417),
 ('rose', 0.5418193936347961),
 ('vase', 0.5413583517074585),
 ('lilies', 0.5412375330924988)]

In [149]:
model.wv.most_similar(positive = ['winter', 'cold'], negative=['rain'],topn=25 )

[('chilly', 0.5713422298431396),
 ('colder', 0.5294924974441528),
 ('humid', 0.5113628506660461),
 ('frigid', 0.5016564130783081),
 ('climates', 0.486576110124588),
 ('coldest', 0.477715402841568),
 ('warmer', 0.4672723412513733),
 ('winters', 0.462340384721756),
 ('muggy', 0.4619383215904236),
 ('warm', 0.45892903208732605),
 ('toasty', 0.42792022228240967),
 ('summer', 0.414188414812088),
 ('sweltering', 0.4141833484172821),
 ('wintry', 0.40693551301956177),
 ('freezer', 0.39670154452323914),
 ('wintery', 0.3942726254463196),
 ('drafty', 0.3858770430088043),
 ('cozy', 0.38215172290802),
 ('heated', 0.38056182861328125),
 ('chilled', 0.3805472254753113),
 ('wintertime', 0.379264771938324),
 ('thawing', 0.37907934188842773),
 ('bloodedly', 0.3786478340625763),
 ('temperate', 0.37447041273117065),
 ('clammy', 0.36930274963378906)]

Through all the examples we see the model behaves as expected.