<a href="https://colab.research.google.com/github/demoleiwang/SDSC_Bert_Seminar/blob/master/01_Word2vec_Gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Simple Example for word2vec trained on a small IMDB dataset

Download the IMDB dataset

In [1]:
!gdown --id '1vP1lVYFGTLGHjvST3kSH5pxowd_4DcAe' --output IMDB_Dataset.csv

Downloading...
From: https://drive.google.com/uc?id=1vP1lVYFGTLGHjvST3kSH5pxowd_4DcAe
To: /content/IMDB_Dataset.csv
66.2MB [00:00, 181MB/s]


In [2]:
import pandas as pd

## Load Corpus

Read the dataset with csv format 

In [3]:
df = pd.read_csv('./IMDB_Dataset.csv')
print (df.shape)

(50000, 2)

In [4]:
df = df[0:int(df.shape[0]/5)]
print (df.shape)

(10000, 2)

In [5]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Pre-procession of dataset

In [7]:
import spacy
import string

In [8]:
nlp = spacy.load('en', disable=['ner', 'parser'])

In [9]:
def cleaning(doc):
    # Lemmatizes and removes stopwords
    txt = [token.lemma_ for token in doc if (not token.is_stop) and (token.lemma_ not in string.punctuation)]
    # filter out very short sentences
    if len(txt) > 2:
        return ' '.join(txt)

In [10]:
sents = [row.lower() for row in df['review']]

In [11]:
from time import time

In [12]:
start = time()
txt = [cleaning(doc) for doc in nlp.pipe(sents, batch_size=1000, n_threads=-1)]
print('Time to pre-process sentences: {} mins'.format(round((time() - start) / 60, 2)))

Time to pre-process sentences: 1.65 mins


In [13]:
print (txt[0])

reviewer mention watch 1 oz episode hook right exactly happen me.<br /><br />the thing strike oz brutality unflinche scene violence set right word trust faint hearted timid pull punch regard drug sex violence hardcore classic use word.<br /><br />it call oz nickname give oswald maximum security state penitentary focus mainly emerald city experimental section prison cell glass front face inward privacy high agenda -PRON- city home .. aryan muslims gangstas latinos christians italians irish .... scuffle death stare dodgy dealing shady agreement far away.<br /><br />i main appeal fact go show dare forget pretty picture paint mainstream audience forget charm forget romance ... oz mess episode see strike nasty surreal ready watch develop taste oz get accustomed high level graphic violence violence injustice crooked guard sell nickel inmate kill order away mannered middle class inmate turn prison bitch lack street skill prison experience watch oz comfortable uncomfortable view .... that touc

In [14]:
sentences = [row.split() for row in txt]

In [15]:
from collections import defaultdict

Observe top-k frequent words

In [16]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
print (len(word_freq))


57810


In [17]:
print (sorted(word_freq, key=word_freq.get, reverse=True)[:100])

['/><br', 'movie', 'film', 'like', 'good', 'time', 'character', 'watch', 'bad', 'story', 'see', 'think', 'scene', 'great', 'look', '...', 'know', 'people', 'go', 'get', 'come', 'way', 'play', 'love', 'thing', 'br', '/>the', 'find', 'man', 'end', 'life', 'work', 'plot', 'actor', 'little', 'make', 'want', 'year', 'try', 'feel', 'give', 'take', 'director', 'real', 'old', 'lot', 'acting', 'performance', 'woman', 'show', 'funny', 'guy', 'big', 'tell', 'say', 'well', 'actually', 'new', 'leave', 'star', 'young', 'act', 'girl', '/>i', 'role', 'point', 'day', '--', 'start', 'turn', 'pretty', 'cast', 'horror', 'long', 'world', 'comedy', 'minute', 'fact', 'set', 'action', 'kill', 'right', 'line', 'need', 'script', 'happen', 'fan', 'friend', 'original', 'bit', 'music', 'family', 'interesting', 'write', 'series', 'enjoy', 'laugh', 'kid', 'live', 'effect']


In [18]:
import multiprocessing

In [19]:
cores = multiprocessing.cpu_count()
print (cores)

2


## Model

Call word2vec from gensim

In [20]:
from gensim.models import Word2Vec

In [21]:
w2v_model = Word2Vec(min_count=10,
                     window=2,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

In [22]:
start = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - start) / 60, 2)))

Time to build vocab: 0.04 mins


Train

In [23]:
start = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - start) / 60, 2)))

Time to train the model: 2.64 mins


## Examples

Find the most similar words 

In [24]:
w2v_model.wv.most_similar(positive=["film"])

  if np.issubdtype(vec.dtype, np.int):


[('movie', 0.5889053344726562),
 ('anthology', 0.5001554489135742),
 ('rushed', 0.4788359999656677),
 ('mislead', 0.4755280017852783),
 ('drawback', 0.4727940857410431),
 ('mishmash', 0.46440500020980835),
 ('undeniably', 0.46284446120262146),
 ('rewarding', 0.4603988528251648),
 ('beginning.<br', 0.45827949047088623),
 ('movies.<br', 0.4580785632133484)]

In [25]:
w2v_model.wv.most_similar(positive=["great"])

  if np.issubdtype(vec.dtype, np.int):


[('excellent', 0.5910580158233643),
 ('good', 0.5851525664329529),
 ('wonderful', 0.571022629737854),
 ('fine', 0.5555368661880493),
 ('amazing', 0.499137282371521),
 ('ensemble', 0.49768131971359253),
 ('terrific', 0.47103047370910645),
 ('notch', 0.4619970917701721),
 ('incredible', 0.4571254551410675),
 ('uplift', 0.4567105770111084)]

In [32]:
w2v_model.wv.most_similar(positive=["bad"])

  if np.issubdtype(vec.dtype, np.int):


[('terrible', 0.6058905124664307),
 ("it's", 0.5963342189788818),
 ('awful', 0.5849742889404297),
 ('good', 0.5650036334991455),
 ('embarrassingly', 0.5620014667510986),
 ('stink', 0.5584094524383545),
 ('atrocious', 0.5563165545463562),
 ('/>bad', 0.5522723197937012),
 ('horrible', 0.5477124452590942),
 ('horrid', 0.5449392199516296)]

In [33]:
w2v_model.wv.most_similar(positive=["interesting"])

  if np.issubdtype(vec.dtype, np.int):


[('fascinating', 0.5521016120910645),
 ('plausible', 0.5142442584037781),
 ('potentially', 0.5137043595314026),
 ('intriguing', 0.5026283264160156),
 ('riveting', 0.5024534463882446),
 ('promising', 0.5011600255966187),
 ('rushed', 0.49501514434814453),
 ('plotline', 0.48673373460769653),
 ('exciting', 0.4854947030544281),
 ('cohesive', 0.4826468527317047)]

In [35]:
w2v_model.wv.most_similar(positive=["role"])

  if np.issubdtype(vec.dtype, np.int):


[('actress', 0.5281485319137573),
 ('deol', 0.5038964748382568),
 ('purdom', 0.5031288862228394),
 ('neeson', 0.5002285242080688),
 ('actor', 0.495933473110199),
 ('danes', 0.48791512846946716),
 ('leading', 0.48751896619796753),
 ('reprise', 0.4852789640426636),
 ('pamela', 0.4844309389591217),
 ('natasha', 0.4792976677417755)]

In [36]:
w2v_model.wv.most_similar(positive=["story"])

  if np.issubdtype(vec.dtype, np.int):


[('plot', 0.5535306930541992),
 ('tale', 0.5447912216186523),
 ('storyline', 0.5088076591491699),
 ('cohesive', 0.5021694898605347),
 ('plotline', 0.48602330684661865),
 ('enchanting', 0.48469215631484985),
 ('intricate', 0.4772065281867981),
 ('developed', 0.47224918007850647),
 ('retelling', 0.4704653024673462),
 ('engross', 0.467024028301239)]