[View in Colaboratory](https://colab.research.google.com/github/Mujadded/Bangla-Digit-Recognition-Kaggle-Numta-Competition/blob/master/word2vec_gensim.ipynb)

**Description**

This is a sample program, that downloads a TED talk dataset, and trains a Word2Vec model with it.

Tags:

*   Gensim simple tutorial
*   Gensim training with 16MB TED talk dataset
*   Basics of Gensim
*   FastText



### Install Dependencies

In [1]:
!pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/eb/59/1db3c9c27049e4f832691c6d642df1f5b64763f73942172c44fee22de397/lxml-4.2.4-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K    100% |████████████████████████████████| 5.8MB 5.1MB/s 
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.2.4


In [2]:
!pip install gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/86/f3/37504f07651330ddfdefa631ca5246974a60d0908216539efda842fd080f/gensim-3.5.0-cp36-cp36m-manylinux1_x86_64.whl (23.5MB)
[K    100% |████████████████████████████████| 23.5MB 1.6MB/s 
Collecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/cf/3d/5f3a9a296d0ba8e00e263a8dee76762076b9eb5ddc254ccaa834651c8d65/smart_open-1.6.0.tar.gz
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/23/10/c0b78c27298029e4454a472a1919bde20cb182dab1662cec7f2ca1dcc523/boto-2.49.0-py2.py3-none-any.whl (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 8.9MB/s 
[?25hCollecting bz2file (from smart-open>=1.2.1->gensim)
  Downloading https://files.pythonhosted.org/packages/61/39/122222b5e85cd41c391b68a99ee296584b2a2d1d233e7ee32b4532384f2d/bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.2.1->gensim)
[?25l  Downloa

### Import dependencies

In [0]:
import numpy as np
import os
from random import shuffle
import re
import urllib.request
import zipfile
import lxml.etree

### Data Download

This code is simply downloading the data from a given url. You can put your data in a text file, and read it to a string "input_text", and it will work just fine.

In [0]:
#download the data
urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")
# extract subtitle
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
    doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))

Printing out the text file to see the text we are going to build our word2vec model on.

In [5]:
input_text[0:200]

"Here are two reasons companies fail: they only do more of the same, or they only do what's new.\nTo me the real, real solution to quality growth is figuring out the balance between two activities: expl"

### Data Preprocessing

In [0]:
# remove parenthesis 
input_text_noparens = re.sub(r'\([^)]*\)', '', input_text)
# store as list of sentences
sentences_strings_ted = []
for line in input_text_noparens.split('\n'):
    m = re.match(r'^(?:(?P<precolon>[^:]{,20}):)?(?P<postcolon>.*)$', line)
    sentences_strings_ted.extend(sent for sent in m.groupdict()['postcolon'].split('.') if sent)
# store as list of lists of words
sentences_ted = []
for sent_str in sentences_strings_ted:
    tokens = re.sub(r"[^a-z0-9]+", " ", sent_str.lower()).split()
    sentences_ted.append(tokens)

Inspecting the data

In [7]:
sentences_ted[0]

['here',
 'are',
 'two',
 'reasons',
 'companies',
 'fail',
 'they',
 'only',
 'do',
 'more',
 'of',
 'the',
 'same',
 'or',
 'they',
 'only',
 'do',
 'what',
 's',
 'new']

In [11]:
sentences_strings_ted[0]

"Here are two reasons companies fail: they only do more of the same, or they only do what's new"

### Train the model and create Word2Vec

**window = 5** : the window we have talked about

**min_count = 5** : throw away any word less frequent than 5 times

**worker** : number of CPU

**sg** : 1 = skip gram model,  0 = Continuous Bag Of Words model

**size=100** : dimension of our word vectors

In [0]:
from gensim.models import Word2Vec
model_ted = Word2Vec(sentences=sentences_ted, size=100, window=5, min_count=5, workers=4, sg=0)

### Playing with our trained model

Finding similar words

In [12]:
model_ted.wv.most_similar("learning")

  if np.issubdtype(vec.dtype, np.int):


[('thinking', 0.624519944190979),
 ('designing', 0.6135532855987549),
 ('understanding', 0.5971096754074097),
 ('creativity', 0.5890544652938843),
 ('interaction', 0.5722825527191162),
 ('sharing', 0.5671535134315491),
 ('knowledge', 0.5633713006973267),
 ('programming', 0.5599903464317322),
 ('behavior', 0.5588247776031494),
 ('concerned', 0.5585861206054688)]

**Actual vector representation of the given word**

In [13]:
model_ted["superman"]

  """Entry point for launching an IPython kernel.


array([ 0.02932545, -0.03813079, -0.03908348, -0.04440913, -0.00780136,
        0.17753965,  0.12432306, -0.01260174, -0.02116437,  0.01338702,
        0.04903176,  0.0128541 ,  0.10578487, -0.03713503,  0.07001546,
       -0.14151737, -0.09285926,  0.08726685,  0.08953398, -0.09632254,
       -0.22157773,  0.075741  ,  0.16108155, -0.09259493, -0.1076882 ,
       -0.02815372,  0.12990771, -0.03335473,  0.00819364,  0.08313783,
        0.04859168, -0.01032159,  0.05126826,  0.02886479,  0.01692663,
       -0.13471396, -0.08546012,  0.11300616, -0.08114375,  0.18780759,
       -0.14915358, -0.16964687,  0.0853797 ,  0.22036536, -0.0376901 ,
        0.06474366,  0.16765636,  0.10471794, -0.0601727 , -0.09241454,
       -0.26310453, -0.12781328, -0.02925716,  0.04133437,  0.04953956,
       -0.05544368, -0.11810335,  0.04695258,  0.17772011,  0.12116767,
        0.26723117, -0.08432853, -0.14733846, -0.00198483, -0.12609589,
        0.05029595, -0.2562903 ,  0.14254208, -0.13047476,  0.11

**"King" - "Man" + "Woman" = "Queen"**

In [16]:
model_ted.most_similar(positive=['paris', 'italy'], negative=['france'], topn=1)

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('tokyo', 0.7420258522033691)]

**Exercise**: 

"Paris" - "France" + "Italy" = ?

### Saving the trained model

In [0]:
model_ted.save('model_ted')

In [19]:
ls

[0m[01;34mdatalab[0m/   model_ted.trainables.vectors_ngrams_lockf.npy  ted_en-20160408.zip
model_ted  model_ted.wv.vectors_ngrams.npy


### Loading the saved model for later use

In [0]:
import gensim
new_model = gensim.models.Word2Vec.load('model_ted')

In [22]:
model_ted.wv.most_similar("musk")

[('sum', 0.8453272581100464),
 ('gypsum', 0.7025263905525208),
 ('spends', 0.6015372276306152),
 ('sumness', 0.5867369771003723),
 ('spend', 0.5801560878753662),
 ('consume', 0.5798588395118713),
 ('consumption', 0.5754652619361877),
 ('enjoyment', 0.5706859230995178),
 ('caloric', 0.569891095161438),
 ('gdp', 0.5693891644477844)]

[Tutorial](https://rare-technologies.com/word2vec-tutorial/)

### Pretrained Models

Download Google's pretrained model (1.5GB). This model has 300 dimensions.

In [17]:
!wget --header="Host: doc-0k-60-docs.googleusercontent.com" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" --header="Accept-Language: en-US,en;q=0.9,bn;q=0.8" --header="Cookie: AUTH_hsj6bsv3gmr4jud640d5antqf80l6l0u_nonce=4e0ai06tkdrcq" --header="Connection: keep-alive" "https://doc-0k-60-docs.googleusercontent.com/docs/securesc/3ibvjp4ek73d72m48ftm6aqlh1cm5gq2/9pj70o100kqsh714bprtp2s146a5ntff/1534593600000/06848720943842814915/11497237792643639470/0B7XkCwpI5KDYNlNUTTlSS21pQmM?e=download&nonce=4e0ai06tkdrcq&user=11497237792643639470&hash=hc7ja67baf5hea9lr11tvd4ippf9g9vg" -O "GoogleNews-vectors-negative300.bin.gz" -c

--2018-08-18 16:09:18--  https://doc-0k-60-docs.googleusercontent.com/docs/securesc/3ibvjp4ek73d72m48ftm6aqlh1cm5gq2/9pj70o100kqsh714bprtp2s146a5ntff/1534593600000/06848720943842814915/11497237792643639470/0B7XkCwpI5KDYNlNUTTlSS21pQmM?e=download&nonce=4e0ai06tkdrcq&user=11497237792643639470&hash=hc7ja67baf5hea9lr11tvd4ippf9g9vg
Resolving doc-0k-60-docs.googleusercontent.com (doc-0k-60-docs.googleusercontent.com)... 209.85.200.132, 2607:f8b0:4001:c16::84
Connecting to doc-0k-60-docs.googleusercontent.com (doc-0k-60-docs.googleusercontent.com)|209.85.200.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’

GoogleNews-vectors-     [          <=>       ]   1.53G   129MB/s    in 11s     

2018-08-18 16:09:30 (137 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227]



In [21]:
ls

[0m[01;36mdatalab[0m@                               [01;34msample_data[0m/
GoogleNews-vectors-negative300.bin.gz  ted_en-20160408.zip


Don't worry, it'll take a while.

In [0]:
import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin.gz', binary=True)

In [20]:
model["food"]

array([-0.18164062,  0.16503906, -0.16601562,  0.35742188, -0.09228516,
        0.20117188, -0.0546875 , -0.26171875, -0.17285156, -0.08056641,
        0.14648438, -0.24609375,  0.18652344,  0.10253906, -0.3203125 ,
        0.16699219, -0.0032196 , -0.06640625,  0.06591797, -0.109375  ,
        0.13964844, -0.05029297,  0.25390625,  0.0859375 ,  0.02026367,
        0.05517578, -0.08447266,  0.07324219,  0.15429688, -0.13867188,
       -0.25195312, -0.15136719,  0.07958984,  0.00848389, -0.24902344,
        0.05224609,  0.04394531, -0.19726562, -0.2109375 ,  0.01477051,
       -0.23632812, -0.14355469,  0.17773438,  0.26757812, -0.08789062,
       -0.07910156, -0.16113281,  0.23632812, -0.07177734,  0.08837891,
        0.07177734, -0.11962891, -0.09228516, -0.12060547, -0.00448608,
       -0.21875   , -0.05712891, -0.04418945,  0.07226562, -0.05883789,
       -0.12597656,  0.03125   , -0.24609375,  0.19140625,  0.14941406,
       -0.19335938, -0.1875    , -0.05126953,  0.03369141, -0.21

**How to represent your sentence into a vector array:**

For deminstration, we take this sentence

In [21]:
sentences_ted[0]

['here',
 'are',
 'two',
 'reasons',
 'companies',
 'fail',
 'they',
 'only',
 'do',
 'more',
 'of',
 'the',
 'same',
 'or',
 'they',
 'only',
 'do',
 'what',
 's',
 'new']

In [22]:
allowed_words = set(model.wv.vocab)

  """Entry point for launching an IPython kernel.


Adding all the vectors of the allowed words to a list. To remember, Google pretrained models has prepositions removed.

In [0]:
vectors = []

for word in sentences_ted[0]:
  if word in allowed_words:
    vectors.append(model[word])

In [24]:
print(vectors)

[array([-2.73437500e-02,  4.49218750e-02,  7.66601562e-02,  1.33789062e-01,
       -8.30078125e-02, -5.63964844e-02,  8.05664062e-02, -1.22070312e-01,
       -1.01074219e-01,  1.04003906e-01,  5.12695312e-02, -8.59375000e-02,
        1.61132812e-02, -7.32421875e-02, -2.03125000e-01,  3.06396484e-02,
        2.73437500e-01,  2.08984375e-01,  5.73730469e-03, -8.10546875e-02,
       -1.45507812e-01,  8.39843750e-02,  1.50390625e-01, -2.10937500e-01,
        2.77099609e-02,  4.34570312e-02, -1.30004883e-02,  1.73950195e-03,
       -2.28271484e-02, -6.00585938e-02, -2.69775391e-02,  2.42919922e-02,
        1.01928711e-02,  2.11181641e-02,  5.37109375e-02, -3.61328125e-02,
        2.63977051e-03, -5.51757812e-02,  3.93066406e-02,  2.02148438e-01,
        5.61523438e-02, -5.66406250e-02,  9.08203125e-02,  5.90820312e-02,
        1.78222656e-02,  1.33056641e-02, -7.66601562e-02,  4.46777344e-02,
        4.34570312e-02,  5.78613281e-02,  8.34960938e-02,  1.22558594e-01,
        8.64257812e-02, 

Creating numpy array with the list

In [35]:
import numpy as np

vectors = np.array(vectors)
print(vectors)

numpy.ndarray

Now you can use this array of vectors to your Machine Learning model