## Training a word2vec model from scratch

-- Prof. Dorien Herremans

We will start by training a word2vec model from scratch using the gensim library. You will need to ensure that you have gensim installed, and a file decompressor to load our dataset. 

Note: these models may take a while to train. Be sure to switch the runtime of  Google Colab to us a TPU or GPU hardware accellerator (in the menu at the top). 

Let's start by installing some libraries that we will use:

In [0]:
!pip install gensim
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9681 sha256=fd0c28687f88293541f32f70ab1e1a7293fd19efe13a5890fd592f7353413a82
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


Now we can import these libraries:

In [0]:
# imports needed 
import gensim 
import wget




We will train our model using a very small dataset for demonstrative purposes. Note that for a real data science project you should train on a much larger dataset. 

We will use the complete works of Shakespeare. You can find the file at https://dorienherremans.com/drop/CDS/CNNs/shakespeare.txt

In [20]:
# download the dataset
!wget "https://dorienherremans.com/drop/CDS/CNNs/shakespeare.txt"


--2019-11-14 11:03:20--  https://dorienherremans.com/drop/CDS/CNNs/shakespeare.txt
Resolving dorienherremans.com (dorienherremans.com)... 96.127.180.74
Connecting to dorienherremans.com (dorienherremans.com)|96.127.180.74|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5447743 (5.2M) [text/plain]
Saving to: ‘shakespeare.txt.1’


2019-11-14 11:03:20 (46.5 MB/s) - ‘shakespeare.txt.1’ saved [5447743/5447743]



In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
#necessary imports:
link = 'https://drive.google.com/open?id=13SD25ui9HMXvqD9pcw0_qZ8K1Qq1OfxE' # The shareable link of metadata file
fluff, id = link.split('=')
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('shakespeare.txt')  

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



NameError: ignored

Let's read the input file and convert each line into a list of words (tokenizing). Do do this, we create a function read_input which is called in the penultimate line below: 

In [21]:
def read_input(input_file):
    print("reading file...")
    with open (input_file, 'r') as f: lines = f.readlines()
    for line in lines:
    # do some pre-processing and return a (tokenized) list      # of words for each review text
    # you can print the output here to understand  # the preprocessing (tokenizing)
        yield  gensim.utils.simple_preprocess  (line)
    # each review item new becomes a series of words # this is a list of lists

# point to the location on your filesystem
data_file  =  'shakespeare.txt'

documents = list (read_input (data_file)) 
print("Done reading data file")

reading file...
Done reading data file


Now let's train the word2vec model using our document variable (which is a list of word lists). Note that you can specify a number of hyperparameters below:
* min_count removes all words that occur less then min_count
* window: window size in the skip-gram
* workers: how many threads to use
* size: number of dimension of your new word embedding vector (typically 100-200). Smaller datasets require a smaller number



In [22]:

model  =  gensim.models.Word2Vec  (documents,  size=150,  window=5,  min_count=2,  workers=4) 
model.train(documents,total_examples=len(documents),epochs=10)


(6704015, 8675160)

That's it! Now you've trained the model! 

Now let's explore some properties of our new word space. You can get the words most close (read:  most similar) to a given word. Remember, the only texts the model has seen is shakespeare!

In [0]:
w1 = "king"
model.wv.most_similar  (positive=w1)

  if np.issubdtype(vec.dtype, np.int):


[('prince', 0.6431804895401001),
 ('plantagenets', 0.5619862675666809),
 ('dauphin', 0.5495572686195374),
 ('warwick', 0.5464417338371277),
 ('fifth', 0.5446635484695435),
 ('duke', 0.5242741703987122),
 ('sixth', 0.5195103883743286),
 ('ghost', 0.5169710516929626),
 ('crown', 0.5003525614738464),
 ('emperor', 0.4984578490257263)]

In [0]:
# look up top 6 words similar to 'smile'
w1 = ["smile"]
model.wv.most_similar  (positive=w1,topn=6)

  if np.issubdtype(vec.dtype, np.int):


[('laugh', 0.7505191564559937),
 ('tremble', 0.6889244914054871),
 ('rail', 0.6800838112831116),
 ('blush', 0.6739107966423035),
 ('push', 0.671941876411438),
 ('spit', 0.6716383099555969)]

In [0]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar  (positive=w1,topn=6)



  if np.issubdtype(vec.dtype, np.int):


[('england', 0.6466440558433533),
 ('orleans', 0.5653001070022583),
 ('princess', 0.5530722737312317),
 ('realm', 0.5522609353065491),
 ('wales', 0.5495685338973999),
 ('egypt', 0.5465847253799438)]

In [0]:
# look up top 6 words similar to 'sword'
w1 = ["sword"]
model.wv.most_similar  (positive=w1,topn=6)


  if np.issubdtype(vec.dtype, np.int):


[('head', 0.7913775444030762),
 ('knife', 0.720653772354126),
 ('finger', 0.7087414264678955),
 ('throat', 0.701163649559021),
 ('pocket', 0.6948661804199219),
 ('body', 0.6912802457809448)]

In [0]:
# get everything related to stuff on the royalty and not related to farmer
w1  =  ["king",'queen','prince'] 
w2  =  ['farmer']
model.wv.most_similar  (positive=w1,negative=w2,topn=10)


  if np.issubdtype(vec.dtype, np.int):


[('princess', 0.6012568473815918),
 ('warwick', 0.5922337770462036),
 ('duke', 0.5894880890846252),
 ('ghost', 0.5376821160316467),
 ('dauphin', 0.5213593244552612),
 ('comfort', 0.5193885564804077),
 ('emperor', 0.5070196390151978),
 ('moor', 0.5068163275718689),
 ('duchess', 0.5065346956253052),
 ('gods', 0.49567297101020813)]

Explore the similarity (e.g. distance) between two words. Does it make sense?

In [0]:
# similarity between two similar words
model.wv.similarity(w1="pretty",w2="beautiful")



  if np.issubdtype(vec.dtype, np.int):


0.5057323

In [0]:
# similarity between two opposing words
model.wv.similarity(w1="king",w2="farmer")


  if np.issubdtype(vec.dtype, np.int):


-0.027320124

Try some other combinations :) 

We can even use it to perform more 'smart' assigments: 

In [0]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


'france'

If you are interested in plotting the words in a multidimensional space, you can actually get the vector coordinates of each word: 

## Bonus: visualising our model in t-SNE: 

In [0]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
%matplotlib inline

def tsne_plot(model):
    "Creates and TSNE model and plots it"

    labels = []
    tokens = []
    
    count = 0
    for word in model.wv.vocab:
        # to speed up the process, let's limit to the first 100 elements
        if count < 100:
            # TODO get the labels
            count = count+1

    # set the t-sne values
    # TODO fit the t-sne model

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()
    
tsne_plot(model)

NameError: ignored

## References

* https://radimrehurek.com/gensim/models/word2vec.html
* https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568
* https://github.com/kavgan/nlp-text-mining-working-examples/tree/master/word2vec
* https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5