<a href="https://colab.research.google.com/github/ShaunakSen/AI-for-Web-Accessibility/blob/master/Custom_corpus_Python_gensim_Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Python gensim Word2Vec tutorial with TensorFlow and Keras

[link](https://adventuresinmachinelearning.com/gensim-word2vec-tutorial/)

### Word embedding and Word2Vec

Word embedding involves creating better vector representations of words – both in terms of efficiency and maintaining meaning. For instance, a word embedding layer may involve creating a 10,000 x 300 sized matrix, whereby we look up a 300 length vector representation for each of the 10,000 words in our vocabulary.  This new, 300 length vector is obviously a lot more efficient than a 10,000 length one-hot representation.  But we also need to create this 300 length vector in such a way as to preserve some semblance of the meaning of the word.

Word2Vec does this by taking the context of words surrounding the target word.  So, if we have a context window of 2, the context of the target word “sat” in the sentence “the cat sat on the mat” is the list of words [“the”, “cat”, “on”, “the”]. In Word2Vec, the meaning of a word is roughly translatable to context – and it basically works. Target words which share similar common context words often have similar meanings. The way Word2Vec trains the embedding vectors is via a neural network of sorts – the neural network, given a one-hot representation of a target word, tries to predict the most likely context words.

Here’s a naive way of performing the neural network training using an output softmax layer:

![](https://i2.wp.com/adventuresinmachinelearning.com/wp-content/uploads/2017/07/Word2Vec-softmax.jpg?w=676&ssl=1)

In this network, the 300 node hidden layer weights are training by trying to predict (via a softmax output layer) genuine, high probability context words.  Once the training is complete, the output softmax layer is discarded and what is of real value is the 10,000 x 300 weight matrix connecting the input to the hidden layer. This is our embedding matrix, and we can look up any member of our 10,000-word vocabulary and get it’s 300 length vector representation.

It turns out that this softmax way of training the embedding layer is very inefficient, due to the millions of weights that need to be involved in updating and calculating the softmax values. Therefore, a concept called negative sampling is used in the real Word2Vec, which involves training the layer with real context words and a few negative samples which are chosen randomly from outside the context.  For more details on this, see my Word2Vec Keras tutorial.

Now we understand what Word2Vec training of embedding layers involves, let’s talk about the gensim Word2Vec module.



### A gensim Word2Vec tutorial

This section will give a brief introduction to the gensim Word2Vec module.  The gensim library is an open-source Python library that specializes in vector space and topic modeling.  It can be made very fast with the use of the Cython Python model, which allows C code to be run inside the Python environment. This is good for our purposes, as the original Google Word2Vec implementation is written in C, and gensim has a wrapper for this code, which will be explained below.

In [0]:
import gensim
from gensim.models import word2vec
import logging

from keras.layers import Input, Embedding, merge, Reshape, Dot
from keras.models import Model

import tensorflow as tf
import numpy as np

import urllib.request
import os
import zipfile
import pickle

from time import time 

from pprint import pprint

In [0]:
def maybe_download(filename, url, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""

  # check if file exists
  if not os.path.exists(path=filename):
    # download: Returns a tuple containing the path to the newly created data file as well as the resulting HTTPMessage object.
    filename, _ = urllib.request.urlretrieve(url=url+filename, filename=filename)
  statinfo = os.stat(path=filename)
  # check file size
  if statinfo.st_size == expected_bytes:
    print('Found and verified', filename)
  else:
    print(statinfo.st_size)
    raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

In [0]:
url = 'http://mattmahoney.net/dc/'
filename = maybe_download('text8.zip', url, 31344016)

print (filename)

Found and verified text8.zip
text8.zip


In [0]:
# extract the file
if not os.path.exists((filename).strip('.zip')):
    zipfile.ZipFile(filename).extractall()

The next step that is required is to create an iterator for gensim to extract its data from.  We can cheat a little bit here and use a supplied iterator that gensim provides for the text8 corpus:



In [0]:
# iterate over sentences from the "text8" corpus
sentences = word2vec.Text8Corpus(fname='./text8')

print (type(sentences))

<class 'gensim.models.word2vec.Text8Corpus'>


In [0]:
for sentence in sentences:
  print (sentence)
  break

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers', 'to', 'related', 'so

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


The required input to the gensim Word2Vec module is an iterator object, which sequentially supplies sentences from which gensim will train the embedding layer. The line above shows the supplied gensim iterator for the text8 corpus, but below shows another generic form that could be used in its place for a different data set (not actually implemented in the code for this tutorial), where the data set also contains multiple files:



In [0]:
class MySentences(object):
  def __init__(self, dirname):
    self.dirname = dirname

  def __iter__(self):
    for fname in os.listdir(self.dirname):
      for line in open(os.path.join(self.dirname, fname)):
        yield line.split()

> This capability of gensim is great, as it means you can setup iterators which cycle through the data without having to load the entire data set into memory.  This is vital, as some text data sets are huge  i.e. tens of GB.

After we’ve setup the iterator object, it is dead simple to train our word vectors:

In [0]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


In [0]:
model = word2vec.Word2Vec(sentences=sentences, size=300, min_count=10, iter=10, workers=4)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


The first line just lets us see the INFO logging that gensim provides as it trains. The second line will execute the training on the provided sentences iterator.  The first optional argument iter specifies how many times the training code will run through the data set to train the neural network (kind of like the number of training epochs). The gensim training code will actually run through all the data iter+1 time, as the first pass involves collecting all the unique words, creating dictionaries etc.  The next argument, min_count, specifies the minimum amount of times that the word has to appear in the corpus before it is included in the vocabulary – this allows us to easily eliminate rare words and reduce our vocabulary size.  The third argument is the size of the resultant word vector – in this case, we set it to 300. In other words, each word in our vocabulary, after training, will be represented by a 300 length word vector. Finally, if we are using Cython, we can specify how many parallel workers we would like to work on the data – this will speed up the training process

Let’s examine our results and see what else gensim can do.



In [0]:
# get the word vector of "the"

print (model.wv['the'].shape)

(300,)


This returns a 300 length numpy vector – as you can see, each word vector can be retrieved from the model via a dictionary key i.e. a word within our vocabulary.



In [0]:
# get the most common words
print(model.wv.index2word[0], model.wv.index2word[1], model.wv.index2word[2])
print (len(model.wv.index2word))

the of and
47134


The word vectors are also arranged within the wv object with indexes – the lowest index (i.e. 0) represents the most common word, the highest (i.e. the length of the vocabulary minus 1) the least common word.  The above code returns: “the of and”, which is unsurprising, as these are very common words.



In [0]:
# get the least common words
vocab_size = len(model.wv.vocab)

print ("Vocab size:", vocab_size)

print (model.wv.index2word[vocab_size-1], model.wv.index2word[vocab_size-2], model.wv.index2word[vocab_size-3])

Vocab size: 47134
kirchenmusik villein meherabad


The discovered vocabulary is found in model.wv.vocab – by taking the length of this dictionary, we can determine the vocabulary size (in this case, it is 47,134 elements long). The code above returns: “zanetti markschies meherabad – rare words indeed.



In [0]:
# find the index of the 2nd most common word ("of")

print('Index of "of" is: {}'.format(model.wv.vocab['of'].index))


Index of "of" is: 1


In [0]:
# some similarity fun
print (model.wv.similarity('woman', 'man'), model.wv.similarity('man', 'elephant'))

0.6051611 0.20642439


  if np.issubdtype(vec.dtype, np.int):


We can also easily extract similarity measures between word vectors (gensim uses cosine similarity). The above code returns “0.6599 0.2955”, which again makes sense given the context such words are generally used in.



In [0]:
model.wv['zimbabwe'].shape

(300,)

In [0]:
# The word further away from the mean of all words.  

model.wv.doesnt_match(["england", "india", "zebra", "zimbabwe"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


'zebra'

This fun function determines which word doesn’t match the context of the others – in this case, “zebra” is returned.

We also want to able to convert our data set from a list of words to a list of integer indexes, based on the vocabulary developed by gensim.  To do so, we can use the following code:


In [0]:
zipfile.ZipFile.read()

TypeError: ignored

In [0]:
# convert the input data into a list of integer indexes aligning with the wv indexes

def read_data(filename):
  """
  Extract the first file enclosed in a zip file as a list of words.
  """
  with zipfile.ZipFile(filename) as f:
    data = f.read(f.namelist()[0]).split()
  return data

str_data = read_data(filename)

str_data = [data.decode("utf-8") for data in str_data]

print (str_data[:10])

In [0]:
def convert_data_to_index(string_data, wv):
  index_data = []

  for word in string_data:
    if word in wv:
      index_data.append(wv.vocab[word].index)
  return index_data

index_data = convert_data_to_index(str_data, model.wv)

print (str_data[:4], index_data[:4])

The first function, read_data simply extracts the zip file data and returns a list of strings in the same order as our original text data set.  The second function loops through each word in the data set, determines if it is in the vocabulary*, and if so, adds the matching integer index to a list.  The code above returns: “[‘anarchism’, ‘originated’, ‘as’, ‘a’] [5237, 3080, 11, 5]”.

* Remember that some words in the data set will be missing from the vocabulary if they are very rare in the corpus.

We can also save and reload our trained word vectors/embeddings by the following simple code:



In [0]:
model.save("mymodel")

model = gensim.models.Word2Vec.load("mymodel")

Finally, I’ll show you how we can extract the embedding weights from the gensim Word2Vec embedding layer and store it in a numpy array, ready for use in TensorFlow and Keras.

The rows should represent each word and the cols the corresponding embedded vector

In [0]:
model.wv.index2word[12]

In [0]:
embedding_matrix = np.zeros(shape=(len(model.wv.vocab), 300))

# for each word in the vocab

for i in range(len(model.wv.vocab)):
  embedding_vector = model.wv[model.wv.index2word[i]]
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector

    

In this case, we first create an appropriately sized numpy zeros array.  Then we loop through each word in the vocabulary, grabbing the word vector associated with that word by using the wv dictionary.  We then add the word vector into our numpy array.

So there we have it – gensim Word2Vec is a great little library that can execute the word embedding process very quickly, and also has a host of other useful functionality.

Now I will show how you can use pre-trained gensim embedding layers in our TensorFlow and Keras models.

### Using gensim Word2Vec embeddings in Keras

We can perform similar steps with a Keras model. In this case, following the example code previously shown in the Keras Word2Vec tutorial, our model takes two single word samples as input and finds the similarity between them.  The top 8 closest words loop is therefore slightly different than the previous example:



In [0]:
valid_size = 16  # Random set of words to evaluate similarity on.
valid_window = 100  # Only pick dev samples in the head of the distribution.

valid_examples = np.random.choice(valid_window, valid_size, replace=False)

valid_examples

array([20, 95,  6, 85, 19, 82, 15, 29, 56, 90, 25, 12, 46, 13, 86, 36])

In [0]:
# input words - in this case we do sample by sample evaluations of the similarity
valid_word = Input((1,), dtype='int32')
other_word = Input((1,), dtype='int32')

# setup the embedding layer

# input_dim is vocab_size
# output dim is 300
embeddings = Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], weights=[embedding_matrix])

W0816 12:37:29.133539 140442326112128 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0816 12:37:29.182073 140442326112128 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.



In [0]:
embedded_a = embeddings(valid_word)
embedded_b = embeddings(other_word)

# reshape to 300,1
embedded_a = Reshape(target_shape=(300, 1))(embedded_a)
embedded_b = Reshape(target_shape=(300, 1))(embedded_b)


print (embedded_a.shape)

(?, 300, 1)


In [0]:
similarity_layer = Dot(axes=0, normalize=True, name='similarity_layer')
similarity = similarity_layer([embedded_a, embedded_b])

print (similarity.shape)

(?, 1, 1)


In [0]:
# create the Keras model
k_model = Model(input=[valid_word, other_word], output=similarity)


  """Entry point for launching an IPython kernel.


In [0]:
model.wv.index2word[0]

'the'

In [0]:
print (valid_size, valid_examples)

16 [20 95  6 85 19 82 15 29 56 90 25 12 46 13 86 36]


In [0]:
def get_sim(valid_word_idx, vocab_size):
  """
  Gets similarity scores of valid_word_idx through the k_model
  against all words in the vocab
  Returns a list of all the similarity scores
  """
  # array of 0s of size vocab_size
  sim = np.zeros((vocab_size,))
  # target and context
  in_arr1 = np.zeros((1,))
  in_arr2 = np.zeros((1,))

  in_arr1 = np.zeros((1,))
  in_arr2 = np.zeros((1,))
  in_arr1[0,] = valid_word_idx

  for i in range(vocab_size):
    in_arr2[0,] = i
    # get similarity score
    out = k_model.predict_on_batch([in_arr1, in_arr2])
    sim[i] = out
  return sim

In [0]:
# now run the model and get the closest words to the valid examples

for i in range(valid_size):
  # get corr word from the model
  valid_word = model.wv.index2word[valid_examples[i]]
  top_k = 8  # number of nearest neighbors
  # get all similarity scores
  sim = get_sim(valid_examples[i], len(model.wv.vocab))

  # sort desc and get the top k except the first one
  # first one is the same as the original so ignore that
  nearest = (-sim).argsort()[1:top_k+1]

  log_str = 'Nearest to %s:' % valid_word

  for k in range(top_k):
    close_word = model.wv.index2word[nearest[k]]
    log_str = '%s %s,' % (log_str, close_word)
    print(log_str)


Nearest to four: five,
Nearest to four: five, three,
Nearest to four: five, three, six,
Nearest to four: five, three, six, one,
Nearest to four: five, three, six, one, eight,
Nearest to four: five, three, six, one, eight, seven,
Nearest to four: five, three, six, one, eight, seven, nine,
Nearest to four: five, three, six, one, eight, seven, nine, two,
Nearest to history: timeline,
Nearest to history: timeline, overview,
Nearest to history: timeline, overview, prehistory,
Nearest to history: timeline, overview, prehistory, annals,
Nearest to history: timeline, overview, prehistory, annals, historical,
Nearest to history: timeline, overview, prehistory, annals, historical, beginnings,
Nearest to history: timeline, overview, prehistory, annals, historical, beginnings, geography,
Nearest to history: timeline, overview, prehistory, annals, historical, beginnings, geography, career,
Nearest to to: letting,
Nearest to to: letting, might,
Nearest to to: letting, might, desired,
Nearest to to: 

## Training word2vec on hyperlinks dataset

### Mount Google Drive

In [0]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive/


In [0]:
# read the final corpus

final_corpus = open('gdrive/My Drive/corpus_final.pickle', 'rb')
final_corpus = pickle.load(final_corpus) # contains list of words for each url

In [0]:
print (len(final_corpus))

print (final_corpus[0])

21058
['hi', 'today', 'want', 'share', 'everyone', 'famous', 'creation', 'walt', 'disney', 'concert', 'hall', 'located', 'los', 'angeles', 'california', 'serves', 'ho', 'Los Angeles Philharmonic', 'orchestra', 'los', 'angeles', 'master', 'chorale', 'creation', 'began', 'lilian', 'disney', 'made', 'gift', 'fifty', 'million', 'dollars', 'order', 'build', 'performance', 'venue', 'gift', 'people', 'los', 'angeles', 'tribute', 'walt', 'disney', 'devotion', 'arts', 'city', 'took', 'sixteen', 'years', 'complete', 'project', 'beginning', 'design', 'launch', 'construction', 'start', 'finally', 'inauguration', 'mass', 'coordination', 'experts', 'laborers', 'architects', 'creation', 'acoustic', 'theatre', 'made', 'possible', 'los', 'angeles', 'philharmonic', 'los', 'angeles', 'philharmonic', 'la', 'phil', 'lap', 'american', 'orchestra', 'based', 'los', 'angeles', 'california', 'regular', 'season', 'concerts', 'october', 'june', 'walt', 'disney', 'concert', 'hall', 'summer', 'season', 'hollywood',

### Train word2vec

In [0]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# window size : 10 as we do not want hyperlink overlapping
model = word2vec.Word2Vec(sentences=final_corpus, size=300, window = 10, min_count = 5, iter=10)

In [0]:
# get the most common words
print(model.wv.index2word[0], model.wv.index2word[1], model.wv.index2word[2])
print (len(model.wv.index2word))

specialtoken first also
38435


In [0]:
print (model.wv['los'].shape)

(300,)


In [0]:
# get the least common words
vocab_size = len(model.wv.vocab)

print ("Vocab size:", vocab_size)

print (model.wv.index2word[vocab_size-1], model.wv.index2word[vocab_size-2], model.wv.index2word[vocab_size-3])

Vocab size: 38435
aperitifs trotters periwinkles


In [0]:
# some similarity fun
print (model.wv.similarity('woman', 'man'), model.wv.similarity('man', 'elephant'))

0.56891394 0.22441162


  if np.issubdtype(vec.dtype, np.int):


In [0]:
model.wv.doesnt_match(["england", "india", "zebra", "zimbabwe"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


'zebra'

In [0]:
print (model.wv.most_similar(positive=['happy'], topn=5))
print (model.wv.most_similar(positive=['india'], topn=5))
print (model.wv.most_similar(positive=['mumbai'], topn=5))
print (model.wv.most_similar(positive=['england'], topn=5))
print (model.wv.most_similar(positive=['obama'], topn=5))
print (model.wv.most_similar(positive=['football'], topn=5))
print (model.wv.most_similar(positive=['google'], topn=5))

[('glad', 0.6442642211914062), ('alive', 0.6190614700317383), ('joy', 0.6157110929489136), ('funny', 0.6091225147247314), ('smile', 0.604058027267456)]
[('indian', 0.6682537794113159), ('India', 0.6019155979156494), ('jammu', 0.5636911988258362), ('maharashtra', 0.5604825019836426), ('hindustan', 0.5522967576980591)]
[('kolkata', 0.8223788738250732), ('hyderabad', 0.8160845637321472), ('madras', 0.7137566208839417), ('chennai', 0.7048352956771851), ('pune', 0.7032150030136108)]
[('wales', 0.6627770066261292), ('essex', 0.6146817803382874), ('scotland', 0.5287102460861206), ('yorkshire', 0.5273866057395935), ('london', 0.5206950902938843)]
[('barack', 0.7266415357589722), ('romney', 0.6231708526611328), ('sinclair', 0.6106960773468018), ('President Obama', 0.5970177054405212), ('Barack Obama', 0.5760226249694824)]
[('basketball', 0.7027027606964111), ('uefa', 0.6789053678512573), ('coach', 0.674741804599762), ('hockey', 0.6677508354187012), ('soccer', 0.6626197099685669)]
[('youtube', 0

  if np.issubdtype(vec.dtype, np.int):


In [0]:
model.save("./gdrive/My Drive/word2vec_own_corpus")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
del model

In [0]:
model = word2vec.Word2Vec.load('./gdrive/My Drive/word2vec_own_corpus')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


### WMD Distance test

In [0]:
# read test corpus


final_corpus_source = open('gdrive/My Drive/final_corpus_source.pickle', 'rb')
final_corpus_source = pickle.load(final_corpus_source) 

final_corpus_target = open('gdrive/My Drive/final_corpus_target.pickle', 'rb')
final_corpus_target = pickle.load(final_corpus_target) 

final_corpus_link = open('gdrive/My Drive/final_corpus_link.pickle', 'rb')
final_corpus_link = pickle.load(final_corpus_link) 

In [0]:
print (len(final_corpus_source), len(final_corpus_link), len(final_corpus_target))

502 502 502


In [0]:
print (final_corpus_source[0])

print (final_corpus_link[0])

print (final_corpus_target[1])

['funmi', 'suffering', 'malignant', 'sarcomaaccording', 'findings', 'sarcoma', 'greek', 'sarx', 'σάρκα', 'meaning', 'flesh', 'cancer', 'arises', 'transformed', 'cells', 'mesenchymal', 'origin', 'thus', 'malignant', 'tumors', 'made', 'cancerous', 'bone', 'cartilage', 'fat', 'muscle', 'vascular', 'hematopoietic', 'tissues', 'definition', 'considered', 'sarcomas', 'contrast', 'malignant', 'tumor', 'originating', 'epithelial', 'cells', 'termed', 'carcinoma', 'sarcomas', 'quite', 'rare', 'common', 'malignancies', 'breast', 'colon', 'lung', 'cancer', 'almost', 'always', 'carcinoma']
['cancer']
['malignancy', 'malignancy', 'latin', 'male', 'meaning', 'badly', 'gnus', 'meaning', 'born', 'tendency', 'medical', 'condition', 'become', 'progressively', 'worse', 'malignancy', 'familiar', 'characterization', 'cancer', 'malignant', 'tumor', 'contrasts', 'noncancerous', 'benign', 'tumor', 'malignancy', 'selflimited', 'growth', 'capable', 'invading', 'adjacent', 'tissues', 'may', 'capable', 'spreading'

In [0]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download

download('stopwords') # Download stopwords list.

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
stopwords = stopwords.words('english')


In [0]:
# Normalizing word2vec vectors.
start = time()

model.init_sims(replace=True)  # Normalizes the vectors in the word2vec class.
print ('Cell took %.2f seconds to run.' %(time() - start))


Cell took 0.30 seconds to run.


In [0]:
distance = model.wmdistance(final_corpus_source[0], final_corpus_target[2])

print (distance)

1.0833837131321475


  """Entry point for launching an IPython kernel.


In [0]:
model_distances_self = []
for x in range(len(final_corpus_source)):
  distance = model.wmdistance(final_corpus_source[x], final_corpus_target[x])
  if distance > 2:
    print ("skip")
    continue
  model_distances_self.append(distance)

  This is separate from the ipykernel package so we can avoid doing imports until


skip
skip
skip
skip


In [0]:
pprint(' '.join(final_corpus_source[0]))

pprint(' '.join(final_corpus_source[87]))

('funmi suffering malignant sarcomaaccording findings sarcoma greek sarx σάρκα '
 'meaning flesh cancer arises transformed cells mesenchymal origin thus '
 'malignant tumors made cancerous bone cartilage fat muscle vascular '
 'hematopoietic tissues definition considered sarcomas contrast malignant '
 'tumor originating epithelial cells termed carcinoma sarcomas quite rare '
 'common malignancies breast colon lung cancer almost always carcinoma')
('went back watch another science show taught us hydrogen peroxide Potassium '
 'Iodide liquid nitrogen completed journey going pathways science learn '
 'astronomy simple machines many wonderspark learnt water light air')


## Google news based word2vec



In [0]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2019-09-03 16:51:47--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.36.78
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.36.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2019-09-03 16:53:43 (13.6 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [0]:
google_model = gensim.models.KeyedVectors.load_word2vec_format(fname='./GoogleNews-vectors-negative300.bin.gz', binary=True)


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [0]:
# Normalizing word2vec vectors.
start = time()

google_model.init_sims(replace=True)  # Normalizes the vectors in the word2vec class.
print ('Cell took %.2f seconds to run.' %(time() - start))


Cell took 21.56 seconds to run.


In [0]:
distance = google_model.wmdistance(final_corpus_source[0], final_corpus_target[2])

print (distance)

1.1420516535773533


In [0]:
for x in range(len(final_corpus_source)):
  distance = google_model.wmdistance(final_corpus_source[0], final_corpus_target[x])
  if (distance < 1.15) and (x!=0) and final_corpus_link[0]!= final_corpus_link[x]:
    print (distance, x)

0.9326307268131726 1
1.1420516535773533 2
1.0905177417224465 3
1.1470603095137142 4
1.1149945186210934 5
1.0730522040712427 7
0.8657670673111251 8
1.057428859419308 9
1.0467125515155244 10
0.8926764956578249 11
1.1489824919497431 295


In [0]:
google_model_distances_self = []
for x in range(len(final_corpus_source)):
  distance = google_model.wmdistance(final_corpus_source[x], final_corpus_target[x])
  if distance > 2:
    print ("skip")
    continue
  google_model_distances_self.append(distance)

skip
skip
skip
skip
skip


In [0]:
print (final_corpus_source[18], final_corpus_target[18])

['The [retired Col.] doth protest too much, methinks'] ['lady', 'doth', 'protest', 'much', 'methinks', 'lady', 'doth', 'protest', 'much', 'methinks', 'line', 'play', 'hamlet', 'william', 'shakespeare', 'spoken', 'queen', 'gertrude', 'response', 'insincere', 'overacting', 'character', 'play', 'within', 'play', 'created', 'prince', 'hamlet', 'prove', 'uncle', 'guilt', 'murder', 'father', 'king', 'denmark', 'phrase', 'used', 'everyday', 'speech', 'indicate', 'doubt', 'concerning', 'someone', 'sincerity', 'common', 'misquotation', 'places', 'methinks', 'first', 'methinks', 'lady', 'doth', 'protest', 'much', 'line']


In [0]:
print (np.mean(model_distances_self), np.mean(google_model_distances_self))

1.0856121041666031 1.1079963889269253


In [0]:
random_list = np.random.randint(low=0, high=len(final_corpus_source), size=20)


In [0]:
google_model_distances_others = []

for x in random_list:
  for y in range(len(final_corpus_source)):
    if x==y:
      continue
    else:
      distance = google_model.wmdistance(final_corpus_source[x], final_corpus_target[y])
      if distance > 2:
        print ("skip")
        continue
      google_model_distances_others.append(distance)

In [0]:
model_distances_others = []

for x in random_list:
  for y in range(len(final_corpus_source)):
    if x==y:
      continue
    else:
      distance = model.wmdistance(final_corpus_source[x], final_corpus_target[y])
      if distance > 2:
        print ("skip")
        continue
      model_distances_others.append(distance)

In [0]:
print (np.mean(model_distances_others), np.mean(google_model_distances_others))

1.2295431544303874 1.2267399344363303


### Evaluating similarity between the source link text and target text/source text



In [0]:
from nltk import ngrams

text = final_corpus_link[21][0]

print (text)

grams = ngrams(sequence=text.split(), n=1)

for gram in grams:
  print(gram)

John Maynard Keynes
('John',)
('Maynard',)
('Keynes',)


In [0]:
sentence = 'this is a foo bar sentences'

n = 7
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print (grams)


print (sixgrams is None)

False


In [0]:
google_model.wmdistance(['hi', 'there', 'asd', 'adds', 'ssd', 'daffa', 'asd', 'adds', 'ssd', 'daffa', 'asd', 'adds', 'ssd', 'daffa', 'asd', 'adds', 'ssd', 'daffa','asd', 'adds', 'ssd', 'daffa', 'asd', 'adds', 'ssd', 'daffa'], ['hi', 'there'])

1.1845182743819533

In [0]:
google_model_link_to_source_distances = []

for x in range(len(final_corpus_link)):
  if len(final_corpus_link[2][0].split())>1:
    source_link_words = final_corpus_link[x][0].split()
    distance = google_model.wmdistance(source_link_words, final_corpus_source[x])
    if distance > 2:
      print ("skip")
      continue
    if np.isnan(distance):
      continue
    google_model_link_to_source_distances.append(distance)
    


In [0]:
model_link_to_source_distances = []

for x in range(len(final_corpus_link)):
  if len(final_corpus_link[2][0].split())>1:
    source_link_words = final_corpus_link[x][0].split()
    distance = model.wmdistance(source_link_words, final_corpus_source[x])
    if distance > 2:
      print ("skip")
      continue
    if np.isnan(distance):
      continue
    model_link_to_source_distances.append(distance)

In [0]:
print (np.mean(google_model_link_to_source_distances), np.mean(model_link_to_source_distances))

nan nan


  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


In [0]:
google_model.wv['Serge']

In [0]:
model.wmdistance(['Serge', 'A.', 'Storms'], ('Serge', 'A.', 'Storms'))

  """Entry point for launching an IPython kernel.


inf

In [0]:
def processList (myList):
  final = []
  for words in myList:
    words = words.split()
    for word in words:
      final.append(word)
  return final
myList = ('named', 'Serge A. Storms ', 'loose')

processList(myList)

['named', 'Serge', 'A.', 'Storms', 'loose']

In [0]:
model_link_to_source_distances = []

for x in range(len(final_corpus_link)):
  # if length of link text > 2
  if len(final_corpus_link[x][0].split())>2:
    # split the link text into words
    source_link_words = final_corpus_link[x][0].split()
    source_link_words = processList(source_link_words)
    #print ("Source Link......................")
    #print (source_link_words)
    #print ("Source text......................")
    final_corpus_source[x] = processList(final_corpus_source[x])
    #print (final_corpus_source[x])
    n = len(source_link_words)
    source_text_ngrams = ngrams(final_corpus_source[x], n)
    
    min_dist = np.inf
    #print ("N grams......................")

    for source_text_gram in source_text_ngrams:
      # compute distance
      #print (source_text_gram)
      distance = model.wmdistance(source_link_words, list(source_text_gram))
      #print (distance)
      if distance < min_dist:
        min_dist = distance
    #print ("Min dist:", min_dist)
    if min_dist > 2:
      continue
    model_link_to_source_distances.append(min_dist)



In [0]:
google_model_link_to_source_distances = []

for x in range(len(final_corpus_link)):
  # if length of link text > 2
  if len(final_corpus_link[x][0].split())>2:
    # split the link text into words
    source_link_words = final_corpus_link[x][0].split()
    source_link_words = processList(source_link_words)
    #print ("Source Link......................")
    #print (source_link_words)
    #print ("Source text......................")
    final_corpus_source[x] = processList(final_corpus_source[x])
    #print (final_corpus_source[x])
    n = len(source_link_words)
    source_text_ngrams = ngrams(final_corpus_source[x], n)
    
    min_dist = np.inf
    #print ("N grams......................")

    for source_text_gram in source_text_ngrams:
      # compute distance
      #print (source_text_gram)
      distance = google_model.wmdistance(source_link_words, list(source_text_gram))
      #print (distance)
      if distance < min_dist:
        min_dist = distance
    #print ("Min dist:", min_dist)
    if min_dist > 2:
      min_dist = 2
    google_model_link_to_source_distances.append(min_dist)

In [0]:
google_model_link_to_source_distances

In [0]:
model_link_to_source_distances

In [0]:
random_list

array([123,  48, 126, 122, 367, 142,  51,  10, 374, 191, 475,  69, 446,
       111, 323, 436, 245, 375, 272, 106])

In [0]:
random_list2 = np.random.randint(low=0, high=len(final_corpus_link), size=20)

random_list2

array([463, 366,  40, 297,  19,   3,  89, 207,   1, 348, 491, 253, 303,
        99, 139,  12, 409, 204, 164, 439])

In [0]:
google_model_link_to_source_distances_others = []

for x in random_list:
  for y in random_list2:
    if x==y:
      continue
    # if length of link text > 2
    if len(final_corpus_link[x][0].split())>2:
      # split the link text into words
      source_link_words = final_corpus_link[x][0].split()
      source_link_words = processList(source_link_words)
      #print ("Source Link......................")
      #print (source_link_words)
      #print ("Source text......................")
      final_corpus_source[y] = processList(final_corpus_source[y])
      #print (final_corpus_source[x])
      n = len(source_link_words)
      source_text_ngrams = ngrams(final_corpus_source[y], n)
      
      min_dist = np.inf
      #print ("N grams......................")

      for source_text_gram in source_text_ngrams:
        # compute distance
        #print (source_text_gram)
        distance = google_model.wmdistance(source_link_words, list(source_text_gram))
        #print (distance)
        if distance < min_dist:
          min_dist = distance
      #print ("Min dist:", min_dist)
      if min_dist > 2:
        min_dist = 2
      google_model_link_to_source_distances_others.append(min_dist)

In [0]:
model_link_to_source_distances_others = []

for x in random_list:
  for y in random_list2:
    if x==y:
      continue
    # if length of link text > 2
    if len(final_corpus_link[x][0].split())>2:
      # split the link text into words
      source_link_words = final_corpus_link[x][0].split()
      source_link_words = processList(source_link_words)
      #print ("Source Link......................")
      #print (source_link_words)
      #print ("Source text......................")
      final_corpus_source[y] = processList(final_corpus_source[y])
      #print (final_corpus_source[x])
      n = len(source_link_words)
      source_text_ngrams = ngrams(final_corpus_source[y], n)
      
      min_dist = np.inf
      #print ("N grams......................")

      for source_text_gram in source_text_ngrams:
        # compute distance
        #print (source_text_gram)
        distance = model.wmdistance(source_link_words, list(source_text_gram))
        #print (distance)
        if distance < min_dist:
          min_dist = distance
      #print ("Min dist:", min_dist)
      if min_dist > 2:
        min_dist = 2
      model_link_to_source_distances_others.append(min_dist)



In [0]:
print (np.median(google_model_link_to_source_distances_others), np.median(model_link_to_source_distances_others))

1.3128739115888088 2.0
