## Word2Vec with Gensim

But Why Word2Vec ?

Word2Vec finds relation (Semantic or Syntactic) between the words which was not possible by our Tradional TF-IDF or Frequency based approach. When we train the model, each one hot encoded word gets a point in a dimensional space where it learns and groups the words with similar meaning.

The neural network incorporated here is a Shallow.

One thing to note here is that we need large textual data to pass into Word2Vec model in order to figure out relation within words or generate meaningful results.

In general the Word2Vec is based on Window Method, where we have to assign a Window size.

<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/gensim.png" width="1200">

In [1]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-py3-none-any.whl size=98051302 sha256=ee79a631d3742cb44e1788e6a810558bbb74e3a6a96676399aceadf2b5b4f994
  Stored in directory: /tmp/pip-ephem-wheel-cache-1wdqyz53/wheels/69/c5/b8/4f1c029d89238734311b3269762ab2ee325a42da2ce8edb997
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


## RESTART RUNTIME

### Is possible to have word2vec with spaCy, but Gensim is more powerful

In [1]:
import spacy
# Load the spacy model that you have installed
nlp = spacy.load('en_core_web_md')
# process a sentence using the model
doc = nlp("This is some text that I am processing with Spacy")

# Get the vector for 'text':
doc[4].vector
# Get the mean vector for the entire sentence (useful for sentence classification etc.)
doc.vector

array([-5.36412969e-02,  2.79353321e-01, -1.05259977e-01, -1.76284965e-02,
        1.34550199e-01,  1.92671806e-01,  5.50469756e-03, -2.39132687e-01,
       -4.06342074e-02,  1.78010297e+00, -1.80772960e-01,  1.02661893e-01,
        6.84069991e-02, -5.09319194e-02, -7.65837058e-02, -3.77540514e-02,
        8.24129581e-03,  1.37752008e+00, -1.78934380e-01, -5.76109104e-02,
        1.66338980e-02, -3.62196006e-02, -7.48579949e-02,  4.40651290e-02,
       -2.65241470e-02,  2.41529979e-02,  9.79370065e-03, -1.13990309e-03,
        1.59522101e-01, -1.56648397e-01, -9.12139937e-02,  9.11872908e-02,
        1.07169405e-01, -1.08843103e-01, -7.94988051e-02, -4.74919155e-02,
       -1.60613850e-01, -2.82304995e-02, -1.03425637e-01, -1.14933215e-01,
        1.62531182e-01, -1.01342008e-01,  2.17013666e-03,  3.47881988e-02,
       -6.34927005e-02,  2.44374484e-01, -3.01910043e-02, -1.46046979e-02,
       -1.06488302e-01,  6.26319647e-03, -1.30655810e-01,  7.04905912e-02,
       -4.86716032e-02,  

In [2]:
doc.vector.shape

(300,)

## Gensim :
- Gensim is fairly easy to use module which inherits CBOW and Skip-gram.
- We can install it by using !pip install gensim in Jupyter Notebook.
- Alternate way to implement Word2Vec is to build it from scratch which is quite complex.
- Read more about Gensim : https://radimrehurek.com/gensim/index.html
- FYI Gensim was developed and is maintained by the NLP researcher Radim Řehůřek and his company RaRe Technologies.

In [1]:
!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

--2021-12-07 18:29:25--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.91.200
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.91.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2021-12-07 18:29:42 (95.5 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [None]:
#Ed estrai il file .gz utilizzando gunzip.
#Google mette a disposizione un modello preaddestrato su un corpus di Google News, contenente 3 milioni di parole e 300 dimensioni.
!gunzip GoogleNews-vectors-negative300.bin.gz

print('gunzip done!')

In questo notebook useremo gensim per caricare il pre-trained model, per farlo ci basta usare la funzione .load_word2vec_format(filpath), trattandosi di un file binario dobbiamo specificare il parametro binary a true.

In [3]:
from gensim.models import Word2Vec
import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print(type(model))

print('load model!')

<class 'gensim.models.keyedvectors.Word2VecKeyedVectors'>
load model!


In [4]:
model["man"].shape

(300,)

In [7]:
model["man"]

array([ 0.32617188,  0.13085938,  0.03466797, -0.08300781,  0.08984375,
       -0.04125977, -0.19824219,  0.00689697,  0.14355469,  0.0019455 ,
        0.02880859, -0.25      , -0.08398438, -0.15136719, -0.10205078,
        0.04077148, -0.09765625,  0.05932617,  0.02978516, -0.10058594,
       -0.13085938,  0.001297  ,  0.02612305, -0.27148438,  0.06396484,
       -0.19140625, -0.078125  ,  0.25976562,  0.375     , -0.04541016,
        0.16210938,  0.13671875, -0.06396484, -0.02062988, -0.09667969,
        0.25390625,  0.24804688, -0.12695312,  0.07177734,  0.3203125 ,
        0.03149414, -0.03857422,  0.21191406, -0.00811768,  0.22265625,
       -0.13476562, -0.07617188,  0.01049805, -0.05175781,  0.03808594,
       -0.13378906,  0.125     ,  0.0559082 , -0.18261719,  0.08154297,
       -0.08447266, -0.07763672, -0.04345703,  0.08105469, -0.01092529,
        0.17480469,  0.30664062, -0.04321289, -0.01416016,  0.09082031,
       -0.00927734, -0.03442383, -0.11523438,  0.12451172, -0.02

In [8]:
#Processing sentences is not as simple as with Spacy:
vectors = [model[x] for x in "This is some text I am processing with Spacy".split(' ')]
vectors

[array([-0.2890625 ,  0.19921875,  0.16015625,  0.02526855, -0.23632812,
         0.10205078,  0.06640625, -0.16503906,  0.12597656,  0.22070312,
         0.05517578, -0.28710938, -0.02148438,  0.05541992,  0.01574707,
         0.29296875,  0.19433594, -0.01531982,  0.03955078, -0.21484375,
         0.00994873,  0.16015625,  0.07958984, -0.05932617,  0.12353516,
        -0.27148438, -0.10205078,  0.078125  , -0.07519531,  0.22363281,
         0.16210938, -0.04614258,  0.12304688,  0.07275391,  0.25      ,
         0.0072937 , -0.38867188,  0.10644531,  0.20996094,  0.06103516,
         0.10107422,  0.16894531, -0.15429688, -0.08251953,  0.06542969,
        -0.12255859, -0.11621094,  0.04248047,  0.08251953,  0.09716797,
        -0.05371094,  0.125     ,  0.15039062, -0.09228516,  0.23925781,
         0.15234375,  0.1796875 , -0.26171875,  0.15429688,  0.09619141,
        -0.30859375, -0.05224609, -0.18652344, -0.24414062, -0.0612793 ,
        -0.12695312,  0.14160156, -0.03295898,  0.0

### Cosine Similarity


<img src="https://frenzy86.s3.eu-west-2.amazonaws.com/IFAO/nlp/cosine.png" width="1200">

Mathematically speaking, Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0,π] radians. 

It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

Calcolando la cosine similarity tra le rappresentazioni vettoriali di due parole possiamo sapere quanto esse sono simili.
NOTA BENE
La funzione cosine(u,v) di scipy calcola la distanza del coseno, possiamo trasformare la distanza in similitudine sottrando tale distanza a 1.

In [10]:
from scipy.spatial.distance import cosine
cosine(model["man"],model["boy"]) #distanza del coseno

0.3175129294395447

In [12]:
1-cosine(model["man"],model["boy"])

0.6824870705604553

il metodo .similarity(word1, word2) è già implementato per il calcolo diretto della similitudine di due parole.

In [13]:
model.similarity("man","boy") # queste parole sono molto simili

0.68248713

In [5]:
model.similarity("cat","mouse") # queste parole sono molto diverse (o almeno spero che lo siano)

0.46566275

Possiamo cercare le parole più simili ad una nostra parola chiave usando il metodo .most_similar.

In [7]:
model.most_similar(positive=['shocked'], topn=10)

[('stunned', 0.8812650442123413),
 ('surprised', 0.8090525269508362),
 ('flabbergasted', 0.8001877069473267),
 ('horrified', 0.7986997365951538),
 ('dismayed', 0.7774383425712585),
 ('dumbfounded', 0.7773443460464478),
 ('appalled', 0.7613470554351807),
 ('astonished', 0.757473349571228),
 ('taken_aback', 0.7515914440155029),
 ('astounded', 0.7368210554122925)]

Utilizzando questo stesso metodo possiamo anche eseguire ricerche più complesse, come le parole più simili a delle determinate parole chiave, passate all'interno del parametro positive ma contrarie ad altre parole chiave, passate all'interno del parametro negative.

In [8]:
model.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.518113374710083),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

Un'altro metodo utile è .doesnt_match(words) che prendendo in input una serie di parole ritorna quella meno attinente alle altre.

In [13]:
model.doesnt_match("breakfast moon sneaker dinner lunch".split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'sneaker'

In [None]:
## Piccolo Esercizio

https://randomwordgenerator.com/

per 10 parole trovare:
* most similar
* model.similarity per coppie
* model.doesnt_match



In [None]:
mercy
pile
plot
owl
riot

In [None]:
model.most_similar(positive=['mercy'], topn=10)

In [None]:
model.most_similar(positive=['pile', 'mercy'])

In [None]:
model.doesnt_match("zzzzzzz kkkkkkkk uuuuuuuu eeeeeeeeee ttttttttttt pppppppppppppp".split())

### Automatically detect common phrases – aka multi-word expressions, word n-gram collocations – from a stream of sentences.



In [None]:
from gensim.models.phrases import Phraser, Phrases

In [None]:
!wget http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz

--2021-03-31 12:45:43--  http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
Resolving qwone.com (qwone.com)... 173.48.209.137
Connecting to qwone.com (qwone.com)|173.48.209.137|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14666916 (14M) [application/x-gzip]
Saving to: ‘20news-18828.tar.gz’


2021-03-31 12:45:46 (5.03 MB/s) - ‘20news-18828.tar.gz’ saved [14666916/14666916]



In [None]:
!mkdir folder

In [None]:
!tar -xf 20news-18828.tar.gz -C /content/folder/

In [None]:
# Import libraries to build Word2Vec model, and load Newsgroups data
import os
import sys
import re
from gensim.models import Word2Vec
from gensim.models.phrases import Phraser, Phrases
TEXT_DATA_DIR = 'folder/20news-18828/'

In [None]:
# Newsgroups data is split between many files and folders.
# Directory stucture c<newsgroup label>/<post ID>

texts = []         # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []        # list of label ids
label_text = []    # list of label texts

# Go through each directory
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            # News groups posts are named as numbers, with no extensions.
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n')  # skip header in file (starts with two newlines.)
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)
                label_text.append(name)

print('Found %s texts.' % len(texts))
# >> Found 1997 texts.

Found 18828 texts.


In [None]:
# Cleaning data - remove punctuation from every newsgroup text
sentences = []
# Go through each text in turn
for ii in range(len(texts)):
    sentences = [re.sub(pattern=r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]', 
                        repl='', 
                        string=x
                       ).strip().split(' ') for x in texts[ii].split('\n') 
                      if not x.endswith('writes:')]
    sentences = [x for x in sentences if x != ['']]
    texts[ii] = sentences

In [None]:
print(texts[6])

[['The', 'motto', 'originated', 'in', 'the', 'StarSpangled', 'Banner', '', 'Tell', 'me', 'that', 'this', 'has'], ['something', 'to', 'do', 'with', 'atheists'], ['The', 'motto', 'oncoins', 'originated', 'as', 'a', 'McCarthyite', 'smear', 'which', 'equated', 'atheism'], ['with', 'Communism', 'and', 'called', 'both', 'unamerican'], ['No', 'it', "didn't", '', 'The', 'motto', 'has', 'been', 'on', 'various', 'coins', 'since', 'the', 'Civil', 'War'], ['It', 'was', 'just', 'required', 'to', 'be', 'on', 'all', 'currency', 'in', 'the', "50's"], ['keith']]


In [None]:
# concatenate all sentences from all texts into a single list of sentences
all_sentences = []
for text in texts:
    all_sentences += text

In [None]:
# Phrase Detection
# Give some common terms that can be ignored in phrase detection
# For example, 'state_of_affairs' will be detected because 'of' is provided here: 
common_terms = ["of", "with", "without", "and", "or", "the", "a"]
# Create the relevant phrases from the list of sentences:
phrases = Phrases(all_sentences, common_terms=common_terms)
# The Phraser object is used from now on to transform sentences
bigram = Phraser(phrases)
# Applying the Phraser to transform our sentences is simply
all_sentences = list(bigram[all_sentences])

In [None]:
print(all_sentences[5676])

['Question', 'Do_you', 'retract', 'your', 'claim', 'that', 'aa', 'posters', 'have', 'not', 'become']


In [None]:
print(bigram[all_sentences[5676]])

['Question', 'Do_you', 'retract', 'your', 'claim', 'that', 'aa', 'posters', 'have', 'not', 'become']


In [None]:
all_sentences

[['Archivename', 'atheismresources'],
 ['Altatheismarchivename', 'resources'],
 ['Lastmodified', '11', 'December_1992'],
 ['Version', '10'],
 ['Atheist', 'Resources'],
 ['Addresses', 'of', 'Atheist', 'Organizations'],
 ['USA'],
 ['FREEDOM', 'FROM', 'RELIGION', 'FOUNDATION'],
 ['Darwin',
  'fish',
  'bumper',
  'stickers',
  'and',
  'assorted',
  'other',
  'atheist',
  'paraphernalia',
  'are'],
 ['available',
  'from',
  'the',
  'Freedom',
  'From',
  'Religion',
  'Foundation',
  'in',
  'the',
  'US'],
 ['Write', 'to', 'FFRF', 'PO_Box', '750', 'Madison_WI', '53701'],
 ['Telephone', '608', '2568900'],
 ['EVOLUTION', 'DESIGNS'],
 ['Evolution',
  'Designs',
  'sell',
  'the',
  'Darwin',
  'fish',
  "It's",
  'a',
  'fish',
  'symbol',
  'like',
  'the',
  'ones'],
 ['Christians',
  'stick',
  'on',
  'their',
  'cars',
  'but',
  'with',
  'feet',
  'and',
  'the',
  'word',
  'Darwin',
  'written'],
 ['inside',
  'The',
  'deluxe',
  'moulded',
  '3D',
  'plastic',
  'fish',
  'is'