Problem
--
You want to implement word embeddings - Semantic meaning

Solution
--
Word embeddings are prediction based, and they use shallow neural networks to train the model that will lead to learning the weight and using them as a vector representation.

<font color='green'>word2vec</font>
--
**word2vec** is the deep learning Google framework to train word embeddings. It will use all the words of the whole corpus and predict
the nearby words. It will create a vector for all the words present in the
corpus in a way so that the context is captured. It also outperforms any
other methodologies in the space of word similarity and word analogies.

There are mainly 2 types of word2vec Model.

• Skip-Gram

• Continuous Bag of Words (CBOW)

<img src="https://drive.google.com/uc?id=1ZC7kOYkuY2BGRCONWde38usTOCRJqJlR"/>

The above figure shows the architecture of the CBOW and skip-gram
algorithms used to build word embeddings. Let us see how these models
work in detail.

Skip-Gram
--
The skip-gram model is used to predict the probabilities of a word given the context of word or words.

Let us take a small sentence and understand how it actually works.
Each sentence will generate a target word and context, which are the words
nearby. The number of words to be considered around the target variable
is called the window size. The table below shows all the possible target
and context variables for window size 2. Window size needs to be selected
based on data and the resources at your disposal. The larger the window
size, the higher the computing power.

<img src="https://drive.google.com/uc?id=18nKDL_JAX96Zs_ILGMrcdd517GWLwrW2"/>

Since it takes a lot of text and computing power, let us go ahead and take sample data and build a skip-gram model.

As mentioned *in earlier NB's*, import the text corpus and break it into sentences. Perform **some cleaning and preprocessing** like the removal of
punctuation and digits, and split the sentences into words or tokens, etc.


In [1]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.6/26.6 MB[0m [31m77.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━

In [1]:
#Example sentences
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
[ 'nlp', 'saves', 'time', 'and', 'solves','lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

#import library

import gensim
from gensim.models import Word2Vec
from matplotlib import pyplot


In [2]:
length=[]
for i in sentences:
    print(len(i))
    length.append(len(i))

3
7
3
9
4


In [3]:
import numpy as np

In [4]:
np.mean(length)

5.2

In [5]:
np.median(length)

4.0

In [6]:
np.min(length)

3

In [7]:
np.max(length)

9

### training the model
https://radimrehurek.com/gensim/models/word2vec.html

In [None]:
# vocabory = 100
# vector size=2, king = [0.1, 0.2], xy=[0.2,0.3]

In [10]:

skipgram = Word2Vec(sentences, vector_size = 50, window = 3, min_count=1,sg = 0,epochs=5)
# vector_size : int, optional
#     Dimensionality of the word vectors.
# window : int, optional
#     Maximum distance between the current and predicted word within a sentence.
# min_count=1 -> Minimium frequency count of words.
#                The model would ignore words that do not satisfy the min_count.
#                Extremely infrequent words are usually unimportant. (default 5)
# workers -> How many threads to use behind the scenes? (default 3)
# sg -> (default 0 or CBOW) The training algorithm, either CBOW (0)
#                           or skip gram (1).

In [11]:
# distribution
# Word2Vec?

In [12]:
print(skipgram.wv.key_to_index)

{'nlp': 0, 'I': 1, 'future': 2, 'love': 3, 'will': 4, 'learn': 5, 'in': 6, '2': 7, 'months': 8, 'is': 9, 'learning': 10, 'machine': 11, 'time': 12, 'and': 13, 'solves': 14, 'lot': 15, 'of': 16, 'industry': 17, 'problems': 18, 'uses': 19, 'saves': 20}


### access vector for one word

In [13]:

print(len(skipgram.wv['nlp']))  # get numpy vector of a word

# Since our vector size parameter was 50, the model
# gives a vector of size 50 for each word.

50


In [14]:
# "i didn't like the product":50
# [..]+[..],[...][...][....]/5=50

### Similar to a word

In [15]:
skipgram.wv.most_similar('solves', topn=2)

[('machine', 0.5294367074966431), ('nlp', 0.21074169874191284)]

### Saving a model

In [16]:
skipgram.save("skipgram.model")

### Reloading the model

In [17]:
model = Word2Vec.load("skipgram.model")
model.wv.most_similar('learn', topn=2)

[('and', 0.3066261410713196), ('industry', 0.27047690749168396)]

In [18]:
model.wv.most_similar('industry', topn=2)

[('learn', 0.27047690749168396), ('in', 0.2048206925392151)]

In [19]:
model.wv.most_similar('machine', topn=2)

[('solves', 0.5294366478919983), ('of', 0.2258927822113037)]

#### Training an actual corpus

#### TASK


In [20]:
documents = [
    "Human machine interface, for lab abc& computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees?",
    "Graph minors IV Widths of trees and well. quasi ordering",
    "Graph minors A survey",
]



### Activity

In [21]:
# cleaning the texts

# remove common words and tokenize

# remove words that appear only once

### Activity- Solution

In [22]:
from pprint import pprint  # pretty-printer
from collections import defaultdict
import re

#Get the character set
characters=set()
for sent in documents:
    for word in sent.split():
        for char in word:
            characters.add(char.lower())

# cleaning the texts

documents_clean=[]

for sent in documents:
#     print(sent)
    sent=re.sub("&","",sent)
    sent=re.sub(",","",sent)
    sent=re.sub("\?","",sent)
    sent=re.sub("\.","",sent)
#     print(sent)
    documents_clean.append(sent)

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents_clean
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

pprint(texts)


[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]


  sent=re.sub("\?","",sent)
  sent=re.sub("\.","",sent)


In [23]:
skipgram = Word2Vec(texts, vector_size = 50, window = 3, min_count=1,sg = 1,epochs=9)

In [24]:
print(skipgram.wv.key_to_index)

{'system': 0, 'graph': 1, 'trees': 2, 'user': 3, 'minors': 4, 'eps': 5, 'time': 6, 'response': 7, 'survey': 8, 'computer': 9, 'interface': 10, 'human': 11}


In [25]:
vector = skipgram.wv['computer']  # get numpy vector of a word
sims = skipgram.wv.most_similar('computer', topn=10)  # get other similar words
print(sims)
#Try with more number of epochs

[('eps', 0.22442299127578735), ('system', 0.0998455286026001), ('time', 0.089928537607193), ('human', 0.058373644948005676), ('graph', 0.0013571253512054682), ('response', -0.0013637219090014696), ('trees', -0.037274789065122604), ('minors', -0.06371434032917023), ('interface', -0.11219383776187897), ('user', -0.12241575121879578)]


In [26]:
 skipgram.wv['aayush']

KeyError: "Key 'aayush' not present"

**Note** : We get an error saying the word doesn’t exist because this word was not there in our input training data. This is the reason we need to train the algorithm on as much data possible so that we do not miss out on words.


Continuous Bag of Words (CBOW)
--
Now let’s see how to build CBOW model. (Its very similar to SkipGram model)

In [27]:
#import library
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot

#Example sentences
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
[ 'nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

In [28]:
# training the model
cbow = Word2Vec(sentences, vector_size =128, window = 3, min_count=1,sg = 0)
# size=50 -> means size of vector to represent each token or word
# window=1 -> The maximum distance between the target word and its neighboring word.
# min_count=1 -> Minimium frequency count of words.
#                The model would ignore words that do not satisfy the min_count.
#                Extremely infrequent words are usually unimportant.
# workers -> How many threads to use behind the scenes?
# as sg=0 i.e no skipgram , hence default CBOW

# access vector for one word
print(cbow.wv['nlp'])

[-4.1892752e-04  1.8471200e-04  3.9869919e-03  7.0384946e-03
 -7.2679296e-03 -5.5600069e-03  5.0459942e-03  7.0101470e-03
 -3.9183032e-03 -2.9401341e-03  5.7660192e-03 -1.1980245e-03
 -3.5442291e-03  5.1203528e-03 -3.7970003e-03 -1.4187638e-03
  2.2473279e-03  7.7490136e-04 -6.4728241e-03 -7.3818890e-03
  5.7123173e-03  3.9611422e-03  5.2794479e-03  5.9598871e-04
  4.9616331e-03 -2.6604421e-03 -7.3937606e-04  4.5066979e-03
 -5.8762794e-03 -3.0750809e-03 -5.8684237e-03 -7.2659552e-04
  7.4516553e-03 -5.7180990e-03 -1.8232567e-03 -1.5138602e-03
  6.3104974e-03 -4.6335123e-03  3.5283156e-05 -3.7138546e-03
 -7.5027738e-03  3.9119478e-03 -6.8434263e-03 -3.4311134e-03
 -2.7421862e-05 -2.3139175e-04 -5.9853438e-03  7.5115180e-03
  3.8922327e-03  7.2133932e-03 -6.3733729e-03  3.5123425e-03
 -3.2320907e-03  6.4416882e-04  6.6395467e-03 -3.4860754e-03
  3.5292972e-03 -5.3023128e-03 -2.7722567e-03  7.3425844e-03
 -1.2325412e-03  2.5107153e-04 -3.2348670e-03 -6.0021002e-03
 -1.1781314e-03  1.92952

Important Observation
--
To train these models, it requires a huge amount of computing
power. So, let us go ahead and use Google’s pre-trained model, which has
been trained with over 100 billion words.

Download the model from the below path and keep it in your local
storage:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

or **better off from this link** :

https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

Note **if running on Jupyter NB** : The Google Db is soo large that we would get ValueError, like this : ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.


In [30]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# import gensim package
import gensim

# load the saved model
model = gensim.models.KeyedVectors.load_word2vec_format('~/Downloads/GoogleNews-vectors-negative300.bin.gz', binary=True)
# model = gensim.models.KeyedVectors.load_word2vec_format('https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz', binary=True)

HTTPError: 404 Client Error: Not Found for url: https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

In [31]:
# lets check similarity
print (model.similarity('This', 'is'))

#Lets check one more.
print (model.similarity('post', 'book'))

#print(model.similarity('seed', 'need'))

AttributeError: 'Word2Vec' object has no attribute 'similarity'

“`This`” and “`is`” have a good amount of similarity, but the similarity
between the words “`post`” and “`book`” is poor. For any given set of words, it uses the vectors of both the words and calculates the similarity between them.

In [32]:
# Finding the odd one out.
model.doesnt_match('breakfast cereal dinner lunch'.split())

AttributeError: 'Word2Vec' object has no attribute 'doesnt_match'

Of '`breakfast`’, ‘`cereal`’, ‘`dinner`’ and ‘`lunch`', only **cereal** is the word that is
not anywhere related to the remaining 3 words.

In [33]:
# It is also finding the relations between words.
#model.most_similar(positive=['woman', 'king'] , negative=['man'])  # default value of topn is 10

# try this too :
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)

AttributeError: 'Word2Vec' object has no attribute 'most_similar'

<img src="https://drive.google.com/uc?id=11Yu1Gj4Rw5BccL6KXnT_rXqYPyJbEUfZ"/>

![Screen%20Shot%202021-04-10%20at%202.37.02%20AM.png](attachment:Screen%20Shot%202021-04-10%20at%202.37.02%20AM.png)

In [34]:
import gensim.downloader

# Show all available models in gensim-data

pprint(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']


In [35]:
# Download the "glove-twitter-25" embeddings

glove_vectors = gensim.downloader.load('glove-twitter-25')

# Use the downloaded vectors as usual:

glove_vectors.most_similar('twitter')




[('facebook', 0.948005199432373),
 ('tweet', 0.9403423070907593),
 ('fb', 0.9342358708381653),
 ('instagram', 0.9104824066162109),
 ('chat', 0.8964964747428894),
 ('hashtag', 0.8885937333106995),
 ('tweets', 0.8878158330917358),
 ('tl', 0.8778461217880249),
 ('link', 0.8778210878372192),
 ('internet', 0.8753897547721863)]

In [42]:
glove_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)

[('meets', 0.8841924071311951),
 ('prince', 0.832163393497467),
 ('queen', 0.8257461190223694)]

Implementing <font color='green'>fastText</font>
--
**fastText** is another deep learning framework developed by Facebook to capture context and meaning.

Problem
--
How to implement fastText in Python.

Solution
--
fastText is the improvised version of word2vec. word2vec basically
considers words to build the representation. But fastText takes each
character while computing the representation of the word.

In [36]:
# Let us see how to build a fastText word embedding.
# Import FastText
from gensim.models import FastText
from sklearn.decomposition import PCA
from matplotlib import pyplot

#Example sentences
sentences = [['I', 'love', 'nlp'],
['I', 'will', 'learn', 'nlp', 'in', '2','months'],
['nlp', 'is', 'future'],
[ 'nlp', 'saves', 'time', 'and', 'solves',
'lot', 'of', 'industry', 'problems'],
['nlp', 'uses', 'machine', 'learning']]

fast = FastText(sentences,window=1, min_count=1, workers=5, min_n=1, max_n=2)
# size=10 -> means size of vector to represent each token or word
# window=1 -> The maximum distance between the target word and its neighboring word.
# min_count=1 -> Minimium frequency count of words.
#                The model would ignore words that do not satisfy the min_count.
#                Extremely infrequent words are usually unimportant.
# workers -> How many threads to use behind the scenes?
# min_n=1, max_n=2  -> When finding similarity or analogies like this :
# "Father" - "Boy" + "Girl" == "Mother"
#print(fast.most_similar(['girl', 'father'], ['boy'], topn=3))
# [('mother', 0.7996115684509277), ('grandfather', 0.7629683613777161),
# ('wife', 0.7478234767913818)]
# we want the model to show min 1 and max 2 analogies


# vector for word nlp
print(fast.wv['nlp'])


[-3.5651671e-03  9.3698065e-04 -2.0309705e-03  1.3372294e-03
  1.0669909e-03 -1.3030644e-03  4.5768594e-04 -3.4850935e-04
  3.0172872e-04 -3.3416669e-04  2.3032913e-03 -2.1352421e-03
 -3.5649028e-03 -2.0423550e-03  1.6025617e-04  3.4330557e-03
  1.6206586e-04 -1.7705902e-03  2.4007181e-04 -2.9109141e-03
  4.3194192e-03 -2.1401787e-04  9.1965008e-04 -1.9673463e-03
  4.2877963e-04  1.9754227e-03 -5.5509841e-04 -7.4600609e-04
 -1.6250368e-04  4.6096946e-04 -4.1272063e-03 -3.7344757e-03
  7.8168191e-04  6.3480536e-04 -2.7865777e-03  1.0397271e-03
  7.2312576e-04  4.0080363e-04 -1.0363614e-03  6.7208544e-04
  5.9388904e-04  1.4350816e-03 -9.1749663e-04  1.1620179e-03
 -4.6320874e-03 -7.4696593e-04 -1.8302952e-03 -1.8878000e-04
  5.4830208e-04 -1.0331636e-03 -9.7611046e-04 -6.6045811e-04
 -1.0406146e-03  4.3693674e-03  3.6964498e-03 -2.3840894e-03
  1.0004241e-03  1.6795534e-04 -4.5526810e-03  8.8466250e-04
  2.2786174e-03 -1.2467529e-04 -2.6847786e-04  2.6513743e-03
 -1.5358579e-03 -4.61347

In [37]:
# Try this
print(fast.wv.most_similar(['machine', 'learning'], ['nlp'], topn=3))

[('learn', 0.352792888879776), ('love', 0.23435966670513153), ('industry', 0.16918529570102692)]



<hr>
<br><br>
<u><b>Further Resources</b></u> :

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

https://datascience.stackexchange.com/questions/22250/what-is-the-difference-between-a-hashing-vectorizer