# <center> <font size = 24 color = 'steelblue'> <b> Pre-trained word embedding model from gensim

<div class="alert alert-block alert-info">
    
<font size = 4>
    
**This notebook demonstrates representation of text using pre-trained word embedding models.**

# <a id= 'w0'>
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#w1)<br>
[2. Model implementation](#w2)<br>
[3. Load the embedding model](#w3)<br>

    

<font size =5 color = 'seagreen'>
    
Using a pre-trained word2vec model to look for most similar words.
    
<b>For this demonstration, `Google News vectors embeddings` are used.

##### <a id = 'w1'>
<font size = 10 color = 'midnightblue'> <b>Installation and import of necessary packages

In [1]:
!pip install scikit-learn
!pip install gensim
!pip install spacy
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


 <font size =5 color = 'seagreen'> <b> Import packages

In [2]:
import os
from gensim.models import Word2Vec, KeyedVectors

# To suppress warnings
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore")

import spacy

[top](#w0)

 ##### <a id = 'w2'>
<font size = 10 color = 'midnightblue'> <b>  Model implementation

<font size = 5 color = pwdrblue> <b> Get the word embeddings

In [10]:
path = os.getcwd()
file_name = 'GoogleNews-vectors-negative300.zip.gz'
pretrained_path = path + '/' + file_name

<font size = 5 color = pwdrblue> <b> Load the model

In [11]:
w2v_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True) #load the model

<font size = 5 color = pwdrblue> <b> Check number of words in vocabulary

In [12]:
print("Number of words in vocabulary: ",len(w2v_model.index_to_key)) #Number of words in the vocabulary.

Number of words in vocabulary:  3000000


In [13]:
print(f"First few words of the vocabulary :\n{ w2v_model.index_to_key[:20]}")

First few words of the vocabulary :
['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said', 'was', 'the', 'at', 'not', 'as', 'it', 'be', 'from', 'by', 'are']


<font size = 5 color = pwdrblue> <b> Examine the model to extract most similar words for a given word like `joyful`, `solid`

In [14]:
w2v_model.most_similar('joyful')

[('joyous', 0.818248987197876),
 ('Besnik_Berisha_Pristina', 0.6848508715629578),
 ('joy', 0.6633967757225037),
 ('joyousness', 0.6440029740333557),
 ('exuberant', 0.6130944490432739),
 ('uplifting', 0.593187153339386),
 ('old_demonstrator_Juliya', 0.592427134513855),
 ('sorrowful', 0.5822992324829102),
 ('cheerful', 0.5811519026756287),
 ('indescribable_joy', 0.581040620803833)]

In [15]:
w2v_model.most_similar('Travel')

[('travel', 0.6577276587486267),
 ('Adventure_Travel', 0.636359453201294),
 ('Travel_Agent', 0.608870804309845),
 ('Aol_Autos', 0.5961860418319702),
 ('Destinations', 0.5935966968536377),
 ('Vacations', 0.5759146213531494),
 ('Vantage_Deluxe', 0.573976993560791),
 ('Travel_Destinations', 0.5731626152992249),
 ('Travel_Agents', 0.5724309086799622),
 ('Escorted_Tours', 0.557145893573761)]

In [17]:
w2v_model.most_similar('travel')

[('traveling', 0.6823130249977112),
 ('Travel', 0.6577276587486267),
 ('travelers', 0.5849088430404663),
 ('trips', 0.5770835280418396),
 ('travels', 0.5704988241195679),
 ('trip', 0.569098174571991),
 ('journeys', 0.5535728335380554),
 ('airfare', 0.5398489832878113),
 ('Travelling', 0.5369202494621277),
 ('Traveling', 0.5305294394493103)]

<div class="alert alert-block alert-success">
<font size = 4>
    
<center><b> Error occurred because the word is not present in the vocabulary.</b>


<font size = 5 color = seagreen> <b> The below snippet can be used to manage the error and check similarity for multiple words:

In [19]:
inp = "y"
while inp.lower() == 'y':
    word = input("Enter a word to get similar words: ")
    try :
        print(f"Most similar words to '{word}' :\n")
        for t in w2v_model.most_similar(word):
            print(t)
        print('\n')
    except :
        print('Word does not exists in vocabulary!')
    inp = input("Do you want to continue? (Y/N) : ")


Enter a word to get similar words: negative
Most similar words to 'negative' :

('positive', 0.7586989998817444)
('Negative', 0.6747699975967407)
('postive', 0.607062816619873)
('negatively', 0.5929017663002014)
('Inaccurate_portrayals', 0.544945240020752)
('unfavorable', 0.5436848402023315)
('Positive', 0.5394940376281738)
('positve', 0.5383689999580383)
('negativity', 0.5172534584999084)
('Incurring_debt', 0.5085647702217102)


Do you want to continue? (Y/N) : No


<font size = 5 color = pwdrblue> <b>  Get the word vector of any term

In [20]:
w2v_model['beautiful']

array([-0.01831055,  0.05566406, -0.01153564,  0.07275391,  0.15136719,
       -0.06176758,  0.20605469, -0.15332031, -0.05908203,  0.22851562,
       -0.06445312, -0.22851562, -0.09472656, -0.03344727,  0.24707031,
        0.05541992, -0.00921631,  0.1328125 , -0.15429688,  0.08105469,
       -0.07373047,  0.24316406,  0.12353516, -0.09277344,  0.08203125,
        0.06494141,  0.15722656,  0.11279297, -0.0612793 , -0.296875  ,
       -0.13378906,  0.234375  ,  0.09765625,  0.17773438,  0.06689453,
       -0.27539062,  0.06445312, -0.13867188, -0.08886719,  0.171875  ,
        0.07861328, -0.10058594,  0.23925781,  0.03808594,  0.18652344,
       -0.11279297,  0.22558594,  0.10986328, -0.11865234,  0.02026367,
        0.11376953,  0.09570312,  0.29492188,  0.08251953, -0.05444336,
       -0.0090332 , -0.0625    , -0.17578125, -0.08154297,  0.01062012,
       -0.04736328, -0.08544922, -0.19042969, -0.30273438,  0.07617188,
        0.125     , -0.05932617,  0.03833008, -0.03564453,  0.24

<font size = 5 color = pwdrblue> <b>  Get the embeddings for a complete text

<div class="alert alert-block alert-success">
<font size = 4>
    
- A simple way is to just sum or average the embeddings for individual words.
- Let us see a small example using another NLP library Spacy

[top](#w0)

 ##### <a id = 'w3'>
<font size = 10 color = 'midnightblue'> <b> Load the embedding model

In [21]:
# Load the english embedding
nlp = spacy.load('en_core_web_md')

In [22]:
# Create a model object
mydoc = nlp("Artificial intelligence revolutionizes industries by enhancing automation and decision-making.")

# Get the averaged vector for the entire sentence
print(mydoc.vector)

[-2.88833928e+00 -1.73949754e+00 -1.43964255e+00  1.20811331e+00
  5.13780832e+00 -1.30674643e-02  7.21087217e-01  3.04970646e+00
 -1.84638822e+00 -2.22266102e+00  4.68991232e+00  2.61828923e+00
 -4.52669907e+00  1.26645672e+00 -2.34535038e-01  2.95858502e+00
  2.32375741e+00  3.09238362e+00 -2.10459971e+00 -2.37954974e+00
  1.66396189e+00  1.19737089e+00 -2.69254231e+00 -6.14912510e-01
 -1.08377755e+00 -2.53180766e+00 -8.91990006e-01 -2.05081487e+00
  1.18648148e+00 -1.64304212e-01 -6.91387177e-01 -3.24216664e-01
  9.49795187e-01 -4.35991287e-02 -1.66022158e+00 -1.46512663e+00
  1.34241199e+00  2.41495681e+00 -1.01387584e+00 -7.52867520e-01
  1.03934491e+00 -4.46677536e-01 -1.66344082e+00  1.26216695e-01
 -1.90512753e+00 -7.97698975e-01 -2.97492743e-03 -1.90645266e+00
  1.40009150e-01  3.15768272e-01 -2.23950672e+00  2.10762620e+00
  2.52927512e-01 -4.42820024e+00  2.02550087e-02  1.65891302e+00
 -1.52912760e+00  1.03578997e+00  1.68434843e-01 -1.34021425e+00
  2.11976314e+00  2.33547