# <center> <font size = 24 color = 'steelblue'> <b> Pre-trained word embedding model from gensim

## Overview:
    
The objective here is to set up an environment, implement or load a model architecture, and then use a pre-trained embedding model for generating embeddings that can be applied to further tasks in NLP or machine learning applications.

# <a id= 'w0'>
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#w1)<br>
[2. Model implementation](#w2)<br>
[3. Load the embedding model](#w3)<br>

    

<font size =5 color = 'seagreen'>
    
Using a pre-trained word2vec model to look for most similar words.
    
<b>For this demonstration, `Google News vectors embeddings` are used.

##### <a id = 'w1'>
<font size = 10 color = 'midnightblue'> <b>Installation and import of necessary packages

In [1]:
!pip install scikit-learn==1.3.1
!pip install gensim==4.2.0
!pip install spacy==3.5.1
!python -m spacy download en_core_web_md

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
/usr/local/venvs/ju

In [39]:
import spacy.cli
spacy.cli.download("en_core_web_lg")

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


 <font size =5 color = 'seagreen'> <b> Import packages

In [12]:
import os
from gensim.models import Word2Vec, KeyedVectors

# To suppress warnings
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore")

import spacy

2024-11-12 16:02:11.977604: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-12 16:02:12.019102: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-12 16:02:14.947624: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


VOC-NOTICE: GPU memory for this assignment is capped at 1024MiB


[top](#w0)

 ##### <a id = 'w2'>
<font size = 10 color = 'midnightblue'> <b>  Model implementation

<font size = 5 color = pwdrblue> <b> Get the word embeddings

In [1]:
from gensim.models import KeyedVectors
import gensim.downloader as api

In [2]:
# Load the model
w2v_model = api.load("word2vec-google-news-300")

# Check the model
print(f"Number of words in the model: {len(w2v_model.key_to_index)}")
print(w2v_model['example'])  # Replace 'example' with an actual word in the model

Number of words in the model: 3000000
[ 2.05078125e-01  7.85827637e-04  3.54003906e-02  1.00585938e-01
 -5.44433594e-02  1.53320312e-01  2.55859375e-01 -2.18750000e-01
 -3.31115723e-03  2.09960938e-01 -2.07031250e-01  1.77001953e-02
  4.29687500e-02 -2.01171875e-01 -1.57226562e-01  1.88476562e-01
 -3.73535156e-02  2.36816406e-02 -2.63671875e-01 -1.33789062e-01
  2.23632812e-01  2.05078125e-01 -5.83496094e-02 -3.11279297e-02
  4.92095947e-04  2.36328125e-01  1.16699219e-01  4.24804688e-02
 -1.33789062e-01  1.84570312e-01  5.02929688e-02 -6.00585938e-02
 -6.22558594e-02  7.61718750e-02  1.48437500e-01  6.10351562e-02
  6.39648438e-02 -2.73437500e-01  1.48437500e-01  8.15429688e-02
  1.57226562e-01 -2.63671875e-02 -1.10839844e-01  3.24707031e-02
 -6.93359375e-02 -3.29589844e-02 -1.34765625e-01  4.32128906e-02
 -1.42578125e-01 -2.50000000e-01  9.86328125e-02 -1.10839844e-01
 -6.98242188e-02 -2.46093750e-01  1.65039062e-01 -9.81445312e-02
 -1.71875000e-01 -1.20117188e-01  1.21582031e-01  1.

<font size = 5 color = pwdrblue> <b> Check number of words in vocabulary

In [3]:
print("Number of words in vocabulary: ",len(w2v_model.index_to_key)) #Number of words in the vocabulary.

Number of words in vocabulary:  3000000


In [4]:
print(f"First few words of the vocabulary :\n{ w2v_model.index_to_key[:20]}")

First few words of the vocabulary :
['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said', 'was', 'the', 'at', 'not', 'as', 'it', 'be', 'from', 'by', 'are']


<font size = 5 color = pwdrblue> <b> Examine the model to extract most similar words for a given word like `joyful`, `solid`

## Understanding `most_similar` Method

- **Function Purpose**: The `most_similar` method is a built-in function of the `KeyedVectors` class from the Gensim library. Its primary purpose is to find and return the top N most similar words to a given word based on their embeddings.

- **Input**: In your case, the input is the string `'joyful'`. This is the word for which you want to find similar words.

- **Output**: The method returns a list of tuples. Each tuple contains a similar word and its corresponding similarity score. For example:
  ```python
  [('happy', 0.85), ('joyous', 0.78), ('cheerful', 0.75), ...]
  ```  
## How Similarity is Calculated

- **Cosine Similarity**: The similarity is calculated using cosine similarity, which measures the cosine of the angle between two non-zero vectors in an inner product space. This value ranges from -1 to 1, where:
  - 1 means the vectors point in the same direction (completely similar),
  - 0 means they are orthogonal (not similar), and
  - -1 means they point in opposite directions.

- **Vector Representation**: Each word, including "joyful," is represented as a high-dimensional vector. The model was trained on large corpora of text, which allows it to capture semantic relationships based on word co-occurrences.



In [5]:
w2v_model.most_similar('joyful')

[('joyous', 0.818248987197876),
 ('Besnik_Berisha_Pristina', 0.6848508715629578),
 ('joy', 0.6633967757225037),
 ('joyousness', 0.6440029740333557),
 ('exuberant', 0.6130944490432739),
 ('uplifting', 0.593187153339386),
 ('old_demonstrator_Juliya', 0.592427134513855),
 ('sorrowful', 0.5822992324829102),
 ('cheerful', 0.5811519026756287),
 ('indescribable_joy', 0.581040620803833)]

In [6]:
w2v_model.most_similar('Travel')

[('travel', 0.6577276587486267),
 ('Adventure_Travel', 0.636359453201294),
 ('Travel_Agent', 0.608870804309845),
 ('Aol_Autos', 0.5961860418319702),
 ('Destinations', 0.5935966968536377),
 ('Vacations', 0.5759146213531494),
 ('Vantage_Deluxe', 0.573976993560791),
 ('Travel_Destinations', 0.5731626152992249),
 ('Travel_Agents', 0.5724309086799622),
 ('Escorted_Tours', 0.557145893573761)]

In [7]:
w2v_model.most_similar('travel')

[('traveling', 0.6823130249977112),
 ('Travel', 0.6577276587486267),
 ('travelers', 0.5849088430404663),
 ('trips', 0.5770835280418396),
 ('travels', 0.5704988241195679),
 ('trip', 0.569098174571991),
 ('journeys', 0.5535728335380554),
 ('airfare', 0.5398489832878113),
 ('Travelling', 0.5369202494621277),
 ('Traveling', 0.5305294394493103)]

<font size = 5 color = seagreen> <b> The below snippet can be used to manage the error if word doesnt exist in corpus and check similarity for multiple words:
    

- The provided code snippet allows users to input a word and retrieve the most similar words based on the Word2Vec model. The code includes error handling to manage situations where the input word is not present in the model's vocabulary.

### Error Management
- Try-Except Block:
    - The code uses a try block to attempt to retrieve similar words for the input word. If the word is not found in the model's vocabulary, an exception is raised.
    - The except block catches this exception and provides a user-friendly message: "Word does not exist in vocabulary!". This ensures that the program does not crash and allows the user to enter another word.
### Similarity Calculation
- Cosine Similarity:

    - The most_similar method calculates similarity using cosine similarity. This mathematical measure determines how similar two vectors (in this case, word embeddings) are by computing the cosine of the angle between them.
    - If two word vectors point in the same direction, their cosine similarity is close to 1. If they are orthogonal (i.e., unrelated), the similarity is close to 0, and if they point in opposite directions, the similarity is -1.

### Word Vector Representation:

- Each word is represented as a high-dimensional vector. The Word2Vec model is trained on large corpora of text, allowing it to learn contextual relationships between words based on their occurrences in similar contexts.

In [8]:
inp = "y"
while inp.lower() == 'y':
    word = input("Enter a word to get similar words: ")
    try :
        print(f"Most similar words to '{word}' :\n")
        for t in w2v_model.most_similar(word):
            print(t)
        print('\n')
    except :
        print('Word does not exists in vocabulary!')
    inp = input("Do you want to continue? (Y/N) : ")


Enter a word to get similar words:  happy


Most similar words to 'happy' :

('glad', 0.7408890724182129)
('pleased', 0.6632170677185059)
('ecstatic', 0.6626912355422974)
('overjoyed', 0.6599286794662476)
('thrilled', 0.6514049172401428)
('satisfied', 0.6437949538230896)
('proud', 0.636042058467865)
('delighted', 0.627237856388092)
('disappointed', 0.6269949674606323)
('excited', 0.6247665286064148)




Do you want to continue? (Y/N) :  n


<font size = 5 color = pwdrblue> <b>  Get the word vector of any term

## What is a Word Vector?

### Definition:
A word vector is a numerical representation of a word in a high-dimensional space. In the context of Word2Vec, each word from the vocabulary is mapped to a vector of fixed length (e.g., 300 dimensions). These vectors are generated through training on large corpora of text, allowing the model to capture the contextual meaning of words based on their usage in sentences.

### Significance:
Word vectors are crucial in natural language processing (NLP) because they allow machines to understand and manipulate human language in a more meaningful way. Similar words will have vectors that are closer together in the vector space, enabling the model to identify semantic relationships, such as synonyms or related concepts. For example, "beautiful" might be close to words like "pretty," "lovely," or "gorgeous" in the vector space.

### Dimensionality:
The dimensionality of word vectors (e.g., 100, 200, 300 dimensions) determines the richness of the word representations. Higher dimensions may capture more nuanced relationships but also increase computational complexity.

### Example of Usage:
When you execute `w2v_model['beautiful']`, you are accessing the vector associated with the word "beautiful." This vector can be used in various NLP tasks, including:

- **Similarity calculations**: Comparing how similar "beautiful" is to other words.
- **Clustering**: Grouping similar words based on their vectors.
- **Machine learning models**: Using word vectors as input features for models that perform tasks like sentiment analysis, classification, or information retrieval.


In [9]:
w2v_model['beautiful']

array([-0.01831055,  0.05566406, -0.01153564,  0.07275391,  0.15136719,
       -0.06176758,  0.20605469, -0.15332031, -0.05908203,  0.22851562,
       -0.06445312, -0.22851562, -0.09472656, -0.03344727,  0.24707031,
        0.05541992, -0.00921631,  0.1328125 , -0.15429688,  0.08105469,
       -0.07373047,  0.24316406,  0.12353516, -0.09277344,  0.08203125,
        0.06494141,  0.15722656,  0.11279297, -0.0612793 , -0.296875  ,
       -0.13378906,  0.234375  ,  0.09765625,  0.17773438,  0.06689453,
       -0.27539062,  0.06445312, -0.13867188, -0.08886719,  0.171875  ,
        0.07861328, -0.10058594,  0.23925781,  0.03808594,  0.18652344,
       -0.11279297,  0.22558594,  0.10986328, -0.11865234,  0.02026367,
        0.11376953,  0.09570312,  0.29492188,  0.08251953, -0.05444336,
       -0.0090332 , -0.0625    , -0.17578125, -0.08154297,  0.01062012,
       -0.04736328, -0.08544922, -0.19042969, -0.30273438,  0.07617188,
        0.125     , -0.05932617,  0.03833008, -0.03564453,  0.24

<font size = 5 color = pwdrblue> <b>  Get the embeddings for a complete text

<div class="alert alert-block alert-success">
<font size = 4>
    
- A simple way is to just sum or average the embeddings for individual words.
- Let us see a small example using another NLP library Spacy

[top](#w0)

 ##### <a id = 'w3'>
<font size = 10 color = 'midnightblue'> <b> Load the embedding model

In [13]:
# Load the english embedding
nlp = spacy.load('en_core_web_lg')

In [14]:
# Create a model object
mydoc = nlp("Artificial intelligence revolutionizes industries by enhancing automation and decision-making.")

# Get the averaged vector for the entire sentence
print(mydoc.vector)

[-2.7266681e+00 -1.7625976e+00 -1.5062932e+00  1.0659857e+00
  5.1311164e+00  2.9111406e-02  6.2394553e-01  3.1948249e+00
 -1.8934714e+00 -2.1929944e+00  4.5006537e+00  2.4514868e+00
 -4.5428243e+00  1.2644234e+00 -2.3397674e-01  2.9398270e+00
  2.2581408e+00  2.9875667e+00 -1.9645262e+00 -2.4424388e+00
  1.7411895e+00  1.2703067e+00 -2.6838844e+00 -7.6376837e-01
 -1.1830875e+00 -2.5485353e+00 -8.7953758e-01 -2.0849400e+00
  1.1747426e+00 -1.0882088e-01 -6.1871994e-01 -3.4561667e-01
  7.8869581e-01 -2.6599169e-02 -1.5586964e+00 -1.4661268e+00
  1.4246036e+00  2.3581977e+00 -8.0672759e-01 -6.2371916e-01
  9.9842834e-01 -3.4404168e-01 -1.5362601e+00  9.4676696e-02
 -1.9323992e+00 -7.1922398e-01  2.5718352e-02 -1.9375043e+00
  1.4668162e-01  3.8322163e-01 -2.0981150e+00  2.1650372e+00
  3.3772334e-01 -4.2974758e+00  1.0157917e-01  1.6267353e+00
 -1.6143967e+00  1.0875858e+00  6.1909851e-02 -1.1584975e+00
  2.1183548e+00  3.7791324e-01 -6.0935080e-01  3.5485086e-01
  3.9455414e+00  9.86920