# <center> <font size = 24 color = 'steelblue'> <b> Pre-trained word embedding model from gensim

<div class="alert alert-block alert-info">
    
<font size = 4>
    
**This notebook demonstrates representation of text using pre-trained word embedding models.**

# <a id= 'w0'>
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#w1)<br>
[2. Model implementation](#w2)<br>
[3. Load the embedding model](#w3)<br>

    

<font size =5 color = 'seagreen'>
    
Using a pre-trained word2vec model to look for most similar words.
    
<b>For this demonstration, `Google News vectors embeddings` are used.

##### <a id = 'w1'>
<font size = 10 color = 'midnightblue'> <b>Installation and import of necessary packages

In [1]:
!pip install scikit-learn
!pip install gensim
!pip install spacy
!python -m spacy download en_core_web_md

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
/usr/local/venvs/jupyter/bin/python: No module named spacy


 <font size =5 color = 'seagreen'> <b> Import packages

In [2]:
import os
from gensim.models import Word2Vec, KeyedVectors

# To suppress warnings
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore")

import spacy

2025-09-24 19:01:31.572865: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-24 19:01:31.613532: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


VOC-NOTICE: GPU memory for this assignment is capped at 1024MiB


2025-09-24 19:01:33.514298: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


[top](#w0)

 ##### <a id = 'w2'>
<font size = 10 color = 'midnightblue'> <b>  Model implementation

<font size = 5 color = pwdrblue> <b> Get the word embeddings

In [3]:
path = os.getcwd()
file_name = 'GoogleNews-vectors-negative300.zip.gz'
pretrained_path = path + '/' + file_name

In [4]:
import os
import kagglehub
from gensim.models import KeyedVectors

download_path = kagglehub.dataset_download("leadbest/googlenewsvectorsnegative300")
print(f"Dataset downloaded to: {download_path}")

model_file = 'GoogleNews-vectors-negative300.bin.gz'
pretrained_path = os.path.join(download_path, model_file)

print(f"Loading model from: {pretrained_path}")




Dataset downloaded to: /voc/work/.cache/kagglehub/datasets/leadbest/googlenewsvectorsnegative300/versions/2
Loading model from: /voc/work/.cache/kagglehub/datasets/leadbest/googlenewsvectorsnegative300/versions/2/GoogleNews-vectors-negative300.bin.gz


<font size = 5 color = pwdrblue> <b> Load the model

In [5]:
w2v_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)

print("Model loaded successfully! 🎉")

Model loaded successfully! 🎉


<font size = 5 color = pwdrblue> <b> Check number of words in vocabulary

In [6]:
print("Number of words in vocabulary: ",len(w2v_model.index_to_key)) #Number of words in the vocabulary.

Number of words in vocabulary:  3000000


In [7]:
print(f"First few words of the vocabulary :\n{ w2v_model.index_to_key[:20]}")

First few words of the vocabulary :
['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said', 'was', 'the', 'at', 'not', 'as', 'it', 'be', 'from', 'by', 'are']


<font size = 5 color = pwdrblue> <b> Examine the model to extract most similar words for a given word like `joyful`, `solid`

In [8]:
w2v_model.most_similar('joyful')

[('joyous', 0.818248987197876),
 ('Besnik_Berisha_Pristina', 0.6848508715629578),
 ('joy', 0.6633967757225037),
 ('joyousness', 0.6440029740333557),
 ('exuberant', 0.6130944490432739),
 ('uplifting', 0.593187153339386),
 ('old_demonstrator_Juliya', 0.592427134513855),
 ('sorrowful', 0.5822992324829102),
 ('cheerful', 0.5811519026756287),
 ('indescribable_joy', 0.581040620803833)]

In [9]:
w2v_model.most_similar('Travel')

[('travel', 0.6577276587486267),
 ('Adventure_Travel', 0.636359453201294),
 ('Travel_Agent', 0.608870804309845),
 ('Aol_Autos', 0.5961860418319702),
 ('Destinations', 0.5935966968536377),
 ('Vacations', 0.5759146213531494),
 ('Vantage_Deluxe', 0.573976993560791),
 ('Travel_Destinations', 0.5731626152992249),
 ('Travel_Agents', 0.5724309086799622),
 ('Escorted_Tours', 0.557145893573761)]

In [10]:
w2v_model.most_similar('travel')

[('traveling', 0.6823130249977112),
 ('Travel', 0.6577276587486267),
 ('travelers', 0.5849088430404663),
 ('trips', 0.5770835280418396),
 ('travels', 0.5704988241195679),
 ('trip', 0.569098174571991),
 ('journeys', 0.5535728335380554),
 ('airfare', 0.5398489832878113),
 ('Travelling', 0.5369202494621277),
 ('Traveling', 0.5305294394493103)]

<div class="alert alert-block alert-success">
<font size = 4>
    
<center><b> Error occurred because the word is not present in the vocabulary.</b>


<font size = 5 color = seagreen> <b> The below snippet can be used to manage the error and check similarity for multiple words:

In [None]:
inp = "y"
while inp.lower() == 'y':
    word = input("Enter a word to get similar words: ")
    try :
        print(f"Most similar words to '{word}' :\n")
        for t in w2v_model.most_similar(word):
            print(t)
        print('\n')
    except :
        print('Word does not exists in vocabulary!')
    inp = input("Do you want to continue? (Y/N) : ")


<font size = 5 color = pwdrblue> <b>  Get the word vector of any term

In [None]:
w2v_model['beautiful']

<font size = 5 color = pwdrblue> <b>  Get the embeddings for a complete text

<div class="alert alert-block alert-success">
<font size = 4>
    
- A simple way is to just sum or average the embeddings for individual words.
- Let us see a small example using another NLP library Spacy

[top](#w0)

 ##### <a id = 'w3'>
<font size = 10 color = 'midnightblue'> <b> Load the embedding model

In [None]:
# Load the english embedding
nlp = spacy.load('en_core_web_md')

In [None]:
# Create a model object
mydoc = nlp("Artificial intelligence revolutionizes industries by enhancing automation and decision-making.")

# Get the averaged vector for the entire sentence
print(mydoc.vector)