# Word Vector Extraction using Word2Vec Model

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics.

The process of converting words into numbers are called Vectorization.

**Word embeddings help in the following use cases.**

*   Compute similar words
*   Text classifications
*   Document clustering/grouping
*   Feature extraction for text classifications

In [4]:
import nltk
from gensim.models import Word2Vec
import gensim.downloader as api

In [5]:
# Download NLTK resources if needed
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
# Sample sentences
sentences = [
    "NLTK is a leading platform for building Python programs to work with human language data.",
    "It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.",
    "Word vector extraction, also known as word embedding, is a fundamental technique in Natural Language Processing (NLP) that converts words or phrases from text data into numerical vectors.",
    "These vectors capture the semantic meaning and relationships between words in a low-dimensional space. "
]

In [7]:
# Tokenize sentences into words
tokenized_sentences = [nltk.word_tokenize(sentence.lower()) for sentence in sentences]

In [8]:
# Load pre-trained Word2Vec model
# Note: You would typically train Word2Vec on a large corpus of text data or use a pre-trained model

# Download the dataset
dataset_name = 'text8'
dataset = api.load(dataset_name)

# Train Word2Vec model
# You can adjust various parameters like vector_size, window, min_count, etc. based on your dataset and requirements
model = Word2Vec(sentences=dataset, vector_size=100, window=5, min_count=1, workers=4)

# Save the trained model
model.save("word2vec_model")

# Load the trained model
word2vec_model = Word2Vec.load("word2vec_model")



In [9]:
# Function to extract word vector representation of a word
def get_word_vector(word):
    try:
        return word2vec_model.wv[word]
    except KeyError:
        return None  # Return None if word is not in the vocabulary

In [10]:
# Extract word vectors for each word in the tokenized sentences
word_vectors = [[get_word_vector(word) for word in sentence] for sentence in tokenized_sentences]


# Print word vectors
for i, sentence_vectors in enumerate(word_vectors):
    print(f"\n Sentence {i+1} word vectors:")
    for j, word_vector in enumerate(sentence_vectors):
        if word_vector is not None:
            print(f"\n Word {j+1}: {word_vector}")


 Sentence 1 word vectors:

 Word 2: [ 0.35179085 -2.0243583   1.2972069  -3.346163    1.6218975  -4.547846
 -4.0564604   2.260282   -0.27348074 -3.038299    1.7159584   1.5611367
  0.02286207 -2.349779    0.20546688  0.7640269   2.0843601  -4.709505
 -0.9064398   0.01229943 -1.282401    0.02423194  2.6836243  -1.7344304
  1.775738   -1.9909921   2.3559866   1.4282984   1.0182769  -0.9555671
 -2.2178109  -0.30147624 -0.5342147   1.9437217   0.5528936  -0.5274956
 -1.7918359  -1.2141846   3.0742435  -1.0235943   0.02986413 -0.08664255
 -0.4321286   2.115937   -2.3477829   0.36580774 -3.912159   -1.1497662
 -0.3906269  -0.78947234 -1.4559329  -3.0609112  -0.03316841  1.0516956
  0.19200823 -1.5584183   1.0533408  -1.806278    2.0710247   4.686085
  1.3299055  -0.36231178 -1.6339275   2.5185347   3.250894    1.1790041
 -0.43471768 -3.3426266   1.048345   -0.60942394  0.6067328  -0.9581387
 -1.549519    0.13391894 -0.1833108   2.2801924  -0.45416474  0.31459463
  0.83905834  0.27134332  1.