<img src="https://online.york.ac.uk/wp-content/uploads/2021/10/Image-of-a-human-made-up-of-lit-up-lines-touching-a-graphic-which-reads-NLP.jpg" jsaction="" class="sFlh5c FyHeAf iPVvYb" style="max-width: 1280px; height: 197px; margin: 0.5px 0px; width: 351px;" alt="The role of natural language processing in AI - University of York" jsname="kn3ccd" aria-hidden="false">

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20240524132821/nlp-working.webp" jsaction="" class="sFlh5c FyHeAf iPVvYb" style="max-width: 1043px; height: 236px; margin: 0px; width: 330px;" alt="Natural Language Processing (NLP) - Overview - GeeksforGeeks" jsname="kn3ccd" aria-hidden="false">

<img src="https://www.samyzaf.com/ML/nlp/word2vec2.png" jsaction="" class="sFlh5c FyHeAf iPVvYb" style="max-width: 1951px; height: 185px; margin: 0px; width: 351px;" alt="word2vec" jsname="kn3ccd" aria-hidden="false">

**Word2Vec** is a technique in natural language processing (NLP) used to represent words as dense vectors of fixed size, where words with similar meanings are close in the vector space. The method was introduced by Tomas Mikolov and his team at Google in 2013. It is based on shallow neural networks that learn word representations by processing large text corpora.

- There are two main architectures in Word2Vec:

- **Continuous Bag of Words (CBOW):** Predicts the target word (center word) based on its context (surrounding words).

- **Skip-gram:** The inverse of CBOW, where the model predicts the surrounding words based on the target word.

**Key Points of Word2Vec:**

- **Training Objective:** The model is trained to maximize the likelihood of the context words given a target word (Skip-gram) or vice versa (CBOW).

- **Efficient Learning:** Uses techniques like hierarchical softmax or negative sampling to efficiently train on large corpora.

- **Applications:** It is used in many NLP tasks like word similarity, analogy completion, and machine translation.

**# Importing necessary libraries**

In [1]:
import nltk  # For NLP tasks like tokenization
from gensim.models import Word2Vec  # To create and train Word2Vec models
from nltk.corpus import stopwords  # To remove common stopwords from text
import re  # For regular expressions to preprocess text

**Sample paragraph for analysis**

In [4]:
paragraph = """I hold three profound visions for India. Over the past 3000 years, our land has witnessed 
invasions from countless civilizations—Alexander, the Greeks, the Turks, the Mughals, the Portuguese, 
the British, the French, and the Dutch. They came, took over our lands, plundered our resources, 
and sought to dominate our minds. Yet, India has never invaded or oppressed any other nation. 
We have not taken away others’ land, culture, or history, nor forced our way of life upon them. 
Why? Because we cherish and respect freedom. This is my first vision for India: freedom. 
The seeds of this vision were sown in 1857 during the War of Independence. 
This freedom is precious, and it is our duty to safeguard and strengthen it. 
Without freedom, we cannot command dignity or respect.
My second vision is for progress. For over five decades, we have been a developing nation. 
It is time to reimagine ourselves as a developed nation. India stands among the world’s top 
economies in GDP, with notable achievements in many sectors. Our poverty levels are reducing, 
and the world recognizes our advancements. Yet, we often hesitate to see ourselves as capable, 
self-reliant, and confident. This mindset must change if we are to embrace our true potential.
My third vision is for India to stand tall on the global stage. Respect is earned through 
strength—both economic and military—and these must evolve together. Only when we are strong 
can we inspire respect and influence globally.
I have been fortunate to work with extraordinary mentors like Dr. Vikram Sarabhai, Prof. 
Satish Dhawan, and Dr. Brahm Prakash. Their guidance shaped my journey and opened doors to 
remarkable opportunities. As I reflect on my career, I identify four significant milestones 
that have defined my path and contributed to my growth."""

**Step 1: Text preprocessing**
- Key Point: Preprocessing ensures data is clean and standardized for analysis.

In [7]:
text = re.sub(r'\[[0-9]*\]', ' ', paragraph)  # Remove references like [1], [2]
text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
text = text.lower()  # Convert text to lowercase for uniformity
text = re.sub(r'\d', ' ', text)  # Remove digits from text
text = re.sub(r'\s+', ' ', text)  # Normalize spaces again after cleaning

**Step 2: Sentence tokenization**
- Key Point: Tokenize the paragraph into sentences, then into words for Word2Vec.

In [10]:
sentences = nltk.sent_tokenize(text)  # Split text into sentences
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]  # Tokenize each sentence into words

**Step 3: Remove stopwords**
- Key Point: Stopwords are common words (like 'the', 'is') that don't add value to context.

In [13]:
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]

**Step 4: Train the Word2Vec model**
- Key Point: Word2Vec creates vector representations of words based on context.

In [16]:
model = Word2Vec(sentences, min_count=1)  # min_count=1 ensures all words are considered

**Extract vocabulary**
- Key Point: Access vocabulary and its vector dimensions.

In [19]:
# words = model.wv.vocab  # List of words in the vocabulary

**Step 5: Finding word vectors**
- Key Point: Vector representations capture semantic relationships between words.

In [22]:
vector = model.wv['war']  # Get the vector for the word 'war

**Step 6: Finding similar words**
- Key Point: Find contextually similar words based on vector distances.

In [25]:
similar_to_war = model.wv.most_similar('war')
similar_to_freedom = model.wv.most_similar('freedom')
similar_to_vikram = model.wv.most_similar('vikram')

**Step 7: Compare CBOW and Skip-gram models**
- Key Point: CBOW predicts a target word from context; Skip-gram predicts context from a target word.

In [28]:
# Sample sentences for training in AI
sentences = [
    ["Artificial", "intelligence", "is", "transforming", "the", "world"],
    ["Machine", "learning", "is", "a", "subset", "of", "AI"],
    ["Deep", "learning", "uses", "neural", "networks", "for", "pattern", "recognition"],
    ["Natural", "language", "processing", "enables", "computers", "to", "understand", "human", "language"],
    ["AI", "applications", "include", "image", "recognition", "and", "autonomous", "vehicles"],
    ["Reinforcement", "learning", "helps", "train", "agents", "through", "reward", "systems"],
    ["Data", "is", "the", "key", "fuel", "for", "artificial", "intelligence"],
    ["Supervised", "learning", "requires", "labeled", "data"],
    ["Unsupervised", "learning", "groups", "data", "based", "on", "similarities"],
    ["Ethics", "in", "AI", "is", "important", "to", "ensure", "responsible", "development"],
]

# KeyPoint: The training data consists of tokenized sentences. Each sentence is a list of words, providing the context for training Word2Vec.

# Train the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)

# KeyPoint:
# - `vector_size=100`: The dimensionality of the word vectors (100 dimensions).
# - `window=5`: The maximum distance between the target word and context words.
# - `min_count=1`: Words that appear less than once are ignored.
# - `sg=1`: Skip-gram model is used (if `sg=0`, CBOW would be used).

# Get the vector for a word
vector = model.wv['language']  # Fetch the vector representation for the word 'language'
print("Vector for 'language':", vector)

# KeyPoint: Each word is represented as a dense vector in a high-dimensional space. The vector captures the word's semantic meaning.

# Find similar words
similar_words = model.wv.most_similar('language', topn=5)  # Find the top 5 words similar to 'language'
print("Words similar to 'language':", similar_words)

# KeyPoint: `most_similar` computes similarity between word vectors. Words with similar vectors (semantic proximity) are identified.
# - `topn=5`: Limits the output to the top 5 most similar words.

# Overall KeyPoint:
# - Word2Vec learns relationships between words based on their context in the sentences.
# - Skip-gram (sg=1) focuses on predicting context words given a target word, which works well for smaller datasets.
# - This trained model can now be used for tasks like finding word similarity or clustering words based on their vector representations.

Vector for 'language': [-9.58782248e-03  8.95247795e-03  4.16768529e-03  9.23937839e-03
  6.63647708e-03  2.91263685e-03  9.80561227e-03 -4.41262405e-03
 -6.79557770e-03  4.21743933e-03  3.71244014e-03 -5.65997604e-03
  9.70590487e-03 -3.55342729e-03  9.56569146e-03  8.24862858e-04
 -6.35078829e-03 -1.98813085e-03 -7.37798773e-03 -2.97638960e-03
  1.04563322e-03  9.49254353e-03  9.36291739e-03 -6.61006477e-03
  3.47603019e-03  2.29112501e-03 -2.50939350e-03 -9.23522189e-03
  1.02926721e-03 -8.15827399e-03  6.33065309e-03 -5.79973916e-03
  5.53055434e-03  9.82293021e-03 -1.69930747e-04  4.53570904e-03
 -1.81779568e-03  7.36540137e-03  3.93093657e-03 -9.01917554e-03
 -2.38957629e-03  3.64277582e-03 -1.00166544e-04 -1.21789996e-03
 -1.06380926e-03 -1.68953754e-03  6.03163266e-04  4.15880233e-03
 -4.25073365e-03 -3.83816310e-03 -4.38034949e-05  2.52343132e-04
 -1.72673841e-04 -4.79154196e-03  4.32580896e-03 -2.16904492e-03
  2.10142788e-03  6.58018980e-04  5.97620010e-03 -6.85623055e-03
 -

In [30]:
# Sample sentences for training
sentences = [
    ["Artificial", "intelligence", "is", "transforming", "the", "world"],
    ["Machine", "learning", "is", "a", "subset", "of", "AI"],
    ["Deep", "learning", "uses", "neural", "networks", "for", "pattern", "recognition"],
    ["Natural", "language", "processing", "enables", "computers", "to", "understand", "human", "language"],
    ["AI", "applications", "include", "image", "recognition", "and", "autonomous", "vehicles"],
    ["Reinforcement", "learning", "helps", "train", "agents", "through", "reward", "systems"],
    ["Data", "is", "the", "key", "fuel", "for", "artificial", "intelligence"],
    ["Supervised", "learning", "requires", "labeled", "data"],
    ["Unsupervised", "learning", "groups", "data", "based", "on", "similarities"],
    ["Ethics", "in", "AI", "is", "important", "to", "ensure", "responsible", "development"],
]

# KeyPoint: These sentences form the corpus for training Word2Vec models. Each list represents a tokenized sentence.

# CBOW Model
cbow_model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=0)
# KeyPoint:
# - CBOW (Continuous Bag of Words) predicts a target word using its surrounding context.
# - `vector_size=100`: Dimensionality of the word vectors.
# - `window=2`: The context window size (words within 2 positions before and after the target word are considered).
# - `min_count=1`: Ignores words that appear less than once.
# - `sg=0`: Specifies CBOW model (sg=1 would specify Skip-gram).

# Skip-gram Model
skipgram_model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=1)
# KeyPoint:
# - Skip-gram predicts surrounding context words for a given target word.
# - The parameters are the same as CBOW, except `sg=1` for Skip-gram.

# Example: Getting the vector for a word
word = "language"
cbow_vector = cbow_model.wv[word]  # Fetch vector for 'language' using CBOW model
skipgram_vector = skipgram_model.wv[word]  # Fetch vector for 'language' using Skip-gram model

print(f"CBOW Vector for '{word}':", cbow_vector)
print(f"Skip-gram Vector for '{word}':", skipgram_vector)

# KeyPoint: 
# - Each word is represented as a dense vector of `vector_size` dimensions.
# - The vectors are learned based on the context in the training sentences.

# Example: Finding similar words
cbow_similar_words = cbow_model.wv.most_similar(word, topn=5)  # Find top 5 similar words using CBOW model
skipgram_similar_words = skipgram_model.wv.most_similar(word, topn=5)  # Find top 5 similar words using Skip-gram model

print(f"CBOW - Words similar to '{word}':", cbow_similar_words)
print(f"Skip-gram - Words similar to '{word}':", skipgram_similar_words)


CBOW Vector for 'language': [-9.5778247e-03  8.9507308e-03  4.1638282e-03  9.2415949e-03
  6.6409362e-03  2.9237177e-03  9.8145939e-03 -4.4304705e-03
 -6.8065114e-03  4.2280448e-03  3.7324817e-03 -5.6659030e-03
  9.7130537e-03 -3.5615030e-03  9.5540686e-03  8.2726392e-04
 -6.3318592e-03 -1.9830721e-03 -7.3820893e-03 -2.9793170e-03
  1.0386818e-03  9.4887791e-03  9.3572065e-03 -6.6004116e-03
  3.4746295e-03  2.2733128e-03 -2.4985084e-03 -9.2301974e-03
  1.0292478e-03 -8.1671849e-03  6.3251727e-03 -5.8021853e-03
  5.5378280e-03  9.8346984e-03 -1.6494206e-04  4.5241606e-03
 -1.8167844e-03  7.3680729e-03  3.9445432e-03 -9.0172039e-03
 -2.4024302e-03  3.6295475e-03 -9.7236669e-05 -1.2063734e-03
 -1.0603560e-03 -1.6788138e-03  6.0107547e-04  4.1653048e-03
 -4.2543472e-03 -3.8350383e-03 -5.2279724e-05  2.6976145e-04
 -1.7055863e-04 -4.7903457e-03  4.3160366e-03 -2.1736391e-03
  2.1093355e-03  6.6396361e-04  5.9701521e-03 -6.8402458e-03
 -6.8201255e-03 -4.4861357e-03  9.4411829e-03 -1.5910413e

# KeyPoint:
 - `most_similar` identifies words with similar vector representations, indicating semantic similarity.
 - The similarity results might differ between CBOW and Skip-gram due to differences in training objectives.
 - CBOW focuses on context, making it faster for small datasets.
 - Skip-gram works well for smaller data and captures rare word relationships effectively.

# Overall KeyPoint:
 - CBOW is faster and performs better on frequent words.
 - Skip-gram is slower but better for learning rare word embeddings.
 - Both models learn the contextual and semantic relationships between words in a corpus.