<a href="https://colab.research.google.com/github/HamzaBahsir/NLP/blob/main/N_grams_and_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **N-grams and Word Embeddings**

# **Objectives:**

Here we will do fundamental techniques for representing text in a machine-readable format. These techniques form the foundation for various NLP applications, enabling machines to understand and process human language efficiently.

1. N-gram Models
    * Unigram
    * Bigrams
    * Trigrams
2. Word Embeddings
  * Word2Vec
      * Continuous Bag-of-Words (CBOW)
      * Skip-gram
  * Global Vectors for Word Representation (GloVe)

#**Extra Resources**
[Natural Language Processing with Python](https://www.nltk.org/book/)

#**Libraries Required**

* nltk
* gensim
* os
* platform
* sys
* struct
* numpy
* scipy
* gensim


In [1]:
# Import Libraries
# For N-gram
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from nltk.probability import FreqDist
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## **Text preparation**

In [2]:
# Sample text
text = "Natural Language Processing is an exciting field. NVDEE is provide services in this field."

# Tokenize the text into words
tokens = word_tokenize(text)
print("Tokens:", tokens)

Tokens: ['Natural', 'Language', 'Processing', 'is', 'an', 'exciting', 'field', '.', 'NVDEE', 'is', 'provide', 'services', 'in', 'this', 'field', '.']


## **Unigrams**

In [3]:
# Generate unigrams (1-grams)
unigrams = list(ngrams(tokens, 1))
print("Unigrams:", unigrams)


# Calculate frequency distribution
unigram_freq_dist = FreqDist(unigrams)

# Output the frequency of each unigram
print("\nUnigrams : Frequency")
for unigrams, frequency in unigram_freq_dist.items():
    print(f"{unigrams}: {frequency}")

Unigrams: [('Natural',), ('Language',), ('Processing',), ('is',), ('an',), ('exciting',), ('field',), ('.',), ('NVDEE',), ('is',), ('provide',), ('services',), ('in',), ('this',), ('field',), ('.',)]

Unigrams : Frequency
('Natural',): 1
('Language',): 1
('Processing',): 1
('is',): 2
('an',): 1
('exciting',): 1
('field',): 2
('.',): 2
('NVDEE',): 1
('provide',): 1
('services',): 1
('in',): 1
('this',): 1


In [4]:
# Total tokens
total_tokens = len(tokens) # Here this is equal to sum of frequencies of unigrams, which is why it is used to compute probabilities.

# Convert frequencies to probabilities
unigram_prob_dist = {unigram: count / total_tokens for unigram, count in unigram_freq_dist.items()}

# Print probabilities
print("Unigram : Probability")
for unigram, prob in unigram_prob_dist.items():
    print(f"{unigram}: {prob:.3f}")

Unigram : Probability
('Natural',): 0.062
('Language',): 0.062
('Processing',): 0.062
('is',): 0.125
('an',): 0.062
('exciting',): 0.062
('field',): 0.125
('.',): 0.125
('NVDEE',): 0.062
('provide',): 0.062
('services',): 0.062
('in',): 0.062
('this',): 0.062


In [5]:
def predict_next_word(unigram_prob_dist, top_n=3):
    # Sort words by probability (descending)
    sorted_words = sorted(unigram_prob_dist.items(), key=lambda x: -x[1])
    # Return top N words
    return [word for word, prob in sorted_words[:top_n]]

# Example usage
top_words = predict_next_word(unigram_prob_dist, top_n=3)
print("Top Predictions (Context Ignored):", top_words)

# Prediction Example: Predict after "Natural"
context = ["Natural"]
prediction = predict_next_word(unigram_prob_dist)
print(f"\nAfter '{context[-1]}':", prediction)

Top Predictions (Context Ignored): [('is',), ('field',), ('.',)]

After 'Natural': [('is',), ('field',), ('.',)]


## **Bigrams**

In [6]:
# Generate bigrams (2-grams)
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)


# Calculate frequency distribution
bigram_freq_dist = FreqDist(bigrams)

# Output the frequency of each bigram
print("\nBigrams : Frequency")
for bigram, frequency in bigram_freq_dist.items():
    print(f"{bigram}: {frequency}")

Bigrams: [('Natural', 'Language'), ('Language', 'Processing'), ('Processing', 'is'), ('is', 'an'), ('an', 'exciting'), ('exciting', 'field'), ('field', '.'), ('.', 'NVDEE'), ('NVDEE', 'is'), ('is', 'provide'), ('provide', 'services'), ('services', 'in'), ('in', 'this'), ('this', 'field'), ('field', '.')]

Bigrams : Frequency
('Natural', 'Language'): 1
('Language', 'Processing'): 1
('Processing', 'is'): 1
('is', 'an'): 1
('an', 'exciting'): 1
('exciting', 'field'): 1
('field', '.'): 2
('.', 'NVDEE'): 1
('NVDEE', 'is'): 1
('is', 'provide'): 1
('provide', 'services'): 1
('services', 'in'): 1
('in', 'this'): 1
('this', 'field'): 1


## **Trigrams**

In [7]:
# Generate Trigrams (3-grams) on the same sample text (used in Unigrams/Bigrams)
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams)

# Calculate frequency distribution of Trigrams
trigram_freq_dist = FreqDist(trigrams)


# Output the frequency along with the trigram
print("\nTrigram : Frequency")
for trigram, frequency in trigram_freq_dist.items():
    print(f"{trigram}: {frequency}")


Trigrams: [('Natural', 'Language', 'Processing'), ('Language', 'Processing', 'is'), ('Processing', 'is', 'an'), ('is', 'an', 'exciting'), ('an', 'exciting', 'field'), ('exciting', 'field', '.'), ('field', '.', 'NVDEE'), ('.', 'NVDEE', 'is'), ('NVDEE', 'is', 'provide'), ('is', 'provide', 'services'), ('provide', 'services', 'in'), ('services', 'in', 'this'), ('in', 'this', 'field'), ('this', 'field', '.')]

Trigram : Frequency
('Natural', 'Language', 'Processing'): 1
('Language', 'Processing', 'is'): 1
('Processing', 'is', 'an'): 1
('is', 'an', 'exciting'): 1
('an', 'exciting', 'field'): 1
('exciting', 'field', '.'): 1
('field', '.', 'NVDEE'): 1
('.', 'NVDEE', 'is'): 1
('NVDEE', 'is', 'provide'): 1
('is', 'provide', 'services'): 1
('provide', 'services', 'in'): 1
('services', 'in', 'this'): 1
('in', 'this', 'field'): 1
('this', 'field', '.'): 1


## **Word2Vec**

In [8]:
# Install latest version of gensim
!pip install --upgrade gensim



After the installation, a manual Colab Session Restart is required to fetch updated packages. It can be done by either of the two options given below:


1.   **Runtime -> Restart Session** (or using the shortcut `Ctrl + M + .`)
2.   To avoid Manually Restarting the Session, **Run the next Code Block** to kill and auto restart the session



In [None]:
import os
os.kill(os.getpid(), 9)

After successfully restarting the session, run the following code to ensure proper installation and import. (It will give errors if the session is not restarted!)

In [1]:
# Checks to ensure Gensim is installed correctly along with correct version of dependencies
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import struct; print("Bits", 8 * struct.calcsize("P"))
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec; print("FAST_VERSION", word2vec.FAST_VERSION)
## For downloading data from wiki to analyze
import gensim.downloader as api
# For data preparation
from gensim.models import Word2Vec
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

Linux-6.1.85+-x86_64-with-glibc2.35
Python 3.11.11 (main, Dec  4 2024, 08:55:07) [GCC 11.4.0]
Bits 64
NumPy 1.26.4
SciPy 1.13.1
gensim 4.3.3
FAST_VERSION 0


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Now let's get back to training our Word2Vec models

## **Text preparation**

In [2]:
# Sample sentences as a lists
sentences = [
    "Natural language processing is fascinating",
    "Word embeddings capture semantic relationships",
    "CBOW and SkipGram are popular models",
    "Word2Vec is a technique for learning vector representations",
    "SkipGram tends to perform better on rare words",
    "CBOW is faster and more efficient for common words"
]

# Tokenize sentences
tokenized_sentences = [word_tokenize(sent.lower()) for sent in sentences]
print(tokenized_sentences)

[['natural', 'language', 'processing', 'is', 'fascinating'], ['word', 'embeddings', 'capture', 'semantic', 'relationships'], ['cbow', 'and', 'skipgram', 'are', 'popular', 'models'], ['word2vec', 'is', 'a', 'technique', 'for', 'learning', 'vector', 'representations'], ['skipgram', 'tends', 'to', 'perform', 'better', 'on', 'rare', 'words'], ['cbow', 'is', 'faster', 'and', 'more', 'efficient', 'for', 'common', 'words']]


In [3]:
# Check Number of CPU cores (Colab free: 2x CPUs) to set num of workers
# !cat /proc/cpuinfo
print("CPU Cores:")
!grep -c '^processor' /proc/cpuinfo

CPU Cores:
2


### **Continuous Bag-of-Words (CBOW)**

CBOW predicts a word from its context. It can be used by setting `sg=0` in `gensim.models.Word2Vec()`

In [4]:
# ---------------------
# Option 1: Train a Word2Vec model using CBOW (sg=0)
# ---------------------
cbow_model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,    # Dimensionality of word vectors
    window=5,           # Maximum distance between current and predicted word
    sg=0,               # CBOW architecture (default)
    min_count=1,        # Ignore words with frequency < min_count
    workers=2           # Number of CPU threads
)

print("CBOW Vector for 'language':")
print(cbow_model.wv['language'])

print("\nCBOW Model - Most similar words to 'language':")
print(cbow_model.wv.most_similar("language"))

CBOW Vector for 'language':
[ 6.4154328e-03 -8.9511415e-03 -7.3454725e-03 -1.7511440e-03
  1.6980087e-03 -1.0342204e-03 -5.2042734e-03  6.5792515e-03
  8.7828115e-03 -7.4120974e-03  9.8055657e-03  7.3666479e-03
 -7.4607255e-03 -1.8999577e-03  4.2520380e-03  7.0596053e-03
 -3.6636866e-03 -6.9730817e-03  4.7248350e-03 -9.0386067e-03
 -5.8503030e-03 -1.2834811e-03  5.4809176e-03 -5.6869411e-03
  4.7847857e-03 -4.3494583e-04  2.6672638e-03  6.4039291e-03
  1.4176309e-03  7.7085746e-03 -3.2240152e-04 -8.2637034e-03
  9.1784988e-03 -4.8582028e-03  4.7219060e-03 -3.9027834e-03
 -7.3295189e-03 -6.5126489e-03  4.6773339e-03 -6.5943244e-04
  1.4602697e-03 -8.9282785e-03 -5.1465523e-03 -6.0544228e-03
  8.4127877e-03 -8.6974325e-03  5.0248443e-03 -8.6135149e-04
  1.8937707e-04  8.7997578e-03 -3.5854375e-03 -6.9373380e-03
  7.6357962e-04  7.7428352e-03  9.1208639e-03 -3.6847114e-03
  2.7328073e-03  4.9426113e-03 -5.2920487e-03  6.8525351e-03
 -6.4529707e-03  2.1008432e-03  4.5872582e-03  4.3851123e

### **Skip-gram**

Skip-gram predicts the context from a given word. It can be used by setting `sg=1` in `gensim.models.Word2Vec()`

In [5]:
# ---------------------
# Option 2: Train a Word2Vec model using Skip-gram (sg=1)
# ---------------------
skipgram_model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,    # Dimensionality of word vectors
    window=5,           # Maximum distance between current and predicted word
    sg=1,               # Skip-gram architecture
    min_count=1,        # Ignore words with frequency < min_count
    workers=2           # Number of CPU threads
)

print("Skip-gram Vector for 'language':")
print(skipgram_model.wv['language'])

print("\nSkip-gram Model - Most similar words to 'language':")
print(skipgram_model.wv.most_similar("language"))

Skip-gram Vector for 'language':
[ 6.4154328e-03 -8.9511415e-03 -7.3454725e-03 -1.7511440e-03
  1.6980087e-03 -1.0342204e-03 -5.2042734e-03  6.5792515e-03
  8.7828115e-03 -7.4120974e-03  9.8055657e-03  7.3666479e-03
 -7.4607255e-03 -1.8999577e-03  4.2520380e-03  7.0596053e-03
 -3.6636866e-03 -6.9730817e-03  4.7248350e-03 -9.0386067e-03
 -5.8503030e-03 -1.2834811e-03  5.4809176e-03 -5.6869411e-03
  4.7847857e-03 -4.3494583e-04  2.6672638e-03  6.4039291e-03
  1.4176309e-03  7.7085746e-03 -3.2240152e-04 -8.2637034e-03
  9.1784988e-03 -4.8582028e-03  4.7219060e-03 -3.9027834e-03
 -7.3295189e-03 -6.5126489e-03  4.6773339e-03 -6.5943244e-04
  1.4602697e-03 -8.9282785e-03 -5.1465523e-03 -6.0544228e-03
  8.4127877e-03 -8.6974325e-03  5.0248443e-03 -8.6135149e-04
  1.8937707e-04  8.7997578e-03 -3.5854375e-03 -6.9373380e-03
  7.6357962e-04  7.7428352e-03  9.1208639e-03 -3.6847114e-03
  2.7328073e-03  4.9426113e-03 -5.2920487e-03  6.8525351e-03
 -6.4529707e-03  2.1008432e-03  4.5872582e-03  4.385

#### **Word Embeddings using Word2Vec on 'text8' Corpus**

Use [Gensim Downloader API](https://radimrehurek.com/gensim/downloader.html) to load `text8` corpus

In [6]:
# Loading a preprocessed Wikipedia dataset 'text8'
dataset = api.load("text8")
sentences = list(dataset)


In [7]:
# Check Number of CPU cores
print("CPU Cores:")
!grep -c '^processor' /proc/cpuinfo

CPU Cores:
2


In [8]:
tokenized_sentences = sentences
# Train a Word2Vec model using CBOW
cbow_model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,    # Dimensionality of word vectors
    window=5,           # Maximum distance between current and predicted word
    sg=0,               # CBOW architecture (default)
    min_count=1,        # Ignore words with frequency < min_count
    workers=2           # Number of CPU threads
)

In [9]:
# 1. Predict top 3 similar words to 'king', using the trained Word2Vec model
print("CBOW Vector for 'king':")
print(cbow_model.wv['king'])

print("\nCBOW Model - Top 3 similar words to 'king':")
print(cbow_model.wv.most_similar("king", topn=3))

CBOW Vector for 'king':
[-2.1695945   0.5847411   3.2021227   0.2976922   2.3323178   1.5003363
  2.4373105   0.2977524   0.4758191   0.17675078 -3.4745905  -2.9144578
 -0.68318254  5.86263    -2.0450642   0.09799168  1.3985769  -0.35918367
  1.4730158   0.26482123 -0.70288175  2.0615118  -1.7213688   0.08982217
 -0.19036348  0.5005349  -1.4481603   1.0729766   0.696614   -1.2911931
 -1.1185753   0.34417343  1.970582    0.05067883  2.7318215   1.0668592
  0.37994498 -2.0603201  -0.31645995 -2.006795   -1.6716607  -0.8880612
 -1.2657269  -0.88274217 -4.392502   -0.6589444  -1.21468     3.3773243
  3.1391618  -0.736303    3.302079    0.7741737  -1.6376418   2.6328306
 -0.80173177 -1.5870308  -0.8007671   1.2712377   1.4984668  -0.5695333
  1.4345889   0.3421586   0.9933981   0.62779355  0.73089826  0.36383626
 -2.4109068   1.8112689   1.3038373  -1.6203868  -0.10327034  0.12920262
 -0.60925305 -2.1789649   1.7843876  -1.9788699   0.07465937  2.159091
 -2.9001215   3.0305536  -1.2490788  

In [10]:
# 2. Perform Word algebra (king - man + woman = ?), using the trained Word2Vec model
result = cbow_model.wv.most_similar(positive=["king", "woman"], negative=["man"], topn=3)

# Top results
for word, similarity in result:
    print(f"{word}: {similarity:.4f}")

queen: 0.6931
daughter: 0.6605
empress: 0.6570


## **Global Vectors for Word Representation (GloVe)**


In [11]:
# Load pre-trained GLoVe model (Wikipedia + Gigaword corpus)
glove = api.load("glove-twitter-25")  # 25-dimensional vectors

# Similarity check
print(glove.most_similar("computer", topn=3))

print(glove['computer'])

[('camera', 0.907833456993103), ('cell', 0.891890287399292), ('server', 0.874466598033905)]
[ 0.64005  -0.019514  0.70148  -0.66123   1.1723   -0.58859   0.25917
 -0.81541   1.1708    1.1413   -0.15405  -0.11369  -3.8414   -0.87233
  0.47489   1.1541    0.97678   1.1107   -0.14572  -0.52013  -0.52234
 -0.92349   0.34651   0.061939 -0.57375 ]


### **Word Embeddings using GloVe**

In [12]:
# Load pre-trained GLoVe model (Wikipedia + Gigaword corpus)
dataset = api.load("glove-wiki-gigaword-300")

# 1. Predict top 3 similar words to 'king'
print("Top 3 similar words to king \n", glove.most_similar("king", topn=3))

# 2. Word analogy: king - man + woman = ?
# Perform word analogy: king - man + woman = ?
result = glove.most_similar(positive=["king", "woman"], negative=["man"], topn=1)
print("Word analogy: king - man + woman = ", result[0][0])

# 3. Get Vector for any word and Print it
print("Vector for 'king':")
print(glove['king'])

Top 3 similar words to king 
 [('prince', 0.9337409734725952), ('queen', 0.9202421307563782), ('aka', 0.9176921844482422)]
Word analogy: king - man + woman =  meets
Vector for 'king':
[-0.74501  -0.11992   0.37329   0.36847  -0.4472   -0.2288    0.70118
  0.82872   0.39486  -0.58347   0.41488   0.37074  -3.6906   -0.20101
  0.11472  -0.34661   0.36208   0.095679 -0.01765   0.68498  -0.049013
  0.54049  -0.21005  -0.65397   0.64556 ]
