### Dataset: Alice Dataset
**Link: https://www.gutenberg.org/files/11/11-0.txt**

## N-gram language model:

**Approach**: First context size and embedding dimension (which the domain in vector field) will be initialized.

In [1]:
contextSize = 3
embeddingDimension = 20

**Approach:** A sample text will be processed which will be considered as training data, using this text segment the vectorization scale can be understood.

In [2]:
test_sentence = """India, officially the Republic of India, is a country in South Asia. It is the seventh-largest country by area, the most populous country
in the world, and the most populous democracy."""

**Approach:**  We will split the test in training and testing for better model evaluation and generalization purpose.

In [3]:
ngram_model = [([test_sentence[m - n - 1] for n in range (contextSize)], test_sentence[m])
                for m in range(contextSize, len(test_sentence))]

**Approach:**  First four segment of the model will be printed in chronological order after the training process is complete.

In [4]:
print(ngram_model[:1])
print(ngram_model[:2])
print(ngram_model[:3])
print(ngram_model[:4])

[(['d', 'n', 'I'], 'i')]
[(['d', 'n', 'I'], 'i'), (['i', 'd', 'n'], 'a')]
[(['d', 'n', 'I'], 'i'), (['i', 'd', 'n'], 'a'), (['a', 'i', 'd'], ',')]
[(['d', 'n', 'I'], 'i'), (['i', 'd', 'n'], 'a'), (['a', 'i', 'd'], ','), ([',', 'a', 'i'], ' ')]


In [5]:
vocab = set(test_sentence)
wordIxConversion = {word: m for m, word in enumerate(vocab)}

**Approach:** Necessary libraries will be imported to implement N gram language model.

In [6]:
## Import libraries
import torch
import torch.nn as tnn
import torch.optim as toptim
import torch.nn.functional as tfunction

**Approach:** Using the nn module the N gram modeling class will be implemented.

There will be segment of the total class, one initializing the context and embedded(vectorized) dimension and other indicating the forward propagation through the network.

In [7]:
class nGramModelling(tnn.Module):
    def __init__(self, vocab_dimension, embedding_dimension, context_dimension):
        super(nGramModelling, self).__init__()
        self.embedding = tnn.Embedding(vocab_dimension, embedding_dimension)
        self.linear_first = tnn.Linear(context_dimension * embedding_dimension, 119)
        self.linear_second = tnn.Linear(119, vocab_dimension)
    def forward(self, training_set):
        embedded_form = self.embedding(training_set).view((1, -1))
        forward_out = tfunction.relu(self.linear_first(embedded_form))
        forward_out = self.linear_second(forward_out)
        log_propagation = tfunction.log_softmax(forward_out, dim=1)
        return log_propagation

In [8]:
lossFunction = tnn.NLLLoss()
calculated_loss = []
model_structure = nGramModelling(len(vocab), embeddingDimension, contextSize)
optimization = toptim.SGD(model_structure.parameters(), lr=0.002)

**Approach:** The error generated through forward propagation will be calculated and using gradient decent the more optimized approach will be implemented.

In [9]:
for tempVariable in range(22):
    totalLoss = 0
    for context, target in ngram_model:
        model_structure.zero_grad() ## Gradient initilization to zero
        context_idxs = torch.tensor([wordIxConversion[w] for w in context]) ## Integer index mapping
        log_propagation = model_structure(context_idxs) ## Forward pass
        loss = lossFunction(log_propagation, torch.tensor([wordIxConversion[target]])) ## Calculate loss function
        optimization.step() ## Update gradient
        loss.backward() ## Backward pass
        totalLoss += loss.item() ## Calculate total loss
    calculated_loss.append(totalLoss)

**Approach:** The total loss will be prined.

In [10]:
print(calculated_loss) ## Print the loss value

[627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568, 627.9287450313568]


**Approach:** To test the model and evaluate the generalization approach some vectorized form of letter of printed.

In [11]:
print(model_structure.embedding.weight[wordIxConversion["i"]]) ## Print the embedded form of 'i'

tensor([-0.1026,  0.1578,  0.9792,  0.5220,  0.3054,  1.8454, -0.0544, -0.2808,
         2.1645,  1.6299,  0.9116,  1.1881,  0.5856,  0.3783,  0.2527,  1.3072,
        -0.5567, -0.4364,  0.7162,  0.0098], grad_fn=<SelectBackward0>)


In [12]:
print(model_structure.embedding.weight[wordIxConversion["n"]]) ## Print the embedded form of 'n'

tensor([ 1.0074,  0.9762,  0.8122,  0.8563,  0.1722, -0.7613, -1.7372,  0.3524,
        -0.4371,  1.2262, -0.2154, -1.0835,  1.6382,  0.1090,  0.9307,  0.8007,
        -0.2892,  0.9695, -1.2736,  0.5683], grad_fn=<SelectBackward0>)


In [13]:
print(model_structure.embedding.weight[wordIxConversion["d"]]) ## Print the embedded form of 'd'

tensor([-0.5240, -0.0740, -0.9617, -1.7470, -0.7892, -0.1827, -0.4542,  0.5034,
         0.7301,  0.7181, -1.4437, -0.6801,  0.3338, -0.2508, -0.3352, -0.4485,
         0.6673,  1.2342, -2.1650,  0.1493], grad_fn=<SelectBackward0>)


## CBOW model:

**Approach:** In this modelling method, multiple word input vectors will be trained in a single projection and it will be mapped to a specified vectorized output.

First the context dimension and main text(training) will be initialized.

In [14]:
context_dimension = 2  ## 2 words to left and 2 to right
mainTextSegment = """ West Bengal is a state in eastern India, between the Himalayas and the Bay of Bengal. Its capital, Kolkata (formerly Calcutta), retains
architectural and cultural remnants of its past as an East India Company trading post and capital of the British Raj. The city's colonial landmarks include
the government buildings around B.B.D. Bagh Square, and the iconic Victoria Memorial, dedicated to Britain's queen. """

**Approach:** The text will be considered as array format and the array will be deduplicated. The length which will be used for training purpose will printed.

In [15]:
## deduplication of array
vocab = set(mainTextSegment)
vocab_length = len(vocab)
print(vocab_length)

44


In [16]:
wordIxConversion = {word: m for m, word in enumerate(vocab)}
extracted_word = []

**Approach:**  Some specific text area will be extracted from the total text segment for training purpose and other will be for testing purpose to achieve better generalized approach.

In [17]:
for m in range(contextSize, len(mainTextSegment) - contextSize):
    context = ([mainTextSegment[m + n + 1] for n in range(context_dimension)] + [mainTextSegment[m - m - 1] for n in range(context_dimension)])
    targetSegment = mainTextSegment[m]
    extracted_word.append((context, targetSegment))

**Approach:** The first five extracted segment of the training text will be printed.

In [18]:
print(extracted_word[:1])
print(extracted_word[:2])
print(extracted_word[:3])
print(extracted_word[:4])
print(extracted_word[:5])

[(['t', ' ', ' ', ' '], 's')]
[(['t', ' ', ' ', ' '], 's'), ([' ', 'B', ' ', ' '], 't')]
[(['t', ' ', ' ', ' '], 's'), ([' ', 'B', ' ', ' '], 't'), (['B', 'e', ' ', ' '], ' ')]
[(['t', ' ', ' ', ' '], 's'), ([' ', 'B', ' ', ' '], 't'), (['B', 'e', ' ', ' '], ' '), (['e', 'n', ' ', ' '], 'B')]
[(['t', ' ', ' ', ' '], 's'), ([' ', 'B', ' ', ' '], 't'), (['B', 'e', ' ', ' '], ' '), (['e', 'n', ' ', ' '], 'B'), (['n', 'g', ' ', ' '], 'e')]


**Approach:** The CBOW class will be implemented and the context vectorizer function will be initialized to vectorize the domain.

In [19]:
class CBOWModelling(tnn.Module):
    def __init__(self):
        pass
    def forward(self, inputs):
        pass

In [20]:
## Model creation and training
def contextVectorization(context, word_to_ix):
    vectorized_id = [wordIxConversion[w] for w in context]
    return torch.tensor(vectorized_id, dtype=torch.long)

**Approach:** As the text segment was taken as an array format, so some array index will printed to analyze the vectorized form for that index.

In [21]:
contextVectorization(extracted_word[0][0], wordIxConversion)

tensor([ 7, 12, 12, 12])

In [22]:
contextVectorization(extracted_word[0][1], wordIxConversion)

tensor([28])

In [23]:
contextVectorization(extracted_word[1][0], wordIxConversion)

tensor([12, 25, 12, 12])

In [24]:
contextVectorization(extracted_word[1][1], wordIxConversion)

tensor([7])

## Semantic similarity measurement:

**Approach:**
For word processing and measurement purpose in vectorized format nltk and gensim will be used.

First the libraries will be imported and dataset text file will be processed, then word2vec will be implemented in both the modelling format n gram and CBOW.

In [25]:
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings as warn
import nltk
nltk.download('punkt')

import gensim
from gensim.models import Word2Vec

warn.filterwarnings(action = 'ignore')


## Reads text file dataset
sample_text = open('/content/drive/MyDrive/Dataset/alice.txt')
extracted_text = sample_text.read()

# Reapleces all the escape characters with space segment
updated_text = extracted_text.replace("\n", " ")

data = []

# Travarsing through all senteces of dataset
for m in sent_tokenize(updated_text):
    temp = []

    # word tokenization
    for n in word_tokenize(m):
        temp.append(n.lower())

    data.append(temp)

# Implementation of N Gram model
model_first = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100, window = 5, sg = 1)

# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - N Gram : ", model_first.wv.similarity('alice', 'wonderland'))

print("Cosine similarity between 'alice' " + "and 'machines' - N Gram : ", model_first.wv.similarity('alice', 'machines'))

# Implementation of  CBOW model
model_second = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100, window = 5)

# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : ", model_second.wv.similarity('alice', 'wonderland'))

print("Cosine similarity between 'alice' " + "and 'machines' - CBOW : ", model_second.wv.similarity('alice', 'machines'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Cosine similarity between 'alice' and 'wonderland' - N Gram :  0.73083436
Cosine similarity between 'alice' and 'machines' - N Gram :  0.8429953
Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.97775567
Cosine similarity between 'alice' and 'machines' - CBOW :  0.933941
