--- 

**PyTorch Embedding Module** `nn.Embedding()` by [Moussa JAMOR](https://github.com/JamorMoussa)

Code Repository: [todo]()

---

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

import wikipedia
import re

In [2]:
page_title = "Morocco"

page_content = wikipedia.page(page_title).content

page_summary = wikipedia.page(page_title)

In [3]:
page_content = page_summary.content.lower()

In [4]:
page_content

'morocco ( ), officially the kingdom of morocco, is a country in the maghreb region of north africa. it overlooks the mediterranean sea to the north and the atlantic ocean to the west, and has land borders with algeria to the east, and the disputed territory of western sahara to the south. morocco also claims the spanish exclaves of ceuta, melilla and peñón de vélez de la gomera, and several small spanish-controlled islands off its coast. it spans an area of 446,300 km2 (172,300 sq mi) or 716,550 km2 (276,660 sq mi),, with a population of roughly 37 million. its official and predominant religion is islam, and the official languages are arabic and berber (tamazight);  french and the moroccan dialect of arabic are also widely spoken. moroccan identity and culture is a mix of arab, berber, african and european cultures. its capital is rabat, while its largest city is casablanca.the region constituting morocco has been inhabited since the paleolithic era over 300,000 years ago. the idrisid

In [5]:
max_length = 5000
truncated_content = page_content[:max_length]

cleaned_content = " ".join(re.sub(r'[^a-z]', ' ', truncated_content).split())

print(cleaned_content)

morocco officially the kingdom of morocco is a country in the maghreb region of north africa it overlooks the mediterranean sea to the north and the atlantic ocean to the west and has land borders with algeria to the east and the disputed territory of western sahara to the south morocco also claims the spanish exclaves of ceuta melilla and pe n de v lez de la gomera and several small spanish controlled islands off its coast it spans an area of km sq mi or km sq mi with a population of roughly million its official and predominant religion is islam and the official languages are arabic and berber tamazight french and the moroccan dialect of arabic are also widely spoken moroccan identity and culture is a mix of arab berber african and european cultures its capital is rabat while its largest city is casablanca the region constituting morocco has been inhabited since the paleolithic era over years ago the idrisid dynasty was established by idris i in and was subsequently ruled by a series 

In [6]:
import nltk 
from nltk.corpus import stopwords

In [7]:
stop_words_set = set(stopwords.words("english"))

In [8]:
content = " ".join(filter(lambda word: word not in stop_words_set, cleaned_content.split()))

In [9]:
content

'morocco officially kingdom morocco country maghreb region north africa overlooks mediterranean sea north atlantic ocean west land borders algeria east disputed territory western sahara south morocco also claims spanish exclaves ceuta melilla pe n de v lez de la gomera several small spanish controlled islands coast spans area km sq mi km sq mi population roughly million official predominant religion islam official languages arabic berber tamazight french moroccan dialect arabic also widely spoken moroccan identity culture mix arab berber african european cultures capital rabat largest city casablanca region constituting morocco inhabited since paleolithic era years ago idrisid dynasty established idris subsequently ruled series independent dynasties reaching zenith regional power th th centuries almoravid almohad dynasties controlled iberian peninsula maghreb centuries arab migration maghreb since th century shifted demographic scope region th th centuries morocco faced external threat

In [10]:
from textblob import Word

In [11]:
content = " ".join(Word(word).lemmatize() for word in content.split())

In [12]:
content = " ".join(Word(word).stem() for word in content.split())

In [13]:
content

'morocco offici kingdom morocco countri maghreb region north africa overlook mediterranean sea north atlant ocean west land border algeria east disput territori western sahara south morocco also claim spanish exclav ceuta melilla pe n de v lez de la gomera sever small spanish control island coast span area km sq mi km sq mi popul roughli million offici predomin religion islam offici languag arab berber tamazight french moroccan dialect arab also wide spoken moroccan ident cultur mix arab berber african european cultur capit rabat largest citi casablanca region constitut morocco inhabit sinc paleolith era year ago idrisid dynasti establish idri subsequ rule seri independ dynasti reach zenith region power th th centuri almoravid almohad dynasti control iberian peninsula maghreb centuri arab migrat maghreb sinc th centuri shift demograph scope region th th centuri morocco face extern threat sovereignti portug seiz territori ottoman empir encroach east marinid saadi dynasti otherwis resist

In [20]:
tokens = content.split()

In [22]:
set_words = set(tokens)

In [24]:
vocab = {word: i for i, word in enumerate(set_words)}

In [25]:
vocab

{'scope': 0,
 'valid': 1,
 'sea': 2,
 'polit': 3,
 'spain': 4,
 'forc': 5,
 'coast': 6,
 'god': 7,
 'atlant': 8,
 'south': 9,
 'tripoli': 10,
 'western': 11,
 'affair': 12,
 'guerrilla': 13,
 'european': 14,
 'spanish': 15,
 'court': 16,
 'paleolith': 17,
 'peninsula': 18,
 'citi': 19,
 'break': 20,
 'far': 21,
 'self': 22,
 'mi': 23,
 'idri': 24,
 'v': 25,
 'dominion': 26,
 'reserv': 27,
 'call': 28,
 'rule': 29,
 'war': 30,
 'span': 31,
 'melilla': 32,
 'religion': 33,
 'spoken': 34,
 'era': 35,
 'saadi': 36,
 'global': 37,
 'portug': 38,
 'territori': 39,
 'dialect': 40,
 'ja': 41,
 'area': 42,
 'union': 43,
 'alexandria': 44,
 'protector': 45,
 'vest': 46,
 'popul': 47,
 'islam': 48,
 'next': 49,
 'fifth': 50,
 'middl': 51,
 'disput': 52,
 'chamber': 53,
 'control': 54,
 'especi': 55,
 'religi': 56,
 'design': 57,
 'offici': 58,
 'repres': 59,
 'awsa': 60,
 'neighbour': 61,
 'king': 62,
 'zenith': 63,
 'escap': 64,
 'provinc': 65,
 'independ': 66,
 'fail': 67,
 'deadlock': 68,
 'hi

In [18]:
data = []

In [28]:
for i in range(2, len(tokens) - 2):
    context = [vocab[word] for word in tokens[i - 2:i] + tokens[i + 1:i + 3]]
    target = vocab[tokens[i]]
    data.append((torch.tensor(context), torch.tensor(target)))

In [30]:
data[:6]

[(tensor([105,  58, 105, 106]), tensor(245)),
 (tensor([ 58, 245, 106, 189]), tensor(105)),
 (tensor([245, 105, 189, 255]), tensor(106)),
 (tensor([105, 106, 255,  79]), tensor(189)),
 (tensor([106, 189,  79,  71]), tensor(255)),
 (tensor([189, 255,  71, 233]), tensor(79))]

In [31]:
class CBOWModel(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(CBOWModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.linear = nn.Linear(embed_size, vocab_size)
 
    def forward(self, context):
        context_embeds = self.embeddings(context).sum(dim=1)
        output = self.linear(context_embeds)
        return output

In [150]:
# Initialize CBOW model
cbow_model = CBOWModel(vocab_size, embed_size)

In [154]:
vocab_size = len(vocab)
embed_size = 4
learning_rate = 0.1
epochs = 200

In [155]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(cbow_model.parameters(), lr=learning_rate)

In [156]:
for epoch in range(epochs):
    total_loss = 0
    for context, target in data:
        optimizer.zero_grad()
        output = cbow_model(context)
        loss = criterion(output.unsqueeze(0), target.unsqueeze(0))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    if epoch%10 == 0:
        print(f"Epoch {epoch + 1}, Loss: {total_loss}")

Epoch 1, Loss: 2122.248557866318
Epoch 11, Loss: 2130.435435579122
Epoch 21, Loss: 2119.178083180952
Epoch 31, Loss: 2022.4145251192967
Epoch 41, Loss: 2274.9433048166056
Epoch 51, Loss: 2361.6241586204856
Epoch 61, Loss: 2239.9553883429326
Epoch 71, Loss: 2234.2701226471363
Epoch 81, Loss: 2400.5961013270885


KeyboardInterrupt: 

In [83]:
# Example usage: Get embedding for a specific word
def embeddings(word):
    word_index = vocab[word]
    embedding = cbow_model.embeddings(torch.tensor([word_index]))
    print(f"Embedding for '{word_to_lookup}': {embedding.detach()}")
    return embedding.detach()

In [93]:
m = embeddings("morocco")

Embedding for 'morocco': tensor([[ 0.8026,  0.1219, -0.3515, -0.0969]])


In [94]:
k = embeddings("countri")

Embedding for 'morocco': tensor([[-2.0423, -0.4131, -2.2740, -2.7473]])


In [95]:
torch.matmul(m, k.t())

tensor([[-0.6239]])

In [78]:
list(filter(lambda word: word.startswith("c"), vocab.keys()))

['coast',
 'court',
 'citi',
 'call',
 'chamber',
 'control',
 'casablanca',
 'centuri',
 'consid',
 'countri',
 'ceasefir',
 'claim',
 'councillor',
 'capit',
 'constitut',
 'caliph',
 'consult',
 'contrast',
 'ceuta',
 'cultur',
 'continu',
 'cede',
 'coloni',
 'commerci']

#### References: 

- **`Word Embedding in NLP`** [https://www.geeksforgeeks.org/word-embeddings-in-nlp/](https://www.geeksforgeeks.org/word-embeddings-in-nlp/)