1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
SOURCE_DIR = '/content/gdrive/MyDrive/NLP/Q3_data.csv'

In [3]:
import torch
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [4]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [6]:
!pip install json-lines
!pip install pyarabic

Collecting json-lines
  Downloading json_lines-0.5.0-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: json-lines
Successfully installed json-lines-0.5.0
Collecting pyarabic
  Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.4/126.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarabic
Successfully installed pyarabic-0.6.15


In [7]:
import json_lines
import pyarabic.araby as araby

In [8]:
# 1. extract all tweets from file and save them in memory
# 2. remove urls, hashtags and usernames. use the prepared functions
def preprocess(text):
    text = delete_hashtag_usernames(text)
    text = delete_url(text)
    text = delete_ex(text)

    # Remove Punctuations
    punct = ':؛؟!،»«><.,;:"\'!?/'
    text = text.translate(str.maketrans(punct, ' '*len(punct)))

    # Remove آ and ي
    text = text.replace('آ', 'ا')
    text = text.replace('ي', 'ی')
    text = text.replace('ك', 'ک')

    # Remove arabic diacritics
    text = araby.strip_diacritics(text)

    # Remove numbers
    # text = text.translate(str.maketrans('', '', '۱۲۳۴۵۶۷۸۹۰1234567890١٢٣٤٥٦٧٨٩٠'))

    return text


import pandas as pd

df = pd.read_csv(SOURCE_DIR)['Text']
df = df.map(preprocess)
df

0                  بنشین تا شود نقش فال ما نقش هم فردا شدن
1        این گوزو رو کی گردن میگیره   دچار زوال عقل شده...
2                                   برای ایران  برای مهسا 
3                                          مرگ بر دیکتاتور
4                               نذاریم خونشون پایمال شه   
                               ...                        
19995                                     برای ایران بانو 
19996        از بس حاج خانم دراز نشده واسش عقده دراز داره😅
19997    به افتخار از بین رفتن جمهوری اسلامی🙆‍♂️🙆‍♂️🙆‍♂...
19998                                          پنجاه و شیش
19999    در محیط طوفانزای ماهرانه در جنگ است ناخدای است...
Name: Text, Length: 20000, dtype: object

# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [9]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    denominator = np.linalg.norm(u) * np.linalg.norm(v)
    numinator = np.dot(u, v)
    return numinator/denominator

## find k nearest neighbors

In [10]:
def find_k_nearest_neighbors(word, embedding_dict, k):
    """
    implement a function to return the nearest words to an specific word based on the given dictionary

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of word that should be returned

    Returns:
        a list of size k consisting of the k most similar words to the given word

    Note: use the cosine_similarity function that you have implemented to calculate the similarity between words
    """
    word = preprocess(word)

    similarities = dict()
    word_emb = embedding_dict[word]

    for w in embedding_dict:
        if w != word:
            w_emb = embedding_dict[w]
            if not np.any(w_emb):
                continue
            similarity = cosine_similarity(w_emb, word_emb)
            similarities[w] = similarity

    return sorted(similarities.items(), key=lambda x:x[1], reverse=True)[:k]

# 2. One hot encoding

In [None]:
def find_words(texts:pd.DataFrame):
    words = []
    for text in texts:
        for word in text.split():
            words.append(word)
    words = list(set(words))
    return words

def build_one_hot_vocabulary(words:list):
    encoder = OneHotEncoder(handle_unknown='ignore')
    one_hot = encoder.fit_transform(np.array(words).reshape(-1, 1)).toarray()

    vocabulary = dict(zip(words, one_hot))
    return vocabulary

In [None]:
# 1. find one hot encoding of each word
# 2. find 10 nearest words from "آزادی"

# Extract words and build a vocabulary
words = find_words(df)
vocabulary = build_one_hot_vocabulary(words)
print('One-Hot Encoding Results:')
print(''.join([f'\n\t{v[0]} - {v[1]}'for v in find_k_nearest_neighbors('آزادی', vocabulary, 10)]))

One-Hot Encoding Results:

	حکمت - 0.0
	جسور - 0.0
	پلتفرمی - 0.0
	بزرگمه - 0.0
	(امام - 0.0
	مرج - 0.0
	(خودش - 0.0
	مامانایی - 0.0
	چهارده🖤 - 0.0
	زدنا - 0.0


### Describe advantages and disadvantages of one-hot encoding


---


#### Advantages

*   Binary Representation: One-hot encoding transforms categorical variables into a binary vector representation, which is beneficial for models that require numerical input.
*   Orthogonal Vector Space: The resulting vectors are orthogonal to each other, meaning that each category is independent and equidistant from others, which can be useful for certain types of analyses.
*   Fast Training: Acutally this technique doesn't need any training at all and can be easily calculated by constructing a one-hot vector with size same as the vocabulary size.


---


#### Disadvantages
*   Dimensionality: One-hot encoding can significantly increase the dimensionality of the dataset, as it creates a separate column for each category. This can lead to a sparse matrix and increase the computational cost.
*   Lack of Semantic Information: One-hot vectors do not capture any semantic relationships between words since each word is represented as an independent entity with no shared features. **So using this technique we can not find any related word to any arbitrary word inside out vocabulary.**
*   Multi-Collinearity: The addition of dummy variables for each unique category can lead to multi-collinearity, which can affect the performance of certain machine learning algorithms.

# 3. TF-IDF

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

In [12]:
# Find TF-IDF of all words in tweets
corpus = df.values
X = vectorizer.fit_transform(corpus)
print(len(vectorizer.get_feature_names_out()))
print(X.toarray().shape) # -> first axis is corresponding vectors to each document

22749
(20000, 22749)


In [50]:
index = np.random.randint(len(corpus))
text = corpus[index]
text

'بریم برای ۸۰ میلیون'

In [56]:
similarities = dict()
tweet_emb = X.toarray()[index]

for i, t_emb in enumerate(X.toarray()):
    if i != index:
        similarity = cosine_similarity(t_emb, tweet_emb)
        similarities[i] = similarity

sorted_similarities = sorted(similarities.items(), key=lambda x:x[1], reverse=True)[:10]

print('TF-IDF Results:')
print(''.join([f'\n\ttweet: \n\t\t{corpus[k]}\n\tsimilarity:\t{v}\n'for k,v in sorted_similarities]))

  return numinator/denominator


TF-IDF Results:

	tweet: 
		بریم واسه ۸۰ میلیون
	similarity:	0.8794665120857836

	tweet: 
		تا ۸۰ میلیون
	similarity:	0.7389837580041799

	tweet: 
		بریم برای سی میلیون
	similarity:	0.6614061626750124

	tweet: 
		میریم برای ۸۰ میلیون
	similarity:	0.6483774893610668

	tweet: 
		بریم تا ۸۰ میل
	similarity:	0.5848430828366494

	tweet: 
		بریم
	similarity:	0.5616340562850123

	tweet: 
		با کیا شدیم ۸۰ میلیون نفر
	similarity:	0.46285040884170664

	tweet: 
		میریم واسه ۵۰و ۸۰ میلیون
	similarity:	0.4478971320148744

	tweet: 
		اخ کاش برسونیمش به ۸۰ میلیون 
	similarity:	0.44341167734096393

	tweet: 
		بریم چهار
	similarity:	0.4200767619165326



### Describe advantages and disadvantages of TF-IDF


---


#### Advantages
*    Balancing Frequency and Importance: TF-IDF balances the term frequency (how often a word appears in a document) and its inverse document frequency (how rare a word is across a set of documents) to provide a more meaningful representation of word importance.
*    Improved Model Performance: By assigning higher weights to important words and lower weights to less important words, TF-IDF allows machine learning models to focus on the most relevant features, often resulting in improved model accuracy and performance.
*    Efficient Vectorization: TF-IDF vectorization involves calculating the TF-IDF score for every word in a corpus relative to a document, which can be used to create vectors that represent the text in a format suitable for machine learning algorithms.
*    Useful for Information Retrieval: TF-IDF is commonly used in search engines and information retrieval systems to rank documents based on their relevance to a user's query.
*    Simple and Easy to Understand: The math behind TF-IDF is straightforward, making it easy to understand and use.


---


#### Disadvantages
*    Lack of Semantic Meaning: TF-IDF does not capture the semantic meaning of words. It is primarily concerned with indicating the importance of words in a document, rather than their relationships or context.
*    Rare Words and Long Documents: TF-IDF can struggle with rare words and long documents, as it may not accurately reflect the importance of such words in the context of the entire corpus.
*    Limited for Complex NLP Tasks: While TF-IDF is useful for text classification, sentiment analysis, and keyword extraction, it may not be the best choice for tasks that require capturing the semantic relationship between words, such as word embeddings.
*    Clustering Limitations: TF-IDF can cluster documents that are keyword similar, making it less suitable for identifying documents that discuss the same topic but use different keywords.

# 4. Word2Vec

In [None]:
def tokenize_text(text):
    return [word.strip() for word in text.split()]

In [None]:
def build_w2v_vocabulary(model):
    vocabulary = dict()
    for word in model.wv.index_to_key:
        vocabulary[word] = model.wv[word]
    return vocabulary

In [None]:
# 1. train a word2vec model base on all tweets
# 2. find 10 nearest words from "آزادی"
tokenized_text = df.map(tokenize_text)

for vec_size in [1000, 5000, 10000, 15000, 20000]:
    print(f'With vector_size={vec_size}: ')
    word2vec_model = Word2Vec(sentences=tokenized_text.to_numpy(), seed=31, workers=1, vector_size=vec_size)
    vocabulary = build_w2v_vocabulary(word2vec_model)
    values = find_k_nearest_neighbors('آزادی', vocabulary, 10)
    print(''.join([f'\n\t{v[0]} - {v[1]}'for v in values]))
    print()

With vector_size=1000: 

	ابادی - 0.9951343536376953
	امید - 0.9943196773529053
	خواهرم - 0.9888500571250916
	استقامت - 0.986889660358429
	ایران - 0.9853800535202026
	امینی - 0.9832085371017456
	میهن - 0.9830191731452942
	مهسا - 0.9823393225669861
	برای - 0.9816274046897888
	مهسا🖤 - 0.9810211658477783

With vector_size=5000: 

	امید - 0.99808669090271
	مهسا🖤 - 0.9941138625144958
	ابادی - 0.993466317653656
	مهسا - 0.9934418201446533
	برای - 0.9930067658424377
	خواهرم - 0.9923779964447021
	ایران - 0.9888638854026794
	امینی - 0.9885563254356384
	میهن - 0.9829726815223694
	ازادی🤍 - 0.9819926023483276

With vector_size=10000: 

	امید - 0.9983457922935486
	مهسا - 0.9955490231513977
	برای - 0.9950383305549622
	مهسا🖤 - 0.9930261969566345
	ابادی - 0.9918341636657715
	خواهرم - 0.9915587902069092
	ایرانم🖤 - 0.9894627332687378
	ایران - 0.988168478012085
	امینی - 0.9870553612709045
	ازادی🤍 - 0.9845814108848572

With vector_size=15000: 

	امید - 0.9982794523239136
	مهسا - 0.996525228023529
	برای - 0

### Describe advantages and disadvantages of Word2Vec


---


#### Advantages
*   Semantic Meaning: Word2Vec captures the semantic meaning of words by placing semantically similar words close together in the vector space. As you can see from above results this method result well in task of finding similar words.
*   Efficiency: It can be more efficient than one-hot encoding as it results in lower-dimensional vectors. So the computation of similarity or any other needed computation decreases as we observed above in action.


---


#### Disadvantages
*   OOV Words: Word2Vec cannot handle unknown or out-of-vocabulary (OOV) words. If a word has not been encountered before, the model will not know how to interpret it. However there have been some extensions and updates for this technique such as fastText algorithm.
*   No Sub-Word Information: Word2Vec does not account for shared representations at sub-word levels, which can be a limitation for understanding morphologically similar words or in languages with rich morphology.
*   Scaling and Cross-Lingual Limitations: Scaling Word2Vec to new languages requires creating new embedding matrices, and it does not allow for cross-lingual parameter sharing.

# 5. Contextualized embedding

In [None]:
!pip install -Uq accelerate transformers[sentencepiece] datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m70.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m64.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer, DataCollatorForLanguageModeling, TrainingArguments, Trainer
from datasets import Dataset, DatasetDict

model_name = "HooshvareLab/bert-base-parsbert-uncased"
model_path = 'persian-tweets-embeddings'

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Load model and tokenizer
bert_model = AutoModelForMaskedLM.from_pretrained(model_name)
bert_tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at HooshvareLab/bert-base-parsbert-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
dataset = Dataset.from_pandas(pd.DataFrame(df))
train_dataset, eval_dataset = dataset.train_test_split(test_size=0.05).values()
dataset_dict = DatasetDict({"train":train_dataset,"eval":eval_dataset})

In [None]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['Text'],
        num_rows: 19000
    })
    eval: Dataset({
        features: ['Text'],
        num_rows: 1000
    })
})

In [None]:
def preprocess_function(examples):
    return bert_tokenizer([" ".join(x) for x in examples["Text"]])

tokenized_datasets = dataset_dict.map(preprocess_function, batched=True, remove_columns='Text')

Map:   0%|          | 0/19000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 19000
    })
    eval: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
})

In [None]:
block_size = 128

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    return result

In [None]:
lm_dataset = tokenized_datasets.map(group_texts, batched=True)
lm_dataset

Map:   0%|          | 0/19000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 6141
    })
    eval: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 323
    })
})

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.2)

In [None]:
training_args = TrainingArguments(
    output_dir=model_path,          # output directory to where save model checkpoint
    evaluation_strategy="epoch",
    save_strategy='epoch',
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,  # whether to load the best model (in terms of loss) at the end of training
    push_to_hub=True,
)

In [None]:
trainer = Trainer(
    model=bert_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=lm_dataset['train'],
    eval_dataset=lm_dataset['eval'],
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.362,2.042575


Epoch,Training Loss,Validation Loss
1,2.362,2.042575
2,1.9956,1.956014
3,1.9262,1.841535


There were missing keys in the checkpoint model loaded: ['cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'].


TrainOutput(global_step=2304, training_loss=2.0717985894944935, metrics={'train_runtime': 897.5416, 'train_samples_per_second': 20.526, 'train_steps_per_second': 2.567, 'total_flos': 1213238601326592.0, 'train_loss': 2.0717985894944935, 'epoch': 3.0})

In [None]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/kamyar-mroadian/persian-tweets-embeddings/commit/4087e25ff1b1da74bb4b890900eeda199ac00aa7', commit_message='End of training', commit_description='', oid='4087e25ff1b1da74bb4b890900eeda199ac00aa7', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
model_name = 'kamyar-mroadian/persian-tweets-embeddings'
tokenizer_name = "HooshvareLab/bert-base-parsbert-uncased"
bert_model = AutoModelForMaskedLM.from_pretrained(model_name, output_hidden_states=True)
bert_tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

In [None]:
from transformers import pipeline
from tqdm import tqdm
bert_pipeline = pipeline('feature-extraction', model=bert_model, tokenizer=bert_tokenizer, device=0)

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bert_model.to(device)

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(100000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

In [None]:
from tqdm import tqdm

def build_bert_vocabulary(words:list):
    tokens = [bert_tokenizer.tokenize(word) for word in words]

    # Convert tokens to input IDs
    input_ids = [bert_tokenizer.convert_tokens_to_ids(token) for token in tokens]

    # Move the model to the device
    bert_model.to(device)

    # Generate embeddings for each set of input IDs
    embeddings = []
    i = 0
    for ids in tqdm(input_ids):
        inputs = torch.tensor(ids).unsqueeze(0).to(device)  # Add batch dimension and move to device
        with torch.no_grad():
            try:
                outputs = bert_model(inputs).hidden_states[-1].to('cpu').squeeze()
            except:
                outputs = torch.zeros(768) # Handle words not tokenized well
        if outputs.dim() > 1: # Get the mean embedding for the word
            embedding = outputs.mean(dim=0).numpy()
        else:
            embedding = outputs.numpy()
        embeddings.append(embedding)

    return dict(zip(words, embeddings))

In [None]:
words = find_words(df)
vocabulary = build_bert_vocabulary(words)

100%|██████████| 25051/25051 [05:01<00:00, 82.99it/s]


In [None]:
find_k_nearest_neighbors('آزادی', vocabulary, k=10)

[('ازادى', 0.7537505),
 ('ازادیت', 0.7168002),
 ('ازادیو', 0.6991623),
 ('ازادی۳', 0.6973397),
 ('ازادیم', 0.6834783),
 ('ازادانه', 0.6719267),
 ('ازادیه', 0.6645093),
 ('٬ازادی', 0.6594013),
 ('ازادیان', 0.65564626),
 ('ازادیست', 0.6476268)]

### Describe advantages and disadvantages of Contextualized embedding


---


#### Advantages
*    Contextualized Embeddings: BERT generates word embeddings that consider the context in which a word appears, allowing it to capture subtle differences in meaning and usage that other methods may miss.
*    Finding most similar words: Another good result we can observe is that Contexualized Embeddings give greate results and can find most similar words in structure to the target word.
*    Applicability: BERT's embeddings can be used in a wide range of natural language processing tasks, such as sentiment analysis, text classification, and named entity recognition and it can perfectly result in any of these fields if pretrained well on the target context corpus.


---


#### Disadvantages
*    Time Consuming Training: Because of the fact that we need to use a neural network for this type of embeddings, so we need to train it first, and training a model with such huge number of parameters is extremely time-consuming as we observed above in training model, which took about 15 minutes to complete.
*    Computational Cost: BERT is a large and complex model that requires significant computational resources, making it less suitable for use on low-power devices or in real-time applications. As you can observe, in above examples it took about 5 minutes to create the embedding dictionary after training the model. so it takes alot to get the output of model for each of the words inside the vocabulary.
*    Limited Interpretability: The high-dimensional vectors produced by BERT can be difficult to interpret, posing challenges for explaining the behavior of models that use these embeddings.
