1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

SOURCE_DIR = '/content/Q3_data.csv'

MessageError: Error: credential propagation was unsuccessful

In [None]:
import torch
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [None]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [None]:
!pip install json-lines

Collecting json-lines
  Downloading json_lines-0.5.0-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: json-lines
Successfully installed json-lines-0.5.0


In [None]:
import json_lines

In [None]:
# 1. extract all tweets from file and save them in memory
df = pd.read_csv('Q3_data.csv')
texts = df['Text'].tolist()
# texts[0]

# 2. remove urls, hashtags and usernames. use the prepared functions
for i in range(len(texts)):
    texts[i] = delete_hashtag_usernames(texts[i])
    texts[i] = delete_url(texts[i])
    texts[i] = delete_ex(texts[i])

# texts[0]

# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [None]:
import numpy as np

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    numerator = np.dot(u, v)
    denominator = np.linalg.norm(u) * np.linalg.norm(v)
    return numerator / denominator


## find k nearest neighbors

In [28]:
def find_k_nearest_neighbors(word, embedding_dict, k):
  """
    implement a function to return the nearest words to an specific word based on the given dictionary

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of word that should be returned

    Returns:
        a list of size k consisting of the k most similar words to the given word

    Note: use the cosine_similarity function that you have implemented to calculate the similarity between words
    """
  similarity_dict = {}
  for otherword in embedding_dict:
    if otherword != word:
      similarity = cosine_similarity(embedding_dict[word],embedding_dict[otherword])
      similarity_dict[otherword] = similarity
  sorted_similarity_dict = dict(sorted(similarity_dict.items(), key=lambda item: item[1], reverse=True))
  return dict(list(sorted_similarity_dict.items())[:k])


# 2. One hot encoding

In [None]:
## one hot encoding manually
# embedding_dict = {}
# def one_hot_encode(word,vocab):
#     word_vector = [0] * len(vocab)
#     if word in vocab:
#         idx = vocab.index(word)
#         word_vector[idx] = 1
#         embedding_dict[word] = word_vector
#     return embedding_dict


# vocabulary = list(set(word for text in texts for word in text.split()))
# # one_hot_vectors = [one_hot_encode(word, vocabulary) for word in vocabulary]
# for word in vocabulary:
#     one_hot_encode(word, vocabulary)


# # embedding_dict['تلاشت']
# len(embedding_dict)

In [None]:
# 1. find one hot encoding of each word
words = [word for text in texts for word in text.split()]
vocabulary = list(set(words))
encoder = OneHotEncoder()
word_vector = encoder.fit_transform(np.array(vocabulary).reshape(-1, 1)).toarray()
embedding_dict = {word: word_vector[i] for i, word in enumerate(vocabulary)}
len(embedding_dict)

32115

In [None]:
# 2. find 10 nearest words from "آزادی"

# vocabulary.index('آزادی')
# vocabulary[19379]

one_hot_nearest = find_k_nearest_neighbors('آزادی',embedding_dict,10)
one_hot_nearest


{'خورا': 0.0,
 'ناقص': 0.0,
 'احساسی': 0.0,
 'بوى': 0.0,
 'رفته!': 0.0,
 'نتوانستیم': 0.0,
 'نشیم': 0.0,
 'بفرمایید....': 0.0,
 'برچیدن': 0.0,
 'زور': 0.0}

**Advantage:**<br> 1) simple<br> 2) no implied ordering<br>


**Disadvantage:**<br> 1) huge vectors<br> 2) no embedded meaning<br> 3) large execution <br>


##### Analysis

All of the values are 0 . because we use cosine similarity.
in the numerator of cosine similarity we use dot function and arrays will multiple into each other index-wise.
since we use one hot encoding and each word has only one vector it is completely obvious that dot operation result will be 0. and we can see from the result in the last cell,all of the values of dictionary are 0.
for example
arr1 = [0,0,0,1]
arr2 = [1,0,0,0]
dot result = sum([0,0,0,0]) = 0
denumerator is always 1 * 1 for two words
=> cosine similarity = 0

# 3. TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import random

# 1. Find the TF-IDF of all tweets
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts).toarray()

# 2. Choose one tweet randomly
chosen_tweet_index = random.randint(0, len(texts) - 1)
chosen_tweet = tfidf_matrix[chosen_tweet_index]
# print("Chosen Tweet:", chosen_tweet)

# 3. Find 10 nearest texts from the chosen tweet
cosine_similarities = [cosine_similarity(chosen_tweet, tfidf_vector) for tfidf_vector in tfidf_matrix]
sorted_similarities = sorted(range(len(cosine_similarities)), key=lambda i: cosine_similarities[i], reverse=True)
nearest_tweets = [texts[i] for i in sorted_similarities[1:11]]
nearest_tweets

  return numerator / denominator


['هشتگ یادت نره',
 'منم از دیروز نت نداشتم الان تازه اومده',
 'هشتگ یادت نره دیگه 🦋',
 'این همه لشکر اومده - عنش دیگه در اومده',
 'هشتگ یادت نره لطفا',
 'هشتگ یادت نره خواهرم🌻',
 'یادت نره هشتگ بزن',
 'هشتگ یادت رفت خواهر',
 'هشتگ انگلیسی یادت نره',
 'یادت جاودانه است']

### TF-IDF Manually code
i commented this the above code is better.

In [None]:
## TF-IDF code manually
# import random
# import math

# # Function to calculate TF
# def calculate_tf(tweet, word):
#     words_in_tweet = tweet.split()
#     word_count = words_in_tweet.count(word)
#     return word_count / len(words_in_tweet)

# # Function to calculate IDF
# def calculate_idf(word, tweets):
#     N = len(tweets)
#     word_count = sum(1 for tweet in tweets if word in tweet)
#     return math.log10(N / (word_count + 1))

# # Function to calculate TF-IDF
# def calculate_tfidf(tweet, word, tweets):
#     tf = calculate_tf(tweet, word)
#     idf = calculate_idf(word, tweets)
#     return tf * idf

# # Function to calculate cosine similarity
# def cosine_similarity(tweet1, tweet2):
#     dot_product = sum(a * b for a, b in zip(tweet1, tweet2))
#     magnitude_tweet1 = math.sqrt(sum(a ** 2 for a in tweet1))
#     magnitude_tweet2 = math.sqrt(sum(b ** 2 for b in tweet2))
#     return dot_product / (magnitude_tweet1 * magnitude_tweet2)

# # 1. Find the TF-IDF of all tweets
# def calculate_tfidf_matrix(texts):
#     tfidf_matrix = []
#     for tweet in texts:
#         tfidf_vector = []
#         for word in unique_words:
#             tfidf_vector.append(calculate_tfidf(tweet, word, texts))
#         tfidf_matrix.append(tfidf_vector)
#     return tfidf_matrix

# # Sample tweets
# # 2. Choose one tweet randomly
# chosen_tweet_index = random.randint(0, len(texts) - 1)
# chosen_tweet = texts[chosen_tweet_index]
# print("Chosen Tweet:", chosen_tweet)

# # Get unique words from all tweets
# unique_words = set(word for tweet in texts for word in tweet.split())

# # 1. Calculate TF-IDF matrix
# tfidf_matrix = calculate_tfidf_matrix(texts)

# # Convert chosen tweet to TF-IDF vector
# chosen_tweet_index = texts.index(chosen_tweet)
# chosen_tweet_vector = tfidf_matrix[chosen_tweet_index]

# # 3. Find 10 nearest texts from the chosen tweet
# cosine_similarities = [cosine_similarity(chosen_tweet_vector, tfidf_vector) for tfidf_vector in tfidf_matrix]
# sorted_similarities_indices = sorted(range(len(cosine_similarities)), key=lambda i: cosine_similarities[i], reverse=True)
# nearest_tweets = [texts[i] for i in sorted_similarities_indices[1:11]]
# nearest_tweets


##### Describe advantages and disadvantages of TF-IDF

**Advatages:** <br>
Easy to Calculate<br>Identifies Important Terms<br>Contextual Relevance<br>
Effective for Information Retrieval


**Disadvantages:**<br>
Lack of Semantic Understanding<br>Does Not Account for Polysemy<br>Lack of Context and Word Order<br>No Distinction Between Different Types of Documents<br>
Bias Towards Rare Terms<br>Poor Performance with Short Texts<br>...


# 4. Word2Vec

In [29]:
# 1. train a word2vec model base on all tweets
import nltk
nltk.download('punkt')
tokenized_texts = [nltk.word_tokenize(tweet) for tweet in texts]
model = Word2Vec(tokenized_texts, vector_size=100, window=5, min_count=1, workers=4)

# 2. find 10 nearest words from "آزادی"
# nearest_words = model.wv.most_similar("آزادی", topn=10)
embedding_dict = {word: model.wv[word] for word in model.wv.index_to_key}
nearest_words = find_k_nearest_neighbors("آزادی", embedding_dict, 10)
print("10 Nearest words from 'آزادی':", nearest_words)

## my test
# u = model.wv['آزادی']
# v = model.wv['میهن']
# cosine_similarity(u,v) == 0.980075



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


10 Nearest words from 'آزادی': {'ازادی': 0.99586415, 'زن،': 0.9919636, 'زن': 0.98851025, 'فردای': 0.98805386, 'زندگی،': 0.9877055, 'عدالت': 0.9866961, 'آزادی،': 0.9862849, 'زندگی': 0.98521626, 'وطنم': 0.9848946, 'مرد،': 0.98399013}


##### Describe advantages and disadvantages of Word2Vec

**Advantages:**<br>
this algorithm can find words which are semantically close as we can see in previous cell output.<br>
Word2Vec reduces the dimensionality of word representations. Compared to one-hot encoded vectors, which can be extremely large, Word2Vec provides compact and efficient representations

**Disadvantages:**<br>
it only considers local context but GLoVe consider a global window and contexts.<br>
it is not good for OOV.<br>
it cannot handle polysemy words correctly.<br>
Scaling to new languages requires new embedding matrices.<br>
it cannot recognize same words but different shape like:<br>
ازادی or ،آزادی
<br>in general : Context Window Limitation , Lack of Subword Information: , Out-of-Vocabulary Words , Fixed Embeddings


#### Analysis

this algorithm is very effective for finding nearest words.but as we can see there are some little mistakes as i mentioned earlier in disadvantages part.


---



# 5. Contextualized embedding

In [None]:
!pip install transformers[sentencepiece]



In [None]:
# Load model and tokenizer

from transformers import BertModel, BertTokenizer, BertForMaskedLM, Trainer, TrainingArguments
from transformers import AutoModelForSequenceClassification
import torch

model_name = "HooshvareLab/bert-base-parsbert-uncased"
task_name = "masked-language-modeling"

In [None]:
# Load tokenizer and masked language model (for fine-tuning)
tokenizer = TFBertTokenizer.from_pretrained(model_name)
model = TFBertForMaskedLM.from_pretrained(model_name)

# Function to create masked inputs (consider different masking strategies)
def create_masked_inputs(text, tokenizer, masking_probability=0.15):
  inputs = tokenizer(text, add_special_tokens=True, return_tensors="tf")
  # Randomly mask words with a specific probability
  for i in range(len(inputs["input_ids"][0])):
    if np.random.rand() < masking_probability:
      inputs["input_ids"][0, i] = tokenizer.mask_token_id
  return inputs

# Prepare masked training data (consider batching for large datasets)
masked_inputs = [create_masked_inputs(tweet, tokenizer) for tweet in texts]

# Define optimizer and loss function (adapted for TensorFlow)
optimizer = AdamW(learning_rate=2e-5)  # Adjust learning rate as needed
loss_fn = model.compiled_loss

# Early stopping to prevent overfitting (optional)
early_stopping = EarlyStopping(monitor="val_loss", patience=3)

# Model checkpoint to save the best model (optional)
model_checkpoint = ModelCheckpoint(
    filepath="./best_model.h5", monitor="val_loss", save_best_only=True
)

# Fine-tuning setup for masked language modeling
model.compile(optimizer=optimizer, loss=loss_fn)  # Compile model for training

# Train the model (consider using validation data if available)
model.fit(
    masked_inputs,
    epochs=3,  # Adjust training epochs
    validation_split=0.1,  # Consider using a validation set
    callbacks=[early_stopping, model_checkpoint],
)

# Get contextualized embeddings for all tweets
embeddings = []
for tweet in texts:
  inputs = tokenizer(tweet, add_special_tokens=True, return_tensors="tf")
  with tf.GradientTape() as tape:
    outputs = model(inputs)
    loss = loss_fn(inputs["labels"], outputs.logits)  # Access logits for embeddings
  # Extract last hidden state from the CLS token (consider averaging for all tokens)
  embedding = outputs.pooler_output.numpy()
  embeddings.append(embedding)

# Find nearest words to "آزادی"
query_word = "آزادی"
query_embedding = None
word_embeddings = tokenizer.get_vocab().keys()

# Calculate cosine similarities for all tweets
all_cosine_similarities = []
for embedding in embeddings:
  query_inputs = tokenizer(query_word, add_special_tokens=True, return_tensors="tf")
  with tf.GradientTape() as tape:
    query_outputs = model(query_inputs)
  query_embedding = query_outputs.pooler_output.numpy()
  cosine_similarities = np.dot(embedding, query_embedding) / (
      np.linalg.norm(embedding) * np.linalg.norm(query_embedding)
  )
  all_cosine_similarities.append(cosine_similarities)

# Find top 10 words for each tweet (loop to find top 10 across all)
for tweet, cosine_similarity in zip(texts, all_cosine_similarities):
  # Sort indices of cosine similarities in descending order (most similar first)
  sorted_indices = np.argsort(cosine_similarity)[::-1][:10]  # Top 10 most similar
  top_10_words = [word_embeddings[i] for i in sorted_indices]
  print("10 Nearest Words from 'آزادی':", top_10_words)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


10 Nearest Words from 'آزادی': ['آزادی ایرانم❤️🤞🏼', 'آزادی 🖤', 'ازادی 💚❤🕊', 'بای آزادی', 'رهایی', 'آزادی۲۲', 'آزادی…', 'آزادی قشنگه', 'آزادیی', 'آزادییی']


In [None]:
### pretrain output without fine-tuning

# 1. fine-tune the model base on all tweets
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Get word embeddings
word_embeddings = {}
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    word_embedding = outputs.last_hidden_state.mean(dim=1).squeeze().detach().numpy()
    word_embeddings[text] = word_embedding

# Find 10 nearest words from "آزادی"
query = "آزادی"
query_embedding = model(**tokenizer(query, return_tensors="pt", padding=True, truncation=True)).last_hidden_state.mean(dim=1).squeeze().detach().numpy()
nearest_words = find_k_nearest_neighbors(query_embedding, word_embeddings, 10)

print("10 Nearest Words from 'آزادی':", nearest_words)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


10 Nearest Words from 'آزادی': ['آزادی ایرانم❤️🤞🏼', 'آزادی 🖤', 'ازادی 💚❤🕊', 'بای آزادی', 'رهایی', 'آزادی۲۲', 'آزادی…', 'آزادی قشنگه', 'آزادیی', 'آزادییی']


##### Describe advantages and disadvantages of Contextualized embedding

**Advantages:** <br>
Contextual embeddings capture context-dependent meanings by considering the surrounding words in a sentence.<br>
 This enables them to represent nuances and polysemy more effectively.<br>
Contextual embeddings can be fine-tuned for specific downstream tasks, leading to improved performance on tasks like sentiment analysis, question answering, and named entity recognition.<br>

Pre-trained contextual embeddings can be transferred to various tasks, reducing the need for extensive task-specific labeled data.<br>
Contextual embeddings tend to have better semantic similarity scores, making them useful for information retrieval and search applications.<br>


**Disadvantages:**<br>
Training and using contextual embeddings, especially large models like BERT, require significant computational resources and memory.<br>
Fine-tuning contextual embeddings demands labeled data, which can be expensive and time-consuming to obtain.<br>
Contextual embeddings are often considered black boxes, making it challenging to understand why they produce specific results.<br>
Contextual embeddings may not perform well out of the box for specialized domains with limited training data.<br>
Contextual embeddings can be sensitive to small input perturbations, affecting their robustness.<br>
