1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [4]:
# from google.colab import drive
# drive.mount('/content/gdrive')

# SOURCE_DIR = '/content/Q3_data.csv'

In [5]:
import torch
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [6]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [7]:
!pip install json-lines



In [8]:
import json_lines

In [9]:
# 1. extract all tweets from file and save them in memory
# 2. remove urls, hashtags and usernames. use the prepared functions

# Read the CSV file
data = pd.read_csv('Q3_data.csv')
# print(data.columns)

# Apply preprocessing functions to the tweet data
data['Text'] = data['Text'].apply(delete_hashtag_usernames)
data['Text'] = data['Text'].apply(delete_url)
data['Text'] = data['Text'].apply(delete_ex)

# Print the preprocessed tweet data
# print(data['Text'])


# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [10]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """

    dot_product = np.dot(u, v)
    norm_u = np.linalg.norm(u) ##########
    norm_v = np.linalg.norm(v) ##########
    cosine_similarity = dot_product / (norm_u * norm_v)

    return cosine_similarity

## find k nearest neighbors

In [11]:
import heapq ##########
def find_k_nearest_neighbors(word, embedding_dict, k):
  """
    implement a function to return the nearest words to an specific word based on the given dictionary

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of word that should be returned

    Returns:
        a list of size k consisting of the k most similar words to the given word

    Note: use the cosine_similarity function that you have implemented to calculate the similarity between words
    """

  word_vector = embedding_dict[word]
  similarities = []

  for w, v in embedding_dict.items():
      if w != word:
          similarity = cosine_similarity(word_vector, v)
          heapq.heappush(similarities, (-similarity, w))
          if len(similarities) > k:
              heapq.heappop(similarities)

  nearest_words = [word for _, word in similarities]
  nearest_words.reverse()

  return nearest_words

# 2. One hot encoding

In [None]:
# 1. find one hot encoding of each word

words = data['Text'].str.split().tolist()
words = [word for sublist in words for word in sublist]
words = np.array(words).reshape(-1, 1)

# Create an instance of the OneHotEncoder
encoder = OneHotEncoder()

# Fit the encoder on the list of words to determine the unique categories
encoder.fit(words)

# Transform the list of words into a one-hot encoded matrix
one_hot_matrix = encoder.transform(words).toarray()

# Print the one-hot encoded matrix
print(one_hot_matrix)





# 2. find 10 nearest words from "آزادی"

from sklearn.decomposition import PCA #############
# Perform PCA on the one-hot encoding matrix
pca = PCA(n_components=100)  # Choose the desired number of components
embedding_matrix = pca.fit_transform(one_hot_matrix)

# Create the embedding dictionary
embedding_dict = {}
for i, word in enumerate(words):
    embedding_dict[word] = embedding_matrix[i]
nearest_words = find_k_nearest_neighbors("آزادی", embedding_dict, 10)
print(nearest_words)


[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


##### Describe advantages and disadvantages of one-hot encoding

Advantage:


Disadvantage:


# 3. TF-IDF

In [None]:
# 1. find the TF-IDF of all tweets.
# 2. choose one tweets randomly.
# 3. find 10 nearest tweets from chosen tweet.

##### Describe advantages and disadvantages of TF-IDF

Advatages:


Disadvantages:


# 4. Word2Vec

In [None]:
# 1. train a word2vec model base on all tweets
# 2. find 10 nearest words from "آزادی"


##### Describe advantages and disadvantages of Word2Vec

Advantages:


Disadvantages:


# 5. Contextualized embedding

In [None]:
!pip install transformers[sentencepiece]

In [None]:
# Load model and tokenizer

from transformers import BertModel, BertTokenizer

model_name = "HooshvareLab/bert-base-parsbert-uncased"


In [None]:
# 1. fine-tune the model base on all tweets
# 2. find 10 nearest words from "آزادی"


##### Describe advantages and disadvantages of Contextualized embedding

Advantages:


Disadvantages:
