1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [1]:
# from google.colab import drive
# drive.mount('/content/gdrive')

# SOURCE_DIR = '/content/Q3_data.csv'

In [2]:
import torch
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [3]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [4]:
!pip install json-lines



In [5]:
import json_lines

In [6]:
# 1. extract all tweets from file and save them in memory
# 2. remove urls, hashtags and usernames. use the prepared functions

# Read the CSV file
data = pd.read_csv('Q3_data.csv')
# print(data.columns)

# Apply preprocessing functions to the tweet data
data['Text'] = data['Text'].apply(delete_hashtag_usernames)
data['Text'] = data['Text'].apply(delete_url)
data['Text'] = data['Text'].apply(delete_ex)

# Print the preprocessed tweet data
# print(data['Text'])

# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [7]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """

    dot_product = np.dot(u, v)
    norm_u = np.linalg.norm(u) ##########
    norm_v = np.linalg.norm(v) ##########
    cosine_similarity = dot_product / (norm_u * norm_v)

    return cosine_similarity

## find k nearest neighbors

In [8]:
def find_k_nearest_neighbors(word, embedding_dict, k):
  """
    implement a function to return the nearest words to an specific word based on the given dictionary

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of word that should be returned

    Returns:
        a list of size k consisting of the k most similar words to the given word

    Note: use the cosine_similarity function that you have implemented to calculate the similarity between words
    """
  # Ensure the word is in the embedding dictionary
  if word not in embedding_dict:
      return []

  # Get the embedding for the word
  word_embedding = embedding_dict[word]

  # Calculate cosine similarity with all other words
  similarities = {}
  for other_word, other_embedding in embedding_dict.items():
      if other_word != word:
          sim = cosine_similarity(word_embedding, other_embedding)
          similarities[other_word] = sim

  # Sort by similarity
  sorted_similarities = sorted(similarities.items(), key=lambda item: item[1], reverse=True)

  # Extract the top k words
  neighbors = [word for word, _ in sorted_similarities[:k]]

  return neighbors

# 2. One hot encoding

In [9]:
# 1. find one hot encoding of each word

words = data['Text'].str.split().tolist()
words = [word for sublist in words for word in sublist]
# print(words)

# Reshape the words to be a column vector
words_array = np.array(words).reshape(-1, 1)

# Create the encoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the words to one-hot encoded vectors
one_hot_encoded = encoder.fit_transform(words_array)

# # Create a DataFrame to view the one-hot encoded words
# one_hot_df = pd.DataFrame(one_hot_encoded, index=words, columns=encoder.get_feature_names_out())

# # Display the one-hot encoded DataFrame
# print(one_hot_df)



In [12]:
# 2. find 10 nearest words from "آزادی"
embedding_dict = {word: encoding for word, encoding in zip(words, one_hot_encoded)}

word = "آزادی"
k = 10

nearest_words = find_k_nearest_neighbors(word, embedding_dict, k)

print(nearest_words)

['بنشین', 'تا', 'شود', 'نقش', 'فال', 'ما', 'هم', 'فردا', 'شدن', 'این']


In [13]:
# Just testing
count = 0
for value in embedding_dict['آزادی']:
  if value != 0.0:
    print(count, value)
  count = count + 1

count = 0
for value in embedding_dict['بهاره']:
  if value != 0.0:
    print(count, value)
  count = count + 1

count = 0
for value in embedding_dict['کامپیوتر']:
  if value != 0.0:
    print(count, value)
  count = count + 1

1805 1.0
7146 1.0
28946 1.0


##### Describe advantages and disadvantages of one-hot encoding

**Advantages:**

1. **Simplicity:** One-hot encoding is a straightforward and simple method to represent categorical variables. It involves creating a binary vector where each element corresponds to a unique category, making it easy to understand and implement.

2. **Retains categorical information:** One-hot encoding preserves the categorical nature of the variable. Each category is represented by a separate binary variable, allowing models to capture relationships and patterns specific to each category.

3. **Compatibility with machine learning algorithms:** Many machine learning algorithms require numerical inputs. One-hot encoding converts categorical variables into a numeric format that can be readily used by these algorithms.

4. **Avoids ordinality assumption:** One-hot encoding treats all categories as independent and does not impose any ordinal relationship between categories. This is useful when there is no inherent order or hierarchy among the categories.


**Disadvantages:**

1. **Dimensionality:** One-hot encoding expands the dimensionality of the feature space. If a categorical variable has a large number of unique categories, the resulting one-hot encoded representation can lead to a high-dimensional feature space, which may impact computational efficiency and model complexity.

2. **Curse of dimensionality:** The increase in dimensionality due to one-hot encoding can lead to the curse of dimensionality. This refers to the problem where the number of features becomes large relative to the number of observations, which can result in sparse data, increased model complexity, and overfitting.

3. **Redundancy:** One-hot encoding can introduce redundancy in the data representation. Since each category is represented by a separate binary variable, there is a perfect correlation between these variables. This redundancy can lead to multicollinearity issues in some models.

4. **Handling new categories:** One-hot encoding requires defining the set of categories in advance. If new categories appear during testing or deployment, the one-hot encoding scheme may not handle them properly. This can be particularly problematic in real-world scenarios where new categories may emerge over time.


# 3. TF-IDF

In [14]:
Tweets = data['Text']
print(Tweets)

0                  بنشین تا شود نقش فال ما نقش هم فردا شدن
1        این گوزو رو کی گردن میگیره؟؟ دچار زوال عقل شده...
2                                   برای ایران، برای مهسا.
3                                          مرگ بر دیکتاتور
4                               نذاریم خونشون پایمال شه...
                               ...                        
19995                                     برای ایران بانو 
19996        از بس حاج خانم دراز نشده واسش عقده دراز داره😅
19997    به افتخار از بین رفتن جمهوری اسلامی🙆‍♂️🙆‍♂️🙆‍♂...
19998                                          پنجاه و شیش
19999    در محیط طوفانزای ماهرانه در جنگ است ناخدای است...
Name: Text, Length: 20000, dtype: object


In [15]:
# 1. find the TF-IDF of all tweets.

from sklearn.feature_extraction.text import TfidfVectorizer

# Create a document-term matrix
vectorizer = TfidfVectorizer()
document_term_matrix = vectorizer.fit_transform(Tweets)

# Calculate the TF-IDF values
tfidf_values = document_term_matrix.toarray()

# Normalize the TF-IDF values (optional)
normalized_tfidf = tfidf_values / np.linalg.norm(tfidf_values, axis=1, keepdims=True)

# Print the TF-IDF values for the first 10 tweets
for i in range(10):
    print("Tweet:", Tweets[i])
    print("TF-IDF:", normalized_tfidf[i])
    print()


Tweet: بنشین تا شود نقش فال ما نقش هم فردا شدن
TF-IDF: [0. 0. 0. ... 0. 0. 0.]

Tweet: این گوزو رو کی گردن میگیره؟؟ دچار زوال عقل شده از بس پای منبر دستمال کشی کرده.
TF-IDF: [0. 0. 0. ... 0. 0. 0.]

Tweet: برای ایران، برای مهسا.
TF-IDF: [0. 0. 0. ... 0. 0. 0.]

Tweet: مرگ بر دیکتاتور
TF-IDF: [0. 0. 0. ... 0. 0. 0.]

Tweet: نذاریم خونشون پایمال شه...
TF-IDF: [0. 0. 0. ... 0. 0. 0.]

Tweet: مابهت افتخار میکنیم نبات باعث شدی کل دنیا مارو ببینه
TF-IDF: [0. 0. 0. ... 0. 0. 0.]

Tweet: برای انسانای خوشگلمون
TF-IDF: [0. 0. 0. ... 0. 0. 0.]

Tweet: فارغ از هر باوری متحد شویم.
TF-IDF: [0. 0. 0. ... 0. 0. 0.]

Tweet: اینها عجب موجودات پستی هستن🥺🥺🥺الهی بگردم، من خودم باردارم و حتی توتظاهرات مسالمت امیز خارج ایران استرس داشتم ادم ها نا خود اگاه بهم ضربه بزنن،بمیرم برای دل اون زن که چه کشیده...مرگ بر دیکتاتور
TF-IDF: [0. 0. 0. ... 0. 0. 0.]

Tweet: کصخلا چرا ۴ تاوفحشش نمیدن؟
TF-IDF: [0. 0. 0. ... 0. 0. 0.]



  normalized_tfidf = tfidf_values / np.linalg.norm(tfidf_values, axis=1, keepdims=True)
  normalized_tfidf = tfidf_values / np.linalg.norm(tfidf_values, axis=1, keepdims=True)


In [58]:
# 2. choose one tweets randomly.
# 3. find 10 nearest tweets from chosen tweet.

##### Describe advantages and disadvantages of TF-IDF

Advatages:


Disadvantages:


# 4. Word2Vec

In [None]:
# 1. train a word2vec model base on all tweets
# 2. find 10 nearest words from "آزادی"


##### Describe advantages and disadvantages of Word2Vec

Advantages:


Disadvantages:


# 5. Contextualized embedding

In [None]:
!pip install transformers[sentencepiece]

In [None]:
# Load model and tokenizer

from transformers import BertModel, BertTokenizer

model_name = "HooshvareLab/bert-base-parsbert-uncased"


In [None]:
# 1. fine-tune the model base on all tweets
# 2. find 10 nearest words from "آزادی"


##### Describe advantages and disadvantages of Contextualized embedding

Advantages:


Disadvantages:
