1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [1]:
# from google.colab import drive
# drive.mount('/content/gdrive')

# SOURCE_DIR = '/content/Q3_data.csv'

In [2]:
import torch
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [3]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [4]:
!pip install json-lines

Collecting json-lines
  Downloading json_lines-0.5.0-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: json-lines
Successfully installed json-lines-0.5.0


In [5]:
import json_lines

In [6]:
# 1. extract all tweets from file and save them in memory
# 2. remove urls, hashtags and usernames. use the prepared functions

# Read the CSV file
data = pd.read_csv('Q3_data.csv')
# print(data.columns)

PureText_data = data['PureText']

# Apply preprocessing functions to the tweet data
PureText_data = PureText_data.apply(delete_hashtag_usernames)
PureText_data = PureText_data.apply(delete_url)
PureText_data = PureText_data.apply(delete_ex)

# Print the preprocessed tweet data
# print(data['Text'])

# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [7]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """

    dot_product = np.dot(u, v)
    norm_u = np.sqrt(np.sum(u**2))
    norm_v = np.sqrt(np.sum(v**2))
    cosine_similarity = dot_product / (norm_u * norm_v)

    return cosine_similarity

## find k nearest neighbors

In [8]:
def find_k_nearest_neighbors(word, embedding_dict, k):
  """
    implement a function to return the nearest words to an specific word based on the given dictionary

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of word that should be returned

    Returns:
        a list of size k consisting of the k most similar words to the given word

    Note: use the cosine_similarity function that you have implemented to calculate the similarity between words
    """
  # Ensure the word is in the embedding dictionary
  if word not in embedding_dict:
      return []

  # Get the embedding for the word
  word_embedding = embedding_dict[word]

  # Calculate cosine similarity with all other words
  similarities = {}
  for other_word, other_embedding in embedding_dict.items():
      # print("Other word is: ", other_word)
      # print("other_embedding is: ", other_embedding)
      if other_word != word:
          sim = cosine_similarity(word_embedding, other_embedding)
          similarities[other_word] = sim
          # if sim != 0.0:
            # print("sim is: ", sim)
  # print ("Similarities is: ", similarities)
  # Sort by similarity
  sorted_similarities = sorted(similarities.items(), key=lambda item: item[1], reverse=True)

  # Extract the top k words
  # neighbors = [word for word, _ in sorted_similarities[:k]]
  neighbors = sorted_similarities[:k]

  return neighbors

# 2. One hot encoding

In [9]:
# 1. find one hot encoding of each word

words = PureText_data.str.split().tolist()
words = [word for sublist in words for word in sublist]
# print(words[1003])

# Reshape the words to be a column vector
words_array = np.array(words).reshape(-1, 1)
# print(words_array)

# Create the encoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the words to one-hot encoded vectors
one_hot_encoded = encoder.fit_transform(words_array)

# # Create a DataFrame to view the one-hot encoded words
# one_hot_df = pd.DataFrame(one_hot_encoded, index=words, columns=encoder.get_feature_names_out())

# # Display the one-hot encoded DataFrame
# print(one_hot_df)



In [10]:
# 2. find 10 nearest words from "آزادی"
embedding_dict = {word: encoding for word, encoding in zip(words, one_hot_encoded)}

word = "آزادی"
k = 10
nearest_words = find_k_nearest_neighbors(word, embedding_dict, k)
print(nearest_words)

word = "کامپیوتر"
nearest_words = find_k_nearest_neighbors(word, embedding_dict, k)
print(nearest_words)

[('بنشین', 0.0), ('تا', 0.0), ('شود', 0.0), ('نقش', 0.0), ('فال', 0.0), ('ما', 0.0), ('هم', 0.0), ('فردا', 0.0), ('شدن', 0.0), ('این', 0.0)]
[('بنشین', 0.0), ('تا', 0.0), ('شود', 0.0), ('نقش', 0.0), ('فال', 0.0), ('ما', 0.0), ('هم', 0.0), ('فردا', 0.0), ('شدن', 0.0), ('این', 0.0)]


Each and every vector in the one hot encoding is orthogonal to each other. So the cosine similarity as well as distance between any two vectors are same. Thus it holds no relationship among them.
That is why the nearest words found are the same. The cosine similarity of each pair of words equals 0.

In [11]:
# Just testing
count = 0
for value in embedding_dict['آزادی']:
  if value != 0.0:
    print(count, value)
  count = count + 1

count = 0
for value in embedding_dict['بهاره']:
  if value != 0.0:
    print(count, value)
  count = count + 1

count = 0
for value in embedding_dict['کامپیوتر']:
  if value != 0.0:
    print(count, value)
  count = count + 1

1805 1.0
7146 1.0
28946 1.0


##### Describe advantages and disadvantages of one-hot encoding

**Advantages:**

1. **Simplicity:** One-hot encoding is a straightforward and simple method to represent categorical variables. It involves creating a binary vector where each element corresponds to a unique category, making it easy to understand and implement.

2. **Retains categorical information:** One-hot encoding preserves the categorical nature of the variable. Each category is represented by a separate binary variable, allowing models to capture relationships and patterns specific to each category.

3. **Compatibility with machine learning algorithms:** Many machine learning algorithms require numerical inputs. One-hot encoding converts categorical variables into a numeric format that can be readily used by these algorithms.

4. **Avoids ordinality assumption:** One-hot encoding treats all categories as independent and does not impose any ordinal relationship between categories. This is useful when there is no inherent order or hierarchy among the categories.


**Disadvantages:**

1. **Dimensionality:** One-hot encoding expands the dimensionality of the feature space. If a categorical variable has a large number of unique categories, the resulting one-hot encoded representation can lead to a high-dimensional feature space, which may impact computational efficiency and model complexity.

2. **Curse of dimensionality:** The increase in dimensionality due to one-hot encoding can lead to the curse of dimensionality. This refers to the problem where the number of features becomes large relative to the number of observations, which can result in sparse data, increased model complexity, and overfitting.

3. **Redundancy:** One-hot encoding can introduce redundancy in the data representation. Since each category is represented by a separate binary variable, there is a perfect correlation between these variables. This redundancy can lead to multicollinearity issues in some models.

4. **Handling new categories:** One-hot encoding requires defining the set of categories in advance. If new categories appear during testing or deployment, the one-hot encoding scheme may not handle them properly. This can be particularly problematic in real-world scenarios where new categories may emerge over time.


# 3. TF-IDF

In [12]:
Tweets = data['Text']
print(Tweets)

0        بنشین تا شود نقش فال ما \nنقش هم‌ فردا شدن\n#م...
1        @Tanasoli_Return @dr_moosavi این گوزو رو کی گر...
2        @ghazaleghaffary برای ایران، برای مهسا.\n#OpIr...
3        @_hidden_ocean مرگ بر دیکتاتور \n#OpIran \n#Ma...
4        نذاریم خونشون پایمال شه.‌‌.‌‌.\n#Mahsa_Amini #...
                               ...                        
19995    برای ایران بانو #Mahsa_Amini      #MahsaAmini ...
19996    @MohammadTehra16 @mimpedram از بس حاج خانم درا...
19997    به افتخار از بین رفتن جمهوری اسلامی🙆‍♂️🙆‍♂️🙆‍♂...
19998    پنجاه و شیش \n\n#مهسا_امینی \n#Mahsa_Amini \n#...
19999    در محیط طوفان‌زای ماهرانه در جنگ است\nناخدای ا...
Name: Text, Length: 20000, dtype: object


In [23]:
# 1. find the TF-IDF of all tweets.
########## import and preprocess (PureText_data)

from sklearn.feature_extraction.text import TfidfVectorizer

# # Create a document-term matrix
# vectorizer = TfidfVectorizer()
# document_term_matrix = vectorizer.fit_transform(Tweets)

# # Calculate the TF-IDF values
# tfidf_values = document_term_matrix.toarray()

# # Normalize the TF-IDF values
# normalized_tfidf = tfidf_values / np.linalg.norm(tfidf_values, axis=1, keepdims=True)

# # Print the TF-IDF values for the first 10 tweets
# for i in range(10):
#     print("Tweet:", Tweets[i])
#     print("TF-IDF:", normalized_tfidf[i])
#     print()

# Create a TfidfVectorizer object and fit it to the preprocessed corpus
vectorizer = TfidfVectorizer()
vectorizer.fit(words)

# Transform the preprocessed corpus into a TF-IDF matrix
tf_idf_matrix = vectorizer.transform(words)

# Get list of feature names that correspond to the columns in the TF-IDF matrix
print("Feature Names:\n", vectorizer.get_feature_names_out())

# Print the resulting matrix
print("TF-IDF Matrix:\n", tf_idf_matrix.toarray())
# for i in tf_idf_matrix.toarray():
#   for j in i:
#     if (j != 0):
#       print (j)

Feature Names:
 ['00' '0020115687' '00971562643674' ... '۹۸' '۹۹' 'ﺍست']
TF-IDF Matrix:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [20]:
# 2. choose one tweets randomly.
import random
chosen_tweet = random.choice(Tweets)
print("Chosen Tweet:", chosen_tweet)

Chosen Tweet: @AnonymousUK2022 #مهسا_امینی 
#Mahsa_Amini #oplran 
ایرانو پس میگیریم


In [21]:
# 3. find 10 nearest tweets from chosen tweet.

# Get the index of the chosen tweet in the tfidf_matrix
chosen_tweet_index = np.where(Tweets == chosen_tweet)[0][0]

# Get the TF-IDF vector for the chosen tweet
chosen_tweet_vector = tf_idf_matrix[chosen_tweet_index]

# Calculate the cosine similarity between the chosen tweet and all other tweets
similarities = []
for i in range(len(Tweets)):
    similarity = cosine_similarity(chosen_tweet_vector.toarray()[0], tf_idf_matrix[i].toarray()[0])
    similarities.append(similarity)

# Sort the similarities and get the indices of the 10 nearest tweets
nearest_indices = np.argsort(similarities)[::-1][1:11]

print("10 Nearest Tweets:")
for index in nearest_indices:
    print("========================")
    print(Tweets[index])

  cosine_similarity = dot_product / (norm_u * norm_v)


10 Nearest Tweets:
برای عمرمون که سر اینترنت و دور زدن فیلتر ها حروم شد..
#MahsaAmini 
#Mahsa_Amini 
#OpIran
@Chandlershelby1 برای
 #Mahsa_Amini 
#OpIran 
#مهسا_امینی
اونایی که دوست ندارم پیج اینستاشون از الگوریتم خارج بشه بجای ناله در این زمینه، فقط کافیه پست و استوری حمایت از مردم بزارن، همین.
#MahsaAmini
#Mahsa_Amini
#مهسا_امینی
#OpIran
اینایی که میگن فلان استان و فلان جا و مرکز کشور خبری نیست اول برن ببینن چقدر بچهامونو زدن کشتن بعد بیان اینو بخورن

تفرقه افکن های رو اعصاب سایبری

#مهسا_امینی 
#Mahsa_Amini
@mamadporii دانشگاها بیدارن
اینو ۲۵ روزه که داریم میبینیم و لمس میکنیم
#مهسا_امینی 
#Mahsa_Amini 
#OpIran
ما همه مهسا هستیم بجنگ تا بجنگیم
#مهسا_امینی
#Mahsa_Amini 
#MahsaAmini
@asemanhn @Cherii98 یک
مهسا_امینی 
دختر ایران🖤🖤
#Mahsa_Amini
#OpIran
@alikarimi_ak8 برای امید به آینده ای روشن برای فرزندان ایران زمین #Mahsa_Amini
@OutFarsi @armin_prm #OpIran 
#Mahsa_Amini 
#مهسا_امینی 
ما همه باهم هستیم
@Nafise1375 @Godofpersiian خیلیا دارن هشتگ اشتباه میزنن، سایبری هم نیستن» لطفن آگاهش

In [22]:
chosen_tweet = random.choice(Tweets)
print("Chosen Tweet:", chosen_tweet)

# Get the index of the chosen tweet in the tfidf_matrix
chosen_tweet_index = np.where(Tweets == chosen_tweet)[0][0]

# Get the TF-IDF vector for the chosen tweet
chosen_tweet_vector = tf_idf_matrix[chosen_tweet_index]

# Calculate the cosine similarity between the chosen tweet and all other tweets
similarities = []
for i in range(len(Tweets)):
    similarity = cosine_similarity(chosen_tweet_vector.toarray()[0], tf_idf_matrix[i].toarray()[0])
    similarities.append(similarity)

# Sort the similarities and get the indices of the 10 nearest tweets
nearest_indices = np.argsort(similarities)[::-1][1:11]

print("10 Nearest Tweets:")
for index in nearest_indices:
    print("========================")
    print(Tweets[index])

Chosen Tweet: @Delaramm1127 کیم تهیونگ 
#Mahsa_Amini
#OpIran
#مهسا_امینی


  cosine_similarity = dot_product / (norm_u * norm_v)


10 Nearest Tweets:
@Nafise1375 @Godofpersiian خیلیا دارن هشتگ اشتباه میزنن، سایبری هم نیستن» لطفن آگاهشون کنین...
دقت کنید #مهسا_امینی درسته یه آندرلاین بیشتر نداره.#اعتصابات_سراری 
هشتگ رو خودتون بنویسین، از انتخابهایی که توییتر بهتون میده استفاده نکنین...
 #MashaAmini 
#OpIran
#Mahsa_Amini
به خاطر همه شهیدایی که این چند روز و تمام این سالها دادیم...
#مهسا_امینی 
#Mahsa_Amini 
#OpIran
@Chandlershelby1 برای
 #Mahsa_Amini 
#OpIran 
#مهسا_امینی
اونایی که دوست ندارم پیج اینستاشون از الگوریتم خارج بشه بجای ناله در این زمینه، فقط کافیه پست و استوری حمایت از مردم بزارن، همین.
#MahsaAmini
#Mahsa_Amini
#مهسا_امینی
#OpIran
اینایی که میگن فلان استان و فلان جا و مرکز کشور خبری نیست اول برن ببینن چقدر بچهامونو زدن کشتن بعد بیان اینو بخورن

تفرقه افکن های رو اعصاب سایبری

#مهسا_امینی 
#Mahsa_Amini
@mamadporii دانشگاها بیدارن
اینو ۲۵ روزه که داریم میبینیم و لمس میکنیم
#مهسا_امینی 
#Mahsa_Amini 
#OpIran
ما همه مهسا هستیم بجنگ تا بجنگیم
#مهسا_امینی
#Mahsa_Amini 
#MahsaAmini
@asemanhn @Cherii98 یک
مهسا

##### Describe advantages and disadvantages of TF-IDF

**Advantages:**
1. **Term Importance:** TF-IDF highlights important terms in a document by assigning higher weights to words that are more frequent in the document and less frequent in the entire corpus. This allows for effective keyword extraction and helps in identifying the most relevant terms within a document.

2. **Document Similarity:** TF-IDF enables the calculation of cosine similarity between documents based on their TF-IDF vector representations. This similarity measure is useful for tasks such as document clustering, information retrieval, and recommendation systems.

3. **Language Independence:** TF-IDF is language-independent, meaning it can be applied to documents in any language. It doesn't rely on language-specific rules or heuristics, making it a versatile technique for text analysis across different languages.

4. **Computational Efficiency:** TF-IDF can be computed efficiently, especially when using sparse matrix representations. This makes it scalable for large corpora and enables fast retrieval of relevant documents based on query terms.

**Disadvantages:**
1. **Term Frequency Bias:** TF-IDF heavily relies on term frequency. Overly frequent terms within a document may dominate the TF-IDF score, potentially overshadowing other important terms. This can be mitigated by using term frequency normalization techniques.
2. **Lack of Semantic Understanding:** TF-IDF does not capture the semantic meaning of words or the relationships between them. It treats each term independently, which may limit its ability to capture the context or nuanced meaning of phrases or multi-word expressions.
3. **Handling Out-of-Vocabulary Words:** TF-IDF is based on a fixed vocabulary derived from the corpus. Out-of-vocabulary words, i.e., words not present in the vocabulary, are typically ignored or treated as noise. This can be a limitation when dealing with specialized or domain-specific terms.
4. **Document Length Bias:** Longer documents tend to have higher term frequencies, which can bias the TF-IDF scores. Longer documents may have higher TF-IDF values simply due to more occurrences of terms, even if the terms are not necessarily more important.



# 4. Word2Vec

In [26]:
# 1. train a word2vec model base on all tweets
# Create a list of tokenized tweets
tokenized_tweets = [tweet.split() for tweet in PureText_data]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_tweets, vector_size=100, window=5, min_count=5, workers=4)

# Save the trained model for future use
model.save("tweet_word2vec.model")

In [27]:
# 2. find 10 nearest words from "آزادی"

# Load the trained Word2Vec model
model = Word2Vec.load("tweet_word2vec.model")

# Find the 10 nearest words to "آزادی"
nearest_words = model.wv.most_similar("آزادی", topn=10)

# Print the nearest words
print("10 Nearest Words to 'آزادی':")
for word, similarity in nearest_words:
    print(word)

10 Nearest Words to 'آزادی':
زندگی،
ازادی
زن،
زن
ابادی
زندگی
ایران
آزادی،
امید
،زندگی


In [28]:
# Load the trained Word2Vec model
model = Word2Vec.load("tweet_word2vec.model")

# Find the 10 nearest words to "کامپیوتر"
nearest_words = model.wv.most_similar("کامپیوتر", topn=10)

# Print the nearest words
print("10 Nearest Words to 'کامپیوتر':")
for word, similarity in nearest_words:
    print(word)

10 Nearest Words to 'کامپیوتر':
وقتی
حرف
اینه
اومد
گوشی
واسه
جمع
همین
هاشون
دیروز


##### Describe advantages and disadvantages of Word2Vec

**Advantages:**
1. **Capturing Semantic Relationships:** Word2Vec can capture semantic relationships between words by representing them as dense vectors in a continuous vector space. Similar words tend to have similar vector representations, enabling the model to capture word similarity and analogies.

2. **Dimensionality Reduction:** Word2Vec reduces the dimensionality of word representations. Instead of representing words as one-hot vectors in a high-dimensional space, Word2Vec provides compact and dense vector representations that capture meaningful semantic information.

3. **Contextual Information:** Word2Vec considers the context in which a word appears, allowing it to capture the meaning of words based on their surrounding words. This enables the model to capture syntactic and semantic relationships.

4. **Efficiency:** Word2Vec uses an efficient implementation, such as the skip-gram or continuous bag-of-words (CBOW) models, which make it computationally efficient to train on large-scale datasets. Once trained, the model can quickly provide word embeddings for downstream tasks.

**Disadvantages:**

1. **Lack of Subword Information:** Word2Vec treats words as atomic units and does not capture subword information. Rare or out-of-vocabulary words may not have meaningful embeddings, and the model may struggle with morphologically rich languages or words with multiple meanings.
2. **Limited Context Window:** Word2Vec uses a fixed context window size to capture word relationships. This limits the model's ability to capture long-range dependencies or relationships between words that are further apart.
3. **Domain-Specific Representations:** Word2Vec embeddings are trained on a specific corpus. If the target domain differs significantly from the training corpus, the embeddings may not capture the specific domain's nuances and may require additional fine-tuning or training on domain-specific data.
4. **Polysemy and Homonymy:** Word2Vec treats each word as a single entity, ignoring potential multiple meanings or contexts. This can result in ambiguous representations for polysemous words or different senses of homonymous words.
5. **Lack of Compositionality:** Word2Vec does not inherently capture compositional meaning, where the meaning of a phrase or sentence is derived from the combination of individual word meanings. It treats each word independently, limiting its ability to capture complex linguistic structures.

# 5. Contextualized embedding

In [29]:
!pip install transformers[sentencepiece]



In [30]:
# Load model and tokenizer

from transformers import BertModel, BertTokenizer

model_name = "HooshvareLab/bert-base-parsbert-uncased"


In [31]:
# 1. fine-tune the model base on all tweets
model = BertModel.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

In [32]:
tokenized_data = tokenizer.batch_encode_plus(
    PureText_data,
    padding=True,
    truncation=True,
    max_length=512,  # Adjust as needed
    return_tensors="pt"
)

In [35]:
# Define your classification model on top of the BERT base
class MyClassifier(torch.nn.Module):
    def __init__(self, num_classes):
        super(MyClassifier, self).__init__()
        self.bert = model  # Use the pre-trained BERT model as the base
        self.dropout = torch.nn.Dropout(0.1)
        self.fc = torch.nn.Linear(768, num_classes)  # Adjust the input size (768) and output size (num_classes) as needed

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        logits = self.fc(pooled_output)
        return logits

# Instantiate your classification model
num_classes = 2  # Adjust based on the number of classes for your task
classifier = MyClassifier(num_classes)

# Define your optimization and loss functions
optimizer = torch.optim.Adam(classifier.parameters(), lr=1e-5)
criterion = torch.nn.CrossEntropyLoss()

num_epochs = 10  # Adjust the number of epochs as needed
batch_size = 32  # Adjust the batch size as needed

# Fine-tuning loop
for epoch in range(num_epochs):
    # Set the model to train mode
    classifier.train()

    # Iterate over the mini-batches of your tokenized data
    for i in range(0, len(tokenized_data["input_ids"]), batch_size):
        batch_input_ids = tokenized_data["input_ids"][i:i + batch_size]
        batch_attention_mask = tokenized_data["attention_mask"][i:i + batch_size]
        batch_labels = labels[i:i + batch_size]  # Replace `labels` with your actual label data

        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        logits = classifier(batch_input_ids, batch_attention_mask)

        # Compute loss
        loss = criterion(logits, batch_labels)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        # Print the loss for monitoring
        if i % print_every == 0:
            print(f"Epoch: {epoch+1}, Batch: {i+1}/{len(tokenized_data['input_ids'])}, Loss: {loss.item()}")

# Save the fine-tuned model
torch.save(classifier.state_dict(), "fine_tuned_model.pt")

NameError: name 'labels' is not defined

In [None]:
# 2. find 10 nearest words from "آزادی"

In [40]:
import torch
import numpy as np
from transformers import BertModel, BertTokenizer

# Load the pre-trained BERT model and tokenizer
model_name = "HooshvareLab/bert-base-parsbert-uncased"
model = BertModel.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Set the model to evaluation mode
model.eval()

# Define the input text
input_text = "آزادی"

# Tokenize the input text
tokenized_text = tokenizer.tokenize(input_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# Convert tokens to tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Generate the contextualized embeddings
with torch.no_grad():
    outputs = model(tokens_tensor)

# Get the embeddings for the input tokens
input_embedding = outputs[0][0]  # Embedding for the first token, which represents the input word

# Calculate cosine similarity between the input word embedding and each tweet
similarities = []
for tweet in PureText_data:
    # Tokenize and convert tweet to tensor
    tweet_tokens = tokenizer.tokenize(tweet)
    tweet_indexed_tokens = tokenizer.convert_tokens_to_ids(tweet_tokens)
    tweet_tensor = torch.tensor([tweet_indexed_tokens])

    # Generate the contextualized embeddings for the tweet
    with torch.no_grad():
        tweet_outputs = model(tweet_tensor)

    # Get the embeddings for the tweet tokens
    tweet_embeddings = tweet_outputs[0][0]  # Embeddings for the tweet tokens

    # Calculate cosine similarity between input word embedding and tweet embeddings
    similarity = np.dot(input_embedding.numpy(), tweet_embeddings.numpy().T) / (
        np.linalg.norm(input_embedding.numpy()) * np.linalg.norm(tweet_embeddings.numpy(), axis=1)
    )
    similarities.append(similarity)

# Sort the tweets based on cosine similarity in descending order
sorted_indices = np.argsort(similarities)[::-1]

# Get the 10 most similar tweets
nearest_tweets = [PureText_data[index] for index in sorted_indices[:10]]

# Print the nearest tweets
for tweet in nearest_tweets:
    print(tweet)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (20000, 1) + inhomogeneous part.

##### Describe advantages and disadvantages of Contextualized embedding

Advantages:


Disadvantages:
