1.Submit a Google Colab notebook containing your completed code and experimentation results.

2.Include comments and explanations in your code to help understand the implemented logic.

**Additional Notes:**
*   Ensure that the notebook runs successfully in Google Colab.
*   Document any issues encountered during experimentation and how you addressed them.

**Grading:**
*   Each task will be graded out of the specified points.
*   Points will be awarded for correctness, clarity of code, thorough experimentation, and insightful analysis.

In [2]:
# from google.colab import drive
# drive.mount('/content/gdrive')

# SOURCE_DIR = '/content/Q3_data.csv'

In [3]:
import torch
import re
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
import math
from gensim.models import Word2Vec

In [4]:
def delete_hashtag_usernames(text):
  try:
    result = []
    for word in text.split():
      if word[0] not in ['@', '#']:
        result.append(word)
    return ' '.join(result)
  except:
    return ''

def delete_url(text):
  text = re.sub(r'http\S+', '', text)
  return text

def delete_ex(text):
  text = re.sub(r'\u200c', '', text)
  return text

# 0. Data preprocessing

In [5]:
!pip install json-lines



In [6]:
import json_lines

In [7]:
# 1. extract all tweets from file and save them in memory
# 2. remove urls, hashtags and usernames. use the prepared functions

# Read the CSV file
data = pd.read_csv('Q3_data.csv')
# print(data.columns)

PureText_data = data['PureText']

# Apply preprocessing functions to the tweet data
PureText_data = PureText_data.apply(delete_hashtag_usernames)
PureText_data = PureText_data.apply(delete_url)
PureText_data = PureText_data.apply(delete_ex)

# Print the preprocessed tweet data
# print(data['Text'])

# 1. Functions

## Cosine Similarity

To measure the similarity between two words, you need a way to measure the degree of similarity between two embedding vectors for the two words. Given two vectors $u$ and $v$, cosine similarity is defined as follows:

$$\text{CosineSimilarity(u, v)} = \frac {u \cdot v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

* $u \cdot v$ is the dot product (or inner product) of two vectors
* $||u||_2$ is the norm (or length) of the vector $u$
* $\theta$ is the angle between $u$ and $v$.
* The cosine similarity depends on the angle between $u$ and $v$.
    * If $u$ and $v$ are very similar, their cosine similarity will be close to 1.
    * If they are dissimilar, the cosine similarity will take a smaller value.

<img src="images/cosine_sim.png" style="width:800px;height:250px;">
<caption><center><font color='purple'><b>Figure 1</b>: The cosine of the angle between two vectors is a measure of their similarity.</font></center></caption>

Implement the function `cosine_similarity()` to evaluate the similarity between word vectors.

**Reminder**: The norm of $u$ is defined as $ ||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

In [8]:
def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similarity between u and v

    Arguments:
        u -- a word vector of shape (n,)
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """

    dot_product = np.dot(u, v)
    norm_u = np.sqrt(np.sum(u**2))
    norm_v = np.sqrt(np.sum(v**2))
    cosine_similarity = dot_product / (norm_u * norm_v)

    return cosine_similarity

## find k nearest neighbors

In [9]:
def find_k_nearest_neighbors(word, embedding_dict, k):
  """
    implement a function to return the nearest words to an specific word based on the given dictionary

    Arguments:
        word           -- a word, string
        embedding_dict -- dictionary that maps words to their corresponding vectors
        k              -- the number of word that should be returned

    Returns:
        a list of size k consisting of the k most similar words to the given word

    Note: use the cosine_similarity function that you have implemented to calculate the similarity between words
    """
  # Ensure the word is in the embedding dictionary
  if word not in embedding_dict:
      return []

  # Get the embedding for the word
  word_embedding = embedding_dict[word]

  # Calculate cosine similarity with all other words
  similarities = {}
  for other_word, other_embedding in embedding_dict.items():
      # print("Other word is: ", other_word)
      # print("other_embedding is: ", other_embedding)
      if other_word != word:
          sim = cosine_similarity(word_embedding, other_embedding)
          similarities[other_word] = sim
          # if sim != 0.0:
            # print("sim is: ", sim)
  # print ("Similarities is: ", similarities)
  # Sort by similarity
  sorted_similarities = sorted(similarities.items(), key=lambda item: item[1], reverse=True)

  # Extract the top k words
  # neighbors = [word for word, _ in sorted_similarities[:k]]
  neighbors = sorted_similarities[:k]

  return neighbors

# 2. One hot encoding

In [10]:
# 1. find one hot encoding of each word

words = PureText_data.str.split().tolist()
words = [word for sublist in words for word in sublist]
# print(words[1003])

# Reshape the words to be a column vector
words_array = np.array(words).reshape(-1, 1)
# print(words_array)

# Create the encoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the words to one-hot encoded vectors
one_hot_encoded = encoder.fit_transform(words_array)

# # Create a DataFrame to view the one-hot encoded words
# one_hot_df = pd.DataFrame(one_hot_encoded, index=words, columns=encoder.get_feature_names_out())

# # Display the one-hot encoded DataFrame
# print(one_hot_df)



In [11]:
# 2. find 10 nearest words from "آزادی"
embedding_dict = {word: encoding for word, encoding in zip(words, one_hot_encoded)}

word = "آزادی"
k = 10
nearest_words = find_k_nearest_neighbors(word, embedding_dict, k)
print(nearest_words)

word = "کامپیوتر"
nearest_words = find_k_nearest_neighbors(word, embedding_dict, k)
print(nearest_words)

[('بنشین', 0.0), ('تا', 0.0), ('شود', 0.0), ('نقش', 0.0), ('فال', 0.0), ('ما', 0.0), ('هم', 0.0), ('فردا', 0.0), ('شدن', 0.0), ('این', 0.0)]
[('بنشین', 0.0), ('تا', 0.0), ('شود', 0.0), ('نقش', 0.0), ('فال', 0.0), ('ما', 0.0), ('هم', 0.0), ('فردا', 0.0), ('شدن', 0.0), ('این', 0.0)]


Each and every vector in the one hot encoding is orthogonal to each other. So the cosine similarity as well as distance between any two vectors are same. Thus it holds no relationship among them.
That is why the nearest words found are the same. The cosine similarity of each pair of words equals 0.

In [12]:
# Just testing
count = 0
for value in embedding_dict['آزادی']:
  if value != 0.0:
    print(count, value)
  count = count + 1

count = 0
for value in embedding_dict['بهاره']:
  if value != 0.0:
    print(count, value)
  count = count + 1

count = 0
for value in embedding_dict['کامپیوتر']:
  if value != 0.0:
    print(count, value)
  count = count + 1

1805 1.0
7146 1.0
28946 1.0


##### Describe advantages and disadvantages of one-hot encoding

**Advantages:**

1. **Simplicity:** One-hot encoding is a straightforward and simple method to represent categorical variables. It involves creating a binary vector where each element corresponds to a unique category, making it easy to understand and implement.

2. **Retains categorical information:** One-hot encoding preserves the categorical nature of the variable. Each category is represented by a separate binary variable, allowing models to capture relationships and patterns specific to each category.

3. **Compatibility with machine learning algorithms:** Many machine learning algorithms require numerical inputs. One-hot encoding converts categorical variables into a numeric format that can be readily used by these algorithms.

4. **Avoids ordinality assumption:** One-hot encoding treats all categories as independent and does not impose any ordinal relationship between categories. This is useful when there is no inherent order or hierarchy among the categories.


**Disadvantages:**

1. **Dimensionality:** One-hot encoding expands the dimensionality of the feature space. If a categorical variable has a large number of unique categories, the resulting one-hot encoded representation can lead to a high-dimensional feature space, which may impact computational efficiency and model complexity.

2. **Curse of dimensionality:** The increase in dimensionality due to one-hot encoding can lead to the curse of dimensionality. This refers to the problem where the number of features becomes large relative to the number of observations, which can result in sparse data, increased model complexity, and overfitting.

3. **Redundancy:** One-hot encoding can introduce redundancy in the data representation. Since each category is represented by a separate binary variable, there is a perfect correlation between these variables. This redundancy can lead to multicollinearity issues in some models.

4. **Handling new categories:** One-hot encoding requires defining the set of categories in advance. If new categories appear during testing or deployment, the one-hot encoding scheme may not handle them properly. This can be particularly problematic in real-world scenarios where new categories may emerge over time.


# 3. TF-IDF

In [13]:
Tweets = data['Text']
print(Tweets)

0        بنشین تا شود نقش فال ما \nنقش هم‌ فردا شدن\n#م...
1        @Tanasoli_Return @dr_moosavi این گوزو رو کی گر...
2        @ghazaleghaffary برای ایران، برای مهسا.\n#OpIr...
3        @_hidden_ocean مرگ بر دیکتاتور \n#OpIran \n#Ma...
4        نذاریم خونشون پایمال شه.‌‌.‌‌.\n#Mahsa_Amini #...
                               ...                        
19995    برای ایران بانو #Mahsa_Amini      #MahsaAmini ...
19996    @MohammadTehra16 @mimpedram از بس حاج خانم درا...
19997    به افتخار از بین رفتن جمهوری اسلامی🙆‍♂️🙆‍♂️🙆‍♂...
19998    پنجاه و شیش \n\n#مهسا_امینی \n#Mahsa_Amini \n#...
19999    در محیط طوفان‌زای ماهرانه در جنگ است\nناخدای ا...
Name: Text, Length: 20000, dtype: object


In [14]:
# 1. find the TF-IDF of all tweets.
########## import and preprocess (PureText_data)

from sklearn.feature_extraction.text import TfidfVectorizer

# # Create a document-term matrix
# vectorizer = TfidfVectorizer()
# document_term_matrix = vectorizer.fit_transform(Tweets)

# # Calculate the TF-IDF values
# tfidf_values = document_term_matrix.toarray()

# # Normalize the TF-IDF values
# normalized_tfidf = tfidf_values / np.linalg.norm(tfidf_values, axis=1, keepdims=True)

# # Print the TF-IDF values for the first 10 tweets
# for i in range(10):
#     print("Tweet:", Tweets[i])
#     print("TF-IDF:", normalized_tfidf[i])
#     print()

# Create a TfidfVectorizer object and fit it to the preprocessed corpus
vectorizer = TfidfVectorizer()
vectorizer.fit(words)

# Transform the preprocessed corpus into a TF-IDF matrix
tf_idf_matrix = vectorizer.transform(words)

# Get list of feature names that correspond to the columns in the TF-IDF matrix
print("Feature Names:\n", vectorizer.get_feature_names_out())

# Print the resulting matrix
print("TF-IDF Matrix:\n", tf_idf_matrix.toarray())
# for i in tf_idf_matrix.toarray():
#   for j in i:
#     if (j != 0):
#       print (j)

Feature Names:
 ['00' '0020115687' '00971562643674' ... '۹۸' '۹۹' 'ﺍست']
TF-IDF Matrix:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [15]:
# 2. choose one tweets randomly.
import random
chosen_tweet = random.choice(Tweets)
print("Chosen Tweet:", chosen_tweet)

Chosen Tweet: @i6kitt1e مهسا_امینی 
دختر ایران🖤🖤
#Mahsa_Amini
#OpIran


In [16]:
# 3. find 10 nearest tweets from chosen tweet.

# Get the index of the chosen tweet in the tfidf_matrix
chosen_tweet_index = np.where(Tweets == chosen_tweet)[0][0]

# Get the TF-IDF vector for the chosen tweet
chosen_tweet_vector = tf_idf_matrix[chosen_tweet_index]

# Calculate the cosine similarity between the chosen tweet and all other tweets
similarities = []
for i in range(len(Tweets)):
    similarity = cosine_similarity(chosen_tweet_vector.toarray()[0], tf_idf_matrix[i].toarray()[0])
    similarities.append(similarity)

# Sort the similarities and get the indices of the 10 nearest tweets
nearest_indices = np.argsort(similarities)[::-1][1:11]

print("10 Nearest Tweets:")
for index in nearest_indices:
    print("========================")
    print(Tweets[index])

  cosine_similarity = dot_product / (norm_u * norm_v)


10 Nearest Tweets:
برای وقتایی که خواستیم همو بغلدکنیم موقع خدافظی و نمیشد
#مهسا_امینی 
#OpIran 
#MahsaAmini 
#Mahsa_Amini
برای آزادی
برای حق انتخاب 
برای یه نفس راحت.... 
#MahsaAmini
-#مهسا_امینی
#Mahsa_Amini
شما نميتوانيد با زبان مدني و ملايم ايران رو نجات بدهي شما با يك حيوان طرفي كه فقط سلاح بر او غلبه ميكند . #مهسا_امینی  #Mahsa_Amini #اعتصاب_سراسری
@hodalyyy @Ftmp191 برای مهسا.
برای ازادی.
#مهسا_امینی #Mahsa_Amini #OpIran
دسخووووووووششششششششششششش
#مهسا_امینی #OpIran  #Mahsa_Amini
RT besmaili: Köln heute
تظاهرات امروز در کلن 
#IranRevolution #IranProtests2022 #Mahsa_Amini‌ #مهسا_‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌امینی https://t.c… 
 #MahsaAmini #IranRevolution
مرگ بر رییسی
#مهسا_امینی
#Mahsa_Amini
@this_ryouzaki فالو کنید بک میدم 

#MahsaAmini 
#مهسا_امینی 
#Mahsa_Amini 
#OpIran
@fans_raefipour هم مادرتو گاییدم تخم زنا پدر نا معلوم #Mahsa_Amini
@ActualFatemeh برای ازادی ایران
#مهسا_امینی 
#Mahsa_Amini 
#OpIran


In [17]:
chosen_tweet = random.choice(Tweets)
print("Chosen Tweet:", chosen_tweet)

# Get the index of the chosen tweet in the tfidf_matrix
chosen_tweet_index = np.where(Tweets == chosen_tweet)[0][0]

# Get the TF-IDF vector for the chosen tweet
chosen_tweet_vector = tf_idf_matrix[chosen_tweet_index]

# Calculate the cosine similarity between the chosen tweet and all other tweets
similarities = []
for i in range(len(Tweets)):
    similarity = cosine_similarity(chosen_tweet_vector.toarray()[0], tf_idf_matrix[i].toarray()[0])
    similarities.append(similarity)

# Sort the similarities and get the indices of the 10 nearest tweets
nearest_indices = np.argsort(similarities)[::-1][1:11]

print("10 Nearest Tweets:")
for index in nearest_indices:
    print("========================")
    print(Tweets[index])

Chosen Tweet: @_3adr هر جوری شده بزار 
روحیه میده ب مردم 
#Mahsa_Amini 
#مهسا_امینی


  cosine_similarity = dot_product / (norm_u * norm_v)


10 Nearest Tweets:
امروز از یه دختره 15 ساله #تبریز ی وسط تظاهرات پرسیدم، واقعا نمی ترسی ؟؟؟
برگشت گفت: اولماخ وار دونماخ یوخ (مرگ رو هستم ولی برگشت رو نه ...)

#مهسا_امینی 
#OpIran 
#Mahsa_Amini
به خاطر همه شهیدایی که این چند روز و تمام این سالها دادیم...
#مهسا_امینی 
#Mahsa_Amini 
#OpIran
@Chandlershelby1 برای
 #Mahsa_Amini 
#OpIran 
#مهسا_امینی
اونایی که دوست ندارم پیج اینستاشون از الگوریتم خارج بشه بجای ناله در این زمینه، فقط کافیه پست و استوری حمایت از مردم بزارن، همین.
#MahsaAmini
#Mahsa_Amini
#مهسا_امینی
#OpIran
اینایی که میگن فلان استان و فلان جا و مرکز کشور خبری نیست اول برن ببینن چقدر بچهامونو زدن کشتن بعد بیان اینو بخورن

تفرقه افکن های رو اعصاب سایبری

#مهسا_امینی 
#Mahsa_Amini
@mamadporii دانشگاها بیدارن
اینو ۲۵ روزه که داریم میبینیم و لمس میکنیم
#مهسا_امینی 
#Mahsa_Amini 
#OpIran
ما همه مهسا هستیم بجنگ تا بجنگیم
#مهسا_امینی
#Mahsa_Amini 
#MahsaAmini
@asemanhn @Cherii98 یک
مهسا_امینی 
دختر ایران🖤🖤
#Mahsa_Amini
#OpIran
@OutFarsi @armin_prm #OpIran 
#Mahsa_Amini 
#مهسا_امینی

##### Describe advantages and disadvantages of TF-IDF

**Advantages:**
1. **Term Importance:** TF-IDF highlights important terms in a document by assigning higher weights to words that are more frequent in the document and less frequent in the entire corpus. This allows for effective keyword extraction and helps in identifying the most relevant terms within a document.

2. **Document Similarity:** TF-IDF enables the calculation of cosine similarity between documents based on their TF-IDF vector representations. This similarity measure is useful for tasks such as document clustering, information retrieval, and recommendation systems.

3. **Language Independence:** TF-IDF is language-independent, meaning it can be applied to documents in any language. It doesn't rely on language-specific rules or heuristics, making it a versatile technique for text analysis across different languages.

4. **Computational Efficiency:** TF-IDF can be computed efficiently, especially when using sparse matrix representations. This makes it scalable for large corpora and enables fast retrieval of relevant documents based on query terms.

**Disadvantages:**
1. **Term Frequency Bias:** TF-IDF heavily relies on term frequency. Overly frequent terms within a document may dominate the TF-IDF score, potentially overshadowing other important terms. This can be mitigated by using term frequency normalization techniques.
2. **Lack of Semantic Understanding:** TF-IDF does not capture the semantic meaning of words or the relationships between them. It treats each term independently, which may limit its ability to capture the context or nuanced meaning of phrases or multi-word expressions.
3. **Handling Out-of-Vocabulary Words:** TF-IDF is based on a fixed vocabulary derived from the corpus. Out-of-vocabulary words, i.e., words not present in the vocabulary, are typically ignored or treated as noise. This can be a limitation when dealing with specialized or domain-specific terms.
4. **Document Length Bias:** Longer documents tend to have higher term frequencies, which can bias the TF-IDF scores. Longer documents may have higher TF-IDF values simply due to more occurrences of terms, even if the terms are not necessarily more important.



# 4. Word2Vec

In [18]:
# 1. train a word2vec model base on all tweets
# Create a list of tokenized tweets
tokenized_tweets = [tweet.split() for tweet in PureText_data]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_tweets, vector_size=100, window=5, min_count=5, workers=4)

# Save the trained model for future use
model.save("tweet_word2vec.model")

In [19]:
# 2. find 10 nearest words from "آزادی"

# Load the trained Word2Vec model
model = Word2Vec.load("tweet_word2vec.model")

# Find the 10 nearest words to "آزادی"
nearest_words = model.wv.most_similar("آزادی", topn=10)

# Print the nearest words
print("10 Nearest Words to 'آزادی':")
for word, similarity in nearest_words:
    print(word)

10 Nearest Words to 'آزادی':
ازادی
زندگی،
زن،
زن
خواه
ایران
ابادی
زندگی
ایرانم
امید


In [20]:
# Load the trained Word2Vec model
model = Word2Vec.load("tweet_word2vec.model")

# Find the 10 nearest words to "کامپیوتر"
nearest_words = model.wv.most_similar("کامپیوتر", topn=10)

# Print the nearest words
print("10 Nearest Words to 'کامپیوتر':")
for word, similarity in nearest_words:
    print(word)

10 Nearest Words to 'کامپیوتر':
زنگ
ویدیو
اینبار
کارت
کوتاه
پشت
پاک
راحت
میتونیم
پسرا


##### Describe advantages and disadvantages of Word2Vec

**Advantages:**
1. **Capturing Semantic Relationships:** Word2Vec can capture semantic relationships between words by representing them as dense vectors in a continuous vector space. Similar words tend to have similar vector representations, enabling the model to capture word similarity and analogies.

2. **Dimensionality Reduction:** Word2Vec reduces the dimensionality of word representations. Instead of representing words as one-hot vectors in a high-dimensional space, Word2Vec provides compact and dense vector representations that capture meaningful semantic information.

3. **Contextual Information:** Word2Vec considers the context in which a word appears, allowing it to capture the meaning of words based on their surrounding words. This enables the model to capture syntactic and semantic relationships.

4. **Efficiency:** Word2Vec uses an efficient implementation, such as the skip-gram or continuous bag-of-words (CBOW) models, which make it computationally efficient to train on large-scale datasets. Once trained, the model can quickly provide word embeddings for downstream tasks.

**Disadvantages:**

1. **Lack of Subword Information:** Word2Vec treats words as atomic units and does not capture subword information. Rare or out-of-vocabulary words may not have meaningful embeddings, and the model may struggle with morphologically rich languages or words with multiple meanings.
2. **Limited Context Window:** Word2Vec uses a fixed context window size to capture word relationships. This limits the model's ability to capture long-range dependencies or relationships between words that are further apart.
3. **Domain-Specific Representations:** Word2Vec embeddings are trained on a specific corpus. If the target domain differs significantly from the training corpus, the embeddings may not capture the specific domain's nuances and may require additional fine-tuning or training on domain-specific data.
4. **Polysemy and Homonymy:** Word2Vec treats each word as a single entity, ignoring potential multiple meanings or contexts. This can result in ambiguous representations for polysemous words or different senses of homonymous words.
5. **Lack of Compositionality:** Word2Vec does not inherently capture compositional meaning, where the meaning of a phrase or sentence is derived from the combination of individual word meanings. It treats each word independently, limiting its ability to capture complex linguistic structures.

# 5. Contextualized embedding

In [21]:
!pip install transformers[sentencepiece]



In [22]:
# Load model and tokenizer

from transformers import BertModel, BertTokenizer

model_name = "HooshvareLab/bert-base-parsbert-uncased"


In [23]:
from torch.utils.data import DataLoader, Dataset
from transformers import BertForSequenceClassification, BertTokenizer, AdamW
from sklearn.model_selection import train_test_split

In [24]:
# Read the CSV file with the sentiment data
data = pd.read_csv('Q3_data.csv')
texts = data['Text']
labels = data['Sentiment']

# Map string labels to integers
label_map = {
    'negative': 0,
    'very negative': 1,
    'positive': 2,
    'no sentiment expressed': 3,
    'very positive': 4,
    'mixed': 5
}

labels = labels.map(label_map)

# Split the data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(texts, labels, test_size=0.2, random_state=42)

In [25]:
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define a custom dataset for sentiment classification
class CustomDataset(Dataset):
    def __init__(self, df, tokenizer):
        self.tweets = df['Text'].tolist()  # Convert the 'Text' column to a list
        self.labels = df['Sentiment'].map({'negative': 0, 'very negative': 1, 'positive': 2, 'no sentiment expressed': 3, 'very positive': 4, 'mixed': 5}).tolist()  # Convert the 'Sentiment' column to a list
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.tweets)

    def __getitem__(self, idx):
        tweet = self.tweets[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            tweet,
            truncation=True,
            padding='max_length',
            max_length=128,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [26]:
# Create instances of the custom dataset for training and validation
# train_dataset = CustomDataset(train_texts, train_labels, tokenizer)
# val_dataset = CustomDataset(val_texts, val_labels, tokenizer)

# Define the BERT model for sentiment classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=6)

# Define the optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=1e-5)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
dataset = CustomDataset(data, tokenizer)
dataloader = DataLoader(dataset, batch_size=16)

In [28]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

cpu


In [29]:
import torch.nn as nn
from tqdm import tqdm

In [30]:
def validate(model, dataloader, device):
  """
  Function to perform validation on the model

  Args:
      model: The sentiment classification model
      dataloader: The dataloader for the validation set
      device: The device (CPU or GPU) to use

  Returns:
      The average validation loss
  """
  model.eval()  # Set the model to evaluation mode
  losses = []
  with torch.no_grad():  # Disable gradient calculation for validation
    for batch in tqdm(dataloader):
      batch = {k: v.to(device) for k, v in batch.items()}
      outputs = model(**batch)
      loss_function = nn.CrossEntropyLoss()
      loss = loss_function(outputs.logits, batch['labels'])
      losses.append(loss.item())
  return sum(losses) / len(losses)  # Calculate average validation loss

In [None]:
model.to(device)

# optimizer = AdamW(model.parameters(), lr=1e-5)

# Training loop with early stopping (optional)
num_epochs = 10  # Set the number of training epochs
patience = 3  # Number of epochs to wait for improvement before stopping (optional)

best_loss = float('inf')
epochs_without_improvement = 0

# for epoch in range(num_epochs):
#   print(f"--- Epoch {epoch+1} ---")

#   # Print data lengths for debugging
#   print(f"Length of training texts: {len(train_texts)}")
#   print(f"Length of training labels: {len(train_labels)}")
model.train()

for batch in tqdm(dataloader):
    batch = {k: v.to(device) for k, v in batch.items()}

    outputs = model(**batch)

    loss_function = nn.CrossEntropyLoss()
    loss = loss_function(outputs.logits, batch['labels'])

    print(f"Training Loss: {loss}")  # Print training loss after each batch (optional)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    # # Validation
    # model.eval()  # Set model to evaluation mode again
    # with torch.no_grad():
    #   val_loss = validate(model, val_dataset, device)
    # print(f"Validation Loss: {val_loss}")

    # # Early stopping (optional)
    # if val_loss < best_loss:
    #   best_loss = val_loss
    #   epochs_without_improvement = 0
    # else:
    #   epochs_without_improvement += 1
    # if epochs_without_improvement >= patience:
    #   print("Early stopping triggered")
    #   break

print("Training complete!")

  0%|          | 0/1250 [00:00<?, ?it/s]

Training Loss: 1.739759922027588


  0%|          | 1/1250 [00:24<8:26:49, 24.35s/it]

Training Loss: 1.9425362348556519


  0%|          | 2/1250 [00:46<8:01:10, 23.13s/it]

Training Loss: 1.6889756917953491


  0%|          | 3/1250 [01:00<6:36:25, 19.07s/it]

Training Loss: 1.661017656326294


  0%|          | 4/1250 [01:14<5:53:12, 17.01s/it]

Training Loss: 1.7228882312774658


  0%|          | 5/1250 [01:29<5:34:44, 16.13s/it]

Training Loss: 1.7215032577514648


  0%|          | 6/1250 [01:44<5:25:59, 15.72s/it]

Training Loss: 1.631456732749939


  1%|          | 7/1250 [01:58<5:15:04, 15.21s/it]

Training Loss: 1.5778849124908447


  1%|          | 8/1250 [02:12<5:05:45, 14.77s/it]

Training Loss: 1.7770164012908936


  1%|          | 9/1250 [02:26<4:59:13, 14.47s/it]

Training Loss: 1.622807502746582


  1%|          | 10/1250 [02:39<4:55:11, 14.28s/it]

Training Loss: 1.685881495475769


  1%|          | 11/1250 [02:54<4:55:58, 14.33s/it]

Training Loss: 1.652949333190918


  1%|          | 12/1250 [03:08<4:52:41, 14.19s/it]

Training Loss: 1.5635567903518677


  1%|          | 13/1250 [03:22<4:51:04, 14.12s/it]

Training Loss: 1.491033911705017


  1%|          | 14/1250 [03:36<4:49:32, 14.06s/it]

Training Loss: 1.4514977931976318


  1%|          | 15/1250 [03:50<4:49:55, 14.09s/it]

Training Loss: 1.6338839530944824


  1%|▏         | 16/1250 [04:04<4:48:40, 14.04s/it]

Training Loss: 1.3752135038375854


  1%|▏         | 17/1250 [04:17<4:46:52, 13.96s/it]

Training Loss: 1.4200842380523682


  1%|▏         | 18/1250 [04:31<4:45:46, 13.92s/it]

Training Loss: 1.5077793598175049


  2%|▏         | 19/1250 [04:45<4:47:16, 14.00s/it]

Training Loss: 1.627557396888733


  2%|▏         | 20/1250 [05:00<4:48:02, 14.05s/it]

Training Loss: 1.5712246894836426


  2%|▏         | 21/1250 [05:13<4:46:12, 13.97s/it]

Training Loss: 1.2049825191497803


  2%|▏         | 22/1250 [05:27<4:44:37, 13.91s/it]

Training Loss: 1.5732909440994263


  2%|▏         | 23/1250 [05:41<4:42:58, 13.84s/it]

Training Loss: 1.2533743381500244


  2%|▏         | 24/1250 [05:56<4:52:35, 14.32s/it]

Training Loss: 1.5517573356628418


  2%|▏         | 25/1250 [06:10<4:48:27, 14.13s/it]

Training Loss: 1.7028117179870605


  2%|▏         | 26/1250 [06:24<4:45:52, 14.01s/it]

Training Loss: 1.5169092416763306


  2%|▏         | 27/1250 [06:37<4:43:51, 13.93s/it]

Training Loss: 1.4471746683120728


  2%|▏         | 28/1250 [06:56<5:12:56, 15.37s/it]

Training Loss: 1.3430225849151611


  2%|▏         | 29/1250 [07:10<5:04:15, 14.95s/it]

Training Loss: 1.363461971282959


  2%|▏         | 30/1250 [07:24<4:58:29, 14.68s/it]

Training Loss: 1.47640061378479


  2%|▏         | 31/1250 [07:38<4:51:45, 14.36s/it]

Training Loss: 1.4301021099090576


  3%|▎         | 32/1250 [07:51<4:46:57, 14.14s/it]

Training Loss: 1.5456136465072632


  3%|▎         | 33/1250 [08:05<4:44:19, 14.02s/it]

Training Loss: 1.3992462158203125


  3%|▎         | 34/1250 [08:21<4:57:36, 14.68s/it]

Training Loss: 1.4732033014297485


  3%|▎         | 35/1250 [08:35<4:52:23, 14.44s/it]

Training Loss: 1.4564619064331055


  3%|▎         | 36/1250 [08:49<4:47:44, 14.22s/it]

Training Loss: 1.3881242275238037


  3%|▎         | 37/1250 [09:03<4:44:43, 14.08s/it]

Training Loss: 1.7320538759231567


  3%|▎         | 38/1250 [09:17<4:46:57, 14.21s/it]

Training Loss: 1.520937442779541


  3%|▎         | 39/1250 [09:31<4:44:14, 14.08s/it]

Training Loss: 1.2901990413665771


  3%|▎         | 40/1250 [09:45<4:42:34, 14.01s/it]

Training Loss: 1.4177172183990479


  3%|▎         | 41/1250 [09:59<4:41:36, 13.98s/it]

Training Loss: 1.4929841756820679


  3%|▎         | 42/1250 [10:13<4:41:51, 14.00s/it]

Training Loss: 1.3231526613235474


  3%|▎         | 43/1250 [10:26<4:39:52, 13.91s/it]

Training Loss: 1.322385549545288


  4%|▎         | 44/1250 [10:40<4:39:22, 13.90s/it]

Training Loss: 1.322267770767212


  4%|▎         | 45/1250 [10:54<4:38:10, 13.85s/it]

Training Loss: 1.4564869403839111


  4%|▎         | 46/1250 [11:08<4:38:24, 13.87s/it]

Training Loss: 1.4160504341125488


  4%|▍         | 47/1250 [11:22<4:39:07, 13.92s/it]

Training Loss: 1.4505811929702759


  4%|▍         | 48/1250 [11:36<4:39:36, 13.96s/it]

Training Loss: 1.6229876279830933


  4%|▍         | 49/1250 [11:50<4:38:21, 13.91s/it]

Training Loss: 1.5026118755340576


  4%|▍         | 50/1250 [12:04<4:36:59, 13.85s/it]

Training Loss: 1.7393150329589844


  4%|▍         | 51/1250 [12:18<4:37:41, 13.90s/it]

Training Loss: 1.2649775743484497


  4%|▍         | 52/1250 [12:31<4:36:22, 13.84s/it]

Training Loss: 1.2141255140304565


  4%|▍         | 53/1250 [12:45<4:35:41, 13.82s/it]

Training Loss: 1.5424550771713257


  4%|▍         | 54/1250 [12:59<4:35:38, 13.83s/it]

Training Loss: 1.313936710357666


  4%|▍         | 55/1250 [13:13<4:36:35, 13.89s/it]

Training Loss: 1.3682994842529297


In [None]:
# 1. fine-tune the model base on all tweets


In [None]:
# 2. find 10 nearest words from "آزادی"

##### Describe advantages and disadvantages of Contextualized embedding

Advantages:


Disadvantages:
