# HOMEWORK ASSIGNMENT - TEXT CLASSIFICATION

 **Objective:**

 In this assignment, you will apply everything you've learned to a new, practical
 challenge: text classification. You will build a complete pipeline that:

 1.  Loads a real-world Persian dataset of magazine articles.
 2.  Apply rule-based method to classify the dataset.
3.  Uses a powerful, modern embedding model to convert these articles into vectors.
4.  Visualize embedding vectors
 3.  Trains several classic machine learning models on these embeddings to
     predict the category of each article.
 4.  Evaluates the performance of these models.
 5.  Builds a final inference pipeline to classify new, unseen text.

 This is a common and powerful technique used in industry for tasks like spam
 detection, sentiment analysis, and topic categorization.

 Complete all `#TODO`s in the implementation.


Resources:

 https://huggingface.co/BAAI/bge-m3

 https://huggingface.co/jinaai/jina-embeddings-v4

 https://huggingface.co/datasets/MCINext/digikala-magazine

## Step 0: Setup

In [None]:
!pip install transformers datasets torch umap-learn scikit-learn matplotlib Pillow insightface onnxruntime -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/439.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/439.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.5/439.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m6.6 MB/s

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModel
from datasets import load_dataset
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE, MDS, Isomap
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from sklearn.decomposition import PCA
import requests
from tqdm.auto import tqdm
import umap
from pathlib import Path
import cv2
from insightface.app import FaceAnalysis
from typing import Tuple, List
import plotly.express as px
import pandas as pd
import os
import pickle

## Step 1: Load and Prepare the Digikala Magazine Dataset

 We will use the 'MCINext/digikala-magazine' dataset from the Hugging Face Hub.
 It contains articles and their corresponding categories.

 https://huggingface.co/datasets/MCINext/digikala-magazine

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from collections import Counter

In [None]:
# Load the dataset
print("Loading dataset...")
magazine_dataset_train = load_dataset('MCINext/digikala-magazine', split='train')
magazine_dataset_valid =  load_dataset('MCINext/digikala-magazine', split='validation')
magazine_dataset_test =  load_dataset('MCINext/digikala-magazine', split='test')
print("Dataset loaded successfully.")

Loading dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


train.csv:   0%|          | 0.00/42.1M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/6896 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/767 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/852 [00:00<?, ? examples/s]

Dataset loaded successfully.


 --- HOMEWORK TASK 1: Prepare the Datasets ---

 Your tasks:
 1.  Convert the `train`, `validation`, and `test` splits into pandas DataFrames.
 2.  Create and fit a `LabelEncoder` using ONLY the labels from the training data.
     This is crucial to prevent "data leakage" from the validation/test sets.
 3.  Transform the labels in all three DataFrames (train, validation, test) to
     create new 'label_id' columns.
 4.  Create the final variables for text and labels for all three splits:
     - `X_train_text`, `y_train`
     - `X_valid_text`, `y_valid`
     - `X_test_text`, `y_test`

In [None]:
# 1. Convert Hugging Face datasets to pandas DataFrames.
df_train = magazine_dataset_train.to_pandas()
df_valid = magazine_dataset_valid.to_pandas()
df_test = magazine_dataset_test.to_pandas()

# 2. Initialize and fit the LabelEncoder on the training data labels.
label_encoder = LabelEncoder()
label_encoder.fit(df_train['label'])

# 3. Transform labels for all splits.
df_train['label_id'] = label_encoder.transform(df_train['label'])
df_valid['label_id'] = label_encoder.transform(df_valid['label'])
df_test['label_id'] = label_encoder.transform(df_test['label'])

# 4. Create the final variables.
X_train_text = df_train['content']
y_train = df_train['label_id']

X_valid_text = df_valid['content']
y_valid = df_valid['label_id']

X_test_text = df_test['content']
y_test = df_test['label_id']

In [None]:
# Create mappings for later use.
id_to_label = {i: label for i, label in enumerate(label_encoder.classes_)}
label_to_id = {label: i for i, label in id_to_label.items()}
num_classes = len(id_to_label)

print(f"Dataset prepared.")
print(f"Number of classes: {num_classes}")
print(f"Training set size: {len(X_train_text)}")
print(f"Validation set size: {len(X_valid_text)}")
print(f"Testing set size: {len(X_test_text)}")
print(f"Label to ID: {label_to_id}")
print(f"ID to Label: {id_to_label}")

Dataset prepared.
Number of classes: 7
Training set size: 6896
Validation set size: 767
Testing set size: 852
Label to ID: {'بازی ویدیویی': 0, 'راهنمای خرید': 1, 'سلامت و زیبایی': 2, 'علم و تکنولوژی': 3, 'عمومی': 4, 'هنر و سینما': 5, 'کتاب و ادبیات': 6}
ID to Label: {0: 'بازی ویدیویی', 1: 'راهنمای خرید', 2: 'سلامت و زیبایی', 3: 'علم و تکنولوژی', 4: 'عمومی', 5: 'هنر و سینما', 6: 'کتاب و ادبیات'}


## Step 2: Ruel-based Text Classification

 **Objective:**

 Before the rise of deep learning and embedding models, many NLP tasks were
 handled by rule-based systems. In this section, we will build a simple but
 effective keyword-based classifier to solve the same Digikala Magazine
 problem.

 This will allow you to directly compare the two approaches and understand their
 respective strengths and weaknesses.

 The core idea of this "Bag-of-Words" approach is simple: if a text contains
 enough words from a specific category's keyword list, we classify it as
 belonging to that category.

In [None]:
from collections import Counter
import pprint

 Automatically Generate Keyword Dictionaries from the Dataset

 Instead of manually creating keywords, a more systematic approach is to extract
 the most frequent and relevant words directly from the training data for each category.
 We will now remove common "stopwords" (like 'از', 'به', 'که', etc.) to get more
 meaningful keywords.

In [None]:
# Define the provided list of Persian stopwords for filtering (from hazm)
persian_stopwords = ['آخرین', 'آقای', 'آمد', 'آمده', 'آمده_است', 'آن', 'آنان',
                     'آنجا', 'آنها', 'آنچه', 'آنکه', 'آورد', 'آوری', 'آیا',
                     'ابتدا', 'اثر', 'اجرا', 'اخیر', 'از', 'است', 'اش', 'اغلب',
                     'افراد', 'افرادی', 'افزود', 'البته', 'اما', 'امر', 'امکان',
                     'اند', 'او', 'اول', 'اولین', 'اکنون', 'اگر', 'ایشان', 'این',
                     'اینجا', 'اینکه', 'با', 'بار', 'باره', 'باز', 'باشد', 'باشند',
                     'باعث', 'بالا', 'باید', 'بخش', 'بخشی', 'بدون', 'بر', 'برابر',
                     'براساس', 'برای', 'برخی', 'برداری', 'بروز', 'بزرگ', 'بسیار',
                     'بسیاری', 'بعد', 'بعضی', 'بلکه', 'بنابراین', 'بندی', 'به',
                     'بهتر', 'بهترین', 'بود', 'بودن', 'بودند', 'بوده', 'بوده_است',
                     'بی', 'بیان', 'بیرون', 'بیش', 'بیشتر', 'بیشتری', 'بین', 'تا',
                     'تاکنون', 'تبدیل', 'تحت', 'ترتیب', 'تعداد', 'تعیین', 'تغییر',
                     'تمام', 'تمامی', 'تنها', 'تهیه', 'تو', 'جا', 'جاری', 'جای',
                     'جایی', 'جدی', 'جدید', 'جریان', 'جز', 'جمع', 'جمعی', 'حال',
                     'حالا', 'حالی', 'حتی', 'حد', 'حداقل', 'حدود', 'حل', 'خاص',
                     'خاطرنشان', 'خصوص', 'خطر', 'خواهد_بود', 'خواهد_شد', 'خواهد_کرد',
                     'خوب', 'خوبی', 'خود', 'خودش', 'خویش', 'خیلی', 'داد', 'دادن',
                     'دادند', 'داده', 'داده_است', 'دار', 'دارای', 'دارد', 'دارند',
                     'داریم', 'داشت', 'داشتن', 'داشتند', 'داشته', 'داشته_است',
                     'داشته_باشد', 'داشته_باشند', 'دانست', 'در', 'درباره', 'درون',
                     'دسته', 'دهد', 'دهند', 'دهه', 'دو', 'دوباره', 'دور', 'دوم',
                     'دچار', 'دیگر', 'دیگران', 'دیگری', 'را', 'راه', 'رسید', 'رسیدن',
                     'رشد', 'رفت', 'رو', 'روبه', 'روش', 'روند', 'روی', 'ریزی', 'زاده',
                     'زیاد', 'زیادی', 'زیر', 'زیرا', 'ساز', 'سازی', 'ساله', 'سالهای',
                     'سال\u200cهای', 'سایر', 'سبب', 'سراسر', 'سعی', 'سمت', 'سه', 'سهم',
                     'سوم', 'سوی', 'سپس', 'سی', 'شامل', 'شان', 'شاید', 'شخصی', 'شد',
                     'شدن', 'شدند', 'شده', 'شده_است', 'شده_اند', 'شده_بود', 'شروع',
                     'شش', 'شما', 'شمار', 'شود', 'شوند', 'صرف', 'ضمن', 'طبق', 'طرف',
                     'طور', 'طول', 'طی', 'ع', 'عالی', 'عدم', 'علاوه', 'علت', 'علیه',
                     'عهده', 'عین', 'غیر', 'فرد', 'فردی', 'فقط', 'فوق', 'فکر', 'قابل',
                     'قبل', 'لازم', 'لحاظ', 'لذا', 'ما', 'مانند', 'متاسفانه', 'متر',
                     'متفاوت', 'مثل', 'محسوب', 'مدت', 'مربوط', 'مشخص', 'ممکن', 'من',
                     'مناسب', 'منظور', 'مهم', 'مواجه', 'موجب', 'مورد', 'می', 'میان',
                     'می\u200cآید', 'می\u200cباشد', 'می\u200cتوان', 'می\u200cتواند',
                     'می\u200cتوانند', 'می\u200cدهد', 'می\u200cدهند', 'می\u200cرسد',
                     'می\u200cرود', 'می\u200cشد', 'می\u200cشود', 'می\u200cشوند',
                     'می\u200cکرد', 'می\u200cکردند', 'می\u200cکند', 'می\u200cکنم',
                     'می\u200cکنند', 'می\u200cکنیم', 'می\u200cگوید', 'می\u200cگویند',
                     'می\u200cگیرد', 'می\u200cیابد', 'ناشی', 'نباید', 'نبود', 'نحوه',
                     'نخست', 'نخستین', 'ندارد', 'ندارند', 'نزدیک', 'نسبت', 'نشست',
                     'نظر', 'نظیر', 'نمی\u200cشود', 'نه', 'نوع', 'نوعی', 'نیاز',
                     'نیز', 'نیست', 'نیستند', 'نیمه', 'هایی', 'هر', 'هستند', 'هستیم',
                     'هم', 'همان', 'همه', 'همواره', 'همچنان', 'همچنین', 'همچون',
                     'همیشه', 'همین', 'هنوز', 'هنگام', 'هیچ', 'و', 'وارد', 'وجود',
                     'وقتی', 'ولی', 'وگو', 'وی', 'ویژه', 'پخش', 'پر', 'پس', 'پشت',
                     'پنج', 'پی', 'پیدا', 'پیش', 'چرا', 'چند', 'چنین', 'چه', 'چهار',
                     'چهارم', 'چون', 'چگونه', 'چیز', 'چیزی', 'کافی', 'کامل', 'کاملا',
                     'کدام', 'کرد', 'کردم', 'کردن', 'کردند', 'کرده', 'کرده_است',
                     'کرده_اند', 'کرده_بود', 'کسانی', 'کسی', 'کل', 'کلی', 'کم',
                     'کمی', 'کنار', 'کند', 'کنم', 'کنند', 'کننده', 'کنندگان', 'کنید',
                     'کنیم', 'که', 'کوچک', 'گاه', 'گذاری', 'گردد', 'گرفت', 'گرفته',
                     'گرفته_است', 'گروهی', 'گفت', 'گفته', 'گونه', 'گیرد', 'گیری',
                     'یا', 'یابد', 'یافت', 'یافته', 'یافته_است', 'یعنی', 'یک', 'یکدیگر',
                     'یکی']

In [None]:
print("--- Building a Rule-Based Classifier ---")
# Using a set for faster lookups
persian_stopwords_set = set(persian_stopwords)

def generate_top_keywords(df, category, stopwords, num_keywords=200):
    """
    Extracts the most frequent non-stopword tokens for a given category using simple whitespace tokenization.
    """
    print(f"Generating keywords for category: {category}...")
    # Filter texts for the specific category
    category_texts = df_train[df_train['label'] == category]

    # Combine all texts into a single string
    full_text = category_texts['content'].str.cat(sep=" ")

    # Tokenize the text by splitting on whitespace
    tokens = full_text.split()

    # Filter out stopwords and non-alphabetic tokens
    filtered_tokens = [t for t in tokens if t not in stopwords]

    # Count word frequencies
    word_counts = Counter(filtered_tokens)

    # Get the most common keywords (num_keywords = 200)
    num_keywords = 200
    top_keywords = [word for word, count in word_counts.most_common(num_keywords)]
    return top_keywords

# Define the categories we want to build the dictionary for
categories_to_process = [
    'بازی ویدیویی',
    'راهنمای خرید',
    'سلامت و زیبایی',
    'علم و تکنولوژی',
    'هنر و سینما',
    'کتاب و ادبیات'
]

# Generate the keyword dictionary automatically
keyword_dictionary = {}
for category in categories_to_process:
    keyword_dictionary[category] = generate_top_keywords(df_train, category, persian_stopwords_set)

print("Keyword dictionary automatically generated from the dataset (stopwords removed):")
pprint.pprint(keyword_dictionary)

--- Building a Rule-Based Classifier ---
Generating keywords for category: بازی ویدیویی...
Generating keywords for category: راهنمای خرید...
Generating keywords for category: سلامت و زیبایی...
Generating keywords for category: علم و تکنولوژی...
Generating keywords for category: هنر و سینما...
Generating keywords for category: کتاب و ادبیات...
Keyword dictionary automatically generated from the dataset (stopwords removed):
{'بازی ویدیویی': ['بازی',
                  'بازی\u200cهای',
                  'است.',
                  'سال',
                  'قرار',
                  'شرکت',
                  'قسمت',
                  'عرضه',
                  'آن\u200cها',
                  'ایکس\u200cباکس',
                  'خواهد',
                  'کنسول',
                  'ساخت',
                  '۲',
                  'of',
                  'است\u200c',
                  'پلی\u200cاستیشن',
                  'کار',
                  '–',
                  'مجموعه',
                  '

 Now, we'll create the logic to classify texts based on our dictionary and
 then evaluate its performance on the same test set we used for the ML models.

In [None]:
# Implement and Evaluate the Rule-Based Classifier
def classify_with_keywords(text, dictionary):
    """
    Classifies a text by finding which category's keyword list has the most matches.
    """
    scores = {category: 0 for category in dictionary.keys()}

    for category, keywords in dictionary.items():
        for keyword in keywords:
            if keyword in text:
                scores[category] += 1

    # Find the category with the highest score
    # If all scores are 0, no keywords were found
    if all(score == 0 for score in scores.values()):
        return "Uncategorized"

    # If there's a tie, this will pick one, but in a real system, you might have tie-breaking rules
    best_category = max(scores, key=scores.get)
    return best_category

In [None]:
# --- Evaluate on the Test Set ---
# First, get the label_ids for the categories we are testing
categories_to_test = list(keyword_dictionary.keys())
label_ids_to_test = [label_to_id[cat] for cat in categories_to_test if cat in label_to_id]

# Filter the test set
test_indices_to_use = y_test.isin(label_ids_to_test)
X_test_subset = X_test_text[test_indices_to_use]
y_test_subset = y_test[test_indices_to_use]

print(f"Evaluating on a subset of the test data ({len(X_test_subset)} samples) for our defined categories.")

# Make predictions on the subset
y_pred_rules = [classify_with_keywords(text, keyword_dictionary) for text in X_test_subset]

# We need to convert our predicted string labels back to the integer IDs for the report
y_pred_rules_ids = [label_to_id.get(pred, -1) for pred in y_pred_rules] # Use -1 for "Uncategorized"

print("Rule-Based Classifier Report:")
print(classification_report(y_test_subset, y_pred_rules_ids, labels=label_ids_to_test, target_names=categories_to_test))

Evaluating on a subset of the test data (840 samples) for our defined categories.
Rule-Based Classifier Report:
                precision    recall  f1-score   support

  بازی ویدیویی       0.91      0.87      0.89       197
  راهنمای خرید       0.34      0.92      0.50        13
سلامت و زیبایی       0.86      0.72      0.78       161
علم و تکنولوژی       0.92      0.91      0.92       277
   هنر و سینما       0.92      0.86      0.89       167
 کتاب و ادبیات       0.47      0.96      0.63        25

      accuracy                           0.86       840
     macro avg       0.74      0.87      0.77       840
  weighted avg       0.88      0.86      0.86       840



In [None]:
# Build the Rule-Based Inference Pipeline
class RuleBasedClassifier:
    def __init__(self, keyword_dictionary):
        """
        Initializes the classifier with a dictionary of keywords.
        """
        self.dictionary = keyword_dictionary

    def predict(self, text: str) -> str:
        """
        Predicts the category of a single text string based on keyword matching.
        """
        return classify_with_keywords(text, self.dictionary)

In [None]:
# --- Test the Inference Pipeline ---
rule_based_pipeline = RuleBasedClassifier(keyword_dictionary)

# Use the same sentences from the previous homework part for a direct comparison
test_sentence_1 = """
مینگ‌چی کو، تحلیلگر و افشاگر سرشناس محصولات اپل، می‌گوید این شرکت روی مدل جدید آیپد مینی با پردازنده‌ی تقویت‌شده کار می‌کند. احتمالاً آیپد مینی نسل جدید تا پایان ۲۰۲۳ یا نیمه‌ی اول ۲۰۲۴ از راه نمی‌رسد.
آیپد مینی در پایان سال ۲۰۲۱ با طراحی کاملاً جدید به‌روز شد. این تبلت از زمان رونمایی در سال ۲۰۱۲، تغییرات زیادی به خود ندیده بود. این تبلت ۸٫۳ اینچی جایگاهی میان بزرگ‌ترین آیفون (مدل پرو مکس) و آیپد ۱۰٫۹ اینچ دارد.
"""
test_sentence_2 = """
ماکارونی یکی از غذاهای بسیار محبوب در جهان است که به عنوان یک غذای بین المللی در سراسر جهان شناخته شده می باشد. ماکارونی هم مانند غذاهایی مثل لازانیا و پاستا اصالتی ایتالیایی دارد. آشپزهای ایرانی ماکارونی را با روشی درست می کنند که بیشتر باب میل ایرانیان است
 ، زیرا در بیشتر کشورها ماکارونی را در آب جوش می ریزند و بعد از ۱۵ دقیقه با سس کچاپ سرو می کنند، برای مشاهده آموزش کامل و مرحله به مرحله طرز تهیه ماکارونی ایرانی در ادامه با سایت اموزشی چی شی همراه باشید.
"""
test_sentence_3 = """
به گزارش روابط‌عمومی خانه کتاب و ادبیات ایران، به مناسبت هزار و پانصدمین سالگرد میلاد پیامبر اکرم (ص)، بخش ویژه‌ای با محوریت موضوعات مرتبط با معارف نبوی به چهل‌وسومین دوره جایزه کتاب سال جمهوری اسلامی ایران افزوده شد.
"""

prediction_1 = rule_based_pipeline.predict(test_sentence_1)
prediction_2 = rule_based_pipeline.predict(test_sentence_2)
prediction_3 = rule_based_pipeline.predict(test_sentence_3)

print(f"--- Rule-Based Inference Test ---")
print(f"Sentence: '{test_sentence_1}'")
print(f"Predicted Category: '{prediction_1}'")
print("-" * 20)
print(f"Sentence: '{test_sentence_2}'")
print(f"Predicted Category: '{prediction_2}'")
print("-" * 20)
print(f"Sentence: '{test_sentence_3}'")
print(f"Predicted Category: '{prediction_3}'")

--- Rule-Based Inference Test ---
Sentence: '
مینگ‌چی کو، تحلیلگر و افشاگر سرشناس محصولات اپل، می‌گوید این شرکت روی مدل جدید آیپد مینی با پردازنده‌ی تقویت‌شده کار می‌کند. احتمالاً آیپد مینی نسل جدید تا پایان ۲۰۲۳ یا نیمه‌ی اول ۲۰۲۴ از راه نمی‌رسد.
آیپد مینی در پایان سال ۲۰۲۱ با طراحی کاملاً جدید به‌روز شد. این تبلت از زمان رونمایی در سال ۲۰۱۲، تغییرات زیادی به خود ندیده بود. این تبلت ۸٫۳ اینچی جایگاهی میان بزرگ‌ترین آیفون (مدل پرو مکس) و آیپد ۱۰٫۹ اینچ دارد.
'
Predicted Category: 'علم و تکنولوژی'
--------------------
Sentence: '
ماکارونی یکی از غذاهای بسیار محبوب در جهان است که به عنوان یک غذای بین المللی در سراسر جهان شناخته شده می باشد. ماکارونی هم مانند غذاهایی مثل لازانیا و پاستا اصالتی ایتالیایی دارد. آشپزهای ایرانی ماکارونی را با روشی درست می کنند که بیشتر باب میل ایرانیان است
 ، زیرا در بیشتر کشورها ماکارونی را در آب جوش می ریزند و بعد از ۱۵ دقیقه با سس کچاپ سرو می کنند، برای مشاهده آموزش کامل و مرحله به مرحله طرز تهیه ماکارونی ایرانی در ادامه با سایت اموزشی چی شی همراه باشید.
'

## Step 3: Load the Embedding Model

 We will test two different models. BGE-M3 is a powerful multilingual model,
 while Jina is specialized for Persian (Farsi), which is the language of our dataset.

https://huggingface.co/BAAI/bge-m3

https://huggingface.co/jinaai/jina-embeddings-v4

In [None]:
# Choose your model: 'bge' or 'jina'
MODEL_CHOICE = 'bge' # You can switch this to 'bge'

if MODEL_CHOICE == 'bge':
    CLASSIFY_MODEL_NAME = 'BAAI/bge-m3'
elif MODEL_CHOICE == 'jina':
    CLASSIFY_MODEL_NAME = 'jinaai/jina-embeddings-v2-base-fa'
else:
    raise ValueError("Invalid model choice. Choose 'bge' or 'jina'.")

print(f"Selected embedding model: {CLASSIFY_MODEL_NAME}")

Selected embedding model: BAAI/bge-m3


 --- HOMEWORK TASK 2: Load the Tokenizer and Model ---

 Your task:
 1.  Load the tokenizer for the selected `CLASSIFY_MODEL_NAME`.
 2.  Load the pre-trained model for the selected `CLASSIFY_MODEL_NAME`.
 3.  Move the model to the correct device (GPU if available).

In [None]:
# import torch_xla.core.xla_model as xm

# Load the tokenizer from Hugging Face.
classify_tokenizer = AutoTokenizer.from_pretrained(CLASSIFY_MODEL_NAME)

# Load the model from Hugging Face.
classify_model = AutoModel.from_pretrained(CLASSIFY_MODEL_NAME)

# Move the model to the GPU if available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = xm.xla_device()
#TODO: Move the model to device
classify_model.to(device)

print(f"Model '{CLASSIFY_MODEL_NAME}' loaded successfully.")

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

Model 'BAAI/bge-m3' loaded successfully.


## Step 4: Generate Embeddings for the Dataset

 Now we'll use the loaded model to convert our training and testing text
 into numerical embeddings.

 --- HOMEWORK TASK 3: Generate Text Embeddings ---

 Your task:

 Use the `generate_embeddings` function to create vector representations
     for the training, validation, and test text data (`X_train_text`,
     `X_valid_text`, and `X_test_text`).

In [None]:
def generate_embeddings(texts, model, tokenizer, batch_size=8):
    all_embeddings = []
    # Check if input is a pandas Series and convert to list if so
    text_list = texts.tolist() if isinstance(texts, pd.Series) else texts

    for i in tqdm(range(0, len(text_list), batch_size), desc="Generating Embeddings"):
        batch = text_list[i : i+batch_size]

        encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors="pt", max_length=256)
        encoded_input = {k: v.to(device) for k, v in encoded_input.items()}

        with torch.no_grad():
            model_output = model(**encoded_input)

        # BGE models often recommend using the [CLS] token's embedding
        if "bge-m3" in model.name_or_path:
             embeddings = model_output.last_hidden_state[:, 0]
        else: # For other models like Jina or MiniLM, mean pooling is standard.
            def mean_pooling(model_output, attention_mask):
                token_embeddings = model_output[0]
                input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
                return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
            embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

        all_embeddings.append(embeddings.cpu().numpy())
    return np.vstack(all_embeddings)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

save_dir = "/content/drive/MyDrive/RahnamaCallage"
os.makedirs(save_dir, exist_ok=True)


In [None]:
# Define filenames for cached embeddings, making them unique to the chosen model
train_embedding_file = f"/content/drive/MyDrive/RahnamaCallage/X_train_embeddings_{MODEL_CHOICE}.pkl"
valid_embedding_file = f"/content/drive/MyDrive/RahnamaCallage/X_valid_embeddings_{MODEL_CHOICE}.pkl"
test_embedding_file = f"/content/drive/MyDrive/RahnamaCallage/X_test_embeddings_{MODEL_CHOICE}.pkl"

# --- Process Training Embeddings ---
if os.path.exists(train_embedding_file):
    print(f"Loading cached training embeddings from {train_embedding_file}...")
    with open(train_embedding_file, 'rb') as f:
        X_train_embeddings = pickle.load(f)
else:
    print(f"No cache found. Generating training embeddings...")
    X_train_embeddings = generate_embeddings(X_train_text, classify_model, classify_tokenizer)
    print(f"Saving training embeddings to {train_embedding_file}...")
    with open(train_embedding_file, 'wb') as f:
        pickle.dump(X_train_embeddings, f)



# --- Process Validation Embeddings ---
if os.path.exists(valid_embedding_file):
    print(f"Loading cached validation embeddings from {valid_embedding_file}...")
    with open(valid_embedding_file, 'rb') as f:
        X_valid_embeddings = pickle.load(f)
else:
    print(f"No cache found. Generating validation embeddings...")
    X_valid_embeddings = generate_embeddings(X_valid_text, classify_model, classify_tokenizer)
    print(f"Saving validation embeddings to {valid_embedding_file}...")
    with open(valid_embedding_file, 'wb') as f:
        pickle.dump(X_valid_embeddings, f)



# --- Process Testing Embeddings ---
if os.path.exists(test_embedding_file):
    print(f"Loading cached testing embeddings from {test_embedding_file}...")
    with open(test_embedding_file, 'rb') as f:
        X_test_embeddings = pickle.load(f)
else:
    print(f"No cache found. Generating testing embeddings...")
    X_test_embeddings = generate_embeddings(X_test_text, classify_model, classify_tokenizer)
    print(f"Saving testing embeddings to {test_embedding_file}...")
    with open(test_embedding_file, 'wb') as f:
        pickle.dump(X_test_embeddings, f)

print(f"Training embeddings ready. Shape: {X_train_embeddings.shape}")
print(f"Validation embeddings ready. Shape: {X_valid_embeddings.shape}")
print(f"Testing embeddings ready. Shape: {X_test_embeddings.shape}")

Loading cached training embeddings from /content/drive/MyDrive/RahnamaCallage/X_train_embeddings_bge.pkl...
Loading cached validation embeddings from /content/drive/MyDrive/RahnamaCallage/X_valid_embeddings_bge.pkl...
Loading cached testing embeddings from /content/drive/MyDrive/RahnamaCallage/X_test_embeddings_bge.pkl...
Training embeddings ready. Shape: (6896, 1024)
Validation embeddings ready. Shape: (767, 1024)
Testing embeddings ready. Shape: (852, 1024)


## Step 5: Visualize Embeddings with Dimensionality Reduction

 Before we train classifiers, let's visualize our high-dimensional training embeddings
 to see if the model has created meaningful clusters.

 NOTE: The training set is large. Running t-SNE or UMAP on all data points can be
 very slow. We will first take a smaller, stratified sample for visualization purposes.


In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# --- Create a smaller sample for visualization ---
n_vis_samples = 200

# Ensure the sample size is not larger than the dataset
if n_vis_samples > len(X_train_embeddings):
    n_vis_samples = len(X_train_embeddings)

splitter = StratifiedShuffleSplit(n_splits=1, train_size=n_vis_samples, random_state=42)
# The train_index will give us the indices for our sample
for train_index, _ in splitter.split(X_train_embeddings, y_train):
    X_vis = X_train_embeddings[train_index]
    y_vis = y_train.iloc[train_index]
    text_vis = X_train_text.iloc[train_index]

print(f"Created a sample of {len(X_vis)} points for visualization.")

Created a sample of 200 points for visualization.


In [None]:
# --- t-SNE Visualization ---
print("Running t-SNE on the sample...")
tsne_hw = TSNE(n_components=2, perplexity=30, random_state=42, init='random', learning_rate='auto')
X_vis_tsne_2d = tsne_hw.fit_transform(X_vis)

# Create DataFrame for Plotly
df_hw_tsne = pd.DataFrame({
    'tsne_1': X_vis_tsne_2d[:, 0],
    'tsne_2': X_vis_tsne_2d[:, 1],
    'category': [id_to_label[id] for id in y_vis],
    'text': text_vis
})

# Create interactive plot
fig_hw_tsne = px.scatter(
    df_hw_tsne, x='tsne_1', y='tsne_2', color='category',
    hover_data={'text': True, 'category': False},
    title='Interactive t-SNE Visualization of Digikala Magazine Embeddings (Sample)'
)
fig_hw_tsne.show()

Running t-SNE on the sample...


In [None]:
# --- PCA Visualization ---
print("Running PCA on the sample...")
pca_hw = PCA(n_components=2, random_state=42)
X_vis_pca_2d = pca_hw.fit_transform(X_vis)

df_hw_pca = pd.DataFrame({
    'pca_1': X_vis_pca_2d[:, 0],
    'pca_2': X_vis_pca_2d[:, 1],
    'category': [id_to_label[id] for id in y_vis],
    'text': text_vis
})

fig_hw_pca = px.scatter(
    df_hw_pca, x='pca_1', y='pca_2', color='category',
    hover_data={'text': True, 'category': False},
    title='Interactive PCA Visualization of Digikala Magazine Embeddings (Sample)'
)
fig_hw_pca.show()

Running PCA on the sample...


In [None]:
# --- UMAP Visualization ---
print("Running UMAP on the sample...")

umap_hw = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_vis_umap_2d = umap_hw.fit_transform(X_vis)

df_hw_umap = pd.DataFrame({
    'umap_1': X_vis_umap_2d[:, 0],
    'umap_2': X_vis_umap_2d[:, 1],
    'category': [id_to_label[id] for id in y_vis],
    'text': text_vis
})

fig_hw_umap = px.scatter(
    df_hw_umap, x='umap_1', y='umap_2', color='category',
    hover_data={'text': True, 'category': False},
    title='Interactive UMAP Visualization of Digikala Magazine Embeddings (Sample)'
)
fig_hw_umap.show()

Running UMAP on the sample...



n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



In [None]:
# --- MDS Visualization ---
print("Running MDS on the sample...")
mds_hw = MDS(n_components=2, random_state=42, n_jobs=-1)
X_vis_mds_2d = mds_hw.fit_transform(X_vis)

df_hw_mds = pd.DataFrame({
    'mds_1': X_vis_mds_2d[:, 0],
    'mds_2': X_vis_mds_2d[:, 1],
    'category': [id_to_label[id] for id in y_vis],
    'text': text_vis
})

fig_hw_mds = px.scatter(
    df_hw_mds, x='mds_1', y='mds_2', color='category',
    hover_data={'text': True, 'category': False},
    title='Interactive MDS Visualization of Digikala Magazine Embeddings (Sample)'
)
fig_hw_mds.show()

Running MDS on the sample...


In [None]:
# --- Isomap Visualization ---
print("Running Isomap on the sample...")
isomap_hw = Isomap(n_components=2, n_neighbors=10, n_jobs=-1)
X_vis_isomap_2d = isomap_hw.fit_transform(X_vis)

df_hw_isomap = pd.DataFrame({
    'isomap_1': X_vis_isomap_2d[:, 0],
    'isomap_2': X_vis_isomap_2d[:, 1],
    'category': [id_to_label[id] for id in y_vis],
    'text': text_vis
})

fig_hw_isomap = px.scatter(
    df_hw_isomap, x='isomap_1', y='isomap_2', color='category',
    hover_data={'text': True, 'category': False},
    title='Interactive Isomap Visualization of Digikala Magazine Embeddings (Sample)'
)
fig_hw_isomap.show()

Running Isomap on the sample...


## Step 6: Train and Evaluate Machine Learning Classifiers

 Now for the exciting part! We'll use our high-quality embeddings as features
 to train several standard machine learning models. The pre-trained model did
 the hard work of feature extraction; now these simple classifiers just need
 to find the patterns.

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

--- HOMEWORK TASK 4: Train and Evaluate an SVM Classifier ---

 Your task:
 1.  Initialize a Support Vector Machine classifier (`SVC`). A `random_state`
     is good for reproducibility.
 2.  Train the classifier on the training embeddings (`X_train_embeddings`) and
     labels (`y_train`).
 3.  Make predictions on the test embeddings (`X_test_embeddings`).
 4.  Print the classification report to see its performance.

In [None]:
print("--- Training Support Vector Machine (SVM) ---")

svm_classifier = SVC(kernel='linear', random_state=42)

print("Training SVM...")
svm_classifier.fit(X_train_embeddings, y_train)

print("Making predictions...")
y_pred_svm = svm_classifier.predict(X_test_embeddings)

print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm, target_names=label_encoder.classes_))

--- Training Support Vector Machine (SVM) ---
Training SVM...
Making predictions...
SVM Classification Report:
                precision    recall  f1-score   support

  بازی ویدیویی       0.96      0.94      0.95       197
  راهنمای خرید       0.53      0.62      0.57        13
سلامت و زیبایی       0.85      0.89      0.87       161
علم و تکنولوژی       0.96      0.95      0.96       277
         عمومی       0.09      0.08      0.09        12
   هنر و سینما       0.95      0.93      0.94       167
 کتاب و ادبیات       0.78      0.72      0.75        25

      accuracy                           0.91       852
     macro avg       0.73      0.73      0.73       852
  weighted avg       0.91      0.91      0.91       852



In [None]:
print("--- Training Random Forest ---")
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

print("Training Random Forest...")
rf_classifier.fit(X_train_embeddings, y_train)

print("Making predictions...")
y_pred_rf = rf_classifier.predict(X_test_embeddings)

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf, target_names=label_encoder.classes_))

--- Training Random Forest ---
Training Random Forest...
Making predictions...
Random Forest Classification Report:
                precision    recall  f1-score   support

  بازی ویدیویی       0.97      0.94      0.96       197
  راهنمای خرید       0.00      0.00      0.00        13
سلامت و زیبایی       0.79      0.92      0.85       161
علم و تکنولوژی       0.91      0.96      0.94       277
         عمومی       0.00      0.00      0.00        12
   هنر و سینما       0.89      0.96      0.92       167
 کتاب و ادبیات       0.00      0.00      0.00        25

      accuracy                           0.89       852
     macro avg       0.51      0.54      0.52       852
  weighted avg       0.84      0.89      0.87       852




Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



In [None]:
print("--- Training K-Nearest Neighbors (KNN) ---")
knn_classifier =  KNeighborsClassifier(n_neighbors=11)

print("Training KNN...")
knn_classifier.fit(X_train_embeddings, y_train)

print("Making predictions...")
y_pred_knn = knn_classifier.predict(X_test_embeddings)

print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn, target_names=label_encoder.classes_))

--- Training K-Nearest Neighbors (KNN) ---
Training KNN...
Making predictions...
KNN Classification Report:
                precision    recall  f1-score   support

  بازی ویدیویی       0.96      0.96      0.96       197
  راهنمای خرید       0.56      0.69      0.62        13
سلامت و زیبایی       0.90      0.94      0.92       161
علم و تکنولوژی       0.97      0.96      0.97       277
         عمومی       0.00      0.00      0.00        12
   هنر و سینما       0.91      0.95      0.93       167
 کتاب و ادبیات       0.80      0.64      0.71        25

      accuracy                           0.93       852
     macro avg       0.73      0.74      0.73       852
  weighted avg       0.92      0.93      0.92       852



### Question

**Question 1: Performance Analysis**

    Why do Random Forest and Support Vector Machine (SVM) models perform poorly on the "عمومی" (General) and "کتاب و ادبیات" (Books and Literature) categories?

**Your Answer:**

1. Severe class imbalance → tiny number of samples.

2. Overlapping features with other categories.

3. Small sample size → not enough data to train a reliable boundary (SVM) or trees (Random Forest).


4. Random Forest’s bootstrapping might miss these small classes entirely.

5. KNN performs slightly better because it is instance-based, not model-based.

**Question 2: Improving Accuracy**

    How can we improve the accuracy specifically for the "عمومی" and "کتاب و ادبیات" categories?


**Your Answer:**

1. Handle Class Imbalance
2. Resample the dataset:

     Oversample minority classes using techniques like SMOTE or Random Oversampling.

     Undersample dominant classes to balance the dataset.
3. Combine classifiers:

     Example: Use KNN or Naive Bayes for minority classes and SVM/Random Forest for majority classes.

     Use stacking or weighted voting to favor predictions for small categories.

**Question 3: KNN Performance**

    Why does the K-Nearest Neighbors (KNN) model show improved accuracy for this application compared to Random Forest and SVM?

**Your Answer:**

KNN performs better because it is local, instance-based, and flexible, making it more effective for:

  Small classes

  Overlapping feature spaces

  Sparse data points

Random Forest and SVM are global models that tend to ignore tiny classes or create poor boundaries, leading to low recall and precision.

**Question 4: Comparing Approaches**

    Compare the accuracy and use cases of your embedding-based approach with a rule-based approach. When would you prefer to use one over the other?

**Your Answer:**

Observations:

The embedding-based models generally achieve higher overall accuracy and F1-scores, especially for medium-sized categories like "سلامت و زیبایی" and "کتاب و ادبیات".

Rule-based methods sometimes overperform for tiny classes in recall, e.g., "کتاب و ادبیات" has 0.96 recall in rule-based (probably because the rules match keywords well).

Rule-based precision can be low (e.g., "راهنمای خرید" has 0.34 precision), meaning many false positives.

Rule-based is good at high recall for very small or niche classes ("کتاب و ادبیات" recall = 0.96) but suffers in precision.

Embedding-based approaches achieve higher balanced F1 and overall accuracy, especially for categories with semantic overlap or more complex language.

In practice, a hybrid approach often works best:
1. Use embeddings for main classification.
2. Apply rule-based corrections for rare or critical classes.

## Step 7: Build an Inference Pipeline

 The final step is to wrap our entire process—from raw text to category prediction—
 into a single, reusable class.

 --- HOMEWORK TASK 6: Complete the Inference Class ---

 Your task:

 Inside the `predict` method, complete the steps to process a single
     sentence and return the predicted category name.


In [None]:
class TextClassifier:
    def __init__(self, embedding_model, embedding_tokenizer, ml_classifier, id_to_label_map):
        self.model = embedding_model
        self.tokenizer = embedding_tokenizer
        self.classifier = ml_classifier
        self.id_to_label = id_to_label_map
        self.device = embedding_model.device

    def predict(self, text: str) -> str:
        # 1. The input is a single string. We need to wrap it in a list
        #    because our encoder function expects a list of sentences.
        text_list = text.tolist() if isinstance(text, pd.Series) else text

        # 2. Generate the embedding for the text.
        #    We need to create temporary Series for the function to work.
        text_series = pd.Series(text_list)
        embedding = generate_embeddings(text_series, self.model, self.tokenizer)

        # 3. Use the trained machine learning model to predict the label ID.
        #    The `.predict()` method of sklearn models returns an array, so we
        #    take the first element.
        predicted_id = self.classifier.predict(embedding)[0]

        # 4. Convert the predicted ID back to its string label using the map.
        predicted_label = self.id_to_label[predicted_id]

        return predicted_label

In [None]:
# Let's test the pipeline with the trained Random Forest model.
inference_pipeline =TextClassifier(
    embedding_model=classify_model,
    embedding_tokenizer=classify_tokenizer,
    ml_classifier=knn_classifier,
    id_to_label_map=id_to_label
)

# Test with some example sentences
test_sentence_1 = "این بهترین گوشی هوشمندی است که تا به حال داشته ام"
test_sentence_2 = "دستور پخت کیک شکلاتی بسیار آسان بود"

prediction_1 = inference_pipeline.predict(test_sentence_1)
prediction_2 = inference_pipeline.predict(test_sentence_2)


print(f"--- Inference Test ---")
print(f"Sentence: '{test_sentence_1}'")
print(f"Predicted Category: '{prediction_1}'")
print("-" * 20)
print(f"Sentence: '{test_sentence_2}'")
print(f"Predicted Category: '{prediction_2}'")

Generating Embeddings:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Embeddings:   0%|          | 0/1 [00:00<?, ?it/s]

--- Inference Test ---
Sentence: 'این بهترین گوشی هوشمندی است که تا به حال داشته ام'
Predicted Category: 'علم و تکنولوژی'
--------------------
Sentence: 'دستور پخت کیک شکلاتی بسیار آسان بود'
Predicted Category: 'سلامت و زیبایی'
