# Emotion Detection in Persian Texts

**Student Name:** <span style="color:cyan">Ardalan Siavashpour</span>

**Student ID:** <span style="color:cyan">99109896</span>

## 1. Project Overview and Goals

In this assignment, you will build a machine learning pipeline to detect emotions in a single-label collection of Persian texts. The dataset is categorized into five emotional classes: **HAPPY, SAD, ANGRY, FEAR, and OTHER**.

### Your Tasks:

*   **Data Cleaning & Feature Engineering:** Preprocess the Persian text data.
*   **Model Selection:** Choose a suitable classical machine learning model.
*   **Pipeline Construction:** Use the provided `EmotionClassifierPipeline` class to encapsulate your workflow.
*   **Model Evaluation:** Use K-Fold/Stratified K-Fold cross-validation to evaluate your model's performance and interpret the results.
*   **Prediction:** Train your final pipeline on the entire training dataset and generate predictions for the unlabeled test set.
*   **Submission:** Save your test predictions to a CSV file named `submission.csv`.

**Grading:** Achieving an accuracy level above **65%** on the hidden test set will result in full marks.

## 2. Setup and Data Loading

First, let's import the necessary libraries and load our data.

In [1]:
!pip install pandas numpy scikit-learn openpyxl


Collecting pandas
  Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
Collecting numpy
  Using cached numpy-2.3.5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.7.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Using cached scipy-1.16.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (62 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached t

In [None]:
# Basic libraries for data manipulation
import pandas as pd
import numpy as np
import re

# Scikit-learn modules for machine learning
from sklearn.model_selection import KFold

# Set a random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

In [None]:
# --- Load the Datasets ---
# Make sure the files 'HW1P1_train.xlsx' and 'HW1P1_test.csv' are in the same directory.
df_train = pd.read_excel('Twitter_train.xlsx')
df_test = pd.read_csv('Twitter_test.csv')

print("--- Training Data Head ---")
display(df_train.head())

print("\n--- Test Data Head ---")
display(df_test.head())

## 3. Exploratory Data Analysis (EDA) and Preprocessing

### 3.1. Analyze Class Distribution

**TODO:** Analyze and visualize the distribution of emotions in the training set. Is the dataset balanced?

In [None]:
# YOUR CODE HERE to analyze and plot the emotion distribution.

### 3.2. Text Cleaning and Preprocessing

**TODO:** Implement a function to clean the Persian text. Consider steps like normalization, removing punctuation and numbers, and handling stop words. Apply this function to create a new `cleaned_text` column in both `df_train` and `df_test`.

In [None]:
def clean_persian_text(text):
    """
    A function to clean and preprocess Persian text.
    
    TODO: Implement your text cleaning logic here.
    """
    
    pass # Replace this with your code

# --- Apply your cleaning function ---
# TODO: Run these lines after you have implemented your function.
df_train['cleaned_text'] = df_train['text'].apply(clean_persian_text)
df_test['cleaned_text'] = df_test['text'].apply(clean_persian_text)

print("Text cleaning complete. Example:")
display(df_train[['text', 'cleaned_text']].head())

## 4. Model Evaluation with Cross-Validation

**TODO:** Use (Stratified) K-Fold to evaluate different models and vectorizers. Your goal is to find the best combination.

In [None]:
# Prepare the data (ensure you have run the cleaning step above)
X = df_train['cleaned_text']
y = df_train['emotion']

# --- Define your components ---
# TODO: Experiment to find the best components for your pipeline.

# --- Cross-Validation Loop ---
N_SPLITS = 5
skf = KFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_SEED)

fold_accuracies = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"\n--- Fold {fold+1}/{N_SPLITS} ---")
    
#     # TODO: Split data into training and validation sets for this fold
#     # YOUR CODE HERE
    
#     # TODO: Instantiate your pipeline for this fold
#     # YOUR CODE HERE
    
#     # TODO: Fit the pipeline on the training fold
#     # YOUR CODE HERE
    
#     # TODO: Make predictions on the validation fold
#     # YOUR CODE HERE
    
#     # TODO: Calculate and store appropriate metrics
#     # YOUR CODE HERE

# --- Print Average Scores ---
print("\n--- Cross-Validation Summary ---")
print(f"Average Accuracy: {np.mean(fold_accuracies):.4f} (+/- {np.std(fold_accuracies):.4f})")
# YOUR CODE HERE

## 5. Building the Machine Learning Pipeline

Here is a custom pipeline class provided for you. Your task is put your chosen components inside this class

In [None]:
# TODO: Based on your experiments in Section 5, define the final
# pipeline you will use.
class EmotionClassifierPipeline:
    """
    A custom pipeline class to handle text vectorization and classification.
    This class is provided for you to use.
    """
    def __init__(self, vectorizer, classifier):
        """
        Initializes the pipeline with chosen components.
        """
        self.vectorizer = vectorizer
        self.classifier = classifier

    def fit(self, X, y):
        """
        Trains the pipeline on the provided training data.
        """
        X_transformed = self.vectorizer.fit_transform(X)
        self.classifier.fit(X_transformed, y)
        return self

    def predict(self, X):
        """
        Predicts labels for new, unseen data.
        """
        X_transformed = self.vectorizer.transform(X)
        predictions = self.classifier.predict(X_transformed)
        return predictions

## 6. Final Model Training and Prediction

Now, you will train your best-performing pipeline on the **entire training dataset** and generate predictions for the test set.

In [None]:
# --- Reload the data to ensure we are working with the original sets ---
print("Reloading data for final training and prediction...")
df_train = pd.read_excel('train.xlsx')
df_test = pd.read_csv('test.csv')

# --- Apply the SAME cleaning process ---
# TODO: Make sure your 'clean_persian_text' function is defined above.
# Then, apply it here.
df_train['cleaned_text'] = df_train['text'].apply(clean_persian_text)
df_test['cleaned_text'] = df_test['text'].apply(clean_persian_text)


# --- Instantiate and Fit the Final Pipeline ---
print("Training the final pipeline on all training data...")

# This part uses the provided pipeline class to train your chosen model
final_pipeline = EmotionClassifierPipeline(
    vectorizer=final_vectorizer,
    classifier=final_classifier
)

# Fit the pipeline on the full, cleaned training data
# final_pipeline.fit(df_train['cleaned_text'], df_train['emotion'])

print("Final pipeline training complete.")


# --- Generate Predictions on the Test Set ---
print("Generating predictions for the test set...")

# Predict using the cleaned test data
test_predictions = final_pipeline.predict(df_test['cleaned_text'])

# Add predictions to the test DataFrame
df_test['predicted_emotion'] = test_predictions

print("\n--- Test Data with Predictions (Top 5) ---")
display(df_test.head())

In [None]:
# --- Create the submission DataFrame ---
submission_df = pd.DataFrame({
    'text': df_test['text'],
    'emotion': df_test['predicted_emotion']
})

# --- Save to CSV ---
output_filename = 'submission.csv'
submission_df.to_csv(output_filename, index=False)

print(f"\nSuccessfully saved predictions to '{output_filename}'.")
display(submission_df.head())