<a href="https://colab.research.google.com/github/TheOctoMizer/AAI-510-Project/blob/main/final-project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Business Understanding

Twitter sentiment analysis helps brands, governments, and researchers understand public opinion in real time.

**Business Need:** Build a multilingual sentiment classifier (English, French, Portuguese) that classifies tweets as:
- Positive
- Negative

This will enable:
- Monitoring brand perception
- Tracking political sentiment
- Analyzing feedback across diverse markets

## 2. Data Understanding

You have 3 datasets:
- 🇬🇧 English: 100k+ samples with text and sentiment
- 🇫🇷 French: ~9 lakh samples, but lacks "neutral"
- 🇵🇹 Portuguese: ~6 lakh samples

Challenges:
- Label format inconsistencies (e.g., 0/1/2, strings)
- Extra columns
- Missing/imbalanced classes

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

## 3. Data Preparation

Steps:
- Clean column formats
- Drop extra columns
- Map labels to 'positive'/'negative'
- Remove neutral samples
- Stratified downsample to 65k per language
- Combine into 195k multilingual dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')
base_path = '/content/drive/My Drive/AAI-510-Dataset'

Mounted at /content/drive


In [None]:
def stratified_downsample(df, sample_size):
    label_dist = df['label'].value_counts(normalize=True).to_dict()
    samples = []
    for label, ratio in label_dist.items():
        n = int(sample_size * ratio)
        part = df[df['label'] == label].sample(n=n, random_state=42)
        samples.append(part)
    return pd.concat(samples).sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
import pandas as pd

# ------------------------
# 🟢 1. Portuguese Dataset
# ------------------------

def process_portuguese(input_path, output_path):
    df = pd.read_csv(input_path, sep=';', quoting=3, encoding='utf-8', on_bad_lines='skip')

    # Keep only necessary columns
    df = df[['tweet_text', 'sentiment']]
    df.columns = ['text', 'label']
    df['language'] = 'pt'

    label_map = {
        '0': 'negative', '1': 'positive', '2': 'neutral',
        0: 'negative', 1: 'positive', 2: 'neutral'
    }
    df['label'] = df['label'].map(label_map)
    df = df[df['label'].isin(['positive', 'negative'])]
    df.info()
    # Stratified downsample
    sampled = stratified_downsample(df, 65000)
    sampled.to_csv(output_path, index=False)
    print(f"✅ Portuguese dataset saved: {output_path}")

In [None]:
# --------------------
# 🟢 2. English Dataset
# --------------------

def process_english(input_path, output_path):
    df = pd.read_csv(input_path)
    df = df[['Text', 'Label']]
    df.columns = ['text', 'label']
    df['language'] = 'en'

    df['label'] = df['label'].astype(str).str.lower().str.strip()
    df = df[df['label'].isin(['positive', 'negative'])]

    sampled = stratified_downsample(df, 65000)
    sampled.to_csv(output_path, index=False)
    print(f"✅ English dataset saved: {output_path}")

In [None]:
# -------------------
# 🟢 3. French Dataset
# -------------------

def process_french(input_path, output_path):
    df = pd.read_csv(input_path)
    df = df[['text', 'label']]
    df.columns = ['text', 'label']
    df['language'] = 'fr'

    label_map = {
        '0': 'negative', '1': 'positive', '2': 'neutral',
        0: 'negative', 1: 'positive', 2: 'neutral'
    }
    df['label'] = df['label'].map(label_map)
    df = df[df['label'].isin(['positive', 'negative'])]

    sampled = stratified_downsample(df, 65000)
    sampled.to_csv(output_path, index=False)
    print(f"✅ French dataset saved: {output_path}")

In [None]:
process_portuguese(f"{base_path}/portuguese.csv", f"{base_path}/portuguese_cleaned_65k.csv")
process_english(f"{base_path}/english.csv", f"{base_path}/english_cleaned_65k.csv")
process_french(f"{base_path}/french.csv", f"{base_path}/french_cleaned_65k.csv")

In [None]:
en = pd.read_csv(f"{base_path}/english_cleaned_65k.csv")
pt = pd.read_csv(f"{base_path}/portuguese_cleaned_65k.csv")
fr = pd.read_csv(f"{base_path}/french_cleaned_65k.csv")

df_all = pd.concat([en, pt, fr])
df_all = df_all.sample(frac=1, random_state=42).reset_index(drop=True)
df_all.to_csv(f"{base_path}/multilingual_sentiment_195k.csv", index=False)

print("✅ Combined dataset saved: multilingual_sentiment_195k.csv")

In [None]:
# Encode labels
df_all['label_enc'] = df_all['label'].map({'negative': 0, 'positive': 1})

## 4. Exploratory Data Analysis

In [None]:
sns.countplot(data=df_all, x='label', hue='language')
plt.title("Sentiment Distribution by Language")
plt.show()

print(df_all['label'].value_counts())

## 5. Modeling

Models to train:
1. XLM-RoBERTa Base
2. MDeBERTa v3 Base
3. DistilBERT Multilingual
4. LSTM

Each model is trained and evaluated on the same train/test split.

In [None]:
from huggingface_hub import login
login()

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

def train_transformer_model(model_name, train_ds, test_ds, test_df):
    # Load tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    # Tokenization function (handles NaNs and ensures text format)
    def tokenize(batch):
        # Ensure all entries are strings
        texts = [str(t) if t is not None else "" for t in batch["text"]]
        return tokenizer(texts, truncation=True, padding="max_length", max_length=128)

    # Apply tokenization
    train_encoded = train_ds.map(tokenize, batched=True)
    test_encoded = test_ds.map(tokenize, batched=True)

    # Define training arguments
    training_args = TrainingArguments(
        output_dir=f"./results_{model_name.replace('/', '_')}",
        evaluation_strategy="epoch",
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        num_train_epochs=2,
        save_strategy="no",
        logging_dir='./logs',
        logging_steps=100,
        load_best_model_at_end=False
    )

    # Define metric computation
    def compute_metrics(pred):
        labels = pred.label_ids
        preds = np.argmax(pred.predictions, axis=1)
        return {
            "accuracy": accuracy_score(labels, preds),
            "f1": f1_score(labels, preds)
        }

    # Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_encoded,
        eval_dataset=test_encoded,
        compute_metrics=compute_metrics
    )

    # Train the model
    trainer.train()

    # Predict on test set
    predictions = trainer.predict(test_encoded)
    y_pred = np.argmax(predictions.predictions, axis=1)
    y_true = test_df['label_enc'].values

    return y_pred, y_true

### XLM-RoBERTa Base Model training

In [None]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Assuming df is your full multilingual dataframe
train_df, test_df = train_test_split(df_all, test_size=0.2, stratify=df_all['label_enc'], random_state=42)

# Convert to Hugging Face Datasets
train_ds = Dataset.from_pandas(train_df)
test_ds = Dataset.from_pandas(test_df)

# Train and evaluate
y_pred, y_true = train_transformer_model("xlm-roberta-base", train_ds, test_ds, test_df)

## 6. Evaluation

## 7. Final Conclusion

| Model             | Accuracy | F1-Score |
|------------------|----------|----------|
| MDeBERTa v3 Base  | XX.XX%   | XX.XX%   |
| XLM-RoBERTa Base  | XX.XX%   | XX.XX%   |
| DistilBERT Multi  | XX.XX%   | XX.XX%   |
| LSTM              | XX.XX%   | XX.XX%   |

**Insights:**
- MDeBERTa v3 and XLM-RoBERTa gave best multilingual performance.
- DistilBERT is lighter but less accurate.
- LSTM works but lags behind modern transformers.

**Next Steps:**
- Try adding attention to LSTM
- Use more data (with neutral)
- Test on other languages (e.g., Spanish, Hindi)