## Holistic AI x UCL AI Society Hackathon Tutorial

### Track 2: Building Trustworthy Models for Stereotype Classification in Text Data

### Tutorials: Building an Ethical Classifier for Stereotype Detection Using EMGSD

This tutorial demonstrated how to build an ethical classifier for stereotype detection using the **Expanded Multi-Grain Stereotype Dataset (EMGSD)**. By incorporating sustainability, fairness, and explainability into the development process, we tackled some of the key challenges in creating trustworthy AI systems.

This methodology is inspired by the HEARTS framework from Holistic AI, which is explained in the paper below:

[**HEARTS: A Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection**](https://arxiv.org/abs/2409.11579)


[**Example stereotype classifier From HEARTS Paper**](https://huggingface.co/holistic-ai/bias_classifier_albertv2)

---

#### Key Takeaways:
1. **Sustainability**: 
   - Leveraged a small, carbon-efficient model like **ALBERT-V2** to minimize environmental impact without sacrificing performance.
   - Used **CodeCarbon** to monitor and reduce carbon emissions during training.

2. **Bias Detection**: 
   - Assessed the fairness of the model across different demographic groups by analyzing counterfactual examples and ensuring consistency in predictions.

3. **Explainability**:
   - Utilized **SHAP** and **LIME** for token-level transparency, enabling deeper insights into the classifier's decision-making process and promoting trust and accountability.

4. **Efficiency and Robustness**:
   - Evaluated model performance using **Macro F1 Scores** to ensure generalization across all classes (stereotype, neutral, unrelated).
   - Addressed the robustness of the classifier through rigorous testing on diverse texts and demographic combinations.

5. **Data Preparation**:
   - Simplified data loading, sampling, and preparation to enable flexible experimentation, with clear instructions to scale up for better performance as needed.

6. **Modeling and Baselines**:
   - Progressed from simple baselines (random selection and logistic regression with TF-IDF) to advanced fine-tuned transformers (ALBERT-V2) to achieve better performance.

#### Ethical Considerations:
Throughout the tutorial, we ensured adherence to ethical principles such as **bias minimization**, **sustainability**, **efficacy**, **robustness**, and **explainability**. These principles guide the development of trustworthy AI systems.

---

### Potential Extended Directions:

1. **Dataset Enhancement Through Additional Tutorials**:  
   - Enrich the dataset and improve model performance by exploring the following tutorials:
     - **[Scraping Biased Data](https://github.com/holistic-ai/hai-ucl-hackathon/blob/main/track2_text_stereotype_classification/Extra_Scraping_Biased_Data.ipynb)**: Collect and preprocess real-world biased text data from online sources to make the dataset more diverse and comprehensive.
     - **[Generating Biased Data](https://github.com/holistic-ai/hai-ucl-hackathon/blob/main/track2_text_stereotype_classification/Extra_Generate_Biased_Data.ipynb)**: Fine-tune a biased GPT-2 model to generate stereotype-related text. Use this synthetic data to augment the EMGSD dataset for further experimentation.

2. **Debiasing Techniques**:  
   - Implement strategies such as counterfactual fairness to further enhance the fairness of the classifier and minimize demographic biases.

3. **Further Explainability**:  
   - Integrate advanced interpretability methods (e.g., Integrated Gradients, BERTViz) to provide more detailed insights into the classifier's predictions.

4. **Real-World Testing**:  
   - Extend the model’s application to real-world scenarios by incorporating multi-modal data or testing in diverse, dynamic environments.


---

By following this tutorial, you have built a solid foundation for developing trustworthy AI systems. The outlined **future directions** and optional tutorials provide opportunities to expand your efforts, explore novel solutions, and contribute meaningfully to the field of ethical AI.

### Setup and Dependencies

Install the necessary libraries to get started:


In [None]:
!pip install transformers torch datasets shap pandas scikit-learn accelerate matplotlib codecarbon==2.4.2 lime

: 

We'll focus on an ethical approach, emphasizing sustainability (using a small model like ALBERT-V2), bias detection, explainability (SHAP and LIME), and efficiency.

---

### Data Loading (EMGSD)

The dataset you're using is **EMGSD**: **Expanded Multi-Grain Stereotype Dataset**, which includes stereotype-labeled text across various demographics.

# **Expanded Multi-Grain Stereotype Dataset (EMGSD)**

## **Dataset Overview**
This dataset is designed for detecting and classifying stereotypes across different dimensions, such as **race**, **gender**, **nationality**, and **profession**. It contains both **train** and **test** splits, each with various text samples annotated for bias and stereotype labels.

### **Main Features:**

- **`text_with_marker`**: The original text with certain words highlighted using "===" to indicate the focus of potential stereotypes (e.g., "The ===doctor=== was helpful").
  
- **`text`**: The same text without any markers or highlights. This feature provides a clean version of the text for tasks like natural language processing without needing special token handling.

- **`category`**: This column contains the main classification label indicating whether the text contains a **stereotype** or is **unrelated** or **neutral**. The categories include:
  - `stereotype`
  - `unrelated`
  - `neutral`

- **`stereotype_type`**: Describes the category of the stereotype detected in the text. Categories include:
  - `race`
  - `gender`
  - `nationality`
  - `profession`
  - ... and more
  
- **`data_source`**: The source of the text, indicating where the data sample was originally extracted from. Examples include:
  - `stereoset_intrasentence`
  - `stereoset_intersentence`
  - `seegull_augmented`
  - ... and more

- **`label`**: A more detailed label indicating the specific type of stereotype detected. Examples include:
  - `stereotype_nationality`
  - `stereotype_gender`
  - `neutral_race`
  - ... and more

In [None]:
import pandas as pd

# Load the dataset
splits = {'train': 'train.csv', 'test': 'test.csv'}
train_data = pd.read_csv("hf://datasets/holistic-ai/EMGSD/" + splits["train"])
test_data = pd.read_csv("hf://datasets/holistic-ai/EMGSD/" + splits["test"])

train_data

: 

### Data Preparation and Splitting

In this section, we load the **Expanded Multi-Grain Stereotype Dataset (EMGSD)** and prepare it for training and testing. The dataset is split into **training** and **testing** subsets, which are essential for building and evaluating our classifier.

For demonstration purposes and to speed up the training process during this tutorial, we use **only 10% of the data** by setting `sample_ratio = 0.1`.  
**Note:** This is intended for testing purposes only. For better results in your final implementation, **you should adjust this ratio to use a larger portion or the entire dataset**.

Here’s a summary of the process:
- **Load the dataset**: Import the training and testing data.
- **Sampling**: Use the `sample_ratio` to define the proportion of data used.  
  _Example_: Setting `sample_ratio = 0.1` means using only 10% of the dataset.
- **Prepare inputs and labels**: Extract the `text` column as input data (`X`) and the `category` column as labels (`y`).

By default:
- **`train_data`** contains a subset of the original dataset for training.
- **`test_data`** is similarly reduced for quick evaluation.

### Important:  
You can **modify the `sample_ratio`** to include more data as needed for your experiments or to achieve better performance in real-world applications.

In [None]:
# Use a subset of the data for faster training
sample_ratio = 0.01
train_data = train_data.sample(frac=sample_ratio, random_state=42)
test_data = test_data.sample(frac=sample_ratio, random_state=42)

# Prepare train and test sets by using both training and testing data
X_train, y_train = train_data["text"].values.tolist(), train_data["category"].values.tolist()
X_test, y_test = test_data["text"].values.tolist(), test_data["category"].values.tolist()

train_data.head()

: 

### Model Selection and Sustainability Focus

We will fine-tune a series of **multi-class** classifiers on the stereotype detection task, tracking carbon emissions with CodeCarbon.

#### Macro F1 Score Calculation

We will use the **Macro F1 score** to evaluate the model's performance, ensuring that it generalizes well across all classes.

In [None]:
from sklearn.metrics import f1_score
import numpy as np

# Function to compute Macro F1 score
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    f1 = f1_score(labels, predictions, average='macro')  # Use macro F1
    return {"f1": f1}

: 

#### Training the Models and Tracking Emissions

We start with simple baseline models. The first will select the target variable at random.

In [None]:
import random
from sklearn.metrics import f1_score
from datasets import Dataset

# Convert to Hugging Face dataset format
train_dataset = Dataset.from_dict({"text": X_train, "label": y_train})
test_dataset = Dataset.from_dict({"text": X_test, "label": y_test})

# Map labels to IDs
label2id = {
    'stereotype': 0,
    'unrelated': 1,
    'neutral': 2,
}

id2label = {v: k for k, v in label2id.items()}

def map_labels(example):
    example['label'] = label2id[example['label']]
    return example

# Apply the mapping to your dataset
train_dataset = train_dataset.map(map_labels)
test_dataset = test_dataset.map(map_labels)

# Random Model Prediction
random.seed(42)
random_predictions = [random.choice(y_test) for _ in range(len(y_test))]

# Evaluate the model
f1 = f1_score(y_test, random_predictions, average='macro')
print(f"F1 Score: {f1}")

: 

Next we explore a logistic regression model, with feature vectorization using TF-IDF scores. This model should perform as a stronger baseline. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from codecarbon import EmissionsTracker

# TF-IDF Vectorizer
X_train = train_dataset['text']
y_train = train_dataset['label']  
X_test = test_dataset['text']
y_test = test_dataset['label']

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Logistic Regression Model
model = LogisticRegression()

# Tracking emissions with CodeCarbon
tracker = EmissionsTracker()
tracker.start()

# Fit the model
model.fit(X_train_tfidf, y_train)

# Evaluate the model
predictions = model.predict(X_test_tfidf)
f1 = f1_score(y_test, predictions, average='macro')

emissions = tracker.stop()
print(f"F1 Score: {f1}")
print(f"Training carbon emissions: {emissions} kg")

: 

Now, we seek to improve performance against the simple baselines. 

We will select with the **ALBERT-V2** architecture, a carbon-efficient model with a smaller parameter size than common alternatives such as the LLaMA or GPT series, ensuring sustainability in our approach. 

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
import matplotlib.pyplot as plt


# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("albert-base-v2")

# Tokenization function
def tokenize_function(example):
    return tokenizer(example['text'], padding='max_length', truncation=True)
    
# Apply the tokenizer to the dataset
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

# Load pre-trained ALBERT model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "albert-base-v2", 
    num_labels=3, 
    label2id=label2id,
    id2label=id2label
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # "mps" For macOS (Apple Silicon)
model.to(device)

# Tracking emissions with CodeCarbon
tracker = EmissionsTracker()
tracker.start()

# Fine-tuning the model and save the best model
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir='./logs',
    num_train_epochs=3,
    logging_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=1,
    metric_for_best_model="f1",
    greater_is_better=True
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    compute_metrics=compute_metrics,  # Use macro F1 computation
)

trainer.train()

emissions = tracker.stop()
print()
print(f"Training carbon emissions: {emissions} kg")

: 

We'll benchmark the efficacy of the model using macro F1-scores and check its robustness by testing it against a range of diverse texts and demographic combinations.

In [None]:
from sklearn.metrics import f1_score

# Making predictions on the test set
preds = trainer.predict(tokenized_test_dataset).predictions.argmax(-1)
f1 = f1_score(tokenized_test_dataset['label'], preds, average='macro')
print(f"Macro F1 Score: {f1:.3f}")

: 

You can see from the relatively low macro F1 score that stereotype classification is a very challenging task! How could we improve performance? 

### Evaluating Ethical Aspects: Bias

#### Bias Assessment
We will ensure the classifier treats different demographic groups fairly. We'll analyze the model’s bias by creating counterfactuals and examining prediction consistency. For a systematic approach, please refer to the paper below:

[**Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks**](https://arxiv.org/abs/2410.11059)

---

In [None]:
from transformers import pipeline

# Load the text classification pipeline
classifier = pipeline("text-classification", model=trainer.model, tokenizer=trainer.tokenizer,top_k=3)

def assess_bias(model, counterfactuals):
    bias_results = {}
    for text in counterfactuals:
        prediction = model(text)
        bias_results[text] = prediction
    return bias_results


# Creating demographic counterfactuals (example)
counterfactuals = ["Males usually think that the chef is talented.", "Females often assume that the nurse is dedicated."]
bias_results = assess_bias(classifier, counterfactuals)

bias_results

: 

#### Explainability Using SHAP and LIME

Explainability is key to ethical AI. We’ll use **SHAP** and **LIME** to offer transparency into how the model makes decisions at the token level.

##### SHAP Example:

We'll use SHAP to explain the model's predictions at the token level. Here, we'll explain the first counterfactual.

In [None]:
import shap

explainer = shap.Explainer(classifier)
shap_values = explainer(counterfactuals)

shap_values_stereotype = shap_values[:, :, "stereotype"].values
shap_vectors = []

# Save SHAP values in vectors for subsequent calculation
for index, values in enumerate(shap_values_stereotype):
    # Trim to exclude whitespace and punctuation 
    trimmed_values = values[1:-2]
    shap_vectors.append(trimmed_values)
    print(f"Sentence {index+1} SHAP vector: {trimmed_values}")

shap.plots.text(shap_values[:, :, "stereotype"])


: 

##### LIME Example:

Likewise, we can use LIME to explain the model's predictions at the token level. Here, we'll explain the first counterfactual.

In [None]:
from lime.lime_text import LimeTextExplainer

def predict_proba(texts):
    preds = classifier(texts)
    probabilities = np.array([[pred['score'] for pred in preds_single] for preds_single in preds])
    return probabilities

explainer = LimeTextExplainer(class_names=["stereotype", "neutral", "unrelated"])

lime_values_per_sentence = []

for idx, sentence in enumerate(counterfactuals):
    exp = explainer.explain_instance(sentence, predict_proba, num_features=50, num_samples=100, top_labels=1)
    feature_importances = exp.as_list(label=0)
    
    lime_values = [weight for _, weight in feature_importances]
    lime_values_per_sentence.append(lime_values)
    
    print(f"LIME values for Sentence {idx+1} 'stereotype':", lime_values)

    exp.show_in_notebook() 

: 

Do the explanations provided by these methods align? Let's check an example by computing cosine similarity between the SHAP and LIME values for the first text instance. 

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

# Plot and compare SHAP and LIME explanations
plt.figure(figsize=(10, 5))
plt.plot(shap_vectors[0], label="SHAP")
plt.plot(lime_values_per_sentence[0], label="LIME")
plt.legend()
plt.xlabel("Token position")
plt.ylabel("Explanation value")
plt.title("SHAP and LIME Explanations Comparison")
plt.show()


# Calculating cosine similarity between SHAP and LIME vectors
for idx, (shap_vec, lime_vec) in enumerate(zip(shap_vectors, lime_values_per_sentence)):
    shap_vec_array = np.array(shap_vec)
    lime_vec_array = np.array(lime_vec)

    similarity = cosine_similarity([shap_vec_array], [lime_vec_array])[0][0]
    print(f"Cosine similarity between SHAP and LIME for Sentence {idx + 1} ({counterfactuals[idx]}): {similarity}")

: 

### Conclusion: Building an Ethical Classifier

This tutorial demonstrated how to build an ethical classifier for stereotype detection using the **Expanded Multi-Grain Stereotype Dataset (EMGSD)**. By incorporating sustainability, fairness, and explainability into the development process, we tackled some of the key challenges in creating trustworthy AI systems.

#### Key Takeaways:
1. **Sustainability**: 
   - Leveraged a small, carbon-efficient model like **ALBERT-V2** to minimize environmental impact without sacrificing performance.
   - Used **CodeCarbon** to monitor and reduce carbon emissions during training.

2. **Bias Detection**: 
   - Assessed the fairness of the model across different demographic groups by analyzing counterfactual examples and ensuring consistency in predictions.

3. **Explainability**:
   - Utilized **SHAP** and **LIME** for token-level transparency, enabling deeper insights into the classifier's decision-making process and promoting trust and accountability.

4. **Efficiency and Robustness**:
   - Evaluated model performance using **Macro F1 Scores** to ensure generalization across all classes (stereotype, neutral, unrelated).
   - Addressed the robustness of the classifier through rigorous testing on diverse texts and demographic combinations.

5. **Data Preparation**:
   - Simplified data loading, sampling, and preparation to enable flexible experimentation, with clear instructions to scale up for better performance as needed.

6. **Modeling and Baselines**:
   - Progressed from simple baselines (random selection and logistic regression with TF-IDF) to advanced fine-tuned transformers (ALBERT-V2) to achieve better performance.

#### Ethical Considerations:
Throughout the tutorial, we ensured adherence to ethical principles such as **bias minimization**, **sustainability**, **efficacy**, **robustness**, and **explainability**. These principles guide the development of trustworthy AI systems.

---

### Potential Extended Directions:

1. **Dataset Enhancement Through Additional Tutorials**:  
   - Enrich the dataset and improve model performance by exploring the following tutorials:
     - **[Scraping Biased Data](https://github.com/holistic-ai/hai-ucl-hackathon/blob/main/track2_text_stereotype_classification/Extra_Scraping_Biased_Data.ipynb)**: Collect and preprocess real-world biased text data from online sources to make the dataset more diverse and comprehensive.
     - **[Generating Biased Data](https://github.com/holistic-ai/hai-ucl-hackathon/blob/main/track2_text_stereotype_classification/Extra_Generate_Biased_Data.ipynb)**: Fine-tune a biased GPT-2 model to generate stereotype-related text. Use this synthetic data to augment the EMGSD dataset for further experimentation.

2. **Debiasing Techniques**:  
   - Implement strategies such as counterfactual fairness to further enhance the fairness of the classifier and minimize demographic biases.

3. **Further Explainability**:  
   - Integrate advanced interpretability methods (e.g., Integrated Gradients, BERTViz) to provide more detailed insights into the classifier's predictions.

4. **Real-World Testing**:  
   - Extend the model’s application to real-world scenarios by incorporating multi-modal data or testing in diverse, dynamic environments.


---

By following this tutorial, you have built a solid foundation for developing trustworthy AI systems. The outlined **future directions** and optional tutorials provide opportunities to expand your efforts, explore novel solutions, and contribute meaningfully to the field of ethical AI.