## Evaluating Biases in Large Language Models (LLMs) using WEAT and Demographic Diversity Analysis
### **Word Embedding Association Test (WEAT)**

#### **What are Word Embeddings?**
A brief overview of word embeddings (e.g., Word2Vec, GloVe) and their significance in NLP. Mathematical representation of word embeddings.

#### **Introduction to WEAT**

Objective:
- Measure the strength and direction of associations between word embeddings and predefined categories.
- Real-world implications of biases in word embeddings.

In [4]:
import numpy as np

#### **Defining Word Sets**
- X and Y are target word sets. In our example, they represent different occupations.
- A and B are attribute word sets, representing gender terms in this case.

In [9]:
## Define our sets
X = ['doctor', 'engineer', 'scientist']
Y = ['nurse', 'teacher', 'receptionist']
A = ['man', 'male', 'boy']
B = ['woman', 'female', 'girl']

#### **Embeddings**
This dictionary contains 3-dimensional embeddings (vectors) for various words.

In a real-world scenario, these embeddings would be derived from models like Word2Vec, GloVe, or large language models.

In [10]:
## word embeddings
word_embeddings = {
    'doctor': np.array([0.1, 0.3, 0.5]),
    'engineer': np.array([0.2, 0.4, 0.2]),
    'scientist': np.array([0.3, 0.1, 0.4]),
    'nurse': np.array([0.5, 0.1, 0.3]),
    'teacher': np.array([0.4, 0.2, 0.1]),
    'receptionist': np.array([0.3, 0.4, 0.3]),
    'man': np.array([0.5, 0.5, 0.5]),
    'male': np.array([0.5, 0.4, 0.5]),
    'boy': np.array([0.5, 0.5, 0.4]),
    'woman': np.array([0.5, 0.2, 0.3]),
    'female': np.array([0.5, 0.3, 0.3]),
    'girl': np.array([0.5, 0.3, 0.4])
}

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

#### **Computing Differential Association**
- The function s computes the differential association of a word w with the sets X and Y.
- For each word in X, we compute its cosine similarity with w and then take the mean of these values to get sim_X.
- Similarly, we compute the average cosine similarity between w and each word in Y to get sim_Y.
- The function returns the difference between sim_X and sim_Y.

In [12]:
def s(w, X, Y):
    sim_X = np.mean([cosine_similarity(word_embeddings[w].reshape(1, -1), word_embeddings[x].reshape(1, -1)) for x in X])
    sim_Y = np.mean([cosine_similarity(word_embeddings[w].reshape(1, -1), word_embeddings[y].reshape(1, -1)) for y in Y])
    return sim_X - sim_Y

#### **Calculating the WEAT Score**
- For each word in set A, we compute its differential association with X and Y and sum these values.
- Similarly, we compute the sum of differential associations for each word in set B.
- The WEAT score is the difference between the two sums.
- A positive WEAT score indicates that, on average, words in A are more strongly associated with words in X than words in B are. Conversely, a negative score indicates a stronger association between B and X.

In [13]:
WEAT_score = sum([s(a,X,Y) for a in A]) - sum([s(b, X,Y) for b in B])

In [14]:
print(WEAT_score)

0.25109671349724283


The WEAT score we obtained, 0.2511, is a positive value. Here's how to interpret it in the context of the word sets:

Target word sets (Occupations):

X: ['doctor', 'engineer', 'scientist']

Y: ['nurse', 'teacher', 'receptionist']

Attribute word sets (Gender):

A: ['man', 'male', 'boy']

B: ['woman', 'female', 'girl']

Interpretation:

The positive WEAT score of 0.2511 indicates that the words in set

A (male-associated terms) have a stronger association with the occupations in set
X (like 'doctor', 'engineer', 'scientist') than they do with occupations in set
Y (like 'nurse', 'teacher', 'receptionist'). In contrast, the words in set
B (female-associated terms) have a stronger association with occupations in set
Y.

In simpler terms, based on the word embeddings you provided, there appears to be a gender bias. The male terms are more closely associated with professions like 'doctor', 'engineer', and 'scientist', while the female terms are more closely associated with 'nurse', 'teacher', and 'receptionist'.

While the score is positive and indicates a bias, it's important to consider the magnitude. A score closer to 0 would suggest a weaker bias, while a score further from 0 (either positive or negative) would suggest a stronger bias. In this case, the score of 0.2511 indicates a moderate bias in the embeddings based on the chosen word sets.

## Demographic Diversity Analysis
### Introduction
Objective: Measure the performance of LLMs across different demographic groups.
- Importance of demographic parity in LLMs.
- Steps in Demographic Diversity Analysis

Define demographic groups.
- Measure model's performance for each group.
- Compare results to identify disparities.


Let's imagine we have an LLM that's been trained to answer questions. We will assess its performance across two hypothetical demographic groups: native English speakers and non-native English speakers.

In [15]:
from sklearn.metrics import accuracy_score

In [16]:
# Sample questions and the correct answers
questions = {
    "What's the capital of France?": "Paris",
    "Which gas do plants take in during photosynthesis?": "Carbon dioxide",
    "Who wrote Romeo and Juliet?": "William Shakespeare",
    "In which year did World War II end?": "1945",
    "How many sides does a hexagon have?": "6"
}

# Hypothetical responses from the LLM for native and non-native speakers
native_responses = {
    "What's the capital of France?": "Paris",
    "Which gas do plants take in during photosynthesis?": "Carbon dioxide",
    "Who wrote Romeo and Juliet?": "Shakespeare",
    "In which year did World War II end?": "1945",
    "How many sides does a hexagon have?": "Six"
}

non_native_responses = {
    "What's the capital of France?": "Paris",
    "Which gas do plants take in during photosynthesis?": "Oxygen",
    "Who wrote Romeo and Juliet?": "Shakespeare",
    "In which year did World War II end?": "1944",
    "How many sides does a hexagon have?": "Six"
}

In [17]:
def evaluate_responses(correct_answers, responses):
    correct_count = sum([1 for q, a in correct_answers.items() if responses[q] == a])
    accuracy = correct_count / len(correct_answers)
    return accuracy

native_accuracy = evaluate_responses(questions, native_responses)
non_native_accuracy = evaluate_responses(questions, non_native_responses)

print(f"Accuracy for native English speakers: {native_accuracy:.2f}")
print(f"Accuracy for non-native English speakers: {non_native_accuracy:.2f}")

Accuracy for native English speakers: 0.60
Accuracy for non-native English speakers: 0.20


Alright, let's interpret these results:

**Accuracy for native English speakers: 0.60**
This means the LLM correctly answered 60% of the questions posed by native English speakers.

**Accuracy for non-native English speakers: 0.20**
The LLM correctly answered only 20% of the questions posed by non-native English speakers.

**Interpretation:**
- There's a significant disparity in the model's performance between the two groups. The model seems to perform better for native English speakers compared to non-native speakers by a wide margin (40% difference in accuracy).
- Such a disparity might suggest that the LLM is biased in favor of native English speakers or is not adept at understanding the nuances or potential grammatical inaccuracies in questions posed by non-native speakers.

**Implications:**
- If the LLM is being used in applications that cater to a global audience, this bias can be problematic. It's crucial to ensure equitable performance across diverse user groups.
- Further investigation is needed to determine the cause of this disparity. Is it because of the way questions are phrased by non-native speakers? Or is the model inherently biased due to its training data? Answering these questions can guide interventions to improve the model's performance.

# Recommendations and Analysis for Improving LLM

## Recommendations

### 1. Data Augmentation
- **Introduce more diverse training data, especially data representing non-native English speakers.**
  - This will help the model understand a wider range of linguistic nuances and variations, improving its performance across different user groups.

### 2. Feedback Loop
- **Allow users to provide feedback on incorrect answers, and use this feedback to continuously train and improve the model.**
  - Implementing a feedback mechanism will enable the model to learn from its mistakes and adapt to user needs more effectively.

### 3. Bias Mitigation Techniques
- **Apply techniques designed to reduce bias in AI models.**
  - Techniques such as re-weighting, adversarial debiasing, and counterfactual data augmentation can help in minimizing biases in the model.

### 4. Clear Communication
- **If deploying the model in its current state, communicate its limitations to users.**
  - Transparency about the model’s capabilities and limitations is crucial for setting realistic expectations and maintaining user trust.

### Summary
- **The results indicate a need for further refinement and calibration of the LLM to ensure it serves all user groups equitably.**

## Benefits of Bias Analysis

- **Ensuring fairness and inclusivity in AI systems.**
  - By addressing biases, we can create AI systems that are fair and inclusive for all users.
- **Enhancing trust and acceptance among users.**
  - Reducing biases increases user trust and acceptance, making the technology more widely adopted.
- **Aligning with ethical considerations and societal norms.**
  - Bias analysis aligns AI development with ethical standards and societal expectations, promoting responsible AI use.

## Challenges and Considerations

- **The subjectivity of defining biases.**
  - Different stakeholders may have different perspectives on what constitutes bias, making it challenging to address all concerns.
- **The trade-offs between accuracy and fairness.**
  - Efforts to reduce bias may sometimes impact the overall accuracy of the model, requiring careful balancing of these factors.
- **The importance of continuous monitoring and updating.**
  - Bias mitigation is an ongoing process that requires regular monitoring and updating to ensure the model remains fair and effective over time.

