

## Evaluating Biases in Large Language Models (LLMs) using WEAT

### Word Embedding Association Test (WEAT)
The Word Embedding Association Test (WEAT) is a method for evaluating biases in word embeddings by
measuring the strength and direction of associations between predefined target and attribute word sets.

#### Objective:
- Assess the differential association between target word sets (e.g., animals, professions) and attribute word sets (e.g., gender terms, demographic groups).

#### Key Concepts:
- **Word Embeddings**: Vector representations of words capturing semantic and syntactic relationships (e.g., Word2Vec, GloVe).
- **Cosine Similarity**: A metric to measure the similarity between two vectors by calculating the cosine of the angle between them.


In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
word_embeddings = {
    'lion': np.array([0.4, 0.5, 0.3]),
    'tiger': np.array([0.6, 0.5, 0.7]),
    'elephant': np.array([0.3, 0.7, 0.4]),
    'cat': np.array([0.8, 0.6, 0.4]),
    'dog': np.array([0.7, 0.5, 0.3]),
    'parrot': np.array([0.5, 0.3, 0.6]),
    'king': np.array([0.9, 0.7, 0.8]),
    'queen': np.array([0.8, 0.6, 0.7]),
    'prince': np.array([0.85, 0.7, 0.75]),
    'princess': np.array([0.7, 0.6, 0.85]),
    'duke': np.array([0.9, 0.8, 0.7]),
    'duchess': np.array([0.6, 0.8, 0.9])
}

In [None]:
# Target sets (animals and royal titles)
X = ['lion', 'tiger', 'elephant']
Y = ['cat', 'dog', 'parrot']

# Attribute sets (male and female royal terms)
A = ['king', 'prince', 'duke']
B = ['queen', 'princess', 'duchess']

In [None]:
def s(w, X, Y):
    """
    Calculate the differential association of word w with sets X and Y.

    - Compute the mean cosine similarity between w and each word in X (sim_X).
    - Compute the mean cosine similarity between w and each word in Y (sim_Y).
    - Return the difference sim_X - sim_Y.

    Args:
        w (str): The target word.
        X (list): The first target word set.
        Y (list): The second target word set.

    Returns:
        float: The differential association score.
    """
    sim_X = np.mean([cosine_similarity(word_embeddings[w].reshape(1, -1), word_embeddings[x].reshape(1, -1))[0][0] for x in X])
    sim_Y = np.mean([cosine_similarity(word_embeddings[w].reshape(1, -1), word_embeddings[y].reshape(1, -1))[0][0] for y in Y])
    return sim_X - sim_Y

### Differential Association

**Objective**: The purpose of computing differential association is to measure how strongly a word `w` is associated with two different sets of words, `X` and `Y`. The differential association quantifies whether a given word `w` is more closely related to the words in set `X` or the words in set `Y`.

#### How It Is Calculated:

1. **Similarity Measure**:
   - A similarity measure is used to assess how similar two words are based on their embeddings (vectors). It gives a value between -1 (completely opposite) and 1 (completely similar). We use this measure to assess the similarity between the word `w` and other words in the sets.

2. **For a given word `w`**:
   - We compute its similarity with each word in set `X` and each word in set `Y`. These sets represent different categories or groups of words.

3. **Computing `sim_X`**:
   - `sim_X` is the average similarity between the word `w` and all the words in set `X`. This means we compute the similarity between `w` and every word in `X`, and then take the mean of these similarities.
   
4. **Computing `sim_Y`**:
   - Similarly, `sim_Y` is the average similarity between the word `w` and all the words in set `Y`.

5. **Differential Association**:
   - The differential association for word `w` is calculated by subtracting `sim_Y` from `sim_X`. This gives us the difference in associations:
     - **If `sim_X > sim_Y`**: The word `w` is more closely associated with the words in set `X`.
     - **If `sim_Y > sim_X`**: The word `w` is more closely associated with the words in set `Y`.
     - **If `sim_X = sim_Y`**: The word `w` has an equal association with both sets.

#### Mathematical Formula:
```
differential_association(w) = sim_X(w) - sim_Y(w)
```


Where:
- `sim_X(w)` is the average similarity between word `w` and each word in set `X`.
- `sim_Y(w)` is the average similarity between word `w` and each word in set `Y`.

### Example:

Let's assume we have the following sets of words:

- `X = ['artist', 'painter', 'sculptor']` (creative professions)
- `Y = ['chef', 'baker', 'cook']` (cooking professions)
- `w = 'man'` (attribute word)

We calculate the similarity between `w = 'man'` and each word in `X`, then take the average to get `sim_X`. Similarly, we compute the similarity between `w = 'man'` and each word in `Y`, and calculate `sim_Y`. Finally, we compute the difference `sim_X - sim_Y` to get the differential association.

#### Interpretation:
- If the result is positive, it means the word `w` (e.g., 'man') is more associated with the words in set `X` (e.g., 'artist', 'painter', 'sculptor').
- If the result is negative, the word `w` is more associated with set `Y` (e.g., 'chef', 'baker', 'cook').


#### Implications:
- The score reflects potential biases in the embeddings, which can have real-world consequences in NLP applications.

In [None]:
WEAT_score = sum([s(a, X, Y) for a in A]) - sum([s(b, X, Y) for b in B])

print(f"WEAT score: {WEAT_score}")


WEAT score: -0.06243427547253355


### Summary:
1. Define your word sets (`X`, `Y`, `A`, `B`).
2. Calculate the differential association for each word in `A` and `B`.
3. Compute the sum of differential associations for both sets (`A` and `B`).
4. Subtract the sums to obtain the WEAT score.
5. Interpret the score to understand the bias between the groups.