# Fairness Metrics Comparison and Discussion

#### Comparison of Fairness Across Datasets

We evaluate the fairness of the recommendation models (**SVD++**, **UserKNN**, and **ItemKNN**) on two distinct datasets: **MovieLens** and **Yelp**, using a comprehensive set of fairness metrics. While both datasets support recommender system research, they differ in scope, structure, and granularity. Please see the results at the end of this Notebook.

**Dataset Differences**:

- **MovieLens** is a structured dataset with explicit ratings, clear user demographics (gender, age, occupation), and consistent rating behavior, making it highly suitable for recommender system evaluation.
- **Yelp** consists of unstructured reviews and business metadata, offering fewer user demographics. Additionally, the domain difference (movies vs. local businesses) may impact recommendation behavior and fairness interpretation.
- For **Yelp**, fairness metrics were computed using only a subset of **50 users** due to **computational constraints**. Ideally, this analysis would include at least **500 users** for more robust results.
- In practice, **e-commerce** platforms may be more comparable to MovieLens in terms of structure and item categorization, whereas Yelp represents a broader and more diverse set of user preferences and item types.



#### Key Metric Observations

| Metric                         | MovieLens Insights                                              | Yelp Insights                                                   |
|-------------------------------|------------------------------------------------------------------|------------------------------------------------------------------|
| **RMSE (Accuracy)**           | SVD++ is most accurate (0.9287); ItemKNN and UserKNN higher.     | Similar ranking, but overall higher RMSE (lower accuracy).       |
| **Gender-Based RMSE**         | RMSE is consistently **higher for female users** across models, suggesting gender-based bias or data imbalance. | Gender split less detailed, but similar trend present.           |
| **Counterfactual Difference** | All models sensitive to gender flips (~0.0798), reflecting potential fairness concerns. | Similar magnitude observed for UserKNN (0.0756).                 |
| **Consistency**               | Highest for UserKNN and SVD++; ItemKNN slightly lower.           | UserKNN shows strong consistency (0.297), possibly due to domain diversity. |
| **Statistical Parity**        | Fair exposure with low parity gaps (e.g., 0.0154 for ItemKNN).   | Parity slightly lower (e.g., 0.0678), suggesting balanced exposure. |
| **Rawlsian Min Exposure**     | SVD++ ensures better exposure for lowest-represented group.      | UserKNN has stronger Rawlsian score (0.7407), indicating equitable coverage. |
| **Local Individual Fairness** | ItemKNN scores high (0.9491), indicating minimal deviation between similar users. | SVD++ also strong (0.2267), despite complex domain.             |
| **Calibration Error**         | Best for SVD++; ItemKNN higher error suggests over/underconfidence in predictions. | Similar trend.                                                  |
| **Disparate Impact Ratio**    | Close to 1 across models, indicating fairness in exposure likelihood. | Stable and fair ratio across models (e.g., SVD++: 1.0078).       |
| **Demographic Parity**        | Generally low (<0.02), showing fair recommendation distribution. | Consistent, with lowest parity gap from UserKNN (0.0549).        |



### Discussion

The MovieLens dataset proves to be more suitable for rigorous recommender system evaluation due to its well-structured format, the availability of explicit ratings, and comprehensive demographic features. These elements make it easier to analyze model behavior and interpret fairness outcomes systematically. In contrast, the Yelp dataset, which is based on unstructured review data and contains fewer demographic attributes, introduces additional complexity. The domain-specific behavior associated with Yelp, such as evaluating restaurants or services, leads to greater noise and variability in user interactions, which may obscure fairness patterns.

A recurring theme across models and datasets is the impact of gender on fairness. Female users often appear to have fewer interactions or narrower rating distributions compared to male users. This can reduce the system's ability to learn robust user preferences, resulting in less accurate recommendations and lower exposure. These discrepancies are reflected in fairness metrics like RMSE, statistical parity, and disparate impact ratio, which tend to reveal slightly higher bias toward male users.

The differences in scores between the datasets, especially in metrics like Local Individual Fairness, can be attributed to the nature of the domains themselves. MovieLens users engage with a shared and universal set of items/movies that are more broadly rated and consumed across user groups. This leads to more consistent behavior and better alignment between similar users. In contrast, Yelp users may vary significantly in the types of businesses they interact with, introducing sparsity and limiting the effectiveness of neighborhood-based models like KNN.

Model design also plays a key role in observed fairness. SVD++, being a model-based approach, typically exhibits stronger calibration and consistency. It generalizes better in sparse conditions and avoids some of the pitfalls seen in neighbor-based methods. Meanwhile, memory-based models such as UserKNN and ItemKNN are more sensitive to the density and quality of historical data, which can negatively affect both fairness and accuracy in noisier datasets.

One notable limitation in the Yelp evaluation is the sample size. Fairness metrics were calculated using only 50 users due to computational and time constraints. A larger sample, ideally 500 users or more, would be necessary for stronger statistical confidence and a more representative fairness assessment. This smaller scale likely contributes to higher variability in some metrics and limits the ability to generalize results.

### Conclusion

Among the three models, SVD++ demonstrates the best overall balance between predictive accuracy and fairness, performing consistently well across both datasets. ItemKNN shows strong performance in local individual fairness, especially in MovieLens, where user-item interactions are richer and more structured. In contrast, UserKNN yields the best consistency and Rawlsian exposure scores on Yelp, indicating its strength in treating similar users equitably and covering disadvantaged groups more evenly.

Gender-related disparities persist across both datasets, reinforcing the importance of integrating fairness-aware design into recommendation pipelines. While all models exhibit some sensitivity to gender-based differences, the effects are most pronounced in sparse or demographically unbalanced data.

Ultimately, the structure and richness of the dataset play a critical role in how fairness is expressed and measured. MovieLens, with its high-quality data and balanced design, is better suited for fairness research and recommendation system evaluation than Yelp. As fairness in machine learning continues to grow in importance, choosing the right dataset is just as crucial as choosing the right model.

# Results Across Models

In [1]:
import pandas as pd
import numpy as np

In [2]:
# MovieLens fairness metrics table
# Build the DataFrame using np.nan for missing metrics
condensed_transposed = pd.DataFrame({
    'Metric': [
        'RMSE',
        'RMSE (Male)',
        'RMSE (Female)',
        'CFD (Gender)',
        'Consistency',
        'Stat. Parity (Gender, T=3.5)',
        'Rawlsian Exposure (Gender, T=3.5)',
        'Local Individual Fairness',
        'Calibration Error',
        'Disparate Impact Ratio',
        'Demographic Parity'
    ],
    'ItemKNN': [0.9738, 0.9634, 1.0031, 0.0798, 0.2520, 0.0154, 0.6690, 0.9491, 0.7588, 0.9775, 0.0154],
    'UserKNN': [1.0126, 0.9855, 1.0869, 0.0798, 0.2573, 0.0023, 0.5691, np.nan, np.nan, np.nan, np.nan],
    'SVD++':  [0.9287, 0.9139, 0.9699, 0.0799, 0.2580, 0.0118, 0.5488, np.nan, np.nan, np.nan, np.nan]
})

condensed_transposed.set_index('Metric', inplace=True)

# Format for display: keep numbers but show '-' for NaNs
formatted_table = condensed_transposed.copy().applymap(
    lambda x: "-" if pd.isna(x) else f"{x:.4f}"
)

# Display
print("\n Fairness Table (Movielens):")
display(formatted_table)


# Define metrics and model scores as lists
metrics = [
    "RMSE", "RMSE (Female)", "RMSE (Male)", "RMSE (Unknown)",
    "Calibration Error", "Local Individual Fairness",
    "Disparate Impact Ratio", "Demographic Parity",
    "Counterfactual Difference", "Consistency Score",
    "Statistical Parity", "Rawlsian Min Exposure"
]

svdpp    = [1.0175, 1.0122, 0.9846, 1.0574, 0.7381, 0.2267, 1.0078, 0.0076, "-", "-", "-", "-"]
itemknn  = [1.295,  1.2768, 1.3171, 1.3574, 0.8946, 0.2615, 1.0194, 0.0174, "-", "-", "-", "-"]
userknn  = [1.3022, 1.2729, 1.2726, 1.2959, 0.5593, "-",    1.0792, 0.0549, 0.0756, 0.297, 0.0678, 0.7407]

# Create and format DataFrame
fairness_yelp = pd.DataFrame({
    'SVD++': svdpp,
    'ItemKNN': itemknn,
    'UserKNN': userknn
}, index=metrics).replace(np.nan, "-")

# Display the table
print("\nFairness Summary Table Across Models (Yelp):")
display(fairness_yelp)





 Fairness Table (Movielens):


  formatted_table = condensed_transposed.copy().applymap(


Unnamed: 0_level_0,ItemKNN,UserKNN,SVD++
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
RMSE,0.9738,1.0126,0.9287
RMSE (Male),0.9634,0.9855,0.9139
RMSE (Female),1.0031,1.0869,0.9699
CFD (Gender),0.0798,0.0798,0.0799
Consistency,0.252,0.2573,0.2580
"Stat. Parity (Gender, T=3.5)",0.0154,0.0023,0.0118
"Rawlsian Exposure (Gender, T=3.5)",0.669,0.5691,0.5488
Local Individual Fairness,0.9491,-,-
Calibration Error,0.7588,-,-
Disparate Impact Ratio,0.9775,-,-



Fairness Summary Table Across Models (Yelp):


Unnamed: 0,SVD++,ItemKNN,UserKNN
RMSE,1.0175,1.295,1.3022
RMSE (Female),1.0122,1.2768,1.2729
RMSE (Male),0.9846,1.3171,1.2726
RMSE (Unknown),1.0574,1.3574,1.2959
Calibration Error,0.7381,0.8946,0.5593
Local Individual Fairness,0.2267,0.2615,-
Disparate Impact Ratio,1.0078,1.0194,1.0792
Demographic Parity,0.0076,0.0174,0.0549
Counterfactual Difference,-,-,0.0756
Consistency Score,-,-,0.297
