Name:- Rakhi M Rajput

PRN NO. :- 22SC114501042

Class :- B.Tech AI&ML(A div)

Roll No. :- 38

Experiment 2 :- Impact of Data Quality on AI Fairness.
     

## **Title**
Impact of Data Quality on AI Fairness using the **German Credit Dataset**.

---

## **Objective**
The goal of this experiment is to analyze the **impact of sensitive attributes** on fairness in AI models.  
We will:
1. Train a logistic regression model on the German Credit dataset.  
2. Evaluate fairness metrics across groups.  
3. Apply **fairness mitigation** using `fairlearn`.  
4. Compare results before and after mitigation.  


In [1]:
%pip install fairlearn





[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [46]:
%pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## **Step 2: Import Libraries**
We import all the necessary Python libraries for:
- Data handling (`pandas`, `numpy`)
- Model training (`scikit-learn`)
- Fairness evaluation (`fairlearn`)


In [50]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from fairlearn.preprocessing import CorrelationRemover
from fairlearn.metrics import MetricFrame, selection_rate, true_positive_rate, false_positive_rate, false_negative_rate


## **Step 3: Load the German Credit Dataset**
The dataset is provided by the mentor.  


In [51]:
german_credit = fetch_ucirepo(id=144)

In [52]:
X = german_credit.data.features.copy()
y = german_credit.data.targets.copy()

In [66]:
df = pd.concat([X, y], axis=1)

In [67]:
print("First 5 rows of the dataset:")
display(df.head())

First 5 rows of the dataset:


Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,...,Attribute12,Attribute13,Attribute14,Attribute15,Attribute16,Attribute17,Attribute18,Attribute19,Attribute20,class
0,0,6,4,4,1169,4,4,4,2,0,...,0,67,2,1,2,2,1,1,0,1
1,1,48,2,4,5951,0,2,2,1,0,...,0,22,2,1,1,2,1,0,0,2
2,3,12,4,7,2096,0,3,2,2,0,...,0,49,2,1,1,1,2,0,0,1
3,0,42,2,3,7882,0,3,2,2,2,...,1,45,2,2,1,2,2,0,0,1
4,0,24,3,0,4870,0,2,3,2,0,...,3,53,2,2,2,2,2,0,0,2


In [68]:
print("\nDataset shape (rows, columns):", df.shape)


Dataset shape (rows, columns): (1000, 21)


In [69]:
print("\nData types and non-null counts:")
print(df.info())


Data types and non-null counts:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Attribute1   1000 non-null   int32
 1   Attribute2   1000 non-null   int64
 2   Attribute3   1000 non-null   int32
 3   Attribute4   1000 non-null   int32
 4   Attribute5   1000 non-null   int64
 5   Attribute6   1000 non-null   int32
 6   Attribute7   1000 non-null   int32
 7   Attribute8   1000 non-null   int64
 8   Attribute9   1000 non-null   int32
 9   Attribute10  1000 non-null   int32
 10  Attribute11  1000 non-null   int64
 11  Attribute12  1000 non-null   int32
 12  Attribute13  1000 non-null   int64
 13  Attribute14  1000 non-null   int32
 14  Attribute15  1000 non-null   int32
 15  Attribute16  1000 non-null   int64
 16  Attribute17  1000 non-null   int32
 17  Attribute18  1000 non-null   int64
 18  Attribute19  1000 non-null   int32
 19  Attribute20  100

## **Step 5: Preprocessing**
- Convert categorical columns into numeric form (using one-hot encoding).  
- Encode target (`Good` → 0, `Bad` → 1).  
- Encode sensitive attribute (`male` → 0, `female` → 1).  
- Normalize features for logistic regression.


In [70]:

print("\nSummary statistics (numerical columns):")
display(df.describe())


Summary statistics (numerical columns):


Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,...,Attribute12,Attribute13,Attribute14,Attribute15,Attribute16,Attribute17,Attribute18,Attribute19,Attribute20,class
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,1.577,20.903,2.545,3.277,3271.258,1.105,2.384,2.973,1.682,0.145,...,1.358,35.546,1.675,0.929,1.407,1.904,1.155,0.404,0.037,1.3
std,1.257638,12.058814,1.08312,2.739302,2822.736876,1.580023,1.208306,1.118715,0.70808,0.477706,...,1.050209,11.375469,0.705601,0.531264,0.577654,0.653614,0.362086,0.490943,0.188856,0.458487
min,0.0,4.0,0.0,0.0,250.0,0.0,0.0,1.0,0.0,0.0,...,0.0,19.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
25%,0.0,12.0,2.0,1.0,1365.5,0.0,2.0,2.0,1.0,0.0,...,0.0,27.0,2.0,1.0,1.0,2.0,1.0,0.0,0.0,1.0
50%,1.0,18.0,2.0,3.0,2319.5,0.0,2.0,3.0,2.0,0.0,...,1.0,33.0,2.0,1.0,1.0,2.0,1.0,0.0,0.0,1.0
75%,3.0,24.0,4.0,4.0,3972.25,2.0,4.0,4.0,2.0,0.0,...,2.0,42.0,2.0,1.0,2.0,2.0,1.0,1.0,0.0,2.0
max,3.0,72.0,4.0,9.0,18424.0,4.0,4.0,4.0,3.0,2.0,...,3.0,75.0,2.0,2.0,4.0,3.0,2.0,1.0,1.0,2.0


In [71]:
print("\nTarget value counts:")
print(df[df.columns[-1]].value_counts())


Target value counts:
class
1    700
2    300
Name: count, dtype: int64


In [72]:
print("\nAny missing values?")
print(df.isnull().sum())


Any missing values?
Attribute1     0
Attribute2     0
Attribute3     0
Attribute4     0
Attribute5     0
Attribute6     0
Attribute7     0
Attribute8     0
Attribute9     0
Attribute10    0
Attribute11    0
Attribute12    0
Attribute13    0
Attribute14    0
Attribute15    0
Attribute16    0
Attribute17    0
Attribute18    0
Attribute19    0
Attribute20    0
class          0
dtype: int64


In [53]:
print("Shape of dataset:", X.shape)
print(X.head())
print(y.head())


Shape of dataset: (1000, 20)
  Attribute1  Attribute2 Attribute3 Attribute4  Attribute5 Attribute6  \
0        A11           6        A34        A43        1169        A65   
1        A12          48        A32        A43        5951        A61   
2        A14          12        A34        A46        2096        A61   
3        A11          42        A32        A42        7882        A61   
4        A11          24        A33        A40        4870        A61   

  Attribute7  Attribute8 Attribute9 Attribute10  Attribute11 Attribute12  \
0        A75           4        A93        A101            4        A121   
1        A73           2        A92        A101            2        A121   
2        A74           2        A93        A101            3        A121   
3        A74           2        A93        A103            4        A122   
4        A73           3        A93        A101            4        A124   

   Attribute13 Attribute14 Attribute15  Attribute16 Attribute17  Attribute1

## **Step 5: Preprocessing**
- Convert categorical columns into numeric form (using one-hot encoding).  
- Encode target (`Good` → 0, `Bad` → 1).  
- Encode sensitive attribute (`male` → 0, `female` → 1).  
- Normalize features for logistic regression.


In [54]:
label_encoders = {}
for col in X.columns:
    if X[col].dtype == 'object':
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])
        label_encoders[col] = le

In [55]:
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

In [56]:
sensitive_feature = "personal_status" if "personal_status" in X.columns else X.columns[0]
print("Sensitive feature chosen:", sensitive_feature)

Sensitive feature chosen: Attribute1


## **Step 6: Train-Test Split**
We split data into **train (70%)** and **test (30%)** while keeping group balance using stratification.


In [58]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

In [59]:
A_train = X_train[sensitive_feature]
A_test = X_test[sensitive_feature]


## **Step 7: Train Baseline Logistic Regression Model**
We train a simple logistic regression model to predict credit risk.


In [73]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [74]:
y_pred = model.predict(X_test)

acc_before = accuracy_score(y_test, y_pred)
print("Accuracy before mitigation:", acc_before)

Accuracy before mitigation: 0.765


## **Step 8: Fairness Evaluation **
We calculate **group fairness metrics** across gender groups:
- True Positive Rate (TPR)  
- False Positive Rate (FPR)  
- False Negative Rate (FNR)  
- Selection Rate  


In [75]:
# Convert y_test to a 1D array of integers
y_true = y_test["class"].values

# If the target labels are {1, 2}, set pos_label=1 for TPR, FPR, FNR
metrics_before = MetricFrame(
    metrics={
        'accuracy': accuracy_score,
        'selection_rate': selection_rate,
        'TPR': lambda y_true, y_pred: true_positive_rate(y_true, y_pred, pos_label=1),
        'FPR': lambda y_true, y_pred: false_positive_rate(y_true, y_pred, pos_label=1),
        'FNR': lambda y_true, y_pred: false_negative_rate(y_true, y_pred, pos_label=1)
    },
    y_true=y_true,
    y_pred=y_pred,
    sensitive_features=A_test
)

print("\nFairness Metrics Before Mitigation:")
print(metrics_before.by_group)
print("\nOverall Accuracy:", accuracy_score(y_test, y_pred))



Fairness Metrics Before Mitigation:
            accuracy  selection_rate       TPR       FPR       FNR
Attribute1                                                        
-1.254566   0.545455        0.400000  0.440000  0.366667  0.560000
-0.459026   0.763636        0.709091  0.882353  0.428571  0.117647
 0.336513   0.833333        0.916667  0.909091  1.000000  0.090909
 1.132053   0.910256        0.987179  1.000000  0.875000  0.000000

Overall Accuracy: 0.765


### Conclusion

The German Credit dataset is well-structured and complete, making it suitable for fairness evaluation experiments. However, the slight class imbalance and mixed data types mean that preprocessing steps such as scaling, encoding, and possibly rebalancing will be important. The next step will be to build a classification model and evaluate it using fairness metrics to understand any potential bias in predictions.