### Adult Dataset 

This dataset, also known as the "Census Income" dataset, is used to predict whether the annual income of an individual exceeds $50K per year based on census data. The dataset was extracted by Barry Becker from the 1994 Census database and contains 48,842 instances.
The dataset can be found [here](https://archive.ics.uci.edu/dataset/2/adult).
#### Key Features:
- **Demographics:** Attributes like age, race, and sex.
- **Education:** Years of education completed, represented as `education-num`.
- **Work-related Attributes:** Hours worked per week (`hours-per-week`) and employment type (`workclass`).
- **Target Label:** Binary label indicating income level (>50K or <=50K).

#### Preprocessing:

#### Protected Attributes:
- **Sex**
- **Race**

#### Privileged Classes:
- **Sex:** Male (`1.0`)
- **Race:** White (`1.0`)

#### Categorical Features:
- **Age (decade):** Grouped into discrete categories (`10`, `20`, `30`, ..., `>=70`), with one-hot encoding applied.
- **Education Years:** Grouped into discrete categories (`<6`, `6`, `7`, ..., `>12`), with one-hot encoding applied.

#### Metadata:
- Contains label mappings:
  - `>50K` as favorable.
  - `<=50K` as unfavorable.

#### Dataset Insights:
- **Demographic Skew:**
  - **Race:** 85.5% of individuals are White, making non-White individuals underrepresented.
  - **Sex:** 66.8% are Male, reflecting a male-dominated dataset.
- **Class Imbalance:**
  - The dataset has an imbalance in income outcomes, with a majority (76.1%) earning <=$50K.
- **Fairness Considerations:**
  - The imbalance and demographic skew could lead to biased models, particularly for underrepresented groups like non-White females. This highlights the need for fairness-aware preprocessing and evaluation.

#### Missing Data:
- There are no missing values in the preprocessed dataset.




In [1]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import copy
from IPython.display import display
from aif360.algorithms.preprocessing.optim_preproc_helpers.data_preproc_functions\
        import load_preproc_data_adult
from aif360.datasets import AdultDataset

# Ensure reproducibility
np.random.seed(1)

# Append a path if needed
sys.path.append("../")


pip install 'aif360[AdversarialDebiasing]'
pip install 'aif360[AdversarialDebiasing]'
pip install 'aif360[Reductions]'
pip install 'aif360[Reductions]'
pip install 'aif360[inFairness]'
pip install 'aif360[Reductions]'


Original Dataset

In [2]:
dataset = AdultDataset()
df = pd.DataFrame(dataset.features, columns=dataset.feature_names)
df



Unnamed: 0,age,education-num,race,sex,capital-gain,capital-loss,hours-per-week,workclass=Federal-gov,workclass=Local-gov,workclass=Private,...,native-country=Portugal,native-country=Puerto-Rico,native-country=Scotland,native-country=South,native-country=Taiwan,native-country=Thailand,native-country=Trinadad&Tobago,native-country=United-States,native-country=Vietnam,native-country=Yugoslavia
0,25.0,7.0,0.0,1.0,0.0,0.0,40.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,38.0,9.0,1.0,1.0,0.0,0.0,50.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,28.0,12.0,1.0,1.0,0.0,0.0,40.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,44.0,10.0,0.0,1.0,7688.0,0.0,40.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,34.0,6.0,1.0,1.0,0.0,0.0,30.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45217,27.0,12.0,1.0,0.0,0.0,0.0,38.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
45218,40.0,9.0,1.0,1.0,0.0,0.0,40.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
45219,58.0,9.0,1.0,0.0,0.0,0.0,40.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
45220,22.0,9.0,1.0,1.0,0.0,0.0,20.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [3]:
print("Features of the original dataset")
dataset.feature_names

Features of the original dataset


['age',
 'education-num',
 'race',
 'sex',
 'capital-gain',
 'capital-loss',
 'hours-per-week',
 'workclass=Federal-gov',
 'workclass=Local-gov',
 'workclass=Private',
 'workclass=Self-emp-inc',
 'workclass=Self-emp-not-inc',
 'workclass=State-gov',
 'workclass=Without-pay',
 'education=10th',
 'education=11th',
 'education=12th',
 'education=1st-4th',
 'education=5th-6th',
 'education=7th-8th',
 'education=9th',
 'education=Assoc-acdm',
 'education=Assoc-voc',
 'education=Bachelors',
 'education=Doctorate',
 'education=HS-grad',
 'education=Masters',
 'education=Preschool',
 'education=Prof-school',
 'education=Some-college',
 'marital-status=Divorced',
 'marital-status=Married-AF-spouse',
 'marital-status=Married-civ-spouse',
 'marital-status=Married-spouse-absent',
 'marital-status=Never-married',
 'marital-status=Separated',
 'marital-status=Widowed',
 'occupation=Adm-clerical',
 'occupation=Armed-Forces',
 'occupation=Craft-repair',
 'occupation=Exec-managerial',
 'occupation=Farm

Preprocessed Dataset

In [5]:
preproc_data = load_preproc_data_adult()
preproc_df = pd.DataFrame(preproc_data.features, columns=preproc_data.feature_names)
preproc_df

Unnamed: 0,race,sex,Age (decade)=10,Age (decade)=20,Age (decade)=30,Age (decade)=40,Age (decade)=50,Age (decade)=60,Age (decade)=>=70,Education Years=6,Education Years=7,Education Years=8,Education Years=9,Education Years=10,Education Years=11,Education Years=12,Education Years=<6,Education Years=>12
0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
48838,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
48839,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
48840,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [6]:
print("Features of the preprocessed dataset")
preproc_data.feature_names

Features of the preprocessed dataset


['race',
 'sex',
 'Age (decade)=10',
 'Age (decade)=20',
 'Age (decade)=30',
 'Age (decade)=40',
 'Age (decade)=50',
 'Age (decade)=60',
 'Age (decade)=>=70',
 'Education Years=6',
 'Education Years=7',
 'Education Years=8',
 'Education Years=9',
 'Education Years=10',
 'Education Years=11',
 'Education Years=12',
 'Education Years=<6',
 'Education Years=>12']

In [7]:
# Check data types and null values
print(preproc_df.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   race                 48842 non-null  float64
 1   sex                  48842 non-null  float64
 2   Age (decade)=10      48842 non-null  float64
 3   Age (decade)=20      48842 non-null  float64
 4   Age (decade)=30      48842 non-null  float64
 5   Age (decade)=40      48842 non-null  float64
 6   Age (decade)=50      48842 non-null  float64
 7   Age (decade)=60      48842 non-null  float64
 8   Age (decade)=>=70    48842 non-null  float64
 9   Education Years=6    48842 non-null  float64
 10  Education Years=7    48842 non-null  float64
 11  Education Years=8    48842 non-null  float64
 12  Education Years=9    48842 non-null  float64
 13  Education Years=10   48842 non-null  float64
 14  Education Years=11   48842 non-null  float64
 15  Education Years=12   48842 non-null 

In [8]:
# Summary statistics for numerical columns
print(preproc_df.describe())



               race           sex  Age (decade)=10  Age (decade)=20  \
count  48842.000000  48842.000000     48842.000000     48842.000000   
mean       0.855043      0.668482         0.051390         0.245793   
std        0.352061      0.470764         0.220795         0.430561   
min        0.000000      0.000000         0.000000         0.000000   
25%        1.000000      0.000000         0.000000         0.000000   
50%        1.000000      1.000000         0.000000         0.000000   
75%        1.000000      1.000000         0.000000         0.000000   
max        1.000000      1.000000         1.000000         1.000000   

       Age (decade)=30  Age (decade)=40  Age (decade)=50  Age (decade)=60  \
count     48842.000000     48842.000000     48842.000000     48842.000000   
mean          0.264711         0.219565         0.135519         0.062528   
std           0.441184         0.413956         0.342280         0.242115   
min           0.000000         0.000000         0.00

- Race (mean = 0.855): Most individuals are encoded as 1.0 (White), suggesting that the dataset predominantly represents White individuals (85.5% of the population).

- Sex (mean = 0.668): Around 66.8% of individuals are encoded as 1.0 (Male), indicating a male-dominated dataset.

As a result, there could be implications such as overfitting to majority groups. The model might learn patterns specific to White males and generalize poorly for other combinations, such as non-White females. Therefore, it might unintentionally disadvantage minority groups due to lack of representation.

In [15]:
labels = (preproc_data.labels).flatten()
label_counts = pd.Series(labels).value_counts()
print(label_counts)

0.0    37155
1.0    11687
Name: count, dtype: int64


This means that:

- 23.9% of individuals have a favorable outcome (annual income > $50K).
- 76.1% of individuals have an unfavorable outcome (annual income <= $50K).

The dataset reflects real-world income disparity, with only 23.9% of individuals earning >$50K, likely due to systemic inequalities like access to education and job opportunities. The imbalance may propagate societal biases, particularly if protected attributes like Race or Sex correlate with income levels. Models trained on this data risk reinforcing biases, potentially disadvantaging underrepresented groups (e.g., women or non-white individuals) in decision-making scenarios like hiring or loan approvals.