# 01: Data Preparation

This notebook demonstrates how to download and prepare datasets for the FAIR-CARE pipeline.

## Datasets
- COMPAS: Recidivism risk assessment
- Adult Census: Income prediction
- German Credit: Credit risk
- NIJ: Recidivism forecasting

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.insert(0, '../src')

sns.set_theme(style="whitegrid")
%matplotlib inline

## Download Datasets

Run the download script to fetch COMPAS, Adult, and German Credit datasets.

In [None]:
!python ../scripts/downloaddatasets.py --datasets compas,adult,german

## Load COMPAS Dataset

In [None]:
compas = pd.read_csv('../data/raw/compas/compas.csv')
print(f"COMPAS shape: {compas.shape}")
compas.head()

## Exploratory Data Analysis

In [None]:
# Basic statistics
compas.describe()

In [None]:
# Missing values
missing = compas.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

In [None]:
# Distribution of protected attribute (race)
plt.figure(figsize=(10, 6))
compas['race'].value_counts().plot(kind='bar')
plt.title('Distribution of Race in COMPAS Dataset')
plt.xlabel('Race')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Recidivism rate by race
recid_by_race = compas.groupby('race')['two_year_recid'].mean().sort_values()
plt.figure(figsize=(10, 6))
recid_by_race.plot(kind='barh')
plt.title('Recidivism Rate by Race')
plt.xlabel('Recidivism Rate')
plt.ylabel('Race')
plt.tight_layout()
plt.show()

## Baseline Bias Assessment

Calculate demographic parity before any processing.

In [None]:
# Statistical Parity Difference
african_american_recid = compas[compas['race'] == 'African-American']['two_year_recid'].mean()
caucasian_recid = compas[compas['race'] == 'Caucasian']['two_year_recid'].mean()
spd = african_american_recid - caucasian_recid

print(f"African-American recidivism rate: {african_american_recid:.3f}")
print(f"Caucasian recidivism rate: {caucasian_recid:.3f}")
print(f"Statistical Parity Difference: {spd:.3f}")
print(f"Bias detected: {'Yes' if abs(spd) > 0.1 else 'No'}")

## Load Other Datasets

In [None]:
# Adult Census
adult = pd.read_csv('../data/raw/adult/adult.csv')
print(f"Adult shape: {adult.shape}")
adult.head()

In [None]:
# German Credit
german = pd.read_csv('../data/raw/german/german.csv')
print(f"German shape: {german.shape}")
german.head()

## Summary

We have successfully:
1. Downloaded datasets
2. Performed exploratory data analysis
3. Identified baseline bias in COMPAS dataset

**Next**: Proceed to notebook 02 for Bronze layer ingestion.