# Ethics in Data Science: Transparency, Accountability, Privacy, and Fairness

This notebook uses an artificial dataset to illustrate key concepts in ethical data science: transparency, accountability, privacy, and fairness. Each section provides code and explanations to help you understand and apply these principles.

In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

## 1. Create Artificial Dataset

We will generate a synthetic dataset with features such as age, gender, income, and a binary target variable (e.g., loan approval). Some features may be considered sensitive.

In [2]:
# Generate artificial dataset
np.random.seed(42)
n_samples = 500
age = np.random.randint(18, 70, n_samples)
gender = np.random.choice(['Male', 'Female'], n_samples)
income = np.random.normal(50000, 15000, n_samples).astype(int)
# Simulate loan approval (target) with some bias
loan_approved = ((income > 40000) & (age > 25) & (gender == 'Male')).astype(int)
data = pd.DataFrame({
    'Age': age,
    'Gender': gender,
    'Income': income,
    'Loan_Approved': loan_approved
})
data.head()

Unnamed: 0,Age,Gender,Income,Loan_Approved
0,56,Female,24449,0
1,69,Male,49166,1
2,46,Male,55760,1
3,32,Female,49509,0
4,60,Male,18988,0


## 2. Explore Dataset for Transparency

Transparency means documenting and sharing information about the data, including feature descriptions and data provenance.

In [3]:
# Inspect dataset for transparency
print('Feature Descriptions:')
print('Age: Age of applicant (18-70)')
print('Gender: Male or Female')
print('Income: Annual income in USD')
print('Loan_Approved: 1 if loan approved, 0 otherwise')
print('\nData Provenance: Artificially generated for educational purposes.')
data.info()
data.describe()

Feature Descriptions:
Age: Age of applicant (18-70)
Gender: Male or Female
Income: Annual income in USD
Loan_Approved: 1 if loan approved, 0 otherwise

Data Provenance: Artificially generated for educational purposes.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Age            500 non-null    int64 
 1   Gender         500 non-null    object
 2   Income         500 non-null    int64 
 3   Loan_Approved  500 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 15.8+ KB


Unnamed: 0,Age,Income,Loan_Approved
count,500.0,500.0,500.0
mean,44.22,50226.678,0.326
std,15.036082,14828.885765,0.469217
min,18.0,9546.0,0.0
25%,32.0,40619.0,0.0
50%,45.0,49816.5,0.0
75%,57.0,59604.0,1.0
max,69.0,96183.0,1.0


## 3. Demonstrate Accountability with Data Logging

Accountability involves keeping records of data processing steps and model decisions. This helps trace actions and ensure responsible use.

In [4]:
# Simple logging for accountability
import logging
logging.basicConfig(level=logging.INFO)
logging.info('Splitting data into train and test sets')
X = data[['Age', 'Gender', 'Income']]
# Encode gender
X = pd.get_dummies(X, drop_first=True)
y = data['Loan_Approved']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
logging.info('Training logistic regression model')
model = LogisticRegression()
model.fit(X_train, y_train)
logging.info('Model training complete')

INFO:root:Splitting data into train and test sets
INFO:root:Training logistic regression model
INFO:root:Model training complete


## 4. Illustrate Privacy by Data Anonymization

Privacy is protected by removing or masking personally identifiable information (PII) from the dataset.

In [5]:
# Demonstrate anonymization (if dataset had PII)
# For illustration, add a fake 'Name' column and anonymize it
names = [f'Person_{i}' for i in range(n_samples)]
data['Name'] = names
print('Before anonymization:')
print(data[['Name', 'Age', 'Gender', 'Income']].head())
# Remove or mask PII
anonymized_data = data.drop(columns=['Name'])
print('\nAfter anonymization:')
print(anonymized_data.head())

Before anonymization:
       Name  Age  Gender  Income
0  Person_0   56  Female   24449
1  Person_1   69    Male   49166
2  Person_2   46    Male   55760
3  Person_3   32  Female   49509
4  Person_4   60    Male   18988

After anonymization:
   Age  Gender  Income  Loan_Approved
0   56  Female   24449              0
1   69    Male   49166              1
2   46    Male   55760              1
3   32  Female   49509              0
4   60    Male   18988              0


## 5. Analyze Fairness in Model Predictions

Fairness means ensuring that model predictions do not unfairly disadvantage any group. We will evaluate fairness metrics such as demographic parity and equal opportunity.

In [6]:
# Predict and evaluate fairness
preds = model.predict(X_test)
results = X_test.copy()
results['Actual'] = y_test.values
results['Predicted'] = preds
results['Gender'] = data.loc[X_test.index, 'Gender'].values

# Demographic parity: compare positive prediction rates by gender
groups = results.groupby('Gender')
for gender, group in groups:
    rate = (group['Predicted'] == 1).mean()
    print(f'Demographic parity (Predicted=1 rate) for {gender}: {rate:.2f}')

# Equal opportunity: compare true positive rates by gender
for gender, group in groups:
    true_positives = ((group['Actual'] == 1) & (group['Predicted'] == 1)).sum()
    actual_positives = (group['Actual'] == 1).sum()
    if actual_positives > 0:
        tpr = true_positives / actual_positives
        print(f'Equal opportunity (TPR) for {gender}: {tpr:.2f}')
    else:
        print(f'No actual positives for {gender}.')

Demographic parity (Predicted=1 rate) for Female: 0.00
Demographic parity (Predicted=1 rate) for Male: 0.64
No actual positives for Female.
Equal opportunity (TPR) for Male: 0.85
