In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from imblearn.over_sampling import SMOTE


In [2]:
'''
Preprocessing: We first cleaned the 'dti' (debt-to-income) column, which contained various formats of percentage ranges. 
We developed a function to handle these different formats and calculated the average value for each range. 
Then, we dropped irrelevant columns and kept only the selected features (income, dti, loan amount, ltv, property value, and race) in the dataset.

Model Training: We split the dataset into training and testing sets and trained a logistic regression model using the selected features. 
The target variable was 'loan_approval', indicating whether a loan application was approved or not.

Model Evaluation: We calculated the accuracy of the model using the test set.

Fairness Evaluation: We evaluated the model's fairness using demographic parity. 
To do this, we calculated the proportion of positive outcomes (loan approvals) for each group of the sensitive attribute (race) in the test set. 
We then compared these proportions to determine if the model is fair. If the maximum difference in approval rates between the groups is 
less than or equal to a pre-defined fairness threshold (e.g., 0.05), the model is considered fair.

This entire process aimed to identify and address disparities in mortgage lending approval rates by training a machine learning model 
on relevant features and ensuring fairness in its predictions.

'''


"\nPreprocessing: We first cleaned the 'dti' (debt-to-income) column, which contained various formats of percentage ranges. \nWe developed a function to handle these different formats and calculated the average value for each range. \nThen, we dropped irrelevant columns and kept only the selected features (income, dti, loan amount, ltv, property value, and race) in the dataset.\n\nModel Training: We split the dataset into training and testing sets and trained a logistic regression model using the selected features. \nThe target variable was 'loan_approval', indicating whether a loan application was approved or not.\n\nModel Evaluation: We calculated the accuracy of the model using the test set.\n\nFairness Evaluation: We evaluated the model's fairness using demographic parity. \nTo do this, we calculated the proportion of positive outcomes (loan approvals) for each group of the sensitive attribute (race) in the test set. \nWe then compared these proportions to determine if the model is

In [3]:
pd.set_option('display.max_columns', None)
df = pd.read_csv('2021_public_lar.csv',usecols=['state_code','derived_loan_product_type','derived_dwelling_category'
                                                ,'derived_race','applicant_race_1','action_taken',
                                                'loan_purpose','business_or_commercial_purpose',
                                                'loan_amount','combined_loan_to_value_ratio',
                                                'property_value','occupancy_type','income',
                                                'debt_to_income_ratio'])

  df = pd.read_csv('2021_public_lar.csv',usecols=['state_code','derived_loan_product_type','derived_dwelling_category'


In [4]:
df

Unnamed: 0,state_code,derived_loan_product_type,derived_dwelling_category,derived_race,action_taken,loan_purpose,business_or_commercial_purpose,loan_amount,combined_loan_to_value_ratio,property_value,occupancy_type,income,debt_to_income_ratio,applicant_race_1
0,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,2,1,2,425000,51.829,825000,1,159.0,46,5.0
1,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,95000,95.0,105000,1,35.0,39,5.0
2,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,95000,96.999,95000,1,24.0,38,5.0
3,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,4,31,1,125000,,,3,141.0,,5.0
4,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,105000,90.0,115000,1,80.0,<20%,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26124547,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,165000,91.954,175000.0,1,44.0,30%-<36%,5.0
26124548,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Race Not Available,1,1,2,165000,97.0,165000.0,1,45.0,40,6.0
26124549,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,255000,99.953,265000.0,1,64.0,30%-<36%,5.0
26124550,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,225000,80.0,285000.0,1,77.0,41,5.0


In [5]:
df.applicant_race_1.unique()

array([ 5.,  2.,  3.,  6., 27.,  1.,  7.,  4., 44., nan, 23., 21., 22.,
       26., 24., 41., 25., 42., 43.])

In [6]:
df['debt_to_income_ratio'] = df['debt_to_income_ratio'].astype(str)

In [7]:
def average_dti_range(dti_value):
    if pd.isna(dti_value) or dti_value == 'nan' or dti_value == 'Exempt':
        return np.nan

    range_pattern = r'(\d+)%-<(\d+)%|(\d+)%-(\d+)%'
    less_than_pattern = r'<(\d+)%'
    greater_than_pattern = r'>(\d+)%'
    
    if re.match(range_pattern, dti_value):
        bounds = re.findall(range_pattern, dti_value)[0]
        bounds = [float(b) for b in bounds if b]
        lower_bound, upper_bound = bounds
        return (lower_bound + upper_bound) / 2
    elif re.match(less_than_pattern, dti_value):
        upper_bound = re.findall(less_than_pattern, dti_value)[0][0]
        upper_bound = float(upper_bound)
        return upper_bound / 2
    elif re.match(greater_than_pattern, dti_value):
        lower_bound = re.findall(greater_than_pattern, dti_value)[0][0]
        lower_bound = float(lower_bound)
        return lower_bound * 1.1
    else:
        return float(dti_value.replace('%', ''))

# Apply the function to the DTI column

In [8]:
df_ny = df.dropna()
df_ny

Unnamed: 0,state_code,derived_loan_product_type,derived_dwelling_category,derived_race,action_taken,loan_purpose,business_or_commercial_purpose,loan_amount,combined_loan_to_value_ratio,property_value,occupancy_type,income,debt_to_income_ratio,applicant_race_1
0,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,2,1,2,425000,51.829,825000,1,159.0,46,5.0
1,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,95000,95.0,105000,1,35.0,39,5.0
2,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,95000,96.999,95000,1,24.0,38,5.0
4,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,105000,90.0,115000,1,80.0,<20%,5.0
6,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,2,2,115000,30.556,365000,1,67.0,42,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26124547,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,165000,91.954,175000.0,1,44.0,30%-<36%,5.0
26124548,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Race Not Available,1,1,2,165000,97.0,165000.0,1,45.0,40,6.0
26124549,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,255000,99.953,265000.0,1,64.0,30%-<36%,5.0
26124550,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,225000,80.0,285000.0,1,77.0,41,5.0


In [9]:
'''
1 - American Indian or Alaska Native
2 - Asian
3 - Black or African American
4 - Native Hawaiian or Other Pacific Islander
5 - White
'''


#df_ny = df[(df["state_code"] == 'NY')]

df_ny = df_ny[df_ny['derived_loan_product_type'] == 'Conventional:First Lien']
df_ny = df_ny.loc[df_ny['loan_purpose'].isin([1])]
df_ny = df_ny.loc[df_ny['business_or_commercial_purpose'].isin([2])]
df_ny = df_ny[df_ny['derived_dwelling_category'] == 'Single Family (1-4 Units):Site-Built']
df_ny = df_ny.loc[df_ny['occupancy_type'].isin([1])]
df_ny = df_ny[df_ny["combined_loan_to_value_ratio"].str.contains("Exempt") == False]
df_ny['combined_loan_to_value_ratio'] = df_ny['combined_loan_to_value_ratio'].astype(str).astype(float)
#df_ny['interest_rate'] = df_ny['interest_rate'].astype(str).astype(float)
df_ny['property_value'] = df_ny['property_value'].astype(str).astype(float)
df_ny = df_ny.loc[df_ny['action_taken'].isin([1,3])]
df_ny = df_ny.loc[df_ny['applicant_race_1'].isin([1,2,3,4,5])]
df_ny['action_taken'] = df['action_taken'].replace({3: 0})
df_ny['debt_to_income_ratio'] = df_ny['debt_to_income_ratio'].apply(average_dti_range)

In [10]:
df_ny

Unnamed: 0,state_code,derived_loan_product_type,derived_dwelling_category,derived_race,action_taken,loan_purpose,business_or_commercial_purpose,loan_amount,combined_loan_to_value_ratio,property_value,occupancy_type,income,debt_to_income_ratio,applicant_race_1
1,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,95000,95.000,105000.0,1,35.0,39.0,5.0
2,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,95000,96.999,95000.0,1,24.0,38.0,5.0
4,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,105000,90.000,115000.0,1,80.0,1.0,5.0
9,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,255000,80.000,315000.0,1,68.0,36.0,5.0
10,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,305000,89.820,335000.0,1,107.0,38.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26124279,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,125000,96.154,135000.0,1,49.0,25.0,5.0
26124280,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,135000,97.000,135000.0,1,50.0,33.0,5.0
26124281,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,135000,97.000,145000.0,1,51.0,33.0,5.0
26124283,NY,Conventional:First Lien,Single Family (1-4 Units):Site-Built,White,1,1,2,155000,97.000,165000.0,1,49.0,41.0,5.0


In [11]:
selected_features = ['income', 'debt_to_income_ratio', 'loan_amount', 'combined_loan_to_value_ratio', 'property_value','action_taken','applicant_race_1']
columns_to_drop = [col for col in df_ny.columns if col not in selected_features]
df_ny = df_ny.drop(columns=columns_to_drop)

In [12]:
df_ny

Unnamed: 0,action_taken,loan_amount,combined_loan_to_value_ratio,property_value,income,debt_to_income_ratio,applicant_race_1
1,1,95000,95.000,105000.0,35.0,39.0,5.0
2,1,95000,96.999,95000.0,24.0,38.0,5.0
4,1,105000,90.000,115000.0,80.0,1.0,5.0
9,1,255000,80.000,315000.0,68.0,36.0,5.0
10,1,305000,89.820,335000.0,107.0,38.0,5.0
...,...,...,...,...,...,...,...
26124279,1,125000,96.154,135000.0,49.0,25.0,5.0
26124280,1,135000,97.000,135000.0,50.0,33.0,5.0
26124281,1,135000,97.000,145000.0,51.0,33.0,5.0
26124283,1,155000,97.000,165000.0,49.0,41.0,5.0


In [13]:
df_ny.to_csv('train_filtered.csv')

In [14]:
df_ny.groupby(by='action_taken').count()

Unnamed: 0_level_0,loan_amount,combined_loan_to_value_ratio,property_value,income,debt_to_income_ratio,applicant_race_1
action_taken,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,39302,39302,39302,39302,38012,39302
1,684546,684546,684546,684546,683827,684546


In [15]:
'''
Model selection: chose logistic regression model for this binary classification problem (loan approval). 

Hyperparameter selection: tune the 'C' parameter, which is the inverse of regularization strength in logistic regression. 
Regularization helps prevent overfitting by penalizing large coefficients. 
Then creates an array of possible 'C' values ranging from 0.0001 to 10000 (in log scale) using np.logspace(-4, 4, 20).

Model instantiation: instantiated a logistic regression model with the 'liblinear' solver, 
which is a good choice for small datasets. The solver is an optimization algorithm that the model uses
 to find the best coefficients during the training phase.

Grid search cross-validation: To find the best 'C' value, use GridSearchCV from the scikit-learn library. 
Grid search cross-validation is a method that searches through the predefined hyperparameter space and 
selects the best hyperparameters based on cross-validated performance. 
In this case, use 5-fold cross-validation (cv=5), which means that the training data was split into 5 equal parts, 
and the model was trained and evaluated 5 times, each time using a different part as the validation set. 

The evaluation metric chosen was the ROC-AUC score (Receiver Operating Characteristic - Area Under the Curve), which is
 a commonly used metric for binary classification problems.

Model training: You fit the GridSearchCV object on the training data using grid_search.fit(X_train, y_train). 
This trains the logistic regression model with each combination of 'C' values using 5-fold cross-validation and records the ROC-AUC scores.
'''


"\nModel selection: chose logistic regression model for this binary classification problem (loan approval). \n\nHyperparameter selection: tune the 'C' parameter, which is the inverse of regularization strength in logistic regression. \nRegularization helps prevent overfitting by penalizing large coefficients. \nThen creates an array of possible 'C' values ranging from 0.0001 to 10000 (in log scale) using np.logspace(-4, 4, 20).\n\nModel instantiation: instantiated a logistic regression model with the 'liblinear' solver, \nwhich is a good choice for small datasets. The solver is an optimization algorithm that the model uses\n to find the best coefficients during the training phase.\n\nGrid search cross-validation: To find the best 'C' value, use GridSearchCV from the scikit-learn library. \nGrid search cross-validation is a method that searches through the predefined hyperparameter space and \nselects the best hyperparameters based on cross-validated performance. \nIn this case, use 5-f

In [16]:
# Normal logistic regression
# Handle missing values
df_ny.dropna(inplace=True)

# Feature scaling
scaler = StandardScaler()
df_ny[['income', 'debt_to_income_ratio', 'loan_amount', 'combined_loan_to_value_ratio', 'property_value','applicant_race_1']] = scaler.fit_transform(df_ny[['income', 'debt_to_income_ratio', 'loan_amount', 'combined_loan_to_value_ratio', 'property_value','applicant_race_1']])

# Split the dataset
X = df_ny.drop(columns=['action_taken'])
y = df_ny['action_taken']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model selection and training
params = {'C': np.logspace(-4, 4, 20)}
log_reg = LogisticRegression(solver='liblinear')
grid_search = GridSearchCV(log_reg, params, cv=5, verbose=1, scoring='roc_auc', n_jobs=15)
grid_search.fit(X_train, y_train)

# Model evaluation
# Model evaluation (continued)
y_pred = grid_search.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("AUC-ROC:", roc_auc_score(y_test, y_pred))



Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [None]:
# Handle missing values
#df_ny.dropna(inplace=True)
# With Smote
'''
# Feature scaling
scaler = StandardScaler()
df_ny[['income', 'debt_to_income_ratio', 'loan_amount', 'combined_loan_to_value_ratio', 'property_value','applicant_race_1']] = scaler.fit_transform(df_ny[['income', 'debt_to_income_ratio', 'loan_amount', 'combined_loan_to_value_ratio', 'property_value','applicant_race_1']])

# Split the dataset
X = df_ny.drop(columns=['action_taken'])
y = df_ny['action_taken']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Model selection and training
params = {'C': np.logspace(-4, 4, 20)}
log_reg = LogisticRegression(solver='liblinear')
grid_search = GridSearchCV(log_reg, params, cv=5, verbose=1, scoring='roc_auc')
grid_search.fit(X_train_resampled, y_train_resampled)

# Model evaluation
y_pred = grid_search.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("AUC-ROC:", roc_auc_score(y_test, y_pred)) '''

'\n# Feature scaling\nscaler = StandardScaler()\ndf_ny[[\'income\', \'debt_to_income_ratio\', \'loan_amount\', \'combined_loan_to_value_ratio\', \'property_value\',\'applicant_race_1\']] = scaler.fit_transform(df_ny[[\'income\', \'debt_to_income_ratio\', \'loan_amount\', \'combined_loan_to_value_ratio\', \'property_value\',\'applicant_race_1\']])\n\n# Split the dataset\nX = df_ny.drop(columns=[\'action_taken\'])\ny = df_ny[\'action_taken\']\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# Apply SMOTE to the training data\nsmote = SMOTE(random_state=42)\nX_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)\n\n# Model selection and training\nparams = {\'C\': np.logspace(-4, 4, 20)}\nlog_reg = LogisticRegression(solver=\'liblinear\')\ngrid_search = GridSearchCV(log_reg, params, cv=5, verbose=1, scoring=\'roc_auc\')\ngrid_search.fit(X_train_resampled, y_train_resampled)\n\n# Model evaluation\ny_pred = grid_search.predict

In [None]:
# Use SMOTE with class weight in logistic regression
from imblearn.over_sampling import SMOTE
from sklearn.metrics import balanced_accuracy_score

# Apply SMOTE to balance the classes
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train the model with balanced class weights
log_reg_balanced = LogisticRegression(solver='liblinear', class_weight='balanced',max_iter=200)
grid_search_balanced = GridSearchCV(log_reg_balanced, params, cv=5, verbose=1, scoring='roc_auc', n_jobs=15)
grid_search_balanced.fit(X_train_smote, y_train_smote)

# Model evaluation
y_pred_balanced = grid_search_balanced.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred_balanced))
print("Balanced Accuracy:", balanced_accuracy_score(y_test, y_pred_balanced))
print("Classification Report:\n", classification_report(y_test, y_pred_balanced))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_balanced))


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Accuracy: 0.624106450182866
Balanced Accuracy: 0.5936076555876552
Classification Report:
               precision    recall  f1-score   support

           0       0.08      0.56      0.14      7738
           1       0.96      0.63      0.76    136630

    accuracy                           0.62    144368
   macro avg       0.52      0.59      0.45    144368
weighted avg       0.91      0.62      0.73    144368

AUC-ROC: 0.593607655587655


In [None]:
#  SMOTE for oversampling the minority class and Tomek Links for undersampling the majority class. 
# Then trains the logistic regression model with balanced class weights
from imblearn.combine import SMOTETomek

# Apply SMOTE and Tomek Links
smote_tomek = SMOTETomek(random_state=42)
X_train_smote_tomek, y_train_smote_tomek = smote_tomek.fit_resample(X_train, y_train)

# Train the model with balanced class weights
log_reg_balanced = LogisticRegression(solver='liblinear', class_weight='balanced')
grid_search_balanced = GridSearchCV(log_reg_balanced, params, cv=5, verbose=1, scoring='roc_auc', n_jobs=-1)
grid_search_balanced.fit(X_train_smote_tomek, y_train_smote_tomek)

# Model evaluation
y_pred_balanced = grid_search_balanced.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred_balanced))
print("Balanced Accuracy:", balanced_accuracy_score(y_test, y_pred_balanced))
print("Classification Report:\n", classification_report(y_test, y_pred_balanced))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_balanced))


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Accuracy: 0.6248199046880195
Balanced Accuracy: 0.593984585983615
Classification Report:
               precision    recall  f1-score   support

           0       0.08      0.56      0.14      7738
           1       0.96      0.63      0.76    136630

    accuracy                           0.62    144368
   macro avg       0.52      0.59      0.45    144368
weighted avg       0.91      0.62      0.73    144368

AUC-ROC: 0.593984585983615


In [None]:
# Optimizing the hyperparameter

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from imblearn.combine import SMOTETomek

# Apply SMOTE and Tomek Links
smote_tomek = SMOTETomek(random_state=42)
X_train_smote_tomek, y_train_smote_tomek = smote_tomek.fit_resample(X_train, y_train)

# Define the logistic regression model
log_reg = LogisticRegression(solver='liblinear', class_weight='balanced', max_iter=200)

# Specify the hyperparameters and their possible values
params = {
    'C': np.logspace(-4, 4, 20),
    'penalty': ['l1', 'l2']
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(log_reg, params, cv=5, verbose=1, scoring='roc_auc', n_jobs=-1)
grid_search.fit(X_train_smote_tomek, y_train_smote_tomek)

# Print the best hyperparameters
print("Best Hyperparameters:", grid_search.best_params_)

# Model evaluation
y_pred = grid_search.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Balanced Accuracy:", balanced_accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("AUC-ROC:", roc_auc_score(y_test, y_pred))



Fitting 5 folds for each of 40 candidates, totalling 200 fits


KeyboardInterrupt: 

In [None]:
# Ok at this point
# Logistic regression doesn't improve
# Try Random Forest Classifier
# With Smote and Tomek links

In [None]:
X_test


Unnamed: 0,loan_amount,combined_loan_to_value_ratio,property_value,income,debt_to_income_ratio,applicant_race_1
939623,-0.778711,0.161358,-0.391450,-0.114141,0.035752,0.416558
15318906,-0.153699,0.675404,-0.157004,-0.158430,0.593024,0.416558
22408888,0.679651,0.785731,0.168616,0.204184,0.035752,0.416558
16802785,-0.223144,0.364529,-0.157004,-0.053244,0.593024,0.416558
25742251,-0.257867,0.910605,-0.209103,-0.169502,0.513414,0.416558
...,...,...,...,...,...,...
22651692,-0.952325,0.785731,-0.482624,-0.197183,-0.601131,-1.620228
10884176,-0.292590,0.785731,-0.222128,-0.130750,1.229907,0.416558
7745308,0.332422,0.785731,0.038368,-0.047708,0.035752,-2.638621
11892571,15.888277,-0.463015,7.970465,7.127069,-2.511779,0.416558


In [None]:
results = pd.DataFrame({'race': X_test['applicant_race_1'], 'loan_approval': y_test, 'prediction': y_pred})
group_results = results.groupby('race').mean()
print("Demographic Parity Analysis:")
print(group_results)

# Calculate the difference in approval rates between the groups
max_diff = abs(group_results['prediction'].max() - group_results['prediction'].min())
print("Max difference in approval rates between groups:", max_diff)

# Set a fairness threshold (e.g., 0.05)
fairness_threshold = 0.05
is_fair = max_diff <= fairness_threshold
print("Is the model fair?", is_fair)

Demographic Parity Analysis:
           loan_approval  prediction
race                                
-3.657014       0.914474    0.031015
-2.638621       0.948764    0.087774
-1.620228       0.867117    0.097297
-0.601835       0.914286    0.490476
 0.416558       0.952268    0.714672
Max difference in approval rates between groups: 0.6836569827280539
Is the model fair? False


In [None]:
# Threshold adjustment: postprocessing
#need to get the predicted probabilities from the model instead of the binary predictions. 
# Then, find an optimal threshold for each group that satisfies the fairness criteria.

In [None]:
# Get the predicted probabilities
y_pred_proba = grid_search.predict_proba(X_test)[:, 1]
y_pred_proba

array([0.51866916, 0.51680742, 0.52504774, ..., 0.36274325, 0.88718224,
       0.46791841])

In [None]:
print("Length of X_test:", len(X_test))
print("Length of y_test:", len(y_test))
print("Length of y_pred_proba:", len(y_pred_proba))


Length of X_test: 144368
Length of y_test: 144368
Length of y_pred_proba: 144368


In [None]:
# Find the optimal threshold for a given group
def find_optimal_threshold(y_true, y_pred_proba, indices, fairness_threshold):
    thresholds = np.linspace(0, 1, 100)
    best_threshold = 0
    best_balanced_accuracy = 0

    for threshold in thresholds:
        y_pred_adjusted = (y_pred_proba > threshold).astype(int)
        balanced_accuracy = balanced_accuracy_score(y_true, y_pred_adjusted)

        # Check demographic parity
        group_results = pd.DataFrame({'race': X_test.loc[indices, 'applicant_race_1'], 'prediction': y_pred_adjusted}).groupby('race').mean()
        max_diff = abs(group_results['prediction'].max() - group_results['prediction'].min())

        if max_diff <= fairness_threshold and balanced_accuracy > best_balanced_accuracy:
            best_threshold = threshold
            best_balanced_accuracy = balanced_accuracy

    return best_threshold


In [None]:
# Reset the index of the test set
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

# Apply the optimal threshold to each group
results = pd.DataFrame({'race': X_test['applicant_race_1'], 'loan_approval': y_test, 'prediction_proba': y_pred_proba})
unique_races = results['race'].unique()

for race in unique_races:
    race_mask = results['race'] == race
    race_indices = results.index[race_mask]
    optimal_threshold = find_optimal_threshold(y_test[race_indices], y_pred_proba[race_indices], race_indices, fairness_threshold)
    results.loc[race_mask, 'prediction'] = (y_pred_proba[race_indices] > optimal_threshold).astype(int)

# Evaluate the model's fairness and performance using the adjusted predictions
y_pred_adjusted = results['prediction'].values
print("Accuracy:", accuracy_score(y_test, y_pred_adjusted))
print("Balanced Accuracy:", balanced_accuracy_score(y_test, y_pred_adjusted))
print("Classification Report:\n", classification_report(y_test, y_pred_adjusted))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_adjusted))

# Demographic parity analysis
group_results = results.groupby('race').mean()
print("Demographic Parity Analysis:")
print(group_results)

max_diff = abs(group_results['prediction'].max() - group_results['prediction'].min())
print("Max difference in approval rates between groups:", max_diff)

is_fair = max_diff <= fairness_threshold
print("The fairness threshold", fairness_threshold )
print("Is the model fair?", is_fair)



Accuracy: 0.8245802393882301
Balanced Accuracy: 0.5953465028577065
Classification Report:
               precision    recall  f1-score   support

           0       0.11      0.34      0.17      7738
           1       0.96      0.85      0.90    136630

    accuracy                           0.82    144368
   macro avg       0.54      0.60      0.54    144368
weighted avg       0.91      0.82      0.86    144368

AUC-ROC: 0.5953465028577065
Demographic Parity Analysis:
           loan_approval  prediction_proba  prediction
race                                                  
-3.657014       0.914474          0.352579    0.929511
-2.638621       0.948764          0.419961    0.899335
-1.620228       0.867117          0.422688    0.888176
-0.601835       0.914286          0.498796    0.828571
 0.416558       0.952268          0.536060    0.832040
Max difference in approval rates between groups: 0.10093984962406011
The fairness threshold 0.05
Is the model fair? False


In [None]:
# Preprocessing
#disentangling sensitive features using PCA. 
#We'll apply PCA to the training data, excluding the sensitive feature (race). 
#Then, we'll train a model using the transformed data.

# Maybe try again but with everysingle columns,

In [None]:
from sklearn.decomposition import PCA

# Separate the sensitive feature from the other features
X_train_no_race = X_train.drop(columns=['applicant_race_1'])
X_test_no_race = X_test.drop(columns=['applicant_race_1'])

# Apply PCA to the training data (excluding the sensitive feature)
pca = PCA(n_components=0.95)  # Retain 95% of the variance
X_train_pca = pca.fit_transform(X_train_no_race)
X_test_pca = pca.transform(X_test_no_race)

# Train a model using the transformed data
log_reg_pca = LogisticRegression(solver='liblinear', class_weight='balanced')
grid_search_pca = GridSearchCV(log_reg_pca, params, cv=5, verbose=1, scoring='roc_auc', n_jobs=-1)
grid_search_pca.fit(X_train_pca, y_train)

# Proceed with the rest of the model evaluation steps
# Model evaluation for the PCA model
y_pred_pca = grid_search_pca.predict(X_test_pca)
print("PCA Model Evaluation:")
print("Accuracy:", accuracy_score(y_test, y_pred_pca))
print("Balanced Accuracy:", balanced_accuracy_score(y_test, y_pred_pca))
print("Classification Report:\n", classification_report(y_test, y_pred_pca))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_pca))


Fitting 5 folds for each of 40 candidates, totalling 200 fits
PCA Model Evaluation:
Accuracy: 0.5672517455391777
Balanced Accuracy: 0.5598519721493718
Classification Report:
               precision    recall  f1-score   support

           0       0.07      0.55      0.12      7738
           1       0.96      0.57      0.71    136630

    accuracy                           0.57    144368
   macro avg       0.51      0.56      0.42    144368
weighted avg       0.91      0.57      0.68    144368

AUC-ROC: 0.5598519721493718


In [None]:
# Get the predicted probabilities for the PCA model
y_pred_proba_pca = grid_search_pca.predict_proba(X_test_pca)[:, 1]

# Apply the optimal threshold to each group
results_pca = pd.DataFrame({'race': X_test['applicant_race_1'], 'loan_approval': y_test, 'prediction_proba': y_pred_proba_pca})
unique_races = results_pca['race'].unique()

for race in unique_races:
    race_mask = results_pca['race'] == race
    race_indices = results_pca.index[race_mask]
    optimal_threshold = find_optimal_threshold(y_test[race_indices], y_pred_proba_pca[race_indices], race_indices, fairness_threshold)
    results_pca.loc[race_mask, 'prediction'] = (y_pred_proba_pca[race_indices] > optimal_threshold).astype(int)

# Evaluate the PCA model's fairness and performance using the adjusted predictions
y_pred_adjusted_pca = results_pca['prediction'].values
print("Accuracy:", accuracy_score(y_test, y_pred_adjusted_pca))
print("Balanced Accuracy:", balanced_accuracy_score(y_test, y_pred_adjusted_pca))
print("Classification Report:\n", classification_report(y_test, y_pred_adjusted_pca))
print("AUC-ROC:", roc_auc_score(y_test, y_pred_adjusted_pca))

# Demographic parity analysis for the PCA model
group_results_pca = results_pca.groupby('race').mean()
print("Demographic Parity Analysis (PCA Model):")
print(group_results_pca)

max_diff_pca = abs(group_results_pca['prediction'].max() - group_results_pca['prediction'].min())
print("Max difference in approval rates between groups (PCA Model):", max_diff_pca)

is_fair_pca = max_diff_pca <= fairness_threshold
print("Is the PCA model fair?", is_fair_pca)



Accuracy: 0.8168915549152167
Balanced Accuracy: 0.5921378306862943
Classification Report:
               precision    recall  f1-score   support

           0       0.11      0.34      0.17      7738
           1       0.96      0.84      0.90    136630

    accuracy                           0.82    144368
   macro avg       0.53      0.59      0.53    144368
weighted avg       0.91      0.82      0.86    144368

AUC-ROC: 0.5921378306862943
Demographic Parity Analysis (PCA Model):
           loan_approval  prediction_proba  prediction
race                                                  
-3.657014       0.914474          0.506011    0.898496
-2.638621       0.948764          0.530423    0.865178
-1.620228       0.867117          0.489734    0.884122
-0.601835       0.914286          0.521129    0.928571
 0.416558       0.952268          0.512564    0.826517
Max difference in approval rates between groups (PCA Model): 0.10205444573863476
Is the PCA model fair? False


In [None]:
# Fairness not possible with logistic regression

In [None]:
'''
'income', 'debt_to_income_ratio', 'loan_amount', 'combined_loan_to_value_ratio', 'property_value','applicant_race_1'
Must be in the same order as the fit data
'''

new_input_data = [
    {
    'loan_amount': 175000,
    'combined_loan_to_value_ratio': 175000.0/220000.0,
    'property_value': 22000.0,
    'income': 0.0,
    'debt_to_income_ratio': 45.0,
    'applicant_race_1': 1.0,
    },
    {
    'loan_amount': 175000,
    'combined_loan_to_value_ratio': 175000.0/220000.0,
    'property_value': 22000.0,
    'income': 95.0,
    'debt_to_income_ratio': 45.0,
    'applicant_race_1': 1.0,
    }
]


In [None]:
input_df = pd.DataFrame(new_input_data)
best_model = grid_search.best_estimator_
prediction = best_model.predict(input_df)
prediction

array([1, 1], dtype=int64)

In [None]:
if prediction[0] == 0:
    print("rejected")
else:
    print("approved")

approved


In [None]:
for idx, prediction in enumerate(prediction):
    result = "Approved" if prediction == 1 else "Rejected"
    print(f"Loan application {idx + 1} status: {result}")

Loan application 1 status: Approved
Loan application 2 status: Approved


In [None]:
'''
Our data is imbalanced
Resampling methods:

Undersampling: Randomly remove some samples from the majority class (accepted) to balance the class distribution. 
However, this may result in loss of information.
Oversampling: Create copies of samples from the minority class (rejected) to balance the class distribution. 
This can be done using techniques such as SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic samples for the minority class.

Cost-sensitive learning: Assign different misclassification costs to the majority and minority classes. 
In scikit-learn, you can use the class_weight parameter to assign class weights when training the logistic regression model. 
You can either set the parameter to 'balanced', which will automatically adjust the weights based on the number of samples for each class,
 or manually provide the weights as a dictionary.

Ensemble methods: Utilize ensemble learning techniques that can handle imbalanced datasets, such as:

Bagging: Use an ensemble of base classifiers, each trained on a random subset of the dataset, and combine their predictions. You can use balanced bagging, where each base classifier is trained on a balanced subset created through resampling.
Boosting: Train a series of weak classifiers iteratively, with each classifier focusing on the errors made by the previous classifier. Some boosting algorithms, like AdaBoost, can be modified to handle class imbalance by adjusting the misclassification costs for each class. '''

"\nOur data is imbalanced\nResampling methods:\n\nUndersampling: Randomly remove some samples from the majority class (accepted) to balance the class distribution. \nHowever, this may result in loss of information.\nOversampling: Create copies of samples from the minority class (rejected) to balance the class distribution. \nThis can be done using techniques such as SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic samples for the minority class.\n\nCost-sensitive learning: Assign different misclassification costs to the majority and minority classes. \nIn scikit-learn, you can use the class_weight parameter to assign class weights when training the logistic regression model. \nYou can either set the parameter to 'balanced', which will automatically adjust the weights based on the number of samples for each class,\n or manually provide the weights as a dictionary.\n\nEnsemble methods: Utilize ensemble learning techniques that can handle imbalanced datasets, 