# Industrial Applications of Artificial Intelligence - Bias in Credit Decisions

### This notebook is part of the third hand-in regarding the secondary sector in the lecture Industrial Applications of AI by Niklas Sabel (Matr. no. 1599748)

In [3]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score,GridSearchCV,StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
import itertools
from skopt import BayesSearchCV
from skopt.space import Real, Integer

## I. Data Exploration

The dataset can be found on kaggle unter the following [URL](https://www.kaggle.com/datasets/jannesklaas/model-trap). The author created the dataset for pedagogic purposes to teach students how bias can affect the outcomes of models. It reflects the decisions if a person is offered a loan or not, but is generated that a default models is most likely to be biased again women and minorities. 
First of all, the dataset consists out of 480,000 entries containing 13 features and one target targets, which will be described in the following. We are also given a test set with 160,000 features and the same amount of features that will be kept separated until the end.
* Minority: Binary variable that indicates wether a person belongs to a minority (1) or not (0).
* Sex: Binary variable that indicates wether a person is male (0) or female (1).
* ZIP: Integer that indicates the area where a person lives. 
* Rent: Binary variable that indicates wether a person currently pays rent (1) or not (0).
* Education: Float number indicating the education of a person. Unfortunately, no description is given and that makes it hard to interpret.
* Age: Float number indicating the age of a person.
* Income: Float number indicating the income of a person. Unfortunately, no currency is given.
* Loan size: Float number indicating the loan amount a person requests. Unfortunately, no currency is given.
* Payment timing: Float number indicating the payment timinig of a person. Unfortunately, no description is given and that makes it hard to interpret and decide which belongs to late or not late.
* Year: Integer that indicates the year when a person requested the loan. 
* Job Stability: Float number indicating the job stability of a person. Unfortunately, no description is given and that makes it hard to interpret.
* Occupation: Integer that indicates the employment type that a person has. 
* Default: Target. Binary variable that indicates wether a person gets a loan (1) or not (0).

In [None]:
dir_path ='../../src/data/Abgabe_3/'

In [46]:
df = pd.read_csv(os.path.join(dir_path, "train.csv"))
df.head(10)

Unnamed: 0,minority,sex,ZIP,rent,education,age,income,loan_size,payment_timing,year,job_stability,default,occupation
0,1,0,MT04PA,1,57.23065,36.050927,205168.022244,7600.292199,3.302193,0,3.015554,True,MZ10CD
1,1,0,MT04PA,1,45.891343,59.525251,187530.409981,5534.271289,3.843058,0,5.938132,True,MZ10CD
2,1,0,MT04PA,1,46.775489,67.338108,196912.00669,2009.903438,2.059034,0,2.190777,True,MZ10CD
3,1,0,MT04PA,1,41.784839,24.067401,132911.650615,3112.280893,3.936169,0,1.72586,True,MZ10CD
4,1,0,MT04PA,1,41.744838,47.496605,161162.551205,1372.077093,3.70991,0,0.883104,True,MZ10CD
5,1,0,MT04PA,1,55.330391,34.986911,196698.100755,9807.367705,3.778435,0,3.827745,True,MZ10CD
6,1,0,MT04PA,0,52.697992,25.513529,170699.801788,6978.556887,3.847329,0,1.266106,True,MZ10CD
7,1,0,MT04PA,0,41.172233,56.302828,165954.763667,3593.878976,3.599748,0,0.186606,True,MZ10CD
8,1,0,MT04PA,0,48.732791,58.434236,198240.216967,1514.099813,3.637124,0,1.207918,True,MZ10CD
9,1,0,MT04PA,1,48.849105,39.634137,179749.601173,482.799055,3.962002,0,2.385735,True,MZ10CD


In [52]:
# look at distribution
df.describe().transpose().applymap("{:.2f}".format)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
minority,480000.0,0.5,0.5,0.0,0.0,0.0,1.0,1.0
sex,480000.0,0.5,0.5,0.0,0.0,0.5,1.0,1.0
rent,480000.0,0.47,0.5,0.0,0.0,0.0,1.0,1.0
education,480000.0,26.02,24.55,0.0,1.68,20.28,49.71,89.31
age,480000.0,42.99,14.43,18.0,30.47,43.0,55.47,68.0
income,480000.0,96223.63,91722.3,7.31,6181.59,70380.19,183477.24,350173.9
loan_size,480000.0,5005.0,2887.15,0.04,2503.58,5008.8,7503.32,9999.99
payment_timing,480000.0,3.0,1.0,-12.46,2.62,3.31,3.71,4.0
year,480000.0,14.5,8.66,0.0,7.0,14.5,22.0,29.0
job_stability,480000.0,45.99,45.07,0.01,1.67,31.24,89.45,149.91


In [None]:
df_test = pd.read_csv(os.path.join(dir_path, "test.csv"))
df_test.head(10)

In [47]:
# look at distribution
df_test.describe().transpose().applymap("{:.2f}".format)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
minority,480000.0,0.5,0.5,0.0,0.0,0.0,1.0,1.0
sex,480000.0,0.5,0.5,0.0,0.0,0.5,1.0,1.0
rent,480000.0,0.47,0.5,0.0,0.0,0.0,1.0,1.0
education,480000.0,26.02,24.55,0.0,1.68,20.28,49.71,89.31
age,480000.0,42.99,14.43,18.0,30.47,43.0,55.47,68.0
income,480000.0,96223.63,91722.3,7.31,6181.59,70380.19,183477.24,350173.9
loan_size,480000.0,5005.0,2887.15,0.04,2503.58,5008.8,7503.32,9999.99
payment_timing,480000.0,3.0,1.0,-12.46,2.62,3.31,3.71,4.0
year,480000.0,14.5,8.66,0.0,7.0,14.5,22.0,29.0
job_stability,480000.0,45.99,45.07,0.01,1.67,31.24,89.45,149.91
