# PKL Take on the Home Credit Default Risk Kaggle Competition

In [1]:
# built-in imports
import os
# 3p imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix, matthews_corrcoef
# custom imports
from helper_functions import load_data_frame, flag_columns_to_bool


## Data Loading

In [2]:
path_application_train = os.path.join("./", "home-credit-default-risk", "application_train.csv")

df = load_data_frame(path_application_train)
original_size = df.shape
null_count_df = df.isna().sum() # vypsani statistik poctu chybejicicih hodnot
null_count_df == 0 # funguje, vrati tabulku statistik s bool hodnotami
df = df.dropna() # odmaze vsechny null radky, no made-up data approach
no_null_shape = df.shape

## Data Preprocessing

1. If I understand the meaning, columns with "FLAG" string in the name are binary columns denoting presence of lack there of a certain variable. Thus I decide to replace the string "Y"/"N" values and integer 1/0 with proper boolean values (this may also lead to reduced memory usage, but this is not the main motivation).
2. The rest of the string variables (dtype("O")) needs to be properly encoded in order for the scikitlearn algorithms to be able to handle them. I need some analogy to OneHotEncoder for categorical and OrdinalEncoder for oridnal data. An example of categorical data is the "CODE_GENDER" or "NAME_HOUSING_TYPE". Perhaps the first one mentioned should be removed in future version of the model, because preserving it may lead to a sexist classifier. On the other hand, there are clearly some more ordinalish features, such as "NAME_EDUCATION_TYPE" where a certain commonly agreed ordering does exist (there is much bigger difference between "Primary" and "Masters" level than between "Doctorate" and "Masters"). I sould look up for ordinal features and encode them accordingly in future.

In [3]:
# deal with the boolean features
print(df.shape)
df = flag_columns_to_bool(df)
print(df.shape)
df = pd.get_dummies(df)
print(df.shape)

(8602, 122)
(8602, 122)
(8602, 234)


In [4]:
y = df["TARGET"]
X = df.drop(["TARGET"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(y_train.shape)

(6451, 233)
(6451,)


## Baseline Classifier for later comparison

In [5]:
baseline_classifier = MLPClassifier().fit(X_train, y_train)
y_pred = baseline_classifier.predict(X_test)
cf_mat = confusion_matrix(y_test, y_pred)
print(cf_mat)
print(matthews_corrcoef(y_test, y_pred))

[[1408  611]
 [  88   44]]
