# Data transformation for Lion`s Den ING Risk Modelling Challenge 2024
## Group: Neuralna Ekipa

This notebook describes pipeline construction. Some procedures are based on *analysis.ibynb* notebook, while others are new and their purpose will be described in documentation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector
from feature_engine.encoding import WoEEncoder

## Division of variables, according to features types

In [2]:
discrete_variables = ['ID', 'customer_id', 'Var1', 'Var15', 'Var16', 'Var20', 'Var21', 'Var22',
                      	'Var23', 'Var29', 'Var4', 'Var5', 'Var9', 'Var24', 'Var30', 'Var6'
]

continuous_variables = [
    'Var7', 'Var8', 'Var10', 
    'Var17', 'Var25', 'Var26', '_r_'
]

binary_variables = [
    'target', 'Application_status', 'Var18', 
    'Var19', 'Var27', 'Var28'
]

categorical_nominal_variables = [
    'Var2', 'Var3', 'Var11', 'Var12', 'Var14'
]


datetime_variables = [
    'application_date', 'Var13'
]

We load a test instance just for checking integrity of our divised variables (if some were ommited)

In [3]:
from itertools import chain

train_data = pd.read_csv('https://files.challengerocket.com/files/lions-den-ing-2024/development_sample.csv')
assigned_vars = pd.Index(chain.from_iterable([discrete_variables, continuous_variables, 
binary_variables, categorical_nominal_variables, datetime_variables]))
print("Variables not assigned yet:", train_data.columns.difference(assigned_vars) if train_data.columns.difference(assigned_vars).shape[0] else "ALL ASSIGNED")

Variables not assigned yet: ALL ASSIGNED


In [4]:
names_xlsx = pd.read_excel('./variables_description.xlsx')
#Słownik zmian nazw kolumn
names = {f"{names_xlsx['Column'][i]}":f"{names_xlsx['Description'][i]}" for i in range(5, len(names_xlsx))}

def rename_list(lista):
    for idx in range(len(lista)):
        if lista[idx] in names.keys():
            lista[idx] = names[lista[idx]]
    return lista

discrete_variables = rename_list(discrete_variables)
continuous_variables = rename_list(continuous_variables)
binary_variables = rename_list(binary_variables)
categorical_nominal_variables = rename_list(categorical_nominal_variables)
datetime_variables = rename_list(datetime_variables)

## pre-Pipeline preprocessing

This step involves steps removing some observations, so it is a step before the pipeline. In code below we create two regexes, based on which we will select variables for numerical/categorical processing. It simply automates the process of selecting variables for a given arm of ColumnTransformer (instead of passing long list of names that can change during the pipeline, we are interested only in the end of the name, so original one). Added variables is additional list of variables that will be added to the dataset, but here it doesn't bother us that they will be supplied in the regex (it is simply an alternative regex).

In [5]:

def generate_regex():

    added_variables = []
    added_variables.append('durationOfEmployment')
    added_variables.append('installmentPerIncomeOfMainApplicant')
    added_variables.append('installmentPerIncome')
    added_variables.append('incomeOfMainApplicantperChildrenNumber')
    added_variables.append('incomeOfMainApplicantperdependencesNumber')
    added_variables.append('installmentAmountPerIncomeAndGoods')
    added_variables.append('installmentPerBothIncomes')
    added_variables.append('dependentNumberOfChildrenOnRelationshipStatus')
    
    num_regex = "^(.*)("
    nominal_regex = "^(.*)("
    for num_feature in discrete_variables + continuous_variables + added_variables:
        num_feature = num_feature.replace(')', '\)').replace('(', '\(')
        num_regex+=num_feature+'|'
    num_regex=num_regex[:-1] # removing last |
    num_regex+=')$'

    #lets build nominal feature regex selector
    for cat_feature in categorical_nominal_variables:
            nominal_regex+=cat_feature.replace(')', '\)').replace('(', '\(')+'|'
    nominal_regex=nominal_regex[:-1] # removing last |
    nominal_regex+=')$'
    return num_regex, nominal_regex


It is unfortunately required for this pipeline to have global regexes for columns selectors to use.

In [6]:
num_regex, nominal_regex = generate_regex() 

This function does several different things:
1. Renames columns according to mapping from official competition file
2. Sets the index to one from dataset
3. Removes observations based on supplied to the function list of variables that we won't impute in the future (described why in our earlier work)
4. Change absurd date of 31Dec9999 to date in the middle of the range for emplyment date (say 01Jun2005).
5. It returns preprocessed data that are splitted into training data, training labels

In [7]:
def remove_nans(X : pd.DataFrame, columns=['target', 'Spendings estimation']) -> pd.DataFrame:
    """Funkcja do usuwania wierszy które mają NaN w którejś z kolumn podanych w liście.
    

    Args:
        X (pd.DataFrame): dataframe do przetworzenia (usunięcia wierszy). Ten surowy z URLa.
        columns (list, optional): Kolumny z oryginalnego df (opisowe, nie VarX). 
        Z których wiersze z NaNami.
        Defaults to ['target', 'Spendings estimation'].

    Returns:
        pd.DataFrame pd.Series: Dataframe z danymi treningowymi, dataframe z labelkami
    """
    X = X.copy()
    X = X.rename(columns=names)
    X = X.set_index('ID')
    for column in columns:
        X = X[X[column].notna()]
    X.loc[X['Application data: employment date (main applicant)'].apply(lambda x: int(x[-4:])) > 2030, 'Application data: employment date (main applicant)'] = '01Jun2006'
    X.loc[X['application_date'].apply(lambda x: int(x[-12:-8])) > 2030, 'application_date'] = '01Jun2006 0:00:00'
    return X.drop(['target'], axis=1), X['target']


## Fixing encodings

Some variables ('Distribution channel', 'Application_status') come with some wrong (but easy to fix) encodings. This function however doesn't affect the row count so it will be used in the pipeline.

In [8]:
def fix_encodings(X : pd.DataFrame) -> pd.DataFrame:
    """Tutaj sztywno zmieniam zepsute encodingi w danych kolumnach

    Args:
        X (pd.DataFrame): dataframe po użyciu remove_nans
        with_FE (bool) : flaga na True jeżeli do danych dodajemy przetworzone zmienne
    Returns:
        pd.DataFrame: dataframe z poprawionymi encodingami
    """
    X_copy = X.copy()
    if 'Distribution channel' in X.columns:
        X_copy['Distribution channel'] = X_copy['Distribution channel'].replace("Direct", "1")
        X_copy['Distribution channel'] = X_copy['Distribution channel'].replace("Broker", "2")    
        X_copy['Distribution channel'] = X_copy['Distribution channel'].replace("Online", "3")

    if 'Application_status' in X.columns:
        X_copy['Application_status'] = X_copy['Application_status'].replace("Approved", "1")
        X_copy['Application_status'] = X_copy['Application_status'].replace("Rejected", "0")
        
    return X_copy


## Feature engingeering
Before processing the data, we must create some features that might be a significant help for our model. We must create them here, so they can be processed by pipeline (and they will be taken into consideration by pipeline, because of regexes we built before). Here are some economically substantive feature proposals:

1. How long had have been person employed before loan application (application date - employment date)?
2. Proportion of installment amount to income of main applicant
3. Proportion of installment amount to average income
4. Proportion of installment amount to amount on current account + amount on savings amount -- not possible due to NaNs in amount on current account
5. Income of main applicant / number of children + 1 (the applicant)
6. Income of main applicant / number of dependences + 1 (the applicant)
7. Installment amount / average income + value of the goods
8. Application amount / (value of the goods + amoutn on current account + amount of savings account) -- not possible due to NaNs
9. Installment amount / income of main applicant + income of the second applicant
10. Number of children / 2 if married/informal relationship number of children /1 otherwise
11. Amount on savings account / amount on current account -- also not possible due to NaNs
12. Bureau score > 0?

Funtion below will prepare all variables in one step. As it doesn't change the dependent variable and row count, it will be a first step into the transformer.

In [9]:
def create_new_features(X : pd.DataFrame) -> pd.DataFrame:
    X_new = X.copy()
    # durationOfEmployment
    X_new['durationOfEmployment'] = (pd.to_datetime(X_new['application_date']) - pd.to_datetime(X_new['Application data: employment date (main applicant)'], format="%d%b%Y")).apply(lambda x: x.days)
    
    # installment per average income of main applicant
    X_new['installmentPerIncomeOfMainApplicant'] = X_new['Installment amount'] / X_new['Application data: income of main applicant'].apply(lambda x: 1 if pd.isna(x) or x==0 else x)
    
    # installment amount per average income
    X_new['installmentPerIncome'] = X_new['Installment amount'] / X_new['Average income (Exterval data)']
    
    # income of main applicant / number of children + 1
    X_new['incomeOfMainApplicantperChildrenNumber'] = X_new['Application data: income of main applicant']/(X_new['Application data: number of children of main applicant'] + 1)

    # income of main applicant / number of dependences + 1 (the applicant)
    X_new['incomeOfMainApplicantperdependencesNumber'] = X_new['Application data: income of main applicant']/(X_new['Application data: number of dependences of main applicant'] + 1)
    
    # installment amount / average income + value of the goods
    X_new['installmentAmountPerIncomeAndGoods'] = X_new['Installment amount']/(X_new['Average income (Exterval data)'] + X_new['Value of the goods (car)'].apply(lambda x: 0 if pd.isna(x) else x))
    
    # installment amount / income of main applicant + income of the second applicant
    X_new['installmentPerBothIncomes'] = X_new['Installment amount'] / (X_new['Application data: income of main applicant'].apply(lambda x: 1 if pd.isna(x) or x==0 else x) + X_new['Application data: income of second applicant'].apply(lambda x: 0 if pd.isna(x) else x))
    
    # number of children per different options
    X_new['dependentNumberOfChildrenOnRelationshipStatus'] = X_new['Application data: number of children of main applicant'].apply(lambda x: 0 if pd.isna(x) else x) / X_new['Application data: marital status of main applicant'].apply(lambda x: 2 if x in [1, 2] else 1)
    
    # bureau score > 0? this is done because 1st quartile of this variable is 10, and median is 0 so it is quite unique
    X_new['isPositiveBureauScore'] = (X_new['Credit bureau score (Exterval data)'] > 0).astype('int64')
    
    return X_new

Now we wrap the function into FunctionTransformer for easy implementation into the pipeline:

In [10]:
create_features_transformer = FunctionTransformer(create_new_features)

Below we manually splitted* the variables that need individual treatment when it comes to imputing. The method is described in first chapter of documentation.

In [11]:
vars_for_zero_impute = ['Application data: income of second applicant', 'Application data: profession of second applicant', 'Value of the goods (car)']
vars_for_add_category_impute = ['Property ownership for property renovation', 'Clasification of the vehicle (Car, Motorbike)']
vars_for_mode_impute = ['Loan purpose', 'Distribution channel']
vars_for_fill_zeros_but_add_var = ["Amount on current account", "Amount on savings account"]

Some variables besides filling with 0 require marking the whole observation with new variable (to mark that for example there was no savings account) to differenciate empty accounts from non existing accounts.

In [12]:
class SimpleImputeAddFeature(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns 

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_copy = X.copy()
        
        for column in self.columns:
            X_copy[column + '_was_missing'] = X_copy[column].isnull().astype(int)
            
            X_copy[column] = X_copy[column].fillna(0)
        
        return X_copy
    
    def get_feature_names_out(self, input_features=None):
       if input_features is None:
           input_features = self.columns
       output_features = np.concatenate([input_features, [f"{col}_was_missing" for col in self.columns]])
       return output_features

# 1st step of pipeline

In this step we created various imputers for different subsets of variables (listed in *). Besides these basic imputations we use custom imputer written above and we fix encoding on Application Status separately. The rest of variables stay unchanged. Thanks to architecture of fix_encodings, this function, wrapped in FunctionTransformer works generally (without error).

In [13]:
zero_imputer = SimpleImputer(strategy="constant", fill_value=0)
add_category_imputer = SimpleImputer(strategy="constant", fill_value=2)
mode_imputer = SimpleImputer(strategy="most_frequent")

impute_column_transformer = ColumnTransformer([
    ("zero_fill", zero_imputer, vars_for_zero_impute),
    ("add_third_category", add_category_imputer, vars_for_add_category_impute),
    ("mode_impute", make_pipeline(FunctionTransformer(fix_encodings), mode_imputer), vars_for_mode_impute),
    ("fill_zeros_but_add_var", SimpleImputeAddFeature(vars_for_fill_zeros_but_add_var), vars_for_fill_zeros_but_add_var),
    ("application_status_transform", FunctionTransformer(fix_encodings), ['Application_status'])
    ],
    remainder="passthrough"
).set_output(transform='pandas')

# 2nd step of pipeline

Now, for some reason the whole DataFrame is object type, and this won't work well with models so where we can (with an exception of datetime variables) we cast type to numeric. Then we create the instance of FunctionTransformer wrapping the created function.

In [14]:
def make_dataframe_numeric_again(X : pd.DataFrame) -> pd.DataFrame:
    X_copy = X.copy()
    for column in X:
        if column.split('__')[1] not in datetime_variables: 
            X_copy[column] = pd.to_numeric(X[column])
    return X_copy

numericTransformer = FunctionTransformer(make_dataframe_numeric_again)


## 3rd step of pipeline

In this step we take into consideration existance of challenger model, that might work better with OneHotEncoding but for logistic regression we think the better choice will be WOE Encoding of categorical variables. Because of that we created two different ColumnTransformers that treat numerical features the same (standard scale) but differ in treatment of nominal variables as mentioned before.

In [15]:
feature_transform_transformer = ColumnTransformer([
    ("scale", StandardScaler(), make_column_selector(num_regex)),
    ("one_hot_encode", OneHotEncoder(sparse_output=False), make_column_selector(nominal_regex))
],
    remainder="passthrough").set_output(transform="pandas")

In [16]:
feature_transform_transformer_woe = ColumnTransformer([
    ("scale", StandardScaler(), make_column_selector(num_regex)),
    ("woe_encode", WoEEncoder(ignore_format=True), make_column_selector(nominal_regex))
],
    remainder="passthrough").set_output(transform="pandas")

## 4rd step of pipeline
In this step we remove the variables that are unnecesary in model training but for some reasons were not dropped earlier.


In [17]:
def remove_unnecesary(X : pd.DataFrame) -> pd.DataFrame:
    return X.drop(['remainder__remainder__Application data: employment date (main applicant)',
                   'remainder__remainder__application_date',
                   'remainder__application_status_transform__Application_status',
                   'scale__remainder__customer_id'
                  ], axis=1)

remove_unnecesary_transformer = FunctionTransformer(remove_unnecesary)

# Final pipeline

Below we use make_pipeline command to actually bin together all individual steps into two pipelines.

In [18]:
# logistic pipeline
full_pipeline_logisitic = make_pipeline(create_features_transformer, impute_column_transformer, numericTransformer, feature_transform_transformer_woe, remove_unnecesary_transformer)
# ml pipeline
full_pipeline_ml = make_pipeline(create_features_transformer, impute_column_transformer, numericTransformer, feature_transform_transformer, remove_unnecesary_transformer)

# Full data routine example
In this appendix we will show how created pipeline works on training and test data. We assumed that purity of data in hypotetical scenarios will be only higher, so we didn't prevent the possible wrong imputes, bad values.

In [19]:
train_data = pd.read_csv('https://files.challengerocket.com/files/lions-den-ing-2024/development_sample.csv')
test_data = pd.read_csv('https://files.challengerocket.com/files/lions-den-ing-2024/testing_sample.csv')

#train data
train_X, train_y = remove_nans(train_data)
num_regex, nominal_regex = generate_regex() 
train_ml_data = full_pipeline_ml.fit_transform(train_X)
train_logistic_data = full_pipeline_logisitic.fit_transform(train_X, train_y)

#test data
test_X, test_y = remove_nans(test_data)
num_regex, nominal_regex = generate_regex() 
test_ml_data = full_pipeline_ml.transform(test_X)
test_logistic_data = full_pipeline_logisitic.transform(test_X)

  cols = cols[cols.str.contains(self.pattern, regex=True)]
  cols = cols[cols.str.contains(self.pattern, regex=True)]


In [20]:
full_pipeline_logisitic

In [21]:
full_pipeline_ml

How does it work?

In [22]:
# Any nans here?
print(train_ml_data.isna().any().any())
print(train_logistic_data.isna().any().any())
print(test_ml_data.isna().any().any())
print(test_logistic_data.isna().any().any())

False
False
False
False


There are no NaNs in the dataset after transformations.

In [23]:
train_ml_data.describe()

Unnamed: 0,scale__zero_fill__Application data: income of second applicant,scale__zero_fill__Value of the goods (car),scale__fill_zeros_but_add_var__Amount on current account,scale__fill_zeros_but_add_var__Amount on savings account,scale__remainder__Number of applicants,scale__remainder__Application amount,scale__remainder__Credit duration (months),scale__remainder__Payment frequency,scale__remainder__Installment amount,scale__remainder__Application data: income of main applicant,...,one_hot_encode__remainder__Application data: marital status of main applicant_2,one_hot_encode__remainder__Application data: marital status of main applicant_3,one_hot_encode__remainder__Application data: marital status of main applicant_4,remainder__add_third_category__Property ownership for property renovation,"remainder__add_third_category__Clasification of the vehicle (Car, Motorbike)",remainder__fill_zeros_but_add_var__Amount on current account_was_missing,remainder__fill_zeros_but_add_var__Amount on savings account_was_missing,remainder__remainder__Arrear in last 3 months (indicator),remainder__remainder__Arrear in last 12 months (indicator),remainder__remainder__isPositiveBureauScore
count,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,...,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0
mean,6.197832e-18,4.9582660000000005e-17,-1.239566e-17,1.48748e-16,3.098916e-17,1.11561e-16,-1.2589350000000002e-17,1.41388e-17,1.301545e-16,0.0,...,0.161506,0.134029,0.070599,1.697514,1.440768,0.201548,0.398926,0.012512,0.044077,0.459848
std,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,...,0.368002,0.340688,0.256158,0.541587,0.69553,0.401162,0.489684,0.111155,0.205268,0.498392
min,-0.482761,-0.7358431,-0.8024941,-0.6867143,-0.4915409,-1.393707,-0.8725248,-0.4417715,-0.8412642,-1.905771,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.482761,-0.7358431,-0.6198403,-0.6867143,-0.4915409,-0.8208278,-0.5855389,-0.4417715,-0.5336108,-0.766914,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,-0.482761,-0.7358431,-0.2790257,-0.4305702,-0.4915409,-0.2315806,-0.2028912,-0.4417715,-0.3030519,-0.220263,...,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0
75%,-0.482761,0.5716219,0.2665731,0.3743751,-0.4915409,0.6441062,0.1797566,-0.4417715,0.1229326,0.599714,...,0.0,0.0,0.0,2.0,2.0,0.0,1.0,0.0,0.0,1.0
max,7.205681,5.410638,18.99015,13.37752,4.109021,3.762206,8.884994,3.589177,14.45061,6.316774,...,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0


In [24]:
train_logistic_data.describe()

Unnamed: 0,scale__zero_fill__Application data: income of second applicant,scale__zero_fill__Value of the goods (car),scale__fill_zeros_but_add_var__Amount on current account,scale__fill_zeros_but_add_var__Amount on savings account,scale__remainder__Number of applicants,scale__remainder__Application amount,scale__remainder__Credit duration (months),scale__remainder__Payment frequency,scale__remainder__Installment amount,scale__remainder__Application data: income of main applicant,...,woe_encode__mode_impute__Distribution channel,woe_encode__remainder__Application data: profession of main applicant,woe_encode__remainder__Application data: marital status of main applicant,remainder__add_third_category__Property ownership for property renovation,"remainder__add_third_category__Clasification of the vehicle (Car, Motorbike)",remainder__fill_zeros_but_add_var__Amount on current account_was_missing,remainder__fill_zeros_but_add_var__Amount on savings account_was_missing,remainder__remainder__Arrear in last 3 months (indicator),remainder__remainder__Arrear in last 12 months (indicator),remainder__remainder__isPositiveBureauScore
count,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,...,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0,36686.0
mean,6.197832e-18,4.9582660000000005e-17,-1.239566e-17,1.48748e-16,3.098916e-17,1.11561e-16,-1.2589350000000002e-17,1.41388e-17,1.301545e-16,0.0,...,-0.010177,-0.138633,-0.029828,1.697514,1.440768,0.201548,0.398926,0.012512,0.044077,0.459848
std,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,...,0.142588,0.426042,0.250428,0.541587,0.69553,0.401162,0.489684,0.111155,0.205268,0.498392
min,-0.482761,-0.7358431,-0.8024941,-0.6867143,-0.4915409,-1.393707,-0.8725248,-0.4417715,-0.8412642,-1.905771,...,-0.105673,-1.087192,-0.294188,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.482761,-0.7358431,-0.6198403,-0.6867143,-0.4915409,-0.8208278,-0.5855389,-0.4417715,-0.5336108,-0.766914,...,-0.105673,-0.28987,-0.294188,1.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,-0.482761,-0.7358431,-0.2790257,-0.4305702,-0.4915409,-0.2315806,-0.2028912,-0.4417715,-0.3030519,-0.220263,...,-0.105673,-0.11905,0.067497,2.0,2.0,0.0,0.0,0.0,0.0,0.0
75%,-0.482761,0.5716219,0.2665731,0.3743751,-0.4915409,0.6441062,0.1797566,-0.4417715,0.1229326,0.599714,...,-0.000522,-0.11905,0.067497,2.0,2.0,0.0,1.0,0.0,0.0,1.0
max,7.205681,5.410638,18.99015,13.37752,4.109021,3.762206,8.884994,3.589177,14.45061,6.316774,...,0.318044,2.724311,0.38194,2.0,2.0,1.0,1.0,1.0,1.0,1.0


In [25]:
test_ml_data.describe()

Unnamed: 0,scale__zero_fill__Application data: income of second applicant,scale__zero_fill__Value of the goods (car),scale__fill_zeros_but_add_var__Amount on current account,scale__fill_zeros_but_add_var__Amount on savings account,scale__remainder__Number of applicants,scale__remainder__Application amount,scale__remainder__Credit duration (months),scale__remainder__Payment frequency,scale__remainder__Installment amount,scale__remainder__Application data: income of main applicant,...,one_hot_encode__remainder__Application data: marital status of main applicant_2,one_hot_encode__remainder__Application data: marital status of main applicant_3,one_hot_encode__remainder__Application data: marital status of main applicant_4,remainder__add_third_category__Property ownership for property renovation,"remainder__add_third_category__Clasification of the vehicle (Car, Motorbike)",remainder__fill_zeros_but_add_var__Amount on current account_was_missing,remainder__fill_zeros_but_add_var__Amount on savings account_was_missing,remainder__remainder__Arrear in last 3 months (indicator),remainder__remainder__Arrear in last 12 months (indicator),remainder__remainder__isPositiveBureauScore
count,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,...,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0
mean,0.03823,0.01166,-0.003258,0.025313,0.022296,0.038421,0.037457,-0.001208,0.01544,0.019093,...,0.157115,0.139293,0.069372,1.683027,1.43762,0.197971,0.38799,0.012339,0.043597,0.470524
std,1.064038,1.017301,1.022702,1.039007,1.02427,1.000016,1.038883,0.989439,1.064454,1.013247,...,0.363959,0.346299,0.254121,0.55414,0.695877,0.398525,0.487359,0.110408,0.204226,0.499199
min,-0.482761,-0.735843,-0.802494,-0.686714,-0.491541,-1.352787,-0.872525,-0.441772,-0.838604,-1.905771,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.482761,-0.735843,-0.618103,-0.686714,-0.491541,-0.747172,-0.585539,-0.441772,-0.541104,-0.766914,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,-0.482761,-0.735843,-0.276739,-0.414047,-0.491541,-0.207029,-0.202891,-0.441772,-0.301937,-0.174709,...,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0
75%,-0.482761,0.569295,0.236597,0.38826,-0.491541,0.676842,0.275419,-0.441772,0.114349,0.622491,...,0.0,0.0,0.0,2.0,2.0,0.0,1.0,0.0,0.0,1.0
max,5.737102,6.052739,18.483107,7.744321,4.109021,3.901334,8.789332,3.589177,11.421667,4.881815,...,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0


In [26]:
test_logistic_data.describe()

Unnamed: 0,scale__zero_fill__Application data: income of second applicant,scale__zero_fill__Value of the goods (car),scale__fill_zeros_but_add_var__Amount on current account,scale__fill_zeros_but_add_var__Amount on savings account,scale__remainder__Number of applicants,scale__remainder__Application amount,scale__remainder__Credit duration (months),scale__remainder__Payment frequency,scale__remainder__Installment amount,scale__remainder__Application data: income of main applicant,...,woe_encode__mode_impute__Distribution channel,woe_encode__remainder__Application data: profession of main applicant,woe_encode__remainder__Application data: marital status of main applicant,remainder__add_third_category__Property ownership for property renovation,"remainder__add_third_category__Clasification of the vehicle (Car, Motorbike)",remainder__fill_zeros_but_add_var__Amount on current account_was_missing,remainder__fill_zeros_but_add_var__Amount on savings account_was_missing,remainder__remainder__Arrear in last 3 months (indicator),remainder__remainder__Arrear in last 12 months (indicator),remainder__remainder__isPositiveBureauScore
count,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,...,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0,3647.0
mean,0.03823,0.01166,-0.003258,0.025313,0.022296,0.038421,0.037457,-0.001208,0.01544,0.019093,...,-0.013108,-0.144131,-0.027505,1.683027,1.43762,0.197971,0.38799,0.012339,0.043597,0.470524
std,1.064038,1.017301,1.022702,1.039007,1.02427,1.000016,1.038883,0.989439,1.064454,1.013247,...,0.138813,0.426242,0.251622,0.55414,0.695877,0.398525,0.487359,0.110408,0.204226,0.499199
min,-0.482761,-0.735843,-0.802494,-0.686714,-0.491541,-1.352787,-0.872525,-0.441772,-0.838604,-1.905771,...,-0.105673,-1.087192,-0.294188,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.482761,-0.735843,-0.618103,-0.686714,-0.491541,-0.747172,-0.585539,-0.441772,-0.541104,-0.766914,...,-0.105673,-0.28987,-0.294188,1.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,-0.482761,-0.735843,-0.276739,-0.414047,-0.491541,-0.207029,-0.202891,-0.441772,-0.301937,-0.174709,...,-0.105673,-0.11905,0.067497,2.0,2.0,0.0,0.0,0.0,0.0,0.0
75%,-0.482761,0.569295,0.236597,0.38826,-0.491541,0.676842,0.275419,-0.441772,0.114349,0.622491,...,-0.000522,-0.11905,0.067497,2.0,2.0,0.0,1.0,0.0,0.0,1.0
max,5.737102,6.052739,18.483107,7.744321,4.109021,3.901334,8.789332,3.589177,11.421667,4.881815,...,0.318044,2.724311,0.38194,2.0,2.0,1.0,1.0,1.0,1.0,1.0
