# ING Den - final challenge
## Neuralna Ekipa

In this notebook we will do preliminary analysis of the dataset.

In [2]:
import pandas as pd

train_data = pd.read_csv('./datasets/in_time.csv')


In [3]:
import numpy as np

How much data did we recieve? 

In [9]:
train_data.shape

(310000, 307)

We recieved 310k observation with 307 features each. One feature of course is target binary variable, and the rest are features that are listed in Data_dictionary.xlsx. We built then the function, that splits data into specified categories and transform it into usable form. Then we fix -9999 values into NaNs.

In [5]:
def transform_data_analysis(X : pd.DataFrame):
    X = X.copy()
    X.set_index(['Customer_id'], inplace=True)
    real_variables_columns = pd.read_excel('Data_dictionary.xlsx').iloc[:42, :]
    types = {k:[] for k in real_variables_columns['Type'].unique()}
    X[X == -9999] = pd.NA
    real_variables_columns
    for feature in real_variables_columns.iterrows():
        # all variables with x on the end just land with 1-12
        if feature[1]['Column name*'] == 'Customer_id': continue
        if(feature[1]['Column name*'][-1] =='x'):
            for lag in range(13):
                types[feature[1]['Type']].append((feature[1]['Column name*'][:-1]+str(lag)).replace(' ', '_'))
        else:
            types[feature[1]['Type']].append(feature[1]['Column name*'].replace(' ', '_'))

    # at this time we include only numeric features
    return X.drop(['Target'], axis=1), X['Target']

Splitting data into exogenic and endogenic variable

In [8]:
from data_preparation.data_preparation import transform_data

X,y = transform_data(train_data)

Is there any missing target variable?

In [11]:
y.isna().any()

False

How many variables are missing?

In [36]:
np.mean(X.loc[:, (np.mean(X.isna(), axis=0) > 0).values].isna(), axis=0).sort_values(ascending=False)

External_credit_card_balance             0.999532
External_term_loan_balance               0.999532
External_mortgage_balance                0.999532
Active_credit_card_lines                 0.880958
Active_mortgages                         0.820126
utilized_limit_in_revolving_loans_H6     0.197065
utilized_limit_in_revolving_loans_H2     0.197065
utilized_limit_in_revolving_loans_H3     0.197065
utilized_limit_in_revolving_loans_H4     0.197065
utilized_limit_in_revolving_loans_H5     0.197065
utilized_limit_in_revolving_loans_H8     0.197065
utilized_limit_in_revolving_loans_H7     0.197065
utilized_limit_in_revolving_loans_H0     0.197065
utilized_limit_in_revolving_loans_H9     0.197065
utilized_limit_in_revolving_loans_H10    0.197065
utilized_limit_in_revolving_loans_H11    0.197065
utilized_limit_in_revolving_loans_H1     0.197065
limit_in_revolving_loans_H10             0.197065
limit_in_revolving_loans_H12             0.197065
limit_in_revolving_loans_H11             0.197065


We can see that 99,9532% of all external data is missing (credit card balance, term loan balance, mortgage balance) we will drop these features.

Then we see that also active_credit_lines has missing 88%, so we can create varaible *has_credit_card_lines* but drop this variable.

Then we see active_mortgages miss 82% of values, we will do the same as to the previous variable.

The two variables (and its' derevatives): limit_in_revolving_loans and utilized_limit_in_revolving_loans miss almost 20%, we will drop those features as it would be difficult to impute them.

Features to drop are below:

In [50]:
features_to_drop = np.mean(X.loc[:, (np.mean(X.isna(), axis=0) > 0).values].isna(), axis=0).index

but first we have to prepare some variables as specified above

In [21]:
def transform_data_analysis(X : pd.DataFrame):
    X = X.copy()
    X.set_index(['Customer_id'], inplace=True)
    real_variables_columns = pd.read_excel('Data_dictionary.xlsx').iloc[:42, :]
    types = {k:[] for k in real_variables_columns['Type'].unique()}
    X[X == -9999] = pd.NA
    real_variables_columns
    for feature in real_variables_columns.iterrows():
        # all variables with x on the end just land with 1-12
        if feature[1]['Column name*'] == 'Customer_id': continue
        if(feature[1]['Column name*'][-1] =='x'):
            for lag in range(13):
                types[feature[1]['Type']].append((feature[1]['Column name*'][:-1]+str(lag)).replace(' ', '_'))
        else:
            types[feature[1]['Type']].append(feature[1]['Column name*'].replace(' ', '_'))
    features_to_drop = (X.loc[:, (np.mean(X.isna(), axis=0) > 0).values].isna()).any().index
    types['Created'] = []
    
    # create features that need missing values
    types['Created'].append('hasExternal_credit_card_balance')
    types['Created'].append('hasExternal_term_loan_balance')
    types['Created'].append('hasExternal_mortgage_balance')
    types['Created'].append('hasActive_credit_card_lines')
    types['Created'].append('hasActive_mortgages')

    X['hasExternal_credit_card_balance'] = ~pd.isna(X['External_credit_card_balance'])
    X['hasExternal_term_loan_balance'] = ~pd.isna(X['External_term_loan_balance'])
    X['hasExternal_mortgage_balance'] = ~pd.isna(X['External_mortgage_balance'])
    X['hasActive_credit_card_lines'] = ~pd.isna(X['Active_credit_card_lines'])
    X['hasActive_mortgages'] = ~pd.isna(X['Active_mortgages'])

    # here we drop features that are missing, at this point we have 
    X = X.drop(features_to_drop, axis=1)

    
    return X.drop(['Target'], axis=1), X['Target']

Are there any missing:

In [24]:
(transform_data_analysis(train_data)[0].isna().any() >0).any()

False

(310000, 279)

In [13]:
X_train.columns

Index(['Ref_month', 'Customer_id', 'Birth_date', 'No_dependants',
       'Time_in_address', 'Time_in_current_job', 'Credit_cards', 'Debit_cards',
       'Active_accounts', 'Oldest_account_date',
       ...
       'out_transactions_amt_H9', 'out_transactions_amt_H8',
       'out_transactions_amt_H7', 'out_transactions_amt_H6',
       'out_transactions_amt_H5', 'out_transactions_amt_H4',
       'out_transactions_amt_H3', 'out_transactions_amt_H2',
       'out_transactions_amt_H1', 'out_transactions_amt_H0'],
      dtype='object', length=306)

In [None]:
types = {k:[] for k in real_variables_columns['Type'].unique()}

In [None]:
real_variables_columns
for feature in real_variables_columns.iterrows():
    # all variables with x on the end just land with 1-12
    if(feature[1]['Column name*'][-1] =='x'):
        for lag in range(13):
            types[feature[1]['Type']].append((feature[1]['Column name*'][:-1]+str(lag)).replace(' ', '_'))
    else:
        types[feature[1]['Type']].append(feature[1]['Column name*'].replace(' ', '_'))

for type in types:
    for subtype in types[type]:
        if not subtype in X_train.columns:
            print(subtype)

Okay so to summarize, plot below shows how many variables are in each type:

In [None]:
import matplotlib.pyplot as plt


df = pd.DataFrame({k:len(x) for k,x in types.items()}, index=['no. of features'])
ax = df.T.plot.bar()
plt.title("Number of features in each category")
plt.grid(True)

for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))

plt.show()

As we can see, there is 1 variable that contains month, 35 non bound integer variables, 4 date variables with month, 39 variables bounded to 330, 13 binary variables.

Before next step we will drop obviously random (non-discriminatory variables) using Gini Coefficient (to drop obviously non-discriminatory variables).

## Feature engineering

In this section we will create new variables by hand, and then we will pass all variables via GiniSelector that will remove purely random (non discriminatory) features.

In [None]:
def create_new_features(X):
    # Implicit assumption in place that dataset has the same structure as in the in_time.csv, out_of_time.csv files
    print(X['Birth_date'][:-5])

create_new_features(X_train)

In [None]:
from data_preparation.additional_transformers import GiniSelector
from sklearn.compose import make_column_transformer



gs = GiniSelector(0.01)
column_transformer = make_column_transformer(
    (gs, types['Float'] + types['Integer'] + types['Integer (0 or 1)'] + types['Integer (0-330)']),
    remainder="passthrough"
)



data_trimmed = column_transformer.fit_transform(X_train, y_train)



In [None]:
data_after_gini = pd.DataFrame(data_trimmed, columns=[x.split('__')[1] for x in column_transformer.get_feature_names_out()])