# ING Den - final challenge
## Neuralna Ekipa

In this notebook we will do preliminary analysis of the dataset.

In [1]:
import pandas as pd

train_data = pd.read_csv('./datasets/in_time.csv')


How much data did we recieve? 

In [9]:
train_data.shape

(310000, 307)

In [10]:
X_train, y_train = train_data.drop(['Target'], axis=1), train_data['Target']

We recieved 310k observation with 307 features each. But as we know from Data dictionary attached to the task, we know that in fact there are only 44 real features with the rest being lagged versions of these. We will try to extract names of columns automatically based on Data_disctionary.xlsx file.

In [11]:
real_variables_columns = pd.read_excel('Data_dictionary.xlsx').iloc[:42, :]

In [12]:
X_train.columns

Index(['Ref_month', 'Customer_id', 'Birth_date', 'No_dependants',
       'Time_in_address', 'Time_in_current_job', 'Credit_cards', 'Debit_cards',
       'Active_accounts', 'Oldest_account_date',
       ...
       'out_transactions_amt_H9', 'out_transactions_amt_H8',
       'out_transactions_amt_H7', 'out_transactions_amt_H6',
       'out_transactions_amt_H5', 'out_transactions_amt_H4',
       'out_transactions_amt_H3', 'out_transactions_amt_H2',
       'out_transactions_amt_H1', 'out_transactions_amt_H0'],
      dtype='object', length=306)

In [13]:
X_train.columns

Index(['Ref_month', 'Customer_id', 'Birth_date', 'No_dependants',
       'Time_in_address', 'Time_in_current_job', 'Credit_cards', 'Debit_cards',
       'Active_accounts', 'Oldest_account_date',
       ...
       'out_transactions_amt_H9', 'out_transactions_amt_H8',
       'out_transactions_amt_H7', 'out_transactions_amt_H6',
       'out_transactions_amt_H5', 'out_transactions_amt_H4',
       'out_transactions_amt_H3', 'out_transactions_amt_H2',
       'out_transactions_amt_H1', 'out_transactions_amt_H0'],
      dtype='object', length=306)

In [None]:
types = {k:[] for k in real_variables_columns['Type'].unique()}

In [None]:
real_variables_columns
for feature in real_variables_columns.iterrows():
    # all variables with x on the end just land with 1-12
    if(feature[1]['Column name*'][-1] =='x'):
        for lag in range(13):
            types[feature[1]['Type']].append((feature[1]['Column name*'][:-1]+str(lag)).replace(' ', '_'))
    else:
        types[feature[1]['Type']].append(feature[1]['Column name*'].replace(' ', '_'))

for type in types:
    for subtype in types[type]:
        if not subtype in X_train.columns:
            print(subtype)

Okay so to summarize, plot below shows how many variables are in each type:

In [None]:
import matplotlib.pyplot as plt


df = pd.DataFrame({k:len(x) for k,x in types.items()}, index=['no. of features'])
ax = df.T.plot.bar()
plt.title("Number of features in each category")
plt.grid(True)

for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))

plt.show()

As we can see, there is 1 variable that contains month, 35 non bound integer variables, 4 date variables with month, 39 variables bounded to 330, 13 binary variables.

Before next step we will drop obviously random (non-discriminatory variables) using Gini Coefficient (to drop obviously non-discriminatory variables).

## Feature engineering

In this section we will create new variables by hand, and then we will pass all variables via GiniSelector that will remove purely random (non discriminatory) features.

In [None]:
def create_new_features(X):
    # Implicit assumption in place that dataset has the same structure as in the in_time.csv, out_of_time.csv files
    print(X['Birth_date'][:-5])

create_new_features(X_train)

In [None]:
from data_preparation.additional_transformers import GiniSelector
from sklearn.compose import make_column_transformer



gs = GiniSelector(0.01)
column_transformer = make_column_transformer(
    (gs, types['Float'] + types['Integer'] + types['Integer (0 or 1)'] + types['Integer (0-330)']),
    remainder="passthrough"
)



data_trimmed = column_transformer.fit_transform(X_train, y_train)



In [None]:
data_after_gini = pd.DataFrame(data_trimmed, columns=[x.split('__')[1] for x in column_transformer.get_feature_names_out()])