# Notebook overview

This is an additional notebook in support to 01_Training.ipynb where we will be exploring several points related to our data:

1. Memory usage and how we can optimize it
2. Selection of the most relevant features by different techniques:
   - EDA
   - feature importance obtained by different algorithmes

# Imports

## Libraries

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc, warnings, os, time
import importlib

# sklearn
from sklearn.pipeline import Pipeline

# custom classes
import pipelines
import data_preprocessing as process
import transformers

importlib.reload(process)
importlib.reload(transformers)
importlib.reload(pipelines)

from pipelines import PIPELINES

# sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, RobustScaler

warnings.simplefilter('ignore', category=FutureWarning)

pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = '{:.4f}'.format

## Data

We will load the main Application table, join all other tables with the basic feature engineering as intended for the training part.

In [62]:
data = process.load_application(dev_mode=False)
data.shape

(307511, 122)

In [63]:
dev_mode = False

X_bureau_features = process.get_bureau_and_balance_features(dev_mode=dev_mode)
X_prev_app_features = process.get_previous_applications_features(dev_mode=dev_mode)
X_pos_cash_balance_features = process.get_pos_cash_balance_features(dev_mode=dev_mode)
X_installment_features = process.get_installments_payments_features(dev_mode=dev_mode)
X_cc_balance_features = process.get_credit_card_balance_features(dev_mode=dev_mode)

fe_pipeline = Pipeline(steps=[
    ('cleaner', transformers.ApplicationCleaner()),
    ('feature_extractor', transformers.ApplicationFeaturesExtractor()),
    ('merge_bureau_and_balance', transformers.ApplicationFeaturesMerger(X_bureau_features)),
    ('merge_previous_applications', transformers.ApplicationFeaturesMerger(X_prev_app_features)),
    ('merge_pos_cash_balance', transformers.ApplicationFeaturesMerger(X_pos_cash_balance_features)),
    ('merge_installments_payments', transformers.ApplicationFeaturesMerger(X_installment_features)),
    ('merge_credit_card_balance', transformers.ApplicationFeaturesMerger(X_cc_balance_features))
])

del X_bureau_features, X_prev_app_features, X_pos_cash_balance_features, X_installment_features, X_cc_balance_features
gc.collect()

data_transformed = fe_pipeline.fit_transform(data)
data_transformed.shape

(307511, 663)

# Memory usage

Pandas by default stores the integer values as int64 and float values as float64. This actually takes a lot of memory. Instead, we can downcast the data types to smaller types. We should also check features having only 0 or 1 values encoded as float to transform them to uint8.

The limit of this approach is the presence of the NaN values, the downcasting will not be efficient on the columns having NaN values as they will upcast the existing values

In [64]:
data_transformed.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 663 columns):
 #    Column                                                                 Dtype   
---   ------                                                                 -----   
 0    SK_ID_CURR                                                             int64   
 1    TARGET                                                                 int64   
 2    NAME_CONTRACT_TYPE                                                     category
 3    CODE_GENDER                                                            category
 4    FLAG_OWN_CAR                                                           int64   
 5    FLAG_OWN_REALTY                                                        int64   
 6    CNT_CHILDREN                                                           int64   
 7    AMT_INCOME_TOTAL                                                       float64 
 8    AMT_CREDIT            

The dataset takes 1.5 GB in memory, and most part (608 out of 663) of the values is stored in float64 format. Let's see if we can optimize the dataset size.

In [52]:
def display_memory_by_dtype(dtype):
    selected_dtype = data_transformed.select_dtypes(include = [dtype])
    total_usage = selected_dtype.memory_usage(deep=True).sum()
    print("Total memory usage for {} columns: {:03.4f} MB".format(dtype, (total_usage / 1024) / 1024))

In [65]:
for dtype in ('float64', 'int64', 'category'):
    display_memory_by_dtype(dtype)

Total memory usage for float64 columns: 1426.4427 MB
Total memory usage for int64 columns: 98.5373 MB
Total memory usage for category columns: 3.8260 MB


## Downcast binary values

Let's get all the features having 0/1 values and convert them to int8

In [66]:
binary_features = data_transformed.columns[data_transformed.isin([0, 1]).all()]
binary_features.size

35

In [15]:
binary_features

Index(['TARGET', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
       'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
       'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'],
      dtype='object')

In [16]:
data_transformed[binary_features].info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 35 columns):
 #   Column                       Non-Null Count   Dtype
---  ------                       --------------   -----
 0   TARGET                       307511 non-null  int64
 1   FLAG_OWN_CAR                 307511 non-null  int64
 2   FLAG_OWN_REALTY              307511 non-null  int64
 3   FLAG_MOBIL                   307511 non-null  int64
 4   FLAG_EMP_PHONE               307511 non-null  int64
 5   FLAG_WORK_PHONE              307511 non-null  int64
 6   FLAG_CONT_MOBILE             307511 non-null  int64
 7   FLAG_PHONE                   307511 non-null  int64
 8   FLAG_EMAIL                   307511 non-null  int64
 9   REG_REGION_NOT_LIVE_REGION   307511 non-null  int64
 10  REG_REGION_NOT_WORK_REGION   307511 non-null  int64
 11  LIVE_REGION_NOT_WORK_REGION  307511 non-null  int64
 12  REG_CITY_NOT_LIVE_CITY       307511 non-null  int64
 13  REG_CITY_NOT_WORK_CITY       

In [18]:
data_transformed[binary_features].memory_usage()

Index                              128
TARGET                         2460088
FLAG_OWN_CAR                   2460088
FLAG_OWN_REALTY                2460088
FLAG_MOBIL                     2460088
FLAG_EMP_PHONE                 2460088
FLAG_WORK_PHONE                2460088
FLAG_CONT_MOBILE               2460088
FLAG_PHONE                     2460088
FLAG_EMAIL                     2460088
REG_REGION_NOT_LIVE_REGION     2460088
REG_REGION_NOT_WORK_REGION     2460088
LIVE_REGION_NOT_WORK_REGION    2460088
REG_CITY_NOT_LIVE_CITY         2460088
REG_CITY_NOT_WORK_CITY         2460088
LIVE_CITY_NOT_WORK_CITY        2460088
FLAG_DOCUMENT_2                2460088
FLAG_DOCUMENT_3                2460088
FLAG_DOCUMENT_4                2460088
FLAG_DOCUMENT_5                2460088
FLAG_DOCUMENT_6                2460088
FLAG_DOCUMENT_7                2460088
FLAG_DOCUMENT_8                2460088
FLAG_DOCUMENT_9                2460088
FLAG_DOCUMENT_10               2460088
FLAG_DOCUMENT_11         

The `memory_usage` function returns the memory usage of each column in bytes, we can see that each binary column takes 2.46 MB.

Let's convert them to uint8

In [35]:
for feature in binary_features:
    data_transformed[feature] = data_transformed[feature].astype("uint8")

data_transformed[binary_features].memory_usage(deep=True)

Index                             128
TARGET                         307511
FLAG_OWN_CAR                   307511
FLAG_OWN_REALTY                307511
FLAG_MOBIL                     307511
FLAG_EMP_PHONE                 307511
FLAG_WORK_PHONE                307511
FLAG_CONT_MOBILE               307511
FLAG_PHONE                     307511
FLAG_EMAIL                     307511
REG_REGION_NOT_LIVE_REGION     307511
REG_REGION_NOT_WORK_REGION     307511
LIVE_REGION_NOT_WORK_REGION    307511
REG_CITY_NOT_LIVE_CITY         307511
REG_CITY_NOT_WORK_CITY         307511
LIVE_CITY_NOT_WORK_CITY        307511
FLAG_DOCUMENT_2                307511
FLAG_DOCUMENT_3                307511
FLAG_DOCUMENT_4                307511
FLAG_DOCUMENT_5                307511
FLAG_DOCUMENT_6                307511
FLAG_DOCUMENT_7                307511
FLAG_DOCUMENT_8                307511
FLAG_DOCUMENT_9                307511
FLAG_DOCUMENT_10               307511
FLAG_DOCUMENT_11               307511
FLAG_DOCUMEN

In [36]:
data_transformed[binary_features].info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 35 columns):
 #   Column                       Non-Null Count   Dtype
---  ------                       --------------   -----
 0   TARGET                       307511 non-null  uint8
 1   FLAG_OWN_CAR                 307511 non-null  uint8
 2   FLAG_OWN_REALTY              307511 non-null  uint8
 3   FLAG_MOBIL                   307511 non-null  uint8
 4   FLAG_EMP_PHONE               307511 non-null  uint8
 5   FLAG_WORK_PHONE              307511 non-null  uint8
 6   FLAG_CONT_MOBILE             307511 non-null  uint8
 7   FLAG_PHONE                   307511 non-null  uint8
 8   FLAG_EMAIL                   307511 non-null  uint8
 9   REG_REGION_NOT_LIVE_REGION   307511 non-null  uint8
 10  REG_REGION_NOT_WORK_REGION   307511 non-null  uint8
 11  LIVE_REGION_NOT_WORK_REGION  307511 non-null  uint8
 12  REG_CITY_NOT_LIVE_CITY       307511 non-null  uint8
 13  REG_CITY_NOT_WORK_CITY       

The memory was reduced from 82.1 MB to 10.3 MB.

## Downcast integer values

The most part of the features is in int64 et float64 type, we'll use the pandas `.to_numeric()` function with the `downcast` argument that will automatically choose the smallest type for the given feature.

In [25]:
int_features = data_transformed.select_dtypes(['int64']).columns
int_features.size

7

In [26]:
data_transformed[int_features].info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 7 columns):
 #   Column                       Non-Null Count   Dtype
---  ------                       --------------   -----
 0   SK_ID_CURR                   307511 non-null  int64
 1   CNT_CHILDREN                 307511 non-null  int64
 2   DAYS_BIRTH                   307511 non-null  int64
 3   DAYS_ID_PUBLISH              307511 non-null  int64
 4   REGION_RATING_CLIENT         307511 non-null  int64
 5   REGION_RATING_CLIENT_W_CITY  307511 non-null  int64
 6   HOUR_APPR_PROCESS_START      307511 non-null  int64
dtypes: int64(7)
memory usage: 16.4 MB


In [45]:
for feature in int_features:
    data_transformed[feature] = pd.to_numeric(data_transformed[feature], downcast='integer')

data_transformed[int_features].memory_usage(deep=True)

Index                              128
SK_ID_CURR                     1230044
CNT_CHILDREN                    307511
DAYS_BIRTH                      615022
DAYS_ID_PUBLISH                 615022
REGION_RATING_CLIENT            307511
REGION_RATING_CLIENT_W_CITY     307511
HOUR_APPR_PROCESS_START         307511
dtype: int64

In [56]:
data_transformed[int_features].info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 7 columns):
 #   Column                       Non-Null Count   Dtype
---  ------                       --------------   -----
 0   SK_ID_CURR                   307511 non-null  int32
 1   CNT_CHILDREN                 307511 non-null  int8 
 2   DAYS_BIRTH                   307511 non-null  int16
 3   DAYS_ID_PUBLISH              307511 non-null  int16
 4   REGION_RATING_CLIENT         307511 non-null  int8 
 5   REGION_RATING_CLIENT_W_CITY  307511 non-null  int8 
 6   HOUR_APPR_PROCESS_START      307511 non-null  int8 
dtypes: int16(2), int32(1), int8(4)
memory usage: 3.5 MB


Downcasting integer values allowed to reduce the size occupied by this type from 16.4 MB to 3.5 MB.

## Downcast float values

In [47]:
float_features = data_transformed.select_dtypes(['float64']).columns
float_features.size

608

In [57]:
data_transformed[float_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 608 entries, AMT_INCOME_TOTAL to CC_COUNT
dtypes: float64(608)
memory usage: 1.4 GB


In [58]:
for feature in float_features:
    data_transformed[feature] = pd.to_numeric(data_transformed[feature], downcast='float')

data_transformed[float_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 608 entries, AMT_INCOME_TOTAL to CC_COUNT
dtypes: float32(448), float64(160)
memory usage: 900.9 MB


**Final result**

In [59]:
data_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 663 entries, SK_ID_CURR to CC_COUNT
dtypes: category(13), float32(448), float64(160), int16(2), int32(1), int8(4), uint8(35)
memory usage: 918.5 MB


We went from 1.5 GB to 918.5 MB that represents almost 39% of memory reduction.