# Create input dataset

## Rationale:
We need to generate features at the space level so that we can feed it into a ML model.

## Methodology:
We are to preprocess and engineer every datasource to generate features. Then we are to merge all of them so we can use the data for training.

1. Bureau data
2. Credit card features
3. Installments features
4. POS cash features
5. Previous application features
6. Main application features

## Conclusions:
We ended with a dataset of ``~1,140`` features that will be used for training the model.

In [15]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

from functools import reduce

import sys
sys.path.append("../")

# local imports
from utils.functions__utils import label_encoder
from utils.features_lists import string_columns
from preprocess.bureau.make_features import main as make_bureau_features
from preprocess.credit_card_balance.make_features import main as make_credit_card_features
from preprocess.installments.make_features import main as make_installments_features
from preprocess.pos_cash.make_features import main as make_pos_features
from preprocess.previous_application.make_features import main as make_previous_application_features
from preprocess.main_application.make_features import main as make_application_features

from src.learner_params import space_column

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [16]:
%%time
# load the datasets
application_train_df = pd.read_csv("../data/application_train.csv")
application_test_df = pd.read_csv("../data/application_test.csv")

application_df = pd.concat([application_train_df, application_test_df], ignore_index = True)

bureau_balance_df = pd.read_csv("../data/bureau_balance.csv")
bureau_df = pd.read_csv("../data/bureau.csv")
credit_card_balance_df = pd.read_csv("../data/credit_card_balance.csv")
installments_payments_df = pd.read_csv("../data/installments_payments.csv")
pos_cash_balance_df = pd.read_csv("../data/POS_CASH_balance.csv")
previous_application_df = pd.read_csv("../data/previous_application.csv")

CPU times: user 15.5 s, sys: 4.09 s, total: 19.6 s
Wall time: 22.1 s


In [3]:
bureau_features_df = make_bureau_features(bureau_df, bureau_balance_df,application_train_df,application_test_df)

2023-09-22T09:10:33 | INFO | Preprocessing dataframe...
2023-09-22T09:10:53 | INFO | Training a base learner...


Score on test set for fold 1 is :0.612
Score on test set for fold 2 is :0.607
Score on test set for fold 3 is :0.611


2023-09-22T09:12:55 | INFO | Creating features...
2023-09-22T09:22:16 | INFO | Successfully created featureset of length: 356255 in: 11.73 minutes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Data columns (total 110 columns):
 #    Column                                          Non-Null Count   Dtype  
---   ------                                          --------------   -----  
 0    sk_id_curr                                      356255 non-null  int64  
 1    bureau__total_accounts                          305811 non-null  float64
 2    bureau__active_accounts                         305811 non-null  float64
 3    bureau__closed_accounts                         305811 non-null  float64
 4    bureau__max_amt_overdue                         212971 non-null  float64
 5    bureau__min_amt_overdue                         212971 non-null  float64
 6    bureau__mean_amt_overdue                        212971 non-null  float64
 7    bureau__std_amt_overdue                         136569 non-null  float64
 8    bureau__mean_amt_credit_debt                    297439 non-null  float64
 9    bureau__max_a

In [4]:
credit_card_features_df = make_credit_card_features(credit_card_balance_df,application_train_df,application_test_df)

2023-09-22T09:22:17 | INFO | Preprocessing dataframe...
2023-09-22T09:22:18 | INFO | Training a base learner...


Score on test set for fold 1 is :0.539
Score on test set for fold 2 is :0.554
Score on test set for fold 3 is :0.639


2023-09-22T09:25:41 | INFO | Creating features...
2023-09-22T09:27:15 | INFO | Successfully created featureset of length: 356255 in: 4.98 minutes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Data columns (total 145 columns):
 #    Column                                                           Non-Null Count   Dtype  
---   ------                                                           --------------   -----  
 0    sk_id_curr                                                       356255 non-null  int64  
 1    credit_card__mean_credit_cards_months                            103558 non-null  float64
 2    credit_card__max_credit_cards_months                             103558 non-null  float64
 3    credit_card__total_credit_cards_amt_balance                      103558 non-null  float64
 4    credit_card__max_credit_cards_amt_balance                        103558 non-null  float64
 5    credit_card__mean_credit_cards_amt_balance                       103558 non-null  float64
 6    credit_card__mean_trend_credit_cards_amt_balance                 103558 non-null  float64
 7    credit_card__max_c

In [5]:
installments_features_df =make_installments_features(installments_payments_df, application_df )

2023-09-22T09:27:15 | INFO | Preprocessing dataframe...
2023-09-22T09:28:09 | INFO | Training a base learner...


Score on test set for fold 1 is :0.604
Score on test set for fold 2 is :0.604
Score on test set for fold 3 is :0.608


2023-09-22T09:40:15 | INFO | Creating features...
2023-09-22T09:42:59 | INFO | Successfully created featureset of length: 356255 in: 15.73 minutes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Data columns (total 75 columns):
 #   Column                                                 Non-Null Count   Dtype  
---  ------                                                 --------------   -----  
 0   sk_id_curr                                             356255 non-null  int64  
 1   installments__number_installments                      339587 non-null  float64
 2   installments__amt_installments_max_amt                 339587 non-null  float64
 3   installments__amt_installments_min_amt                 339587 non-null  float64
 4   installments__amt_installments_total_amt               339587 non-null  float64
 5   installments__amt_installments_mean_amt                339587 non-null  float64
 6   installments__amt_installments_trend                   339587 non-null  float64
 7   installments__amt_pay_installments_max_amt             339587 non-null  float64
 8   installments__amt_pay_installments

In [6]:
pos_features_df = make_pos_features(pos_cash_balance_df,application_train_df,application_test_df)

2023-09-22T09:42:59 | INFO | Preprocessing dataframe...
2023-09-22T09:42:59 | INFO | Training a base learner...


Score on test set for fold 1 is :0.536
Score on test set for fold 2 is :0.536
Score on test set for fold 3 is :0.527


2023-09-22T09:50:00 | INFO | Creating features...
2023-09-22T09:59:17 | INFO | Successfully created featureset of length: 356255 in: 16.29 minutes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Data columns (total 58 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   sk_id_curr                                  356255 non-null  int64  
 1   pos_cash__cnt_installments_pos_mean         337252 non-null  float64
 2   pos_cash__cnt_installments_pos_max          337252 non-null  float64
 3   pos_cash__cnt_installments_pos_total        337252 non-null  float64
 4   pos_cash__cnt_installments_pos_trend        337252 non-null  float64
 5   pos_cash__cnt_installments_fut_pos_mean     337252 non-null  float64
 6   pos_cash__cnt_installments_fut_pos_max      337252 non-null  float64
 7   pos_cash__cnt_installments_fut_pos_total    337252 non-null  float64
 8   pos_cash__cnt_installments_fut_pos_trend    337252 non-null  float64
 9   pos_cash__cnt_installments_pos_minus_pos_f  337252 non-null  float64
 

In [17]:
previous_application_features_df = make_previous_application_features(previous_application_df,application_train_df, application_test_df)

2023-09-22T10:25:46 | INFO | Preprocessing dataframe...
2023-09-22T10:25:46 | INFO | Training a base learner...


Score on test set for fold 1 is :0.619
Score on test set for fold 2 is :0.621
Score on test set for fold 3 is :0.622


2023-09-22T10:28:36 | INFO | Creating features...
2023-09-22T10:30:42 | INFO | Successfully created featureset of length: 356255 in: 4.94 minutes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Data columns (total 108 columns):
 #    Column                                                                 Non-Null Count   Dtype  
---   ------                                                                 --------------   -----  
 0    sk_id_curr                                                             356255 non-null  int64  
 1    previous_application__sum_prev_applications                            338857 non-null  float64
 2    previous_application__mean_amt_prev_applications                       338857 non-null  float64
 3    previous_application__mean_amt_cred_prev_applications                  338857 non-null  float64
 4    previous_application__mean_amt_annuity_prev_applications               338377 non-null  float64
 5    previous_application__sum_amt_downpayment_prev_applications            338857 non-null  float64
 6    previous_application__mean_amt_goodsprice_prev_applications       

In [11]:
application_features_df = make_application_features(application_train_df, application_test_df)

2023-09-22T10:16:13 | INFO | Preprocessing dataframe...
2023-09-22T10:16:14 | INFO | Creating features...
2023-09-22T10:16:14 | INFO | Training a base learner...


Score on test set for fold 1 is :0.758
Score on test set for fold 2 is :0.755
Score on test set for fold 3 is :0.762


2023-09-22T10:17:22 | INFO | Successfully created featureset of length: 356255 in: 1.15 minutes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Data columns (total 658 columns):
 #    Column                                                                        Non-Null Count   Dtype  
---   ------                                                                        --------------   -----  
 0    sk_id_curr                                                                    356255 non-null  int64  
 1    target                                                                        307511 non-null  float64
 2    main_application__name_contract_type                                          356255 non-null  object 
 3    main_application__code_gender                                                 356251 non-null  object 
 4    main_application__flag_own_car                                                356255 non-null  object 
 5    main_application__flag_own_realty                                             356255 non-null  object 
 6    main_appli

### Merge all the dataframes

In [18]:
ldf = [bureau_features_df,
       credit_card_features_df, 
       installments_features_df,
       pos_features_df, 
       previous_application_features_df,
       application_features_df
      ]

final_df = reduce(lambda x,y: pd.merge(x,y, on = space_column, how = "inner"), ldf)

In [19]:
final_df.info(verbose = True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356255 entries, 0 to 356254
Data columns (total 1149 columns):
 #     Column                                                                        Non-Null Count   Dtype  
---    ------                                                                        --------------   -----  
 0     sk_id_curr                                                                    356255 non-null  int64  
 1     bureau__total_accounts                                                        305811 non-null  float64
 2     bureau__active_accounts                                                       305811 non-null  float64
 3     bureau__closed_accounts                                                       305811 non-null  float64
 4     bureau__max_amt_overdue                                                       212971 non-null  float64
 5     bureau__min_amt_overdue                                                       212971 non-null  float64
 6     

In [20]:
final_df = final_df.replace(np.inf,np.NaN)
final_df = final_df.replace(-np.inf,np.NaN)

In [21]:
final_df =pd.merge(final_df.drop(string_columns, axis = 1), 
         final_df[string_columns].apply(label_encoder),
         left_index = True,
         right_index = True)

In [23]:
path = "../data/full_dataset_2023-09-19.pkl"
final_df.to_pickle(path)