# <center> **Home Credit Default Risk Assessment**
# <center> **Installments Payments Dataset**

# **Introduction**

In this part of the project, I aggregate features in the installments_payments table that I will merge with the main application_train table at a later stage. The new features will prove to have good predictive abilities.

# **Libraries**

In [1]:
import pandas as pd
import numpy as np

from feature_engine.imputation import ArbitraryNumberImputer

import functions
import importlib
importlib.reload(functions)

import warnings

# **Display**

In [2]:
%matplotlib inline

pd.options.display.max_rows = 300000
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = 500

warnings.filterwarnings("ignore")
warnings.simplefilter(action="ignore", category=FutureWarning)

pd.set_option('display.max_rows', 200)

size = 20

# **Data**

## **Load Data**

In [3]:
install = pd.read_csv(
    r"C:\Users\Dell\Documents\AI\Risk\Data\installments_payments.csv",
    index_col=False)

## **Reduce Memory Usage**

Changing datatypes to a lower level to save on system resources.

In [5]:
install = functions.reduce_memory_usage(install)

Memory usage of dataframe is 830.41 MB
Memory usage after optimization is: 311.40 MB
Decreased by 62.5%


## **Remove Infinity Values**

Replacing inf and -inf values with NAN.

In [6]:
install.replace([np.inf, -np.inf], np.nan, inplace=True)

## **Missing Values**

In [7]:
functions.MissingValues(install)

Unnamed: 0,NumberMissing,PercentageMissing,DataType
DAYS_ENTRY_PAYMENT,2905,0.02,float16
AMT_PAYMENT,2905,0.02,float32


## **Imputation**

Imputation of missing values. Numerical missing values were imputed with an arbitrary number. This is a deliberate choice to not introduce new patterns in the data.

In [8]:
ani = ArbitraryNumberImputer(arbitrary_number=-99999)
ani.fit(install)
install = ani.transform(install)

## **Aggregation**

Aggregation and creation of new features.

In [9]:
install = install.groupby('SK_ID_CURR').agg({

    'SK_ID_PREV': 'count',  
    'AMT_INSTALMENT': ['sum', 'mean'],   
    'AMT_PAYMENT': ['sum', 'mean', 'max', 'min'],  

}).reset_index()


install['SUM_AMT_PAYMENT/SUM_AMT_INSTALMENT'] = install[('AMT_PAYMENT', 'sum')] / install[('AMT_INSTALMENT', 'sum')]
install['MEAN_AMT_PAYMENT-MEAN_AMT_INSTALMENT'] = install[('AMT_PAYMENT', 'mean')] - install[('AMT_INSTALMENT', 'mean')]


install.columns = ['_'.join(col).strip() if type(col) is tuple else col for col in install.columns]

install = install.rename(columns={

    'SK_ID_CURR_': 'SK_ID_CURR', 
    'SK_ID_PREV_count': 'NUM_PREVIOUS_APPLICATIONS',
    'AMT_INSTALMENT_sum': 'SUM_AMT_INSTALMENT',
    'AMT_INSTALMENT_mean': 'AVG_AMT_INSTALMENT',
    'AMT_PAYMENT_sum': 'SUM_AMT_PAYMENT',
    'AMT_PAYMENT_mean': 'AVG_AMT_PAYMENT',
    'AMT_PAYMENT_max': 'MAX_AMT_PAYMENT',
    'AMT_PAYMENT_min': 'MIN_AMT_PAYMENT',
    'SUM_AMT_PAYMENT/SUM_AMT_INSTALMENT_': 'SUM_AMT_PAYMENT/SUM_AMT_INSTALMENT',
    'MEAN_AMT_PAYMENT-MEAN_AMT_INSTALMENT_': 'MEAN_AMT_PAYMENT-MEAN_AMT_INSTALMENT'

})

# **Save Dataframe as CSV File**

A new dataframe is created to be used in later parts of this project.

In [11]:
install.to_csv(r"C:\Users\Dell\Documents\AI\Risk\Data\Data\install 26.csv", index=False)

# **Summary**

> * **Data Cleaning** — I cleaned this table of bad data and infinity values. 
> * **installments_payments Table** — I used domain knowledge gained from research to aggregate and create new features from those in the installments_payments table. 
> * **Merge** — I will merge this new table with the main application_train table in Notebook 10.0.