- <a href='#1'>1. Problem Statement </a>  
- <a href='#2'>2. Reading the data</a>
- <a href='#3'>3. Feature Engineering</a>
    - <a href='#3-1'>3.1 Creating new feature for bureau</a>
    - <a href='#3-2'> 3.2 Function to count and normalize values of categorical variables </a>
- <a href='#4'>4. Grouping the data</a>
- <a href='#5'>5. Exploratory Data Analysis</a>
       - <a href='#5-1'>5.1  Analyzing Target Variable</a>
     - <a href='#5-2'>5.2  Visualizing basic info of the applicant </a>
      - <a href='#5-3'>5.3 Client accompanied by ? </a>
- <a href='#6'>6. Merging the data</a>     
- <a href='#7'>7. Combining Training and Testing data</a>     
- <a href='#7'>8. Feature Engineering Continued</a>     
     -<a href='#8_1'>8.1. Deleting features </a>
     -<a href='#8_2'>8.2  Handling Missing Values </a>
      -<a href='#8_3'>8.3 Scaling Numerical Features </a>
      -<a href='#8_3'>8.4 Converting into Categorical </a>
   
- <a href='#9'>9.Modelling</a>     

> # <a id='1'>1. Problem Statement</a>

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

**Evalutaion**  - Area under the ROC Curve

**Data  ** -   the problem has 7 files. 

* **application_train/application_test**: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR.  
* **bureau **: All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample). Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# import for plotting 
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

['POS_CASH_balance.csv', 'bureau_balance.csv', 'application_train.csv', 'previous_application.csv', 'installments_payments.csv', 'credit_card_balance.csv', 'sample_submission.csv', 'application_test.csv', 'bureau.csv']


>  # <a id='2'>2. Reading the Data</a>

In [2]:
app_train = pd.read_csv('../input/application_train.csv')
app_test = pd.read_csv('../input/application_test.csv')
bureau = pd.read_csv('../input/bureau.csv')
bureau_balance = pd.read_csv('../input/bureau_balance.csv')
pos_cash_balance = pd.read_csv('../input/POS_CASH_balance.csv')

previous_app = pd.read_csv('../input/previous_application.csv')
installments_payments = pd.read_csv('../input/installments_payments.csv')
credit_card_balance = pd.read_csv('../input/credit_card_balance.csv')

In [None]:
print(app_test.shape)

> # <a id='3'>3. Feature Engineering</a>

## <a id='3-1'>3.1 Creating new feature for bureau</a>

In [3]:
# Groupby the client id (SK_ID_CURR), count the number of previous loans, and rename the column
previous_loan_counts = bureau.groupby('SK_ID_CURR', as_index=False)['SK_ID_BUREAU'].count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_counts'})
previous_loan_counts.head()

Unnamed: 0,SK_ID_CURR,previous_loan_counts
0,100001,7
1,100002,8
2,100003,4
3,100004,2
4,100005,3


## <a id='3-2'>3.2 Function to count and normalize values of categorical variables </a>

In [4]:
def normalize_categorical(df, group_var, col_name):
    
    """Computes counts and normalized counts for each observation
    of `group_var` for each unique category in every categorical variable
    
    Parameters 
    ----------
    df - DataFrame for which we will calculate count
    
    group_var  = string
        The variable by which to group the dataframe. For each unique
        value of this variable, the final dataframe will have one row
        
    col_name = string
            Variable added to the front of column names to keep track of columns
            
            """
    # select the categorical columns
    categorical = pd.get_dummies(df.select_dtypes('object'))
    
    # Make sure to put the identifying id on the column
    categorical[group_var] = df[group_var]
    
    # Groupby the group var and calculate the sum and mean
    categorical = categorical.groupby(group_var).agg(['sum', 'mean'])                                              
    
    column_names = []
    
    # Iterate through the columns in level 0
    for var in categorical.columns.levels[0]:
        # Iterate through the stats in level 1
        for stat in ['count', 'count_norm']:
            # Make a new column name
            column_names.append('%s_%s_%s' % (col_name, var, stat))
    
    categorical.columns = column_names
    
    return categorical
    

In [5]:
bureau_counts = normalize_categorical(bureau, group_var = 'SK_ID_CURR', col_name = 'bureau')
bureau_counts.head()

Unnamed: 0_level_0,bureau_CREDIT_ACTIVE_Active_count,bureau_CREDIT_ACTIVE_Active_count_norm,bureau_CREDIT_ACTIVE_Bad debt_count,bureau_CREDIT_ACTIVE_Bad debt_count_norm,bureau_CREDIT_ACTIVE_Closed_count,bureau_CREDIT_ACTIVE_Closed_count_norm,bureau_CREDIT_ACTIVE_Sold_count,bureau_CREDIT_ACTIVE_Sold_count_norm,bureau_CREDIT_CURRENCY_currency 1_count,bureau_CREDIT_CURRENCY_currency 1_count_norm,bureau_CREDIT_CURRENCY_currency 2_count,bureau_CREDIT_CURRENCY_currency 2_count_norm,bureau_CREDIT_CURRENCY_currency 3_count,bureau_CREDIT_CURRENCY_currency 3_count_norm,bureau_CREDIT_CURRENCY_currency 4_count,bureau_CREDIT_CURRENCY_currency 4_count_norm,bureau_CREDIT_TYPE_Another type of loan_count,bureau_CREDIT_TYPE_Another type of loan_count_norm,bureau_CREDIT_TYPE_Car loan_count,bureau_CREDIT_TYPE_Car loan_count_norm,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count_norm,bureau_CREDIT_TYPE_Consumer credit_count,bureau_CREDIT_TYPE_Consumer credit_count_norm,bureau_CREDIT_TYPE_Credit card_count,bureau_CREDIT_TYPE_Credit card_count_norm,bureau_CREDIT_TYPE_Interbank credit_count,bureau_CREDIT_TYPE_Interbank credit_count_norm,bureau_CREDIT_TYPE_Loan for business development_count,bureau_CREDIT_TYPE_Loan for business development_count_norm,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count_norm,bureau_CREDIT_TYPE_Loan for working capital replenishment_count,bureau_CREDIT_TYPE_Loan for working capital replenishment_count_norm,bureau_CREDIT_TYPE_Microloan_count,bureau_CREDIT_TYPE_Microloan_count_norm,bureau_CREDIT_TYPE_Mobile operator loan_count,bureau_CREDIT_TYPE_Mobile operator loan_count_norm,bureau_CREDIT_TYPE_Mortgage_count,bureau_CREDIT_TYPE_Mortgage_count_norm,bureau_CREDIT_TYPE_Real estate loan_count,bureau_CREDIT_TYPE_Real estate loan_count_norm,bureau_CREDIT_TYPE_Unknown type of loan_count,bureau_CREDIT_TYPE_Unknown type of loan_count_norm
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1
100001,3,0.428571,0,0.0,4,0.571429,0,0.0,7,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,7,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0
100002,2,0.25,0,0.0,6,0.75,0,0.0,8,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,4,0.5,4,0.5,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0
100003,1,0.25,0,0.0,3,0.75,0,0.0,4,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,2,0.5,2,0.5,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0
100004,0,0.0,0,0.0,2,1.0,0,0.0,2,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,2,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0
100005,2,0.666667,0,0.0,1,0.333333,0,0.0,3,1.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,2,0.666667,1,0.333333,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0,0,0.0


> # <a id='4'>4 Grouping the data </a>

In [6]:
# Grouping data  so  that we can merge all the files in 1 dataset

data_bureau_agg=bureau.groupby(by='SK_ID_CURR').mean()
data_credit_card_balance_agg=credit_card_balance.groupby(by='SK_ID_CURR').mean()
data_previous_application_agg=previous_app.groupby(by='SK_ID_CURR').mean()
data_installments_payments_agg=installments_payments.groupby(by='SK_ID_CURR').mean()
data_POS_CASH_balance_agg=pos_cash_balance.groupby(by='SK_ID_CURR').mean()

data_bureau_agg.head()

Unnamed: 0_level_0,SK_ID_BUREAU,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
100001,5896633.0,-735.0,0.0,82.428571,-825.5,,0.0,207623.571429,85240.928571,0.0,0.0,-93.142857,3545.357143
100002,6153272.125,-874.0,0.0,-349.0,-697.5,1681.029,0.0,108131.945625,49156.2,7997.14125,0.0,-499.875,0.0
100003,5885878.5,-1400.75,0.0,-544.5,-1097.333333,0.0,0.0,254350.125,0.0,202500.0,0.0,-816.0,
100004,6829133.5,-867.0,0.0,-488.5,-532.5,0.0,0.0,94518.9,0.0,0.0,0.0,-532.0,
100005,6735201.0,-190.666667,0.0,439.333333,-123.0,0.0,0.0,219042.0,189469.5,0.0,0.0,-54.333333,1420.5


> # <a id='5'>5. Exploratory Data Exploration</a>

## <a id='5-2'>5.2  Visualizing basic info of the applicant </a>

In [None]:
# we will be plotting gender, occupation, has car, has flat  

plt.figure(1)
plt.subplot(221)
app_train['CODE_GENDER'].value_counts(normalize=True).plot.bar(figsize=(20,10), title= 'Gender')

plt.subplot(222)
app_train['FLAG_OWN_CAR'].value_counts(normalize=True).plot.bar(title= 'Own Car?')

plt.subplot(223)
app_train['CNT_CHILDREN'].value_counts(normalize=True).plot.bar(title= 'Count Children')

plt.subplot(224)
app_train['FLAG_OWN_REALTY'].value_counts(normalize=True).plot.bar(figsize=(24,6), title= 'Has Realty?')



plt.show()

# Inference - 
1. We see that most of the applicants were female and without any children.
2. An interesting fact is that most of the applicants owned a realty but not a car. 


## <a id='5-3'>5.3 Client accompanied by ? </a>

In [None]:
plt.figure(2)

plt.subplot(321)
app_train['NAME_TYPE_SUITE'].value_counts(normalize=True).plot.bar(figsize=(20,20), title= 'Accompanient')

plt.subplot(322)
app_train["NAME_CONTRACT_TYPE"].value_counts(normalize=True).plot.pie(figsize=(20,20), title='Loan Type')

plt.subplot(323)
app_train["NAME_FAMILY_STATUS"].value_counts(normalize=True).plot.pie(figsize=(20,20), title='Family status of applicants')

plt.subplot(324)
app_train["OCCUPATION_TYPE"].value_counts(normalize=True).plot.bar(figsize=(20,20), title='Occupation')
plt.show()

## <a id='5-3'>5.3  Loan is replayed or not? </a>

In Progress

> # <a id='6'>6. Merging the data</a>

In [7]:
def merge(df):
    df = df.join(data_bureau_agg, how='left', on='SK_ID_CURR', lsuffix='1', rsuffix='2') 
    df = df.join(bureau_counts, on = 'SK_ID_CURR', how = 'left')
    df = df.merge(previous_loan_counts, on = 'SK_ID_CURR', how = 'left')
    df = df.join(data_credit_card_balance_agg, how='left', on='SK_ID_CURR', lsuffix='1', rsuffix='2')    
    df = df.join(data_previous_application_agg, how='left', on='SK_ID_CURR', lsuffix='1', rsuffix='2')   
    df = df.join(data_installments_payments_agg, how='left', on='SK_ID_CURR', lsuffix='1', rsuffix='2') 
    
    return df

train = merge(app_train)
test = merge(app_test)
display(train.head())

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT1,AMT_ANNUITY1,AMT_GOODS_PRICE1,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START1,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,...,AMT_INST_MIN_REGULARITY,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,SK_DPD,SK_DPD_DEF,SK_ID_PREV2,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT2,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE2,HOUR_APPR_PROCESS_START2,NFLAG_LAST_APPL_IN_DAY,RATE_DOWN_PAYMENT,RATE_INTEREST_PRIMARY,RATE_INTEREST_PRIVILEGED,DAYS_DECISION,SELLERPLACE_AREA,CNT_PAYMENT,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL,SK_ID_PREV,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,...,,,,,,,,,,,,,,1038818.0,9251.775,179055.0,179055.0,0.0,179055.0,9.0,1.0,0.0,,,-606.0,500.0,24.0,365243.0,-565.0,125.0,-25.0,-17.0,0.0,1038818.0,1.052632,10.0,-295.0,-315.421053,11559.247105,11559.247105
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,...,,,,,,,,,,,,,,2281150.0,56553.99,435436.5,484191.0,3442.5,435436.5,14.666667,1.0,0.05003,,,-1305.0,533.0,10.0,365243.0,-1274.333333,-1004.333333,-1054.333333,-1047.333333,0.666667,2290070.0,1.04,5.08,-1378.16,-1385.32,64754.586,64754.586
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,...,,,,,,,,,,,,,,1564014.0,5357.25,24282.0,20106.0,4860.0,24282.0,5.0,1.0,0.212008,,,-815.0,30.0,4.0,365243.0,-784.0,-694.0,-724.0,-714.0,0.0,1564014.0,1.333333,2.0,-754.0,-761.666667,7096.155,7096.155
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,...,0.0,,0.0,0.0,0.0,0.0,,0.0,,,0.0,0.0,0.0,1932462.0,23651.175,272203.26,291695.5,34840.17,408304.89,14.666667,1.0,0.163412,,,-272.444444,894.222222,23.0,365243.0,91066.5,91584.0,182477.5,182481.75,0.0,2217428.0,1.125,4.4375,-252.25,-271.625,62947.088438,62947.088438
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,...,,,,,,,,,,,,,,2157812.0,12278.805,150530.25,166638.75,3390.75,150530.25,12.333333,1.0,0.159516,,,-1222.833333,409.166667,20.666667,365243.0,-1263.2,-837.2,72136.2,72143.8,0.6,2048985.0,1.166667,7.045455,-1028.606061,-1032.242424,12666.444545,12214.060227


In [8]:
print(train.shape)
print(test.shape)

(307511, 230)
(48744, 229)


> # <a id='7'>7. Combining training and testing data</a>

In [9]:
#combining the data
ntrain = train.shape[0]
ntest = test.shape[0]

y_train = train.TARGET.values

#train_df = train_df.drop

all_data = pd.concat([train, test]).reset_index(drop=True)
all_data.drop(['TARGET'], axis=1, inplace=True)

> # <a id='8'>8. Feature Engineering Continued</a>

In [10]:
# Now we will convert days employed and days registration and days id publish to a positive no. 
def correct_birth(df):
    
    df['DAYS_BIRTH'] = round((df['DAYS_BIRTH'] * (-1))/365)
    return df

def convert_abs(df):
    df['DAYS_EMPLOYED'] = abs(df['DAYS_EMPLOYED'])
    df['DAYS_REGISTRATION'] = abs(df['DAYS_REGISTRATION'])
    df['DAYS_ID_PUBLISH'] = abs(df['DAYS_ID_PUBLISH'])
    df['DAYS_LAST_PHONE_CHANGE'] = abs(df['DAYS_LAST_PHONE_CHANGE'])
    return df

# Now we will fill misisng values in OWN_CAR_AGE. 
#Most probably there will be missing values if the person does not own a car. So we will fill with 0

def missing(df):
    
    features = ['previous_loan_counts','NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAPARTMENTS_AVG','NONLIVINGAREA_MEDI','OWN_CAR_AGE']
    
    for f in features:
        df[f] = df[f].fillna(0 )
    return df

def transform_app(df):
    df = correct_birth(df)
    df = convert_abs(df)
    df = missing(df)
    return df

   

all_data = transform_app(all_data)

    

In [11]:
# counting no of phones given by the company and delete the irrelevant features
all_data['NO_OF_CLIENT_PHONES'] = all_data['FLAG_MOBIL'] + all_data['FLAG_EMP_PHONE'] + all_data['FLAG_WORK_PHONE']
all_data.head()

Unnamed: 0,AMT_ANNUITY,AMT_ANNUITY1,AMT_ANNUITY2,AMT_APPLICATION,AMT_BALANCE,AMT_CREDIT1,AMT_CREDIT2,AMT_CREDIT_LIMIT_ACTUAL,AMT_CREDIT_MAX_OVERDUE,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,AMT_DOWN_PAYMENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_GOODS_PRICE1,AMT_GOODS_PRICE2,AMT_INCOME_TOTAL,AMT_INSTALMENT,AMT_INST_MIN_REGULARITY,AMT_PAYMENT,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_TOTAL_RECEIVABLE,APARTMENTS_AVG,APARTMENTS_MEDI,APARTMENTS_MODE,BASEMENTAREA_AVG,BASEMENTAREA_MEDI,...,bureau_CREDIT_CURRENCY_currency 1_count,bureau_CREDIT_CURRENCY_currency 1_count_norm,bureau_CREDIT_CURRENCY_currency 2_count,bureau_CREDIT_CURRENCY_currency 2_count_norm,bureau_CREDIT_CURRENCY_currency 3_count,bureau_CREDIT_CURRENCY_currency 3_count_norm,bureau_CREDIT_CURRENCY_currency 4_count,bureau_CREDIT_CURRENCY_currency 4_count_norm,bureau_CREDIT_TYPE_Another type of loan_count,bureau_CREDIT_TYPE_Another type of loan_count_norm,bureau_CREDIT_TYPE_Car loan_count,bureau_CREDIT_TYPE_Car loan_count_norm,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count_norm,bureau_CREDIT_TYPE_Consumer credit_count,bureau_CREDIT_TYPE_Consumer credit_count_norm,bureau_CREDIT_TYPE_Credit card_count,bureau_CREDIT_TYPE_Credit card_count_norm,bureau_CREDIT_TYPE_Interbank credit_count,bureau_CREDIT_TYPE_Interbank credit_count_norm,bureau_CREDIT_TYPE_Loan for business development_count,bureau_CREDIT_TYPE_Loan for business development_count_norm,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count_norm,bureau_CREDIT_TYPE_Loan for working capital replenishment_count,bureau_CREDIT_TYPE_Loan for working capital replenishment_count_norm,bureau_CREDIT_TYPE_Microloan_count,bureau_CREDIT_TYPE_Microloan_count_norm,bureau_CREDIT_TYPE_Mobile operator loan_count,bureau_CREDIT_TYPE_Mobile operator loan_count_norm,bureau_CREDIT_TYPE_Mortgage_count,bureau_CREDIT_TYPE_Mortgage_count_norm,bureau_CREDIT_TYPE_Real estate loan_count,bureau_CREDIT_TYPE_Real estate loan_count_norm,bureau_CREDIT_TYPE_Unknown type of loan_count,bureau_CREDIT_TYPE_Unknown type of loan_count_norm,previous_loan_counts,NO_OF_CLIENT_PHONES
0,9251.775,24700.5,0.0,179055.0,,406597.5,179055.0,,1681.029,108131.945625,49156.2,7997.14125,0.0,0.0,,,,,351000.0,179055.0,202500.0,11559.247105,,11559.247105,,,,,0.0,0.0,0.0,0.0,0.0,1.0,,0.0247,0.025,0.0252,0.0369,0.0369,...,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.5,4.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,2
1,56553.99,35698.5,,435436.5,,1293502.5,484191.0,,0.0,254350.125,0.0,202500.0,0.0,3442.5,,,,,1129500.0,435436.5,270000.0,64754.586,,64754.586,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0959,0.0968,0.0924,0.0529,0.0529,...,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.5,2.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2
2,5357.25,6750.0,,24282.0,,135000.0,20106.0,,0.0,94518.9,0.0,0.0,0.0,4860.0,,,,,135000.0,24282.0,67500.0,7096.155,,7096.155,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,...,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3
3,23651.175,29686.5,,272203.26,0.0,312682.5,291695.5,270000.0,,,,,,34840.17,,0.0,,,297000.0,408304.89,135000.0,62947.088438,0.0,62947.088438,,0.0,0.0,0.0,,,,,,,0.0,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,2
4,12278.805,21865.5,,150530.25,,513000.0,166638.75,,0.0,146250.0,0.0,0.0,0.0,3390.75,,,,,513000.0,150530.25,121500.0,12666.444545,,12214.060227,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2


In [12]:
# add a feature to determine if client's permanent city does not match with contact/work city
all_data['FLAG_CLIENT_OUTSIDE_CITY'] = np.where((all_data['REG_CITY_NOT_WORK_CITY']==1) & (all_data['REG_CITY_NOT_LIVE_CITY']==1),1,0)
all_data.head()

Unnamed: 0,AMT_ANNUITY,AMT_ANNUITY1,AMT_ANNUITY2,AMT_APPLICATION,AMT_BALANCE,AMT_CREDIT1,AMT_CREDIT2,AMT_CREDIT_LIMIT_ACTUAL,AMT_CREDIT_MAX_OVERDUE,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,AMT_DOWN_PAYMENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_GOODS_PRICE1,AMT_GOODS_PRICE2,AMT_INCOME_TOTAL,AMT_INSTALMENT,AMT_INST_MIN_REGULARITY,AMT_PAYMENT,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_TOTAL_RECEIVABLE,APARTMENTS_AVG,APARTMENTS_MEDI,APARTMENTS_MODE,BASEMENTAREA_AVG,BASEMENTAREA_MEDI,...,bureau_CREDIT_CURRENCY_currency 1_count_norm,bureau_CREDIT_CURRENCY_currency 2_count,bureau_CREDIT_CURRENCY_currency 2_count_norm,bureau_CREDIT_CURRENCY_currency 3_count,bureau_CREDIT_CURRENCY_currency 3_count_norm,bureau_CREDIT_CURRENCY_currency 4_count,bureau_CREDIT_CURRENCY_currency 4_count_norm,bureau_CREDIT_TYPE_Another type of loan_count,bureau_CREDIT_TYPE_Another type of loan_count_norm,bureau_CREDIT_TYPE_Car loan_count,bureau_CREDIT_TYPE_Car loan_count_norm,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count_norm,bureau_CREDIT_TYPE_Consumer credit_count,bureau_CREDIT_TYPE_Consumer credit_count_norm,bureau_CREDIT_TYPE_Credit card_count,bureau_CREDIT_TYPE_Credit card_count_norm,bureau_CREDIT_TYPE_Interbank credit_count,bureau_CREDIT_TYPE_Interbank credit_count_norm,bureau_CREDIT_TYPE_Loan for business development_count,bureau_CREDIT_TYPE_Loan for business development_count_norm,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count_norm,bureau_CREDIT_TYPE_Loan for working capital replenishment_count,bureau_CREDIT_TYPE_Loan for working capital replenishment_count_norm,bureau_CREDIT_TYPE_Microloan_count,bureau_CREDIT_TYPE_Microloan_count_norm,bureau_CREDIT_TYPE_Mobile operator loan_count,bureau_CREDIT_TYPE_Mobile operator loan_count_norm,bureau_CREDIT_TYPE_Mortgage_count,bureau_CREDIT_TYPE_Mortgage_count_norm,bureau_CREDIT_TYPE_Real estate loan_count,bureau_CREDIT_TYPE_Real estate loan_count_norm,bureau_CREDIT_TYPE_Unknown type of loan_count,bureau_CREDIT_TYPE_Unknown type of loan_count_norm,previous_loan_counts,NO_OF_CLIENT_PHONES,FLAG_CLIENT_OUTSIDE_CITY
0,9251.775,24700.5,0.0,179055.0,,406597.5,179055.0,,1681.029,108131.945625,49156.2,7997.14125,0.0,0.0,,,,,351000.0,179055.0,202500.0,11559.247105,,11559.247105,,,,,0.0,0.0,0.0,0.0,0.0,1.0,,0.0247,0.025,0.0252,0.0369,0.0369,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.5,4.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,2,0
1,56553.99,35698.5,,435436.5,,1293502.5,484191.0,,0.0,254350.125,0.0,202500.0,0.0,3442.5,,,,,1129500.0,435436.5,270000.0,64754.586,,64754.586,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0959,0.0968,0.0924,0.0529,0.0529,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.5,2.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2,0
2,5357.25,6750.0,,24282.0,,135000.0,20106.0,,0.0,94518.9,0.0,0.0,0.0,4860.0,,,,,135000.0,24282.0,67500.0,7096.155,,7096.155,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3,0
3,23651.175,29686.5,,272203.26,0.0,312682.5,291695.5,270000.0,,,,,,34840.17,,0.0,,,297000.0,408304.89,135000.0,62947.088438,0.0,62947.088438,,0.0,0.0,0.0,,,,,,,0.0,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,2,0
4,12278.805,21865.5,,150530.25,,513000.0,166638.75,,0.0,146250.0,0.0,0.0,0.0,3390.75,,,,,513000.0,150530.25,121500.0,12666.444545,,12214.060227,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2,0


In [13]:
 # add a feature to determine if client's permanent city does not match with contact/work region
all_data['FLAG_CLIENT_OUTSIDE_REGION'] = np.where((all_data['REG_REGION_NOT_LIVE_REGION']==1) & (all_data['REG_REGION_NOT_WORK_REGION']==1),1,0)
all_data.head()

Unnamed: 0,AMT_ANNUITY,AMT_ANNUITY1,AMT_ANNUITY2,AMT_APPLICATION,AMT_BALANCE,AMT_CREDIT1,AMT_CREDIT2,AMT_CREDIT_LIMIT_ACTUAL,AMT_CREDIT_MAX_OVERDUE,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,AMT_DOWN_PAYMENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_GOODS_PRICE1,AMT_GOODS_PRICE2,AMT_INCOME_TOTAL,AMT_INSTALMENT,AMT_INST_MIN_REGULARITY,AMT_PAYMENT,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_TOTAL_RECEIVABLE,APARTMENTS_AVG,APARTMENTS_MEDI,APARTMENTS_MODE,BASEMENTAREA_AVG,BASEMENTAREA_MEDI,...,bureau_CREDIT_CURRENCY_currency 2_count,bureau_CREDIT_CURRENCY_currency 2_count_norm,bureau_CREDIT_CURRENCY_currency 3_count,bureau_CREDIT_CURRENCY_currency 3_count_norm,bureau_CREDIT_CURRENCY_currency 4_count,bureau_CREDIT_CURRENCY_currency 4_count_norm,bureau_CREDIT_TYPE_Another type of loan_count,bureau_CREDIT_TYPE_Another type of loan_count_norm,bureau_CREDIT_TYPE_Car loan_count,bureau_CREDIT_TYPE_Car loan_count_norm,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count_norm,bureau_CREDIT_TYPE_Consumer credit_count,bureau_CREDIT_TYPE_Consumer credit_count_norm,bureau_CREDIT_TYPE_Credit card_count,bureau_CREDIT_TYPE_Credit card_count_norm,bureau_CREDIT_TYPE_Interbank credit_count,bureau_CREDIT_TYPE_Interbank credit_count_norm,bureau_CREDIT_TYPE_Loan for business development_count,bureau_CREDIT_TYPE_Loan for business development_count_norm,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count_norm,bureau_CREDIT_TYPE_Loan for working capital replenishment_count,bureau_CREDIT_TYPE_Loan for working capital replenishment_count_norm,bureau_CREDIT_TYPE_Microloan_count,bureau_CREDIT_TYPE_Microloan_count_norm,bureau_CREDIT_TYPE_Mobile operator loan_count,bureau_CREDIT_TYPE_Mobile operator loan_count_norm,bureau_CREDIT_TYPE_Mortgage_count,bureau_CREDIT_TYPE_Mortgage_count_norm,bureau_CREDIT_TYPE_Real estate loan_count,bureau_CREDIT_TYPE_Real estate loan_count_norm,bureau_CREDIT_TYPE_Unknown type of loan_count,bureau_CREDIT_TYPE_Unknown type of loan_count_norm,previous_loan_counts,NO_OF_CLIENT_PHONES,FLAG_CLIENT_OUTSIDE_CITY,FLAG_CLIENT_OUTSIDE_REGION
0,9251.775,24700.5,0.0,179055.0,,406597.5,179055.0,,1681.029,108131.945625,49156.2,7997.14125,0.0,0.0,,,,,351000.0,179055.0,202500.0,11559.247105,,11559.247105,,,,,0.0,0.0,0.0,0.0,0.0,1.0,,0.0247,0.025,0.0252,0.0369,0.0369,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.5,4.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,2,0,0
1,56553.99,35698.5,,435436.5,,1293502.5,484191.0,,0.0,254350.125,0.0,202500.0,0.0,3442.5,,,,,1129500.0,435436.5,270000.0,64754.586,,64754.586,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0959,0.0968,0.0924,0.0529,0.0529,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.5,2.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2,0,0
2,5357.25,6750.0,,24282.0,,135000.0,20106.0,,0.0,94518.9,0.0,0.0,0.0,4860.0,,,,,135000.0,24282.0,67500.0,7096.155,,7096.155,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3,0,0
3,23651.175,29686.5,,272203.26,0.0,312682.5,291695.5,270000.0,,,,,,34840.17,,0.0,,,297000.0,408304.89,135000.0,62947.088438,0.0,62947.088438,,0.0,0.0,0.0,,,,,,,0.0,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,2,0,0
4,12278.805,21865.5,,150530.25,,513000.0,166638.75,,0.0,146250.0,0.0,0.0,0.0,3390.75,,,,,513000.0,150530.25,121500.0,12666.444545,,12214.060227,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2,0,0


 ## <a id='8'>8.1. Deleting features</a>

In [14]:
# deleting useless features
def delete(df):
   # useless=['FLAG_MOBIL', 'FLAG_EMP_PHONE' ,'FLAG_WORK_PHONE','REG_CITY_NOT_WORK_CITY','REG_CITY_NOT_LIVE_CITY','REG_REGION_NOT_LIVE_REGION','REG_REGION_NOT_WORK_REGION']
    #for feature in useless:
     return df.drop(['FLAG_MOBIL', 'FLAG_EMP_PHONE' ,'FLAG_WORK_PHONE','REG_CITY_NOT_WORK_CITY','REG_CITY_NOT_LIVE_CITY','REG_REGION_NOT_LIVE_REGION','REG_REGION_NOT_WORK_REGION'], axis=1)
def transform(df):
   # df = convert_abs(df)
    df = delete(df)
   
    return df

all_data = transform(all_data)
all_data.head()

Unnamed: 0,AMT_ANNUITY,AMT_ANNUITY1,AMT_ANNUITY2,AMT_APPLICATION,AMT_BALANCE,AMT_CREDIT1,AMT_CREDIT2,AMT_CREDIT_LIMIT_ACTUAL,AMT_CREDIT_MAX_OVERDUE,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,AMT_DOWN_PAYMENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_GOODS_PRICE1,AMT_GOODS_PRICE2,AMT_INCOME_TOTAL,AMT_INSTALMENT,AMT_INST_MIN_REGULARITY,AMT_PAYMENT,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_TOTAL_RECEIVABLE,APARTMENTS_AVG,APARTMENTS_MEDI,APARTMENTS_MODE,BASEMENTAREA_AVG,BASEMENTAREA_MEDI,...,bureau_CREDIT_CURRENCY_currency 2_count,bureau_CREDIT_CURRENCY_currency 2_count_norm,bureau_CREDIT_CURRENCY_currency 3_count,bureau_CREDIT_CURRENCY_currency 3_count_norm,bureau_CREDIT_CURRENCY_currency 4_count,bureau_CREDIT_CURRENCY_currency 4_count_norm,bureau_CREDIT_TYPE_Another type of loan_count,bureau_CREDIT_TYPE_Another type of loan_count_norm,bureau_CREDIT_TYPE_Car loan_count,bureau_CREDIT_TYPE_Car loan_count_norm,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count_norm,bureau_CREDIT_TYPE_Consumer credit_count,bureau_CREDIT_TYPE_Consumer credit_count_norm,bureau_CREDIT_TYPE_Credit card_count,bureau_CREDIT_TYPE_Credit card_count_norm,bureau_CREDIT_TYPE_Interbank credit_count,bureau_CREDIT_TYPE_Interbank credit_count_norm,bureau_CREDIT_TYPE_Loan for business development_count,bureau_CREDIT_TYPE_Loan for business development_count_norm,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count_norm,bureau_CREDIT_TYPE_Loan for working capital replenishment_count,bureau_CREDIT_TYPE_Loan for working capital replenishment_count_norm,bureau_CREDIT_TYPE_Microloan_count,bureau_CREDIT_TYPE_Microloan_count_norm,bureau_CREDIT_TYPE_Mobile operator loan_count,bureau_CREDIT_TYPE_Mobile operator loan_count_norm,bureau_CREDIT_TYPE_Mortgage_count,bureau_CREDIT_TYPE_Mortgage_count_norm,bureau_CREDIT_TYPE_Real estate loan_count,bureau_CREDIT_TYPE_Real estate loan_count_norm,bureau_CREDIT_TYPE_Unknown type of loan_count,bureau_CREDIT_TYPE_Unknown type of loan_count_norm,previous_loan_counts,NO_OF_CLIENT_PHONES,FLAG_CLIENT_OUTSIDE_CITY,FLAG_CLIENT_OUTSIDE_REGION
0,9251.775,24700.5,0.0,179055.0,,406597.5,179055.0,,1681.029,108131.945625,49156.2,7997.14125,0.0,0.0,,,,,351000.0,179055.0,202500.0,11559.247105,,11559.247105,,,,,0.0,0.0,0.0,0.0,0.0,1.0,,0.0247,0.025,0.0252,0.0369,0.0369,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.5,4.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,2,0,0
1,56553.99,35698.5,,435436.5,,1293502.5,484191.0,,0.0,254350.125,0.0,202500.0,0.0,3442.5,,,,,1129500.0,435436.5,270000.0,64754.586,,64754.586,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0959,0.0968,0.0924,0.0529,0.0529,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.5,2.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2,0,0
2,5357.25,6750.0,,24282.0,,135000.0,20106.0,,0.0,94518.9,0.0,0.0,0.0,4860.0,,,,,135000.0,24282.0,67500.0,7096.155,,7096.155,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3,0,0
3,23651.175,29686.5,,272203.26,0.0,312682.5,291695.5,270000.0,,,,,,34840.17,,0.0,,,297000.0,408304.89,135000.0,62947.088438,0.0,62947.088438,,0.0,0.0,0.0,,,,,,,0.0,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,2,0,0
4,12278.805,21865.5,,150530.25,,513000.0,166638.75,,0.0,146250.0,0.0,0.0,0.0,3390.75,,,,,513000.0,150530.25,121500.0,12666.444545,,12214.060227,,,,,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2,0,0


In [15]:
# delete Ids

def delete_id(df):
    return df.drop(['SK_ID_CURR', 'SK_ID_PREV','SK_ID_BUREAU'], axis = 1)

all_data = delete_id(all_data)

In [None]:
all_data.head()

In [16]:
print(all_data.columns)

Index(['AMT_ANNUITY', 'AMT_ANNUITY1', 'AMT_ANNUITY2', 'AMT_APPLICATION',
       'AMT_BALANCE', 'AMT_CREDIT1', 'AMT_CREDIT2', 'AMT_CREDIT_LIMIT_ACTUAL',
       'AMT_CREDIT_MAX_OVERDUE', 'AMT_CREDIT_SUM',
       ...
       'bureau_CREDIT_TYPE_Mortgage_count',
       'bureau_CREDIT_TYPE_Mortgage_count_norm',
       'bureau_CREDIT_TYPE_Real estate loan_count',
       'bureau_CREDIT_TYPE_Real estate loan_count_norm',
       'bureau_CREDIT_TYPE_Unknown type of loan_count',
       'bureau_CREDIT_TYPE_Unknown type of loan_count_norm',
       'previous_loan_counts', 'NO_OF_CLIENT_PHONES',
       'FLAG_CLIENT_OUTSIDE_CITY', 'FLAG_CLIENT_OUTSIDE_REGION'],
      dtype='object', length=222)


## <a id='8-2'>8.2  Handling Missing Values </a>

In [17]:
# handling missing values

def miss_numerical(df):
    
    features = ['previous_loan_counts','NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAPARTMENTS_AVG','NONLIVINGAREA_MEDI','OWN_CAR_AGE']
    numerical_features = all_data.select_dtypes(exclude = ["object"] ).columns
    #print(numerical_features)
    for f in numerical_features:
        #print(f)
        if f not in features:
            df[f] = df[f].fillna(df[f].median())
      
    return df

def miss_categorical(df):
    
    categorical_features = all_data.select_dtypes(include = ["object"]).columns
    
    for f in categorical_features:
        df[f] = df[f].fillna(df[f].mode()[0])
        
    return df

def transform_feature(df):
    df = miss_numerical(df)
    df = miss_categorical(df)
    #df = fill_cabin(df)
    return df

all_data = transform_feature(all_data)


all_data.head()
        

Unnamed: 0,AMT_ANNUITY,AMT_ANNUITY1,AMT_ANNUITY2,AMT_APPLICATION,AMT_BALANCE,AMT_CREDIT1,AMT_CREDIT2,AMT_CREDIT_LIMIT_ACTUAL,AMT_CREDIT_MAX_OVERDUE,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,AMT_DOWN_PAYMENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_GOODS_PRICE1,AMT_GOODS_PRICE2,AMT_INCOME_TOTAL,AMT_INSTALMENT,AMT_INST_MIN_REGULARITY,AMT_PAYMENT,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_TOTAL_RECEIVABLE,APARTMENTS_AVG,APARTMENTS_MEDI,APARTMENTS_MODE,BASEMENTAREA_AVG,BASEMENTAREA_MEDI,...,bureau_CREDIT_CURRENCY_currency 2_count,bureau_CREDIT_CURRENCY_currency 2_count_norm,bureau_CREDIT_CURRENCY_currency 3_count,bureau_CREDIT_CURRENCY_currency 3_count_norm,bureau_CREDIT_CURRENCY_currency 4_count,bureau_CREDIT_CURRENCY_currency 4_count_norm,bureau_CREDIT_TYPE_Another type of loan_count,bureau_CREDIT_TYPE_Another type of loan_count_norm,bureau_CREDIT_TYPE_Car loan_count,bureau_CREDIT_TYPE_Car loan_count_norm,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count_norm,bureau_CREDIT_TYPE_Consumer credit_count,bureau_CREDIT_TYPE_Consumer credit_count_norm,bureau_CREDIT_TYPE_Credit card_count,bureau_CREDIT_TYPE_Credit card_count_norm,bureau_CREDIT_TYPE_Interbank credit_count,bureau_CREDIT_TYPE_Interbank credit_count_norm,bureau_CREDIT_TYPE_Loan for business development_count,bureau_CREDIT_TYPE_Loan for business development_count_norm,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count_norm,bureau_CREDIT_TYPE_Loan for working capital replenishment_count,bureau_CREDIT_TYPE_Loan for working capital replenishment_count_norm,bureau_CREDIT_TYPE_Microloan_count,bureau_CREDIT_TYPE_Microloan_count_norm,bureau_CREDIT_TYPE_Mobile operator loan_count,bureau_CREDIT_TYPE_Mobile operator loan_count_norm,bureau_CREDIT_TYPE_Mortgage_count,bureau_CREDIT_TYPE_Mortgage_count_norm,bureau_CREDIT_TYPE_Real estate loan_count,bureau_CREDIT_TYPE_Real estate loan_count_norm,bureau_CREDIT_TYPE_Unknown type of loan_count,bureau_CREDIT_TYPE_Unknown type of loan_count_norm,previous_loan_counts,NO_OF_CLIENT_PHONES,FLAG_CLIENT_OUTSIDE_CITY,FLAG_CLIENT_OUTSIDE_REGION
0,9251.775,24700.5,0.0,179055.0,24997.602995,406597.5,179055.0,149000.0,1681.029,108131.945625,49156.2,7997.14125,0.0,0.0,4500.0,3329.348636,0.0,365.790547,351000.0,179055.0,202500.0,11559.247105,1623.508258,11559.247105,9856.811065,3986.601378,23914.47654,24765.001041,0.0,0.0,0.0,0.0,0.0,1.0,24770.597235,0.0247,0.025,0.0252,0.0369,0.0369,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.5,4.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,2,0,0
1,56553.99,35698.5,6516.0,435436.5,24997.602995,1293502.5,484191.0,149000.0,0.0,254350.125,0.0,202500.0,0.0,3442.5,4500.0,3329.348636,0.0,365.790547,1129500.0,435436.5,270000.0,64754.586,1623.508258,64754.586,9856.811065,3986.601378,23914.47654,24765.001041,0.0,0.0,0.0,0.0,0.0,0.0,24770.597235,0.0959,0.0968,0.0924,0.0529,0.0529,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.5,2.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2,0,0
2,5357.25,6750.0,6516.0,24282.0,24997.602995,135000.0,20106.0,149000.0,0.0,94518.9,0.0,0.0,0.0,4860.0,4500.0,3329.348636,0.0,365.790547,135000.0,24282.0,67500.0,7096.155,1623.508258,7096.155,9856.811065,3986.601378,23914.47654,24765.001041,0.0,0.0,0.0,0.0,0.0,0.0,24770.597235,0.088,0.0874,0.084,0.0765,0.0761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3,0,0
3,23651.175,29686.5,6516.0,272203.26,0.0,312682.5,291695.5,270000.0,0.0,197297.25,44760.375,0.0,0.0,34840.17,4500.0,0.0,0.0,365.790547,297000.0,408304.89,135000.0,62947.088438,0.0,62947.088438,9856.811065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.088,0.0874,0.084,0.0765,0.0761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.75,1.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,0,0
4,12278.805,21865.5,6516.0,150530.25,24997.602995,513000.0,166638.75,149000.0,0.0,146250.0,0.0,0.0,0.0,3390.75,4500.0,3329.348636,0.0,365.790547,513000.0,150530.25,121500.0,12666.444545,1623.508258,12214.060227,9856.811065,3986.601378,23914.47654,24765.001041,0.0,0.0,0.0,0.0,0.0,0.0,24770.597235,0.088,0.0874,0.084,0.0765,0.0761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2,0,0


## <a id='8-3'>8.3 Scaling Numerical Features </a>

In [18]:
# Scaling the data 

from sklearn.preprocessing import MinMaxScaler

def encoder(df):
    scaler = MinMaxScaler()
    numerical = all_data.select_dtypes(exclude = ["object"]).columns
    features_transform = pd.DataFrame(data= df)
    features_transform[numerical] = scaler.fit_transform(df[numerical])
    display(features_transform.head(n = 5))
    return df

all_data = encoder(all_data)

#display(all_data.head())

Unnamed: 0,AMT_ANNUITY,AMT_ANNUITY1,AMT_ANNUITY2,AMT_APPLICATION,AMT_BALANCE,AMT_CREDIT1,AMT_CREDIT2,AMT_CREDIT_LIMIT_ACTUAL,AMT_CREDIT_MAX_OVERDUE,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,AMT_DOWN_PAYMENT,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_GOODS_PRICE1,AMT_GOODS_PRICE2,AMT_INCOME_TOTAL,AMT_INSTALMENT,AMT_INST_MIN_REGULARITY,AMT_PAYMENT,AMT_PAYMENT_CURRENT,AMT_PAYMENT_TOTAL_CURRENT,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_YEAR,AMT_TOTAL_RECEIVABLE,APARTMENTS_AVG,APARTMENTS_MEDI,APARTMENTS_MODE,BASEMENTAREA_AVG,BASEMENTAREA_MEDI,...,bureau_CREDIT_CURRENCY_currency 2_count,bureau_CREDIT_CURRENCY_currency 2_count_norm,bureau_CREDIT_CURRENCY_currency 3_count,bureau_CREDIT_CURRENCY_currency 3_count_norm,bureau_CREDIT_CURRENCY_currency 4_count,bureau_CREDIT_CURRENCY_currency 4_count_norm,bureau_CREDIT_TYPE_Another type of loan_count,bureau_CREDIT_TYPE_Another type of loan_count_norm,bureau_CREDIT_TYPE_Car loan_count,bureau_CREDIT_TYPE_Car loan_count_norm,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count,bureau_CREDIT_TYPE_Cash loan (non-earmarked)_count_norm,bureau_CREDIT_TYPE_Consumer credit_count,bureau_CREDIT_TYPE_Consumer credit_count_norm,bureau_CREDIT_TYPE_Credit card_count,bureau_CREDIT_TYPE_Credit card_count_norm,bureau_CREDIT_TYPE_Interbank credit_count,bureau_CREDIT_TYPE_Interbank credit_count_norm,bureau_CREDIT_TYPE_Loan for business development_count,bureau_CREDIT_TYPE_Loan for business development_count_norm,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count,bureau_CREDIT_TYPE_Loan for purchase of shares (margin lending)_count_norm,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count,bureau_CREDIT_TYPE_Loan for the purchase of equipment_count_norm,bureau_CREDIT_TYPE_Loan for working capital replenishment_count,bureau_CREDIT_TYPE_Loan for working capital replenishment_count_norm,bureau_CREDIT_TYPE_Microloan_count,bureau_CREDIT_TYPE_Microloan_count_norm,bureau_CREDIT_TYPE_Mobile operator loan_count,bureau_CREDIT_TYPE_Mobile operator loan_count_norm,bureau_CREDIT_TYPE_Mortgage_count,bureau_CREDIT_TYPE_Mortgage_count_norm,bureau_CREDIT_TYPE_Real estate loan_count,bureau_CREDIT_TYPE_Real estate loan_count_norm,bureau_CREDIT_TYPE_Unknown type of loan_count,bureau_CREDIT_TYPE_Unknown type of loan_count_norm,previous_loan_counts,NO_OF_CLIENT_PHONES,FLAG_CLIENT_OUTSIDE_CITY,FLAG_CLIENT_OUTSIDE_REGION
0,0.030796,0.090032,0.0,0.044211,0.029978,0.090287,0.044211,0.11037,1.4e-05,0.000546,0.02144,0.02303,0.0,1.111111e-07,0.004975,0.002071,0.0,0.000226,0.077441,0.044211,0.001512,0.004615,0.037744,0.004615,0.006187,0.002504,0.030248,0.030234,0.0,0.0,0.0,0.0,0.0,0.04,0.03024,0.0247,0.025,0.0252,0.0369,0.0369,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.046512,0.5,0.181818,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068966,0.666667,0.0,0.0
1,0.188246,0.132924,0.000239,0.107515,0.029978,0.311736,0.119553,0.11037,0.0,0.001284,0.02051,0.065332,0.0,0.001700111,0.004975,0.002071,0.0,0.000226,0.271605,0.107515,0.002089,0.025854,0.037744,0.025854,0.006187,0.002504,0.030248,0.030234,0.0,0.0,0.0,0.0,0.0,0.0,0.03024,0.0959,0.0968,0.0924,0.0529,0.0529,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023256,0.5,0.090909,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034483,0.666667,0.0,0.0
2,0.017832,0.020025,0.000239,0.005996,0.029978,0.022472,0.004964,0.11037,0.0,0.000477,0.02051,0.021291,0.0,0.002400111,0.004975,0.002071,0.0,0.000226,0.023569,0.005996,0.000358,0.002833,0.037744,0.002833,0.006187,0.002504,0.030248,0.030234,0.0,0.0,0.0,0.0,0.0,0.0,0.03024,0.088,0.0874,0.084,0.0765,0.0761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023256,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,1.0,0.0,0.0
3,0.078726,0.109477,0.000239,0.067211,0.003145,0.066837,0.072024,0.2,0.0,0.000996,0.021357,0.021291,0.0,0.01720513,0.004975,1.1e-05,0.0,0.000226,0.063973,0.100816,0.000935,0.025133,0.0,0.025133,0.006187,0.0,0.003302,0.003199,0.0,0.0,0.0,0.0,0.0,0.04,0.003199,0.088,0.0874,0.084,0.0765,0.0761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034884,0.75,0.045455,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.0
4,0.040871,0.078975,0.000239,0.037168,0.029978,0.116854,0.041145,0.11037,0.0,0.000738,0.02051,0.021291,0.0,0.001674555,0.004975,0.002071,0.0,0.000226,0.117845,0.037168,0.000819,0.005057,0.037744,0.004877,0.006187,0.002504,0.030248,0.030234,0.0,0.0,0.0,0.0,0.0,0.0,0.03024,0.088,0.0874,0.084,0.0765,0.0761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011628,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008621,0.666667,0.0,0.0


## <a id='8-4'>8.4 Converting into categorical features </a>

In [19]:
# Converting into categorical features

# Create a label encoder object
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le_count = 0


# Iterate through the columns
for col in all_data:
    if all_data[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(all_data[col].unique())) <= 2:
            # Train on the training data
            le.fit(all_data[col])
            # Transform both training and testing data
            all_data[col] = le.transform(all_data[col])
            #test[col] = le.transform(test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
           
print('%d columns were label encoded.' % le_count)

4 columns were label encoded.


In [20]:
# dummy variables
all_data = pd.get_dummies(all_data)

display(all_data.shape)

(356255, 342)

> # <a id='9'>9. Modelling</a>

In [21]:
### Splitting features
train = all_data[:ntrain]
test = all_data[ntrain:]

print("Training shape", train.shape)
print("Testing shape", test.shape)

Training shape (307511, 342)
Testing shape (48744, 342)


In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(train, y_train, test_size = 0.3, random_state = 200)
print("X Training shape", X_train.shape)
print("X Testing shape", X_test.shape)
print("Y Training shape", Y_train.shape)
print("Y Testing shape", Y_test.shape)



X Training shape (215257, 342)
X Testing shape (92254, 342)
Y Training shape (215257,)
Y Testing shape (92254,)


In [23]:
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc, log_loss
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV

logreg = LogisticRegression(random_state=0, class_weight='balanced', C=100)
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict_proba(X_test)[:,1]

#Y_pred_proba = logreg.predict_proba(X_test)

print('Train/Test split results:')
#print(logreg.__class__.__name__+" accuracy is %2.3f" % accuracy_score(Y_test, Y_pred))
print("ROC",  roc_auc_score(Y_test, Y_pred))
#print(logreg.__class__.__name__+" auc is %2.3f" % auc(fpr, tpr))





Train/Test split results:
ROC 0.7621209394788154


In [27]:
pred_test = logreg.predict_proba(test)
#print("ROC",  roc_auc_score(Y_test, pred_test))
submission = pd.read_csv('../input/sample_submission.csv')

submission['SK_ID_CURR']=app_test['SK_ID_CURR']
print(len(app_test['SK_ID_CURR']))
submission['TARGET']=pred_test
#converting to csv
#print(submission['TARGET'])
pd.DataFrame(submission, columns=['SK_ID_CURR','TARGET'],index=None).to_csv('homecreditada.csv')

48744
