- <a href='#0'>0. Introduction</a>  
- <a href='#1'>1. Get the Data</a>
- <a href='#2'>2. Check the Data</a>
- <a href='#3'> 3. Explore the data</a>
    - <a href='#3-1'>3.1 Categorical features</a>
    - <a href='#3-2'>3.2 Numerical features</a>
    - <a href='#3-3'>3.3 Categorical features by label</a>
    - <a href='#3-4'>3.4 Numerical features by label</a>
    - <a href='#3-5'>5.5 Correlation Matrix</a>
- <a href='#4'> 4. More will come soon</a>

## <a id='0'>0. Introduction</a>

 [Home Credit](http://www.homecredit.net/[](http://) is an international non-bank financial institution founded in 1997 in the Czech Republic. The company operates in 14 countries and focuses on lending primarily to people with little or no credit history. 

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--**to predict their clients' repayment abilities.**

While Home Credit is currently using various statistical and machine learning methods to make these predictions, **they're challenging Kagglers to help them unlock the full potential of their data**. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
![](http://www.homecredit.net/~/media/Images/H/Home-Credit-Group/image-gallery/full/image-gallery-01-11-2016-b.png)

## <a id='1'>1. Get the data</a>

In [1]:
import gc
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
color = sns.color_palette()
import warnings
warnings.filterwarnings("ignore")

In [11]:
application_train = pd.read_csv('../input/application_train.csv')
application_test= pd.read_csv('../input/application_test.csv')
bureau = pd.read_csv('../input/bureau.csv')
bureau_balance = pd.read_csv('../input/bureau_balance.csv')
POS_CASH_balance = pd.read_csv('../input/POS_CASH_balance.csv')
credit_card_balance = pd.read_csv('../input/credit_card_balance.csv')
previous_application = pd.read_csv('../input/previous_application.csv')
installments_payments = pd.read_csv('../input/installments_payments.csv')

This file contains descriptions for the columns in the various data files.

<img src="https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png" width="800"></img>

In [2]:
print('------------main------------')
print('application_train:', application_train.shape[0], "rows and", application_train.shape[1],'columns')
print('application_test:', application_test.shape[0], "rows and", application_test.shape[1],'columns')
print('      ')
print('------------others------------')
print('POS_CASH_balance:', POS_CASH_balance.shape[0], "rows and", POS_CASH_balance.shape[1],'columns')
print('bureau:', bureau.shape[0], "rows and", bureau.shape[1],'columns')
print('bureau_balance:', bureau_balance.shape[0], "rows and", bureau_balance.shape[1],'columns')
print('previous_application:', previous_application.shape[0], "rows and", previous_application.shape[1],'columns')
print('installments_payments:', installments_payments.shape[0], "rows and", installments_payments.shape[1],'columns')
print('credit_card_balance:', credit_card_balance.shape[0], "rows and", credit_card_balance.shape[1],'columns')

------------main------------


NameError: name 'application_train' is not defined

## <a id='2'>2. Check the data</a>
### 2.1 application train / test

In [14]:
application_train.head()

In [15]:
application_train.columns.values

In [16]:
def find_missing(data):
    count_missing = data.isnull().sum().values
    total = data.shape[0]
    ratio_missing = count_missing/total
    return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing}, index=data.columns.values)
find_missing(application_train).head(12)

In [18]:
find_missing(application_test).head(12)

### 2.2 POS_CASH_balance

In [19]:
POS_CASH_balance.head()

In [20]:
POS_CASH_balance.columns.values

In [21]:
find_missing(POS_CASH_balance).head(12)

### 2.3 bureau

In [22]:
bureau.head()

In [23]:
bureau.columns.values

In [24]:
find_missing(bureau).head(12)

### 2.4 bureau_balance

In [25]:
bureau_balance.head()

In [26]:
bureau_balance.columns.values

In [27]:
find_missing(bureau_balance).head(12)

### 2.5 credit_card_balance

In [28]:
credit_card_balance.head()

In [29]:
credit_card_balance.columns.values

In [31]:
find_missing(credit_card_balance).head(12)

### 2.6 previous_application

In [32]:
previous_application.head()

In [33]:
previous_application.columns.values

In [34]:
find_missing(previous_application).head(12)

### 2.7 installments_payments

In [35]:
installments_payments.head()

In [36]:
installments_payments.columns.values

In [37]:
find_missing(installments_payments).head(12)

## <a id='3'>3. Explore the data</a>

### <a id='3-1'>3.1 Categorical features</a>
#### Label

In [85]:
def plot_categorical(data, col, size=[8 ,4], xlabel_angle=0, title=''):
    '''use this for ploting the count of categorical features'''
    plotdata = data[col].value_counts()
    plt.figure(figsize = size)
    sns.barplot(x = plotdata.index, y=plotdata.values)
    plt.title(title)
    if xlabel_angle!=0: 
        plt.xticks(rotation=xlabel_angle)
    plt.show()
plot_categorical(data=application_train, col='TARGET', size=[8 ,4], xlabel_angle=0, title='train set: label')

### Occupation Type

In [106]:
plot_categorical(data=application_train, col='OCCUPATION_TYPE', size=[12 ,4], xlabel_angle=30, title='Occupation Type')

#### Gender

### Income Type

In [105]:
plot_categorical(data=application_train, col='NAME_INCOME_TYPE', size=[12 ,4], xlabel_angle=0, title='Income Type')

### House Type

In [104]:
plot_categorical(data=application_train, col='NAME_HOUSING_TYPE', size=[12 ,4], xlabel_angle=0, title='House Type')

### <a id='3-2'>3.2 Numerical features</a>
#### Credit Amount

In [134]:
def plot_numerical(data, col, size=[8, 4], bins=50):
    '''use this for ploting the distribution of numercial features'''
    plt.figure(figsize=size)
    plt.title("Distribution of %s" % col)
    sns.distplot(data[col].dropna(), kde=True,bins=bins)
    plt.show()
plot_numerical(application_train, 'AMT_CREDIT')

#### Annuity Amount

In [135]:
plot_numerical(application_train, 'AMT_ANNUITY')

### Days employed

In [136]:
plot_numerical(application_train, 'DAYS_EMPLOYED')

### <a id='3-3'>3.3 Categorical features by label</a>
#### Gender

In [156]:
def plot_categorical_bylabel(data, col, size=[12 ,6], xlabel_angle=0, title=''):
    '''use it to compare the distribution between label 1 and label 0'''
    plt.figure(figsize = size)
    l1 = data.loc[data.TARGET==1, col].value_counts()
    l0 = data.loc[data.TARGET==0, col].value_counts()
    plt.subplot(1,2,1)
    sns.barplot(x = l1.index, y=l1.values)
    plt.title('Default: '+title)
    plt.xticks(rotation=xlabel_angle)
    plt.subplot(1,2,2)
    sns.barplot(x = l0.index, y=l0.values)
    plt.title('Non-default: '+title)
    plt.xticks(rotation=xlabel_angle)
    plt.show()
plot_categorical_bylabel(application_train, 'CODE_GENDER', title='Gender')

#### Education Type

In [157]:
plot_categorical_bylabel(application_train, 'NAME_EDUCATION_TYPE', size=[15 ,6], xlabel_angle=15, title='Education Type')

### <a id='3-4'>3.4 Numerical features by label</a>
#### EXT_SOURCE_1

In [158]:
def plot_numerical_bylabel(data, col, size=[8, 4], bins=50):
    '''use this to compare the distribution of numercial features'''
    plt.figure(figsize=[12, 6])
    l1 = data.loc[data.TARGET==1, col]
    l0 = data.loc[data.TARGET==0, col]
    plt.subplot(1,2,1)
    sns.distplot(l1.dropna(), kde=True,bins=bins)
    plt.title('Default: Distribution of %s' % col)
    plt.subplot(1,2,2)
    sns.distplot(l0.dropna(), kde=True,bins=bins)
    plt.title('Non-default: Distribution of %s' % col)
    plt.show()
plot_numerical_bylabel(application_train, 'EXT_SOURCE_1', bins=50)

#### EXT_SOURCE_2

In [159]:
plot_numerical_bylabel(application_train, 'EXT_SOURCE_2', bins=50)

#### EXT_SOURCE_3

In [160]:
plot_numerical_bylabel(application_train, 'EXT_SOURCE_3', bins=50)

 ### <a id='3-5'>3.5 Correlation Matrix</a>

In [168]:
corr_mat = application_train.corr()
plt.figure(figsize=[15, 15])
sns.heatmap(corr_mat.values, annot=False)
plt.show()