### Home Credit Default Risk - EDA using Seaborn and Matplotlib
A simple notebook to explore categorical data and credit card history

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import os
import seaborn as sns

from matplotlib import pyplot as plt

pd.set_option('max_columns', 150)

In [None]:
DATA_DIR = "../input/"
df = pd.read_csv(DATA_DIR + "application_train.csv")
df_cc = pd.read_csv(DATA_DIR + "credit_card_balance.csv")

In [None]:
df.head()

### Countplots on spread of TARGET for  NAME_CONTRACT_TYPE & GENDER

Only 4 rows with Gender Unspecified, all of them not-defaulting.

We can see that there are
* More defaults in Cash loans
* Higher ratio of Male defaulting than Female

In [None]:
sns.countplot(x="NAME_CONTRACT_TYPE", hue="TARGET", data=df);

In [None]:
sns.countplot(hue="TARGET", x="CODE_GENDER", data=df);

### TARGET distribution for different OCCUPATION_TYPE & GENDER
It's interesting to see the M/F counts and default ratio under different occupations 

In [None]:
sns.factorplot(x="CODE_GENDER", hue="TARGET", col="OCCUPATION_TYPE", data=df, kind="count", col_wrap=4, sharey=False);

### Exploring Education of the applicant: NAME_EDUCATION_TYPE

In [None]:
# Very few applicants with Academic degree
df["NAME_EDUCATION_TYPE"].value_counts()

In [None]:
sns.factorplot(x="CODE_GENDER", hue="TARGET", col="NAME_EDUCATION_TYPE", data=df, kind="count", col_wrap=4, sharey=False)

### Does the applicant own Realty or a Car?

In [None]:
df["OWN_CAR_OR_REALITY"] = (df["FLAG_OWN_REALTY"] == "Y") | (df["FLAG_OWN_CAR"] == "Y")
sns.factorplot(x="OWN_CAR_OR_REALITY", hue="TARGET", col="CODE_GENDER",
    data=df, kind="count", col_wrap=4, sharey=False)

## credit_card_balance.csv

Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.

In [None]:
df_cc.head()

Join cc info for applicants if available

In [None]:
df_with_cc_info = pd.merge(df, df_cc, on="SK_ID_CURR", how="right")
df_mean = df_with_cc_info.groupby(["TARGET"]).mean()
df_with_cc_info = df_with_cc_info.groupby(["SK_ID_CURR", "TARGET"]).mean().reset_index()

### Average draw amount & limit for defaulter vs non defaulters

In [None]:
df_mean[["AMT_DRAWINGS_CURRENT", "AMT_BALANCE", "AMT_CREDIT_LIMIT_ACTUAL", "AMT_TOTAL_RECEIVABLE"]]

* AMT_CREDIT_LIMIT_ACTUAL mean is almost equal. 
* Other values are slightly higher for defaulters

### Random sampling rows where credit card history was available

and TARGET=1 (default), plotting AMT_DRAWINGS_CURRENT

In [None]:
plt.figure(figsize=(20, 20))
for i in range(25):
    plt.subplot(5,5,i+1)
    df_cc_sample = df_cc[df_cc["SK_ID_CURR"] == df_with_cc_info[df_with_cc_info["TARGET"] == 1].sample(1)["SK_ID_CURR"].values[0]] \
        .sort_values("MONTHS_BALANCE")
    plt.plot(df_cc_sample["MONTHS_BALANCE"], df_cc_sample["AMT_DRAWINGS_CURRENT"])

### Random sample for TARGET=0

In [None]:
plt.figure(figsize=(20, 20))
for i in range(25):
    plt.subplot(5,5,i+1)
    df_cc_sample = df_cc[df_cc["SK_ID_CURR"] == df_with_cc_info[df_with_cc_info["TARGET"] == 0].sample(1)["SK_ID_CURR"].values[0]] \
        .sort_values("MONTHS_BALANCE")
    plt.plot(df_cc_sample["MONTHS_BALANCE"], df_cc_sample["AMT_DRAWINGS_CURRENT"])

### TARGET spread where 0 amount was drawn from the CC 

In [None]:
df_with_cc_info[df_with_cc_info["AMT_DRAWINGS_CURRENT"] == 0]["TARGET"].value_counts()

In [None]:
print("Total applications:", df["SK_ID_CURR"].nunique())
print("Total applications records in CC file(current):", df_cc["SK_ID_CURR"].nunique())
print("Matched records :", df_with_cc_info["SK_ID_CURR"].nunique())