**Objective**

Welcome to this kernel where we will be exploring a loan dataset adn try to find the insights on it 

First of all we will import required python liabraries. Dataset used in this kernel is freely available on Kaggle ( free for public usage) 


In [None]:

import numpy as np # Numpy library
import pandas as pd # Pandas library
import matplotlib.pyplot as plt # Matplotlib library for visualisation 
%matplotlib inline
import seaborn as sns # Matplotlib library for visualisation 

import warnings # Import warning liabraries to ignore standard warnings 
warnings.filterwarnings("ignore")

# Input data files are available in the "../input/" directory.

import os # os liabrary to find the directory where dataset is placed
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
mydf=pd.read_csv("../input/Loan payments data.csv") # Read the dataset CSV
mydf.head(5) # Let's find top 5 records of the dataset 

In [None]:
# Let's generate descriptive statistics of 
#dataframe (mydf) using describe function 
mydf.describe()

**Dataset Details**
> 


> Please find below details of the dataset which can help to understand the dataset
1. Loan_id : A unique loan (ID) assigned to each loan customers- system generated
2. Loan_status : Tell us if a loan is paid off, in collection process - customer is yet to payoff, or paid off after the collection efforts
3. Principal : Pincipal loan amount at the case origination OR Amount of Loan Applied
4. terms : Schedule
5. Effective_date : When the loan got originated (started)
6. Due_date : Due date by which loan should be paid off
7. Paidoff_time : Actual time when loan was paid off , null means yet to be paid 
8. Pastdue_days : How many days a loan has past due date 
9. Age : Age of customer 
10. Education : Education level of customer applied for loan
11. Gender : Customer Gender (Male/Female)[](http://)

In [None]:
#Let's concise summary of our dataset using pandas info function
mydf.info()

Check how many missing values do we have in our dataset ?

In [None]:
#From below query we can see we have 100 null (NAN) values in paid_off_time and 300 null values in 
#past_due_days which is fine , reason - if someone pays earlier before due date these columns will not
#have values specified

mydf.isnull().sum()


In [None]:
# Check dataset shape [rows, columns],below query shows we have a dataset of 500 rows , 11 columns
mydf.shape



**Let's do some Exploratory Data Analysis (EDA) **

Lets create some visualisation to see what dataset tells us when we interrogate it by visaulising it

#Todo - need to update comments section of below code , work in progress

In [None]:
sns.set(style="whitegrid") # Lets set background of charts as white 

In [None]:
# First of all lets find out how many loan cases are Paid Off, Collection or Collection_PaidOff status
x = sns.countplot(x="loan_status", data=mydf )



In [None]:
y = sns.countplot(x="loan_status", data=mydf , hue='Gender')

In [None]:
x = sns.countplot(x="terms", data=mydf , hue='loan_status', palette='pastel', linewidth=5)

In [None]:
g = sns.catplot("loan_status", col="education", col_wrap=4,
                 data=mydf[mydf.loan_status.notnull()],
                 kind="count", height=12.5, aspect=.6)


In [None]:
ax = sns.barplot(x="Principal", y="age",hue="Gender" ,  data=mydf)
ax.legend(loc="upper right")

In [None]:
sns.set(style="whitegrid")
ax = sns.countplot(x="loan_status", hue="Gender", data=mydf ,palette='pastel' ,edgecolor=sns.color_palette("dark", 3))

In [None]:
fig = plt.figure(figsize=(25,5))
g = sns.catplot(x="Principal", hue="loan_status", col="Gender",palette='pastel',
                data=mydf, kind="count",
                 height=4, aspect=.7);

In [None]:
sns.pairplot(mydf, hue='Gender')

In [None]:
sns.set(style="whitegrid", palette="pastel", color_codes=True)

# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="Principal", y="terms", hue="Gender",
               split=True, inner="quart",
               
               data=mydf)
sns.despine(left=False)

In [None]:
g = sns.lmplot(x="age", y="Principal", hue="Gender",
               truncate=True, height=5, data=mydf)

# Use more informative axis labels than are provided by default
g.set_axis_labels("Age", "Principal")

In [None]:
# Plot miles per gallon against horsepower with other semantics
sns.relplot(x="Principal", y="age", hue="education",size="Gender",
            sizes=(40, 400), alpha=.5, palette="muted",
            height=6, data=mydf)

In [None]:
perc=((mydf.shape[0]-mydf['past_due_days'].isnull().sum())/mydf.shape[0])*100
print(perc,"% of people paid before time")

In [None]:
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = 'Before Due Date', 'After Due Date'
sizes = [perc,100-perc]
explode = (0, 0.1)  # only "explode" the 2nd slice 

fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

fig1.suptitle('People who paid Before Due Date or After Due Date', fontsize=16)


plt.show()

In [None]:
sns.boxplot(x='education', y='Principal', data=mydf)
plt.show()

In [None]:
sns.lmplot(x='Principal', y='age', hue = 'Gender', data=mydf, aspect=1.5, fit_reg = False)

plt.show()

In [None]:
sns.lmplot(x='Principal', y='age', hue = 'education', data=mydf, aspect=1.5, fit_reg = False)
plt.show()

In [None]:
fig = plt.figure(figsize=(15,5))
ax = sns.countplot(x="effective_date", hue="loan_status", data=mydf ,palette='pastel' ,edgecolor=sns.color_palette("dark", 3))
ax.set_title('Loan date')
ax.legend(loc='upper right')
for t in ax.patches:
    if (np.isnan(float(t.get_height()))):
        ax.annotate(0, (t.get_x(), 0))
    else:
        ax.annotate(str(format(int(t.get_height()), ',d')), (t.get_x(), t.get_height()*1.01))
plt.show();