### Boring Setups

In [256]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pylab as plt

In [257]:
dtrain = pd.read_csv('data/train.csv')
dtest = pd.read_csv('data/test.csv')

### Data sizes

In [258]:
dtrain.shape

(614, 13)

In [259]:
dtest.shape

(367, 12)

### Data at a Galance

In [260]:
dtrain.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


### Features and their Distributions

##### 1 .Feature Names

In [261]:
dtrain.columns.values

array(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome',
       'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
       'Loan_Status'], dtype=object)

##### 2. Null Counts

In [291]:
nullCount = [dtrain[cols].isnull().sum() for cols in dtrain.columns.values]
pd.DataFrame({'Features': dtrain.columns.values, 'Nulls': nullCount})

Unnamed: 0,Features,Nulls
0,Loan_ID,0
1,Gender,0
2,Married,0
3,Dependents,0
4,Education,0
5,Self_Employed,0
6,ApplicantIncome,0
7,CoapplicantIncome,0
8,LoanAmount,0
9,Loan_Amount_Term,0


##### 3. Data Types & Type Counts

In [294]:
dtrain.dtypes

Loan_ID               object
Gender                object
Married                int64
Dependents            object
Education             object
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status            int64
dtype: object

In [264]:
dtrain_type = dtrain.dtypes.reset_index()
dtrain_type.columns = ["Count", "Type"]
dtrain_type.groupby('Type').aggregate('count').reset_index()

Unnamed: 0,Type,Count
0,int64,1
1,float64,4
2,object,8


##### 4. Unique values for categorical Features

In [265]:
categorical_features = ['Gender', 'Married', 
                        'Education', 'Self_Employed', 'Property_Area']
for each in categorical_features:
    print(each + str(dtrain[each].unique()).
          rjust(len(max(dtrain.columns.values.tolist(),
                        key=len)) + 30 - len(each)))

Gender                    ['Male' 'Female' nan]
Married                        ['No' 'Yes' nan]
Education           ['Graduate' 'Not Graduate']
Self_Employed                  ['No' 'Yes' nan]
Property_Area     ['Urban' 'Rural' 'Semiurban']


### Data Cleaning

##### 1. Cleaning NaN values

In [266]:
dtrain.Dependents.unique()

array(['0', '1', '2', '3+', nan], dtype=object)

Let's handle the variable **```Dependents```**. Unique values of dependents are 0, 1, 2, 3+ and NaN. It is natural that entires with NaN values have 0 dependents. So we shall fill NaN with 0 for this column

In [267]:
dtrain.Dependents.fillna(0, inplace=True)

In [268]:
dtrain.Self_Employed.unique()

array(['No', 'Yes', nan], dtype=object)

Next column is **```Self_Employed```**. If they are self employed then **Yes** else **No**. But then NaN in self employed may mean they are not at all employed. Hence we are filling NaNs with NE(Not Employed). The reason we can not drop entries with NaN is because we have very limited number of 614 data points. Losing any data could be bad.

In [269]:
dtrain.Self_Employed.replace(np.nan, 'NE', inplace=True)

Next is **```LoanAmount```**. This is a very important feature but we still see some NaNs. We are filling the NaNs with the *median* of all the ```LoanAmount```

In [270]:
dtrain.LoanAmount.replace(np.nan, dtrain.LoanAmount.median(), inplace=True)

In [271]:
dtrain.Credit_History.unique()

array([  1.,   0.,  nan])

We are handling the **```Credit_History```** and **```Loan_Amount_Term```** in the same way

In [272]:
dtrain.Credit_History.replace(np.nan, dtrain.Credit_History.median(), inplace=True)
dtrain.Loan_Amount_Term.replace(np.nan, dtrain.Loan_Amount_Term.median(), inplace=True)

In [273]:
dtrain.Gender.unique()

array(['Male', 'Female', nan], dtype=object)

**```Gender```** is something that would have two values namely, 'Male' and 'Female'. But here we also have NaN. We may consider them as 'Trans' to accomodate the category

In [274]:
dtrain.Gender.replace(np.nan, 'Trans', inplace=True)
dtrain.Gender.unique()

array(['Male', 'Female', 'Trans'], dtype=object)

In [275]:
dtrain.Married.unique()

array(['No', 'Yes', nan], dtype=object)

Similarly we can also fill the NaN in **```Married```** as 'Other'. 

In [276]:
dtrain.Married.replace(np.nan, 'Other', inplace=True)

##### 2. Stroring data for EDA

In [277]:
eda_data = dtrain.copy(deep=True)

##### 3. Encoding Categorical Variables

Converting all categorical vairiables into labels to process with Machine Learning

In [289]:
from sklearn.preprocessing import LabelEncoder, LabelBinarizer
number = LabelEncoder()
bi_number = LabelBinarizer()
dtrain['Married'] = number.fit_transform(dtrain.Married.astype('str'))
dtrain['Loan_Status'] = bi_number.fit_transform(dtrain.Loan_Status.astype('str'))
dtrain['Self_Employed'] = number.fit_transform(dtrain.Self_Employed.astype('str'))

For *Property_Area*, *Gender* and *Education* we are performing ratio of the counts of True Loan_Status count to the total Loan_Status Count 

In [279]:
property_counts = dtrain.groupby('Property_Area').Loan_Status.value_counts()
gender_counts = dtrain.groupby('Gender').Loan_Status.value_counts()
edu_count = dtrain.groupby('Education').Loan_Status.value_counts()

In [280]:
### For Property_Area ###
dtrain.loc[dtrain.Property_Area == 'Rural', 'Property_Area'] = property_counts[0] / (property_counts[0] + 
                                                                                     property_counts[1])
dtrain.loc[dtrain.Property_Area == 'Semiurban', 'Property_Area'] = property_counts[2] / (property_counts[2] + 
                                                                                         property_counts[3])
dtrain.loc[dtrain.Property_Area == 'Urban', 'Property_Area'] = property_counts[4] / (property_counts[4] + 
                                                                                     property_counts[5])

### For Gender ###
dtrain.loc[dtrain.Gender == 'Male', 'Gender'] = gender_counts[2] / (gender_counts[2] + 
                                                                    gender_counts[3])
dtrain.loc[dtrain.Gender == 'Female', 'Gender'] = gender_counts[0] / (gender_counts[0] + 
                                                                      gender_counts[1])
dtrain.loc[dtrain.Gender == 'Trans', 'Gender'] = gender_counts[4] / (gender_counts[4] + 
                                                                     gender_counts[5])

### For Education ###
dtrain.loc[dtrain.Education == 'Graduate', 'Education'] = edu_count[0] / (edu_count[1] + edu_count[0])
dtrain.loc[dtrain.Education == 'Not Graduate', 'Education'] = edu_count[2] / (edu_count[2] + edu_count[3])

We are filling all the **```Dependents```** of value **3+** with a value **6** because on an average number of dependents in a family never becomes more than 9 or 10. So mean of 3 and 9 is 6 hence we are filling it with 9

In [285]:
dtrain.Dependents.replace('3+', 6, inplace=True)

In [292]:
dtrain.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,0.693252,0,0,0.708333,1,5849,0.0,128.0,360.0,1.0,0.658416,1
1,LP001003,0.693252,2,1,0.708333,1,4583,1508.0,128.0,360.0,1.0,0.614525,0
2,LP001005,0.693252,2,0,0.708333,2,3000,0.0,66.0,360.0,1.0,0.658416,1
3,LP001006,0.693252,2,0,0.61194,1,2583,2358.0,120.0,360.0,1.0,0.658416,1
4,LP001008,0.693252,0,0,0.708333,1,6000,0.0,141.0,360.0,1.0,0.658416,1


##### 4. DataType Conversion

In [299]:
dtrain['Married'] = dtrain.Married.astype(int)
dtrain['Gender'] = dtrain.Gender.astype(float)
dtrain['Dependents'] = dtrain.Dependents.astype(int)
dtrain['Education'] = dtrain.Education.astype(int)
dtrain['Property_Area'] = dtrain.Property_Area.astype(float)

In [304]:
# Putting cleaned train data to a new csv
dtrain.to_csv('data/cleaned_train.csv')

### Univarite Analysis

##### 1. Distribution of the Loan Amount