Here we will look at preprocessing of the data to be used. It will involve transforming the data to machine understandable format. The data we are considering is a loan default data and is stored in the form of a table. 

In [1]:
import pandas as pd
import numpy as np
print('We will first read the data Sample_data.csv using pandas dataframe. Make sure that it is saved in the home directory.')
df = pd.read_csv('Sample_data.csv')
print('Data stored in dataframe df and first few rows will look like below')
df.head(3)

We will first read the data Sample_data.csv using pandas dataframe. Make sure that it is saved in the home directory.
Data stored in dataframe df and first few rows will look like below


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


### Analysing the summary of the data

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB


### Converting the object types to categorical type data values

In [3]:
print('We do this operation in order to reduce the memory usage as evident from the summary we will get after performing the conversion.')
columns_of_df          = df.columns
column_index_to_change = [0,1,2,3,4,5,11,12]
for i in range(len(column_index_to_change)):
    indx                 = column_index_to_change[i]
    df[columns_of_df[indx]] = df[columns_of_df[indx]].astype('category')
df.info()

We do this operation in order to reduce the memory usage as evident from the summary we will get after performing the conversion.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null category
Gender               601 non-null category
Married              611 non-null category
Dependents           599 non-null category
Education            614 non-null category
Self_Employed        582 non-null category
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null category
Loan_Status          614 non-null category
dtypes: category(8), float64(4), int64(1)
memory usage: 55.0 KB


In [4]:
print('The below function takes into account the categorical column, finds the maximum occuring value in that column and imputes the null values with it.')
def categorical_missing_value_imputation(df,col_name):
    max_value = df[col_name].mode().iloc[0]
    df[col_name]= df[col_name].fillna(max_value)
    return df


The below function takes into account the categorical column, finds the maximum occuring value in that column and imputes the null values with it.


In [5]:
print('Suppose we want to perform the categorical value imputation for the 4 columns mentioned in the list.')
columns = ['Married','Dependents','Education','Self_Employed']
for i in range(len(columns)):
    df = categorical_missing_value_imputation(df,columns[i])
print('After performing missing value imputation of the categorical values, we have the following summary')
df.info()

Suppose we want to perform the categorical value imputation for the 4 columns mentioned in the list.
After performing missing value imputation of the categorical values, we have the following summary
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null category
Gender               601 non-null category
Married              614 non-null category
Dependents           614 non-null category
Education            614 non-null category
Self_Employed        614 non-null category
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null category
Loan_Status          614 non-null category
dtypes: category(8), float64(4), int64(1)
memory usage: 55.0 KB


### Dropping the observations from data having null values which can't be imputed. 
In our example, we have null values in the column Gender and it won't make any sense to impute the maximum recurring gender to the missing values, we hence drop all the observations where the gender is missing.

In [6]:
df.dropna(subset= ['Gender'],inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 601 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              601 non-null category
Gender               601 non-null category
Married              601 non-null category
Dependents           601 non-null category
Education            601 non-null category
Self_Employed        601 non-null category
ApplicantIncome      601 non-null int64
CoapplicantIncome    601 non-null float64
LoanAmount           579 non-null float64
Loan_Amount_Term     587 non-null float64
Credit_History       552 non-null float64
Property_Area        601 non-null category
Loan_Status          601 non-null category
dtypes: category(8), float64(4), int64(1)
memory usage: 59.0 KB


### Converting the dataframe to 2-d array format and separating the label (Y) from the actual set of observation.
Every dataset will have a target variable which is generally used for supervised learning. We find our X matrix from the data and also the Y vector from the data on which we will perform further calculations

In [7]:

subdf = df.iloc[:,1:-1]
X = subdf.values
print('We drop the first column because Loan_Term will not be used in any form of analysis we perform')
y = df.iloc[:,-1].values
print('The X matrix(or the independent column set of the data) is given by')
print(X)
print('\n')
print('The dependent vector is given by ')
print(y)

We drop the first column because Loan_Term will not be used in any form of analysis we perform
The X matrix(or the independent column set of the data) is given by
[['Male' 'No' '0' ... 360.0 1.0 'Urban']
 ['Male' 'Yes' '1' ... 360.0 1.0 'Rural']
 ['Male' 'Yes' '0' ... 360.0 1.0 'Urban']
 ...
 ['Male' 'Yes' '1' ... 360.0 1.0 'Urban']
 ['Male' 'Yes' '2' ... 360.0 1.0 'Urban']
 ['Female' 'No' '0' ... 360.0 0.0 'Semiurban']]


The dependent vector is given by 
[Y, N, Y, Y, Y, ..., Y, Y, Y, Y, N]
Length: 601
Categories (2, object): [N, Y]


### Null value imputation 
We impute the Null values with meaningful values. In case of numerical ones, we may take the mean of the remaining observations and substitute the null values with the mean (or an measure of central tendency). <br>
In case of categorical values, we may prefer to take the mode. <br>
We can drop the observation, in case we are not able to find a suitable treatment for the null values in a particular column of that observation.

In [8]:
from sklearn.preprocessing import Imputer

In [9]:
def Missing_value_imputation(X,col_lb,col_ub,miss_strategy,constant = 10):
    if miss_strategy =='constant':
        imputer = Imputer(missing_values='NaN',strategy=miss_strategy,fill_value = value )
    else:
        imputer = Imputer(missing_values='NaN',strategy=miss_strategy,axis = 0)
    X[:,col_lb:col_ub] = imputer.fit_transform(X[:,col_lb:col_ub])
    return X

In [10]:
print('We create a column dictionary to know what column is occuring at what position.')
def Dictionary_column_names(df):
    keys = df.columns.tolist()
    values = list(range(len(keys)))
    column_dictionary = dict(zip(keys,values))
    return column_dictionary

We create a column dictionary to know what column is occuring at what position.


In [11]:
col_dicts = Dictionary_column_names(subdf)
print('We get the following dictionary of column names:- ',col_dicts)

We get the following dictionary of column names:-  {'Gender': 0, 'Married': 1, 'Dependents': 2, 'Education': 3, 'Self_Employed': 4, 'ApplicantIncome': 5, 'CoapplicantIncome': 6, 'LoanAmount': 7, 'Loan_Amount_Term': 8, 'Credit_History': 9, 'Property_Area': 10}


In [12]:
print('We see the summary again to find which numerical columns to impute')
df.info()

We see the summary again to find which numerical columns to impute
<class 'pandas.core.frame.DataFrame'>
Int64Index: 601 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              601 non-null category
Gender               601 non-null category
Married              601 non-null category
Dependents           601 non-null category
Education            601 non-null category
Self_Employed        601 non-null category
ApplicantIncome      601 non-null int64
CoapplicantIncome    601 non-null float64
LoanAmount           579 non-null float64
Loan_Amount_Term     587 non-null float64
Credit_History       552 non-null float64
Property_Area        601 non-null category
Loan_Status          601 non-null category
dtypes: category(8), float64(4), int64(1)
memory usage: 59.0 KB


In [13]:
print('We can clearly see we can do missing value imputation in the column LoanAmount,Loan_Amount_Term and Credit_History.')
col_lb   = col_dicts['LoanAmount']
col_ub   = col_dicts['Property_Area'] #We need to consider till Loan_Amount_Term and hence we need an upperbound that is exclusive
strategy = 'mean'
X = Missing_value_imputation(X,col_lb,col_ub,strategy)
print('We were able to perform the mean value imputation of missing values in the stated columns and the matrix X will be ')
print(X)

We can clearly see we can do missing value imputation in the column LoanAmount,Loan_Amount_Term and Credit_History.
We were able to perform the mean value imputation of missing values in the stated columns and the matrix X will be 
[['Male' 'No' '0' ... 360.0 1.0 'Urban']
 ['Male' 'Yes' '1' ... 360.0 1.0 'Rural']
 ['Male' 'Yes' '0' ... 360.0 1.0 'Urban']
 ...
 ['Male' 'Yes' '1' ... 360.0 1.0 'Urban']
 ['Male' 'Yes' '2' ... 360.0 1.0 'Urban']
 ['Female' 'No' '0' ... 360.0 0.0 'Semiurban']]


We can also do the missing value imputation at dataframe stage, I did it over here to demonstrate the usage of Imputer module of sklearn package

### Converting Categorical columns to Numerical 
We do this in order to make the strings to machine understandable numerical values

In [14]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [15]:
def label_encoding(column_number,X):
    label_encoder = LabelEncoder()
    X[:,column_number] = label_encoder.fit_transform(X[:,column_number])
    return X

In [16]:
categorical_columns = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area']
for i in range(len(categorical_columns)):
    X = label_encoding(col_dicts[categorical_columns[i]],X)


In [17]:
print('Matrix X is now entirely numerical. We can see that by printing any arbitrary row of X ')
print(X[1])

Matrix X is now entirely numerical. We can see that by printing any arbitrary row of X 
[1 1 1 0 0 4583 1508.0 128.0 360.0 1.0 0]


### One hot encoding the categorical values

In [18]:
def one_hot_encoding(column_number,X):
    onehotencoder = OneHotEncoder(categorical_features = [column_number])
    X = onehotencoder.fit_transform(X).toarray()
    return X

In [19]:
for i in range(len(categorical_columns)):
    X = one_hot_encoding(col_dicts[categorical_columns[i]],X)


In [20]:
print('One hot encoding is performed for all the categorical columns.')

One hot encoding is performed for all the categorical columns.


### Label encoding for categorical values

In [21]:
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

### Splitting the data to train test split

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [23]:
print('The training X matrix is ')
print(X_train)
print('The corresponding Y vector is ')
print(y_train)

The training X matrix is 
[[  0.   0.   0. ... 360.   1.   0.]
 [  0.   0.   0. ... 360.   1.   0.]
 [  0.   0.   0. ... 360.   1.   0.]
 ...
 [  0.   0.   0. ... 360.   1.   0.]
 [  0.   0.   0. ... 360.   1.   0.]
 [  0.   0.   0. ... 360.   0.   2.]]
The corresponding Y vector is 
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1
 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 1
 1 0 0 1 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 1
 0 1 1 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 0 0
 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 0
 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
 1 0 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 1 1 1
 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1
 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 0 1 0 0 1