   ## LOAN PREDICTION PROJECT SOLUTION

### PROBLEM STATEMENT:        
*To predict if the housing loan will get approved or not.*

### KEY CONTENTS:

**1. DATA LOADING**

**2. EXPLORATORY DATA ANALYSIS**

    2.1. FEATURE and TARGET VARIABLE DETAILS    
    2.2. CHANGE CATEGORICAL FEATURE DATA TO NUMERICAL DATA
    2.3. FILL MISSING FEATURE VALUES
        2.3.1. FILL 'NAN' MARRIED COLUMN VALUES
        2.3.2. FILL 'NAN' GENDER COLUMN VALUES
        2.3.3. FILL 'NAN' DEPENDENTS COLUMN VALUES
        2.3.4. FILL 'NAN' SELF_EMPLOYED COLUMN VALUES
        2.3.5. FILL 'NAN' LOANAMOUNT COLUMN VALUES
        2.3.6. FILL 'NAN' CREDIT_HISTORY COLUMN VALUES
        2.3.7. FILL 'NAN' LOAN_AMOUNT_TERM COLUMN VALUES
 
 **3. USE GETDUMMIES for 'Dependents' column and 'Property_Area' column**
 
 **4. ANALYSE TARGET variable & address Class Imbalance problem **
 
 **5. Model using 'ALL' features in dataset **
 
     5.1. Use KNN Classifier model
     5.2. Use Logistic Regression model
     
 **6. Model using only 1 feature (Credit_History) in dataset **
 
     6.1. Use KNN Classifier model
     6.2. Use Logistic Regression model
     6.3. Use different ALPHA factor & l1 penalty to improve accuracy
     
 **7. Model using 3 features (Credit_History, Married, Education) in dataset **
 
     7.1. Use Logistic Regression model

### 1. DATA LOADING

In [1]:
# Import Pandas
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline

loan = pd.read_csv('train_original.csv')

### 2. EXPLORATORY DATA ANALYSIS

In [2]:
# A look at the first 10 rows
loan.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Urban,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N


In [3]:
# Information on data columns
loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null object
Married              611 non-null object
Dependents           599 non-null object
Education            614 non-null object
Self_Employed        582 non-null object
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null object
Loan_Status          614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.4+ KB


In [4]:
# Data Size
loan.shape

(614, 13)

In [5]:
loan_copy1 = loan.copy()

#### 2.1. FEATURE & TARGET DETAILS

_TARGET variable:_
 * LOAN_STATUS
 
_FEATURE DETAILS:_

FEATURE_NAME |  DATA TYPE | COUNT OF MISSING VALUES 
-------------|-------------------------|----------
GENDER       | CATEGORICAL| 24 
MARRIED      | CATEGORICAL| 3 
DEPENDENTS   | CATEGORICAL| 25 
EDUCATION    | CATEGORICAL| 0 
SELF_EMPLOYED | CATEGORICAL| 55
APPLICANTINCOME | CONTINUOUS| 0 
COAPPLICANTINCOME | CONTINUOUS| 0 
LOAN_AMOUNT | CONTINUOUS| 27 
LOAN_AMOUNT_TERM  | CONTINUOUS| 20 
CREDIT_HISTORY | CATEGORICAL| 50 
PROPERTY_AREA | CATEGORICAL| 0

#### 2.2. CHANGE CATEGORICAL FEATURE DATA TO NUMERICAL DATA

In [6]:
# Gender column has 13 missing values
print("GENDER VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Gender.value_counts(dropna=False))

# As it is categorical, convert the data to numeric, by preserving the Nan
loan_copy1.loc[:,'Gender']=loan_copy1.Gender.map(lambda x: 1.0 if x == 'Male' else 0.0 if x == 'Female' else x)

#check the transformed values
print("GENDER VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Gender.value_counts(dropna=False))


GENDER VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:
 Male      489
Female    112
NaN        13
Name: Gender, dtype: int64
GENDER VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:
  1.0    489
 0.0    112
NaN      13
Name: Gender, dtype: int64


In [7]:
# Married column has 3 missing values
print("MARRIED VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Married.value_counts(dropna=False))

# As it is categorical, convert the data to numeric, by preserving the Nan
loan_copy1.loc[:,'Married']=loan_copy1.Married.map(lambda x: 1.0 if x == 'Yes' else 0.0 if x == 'No' else x)

#check the transformed values
print("MARRIED VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Married.value_counts(dropna=False))

MARRIED VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:
 Yes    398
No     213
NaN      3
Name: Married, dtype: int64
MARRIED VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:
  1.0    398
 0.0    213
NaN       3
Name: Married, dtype: int64


In [8]:
# Dependent column has 15 missing values
loan_copy1.Dependents.value_counts(dropna=False)

print("DEPENDENTS VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Dependents.value_counts(dropna=False))

# As it is categorical, convert the data to numeric, by preserving the Nan
loan_copy1.loc[:,'Dependents']=loan_copy1.Dependents.map(lambda x: 0.0 if x == '0' else 1.0 if x == '1' else 2.0 if x == '2' else 3.0 if x == '3+' else x)

#check the transformed values
print("DEPENDENTS VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Dependents.value_counts(dropna=False))

DEPENDENTS VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:
 0      345
1      102
2      101
3+      51
NaN     15
Name: Dependents, dtype: int64
DEPENDENTS VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:
  0.0    345
 1.0    102
 2.0    101
 3.0     51
NaN      15
Name: Dependents, dtype: int64


In [9]:
loan_copy1.Self_Employed.value_counts(dropna=False)
# Self_Employed column has 32 missing values
print("Self_Employed VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Self_Employed.value_counts(dropna=False))

# As it is categorical, convert the data to numeric, by preserving the Nan
loan_copy1.loc[:,'Self_Employed']=loan_copy1.Self_Employed.map(lambda x: 1.0 if x == 'Yes' else 0.0 if x == 'No' else x)

#check the transformed values
print("Self_Employed VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Self_Employed.value_counts(dropna=False))

Self_Employed VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:
 No     500
Yes     82
NaN     32
Name: Self_Employed, dtype: int64
Self_Employed VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:
  0.0    500
 1.0     82
NaN      32
Name: Self_Employed, dtype: int64


In [10]:
loan_copy1.Education.value_counts(dropna=False)
# Education column has no missing values
print("Education VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Education.value_counts(dropna=False))

# As it is categorical, convert the data to numeric, by preserving the Nan
loan_copy1.loc[:,'Education']=loan_copy1.Education.map(lambda x: 1.0 if x == 'Graduate' else 0.0 if x == 'Not Graduate' else x)

#check the transformed values
#print("Education VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Education.value_counts(dropna=False))

Education VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:
 Graduate        480
Not Graduate    134
Name: Education, dtype: int64


In [11]:
loan_copy1.Property_Area.value_counts(dropna=False)


Semiurban    233
Urban        202
Rural        179
Name: Property_Area, dtype: int64

In [12]:
# Property_Area column has no missing values
print("Property_Area VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Property_Area.value_counts(dropna=False))

# As it is categorical, convert the data to numeric, by preserving the Nan
loan_copy1.loc[:,'Property_Area']=loan_copy1.Property_Area.map(lambda x: 1.0 if x == 'Urban' else 2.0 if x == 'Semiurban' else 3.0 if x == 'Rural' else x)

#check the transformed values
print("Property_Area VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Property_Area.value_counts(dropna=False))

Property_Area VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:
 Semiurban    233
Urban        202
Rural        179
Name: Property_Area, dtype: int64
Property_Area VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:
 2.0    233
1.0    202
3.0    179
Name: Property_Area, dtype: int64


In [13]:
# Loan_Status column has no missing values
print("Loan_Status VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Loan_Status.value_counts(dropna=False))

# As it is categorical, convert the data to numeric, by preserving the Nan
loan_copy1.loc[:,'Loan_Status']=loan_copy1.Loan_Status.map(lambda x: 1.0 if x == 'Y' else 0.0 if x == 'N' else x)

#check the transformed values
print("Loan_Status VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:\n",loan_copy1.Loan_Status.value_counts(dropna=False))

Loan_Status VALUES : BEFORE CONVERTING TO BINARY INTEGER FORM:
 Y    422
N    192
Name: Loan_Status, dtype: int64
Loan_Status VALUES : AFTER CONVERTING TO BINARY INTEGER FORM:
 1.0    422
0.0    192
Name: Loan_Status, dtype: int64


In [14]:
loan_copy1.Credit_History.value_counts(dropna=False)

# no categoric to numeric transformation required for Credit History

loan_copy1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null float64
Married              611 non-null float64
Dependents           599 non-null float64
Education            614 non-null float64
Self_Employed        582 non-null float64
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null float64
Loan_Status          614 non-null float64
dtypes: float64(11), int64(1), object(1)
memory usage: 62.4+ KB


In [15]:
# Find the correlation
loan_copy1.corr()
#loan_cust_demo_df[loan_cust_demo_df.Gender.isnull()]

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Gender,1.0,0.369612,0.17597,-0.049258,-0.009829,0.053989,0.083946,0.106947,-0.075117,0.016337,0.024556,0.019857
Married,0.369612,1.0,0.343417,-0.014223,0.001909,0.051332,0.07777,0.149519,-0.10381,0.004381,-0.002918,0.08928
Dependents,0.17597,0.343417,1.0,-0.059161,0.057867,0.118679,0.027259,0.163997,-0.100484,-0.050082,-0.006828,0.006781
Education,-0.049258,-0.014223,-0.059161,1.0,0.012333,0.14076,0.06229,0.171133,0.078784,0.081822,-0.065243,0.085884
Self_Employed,-0.009829,0.001909,0.057867,0.012333,1.0,0.140826,-0.011152,0.123931,-0.037069,0.003883,0.031214,-0.002303
ApplicantIncome,0.053989,0.051332,0.118679,0.14076,0.140826,1.0,-0.116605,0.570909,-0.045306,-0.014715,0.0095,-0.00471
CoapplicantIncome,0.083946,0.07777,0.027259,0.06229,-0.011152,-0.116605,1.0,0.188619,-0.059878,-0.002056,-0.010522,-0.059187
LoanAmount,0.106947,0.149519,0.163997,0.171133,0.123931,0.570909,0.188619,1.0,0.039447,-0.008433,0.045792,-0.037318
Loan_Amount_Term,-0.075117,-0.10381,-0.100484,0.078784,-0.037069,-0.045306,-0.059878,0.039447,1.0,0.00147,0.078748,-0.021268
Credit_History,0.016337,0.004381,-0.050082,0.081822,0.003883,-0.014715,-0.002056,-0.008433,0.00147,1.0,0.001969,0.561678


#### 2.3. FILL MISSING FEATURE VALUES

##### 2.3.1. FILL 'NAN' MARRIED COLUMN VALUES

In [16]:
loan_copy1[loan_copy1.Married.isnull()]


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
104,LP001357,1.0,,,1.0,0.0,3816,754.0,160.0,360.0,1.0,1.0,1.0
228,LP001760,1.0,,,1.0,0.0,4758,0.0,158.0,480.0,1.0,2.0,1.0
435,LP002393,0.0,,,1.0,0.0,10047,0.0,,240.0,1.0,2.0,1.0


In [17]:
#-----------------------------------------------------
# FILL MARRIED COL VALUES BASED ON GENDER VALUES
#-----------------------------------------------------

valid_married_df = loan_copy1.loc[~loan_copy1.Married.isnull(),['Married','Gender']]
marr_valid = valid_married_df.loc[~valid_married_df.Gender.isnull(),['Married','Gender']]
print(type(marr_valid))
print(marr_valid.shape)



<class 'pandas.core.frame.DataFrame'>
(598, 2)


In [18]:
from sklearn.neighbors import KNeighborsClassifier


In [19]:
#Predictor matrix: Gender
#Target variable: Married
marr_pred_col = ['Gender']
X_marr_valid = marr_valid[marr_pred_col]
y_marr_valid = marr_valid['Married'].values
print(type(y_marr_valid))
print(type(X_marr_valid))


<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>


In [20]:
# KNN classifier instantiated and fitted to the Predictor matrix and target variable to determine missing Married values
knn_marr = KNeighborsClassifier(n_neighbors=5)
knn_marr.fit(X_marr_valid,y_marr_valid)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [21]:
# Used the 'Married' fitted model to predict nan
X_marr_miss = loan_copy1.loc[loan_copy1.Married.isnull(),['Gender']]

y_marr_impute = knn_marr.predict(X_marr_miss)

In [22]:
# 3 predicted Married values ready for imputation
y_marr_impute

array([1., 1., 0.])

In [23]:
#loan_cust_demo_marr_imp = loan_cust_demo_df.copy()

In [24]:
loan_copy1.loc[loan_copy1.Married.isnull(),'Married'] = y_marr_impute

print(loan_copy1.Married.value_counts(dropna=False))
# original loan_cust_demo_df dataframe imputed with Married values

1.0    400
0.0    214
Name: Married, dtype: int64


In [25]:
loan_copy1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               601 non-null float64
Married              614 non-null float64
Dependents           599 non-null float64
Education            614 non-null float64
Self_Employed        582 non-null float64
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null float64
Loan_Status          614 non-null float64
dtypes: float64(11), int64(1), object(1)
memory usage: 62.4+ KB


##### 2.3.2. FILL 'NAN' GENDER COLUMN VALUES

In [26]:
#-----------------------------------------------------
# FILL GENDER COL VALUES BASED ON MARRIED VALUES
#-----------------------------------------------------
valid_gender_df = loan_copy1.loc[~loan_copy1.Gender.isnull(),['Gender','Married']]
gender_valid = valid_gender_df.loc[~valid_gender_df.Married.isnull(),['Married','Gender']]
print(type(gender_valid))
print(gender_valid.shape)


<class 'pandas.core.frame.DataFrame'>
(601, 2)


In [27]:
#Predictor matrix: Married
#Target variable: Gender
gender_pred_col = ['Married']
X_gen_valid = gender_valid[gender_pred_col]
y_gen_valid = gender_valid['Gender'].values
print(type(y_gen_valid))
print(type(X_gen_valid))


<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>


In [28]:
# KNN classifier instantiated and fitted to the Predictor matrix and target variable to determine missing Married values
knn_gen = KNeighborsClassifier(n_neighbors=5)
knn_gen.fit(X_gen_valid,y_gen_valid)


# Used the 'Gender' fitted model to predict nan
X_gen_miss = loan_copy1.loc[loan_copy1.Gender.isnull(),['Married']]

y_gen_impute = knn_gen.predict(X_gen_miss)

loan_copy1.loc[loan_copy1.Gender.isnull(),'Gender'] = y_gen_impute

print(loan_copy1.Gender.value_counts(dropna=False))
# original loan_cust_demo_df dataframe imputed with Married values

1.0    499
0.0    115
Name: Gender, dtype: int64


##### 2.3.3. FILL 'NAN' DEPENDENTS COLUMN VALUES

In [29]:
#-----------------------------------------------------
# FILL DEPENDENTS COL VALUES BASED ON GENDER & MARRIED VALUES
#-----------------------------------------------------
valid_dep_df = loan_copy1.loc[~loan_copy1.Dependents.isnull(),['Gender','Married','Dependents']]
# because we just now imputed Married and Gender values, only Dependents will be nan in valid_dep_df

#Predictor matrix
dep_col = ['Gender','Married']
X_dep_valid = valid_dep_df[dep_col]
y_dep_valid = valid_dep_df['Dependents'].values

knn_dep = KNeighborsClassifier(n_neighbors=5)
knn_dep.fit(X_dep_valid,y_dep_valid)

X_dep_miss = loan_copy1.loc[loan_copy1.Dependents.isnull(),['Gender','Married']]
y_dep_impute = knn_dep.predict(X_dep_miss)

loan_copy1.loc[loan_copy1.Dependents.isnull(),'Dependents'] = y_dep_impute

In [30]:
loan_copy1.Dependents.value_counts(dropna=False)

0.0    360
1.0    102
2.0    101
3.0     51
Name: Dependents, dtype: int64

In [31]:
loan_copy1.corr()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Gender,1.0,0.384752,0.167982,-0.05152,-0.011786,0.057199,0.089013,0.106223,-0.07731,0.007382,0.030199,0.018361
Married,0.384752,1.0,0.336372,-0.014097,0.001858,0.049052,0.07776,0.149743,-0.096451,0.00447,-0.004415,0.089072
Dependents,0.167982,0.336372,1.0,-0.055752,0.057161,0.118202,0.03043,0.166106,-0.102028,-0.038702,0.000244,0.010118
Education,-0.05152,-0.014097,-0.055752,1.0,0.012333,0.14076,0.06229,0.171133,0.078784,0.081822,-0.065243,0.085884
Self_Employed,-0.011786,0.001858,0.057161,0.012333,1.0,0.140826,-0.011152,0.123931,-0.037069,0.003883,0.031214,-0.002303
ApplicantIncome,0.057199,0.049052,0.118202,0.14076,0.140826,1.0,-0.116605,0.570909,-0.045306,-0.014715,0.0095,-0.00471
CoapplicantIncome,0.089013,0.07776,0.03043,0.06229,-0.011152,-0.116605,1.0,0.188619,-0.059878,-0.002056,-0.010522,-0.059187
LoanAmount,0.106223,0.149743,0.166106,0.171133,0.123931,0.570909,0.188619,1.0,0.039447,-0.008433,0.045792,-0.037318
Loan_Amount_Term,-0.07731,-0.096451,-0.102028,0.078784,-0.037069,-0.045306,-0.059878,0.039447,1.0,0.00147,0.078748,-0.021268
Credit_History,0.007382,0.00447,-0.038702,0.081822,0.003883,-0.014715,-0.002056,-0.008433,0.00147,1.0,0.001969,0.561678


#### 2.3.4. FILL 'NAN' SELF_EMPLOYED COLUMN VALUES

In [32]:
#------------------------------------------------------------------------
# To find Self_Employed missing values, use ApplicantIncome based on its correlation values
#------------------------------------------------------------------------
valid_emp_df = loan_copy1.loc[~loan_copy1.Self_Employed.isnull(),['ApplicantIncome','Self_Employed']]

#predictor matrix
emp_cols = ['ApplicantIncome']
X_emp_valid = valid_emp_df[emp_cols]
y_emp_valid = valid_emp_df['Self_Employed'].values

knn_emp = KNeighborsClassifier(n_neighbors=5)
knn_emp.fit(X_emp_valid,y_emp_valid)

loan_copy1.Self_Employed.value_counts(dropna=False)
    
X_emp_miss = loan_copy1.loc[loan_copy1.Self_Employed.isnull(),['ApplicantIncome']]
print(X_emp_miss.shape)

y_emp_impute = knn_emp.predict(X_emp_miss)
print(len(y_emp_impute))
loan_copy1.loc[loan_copy1.Self_Employed.isnull(),'Self_Employed'] = y_emp_impute

loan_copy1.Self_Employed.value_counts(dropna=False)

(32, 1)
32


0.0    532
1.0     82
Name: Self_Employed, dtype: int64

In [33]:
loan_copy1[loan_copy1.LoanAmount.isnull()]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,1.0,0.0,0.0,1.0,0.0,5849,0.0,,360.0,1.0,1.0,1.0
35,LP001106,1.0,1.0,0.0,1.0,0.0,2275,2067.0,,360.0,1.0,1.0,1.0
63,LP001213,1.0,1.0,1.0,1.0,0.0,4945,0.0,,360.0,0.0,3.0,0.0
81,LP001266,1.0,1.0,1.0,1.0,1.0,2395,0.0,,360.0,1.0,2.0,1.0
95,LP001326,1.0,0.0,0.0,1.0,0.0,6782,0.0,,360.0,,1.0,0.0
102,LP001350,1.0,1.0,0.0,1.0,0.0,13650,0.0,,360.0,1.0,1.0,1.0
103,LP001356,1.0,1.0,0.0,1.0,0.0,4652,3583.0,,360.0,1.0,2.0,1.0
113,LP001392,0.0,0.0,1.0,1.0,1.0,7451,0.0,,360.0,1.0,2.0,1.0
127,LP001449,1.0,0.0,0.0,1.0,0.0,3865,1640.0,,360.0,1.0,3.0,1.0
202,LP001682,1.0,1.0,3.0,0.0,0.0,3992,0.0,,180.0,1.0,1.0,0.0


##### 2.3.5. FILL NAN LOANAMOUNT COLUMN VALUES

In [34]:
#---------------------------------------------------------------
# Find Loan Amount, based on ApplicantIncome and CoapplicantIncome
#---------------------------------------------------------------
loan_copy1.info()
#--592 valid values

valid_lnamt_df = loan_copy1.loc[~loan_copy1.LoanAmount.isnull(), ['ApplicantIncome','CoapplicantIncome','LoanAmount']]

#predictor matrix
lnamt_cols = ['ApplicantIncome','CoapplicantIncome']
X_lnamt_valid = valid_lnamt_df[lnamt_cols]
y_lnamt_valid = valid_lnamt_df['LoanAmount'].values

knn_ln = KNeighborsClassifier(n_neighbors=5)
knn_ln.fit(X_lnamt_valid,y_lnamt_valid)

X_lnamt_miss = loan_copy1.loc[loan_copy1.LoanAmount.isnull(),['ApplicantIncome','CoapplicantIncome']]
y_lnamt_impute = knn_ln.predict(X_lnamt_miss)

y_lnamt_impute


loan_copy1.loc[loan_copy1.LoanAmount.isnull(),'LoanAmount'] = y_lnamt_impute

print(loan_copy1.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               614 non-null float64
Married              614 non-null float64
Dependents           614 non-null float64
Education            614 non-null float64
Self_Employed        614 non-null float64
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           592 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       564 non-null float64
Property_Area        614 non-null float64
Loan_Status          614 non-null float64
dtypes: float64(11), int64(1), object(1)
memory usage: 62.4+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               614 non-null float64
Married              614 non-null float64
Dependents           614 non-null float64
Education       

In [35]:
loan_copy1.corr()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Gender,1.0,0.384752,0.167982,-0.05152,-0.007875,0.057199,0.089013,0.102611,-0.07731,0.007382,0.030199,0.018361
Married,0.384752,1.0,0.336372,-0.014097,0.005826,0.049052,0.07776,0.142947,-0.096451,0.00447,-0.004415,0.089072
Dependents,0.167982,0.336372,1.0,-0.055752,0.056798,0.118202,0.03043,0.15986,-0.102028,-0.038702,0.000244,0.010118
Education,-0.05152,-0.014097,-0.055752,1.0,0.010383,0.14076,0.06229,0.173213,0.078784,0.081822,-0.065243,0.085884
Self_Employed,-0.007875,0.005826,0.056798,0.010383,1.0,0.12718,-0.0161,0.113345,-0.034361,-0.002362,0.03086,-0.0037
ApplicantIncome,0.057199,0.049052,0.118202,0.14076,0.12718,1.0,-0.116605,0.562831,-0.045306,-0.014715,0.0095,-0.00471
CoapplicantIncome,0.089013,0.07776,0.03043,0.06229,-0.0161,-0.116605,1.0,0.191587,-0.059878,-0.002056,-0.010522,-0.059187
LoanAmount,0.102611,0.142947,0.15986,0.173213,0.113345,0.562831,0.191587,1.0,0.03712,-0.003783,0.054662,-0.030521
Loan_Amount_Term,-0.07731,-0.096451,-0.102028,0.078784,-0.034361,-0.045306,-0.059878,0.03712,1.0,0.00147,0.078748,-0.021268
Credit_History,0.007382,0.00447,-0.038702,0.081822,-0.002362,-0.014715,-0.002056,-0.003783,0.00147,1.0,0.001969,0.561678


##### 2.3.6. FILL 'NAN' CREDIT_HISTORY COLUMN VALUES

In [36]:
#---------------------------------------------------------------
# Find Credit_History, based on Gender, Married, Education, Self_Employed, ApplicationIncome, CoapplicantIncome
#---------------------------------------------------------------
from sklearn.preprocessing import StandardScaler

cred_cols_1 = ['Credit_History','Gender', 'Married', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome']
valid_cred_df = loan_copy1.loc[~loan_copy1.Credit_History.isnull(), cred_cols_1]
print(valid_cred_df.shape)
# predictor matrix
cred_cols  =  ['Gender', 'Married', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome']
X_cred_valid = valid_cred_df[cred_cols]
print(X_cred_valid.shape)

X_cred_valid.info()


y_cred_valid = valid_cred_df['Credit_History'].values
#print(y_cred_valid.shape)


(564, 7)
(564, 6)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 564 entries, 0 to 613
Data columns (total 6 columns):
Gender               564 non-null float64
Married              564 non-null float64
Education            564 non-null float64
Self_Employed        564 non-null float64
ApplicantIncome      564 non-null int64
CoapplicantIncome    564 non-null float64
dtypes: float64(5), int64(1)
memory usage: 30.8 KB


In [37]:

#use StandardScaler 
ss = StandardScaler()
Xss_cred_valid = ss.fit_transform(X_cred_valid)

knn_cred = KNeighborsClassifier(n_neighbors=5)
knn_cred.fit(Xss_cred_valid,y_cred_valid)

X_cred_miss = loan_copy1.loc[loan_copy1.Credit_History.isnull(), cred_cols]
Xss_cred_miss = ss.fit_transform(X_cred_miss)

y_cred_impute = knn_cred.predict(Xss_cred_miss)

loan_copy1.loc[loan_copy1.Credit_History.isnull(),'Credit_History'] = y_cred_impute




In [38]:
loan_copy1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               614 non-null float64
Married              614 non-null float64
Dependents           614 non-null float64
Education            614 non-null float64
Self_Employed        614 non-null float64
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           614 non-null float64
Loan_Amount_Term     600 non-null float64
Credit_History       614 non-null float64
Property_Area        614 non-null float64
Loan_Status          614 non-null float64
dtypes: float64(11), int64(1), object(1)
memory usage: 62.4+ KB


##### 2.3.7. FILL 'NAN' LOAN_AMOUNT_TERM COLUMN VALUES

In [39]:
#-----------------------------------------------------------
# To  fill Nan Loan_amount_Term using LoanAmount, ApplicantIncome, CoapplicantIncome
#-----------------------------------------------------------
valid_term_df = loan_copy1.loc[~loan_copy1.Loan_Amount_Term.isnull(),['Loan_Amount_Term','LoanAmount','ApplicantIncome','CoapplicantIncome']]

term_cols = ['LoanAmount','ApplicantIncome','CoapplicantIncome']
X_term_valid = valid_term_df[term_cols]
y_term_valid = valid_term_df['Loan_Amount_Term'].values

X_term_valid.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 600 entries, 0 to 613
Data columns (total 3 columns):
LoanAmount           600 non-null float64
ApplicantIncome      600 non-null int64
CoapplicantIncome    600 non-null float64
dtypes: float64(2), int64(1)
memory usage: 18.8 KB


In [40]:
knn_term =  KNeighborsClassifier(n_neighbors=5)
knn_term.fit(X_term_valid,y_term_valid)

X_term_miss = loan_copy1.loc[loan_copy1.Loan_Amount_Term.isnull(), term_cols]


y_term_impute = knn_term.predict(X_term_miss)

loan_copy1.loc[loan_copy1.Loan_Amount_Term.isnull(),'Loan_Amount_Term'] = y_term_impute


In [41]:
loan_copy1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
Loan_ID              614 non-null object
Gender               614 non-null float64
Married              614 non-null float64
Dependents           614 non-null float64
Education            614 non-null float64
Self_Employed        614 non-null float64
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           614 non-null float64
Loan_Amount_Term     614 non-null float64
Credit_History       614 non-null float64
Property_Area        614 non-null float64
Loan_Status          614 non-null float64
dtypes: float64(11), int64(1), object(1)
memory usage: 62.4+ KB


#### 3. USE GETDUMMIES for 'Dependents' column and 'Property_Area' column

In [42]:
# Dependent column dummies
dep_dummies = pd.get_dummies(loan_copy1.Dependents, prefix='Dep')
dep_dummies.drop(dep_dummies.columns[0], axis=1, inplace=True)

#Property_Area column dummies
prop_dummies = pd.get_dummies(loan_copy1.Property_Area, prefix='Prop')
prop_dummies.drop(prop_dummies.columns[0], axis=1, inplace=True)


loan_copy1_dummies = pd.concat([loan_copy1,dep_dummies,prop_dummies],axis=1)

loan_copy1_dummies.drop(['Dependents','Property_Area'],axis=1,inplace=True)

loan_copy1_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 16 columns):
Loan_ID              614 non-null object
Gender               614 non-null float64
Married              614 non-null float64
Education            614 non-null float64
Self_Employed        614 non-null float64
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           614 non-null float64
Loan_Amount_Term     614 non-null float64
Credit_History       614 non-null float64
Loan_Status          614 non-null float64
Dep_1.0              614 non-null uint8
Dep_2.0              614 non-null uint8
Dep_3.0              614 non-null uint8
Prop_2.0             614 non-null uint8
Prop_3.0             614 non-null uint8
dtypes: float64(9), int64(1), object(1), uint8(5)
memory usage: 55.8+ KB


#### 4. ANALYSE TARGET variable & address Class Imbalance problem **

In [43]:
loan_copy1_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 16 columns):
Loan_ID              614 non-null object
Gender               614 non-null float64
Married              614 non-null float64
Education            614 non-null float64
Self_Employed        614 non-null float64
ApplicantIncome      614 non-null int64
CoapplicantIncome    614 non-null float64
LoanAmount           614 non-null float64
Loan_Amount_Term     614 non-null float64
Credit_History       614 non-null float64
Loan_Status          614 non-null float64
Dep_1.0              614 non-null uint8
Dep_2.0              614 non-null uint8
Dep_3.0              614 non-null uint8
Prop_2.0             614 non-null uint8
Prop_3.0             614 non-null uint8
dtypes: float64(9), int64(1), object(1), uint8(5)
memory usage: 55.8+ KB


In [44]:
loan_copy1_dummies.Loan_Status.value_counts(dropna=False)
# There is a class imbalance, hence perform a SMOTETomek


1.0    422
0.0    192
Name: Loan_Status, dtype: int64

In [45]:
# Form Target Series
y = loan_copy1_dummies['Loan_Status'].values

# Form Predictor/Feature matrix
final_feature_cols_1 = ['Gender','Married','Education','Self_Employed','ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History','Dep_1.0','Dep_2.0','Dep_3.0','Prop_2.0','Prop_3.0']
X_1 = loan_copy1_dummies[final_feature_cols_1]

print(X_1.shape)

(614, 14)


In [46]:
#------------------------------------------------
# Address clas imbalance of target variable, 
# using SMOTETomek
#------------------------------------------------

from imblearn.combine import SMOTETomek

sm = SMOTETomek(random_state=42,ratio='auto')

X_resampled_1, y_resampled = sm.fit_sample(X_1, y)

print(X_resampled_1.shape)
print(sum(y_resampled))

(714, 14)
357.0


### 5. Model using all features in dataset

#### 5.1. USE  KNN Classifier model


In [47]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import metrics
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_resampled_1, y_resampled, random_state=99)

In [48]:
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_1, y_train_1)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=9, p=2,
           weights='uniform')

In [49]:
y_pred_knn_1 = knn.predict(X_test_1)


print("----------------------------------------------------------------------")
print("--------- MODEL:1.a: KNN Classifier PERFORMANCE ----------------------")
print("----------------------------------------------------------------------")
print("ACCURACY SCORE : ",metrics.accuracy_score(y_test_1, y_pred_knn_1))
print("ROC AUC SCORE : ",metrics.roc_auc_score(y_test_1,y_pred_knn_1))
print("CONFUSION MATRIX : \n",metrics.confusion_matrix(y_test_1,y_pred_knn_1))
print("CLASSIFICATION REPORT : \n",classification_report(y_test_1,y_pred_knn_1))
print("----------------------------------------------------------------------")

----------------------------------------------------------------------
--------- MODEL:1.a: KNN Classifier PERFORMANCE ----------------------
----------------------------------------------------------------------
ACCURACY SCORE :  0.6759776536312849
ROC AUC SCORE :  0.6850012572290671
CONFUSION MATRIX : 
 [[65 17]
 [41 56]]
CLASSIFICATION REPORT : 
              precision    recall  f1-score   support

        0.0       0.61      0.79      0.69        82
        1.0       0.77      0.58      0.66        97

avg / total       0.70      0.68      0.67       179

----------------------------------------------------------------------


#### 5.2. Use Logistic Regression model

In [50]:
# Fit a logistic regression model and store the class predictions.
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()


logreg.fit(X_train_1, y_train_1)
y_pred_log_1 = logreg.predict(X_test_1)

print("----------------------------------------------------------------------")
print("--------- MODEL:1.b: LOG REG MODEL PERFORMANCE -----------------------")
print("----------------------------------------------------------------------")
print("ACCURACY SCORE : ",metrics.accuracy_score(y_test_1, y_pred_log_1))
print("ROC AUC SCORE : ",metrics.roc_auc_score(y_test_1,y_pred_log_1))
print("CONFUSION MATRIX : \n",metrics.confusion_matrix(y_test_1,y_pred_log_1))
print("CLASSIFICATION REPORT : \n",classification_report(y_test_1,y_pred_log_1))
print("----------------------------------------------------------------------")

----------------------------------------------------------------------
--------- MODEL:1.b: LOG REG MODEL PERFORMANCE -----------------------
----------------------------------------------------------------------
ACCURACY SCORE :  0.6871508379888268
ROC AUC SCORE :  0.6764520995725422
CONFUSION MATRIX : 
 [[45 37]
 [19 78]]
CLASSIFICATION REPORT : 
              precision    recall  f1-score   support

        0.0       0.70      0.55      0.62        82
        1.0       0.68      0.80      0.74        97

avg / total       0.69      0.69      0.68       179

----------------------------------------------------------------------


In [51]:
loan_copy1.corr()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Gender,1.0,0.384752,0.167982,-0.05152,-0.007875,0.057199,0.089013,0.102611,-0.076699,0.00392,0.030199,0.018361
Married,0.384752,1.0,0.336372,-0.014097,0.005826,0.049052,0.07776,0.142947,-0.095364,0.009519,-0.004415,0.089072
Dependents,0.167982,0.336372,1.0,-0.055752,0.056798,0.118202,0.03043,0.15986,-0.103864,-0.04016,0.000244,0.010118
Education,-0.05152,-0.014097,-0.055752,1.0,0.010383,0.14076,0.06229,0.173213,0.073928,0.073658,-0.065243,0.085884
Self_Employed,-0.007875,0.005826,0.056798,0.010383,1.0,0.12718,-0.0161,0.113345,-0.033739,-0.00155,0.03086,-0.0037
ApplicantIncome,0.057199,0.049052,0.118202,0.14076,0.12718,1.0,-0.116605,0.562831,-0.046531,-0.018615,0.0095,-0.00471
CoapplicantIncome,0.089013,0.07776,0.03043,0.06229,-0.0161,-0.116605,1.0,0.191587,-0.059383,0.011134,-0.010522,-0.059187
LoanAmount,0.102611,0.142947,0.15986,0.173213,0.113345,0.562831,0.191587,1.0,0.034832,0.003281,0.054662,-0.030521
Loan_Amount_Term,-0.076699,-0.095364,-0.103864,0.073928,-0.033739,-0.046531,-0.059383,0.034832,1.0,-0.004705,0.07612,-0.022549
Credit_History,0.00392,0.009519,-0.04016,0.073658,-0.00155,-0.018615,0.011134,0.003281,-0.004705,1.0,-0.001963,0.540556


### 6. Model using 1 feature (Credit_History) in dataset

#### 6.1. Use KNN Classifier model

In [52]:
# Use only Credit_History as feature matrix
# Form Target Series
y_2 = loan_copy1_dummies['Loan_Status'].values

# Form Predictor/Feature matrix
final_feature_cols_2 = ['Credit_History']
X_2 = loan_copy1_dummies[final_feature_cols_2]

print(X_2.shape)



(614, 1)


In [53]:
X_resampled_2, y_resampled_2 = sm.fit_sample(X_2, y_2)

print(X_resampled_2.shape)
print(sum(y_resampled_2))


(844, 1)
422.0


In [54]:

X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_resampled_2, y_resampled_2, random_state=99)


knn_2 = KNeighborsClassifier(n_neighbors=19)
knn_2.fit(X_train_2, y_train_2)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=19, p=2,
           weights='uniform')

In [55]:

y_pred_knn_2 = knn_2.predict(X_test_2)


print("----------------------------------------------------------------------")
print("--------- MODEL:2.a: KNN Classifier MODEL PERFORMANCE ----------------")
print("----------------------------------------------------------------------")
print("ACCURACY SCORE : ",metrics.accuracy_score(y_test_2, y_pred_knn_2))
print("ROC AUC SCORE : ",metrics.roc_auc_score(y_test_2,y_pred_knn_2))
print("CONFUSION MATRIX : \n",metrics.confusion_matrix(y_test_2,y_pred_knn_2))
print("CLASSIFICATION REPORT :\n",classification_report(y_test_2,y_pred_knn_2))
print("----------------------------------------------------------------------")

----------------------------------------------------------------------
--------- MODEL:2.a: KNN Classifier MODEL PERFORMANCE ----------------
----------------------------------------------------------------------
ACCURACY SCORE :  0.6682464454976303
ROC AUC SCORE :  0.6831981981981982
CONFUSION MATRIX : 
 [[44 67]
 [ 3 97]]
CLASSIFICATION REPORT :
              precision    recall  f1-score   support

        0.0       0.94      0.40      0.56       111
        1.0       0.59      0.97      0.73       100

avg / total       0.77      0.67      0.64       211

----------------------------------------------------------------------


#### 6.2. Use Logistic Regression model

In [56]:
logreg_2 = LogisticRegression()


logreg_2.fit(X_train_2, y_train_2)
y_pred_log_2 = logreg_2.predict(X_test_2)


print("----------------------------------------------------------------------")
print("-----------------MODEL:2.b: LOG REG PERFORMANCE ------------------------")
print("----------------------------------------------------------------------")
print("ACCURACY SCORE : ",metrics.accuracy_score(y_test_2, y_pred_log_2))
print("CONFUSION MATRIX: \n",metrics.confusion_matrix(y_test_2,y_pred_log_2))
print("ROC AUC SCORE : ",metrics.roc_auc_score(y_test_2,y_pred_log_2))
print("CLASSIFICATION REPORT : \n",classification_report(y_test_2,y_pred_log_2))

----------------------------------------------------------------------
-----------------MODEL:2.b: LOG REG PERFORMANCE ------------------------
----------------------------------------------------------------------
ACCURACY SCORE :  0.6682464454976303
CONFUSION MATRIX: 
 [[44 67]
 [ 3 97]]
ROC AUC SCORE :  0.6831981981981982
CLASSIFICATION REPORT : 
              precision    recall  f1-score   support

        0.0       0.94      0.40      0.56       111
        1.0       0.59      0.97      0.73       100

avg / total       0.77      0.67      0.64       211



#### 6.3. Use Logistic Regression with Penalty=l1 and various ALPHA values

In [57]:
#----------------------------------------------------------------
#
# Logistic Regression using Penality=l1, with various ALPHA values
#----------------------------------------------------------------
alpha = [0.0155, 0.1, 1.0, 10, 200, 20, 30]

print("----------------------------------------------------------------------")
print("-----------------MODEL:2.c: LOG REG PERFORMANCE ------------------------")
print("----------------------------------------------------------------------")
for a in alpha:
    logreg_3 = LogisticRegression(penalty='l1', C=a)
    logreg_3.fit(X_train_2,y_train_2)
    y_pred_3 = logreg_3.predict(X_test_2)
    roc = metrics.roc_auc_score(y_test_2, y_pred_3)
    print(roc," : ", a)
    print("----------------------------------------------------------------------")

----------------------------------------------------------------------
-----------------MODEL:2.c: LOG REG PERFORMANCE ------------------------
----------------------------------------------------------------------
0.6831981981981982  :  0.0155
----------------------------------------------------------------------
0.6831981981981982  :  0.1
----------------------------------------------------------------------
0.6831981981981982  :  1.0
----------------------------------------------------------------------
0.6831981981981982  :  10
----------------------------------------------------------------------
0.6831981981981982  :  200
----------------------------------------------------------------------
0.6831981981981982  :  20
----------------------------------------------------------------------
0.6831981981981982  :  30
----------------------------------------------------------------------


#### 7. Model using 3 features (Credit_History, Married, Education)

#### 7.1.  Use Logistic Regression

In [58]:
# Use only Credit_History as feature matrix
# Form Target Series
y_4 = loan_copy1_dummies['Loan_Status'].values

# Form Predictor/Feature matrix
final_feature_cols_4 = ['Credit_History','Married','Education']
X_4 = loan_copy1_dummies[final_feature_cols_4]

print(X_4.shape)

X_resampled_4, y_resampled_4 = sm.fit_sample(X_4, y_4)

print(X_resampled_4.shape)
print(sum(y_resampled_4))


X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(X_resampled_4, y_resampled_4, random_state=99)


(614, 3)
(844, 3)
422.0


In [59]:
logreg_4 = LogisticRegression()

logreg_4.fit(X_train_4, y_train_4)
y_pred_log_4 = logreg_4.predict(X_test_4)

print("----------------------------------------------------------------------")
print("-----------------MODEL:3: LOG REG PERFORMANCE ------------------------")
print("----------------------------------------------------------------------")
print("Accuracy score : ",metrics.accuracy_score(y_test_4, y_pred_log_4))
print("Confusion matrix : \n",metrics.confusion_matrix(y_test_4,y_pred_log_4))
print("ROC AUC score : ",metrics.roc_auc_score(y_test_4,y_pred_log_4))
print("Classification Report : \n",classification_report(y_test_4,y_pred_log_4))
print("----------------------------------------------------------------------")


----------------------------------------------------------------------
-----------------MODEL:3: LOG REG PERFORMANCE ------------------------
----------------------------------------------------------------------
Accuracy score :  0.6682464454976303
Confusion matrix : 
 [[44 67]
 [ 3 97]]
ROC AUC score :  0.6831981981981982
Classification Report : 
              precision    recall  f1-score   support

        0.0       0.94      0.40      0.56       111
        1.0       0.59      0.97      0.73       100

avg / total       0.77      0.67      0.64       211

----------------------------------------------------------------------
