##**Classification Project using Loan Status Data**

##**Main Goals of this Program:**
**I) Check and decide the ML Learning Type and sub-type as applicable**

**II) Check and remove the duplicate records, if any**

**III) Check the class balance**

**IV) Check for Missing Values and handle them as required**

**V) Check for the necessity of creating new column(s) and create the columns as required**

**VI) Check the unique Values of each column and observe the following and take actions as required:**
* **1) Wrong Data in the columns, if any** 
* **2) Wrong format of the data in the columns, if any**
* **3) Identify the columns which need to be categorically converted to numeric values by using Nominal method/ Ordinal Method**

**VII) Check the Test accuracy using appropriate algorithm and Holdout Method.**

**VIII) Implement the Scaling as required**

**IX) Write out the transformed Input file for further usage**

**1) Install/ Import the required Python Packages/ Libraries**

In [None]:
#Import required python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn import preprocessing
%matplotlib inline

In [None]:
pip install category_encoders



**2) Mounting the Google Drive**

In [None]:
# Mount the Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


**3) Read the Data file and check**

In [None]:
# Read the Diabetes Data from .csv file and check the data shape (number of Rows and Columns)
df = pd.read_csv('gdrive/My Drive/SRM-MLP-Internship-2021/Projects/Classification/01-Loan-Status-Prediction/Data-Files/Train_Loan_Status.csv')
print(df.shape)
df.head()

(614, 13)


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


##**I) Check and decide the ML Learning Type and sub-type as applicable**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [None]:
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

**Observations on the given Dataset:**
* a) Number of Independet Variables: 12 (Identified)
* b) Number of Dependent Variable : 1 (Loan_Status) (Identified)
* c) There is no Missing Value in the Dependent Variable column "Loan_Status"


**Conclusions:**
###**a) The given dataset probably belongs to the"Supervised Learning" main-type**
###**b) Since the Dependent variable values are categorical in nature, the given dataset is of "Classification" sub-type.**

##**II) Check and remove the duplicate records, if any**

In [None]:
df.shape

(614, 13)

In [None]:
# Returns True for every row that is a duplicate, othwerwise False:
print(df.duplicated())

0      False
1      False
2      False
3      False
4      False
       ...  
609    False
610    False
611    False
612    False
613    False
Length: 614, dtype: bool


In [None]:
# Remove all duplicates:
df.drop_duplicates(inplace = True)

In [None]:
df.shape

(614, 13)

###**Conclusion: No Duplicate Records**

##**III) Check the Class balance**

In [None]:
df["Loan_Status"].value_counts()

Y    422
N    192
Name: Loan_Status, dtype: int64

###**Conclusion: It is a Binary Classification with imbalanced Classes**

##**IV) Check for Missing Values and handle them as required**

**a) Check the Missing Values, if any**

In [None]:
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

**b) Checking the total number of rows having the missing Values**

In [None]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
11,LP001027,Male,Yes,2,Graduate,,2500,1840.0,109.0,360.0,1.0,Urban,Y
16,LP001034,Male,No,1,Not Graduate,No,3596,0.0,100.0,240.0,,Urban,Y
19,LP001041,Male,Yes,0,Graduate,,2600,3500.0,115.0,,1.0,Urban,Y
23,LP001050,,Yes,2,Not Graduate,No,3365,1917.0,112.0,360.0,0.0,Rural,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...
592,LP002933,,No,3+,Graduate,Yes,9357,0.0,292.0,360.0,1.0,Semiurban,Y
597,LP002943,Male,No,,Graduate,No,2987,0.0,88.0,360.0,0.0,Semiurban,N
600,LP002949,Female,No,3+,Graduate,,416,41667.0,350.0,180.0,,Urban,N
601,LP002950,Male,Yes,0,Not Graduate,,2894,2792.0,155.0,360.0,1.0,Rural,Y


**c) Observations, Decisions and Actions**

**Observations:**
* a) Here, the data values of 7 columns are missing
* b) The total number rows having missing values is 134 against the total number of rows (614) in the dataset. 
###**So, we cannot use the option of dropping the rows having missing values.**

**Decision and Actions:**

###**Fill the missing values of the columns with that of the most_frequent values of the respective columns.**

**d) Imputation of Missing Values using the "fillna" command and checking**

In [None]:
df.fillna(df.mode().iloc[0], inplace=True)

In [None]:
df.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

##**V) Check for necessity of creating new column(s) and create the columns as required**

###**Decision: As of now, there is no necessity to create new column(s).**

##**VI) Check the unique Values of each column and observe the following and take actions as required:**
* **a) Wrong Data in the columns, if any** 
* **b) Wrong format of the data in the columns, if any**
* **c) Identify the columns which need to be categorically converted to numeric values by using Nominal method/ Ordinal Method**


###**Column-1: Loan_ID**

In [None]:
df['Loan_ID'].value_counts()

LP002974    1
LP001580    1
LP002055    1
LP001865    1
LP001350    1
           ..
LP001574    1
LP001275    1
LP002314    1
LP002602    1
LP001264    1
Name: Loan_ID, Length: 614, dtype: int64

**Observations:**
* a) Data in this column will not be contributing to the prediction of the Depenedent variable

**Decsion:**

**We will be dropping this column**

**Action:**

In [None]:
df.drop(['Loan_ID'], axis = 1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             614 non-null    object 
 1   Married            614 non-null    object 
 2   Dependents         614 non-null    object 
 3   Education          614 non-null    object 
 4   Self_Employed      614 non-null    object 
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         614 non-null    float64
 8   Loan_Amount_Term   614 non-null    float64
 9   Credit_History     614 non-null    float64
 10  Property_Area      614 non-null    object 
 11  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 62.4+ KB


###**Column-2: Gender**

In [None]:
df['Gender'].value_counts()

Male      502
Female    112
Name: Gender, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [None]:
#encode the data
gender = pd.DataFrame(df['Gender'])
gender_encoded=pd.get_dummies(data= gender, drop_first=True)
gender_encoded

Unnamed: 0,Gender_Male
0,1
1,1
2,1
3,1
4,1
...,...
609,0
610,1
611,1
612,1


###**Column-3: Married**

In [None]:
df['Married'].value_counts()

Yes    401
No     213
Name: Married, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [None]:
#encode the data
married = pd.DataFrame(df['Married'])
married_encoded=pd.get_dummies(data= married, drop_first=True)
married_encoded

Unnamed: 0,Married_Yes
0,0
1,1
2,1
3,1
4,0
...,...
609,0
610,1
611,1
612,1


###**Column-4: Dependents**

In [None]:
df['Dependents'].value_counts()

0     360
1     102
2     101
3+     51
Name: Dependents, dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             614 non-null    object 
 1   Married            614 non-null    object 
 2   Dependents         614 non-null    object 
 3   Education          614 non-null    object 
 4   Self_Employed      614 non-null    object 
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         614 non-null    float64
 8   Loan_Amount_Term   614 non-null    float64
 9   Credit_History     614 non-null    float64
 10  Property_Area      614 non-null    object 
 11  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 62.4+ KB


**Observations:**
* a) Data in this column is of "Object" or "String" datatype. But the data elements are in integer format.
* b) We have one set of values "3+" which is in "Wrong Data Format"

**Decision and Actions to be taken:**

* a) Replace the values "3+" wtih "3"
* b) Convert the data type of the column to "Integer" Type.


**Action:**

In [None]:
df.replace("3+", "3", inplace=True)
df['Dependents'].value_counts()

0    360
1    102
2    101
3     51
Name: Dependents, dtype: int64

In [None]:
df['Dependents']=df['Dependents'].astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             614 non-null    object 
 1   Married            614 non-null    object 
 2   Dependents         614 non-null    int64  
 3   Education          614 non-null    object 
 4   Self_Employed      614 non-null    object 
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         614 non-null    float64
 8   Loan_Amount_Term   614 non-null    float64
 9   Credit_History     614 non-null    float64
 10  Property_Area      614 non-null    object 
 11  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(2), object(6)
memory usage: 62.4+ KB


###**Column-5: Education**

In [None]:
df['Education'].value_counts()

Graduate        480
Not Graduate    134
Name: Education, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Ordinal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Ordnial Type method "preprocessing.LabelEncoder()**

**Action:**

In [None]:
le = preprocessing.LabelEncoder()
df['Education'] = le.fit_transform(df.Education.values)
df['Education'].value_counts()

0    480
1    134
Name: Education, dtype: int64

###**Column-6: Self_Employed**

In [None]:
df['Self_Employed'].value_counts()

No     532
Yes     82
Name: Self_Employed, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [None]:
#encode the data
self_employed = pd.DataFrame(df['Self_Employed'])
self_employed_encoded=pd.get_dummies(data= self_employed, drop_first=True)
self_employed_encoded

Unnamed: 0,Self_Employed_Yes
0,0
1,0
2,1
3,0
4,0
...,...
609,0
610,0
611,0
612,0


###**Column-7 to 11 : ApplicantIncome,	CoapplicantIncome,	LoanAmount,	Loan_Amount_Term and Credit_History**

In [None]:
df.describe()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,614.0,614.0,614.0,614.0,614.0
mean,0.7443,0.218241,5403.459283,1621.245798,145.465798,342.410423,0.855049
std,1.009623,0.413389,6109.041673,2926.248369,84.180967,64.428629,0.352339
min,0.0,0.0,150.0,0.0,9.0,12.0,0.0
25%,0.0,0.0,2877.5,0.0,100.25,360.0,1.0
50%,0.0,0.0,3812.5,1188.5,125.0,360.0,1.0
75%,1.0,0.0,5795.0,2297.25,164.75,360.0,1.0
max,3.0,1.0,81000.0,41667.0,700.0,480.0,1.0


**Observations:**
* a) Here, all the Integer and float Column values are described.
* b) Each column has got a Standard Deviation, Min and Max Values.
* c) We can assume that there is no wrong data and wrong data format.
* **d) But we need to do Scaling**

###**Column-12: Property_Area**

In [None]:
df['Property_Area'].value_counts()

Semiurban    233
Urban        202
Rural        179
Name: Property_Area, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Ordinal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Ordnial Type method "preprocessing.LabelEncoder()**

**Action:**

In [None]:
le = preprocessing.LabelEncoder()
df['Property_Area'] = le.fit_transform(df.Property_Area.values)
df['Property_Area'].value_counts()

1    233
2    202
0    179
Name: Property_Area, dtype: int64

###**Column-13: Loan_Status**

In [None]:
df['Loan_Status'].value_counts()

Y    422
N    192
Name: Loan_Status, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Ordinal" Type [Dependent Variable Column]

**Decsion:**

**We will be converting the data in this column into Numerical values using Ordnial Type method "preprocessing.LabelEncoder()**

**Action:**

In [None]:
le = preprocessing.LabelEncoder()
df['Loan_Status'] = le.fit_transform(df.Loan_Status.values)
df['Loan_Status'].value_counts()

1    422
0    192
Name: Loan_Status, dtype: int64

**Drop the columns which are to be categorically converted and include the their respective coverted Numeric Values**

In [None]:
df.drop(['Gender', 'Married','Self_Employed',], axis = 1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 614 entries, 0 to 613
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Dependents         614 non-null    int64  
 1   Education          614 non-null    int64  
 2   ApplicantIncome    614 non-null    int64  
 3   CoapplicantIncome  614 non-null    float64
 4   LoanAmount         614 non-null    float64
 5   Loan_Amount_Term   614 non-null    float64
 6   Credit_History     614 non-null    float64
 7   Property_Area      614 non-null    int64  
 8   Loan_Status        614 non-null    int64  
dtypes: float64(4), int64(5)
memory usage: 48.0 KB


In [None]:
df = pd.concat([df,gender_encoded, married_encoded,self_employed_encoded], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Dependents         614 non-null    int64  
 1   Education          614 non-null    int64  
 2   ApplicantIncome    614 non-null    int64  
 3   CoapplicantIncome  614 non-null    float64
 4   LoanAmount         614 non-null    float64
 5   Loan_Amount_Term   614 non-null    float64
 6   Credit_History     614 non-null    float64
 7   Property_Area      614 non-null    int64  
 8   Loan_Status        614 non-null    int64  
 9   Gender_Male        614 non-null    uint8  
 10  Married_Yes        614 non-null    uint8  
 11  Self_Employed_Yes  614 non-null    uint8  
dtypes: float64(4), int64(5), uint8(3)
memory usage: 49.8 KB


In [None]:
df.corr()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Gender_Male,Married_Yes,Self_Employed_Yes
Dependents,1.0,0.055752,0.118202,0.03043,0.163017,-0.103864,-0.04016,-0.000244,0.010118,0.172914,0.334216,0.056798
Education,0.055752,1.0,-0.14076,-0.06229,-0.169436,-0.073928,-0.073658,-0.065243,-0.085884,0.045364,0.012304,-0.010383
ApplicantIncome,0.118202,-0.14076,1.0,-0.116605,0.564698,-0.046531,-0.018615,-0.0095,-0.00471,0.058809,0.051708,0.12718
CoapplicantIncome,0.03043,-0.06229,-0.116605,1.0,0.189723,-0.059383,0.011134,0.010522,-0.059187,0.082912,0.075948,-0.0161
LoanAmount,0.163017,-0.169436,0.564698,0.189723,1.0,0.037152,-0.00025,-0.047414,-0.031808,0.106404,0.146212,0.114971
Loan_Amount_Term,-0.103864,-0.073928,-0.046531,-0.059383,0.037152,1.0,-0.004705,-0.07612,-0.022549,-0.07403,-0.100912,-0.033739
Credit_History,-0.04016,-0.073658,-0.018615,0.011134,-0.00025,-0.004705,1.0,0.001963,0.540556,0.00917,0.010938,-0.00155
Property_Area,-0.000244,-0.065243,-0.0095,0.010522,-0.047414,-0.07612,0.001963,1.0,0.032112,-0.025752,0.004257,-0.03086
Loan_Status,0.010118,-0.085884,-0.00471,-0.059187,-0.031808,-0.022549,0.540556,0.032112,1.0,0.017987,0.091478,-0.0037
Gender_Male,0.172914,0.045364,0.058809,0.082912,0.106404,-0.07403,0.00917,-0.025752,0.017987,1.0,0.364569,-0.000525


##**VII) Check the Test accuracy using appropriate algorithm and Holdout Method.**

##**Step-5: Slice X and y Values**

In [None]:
X = df.drop(['Loan_Status'], axis = 1)
Y = df['Loan_Status']
X.head()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Gender_Male,Married_Yes,Self_Employed_Yes
0,0,0,5849,0.0,120.0,360.0,1.0,2,1,0,0
1,1,0,4583,1508.0,128.0,360.0,1.0,0,1,1,0
2,0,0,3000,0.0,66.0,360.0,1.0,2,1,1,1
3,0,1,2583,2358.0,120.0,360.0,1.0,2,1,1,0
4,0,0,6000,0.0,141.0,360.0,1.0,2,1,0,0


In [None]:
Y.head()

0    1
1    0
2    1
3    1
4    1
Name: Loan_Status, dtype: int64

##**Step-6: Execute Train-Test-Split Command and Verify**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 66)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(491, 11)
(491,)
(123, 11)
(123,)


##**Step-7: Learn the Data and Predict the dependent Variable values for the "X_test"data using "LogisticRegression()" algorithm**

In [None]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model 
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
y_pred = logmodel.predict(X_test)
y_pred

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0])

##**Step-8: Calculate the Accuracy of the Model**

In [None]:
accuracy_lr = logmodel.score(X_test, y_test)
print("Accuracy of Logistic Regression on test set:",accuracy_lr)

Accuracy of Logistic Regression on test set: 0.8211382113821138


##**Step-9: Display the Confusion Matrix and Classification Report of the Model**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix  
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  

[[15 20]
 [ 2 86]]
              precision    recall  f1-score   support

           0       0.88      0.43      0.58        35
           1       0.81      0.98      0.89        88

    accuracy                           0.82       123
   macro avg       0.85      0.70      0.73       123
weighted avg       0.83      0.82      0.80       123



##**VIII) Implement the Scaling as required**

###**Use Normalization**

In [None]:
columnNames = ['Dependents', 'Education', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Gender_Male', 'Married_Yes', 'Self_Employed_Yes']

In [None]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X_train1 = min_max_scaler_object.fit_transform(X_train)
X_train1 = pd.DataFrame(X_train1 , columns = columnNames)
X_train1.head()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Gender_Male,Married_Yes,Self_Employed_Yes
0,0.0,1.0,0.021732,0.05676,0.160637,0.72973,1.0,1.0,0.0,0.0,0.0
1,0.333333,0.0,0.05483,0.036192,0.172214,0.72973,1.0,0.0,1.0,1.0,0.0
2,0.666667,0.0,0.045875,0.0,0.125904,0.72973,1.0,0.5,1.0,1.0,0.0
3,0.0,0.0,0.041435,0.040008,0.151954,0.72973,1.0,0.5,1.0,1.0,0.0
4,0.0,0.0,0.038268,0.0,0.10275,0.72973,1.0,1.0,0.0,0.0,0.0


In [None]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X_test1 = min_max_scaler_object.fit_transform(X_test)
X_test1 = pd.DataFrame(X_test1 , columns = columnNames)
X_test1.head()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Gender_Male,Married_Yes,Self_Employed_Yes
0,1.0,0.0,0.189248,0.0,0.17193,0.358974,1.0,1.0,1.0,1.0,0.0
1,0.0,0.0,0.078489,0.0,0.070175,0.74359,1.0,1.0,1.0,0.0,0.0
2,0.0,0.0,0.057554,1.0,0.12807,0.74359,1.0,0.5,1.0,0.0,0.0
3,0.0,0.0,0.109905,0.0,0.14386,0.74359,1.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.451255,0.0,0.166667,0.74359,1.0,1.0,0.0,0.0,1.0


In [None]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model 
logmodel1 = LogisticRegression()
logmodel1.fit(X_train1, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
#predictions
predictions1 = logmodel1.predict(X_test1)

In [None]:
print(confusion_matrix(y_test, predictions1))
print(classification_report(y_test,predictions1))

[[15 20]
 [ 1 87]]
              precision    recall  f1-score   support

           0       0.94      0.43      0.59        35
           1       0.81      0.99      0.89        88

    accuracy                           0.83       123
   macro avg       0.88      0.71      0.74       123
weighted avg       0.85      0.83      0.81       123



###**Use Standardization**

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler_object = preprocessing.StandardScaler()
X_train2 = std_scaler_object.fit_transform(X_train)
X_train2 = pd.DataFrame(X_train2 , columns = columnNames)
X_train2.head()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Gender_Male,Married_Yes,Self_Employed_Yes
0,-0.736299,1.952876,-0.569716,0.25983,-0.317516,0.28626,0.417901,1.235897,-2.139987,-1.349755,-0.397516
1,0.265149,-0.512065,-0.130289,-0.03307,-0.222873,0.28626,0.417901,-1.303122,0.467293,0.740875,-0.397516
2,1.266598,-0.512065,-0.249177,-0.548464,-0.601444,0.28626,0.417901,-0.033612,0.467293,0.740875,-0.397516
3,-0.736299,-0.512065,-0.308129,0.021272,-0.388498,0.28626,0.417901,-0.033612,0.467293,0.740875,-0.397516
4,-0.736299,-0.512065,-0.350167,-0.548464,-0.79073,0.28626,0.417901,1.235897,-2.139987,-1.349755,-0.397516


In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler_object = preprocessing.StandardScaler()
X_test2 = std_scaler_object.fit_transform(X_test)
X_test2 = pd.DataFrame(X_test2 , columns = columnNames)
X_test2.head()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Gender_Male,Married_Yes,Self_Employed_Yes
0,2.117998,-0.592999,0.361809,-0.578804,-0.145783,-2.423362,0.386695,1.175045,0.492366,0.681385,-0.372678
1,-0.744791,-0.592999,-0.353648,-0.578804,-0.851356,0.225396,0.386695,1.175045,0.492366,-1.467599,-0.372678
2,-0.744791,-0.592999,-0.488882,6.282905,-0.449909,0.225396,0.386695,-0.103986,0.492366,-1.467599,-0.372678
3,-0.744791,-0.592999,-0.150716,-0.578804,-0.340424,0.225396,0.386695,-1.383018,-2.03101,0.681385,-0.372678
4,-0.744791,1.686342,2.054263,-0.578804,-0.182278,0.225396,0.386695,1.175045,-2.03101,-1.467599,2.683282


In [None]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model 
logmodel2 = LogisticRegression()
logmodel2.fit(X_train2, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
#predictions
predictions2 = logmodel2.predict(X_test2)

In [None]:
print(confusion_matrix(y_test, predictions2))
print(classification_report(y_test,predictions2))

[[15 20]
 [ 1 87]]
              precision    recall  f1-score   support

           0       0.94      0.43      0.59        35
           1       0.81      0.99      0.89        88

    accuracy                           0.83       123
   macro avg       0.88      0.71      0.74       123
weighted avg       0.85      0.83      0.81       123



**Observation: Both the scaling methods gives the accuracy of 83%**

**Decision: We will use the "Normalization" method for our model.**


##**IX) Write out the transformed Input file for further usage**

In [None]:
X1 = df.drop(['Loan_Status'], axis = 1)
Y1 = df['Loan_Status']
X.head()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Gender_Male,Married_Yes,Self_Employed_Yes
0,0,0,5849,0.0,120.0,360.0,1.0,2,1,0,0
1,1,0,4583,1508.0,128.0,360.0,1.0,0,1,1,0
2,0,0,3000,0.0,66.0,360.0,1.0,2,1,1,1
3,0,1,2583,2358.0,120.0,360.0,1.0,2,1,1,0
4,0,0,6000,0.0,141.0,360.0,1.0,2,1,0,0


In [None]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X2 = min_max_scaler_object.fit_transform(X1)
X2 = pd.DataFrame(X2 , columns = columnNames)
print(X2.shape)
X2.head()

(614, 11)


Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Gender_Male,Married_Yes,Self_Employed_Yes
0,0.0,0.0,0.070489,0.0,0.160637,0.74359,1.0,1.0,1.0,0.0,0.0
1,0.333333,0.0,0.05483,0.036192,0.172214,0.74359,1.0,0.0,1.0,1.0,0.0
2,0.0,0.0,0.03525,0.0,0.082489,0.74359,1.0,1.0,1.0,1.0,1.0
3,0.0,1.0,0.030093,0.056592,0.160637,0.74359,1.0,1.0,1.0,1.0,0.0
4,0.0,0.0,0.072356,0.0,0.191027,0.74359,1.0,1.0,1.0,0.0,0.0


In [None]:
df1 = pd.DataFrame(data=X2)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Dependents         614 non-null    float64
 1   Education          614 non-null    float64
 2   ApplicantIncome    614 non-null    float64
 3   CoapplicantIncome  614 non-null    float64
 4   LoanAmount         614 non-null    float64
 5   Loan_Amount_Term   614 non-null    float64
 6   Credit_History     614 non-null    float64
 7   Property_Area      614 non-null    float64
 8   Gender_Male        614 non-null    float64
 9   Married_Yes        614 non-null    float64
 10  Self_Employed_Yes  614 non-null    float64
dtypes: float64(11)
memory usage: 52.9 KB


In [None]:
df1.head()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Gender_Male,Married_Yes,Self_Employed_Yes
0,0.0,0.0,0.070489,0.0,0.160637,0.74359,1.0,1.0,1.0,0.0,0.0
1,0.333333,0.0,0.05483,0.036192,0.172214,0.74359,1.0,0.0,1.0,1.0,0.0
2,0.0,0.0,0.03525,0.0,0.082489,0.74359,1.0,1.0,1.0,1.0,1.0
3,0.0,1.0,0.030093,0.056592,0.160637,0.74359,1.0,1.0,1.0,1.0,0.0
4,0.0,0.0,0.072356,0.0,0.191027,0.74359,1.0,1.0,1.0,0.0,0.0


In [None]:
df1 = pd.concat([df1,Y1], axis=1)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Dependents         614 non-null    float64
 1   Education          614 non-null    float64
 2   ApplicantIncome    614 non-null    float64
 3   CoapplicantIncome  614 non-null    float64
 4   LoanAmount         614 non-null    float64
 5   Loan_Amount_Term   614 non-null    float64
 6   Credit_History     614 non-null    float64
 7   Property_Area      614 non-null    float64
 8   Gender_Male        614 non-null    float64
 9   Married_Yes        614 non-null    float64
 10  Self_Employed_Yes  614 non-null    float64
 11  Loan_Status        614 non-null    int64  
dtypes: float64(11), int64(1)
memory usage: 57.7 KB


In [None]:
df1.head()

Unnamed: 0,Dependents,Education,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Gender_Male,Married_Yes,Self_Employed_Yes,Loan_Status
0,0.0,0.0,0.070489,0.0,0.160637,0.74359,1.0,1.0,1.0,0.0,0.0,1
1,0.333333,0.0,0.05483,0.036192,0.172214,0.74359,1.0,0.0,1.0,1.0,0.0,0
2,0.0,0.0,0.03525,0.0,0.082489,0.74359,1.0,1.0,1.0,1.0,1.0,1
3,0.0,1.0,0.030093,0.056592,0.160637,0.74359,1.0,1.0,1.0,1.0,0.0,1
4,0.0,0.0,0.072356,0.0,0.191027,0.74359,1.0,1.0,1.0,0.0,0.0,1


In [None]:
from google.colab import files
df1.to_csv("gdrive/My Drive/SRM-MLP-Internship-2021/Projects/Classification/01-Loan-Status-Prediction/Data-Files/Loan_Status_Train_Preprocessed.csv", index = False)