# **Loan Approval Prediction** 

### **I chose this dataset because it presents several challenges, it is relatively small with  about 615 rows, and contains many missing values and outliers. I addressed these issues through data preprocessing and was able to achieve an accuracy of 80%, which is a strong result given the limited size of the data."**



## Import data

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

In [2]:
df = pd.read_csv('/Users/mohammedmahmood/Desktop/Data projects/Data science/Loan Default Detection Prediction  Main/Loan_Default_Detection_Prediction.csv')
df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


## Check duplication

In [4]:
df.duplicated().sum()

np.int64(0)

## Null Values

In [5]:
df.isna().mean() * 100

Loan_ID              0.000000
Gender               2.117264
Married              0.488599
Dependents           2.442997
Education            0.000000
Self_Employed        5.211726
ApplicantIncome      0.000000
CoapplicantIncome    0.000000
LoanAmount           3.583062
Loan_Amount_Term     2.280130
Credit_History       8.143322
Property_Area        0.000000
Loan_Status          0.000000
dtype: float64

### I will handle the null values within the machine learning pipeline to ensure proper data preprocessing and prevent data leakage

In [6]:
# Unnecessary column in preidction 
df.drop("Loan_ID" , axis = 1 , inplace=True ) 

In [7]:
df.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

## Univariate analysis

### **In this step i will go through each column and fix problems in and do Eda for features to understand**

## Gender

In [8]:
df.Gender.unique()

array(['Male', 'Female', nan], dtype=object)

In [9]:
df.Gender.value_counts()

Gender
Male      489
Female    112
Name: count, dtype: int64

## Married

In [10]:
df.Married.unique()

array(['No', 'Yes', nan], dtype=object)

In [11]:
df.Married.value_counts()

Married
Yes    398
No     213
Name: count, dtype: int64

## Dependents

In [12]:
df.Dependents.unique()

array(['0', '1', '2', '3+', nan], dtype=object)

In [13]:
df['Dependents'] = df['Dependents'].replace('3+', '3')

In [14]:
df.Dependents.info()

<class 'pandas.core.series.Series'>
RangeIndex: 614 entries, 0 to 613
Series name: Dependents
Non-Null Count  Dtype 
--------------  ----- 
599 non-null    object
dtypes: object(1)
memory usage: 4.9+ KB


In [15]:
df.Dependents.unique()

array(['0', '1', '2', '3', nan], dtype=object)

In [16]:
df['Dependents'] = df['Dependents'].astype('Int64')

In [17]:
df.Dependents.info()

<class 'pandas.core.series.Series'>
RangeIndex: 614 entries, 0 to 613
Series name: Dependents
Non-Null Count  Dtype
--------------  -----
599 non-null    Int64
dtypes: Int64(1)
memory usage: 5.5 KB


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             601 non-null    object 
 1   Married            611 non-null    object 
 2   Dependents         599 non-null    Int64  
 3   Education          614 non-null    object 
 4   Self_Employed      582 non-null    object 
 5   ApplicantIncome    614 non-null    int64  
 6   CoapplicantIncome  614 non-null    float64
 7   LoanAmount         592 non-null    float64
 8   Loan_Amount_Term   600 non-null    float64
 9   Credit_History     564 non-null    float64
 10  Property_Area      614 non-null    object 
 11  Loan_Status        614 non-null    object 
dtypes: Int64(1), float64(4), int64(1), object(6)
memory usage: 58.3+ KB


## Education

In [19]:
df.Education.unique()

array(['Graduate', 'Not Graduate'], dtype=object)

## Self_Employed

In [20]:
df.Self_Employed.unique()

array(['No', 'Yes', nan], dtype=object)

## ApplicantIncome

In [21]:
df.ApplicantIncome.describe() 

count      614.000000
mean      5403.459283
std       6109.041673
min        150.000000
25%       2877.500000
50%       3812.500000
75%       5795.000000
max      81000.000000
Name: ApplicantIncome, dtype: float64

In [22]:
px.histogram(df, x= "ApplicantIncome") 

In [23]:
px.box(df, x= "ApplicantIncome")

## **Taking log for column to handle skewenes & Handle outliers**

In [24]:
df["log_ApplicantIncome"] = np.log(df["ApplicantIncome"]) 

In [25]:
px.histogram(df, x= "log_ApplicantIncome")

In [26]:
px.box(df, x= "log_ApplicantIncome")

## LoanAmount

In [27]:
df.LoanAmount.describe() 

count    592.000000
mean     146.412162
std       85.587325
min        9.000000
25%      100.000000
50%      128.000000
75%      168.000000
max      700.000000
Name: LoanAmount, dtype: float64

In [28]:
px.histogram(df, x= "LoanAmount") 

In [29]:
px.box(df, x= "LoanAmount")

### **Taking log for column to handle skewenes & Handle outliers**

In [30]:
df["log_LoanAmount"] = np.log(df["LoanAmount"]) 

In [31]:
px.histogram(df, x= "log_LoanAmount")

In [32]:
px.box(df, x= "log_LoanAmount")

In [33]:
df.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status',
       'log_ApplicantIncome', 'log_LoanAmount'],
      dtype='object')

## Loan_Amount_Term

In [34]:
df.Loan_Amount_Term.describe()

count    600.00000
mean     342.00000
std       65.12041
min       12.00000
25%      360.00000
50%      360.00000
75%      360.00000
max      480.00000
Name: Loan_Amount_Term, dtype: float64

In [35]:
px.histogram(df, x= "Loan_Amount_Term" )

## Credit_History

In [36]:
df.Credit_History.unique()

array([ 1.,  0., nan])

In [37]:
df.Credit_History.value_counts()

Credit_History
1.0    475
0.0     89
Name: count, dtype: int64

## Property_Area

In [38]:
df.Property_Area.unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [39]:
df.Property_Area.value_counts() 

Property_Area
Semiurban    233
Urban        202
Rural        179
Name: count, dtype: int64

## Loan_Status

In [40]:
df["Loan_Status"]. value_counts()

Loan_Status
Y    422
N    192
Name: count, dtype: int64

In [41]:
# Map Target
df["Loan_Status"] = df["Loan_Status"].map({"N" : 0, "Y": 1})

### **Target is immbalanced and wil handle it**

## Multivariate analysis 

### **Multivariate analysis was performed to examine the relationships between multiple features and the target**

In [42]:
df.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status',
       'log_ApplicantIncome', 'log_LoanAmount'],
      dtype='object')

##  1-Does higher income always mean higher loan?

In [43]:
fig = px.box(df, x='Loan_Status', y='ApplicantIncome', color='Loan_Status',
             title='Applicant Income Distribution by Loan Status')
fig.show()


### **Insights** :

### **Higher income doesn’t consistently lead to loan approval; other factors likely play a larger role**

## 2-Is there a relationship between number of dependents and loan approval?

In [44]:
fig = px.histogram(df, x='Dependents', color='Loan_Status',
                   barmode='group', 
                   title='Loan Status by Dependents and Education')
fig.show()


## Result:
### **Applicants with 0 dependents have the highest approval counts in both education groups, likely due to lower financial burdens**

## 3-Do property area affect loan approval?

In [45]:
fig = px.histogram(df, x= "Property_Area", color='Loan_Status',
                    barmode='group',
                   title='Loan Approval by Loan Term and Property Area')
fig.show()


## Insights:
### **Semiurban areas have the highest approval count  followed by rural and urban , suggesting semiurban areas may favor approvals**

## 4-Are shorter or longer loans more likely to be approved?

In [46]:
fig = px.histogram(df, x='Loan_Amount_Term', color='Loan_Status', barmode='group',
                   title='Loan Approval by Loan Term Duration')
fig.show()


## Insights:
### **Longer loans are more likely to be approved, possibly due to better repayment structuring or applicant profiles suited to extended terms.**

## 5-Do coapplicants increase approval chances?


In [47]:
fig = px.box(df, x='Loan_Status', y='CoapplicantIncome', color='Loan_Status',
             title='Coapplicant Income by Loan Status')
fig.show()


## Insights: 
### **Coapplicant income has a limited impact on improving loan approval chances** 

# Feature Engineering

## Total_Income

In [48]:
df["Total_Income"] = df["ApplicantIncome"] + df["CoapplicantIncome"] 

In [49]:
px.histogram(df, x= "Total_Income")

In [50]:
df["log_Total_Income"] = np.log(df["Total_Income"]) 

In [51]:
px.histogram(df, x= "log_Total_Income")

## Loan_Monthly_Paid

In [None]:
df["Loan_Monthly_Paid"] = round((df["LoanAmount"] * 1000 ) / (df["Loan_Amount_Term"]) , 2 ) 

In [53]:
px.histogram(df, x="Loan_Monthly_Paid")

In [54]:
df["log_Loan_Monthly_Paid"] = np.log(df["Loan_Monthly_Paid"]) 

In [55]:
px.histogram(df, x= "log_Loan_Monthly_Paid")

## Income_After_Loan

In [56]:
df["Income_After_Loan"] = df["ApplicantIncome"] - df["Loan_Monthly_Paid"] 

In [57]:
px.histogram(df, x= "Income_After_Loan")

In [58]:
df["log_Income_After_Loan"] = np.log(df["Income_After_Loan"]) 


invalid value encountered in log



In [59]:
px.histogram(df, x= "log_Income_After_Loan")

In [60]:
df['Income_to_LoanRatio'] = (df['Total_Income'] / df['LoanAmount'].replace(0, np.nan)).round(2)


In [61]:
df['Income_to_LoanRatio'].describe()

count    592.000000
mean      51.225811
std       37.916858
min       12.090000
25%       35.525000
50%       41.425000
75%       51.775000
max      396.370000
Name: Income_to_LoanRatio, dtype: float64

# Feature Importance

## Check Correlation

In [62]:
df_corr = df.select_dtypes(include="number")
corr = df_corr.corr()['Loan_Status'].sort_values(ascending=False)
print(corr)


Loan_Status              1.000000
Credit_History           0.561678
Income_to_LoanRatio      0.024229
log_ApplicantIncome      0.010977
log_Total_Income         0.007240
Dependents               0.006781
Income_After_Loan       -0.004473
ApplicantIncome         -0.004710
log_Income_After_Loan   -0.006132
Loan_Monthly_Paid       -0.015394
Loan_Amount_Term        -0.021268
Total_Income            -0.031271
log_Loan_Monthly_Paid   -0.031284
LoanAmount              -0.037318
log_LoanAmount          -0.039100
CoapplicantIncome       -0.059187
Name: Loan_Status, dtype: float64


In [63]:
df.select_dtypes(include="number").columns

Index(['Dependents', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Loan_Status',
       'log_ApplicantIncome', 'log_LoanAmount', 'Total_Income',
       'log_Total_Income', 'Loan_Monthly_Paid', 'log_Loan_Monthly_Paid',
       'Income_After_Loan', 'log_Income_After_Loan', 'Income_to_LoanRatio'],
      dtype='object')

In [64]:
df_importance = df.copy()

In [65]:
df_importance.dropna(inplace= True)

In [66]:
df_Cat = df.select_dtypes(include="object")

for col in df_Cat.columns :
    print(f"Column {col} has {df_Cat[col].nunique()} values")

Column Gender has 2 values
Column Married has 2 values
Column Education has 2 values
Column Self_Employed has 2 values
Column Property_Area has 3 values


###  encoding cat columns to do feature importance using ExtraTreesClassifier 

In [67]:
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import ExtraTreesClassifier

cat_cols = [
    'Gender', 'Married', 'Education', "Self_Employed", "Property_Area"
]

num_cols = ['Dependents', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Loan_Status',
       'log_ApplicantIncome', 'log_LoanAmount', 'Total_Income',
       'log_Total_Income', 'Loan_Monthly_Paid', 'log_Loan_Monthly_Paid',
       'Income_After_Loan', 'log_Income_After_Loan', 'Income_to_LoanRatio']

df_encoded = df_importance.copy()
for col in cat_cols:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))


X = df_encoded[cat_cols + num_cols]
y = df_encoded['Loan_Status']  

## 2-Model-Based Feature Importance (ExtraTreesClassifier)

In [68]:
# Fit ExtraTreesClassifier
model = ExtraTreesClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

#  Get feature importances
importances = pd.Series(model.feature_importances_, index=X.columns)
importances_sorted = importances.sort_values(ascending=False)

#  Show top features
importances_sorted

Loan_Status              0.749878
Credit_History           0.140936
Total_Income             0.007512
CoapplicantIncome        0.007220
Income_After_Loan        0.007055
Property_Area            0.006850
Married                  0.006777
log_LoanAmount           0.006702
log_Total_Income         0.006638
LoanAmount               0.006623
Loan_Monthly_Paid        0.006439
log_Loan_Monthly_Paid    0.006135
Loan_Amount_Term         0.005903
log_Income_After_Loan    0.005721
log_ApplicantIncome      0.005526
Income_to_LoanRatio      0.005378
ApplicantIncome          0.005309
Gender                   0.003877
Dependents               0.003737
Education                0.002988
Self_Employed            0.002795
dtype: float64

In [69]:
df.drop(["Gender", "Dependents", "Education", "Self_Employed"], inplace= True, axis= 1 )

In [70]:
df.columns

Index(['Married', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status',
       'log_ApplicantIncome', 'log_LoanAmount', 'Total_Income',
       'log_Total_Income', 'Loan_Monthly_Paid', 'log_Loan_Monthly_Paid',
       'Income_After_Loan', 'log_Income_After_Loan', 'Income_to_LoanRatio'],
      dtype='object')

In [71]:
df_streamlit= df.copy()

### Df for streamlit depoy

In [72]:
df_streamlit.drop(['log_ApplicantIncome', 'log_LoanAmount', 'log_Loan_Monthly_Paid', 'log_Income_After_Loan', 'Income_After_Loan', "log_Total_Income"] , axis = 1 , inplace = True)


In [73]:
df_streamlit.columns

Index(['Married', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status',
       'Total_Income', 'Loan_Monthly_Paid', 'Income_to_LoanRatio'],
      dtype='object')

### Final Df for Modling

In [74]:
df.drop(['ApplicantIncome', 'CoapplicantIncome', 'Total_Income', 'Loan_Monthly_Paid', 'Income_After_Loan', "LoanAmount"] , axis = 1 , inplace = True)


In [75]:
df.columns

Index(['Married', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
       'Loan_Status', 'log_ApplicantIncome', 'log_LoanAmount',
       'log_Total_Income', 'log_Loan_Monthly_Paid', 'log_Income_After_Loan',
       'Income_to_LoanRatio'],
      dtype='object')

In [76]:
df.duplicated().sum()

np.int64(1)

In [77]:
df.drop_duplicates(inplace= True)
df.reset_index(drop= True, inplace= True)

## Modling

In [78]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder , StandardScaler , LabelEncoder , RobustScaler
from sklearn.impute import SimpleImputer , KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC 
from sklearn.naive_bayes import MultinomialNB , GaussianNB
from sklearn.model_selection import train_test_split , cross_validate
from sklearn.metrics import accuracy_score , recall_score , precision_score , f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# from xgboost import XGBClassifier 

In [79]:
df.isna().sum()

Married                   3
Loan_Amount_Term         14
Credit_History           50
Property_Area             0
Loan_Status               0
log_ApplicantIncome       0
log_LoanAmount           22
log_Total_Income          0
log_Loan_Monthly_Paid    36
log_Income_After_Loan    41
Income_to_LoanRatio      22
dtype: int64

In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 613 entries, 0 to 612
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Married                610 non-null    object 
 1   Loan_Amount_Term       599 non-null    float64
 2   Credit_History         563 non-null    float64
 3   Property_Area          613 non-null    object 
 4   Loan_Status            613 non-null    int64  
 5   log_ApplicantIncome    613 non-null    float64
 6   log_LoanAmount         591 non-null    float64
 7   log_Total_Income       613 non-null    float64
 8   log_Loan_Monthly_Paid  577 non-null    float64
 9   log_Income_After_Loan  572 non-null    float64
 10  Income_to_LoanRatio    591 non-null    float64
dtypes: float64(8), int64(1), object(2)
memory usage: 52.8+ KB


In [81]:
df.select_dtypes(include="object").columns

Index(['Married', 'Property_Area'], dtype='object')

In [82]:
x = df.drop("Loan_Status" , axis = 1 )
y = df["Loan_Status"]

In [83]:
Num_Columns = x.select_dtypes(include="number")
Cat_Columns = x.select_dtypes(include = "object_")

In [84]:
Num_Steps = []
Num_Steps.append(("Num_Imputer" , KNNImputer()))
Num_Steps.append(("Scaler" , StandardScaler()))
Num_Pipeline = Pipeline(steps=Num_Steps)

In [85]:
Cat_Steps = []
Cat_Steps.append(("Cat_Imputer" , SimpleImputer(strategy="most_frequent")))
Cat_Steps.append(('Encoder' , OneHotEncoder(sparse_output=False, drop = "first")))
Cat_Pipeline = Pipeline(steps=Cat_Steps)

In [86]:
Transformer = ColumnTransformer(transformers=[("Num" , Num_Pipeline ,Num_Columns.columns) , ("Cat" , Cat_Pipeline,Cat_Columns.columns)] , remainder="passthrough")

## for testing pipline work or not 

In [87]:
steps = []
steps.append(("Preprocessing" , Transformer))
steps.append(("Model" , LogisticRegression()))
pipeline = Pipeline(steps = steps)

In [88]:
res = cross_validate(pipeline,x,y,cv = 5 , scoring="accuracy" ,return_train_score=True)

In [89]:
res["train_score"].mean()


np.float64(0.8042478906022694)

In [90]:
res["test_score"].mean()

np.float64(0.7978408636545382)

In [91]:
models = list()
models.append(("LR" , LogisticRegression()))
#models.append(("MNBA" , MultinomialNB()))
models.append(("GNB" , GaussianNB()))
models.append(("SVM" , SVC(gamma= .09)))
models.append(("CART" , DecisionTreeClassifier()))
models.append(("RF" , RandomForestClassifier()))
# models.append(("XG" , XGBClassifier()))
models.append(("KNN" , KNeighborsClassifier()))

In [92]:
y.value_counts()

Loan_Status
1    422
0    191
Name: count, dtype: int64

## Now try all models and take the best one 

In [93]:
from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV, KFold, cross_validate
from sklearn.metrics import make_scorer, recall_score, accuracy_score, precision_score, f1_score


scoring = {
    'accuracy': make_scorer(accuracy_score),
    'recall': make_scorer(recall_score),
}

for model in models:
    steps = []
    steps.append(("Preprocessing", Transformer))
    steps.append(("SmoteTomek", SMOTETomek(smote=SMOTE(sampling_strategy={0: 250}, random_state=24))))
    steps.append(model)  
    
    pipeline = Pipeline(steps=steps)
    
    kf = KFold(n_splits=5, shuffle=True, random_state=24)
    scores = cross_validate(pipeline, x, y, scoring=scoring, cv=kf, return_train_score=True)
    print(model[0])
    print("Train Accuracy:", scores["train_accuracy"].mean())
    print("Test Accuracy:", scores["test_accuracy"].mean())
    print("Train Recall:", scores["train_recall"].mean())
    print("Test Recall:", scores["test_recall"].mean())
    print("-" * 40)
    print("\n")


LR
Train Accuracy: 0.8026094185128226
Test Accuracy: 0.7927895508463282
Train Recall: 0.9336751640800827
Test Recall: 0.933492350136612
----------------------------------------


GNB
Train Accuracy: 0.7964944511409451
Test Accuracy: 0.791176862588298
Train Recall: 0.9478954467506637
Test Recall: 0.9505029504765055
----------------------------------------


SVM
Train Accuracy: 0.8319763913712125
Test Accuracy: 0.8026122884179661
Train Recall: 0.9758106951510944
Test Recall: 0.9642076173884095
----------------------------------------


CART
Train Accuracy: 0.9600315890103495
Test Accuracy: 0.7405837664934026
Train Recall: 0.9774559270263119
Test Recall: 0.8193282062744759
----------------------------------------


RF
Train Accuracy: 0.9612560788062678
Test Accuracy: 0.7879248300679728
Train Recall: 0.9851196358502389
Test Recall: 0.8973910395052501
----------------------------------------


KNN
Train Accuracy: 0.8303404131510037
Test Accuracy: 0.7552179128348661
Train Recall: 0.878668218

## Choosing Svm as best one 

### Tuning was not done because the dataset was small, and earlier tries at tuning gave bad results most of the time.

In [94]:
# Final Pipline with SVS model
steps = list()
steps.append(("Preprocessing" , Transformer))
steps.append(("SmoteTomek", SMOTETomek(smote=SMOTE(sampling_strategy={0: 250}, random_state=24))))
steps.append(("SVC" , SVC()))
Final_pipeline = Pipeline(steps = steps)


In [95]:
Final_model = Final_pipeline.fit(x,y)

In [96]:
Final_model 

In [97]:
df_streamlit.columns

Index(['Married', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status',
       'Total_Income', 'Loan_Monthly_Paid', 'Income_to_LoanRatio'],
      dtype='object')

In [107]:
df.columns

Index(['Married', 'Loan_Amount_Term', 'Credit_History', 'Property_Area',
       'Loan_Status', 'log_ApplicantIncome', 'log_LoanAmount',
       'log_Total_Income', 'log_Loan_Monthly_Paid', 'log_Income_After_Loan',
       'Income_to_LoanRatio'],
      dtype='object')

In [101]:
df.select_dtypes(include="object").columns

Index(['Married', 'Property_Area'], dtype='object')

In [104]:
df.Property_Area.unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [None]:
# import pickle

# file_name = '/Users/mohammedmahmood/Desktop/Data projects/Loan_Approval_analysis_and_prediction.sav'

# with open(file_name, "wb") as file:
#     pickle.dump(Final_model, file)