# Problem Statement
The director of SZE bank identified that going through the loan applications to filter the people who can be granted loans or need to be rejected is a tedious and time-consuming process. He wants to automate it and increase his bank’s efficiency. After talking around a bit, your name pops up as one of the few data scientists who can make this possible within a limited time. Will you help the director out? 

## Objective
The idea behind this ML project is to build an ML model and web application that the bank can use to classify if a user can be granted a loan or not.

## About the Data
The dataset contains information about Loan Applicants. There are 12 independent columns and 1 dependent column. This dataset includes attributes like Loan ID, gender, if the loan applicant is married or not, the level of education, applicant’s income etc. 

### **Data Description**

- Loan_ID: A unique ID assigned to every loan applicant
- Gender: Gender of the applicant (Male, Female)
- Married: The marital status of the applicant (Yes, No)
- Dependents: No. of people dependent on the applicant (0,1,2,3+)
- Education: Education level of the applicant (Graduated, Not Graduated)
- Self_Employed: If the applicant is self-employed or not (Yes, No)
- ApplicantIncome: The amount of income the applicant earns
- CoapplicantIncome: The amount of income the co-applicant earns
- LoanAmount: The amount of loan the applicant has requested for
- Loan_Amount_Term: The no. of days over which the loan will be paid
- Credit_History: A record of a borrower's responsible repayment of debts (1- has all debts paid, 0- not paid)
- Property_Area : The type of location where the applicant’s property lies (Rural, Semiurban, Urban)
- Loan_Status: Loan granted or not (Y, N)

In [5]:
pip install xgboost

Collecting xgboostNote: you may need to restart the kernel to use updated packages.
  Downloading xgboost-1.3.1-py3-none-win_amd64.whl (95.2 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.3.1



In [29]:
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer 

from sklearn.impute import SimpleImputer 

import xgboost as xgb

from sklearn.metrics import f1_score

In [116]:
loan_df  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/Loan_Data/loan_train.csv", index_col = 0)

loan_df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP002305,Female,No,0,Graduate,No,4547,0.0,115.0,360.0,1.0,Semiurban,1
1,LP001715,Male,Yes,3+,Not Graduate,Yes,5703,0.0,130.0,360.0,1.0,Rural,1
2,LP002086,Female,Yes,0,Graduate,No,4333,2451.0,110.0,360.0,1.0,Urban,0
3,LP001136,Male,Yes,0,Not Graduate,Yes,4695,0.0,96.0,,1.0,Urban,1
4,LP002529,Male,Yes,2,Graduate,No,6700,1750.0,230.0,300.0,1.0,Semiurban,1


In [117]:
loan_df.drop('Loan_ID', axis = 'columns', inplace=True)

In [11]:
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 491 entries, 0 to 490
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            491 non-null    object 
 1   Gender             481 non-null    object 
 2   Married            490 non-null    object 
 3   Dependents         482 non-null    object 
 4   Education          491 non-null    object 
 5   Self_Employed      462 non-null    object 
 6   ApplicantIncome    491 non-null    int64  
 7   CoapplicantIncome  491 non-null    float64
 8   LoanAmount         475 non-null    float64
 9   Loan_Amount_Term   478 non-null    float64
 10  Credit_History     448 non-null    float64
 11  Property_Area      491 non-null    object 
 12  Loan_Status        491 non-null    int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 53.7+ KB


In [13]:
loan_df.isnull().sum()

Loan_ID               0
Gender               10
Married               1
Dependents            9
Education             0
Self_Employed        29
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           16
Loan_Amount_Term     13
Credit_History       43
Property_Area         0
Loan_Status           0
dtype: int64

In [14]:
loan_df['Gender'].value_counts()

Male      393
Female     88
Name: Gender, dtype: int64

In [15]:
loan_df['Married'].value_counts()

Yes    324
No     166
Name: Married, dtype: int64

In [16]:
loan_df['Self_Employed'].value_counts()

No     398
Yes     64
Name: Self_Employed, dtype: int64

In [18]:
loan_df['Dependents'].value_counts()

0     276
1      85
2      78
3+     43
Name: Dependents, dtype: int64

In [17]:
loan_df.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

### Handling missing values data

In [112]:
def missing_data(df):
    # Extracting missing value columns
    missing_cols = df[df.columns[df.isnull().sum().values > 0]]
    
    # Separating Categorical & Numerical Data Cols
    cat_cols = missing_cols.select_dtypes('object').columns
    num_cols = missing_cols.select_dtypes('float64').columns
    
    # Filling Missing values with median for numerical columns and Mode for categorical columns
    df[cat_cols] = SimpleImputer(strategy='most_frequent').fit_transform(df[cat_cols])
    df[num_cols] = SimpleImputer(strategy='median').fit_transform(df[num_cols])
    
    return df

In [119]:
final_df = missing_data(loan_df.copy())

final_df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Female,No,0,Graduate,No,4547,0.0,115.0,360.0,1.0,Semiurban,1
1,Male,Yes,3+,Not Graduate,Yes,5703,0.0,130.0,360.0,1.0,Rural,1
2,Female,Yes,0,Graduate,No,4333,2451.0,110.0,360.0,1.0,Urban,0
3,Male,Yes,0,Not Graduate,Yes,4695,0.0,96.0,360.0,1.0,Urban,1
4,Male,Yes,2,Graduate,No,6700,1750.0,230.0,300.0,1.0,Semiurban,1


### Dealing with categorical data

In [124]:
def encoding(df):
    # Separating Categorical & Numerical Data Cols
    cat_cols = df.select_dtypes('object').columns
    
    # Converting categorical columns to numerical using labelEncoder
    lbc = LabelEncoder()
    for col in cat_cols:
        df[col] = lbc.fit_transform(df[col])
    
    return df

In [126]:
final_df = encoding(final_df.copy())

final_df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,0,0,0,0,0,4547,0.0,115.0,360.0,1.0,1,1
1,1,1,3,1,1,5703,0.0,130.0,360.0,1.0,0,1
2,0,1,0,0,0,4333,2451.0,110.0,360.0,1.0,2,0
3,1,1,0,1,1,4695,0.0,96.0,360.0,1.0,2,1
4,1,1,2,0,0,6700,1750.0,230.0,300.0,1.0,1,1


In [130]:
x = final_df.drop('Loan_Status', axis = 'columns')
y = final_df['Loan_Status']
xtr, xtt, ytr, ytt = train_test_split(x, y, test_size = 0.2, random_state = 77)

## Model Building

In [177]:
model = xgb.XGBClassifier(n_estimators=150, max_depth=3, random_state=77)

In [178]:
model.fit(xtr, ytr)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=150, n_jobs=4, num_parallel_tree=1, random_state=77,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [179]:
ypred = model.predict(xtt)

In [180]:
f1_score(ytt, ypred)

0.8535031847133758

In [198]:
test_df = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Loan_Data/loan_test.csv')
test_df.drop('Loan_ID', axis = 'columns', inplace = True)

test_df = missing_data(test_df.copy())

test_df = encoding(test_df)

test_df.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
dtype: int64

In [199]:
test_df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1,0,0,1,0,3748,1668.0,110.0,360.0,1.0,1
1,1,1,3,0,0,4000,7750.0,290.0,360.0,1.0,1
2,1,1,0,0,0,2625,6250.0,187.0,360.0,1.0,0
3,1,0,0,1,0,3902,1666.0,109.0,360.0,1.0,0
4,1,1,0,1,0,6096,0.0,218.0,360.0,0.0,0


In [189]:
test_df.shape

(123, 11)

In [200]:
xttn = test_df.copy()

In [201]:
y_predn = model.predict(xttn)

In [202]:
y_predn.shape

(123,)

### Save predictions

In [203]:
predict_data = pd.DataFrame(data=y_predn, columns=['prediction'])
predict_data.to_csv('prediction_results.csv', index=False)