<a href="https://colab.research.google.com/github/Jeffresh/Datathon-3-Dphi-Loan-or-No-Loan/blob/main/Datathon_3_Loan_or_Not_Loan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [423]:
!pip install shap
import shap
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelBinarizer, OneHotEncoder
from sklearn.compose import ColumnTransformer 

from sklearn.impute import SimpleImputer 

import xgboost as xgb

from sklearn.metrics import f1_score



## Import Data

In [424]:
loan_data  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/Loan_Data/loan_train.csv" )
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/Loan_Data/loan_test.csv')

## Basic Exploratory Data Analysis

*   Features unnamed and loand_ID are irrelevant, we have to drop this.



In [425]:
loan_data.head()

Unnamed: 0.1,Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,0,LP002305,Female,No,0,Graduate,No,4547,0.0,115.0,360.0,1.0,Semiurban,1
1,1,LP001715,Male,Yes,3+,Not Graduate,Yes,5703,0.0,130.0,360.0,1.0,Rural,1
2,2,LP002086,Female,Yes,0,Graduate,No,4333,2451.0,110.0,360.0,1.0,Urban,0
3,3,LP001136,Male,Yes,0,Not Graduate,Yes,4695,0.0,96.0,,1.0,Urban,1
4,4,LP002529,Male,Yes,2,Graduate,No,6700,1750.0,230.0,300.0,1.0,Semiurban,1


In [426]:
data = loan_data.drop(columns=['Unnamed: 0', 'Loan_ID'])
data

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Female,No,0,Graduate,No,4547,0.0,115.0,360.0,1.0,Semiurban,1
1,Male,Yes,3+,Not Graduate,Yes,5703,0.0,130.0,360.0,1.0,Rural,1
2,Female,Yes,0,Graduate,No,4333,2451.0,110.0,360.0,1.0,Urban,0
3,Male,Yes,0,Not Graduate,Yes,4695,0.0,96.0,,1.0,Urban,1
4,Male,Yes,2,Graduate,No,6700,1750.0,230.0,300.0,1.0,Semiurban,1
...,...,...,...,...,...,...,...,...,...,...,...,...
486,,Yes,1,Graduate,Yes,9833,1833.0,182.0,180.0,1.0,Urban,1
487,Female,No,1,Graduate,No,3812,0.0,112.0,360.0,1.0,Rural,1
488,Male,Yes,1,Graduate,No,14583,0.0,185.0,180.0,1.0,Rural,1
489,Male,No,0,Graduate,No,1836,33837.0,90.0,360.0,1.0,Urban,0


In [427]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 491 entries, 0 to 490
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             481 non-null    object 
 1   Married            490 non-null    object 
 2   Dependents         482 non-null    object 
 3   Education          491 non-null    object 
 4   Self_Employed      462 non-null    object 
 5   ApplicantIncome    491 non-null    int64  
 6   CoapplicantIncome  491 non-null    float64
 7   LoanAmount         475 non-null    float64
 8   Loan_Amount_Term   478 non-null    float64
 9   Credit_History     448 non-null    float64
 10  Property_Area      491 non-null    object 
 11  Loan_Status        491 non-null    int64  
dtypes: float64(4), int64(2), object(6)
memory usage: 46.2+ KB


## Divide data/target

In [428]:
X = data.drop(columns='Loan_Status')
y = data['Loan_Status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=123, stratify=y, shuffle=True)

## Function to get dataframe  with null numeric|categorical features to preprocess/impute

In [429]:
def divide_per_type(data, dtype):
  cols = data.select_dtypes([dtype]).columns
  return data[cols], cols


def get_null_cols(data, dtype):
  null_value_columns =[y for x, y in zip(data.isnull().sum(), data.isnull().sum().index) if x > 0]
  _, cols = divide_per_type(data, dtype)
  null_cols = [x for x in cols if x in null_value_columns]
  
  return data[null_cols]

## Handling missing values data

In [430]:
def handle_missing_values(data):
  cat_data = get_null_cols(data, 'object')
  num_data = get_null_cols(data, 'number')
  imputer_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent').fit(data[cat_data.columns])
  imputer_num = SimpleImputer(missing_values=np.nan, strategy='median').fit(data[num_data.columns])
  dat_cat = imputer_cat.transform(cat_data)
  dat_num = imputer_num.transform(num_data)


  for i, col in enumerate(cat_data.columns): 
    data[col] = dat_cat[:,i].ravel()

  for i, col in enumerate(num_data.columns):
    data[col] = dat_num[:,i].ravel()


In [431]:
handle_missing_values(X_train)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


## Dealing with categorical data

In [432]:
def map_dependents(data):
  map_dependents = {'0': 0, '1': 1, '2':2, '3+':3}
  data['Dependents'] = data['Dependents'].map(map_dependents)

In [433]:
X_train['Dependents'].value_counts()

0     200
1      56
2      56
3+     31
Name: Dependents, dtype: int64

In [434]:
map_dependents(X_train)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [435]:
X_train['Dependents']

261    0
97     0
257    0
428    2
266    0
      ..
226    0
401    1
78     2
378    2
298    2
Name: Dependents, Length: 343, dtype: int64

In [436]:
def need_hot_encoding(data):
  _ , cat_cols = divide_per_type(data, 'object')
  for cols in cat_cols:
    print("*"*10 + cols + "*"*10)
    print(data[cols].value_counts().shape[0] == 2)

def handle_cat_data(data):
  enc = LabelBinarizer()
  enc_h = OneHotEncoder(handle_unknown='ignore')
  _ , cat_cols = divide_per_type(data, 'object')

  print(cat_cols)

  for col in cat_cols:
    print(col)
    if data[col].value_counts().shape[0] <= 2:
      enc.fit(data[col])
      df = pd.DataFrame(enc.transform(data[col]), columns=[col])
      df.index = data.index
      data = data.drop(columns=col)
      data = pd.concat([data, df], axis=1)
 
    elif data[col].value_counts().shape[0] > 2:
      values = enc_h.fit_transform(data[[col]]).toarray()
      cols = [col+'_'+str(x) for x in range(values.shape[1])]
      df = pd.DataFrame(data= values, columns=cols)
      df.index = data.index
      data = data.drop(columns=col)
      data = pd.concat([data, df], axis=1)
  
  return data



In [437]:
# need_hot_encoding(X_train)

In [438]:
X_train = handle_cat_data(X_train)
X_train

Index(['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area'], dtype='object')
Gender
Married
Education
Self_Employed
Property_Area


Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender,Married,Education,Self_Employed,Property_Area_0,Property_Area_1,Property_Area_2
261,0,20833,6667.0,480.0,360.0,1.0,1,1,0,0,0.0,0.0,1.0
97,0,5695,4167.0,175.0,360.0,1.0,1,1,0,0,0.0,1.0,0.0
257,0,2929,2333.0,139.0,360.0,1.0,0,1,0,0,0.0,1.0,0.0
428,2,2500,1840.0,109.0,360.0,1.0,1,1,0,0,0.0,0.0,1.0
266,0,3366,2200.0,135.0,360.0,1.0,1,1,0,0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
226,0,5050,0.0,118.0,360.0,1.0,1,0,0,0,0.0,1.0,0.0
401,1,7250,1667.0,110.0,360.0,0.0,1,1,0,0,0.0,0.0,1.0
78,2,3917,0.0,124.0,360.0,1.0,1,1,1,0,0.0,1.0,0.0
378,2,1993,1625.0,113.0,180.0,1.0,1,1,1,0,0.0,1.0,0.0


## Validation data

In [439]:
handle_missing_values(X_test)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [440]:
map_dependents(X_test)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [441]:
X_test['Dependents'].isnull().sum()

0

In [442]:
X_test = handle_cat_data(X_test)

Index(['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area'], dtype='object')
Gender
Married
Education
Self_Employed
Property_Area


## Test data

In [444]:
test_data = test_data.drop(columns=['Loan_ID'])
handle_missing_values(test_data)
map_dependents(test_data)
test_data = handle_cat_data(test_data)

Index(['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area'], dtype='object')
Gender
Married
Education
Self_Employed
Property_Area


## Classifier

In [None]:
xgc_model = xgb.XGBClassifier(n_estimators=500, max_depth=5, random_state=123)

In [None]:
xgc_model.fit(X_train, y_train)

In [None]:
f1_score(y_test, xgc_model.predict(X_test))

In [445]:
predictions = xgc_model.predict(test_data)

## Save predictions

In [451]:
predict_data = pd.DataFrame(data=predictions, columns=['prediction'])
predict_data.to_csv('prediction_results.csv', index=False)