<a href="https://www.kaggle.com/code/invaed/homecredit?scriptVersionId=93164925" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Objective

*Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.We strive to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, We make use of a variety of alternative data--including telco and transactional information--to* **predict their clients' repayment abilities.**

# Data Collection

Data is collected from kaggle competition HOME CREDIT RISK DEFAULT using Kaggle API

*https://www.kaggle.com/competitions/home-credit-default-risk/data*

* Install the required packages

In [None]:
%sh 
pip install kaggle
pip install mlflow

* Delete any existisng .csv files in /databricks/driver directory

In [None]:
%sh
rm  /databricks/driver/*.csv

* Download dataset from Kaggel

In [None]:
%sh
export KAGGLE_USERNAME=testkvr
export KAGGLE_KEY=306353bbdee2b3dca64e182b05908918
kaggle competitions download -c home-credit-default-risk
unzip home-credit-default-risk.zip

In [None]:
%sh
ls  "/databricks/driver/"

* Importing required packages

In [None]:
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt   
import seaborn as sns
import pyspark.sql.functions as F

* Read the Data

In [None]:
home_credit_df = spark.read.csv('file:///databricks/driver/application_train.csv', inferSchema=True, header=True, mode='DROPMALFORMED')


In [None]:
display(home_credit_df.limit(5))

# Exploratory Data Analysis

In [None]:
home_credit_df.printSchema()

In [None]:
display(home_credit_df.summary())

**Data Explaination** <br>
*From the Above it is observed that the data set consis of *307511* records with *122* features.<br>
Target Column is *TARGET* with Values 1 and 0.   
Value 1 indicates client with payment difficulties, 0 Payment on time.
Features columns  are the Static data for all applications which explainns the varius status of the applicant. One row represents one loan in our data sample.*

**Data Visualization**

* Plotting Gender vs Loan Repayment count show that Male are more likey to Repay the loan compared to Female

In [None]:
display(home_credit_df.groupby("CODE_GENDER").agg({"TARGET":"sum"}).withColumnRenamed("sum(TARGET)","LoanRepaymentCount"))

* Plotting the Graph to analyze which class of people are more unlikely to pay back the loan.

In [None]:

display(home_credit_df
                      .select("NAME_INCOME_TYPE","TARGET")
                       .groupby("NAME_INCOME_TYPE")
                      .agg({"TARGET":"sum"})
                      .withColumnRenamed("sum(TARGET)","LoanRepaymentCount")
                      
       )

* Plotting the Graph to analyze people of which Occupation are more unlikely to pay back the loan.

In [None]:

display(home_credit_df
                      .select("OCCUPATION_TYPE","TARGET")
                       .filter(~(F.col("OCCUPATION_TYPE")==""))
                       .groupby("OCCUPATION_TYPE")
                      .agg({"TARGET":"sum"})
                      .withColumnRenamed("sum(TARGET)","LoanRepaymentCount")
                       
                      
       )


**Feature Selection**

* Check for correlation and Remove the Highly correlated features

In [None]:
pdata = home_credit_df.toPandas()
fig, ax = plt.subplots()
fig.set_size_inches(20, 20)
corr = pdata.corr()
sns.heatmap(corr)

*From the above  Heat map we observe that the there are many features which are higly correlated .Hence removed these highly correlated fields from Prediction*

In [None]:
corr_remove_list =['LIVE_CITY_NOT_WORK_CITY', 'REGION_RATING_CLIENT_W_CITY','AMT_ANNUITY','AMT_GOODS_PRICE','APARTMENTS_AVG',
 'BASEMENTAREA_AVG','YEARS_BEGINEXPLUATATION_AVG','YEARS_BUILD_AVG','COMMONAREA_AVG','ELEVATORS_AVG','ENTRANCES_AVG',
 'FLOORSMAX_AVG','FLOORSMIN_AVG','LANDAREA_AVG','LIVINGAPARTMENTS_AVG','LIVINGAREA_AVG','NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAREA_AVG','APARTMENTS_MODE','BASEMENTAREA_MODE','YEARS_BEGINEXPLUATATION_MODE','YEARS_BUILD_MODE','COMMONAREA_MODE',
 'ELEVATORS_MODE','ENTRANCES_MODE','FLOORSMAX_MODE','FLOORSMIN_MODE','LANDAREA_MODE','LIVINGAPARTMENTS_MODE','LIVINGAREA_MODE',
  'NONLIVINGAPARTMENTS_MODE','NONLIVINGAREA_MODE','APARTMENTS_MEDI','BASEMENTAREA_MEDI','YEARS_BEGINEXPLUATATION_MEDI',
'YEARS_BUILD_MEDI','COMMONAREA_MEDI','ELEVATORS_MEDI','ENTRANCES_MEDI','FLOORSMAX_MEDI','FLOORSMIN_MEDI','LANDAREA_MEDI','LIVINGAPARTMENTS_MEDI','LIVINGAREA_MEDI','NONLIVINGAPARTMENTS_MEDI','NONLIVINGAREA_MEDI','FONDKAPREMONT_MODE','HOUSETYPE_MODE',
'TOTALAREA_MODE','WALLSMATERIAL_MODE','EMERGENCYSTATE_MODE','NONLIVINGAPARTMENTS_AVG','NONLIVINGAREA_AVG','APARTMENTS_MODE','LIVINGAPARTMENTS_MODE','YEARS_BUILD_MEDI','COMMONAREA_MEDI','ELEVATORS_MEDI','FLOORSMIN_MEDI','LIVINGAPARTMENTS_MEDI','NONLIVINGAPARTMENTS_MEDI','NONLIVINGAREA_MEDI','AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR','DAYS_LAST_PHONE_CHANGE','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','YEARS_BUILD_AVG','COMMONAREA_AVG','FLOORSMIN_AVG','LIVINGAPARTMENTS_AVG','LANDAREA_AVG','NONLIVINGAPARTMENTS_MODE',
'NONLIVINGAREA_MODE','YEARS_BUILD_MODE','COMMONAREA_MODE','FLOORSMIN_MODE']

In [None]:
feature_cols = [col for col in home_credit_df.columns  if col not in corr_remove_list ]
len(feature_cols)

In [None]:
pdata = home_credit_df.select(feature_cols).toPandas()
fig, ax = plt.subplots()
fig.set_size_inches(20, 20)
corr = pdata.corr()
sns.heatmap(corr)

*After Removing the CORRERALETD columns , we see that now the Heat map shows that correlation is reduced*

In [None]:
# Selecting the required features for Modeling
home_credit_df = home_credit_df.select(feature_cols)
print("Total Features  selected are {}".format(home_credit_df.columns))

*Data Cleaning -Dropping Nulls*

In [None]:
#Remove the Null/NA values
print("Rows Befor Dropping Nulls: {}".format(home_credit_df.count()))
df = home_credit_df.dropna()
print("Rows After dropna",format(df.count()))

# Data Transformation

In [None]:
splitted_data = df.randomSplit([0.7, 0.3], 24)   # proportions [], seed for random
train_data = splitted_data[0]
test_data = splitted_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

*Created a pipeline to converts all the string columns to numeric*

In [None]:
stringIndexer_contract = StringIndexer(inputCol="NAME_CONTRACT_TYPE", outputCol="NAME_CONTRACT_TYPE_IX")
stringIndexer_gender = StringIndexer(inputCol="CODE_GENDER", outputCol="CODE_GENDER_IX")
stringIndexer_car = StringIndexer(inputCol="FLAG_OWN_CAR", outputCol="FLAG_OWN_CAR_IX")
stringIndexer_realty = StringIndexer(inputCol="FLAG_OWN_REALTY", outputCol="FLAG_OWN_REALTY_IX")
stringIndexer_suite = StringIndexer(inputCol="NAME_TYPE_SUITE", outputCol="NAME_TYPE_SUITE_IX")
stringIndexer_income = StringIndexer(inputCol="NAME_INCOME_TYPE", outputCol="NAME_INCOME_TYPE_IX")
stringIndexer_education = StringIndexer(inputCol="NAME_EDUCATION_TYPE", outputCol="NAME_EDUCATION_TYPE_IX")
stringIndexer_family_status = StringIndexer(inputCol="NAME_FAMILY_STATUS", outputCol="NAME_FAMILY_STATUS_IX")
stringIndexer_housing_type = StringIndexer(inputCol="NAME_HOUSING_TYPE", outputCol="NAME_HOUSING_TYPE_IX")
stringIndexer_occupation = StringIndexer(inputCol="OCCUPATION_TYPE", outputCol="OCCUPATION_TYPE_IX")
stringIndexer_appr_process_start = StringIndexer(inputCol="WEEKDAY_APPR_PROCESS_START", outputCol="WEEKDAY_APPR_PROCESS_START_IX")


columns = train_data.columns
for i in train_data.dtypes:
    if(i[1]=='string'):
        columns.remove(i[0])

columns.remove('SK_ID_CURR')

columns = columns + ['NAME_CONTRACT_TYPE_IX',
                     'CODE_GENDER_IX',
                     'FLAG_OWN_CAR_IX',
                     'FLAG_OWN_REALTY_IX',
                     'NAME_TYPE_SUITE_IX',
                     'NAME_INCOME_TYPE_IX',
                     'NAME_EDUCATION_TYPE_IX',
                     'NAME_FAMILY_STATUS_IX',
                     'NAME_HOUSING_TYPE_IX',
                     'OCCUPATION_TYPE_IX',
                     'WEEKDAY_APPR_PROCESS_START_IX']

# CREATING THE DATA TRANFORMATION PIPELINE
pipeline = Pipeline(stages=[stringIndexer_contract, 
                            stringIndexer_gender, 
                            stringIndexer_car, 
                            stringIndexer_realty,
                            stringIndexer_suite,
                            stringIndexer_income,
                            stringIndexer_education,
                            stringIndexer_family_status,
                            stringIndexer_housing_type,
                            stringIndexer_occupation, 
                            stringIndexer_appr_process_start
                           ])
                            
train_data = pipeline.fit(train_data).transform(train_data)
test_data = pipeline.fit(test_data).transform(test_data)


# Modeling & Tuning

*Data Modeling and Tuning using Paramgrid and CrossValidation*

In [None]:
#Creating a feature columns
vectorAssembler_features = VectorAssembler(inputCols=columns, outputCol="features")

#Baseline Model(RandomForestClassifier) creation using TARGET as label column
rf = RandomForestClassifier(labelCol ="TARGET")

#Creating the Pipeline  for modeling
pipeline2=Pipeline(stages=[vectorAssembler_features,rf])
evaluator = MulticlassClassificationEvaluator(
    labelCol="TARGET", predictionCol="prediction", metricName="accuracy")

#Defining Param grid
params = ParamGridBuilder()\
  .addGrid(rf.numTrees, range(2,5))\
  .build()

#Defining a cross Validator
cv = CrossValidator()\
    .setEstimator(pipeline2)\
    .setEvaluator(evaluator)\
    .setEstimatorParamMaps(params)\
    .setNumFolds(2)

#Fitting the model with Training data
fittedGrid=cv.fit(train_data)


# Prediction and Evaluation

In [None]:
#Predictions using best model with test data
rfBest = fittedGrid.bestModel
predictions = rfBest.transform(test_data)
display(predictions.select("features","TARGET","PREDICTION").limit(30))

In [None]:
#Model Evaluation
evaluatorRF = MulticlassClassificationEvaluator(labelCol="TARGET", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)
# accuracy=fittedGrid.bestModel.stages[1].summary.accuracy
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

*Confusion Matrix*

In [None]:
display(predictions.groupby("TARGET").pivot("prediction").count())

*Plotting ROC curve*

In [None]:
plt.figure(figsize=(5,5))
plt.plot([0, 1], [0, 1], 'r--')
plt.plot(rfBest.stages[1].summary.roc.select('FPR').collect(),
         rfBest.stages[1].summary.roc.select('TPR').collect())
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

In [None]:
print("Area Under Curve is {}".format(rfBest.stages[1].summary.areaUnderROC))