# Project: Creditworthiness (Part3)

### by Sooyeon Won
### Keywords
- Analytical Framework 
- Application of Selected Classification Model

## Conclusion

> In the previous part of analysis, we evaluated that **Random Forest Model** is the most suitable for predicting the creditworthiness of credit applicants. Now, I will apply the classification model to predict the creditworthiness of **new** credit applicants in this bank. 

In [1]:
# Import the relevant libraries 
import pandas as pd 
import numpy as np 
import statsmodels.api as sm 
import matplotlib.pyplot as plt 
from sklearn.linear_model import LinearRegression 
import seaborn as sns 
sns.set()

import pickle

In [2]:
forest = pickle.load(open('model_rf.pickle','rb')) # Recall the Random Forest Model using pickle library

In [3]:
new_customer_data = pd.read_excel('customers-to-score.xlsx')
new_customer_data.info() # Columns should be modified. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Account-Balance                    500 non-null    object
 1   Duration-of-Credit-Month           500 non-null    int64 
 2   Payment-Status-of-Previous-Credit  500 non-null    object
 3   Purpose                            500 non-null    object
 4   Credit-Amount                      500 non-null    int64 
 5   Value-Savings-Stocks               500 non-null    object
 6   Length-of-current-employment       500 non-null    object
 7   Instalment-per-cent                500 non-null    int64 
 8   Guarantors                         500 non-null    object
 9   Duration-in-Current-address        500 non-null    int64 
 10  Most-valuable-available-asset      500 non-null    int64 
 11  Age-years                          500 non-null    int64 
 12  Concurre

In [4]:
# (1) Change the column names with underscore
new_customer_data.columns = new_customer_data.columns.str.replace("-", "_")

In [5]:
new_customer_data.Length_of_current_employment.replace('1-4 yrs', 'longer than 1 yrs employment', inplace = True)
new_customer_data.Length_of_current_employment.replace('4-7 yrs', 'longer than 1 yrs employment', inplace = True)

In [6]:
new_customer_data.Payment_Status_of_Previous_Credit.replace('No Problems (in this bank)', 'No Problems (in this bank) or Paid Up', inplace = True)
new_customer_data.Payment_Status_of_Previous_Credit.replace('Paid Up', 'No Problems (in this bank) or Paid Up', inplace = True)

In [7]:
# (3) - (6) Dropped the irrelevant or uncessary columns 
new_customer_data.drop(['Telephone'], axis = 1, inplace= True) # Irrelevant feature is dropped.
new_customer_data.drop(['Occupation','Concurrent_Credits'], axis = 1, inplace= True) # Same value for all datapoints are removed.
new_customer_data.drop(['Guarantors', 'Foreign_Worker','No_of_dependents'], axis = 1, inplace= True) # Columns with low variabilities are excluded.
new_customer_data.drop(['Duration_in_Current_address'], axis = 1, inplace= True) # The field with huge percentage of missing values is dropped.

In [8]:
new_customer_mod=pd.get_dummies(new_customer_data) # Create dummy variables for each categorical variable

In [9]:
# One dummy variable is removed from each categorical variable. 
new_customer_mod.drop([ 'Account_Balance_No Account', 
           'Payment_Status_of_Previous_Credit_No Problems (in this bank) or Paid Up',
           'Value_Savings_Stocks_None', 'Length_of_current_employment_< 1yr', 
           'No_of_Credits_at_this_Bank_1' ], axis =1, inplace= True)

In [10]:
new_customer_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 15 columns):
 #   Column                                                     Non-Null Count  Dtype
---  ------                                                     --------------  -----
 0   Duration_of_Credit_Month                                   500 non-null    int64
 1   Credit_Amount                                              500 non-null    int64
 2   Instalment_per_cent                                        500 non-null    int64
 3   Most_valuable_available_asset                              500 non-null    int64
 4   Age_years                                                  500 non-null    int64
 5   Type_of_apartment                                          500 non-null    int64
 6   Account_Balance_Some Balance                               500 non-null    uint8
 7   Payment_Status_of_Previous_Credit_Some Problems            500 non-null    uint8
 8   Purpose_Home Related          

In [11]:
new_customer_mod['Predicted_Creditworthiness'] = forest.predict(new_customer_mod) # Predict the data

In [12]:
new_customer_mod['Predicted_Creditworthiness'].sum()

408

> Based on this analysis using the Forest model, it turns out that **408 customers** belong to the segment 'Creditworthy’.