# Credit Risk Classification Exercise

A certain bank asks you to define a classification model to decide whether a client is eligible or not to be granted a credit. In order to train the model, you are given a dataset with information regarding some attributes of their clients, and the tag they gave to those clients regarding the credit concession. This dataset is provided in the `credit_customers.csv` file (this dataset has been obtained from [this link](https://www.kaggle.com/datasets/ppb00x/credit-risk-customers)). The bank also reminds you that it will always be better to tag a good customer as bad than it is to tag a bad one as good that won't probably repay the credit. Thus, this is an imbalanced classification problem. You are also told that the cost matrix for this problem is as follows:

|     | Good |  Bad  | 
|-----|------|-------|
|**Good** |  0   |   1   |
| **Bad** |  5   |   0   |

being the row of labels the predicted classes and the column of labels the actual classes. We will take this into account when we evaluate the model, after the training.

We first import all libraries that we will need during the project

In [2]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
%matplotlib inline

We can see how our dataframe looks like looking at the first 5 records of data

In [5]:
df = pd.read_csv("credit_customers.csv")
pd.options.display.max_columns = None
df.head()

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,4.0,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes,good
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,2.0,real estate,22.0,none,own,1.0,skilled,1.0,none,yes,bad
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,3.0,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes,good
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,4.0,life insurance,45.0,none,for free,1.0,skilled,2.0,none,yes,good
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,4.0,no known property,53.0,none,for free,2.0,skilled,2.0,none,yes,bad


We can also check the type of each column

In [7]:
df.dtypes

checking_status            object
duration                  float64
credit_history             object
purpose                    object
credit_amount             float64
savings_status             object
employment                 object
installment_commitment    float64
personal_status            object
other_parties              object
residence_since           float64
property_magnitude         object
age                       float64
other_payment_plans        object
housing                    object
existing_credits          float64
job                        object
num_dependents            float64
own_telephone              object
foreign_worker             object
class                      object
dtype: object

We can see our dataset has 21 columns: 7 numerical columns and 14 categorical ones. The 20 features we can find in this dataset are:

- **checking_status**: status of existing checking account.
- **duration**: duration in months of the requested credit.
- **credit_history**: feature related to credits taken by the customer and the status of those credits.
- **purpose**: the purpose of the credit.
- **credit_amount**: amount of money requested.
- **savings_status**: status of savings account.
- **employment**: the time the customer has been working in their current employment, in years.
- **installment_commitment**: installment rate in percentage of disposable income.
- **personal_status**: sex data about the customer (male or female) and marital status.
- **other parties**: other debtors or guarantors.
- **residence_since**: the time the customer has been living in their current residence, in years.
- **property_magnitude**: property (real estate, life insurance, car or not known property).
- **age**: the customer's age in years.
- **other_payment_plans**: other installment plans (banks, stores).
- **housing**: whether they live in an owned or a rented house.
- **existing_credits**: how many other credits the customer has in this same bank.
- **job**: type of job.
- **num_dependents**: number of people being liable to provide maintenance for.
- **own_telephone**: whether the customer owns a telephone or not.
- **foreign_worker**: whether the customer is a foreign worker or not.

The label we want to predict with this model is the **class** column, which tags the customer as *good* (eligible) or *bad* (not eligible) to get the credit. 

We can check if there is any null value across the dataframe

In [7]:
df.isnull().any()

checking_status           False
duration                  False
credit_history            False
purpose                   False
credit_amount             False
savings_status            False
employment                False
installment_commitment    False
personal_status           False
other_parties             False
residence_since           False
property_magnitude        False
age                       False
other_payment_plans       False
housing                   False
existing_credits          False
job                       False
num_dependents            False
own_telephone             False
foreign_worker            False
class                     False
dtype: bool

and we find there's not a single null value in any column.