# Pillar of credit chain:

- Prospection;
- Concession;
- Risk management;
- Recovery.

The focus of this course is on the concession pillar.

## Credit chain:
The credit chain is the process of granting credit to a customer. And it is composed of three agents:

- The savers;
- The financial intermediaries;
- The borrowers.

## Credit scoring:

Credit scoring is the process of evaluating the creditworthiness of a customer. This is a probability, then the value is between 0 and 1.  And it is composed of three steps:
- Data collection;
- Data analysis;
- Decision making.

## Credit risk:

Credit risk is the risk of loss due to a borrower's default on a loan or other line of credit.


# Importing libraries

In [159]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold

In [142]:
SEED =77

# Reading the data

In [143]:
df_german_credit = pd.read_csv('../data/statlog_german_credit/german_credit.csv')
df_german_credit.head(3)

Unnamed: 0,default,account_check_status,duration_in_month,credit_history,purpose,credit_amount,savings,present_emp_since,installment_as_income_perc,personal_status_sex,...,present_res_since,property,age,other_installment_plans,housing,credits_this_bank,job,people_under_maintenance,telephone,foreign_worker
0,0,< 0 DM,6,critical account/ other credits existing (not ...,domestic appliances,1169,unknown/ no savings account,.. >= 7 years,4,male : single,...,4,real estate,67,none,own,2,skilled employee / official,1,"yes, registered under the customers name",yes
1,1,0 <= ... < 200 DM,48,existing credits paid back duly till now,domestic appliances,5951,... < 100 DM,1 <= ... < 4 years,2,female : divorced/separated/married,...,2,real estate,22,none,own,1,skilled employee / official,1,none,yes
2,0,no checking account,12,critical account/ other credits existing (not ...,(vacation - does not exist?),2096,... < 100 DM,4 <= ... < 7 years,2,male : single,...,3,real estate,49,none,own,1,unskilled - resident,2,none,yes


In [144]:
df_german_credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   default                     1000 non-null   int64 
 1   account_check_status        1000 non-null   object
 2   duration_in_month           1000 non-null   int64 
 3   credit_history              1000 non-null   object
 4   purpose                     1000 non-null   object
 5   credit_amount               1000 non-null   int64 
 6   savings                     1000 non-null   object
 7   present_emp_since           1000 non-null   object
 8   installment_as_income_perc  1000 non-null   int64 
 9   personal_status_sex         1000 non-null   object
 10  other_debtors               1000 non-null   object
 11  present_res_since           1000 non-null   int64 
 12  property                    1000 non-null   object
 13  age                         1000 non-null   int64

In [145]:
lines_size, columns_size = df_german_credit.shape
lines_size, columns_size

(1000, 21)

In [146]:
duplicate_lines = df_german_credit.duplicated()
print(f'The dataset has {duplicate_lines.sum()} duplicated lines.')

The dataset has 0 duplicated lines.


In [147]:
empty_values = df_german_credit.isnull()
empty_values.sum()

default                       0
account_check_status          0
duration_in_month             0
credit_history                0
purpose                       0
credit_amount                 0
savings                       0
present_emp_since             0
installment_as_income_perc    0
personal_status_sex           0
other_debtors                 0
present_res_since             0
property                      0
age                           0
other_installment_plans       0
housing                       0
credits_this_bank             0
job                           0
people_under_maintenance      0
telephone                     0
foreign_worker                0
dtype: int64

# Logistic regression

## Data categorization

In [148]:
def categorization_data(df, column_name):
    series_categorized = pd.Categorical(
        df[column_name],
        ordered=False
    )
    series_droped_duplicates = df[column_name].drop_duplicates()
    data_categorized_code = {column_name: {}}

    for index_info in series_droped_duplicates.index:
        data_categorized_code[column_name].update(
            {
                series_droped_duplicates[index_info]: series_categorized.codes[index_info]
            }
        )
    return series_categorized.codes, data_categorized_code


In [149]:
object_columns = df_german_credit.select_dtypes(include=['object']).columns
data_categorized_code_complete = {}

for column_name in object_columns:
    df_german_credit[column_name], data_categorized_code = categorization_data(
        df_german_credit, column_name)
    data_categorized_code_complete.update(data_categorized_code)

df_german_credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   default                     1000 non-null   int64
 1   account_check_status        1000 non-null   int8 
 2   duration_in_month           1000 non-null   int64
 3   credit_history              1000 non-null   int8 
 4   purpose                     1000 non-null   int8 
 5   credit_amount               1000 non-null   int64
 6   savings                     1000 non-null   int8 
 7   present_emp_since           1000 non-null   int8 
 8   installment_as_income_perc  1000 non-null   int64
 9   personal_status_sex         1000 non-null   int8 
 10  other_debtors               1000 non-null   int8 
 11  present_res_since           1000 non-null   int64
 12  property                    1000 non-null   int8 
 13  age                         1000 non-null   int64
 14  other_ins

## Split data between modelling (x) and response (y)

In [150]:
x, y = df_german_credit.drop(columns=['default']), df_german_credit['default']
print(f'data complete shape: {df_german_credit.shape} | x shape: {x.shape} | y shape: {y.shape}')

data complete shape: (1000, 21) | x shape: (1000, 20) | y shape: (1000,)


## Split data between training and testing

In [151]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=SEED)
print(f'The train dataset has {x_train.shape[0]} lines and the test dataset has {x_test.shape[0]} lines.')

The train dataset has 700 lines and the test dataset has 300 lines.


## Define the model

In [152]:
model = LogisticRegression(max_iter=1000, random_state=SEED)
model.fit(x_train, y_train)

LogisticRegression(max_iter=1000, random_state=77)

In [158]:
f'The model score is {model.score(x_test, y_test):.2%}'

'The model score is 73.33%'

In [185]:
kf = KFold(n_splits=5, shuffle=True, random_state=SEED)
kf.get_n_splits(x)
for i, (train_index, test_index) in enumerate(kf.split(x)):
    print(f"Fold {i}:")
    x_train, x_test, y_train, y_test = x.loc[train_index], x.loc[test_index], y.loc[train_index], y.loc[test_index]
    model.fit(x_train, y_train)
    print(f'The model score is {model.score(x_test, y_test):.2%}')

Fold 0:
The model score is 76.00%
Fold 1:
The model score is 70.00%
Fold 2:
The model score is 72.50%
Fold 3:
The model score is 72.00%
Fold 4:
The model score is 71.50%
