# German Credit Risk Analysis

![](https://d30pf83g3s2iw3.cloudfront.net/wp-content/uploads/2019/11/debit-vs-credit-card-holiday-shopping.gif)

# Content

The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes.

# The following steps were followed in this project:
1. Import Modules and Data
1. Data Analysis
1. Data Classification
1. Data Visualization
1. Data Preprocessing
1. Building Models
    - DecisionTree Model
    - GradientBoosting Model
    - XGBoost Model
    - LightGBM Model

# Import Modules and Data

In [38]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')

In [39]:
#Link data : https://www.kaggle.com/kabure/german-credit-data-with-risk?select=german_credit_data.csv
df=pd.read_csv("german_credit_data.csv")
df=df.iloc[:,1:]
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '../input/german-credit-data-with-risk/german_credit_data.csv'

## Variable Description

Meaning of the Values:

1. Age: Age of the person applying for the credit.
1. Sex: Gender of the person applying for the credit.
1. Job: 0,1,2,3 The values specified for the job in the form of 0,1,2,3.
1. Housing: own, rent or free.
1. Saving accounts: the amount of money in the person's bank account.
1. Checking account: cheque account.
1. Credit amount: Credit amount.
1. Duration: Time given for credit payment.
1. Purpose: Goal of credit application.
1. Risk: Credit application positive or negative.

# Data Analysis

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
def unique_value(data_set, column_name):
    return data_set[column_name].nunique()

print("Number of the Unique Values:")
print(unique_value(df,list(df.columns)))

In [None]:
def missing_value_table(df):
    missing_value = df.isna().sum().sort_values(ascending=False)
    missing_value_percent = 100 * df.isna().sum()//len(df)
    missing_value_table = pd.concat([missing_value, missing_value_percent], axis=1)
    missing_value_table_return = missing_value_table.rename(columns = {0 : 'Missing Values', 1 : '% Value'})
    cm = sns.light_palette("lightgreen", as_cmap=True)
    missing_value_table_return = missing_value_table_return.style.background_gradient(cmap=cm)
    return missing_value_table_return
  
missing_value_table(df)

In [None]:
pd.crosstab(df["Sex"],df["Risk"])

In [None]:
pd.crosstab(df["Housing"],df["Risk"])

In [None]:
# This code gives a error Dummy. That reason, we  leave to shaped of Notebook
#pd.crosstab(df["Housing"],df["Risk"],df["Credit amount"],aggfunc=["mean"],normalize=True)

* There is multiple groups in the "Purpose".
* At this situation we can apply ANOVA test.
* This way we will see the differences  according to requisition of Credit Amount.

In [None]:
df.head(2)

In [None]:
from scipy import stats

df1 = df.copy()

df1 = df1[["Credit amount","Purpose"]]

group = pd.unique(df1.Purpose.values)

d_v1 = {grp:df1["Credit amount"][df1.Purpose == grp] for grp in group}


* One of conditions ANOVA test is equal variance.

* Applied levene and according to result, between groups variances are not equal.

In [None]:
# Applying levene
stats.levene(d_v1['radio/TV'],d_v1['furniture/equipment'],d_v1['car'],d_v1['business'],d_v1['domestic appliances'],d_v1['repairs'],
                     d_v1['vacation/others'],d_v1['education'])

* P value << 0.05

In [None]:
f, p = stats.f_oneway(d_v1['radio/TV'],d_v1['furniture/equipment'],d_v1['car'],d_v1['business'],d_v1['domestic appliances'],d_v1['repairs'],
                     d_v1['vacation/others'],d_v1['education'])

("F statistics: "+str(f)+" | P value : "+str(p))

* H0: There are no significant differences means of groups.

* H1: At least one group's mean is different.

* P value < 0.05

* Reject h0 hypothesis

In [None]:
(df.groupby(by=["Purpose"])[["Credit amount"]].agg("sum") / df["Credit amount"].sum())*100

* In the result, there is different between groups.
* In this query we can see difference

# Data Classification

In [None]:
sns.set(font_scale=1,style="whitegrid")
fig,ax=plt.subplots(ncols=2,nrows=3,figsize=(16,12))
cat_list=["Age","Credit amount","Duration"]
count=0
for i in range(3):
    sns.distplot(df[cat_list[count]],ax=ax[i][0],kde=False,color="#e88d67")
    sns.kdeplot(df[cat_list[count]],ax=ax[i][1],shade=True,color="#f2cbac")
    count+=1

* "monthly pay" and "credit amount^2"(square) added in data frame.

In [None]:
df["Monthly pay"] = (df["Credit amount"] / df["Duration"])
df["Credit amount^2"] = df["Credit amount"]**2

* 'Age' and 'Duration' columns Classification

In [None]:
df.insert(1,"Cat Age",np.NaN)
df.loc[df["Age"]<25,"Cat Age"]="0-25"
df.loc[((df["Age"]>=25) & (df["Age"]<30)),"Cat Age"]="25-30"
df.loc[((df["Age"]>=30) & (df["Age"]<35)),"Cat Age"]="30-35"
df.loc[((df["Age"]>=35) & (df["Age"]<40)),"Cat Age"]="35-40"
df.loc[((df["Age"]>=40) & (df["Age"]<50)),"Cat Age"]="40-50"
df.loc[((df["Age"]>=50) & (df["Age"]<76)),"Cat Age"]="50-75"

In [None]:
df.insert(9,"Cat Duration",df["Duration"])
for i in df["Cat Duration"]:
    if i<12:
        df["Cat Duration"]=df["Cat Duration"].replace(i,"0-12")
    elif (i>=12) and (i<24):
        df["Cat Duration"]=df["Cat Duration"].replace(i,"12-24")
    elif (i>=24) and (i<36):
        df["Cat Duration"]=df["Cat Duration"].replace(i,"24-36")
    elif (i>=36) and (i<48):
        df["Cat Duration"]=df["Cat Duration"].replace(i,"36-48")
    elif (i>=48) and (i<60):
        df["Cat Duration"]=df["Cat Duration"].replace(i,"48-60")
    elif (i>=60) and (i<=72):
        df["Cat Duration"]=df["Cat Duration"].replace(i,"60-72")

In [None]:
df.insert(4,"Cat Job",df["Job"])
df["Cat Job"]=df["Cat Job"].astype("category")
df["Cat Job"]=df["Cat Job"].replace(0,"unskilled")
df["Cat Job"]=df["Cat Job"].replace(1,"resident")
df["Cat Job"]=df["Cat Job"].replace(2,"skilled")
df["Cat Job"]=df["Cat Job"].replace(3,"highly skilled")

In [None]:
df["Job"]=pd.Categorical(df["Job"],categories=[0,1,2,3],ordered=True)
df["Cat Age"]=pd.Categorical(df["Cat Age"],categories=['0-25','25-30', '30-35','35-40','40-50','50-75'])
df["Cat Duration"]=pd.Categorical(df["Cat Duration"],categories=['0-12','12-24', '24-36','36-48','48-60','60-72'])

In [None]:
df.head()

# Data Visualization

In [None]:
fig,ax=plt.subplots(ncols=2,figsize=(16,5))
df["Risk"].value_counts().plot.pie(autopct="%.2f%%",colors=['#00FF7F','#FF2424'],explode = (0.1, 0.1),ax=ax[0])
sns.countplot(df["Risk"],ax=ax[1],palette=['#00FF7F','#FF2424'])

In [None]:
fig,ax=plt.subplots(ncols=2,nrows=3,figsize=(16,20))
cat_list=["Cat Age","Sex","Cat Job","Housing","Cat Duration","Purpose"]
palette=["red","blue","purple","green","yellow","cyan"]
count=0
for i in range(3):
    for j in range(2):
        sns.countplot(df[cat_list[count]],ax=ax[i][j],palette=sns.dark_palette(palette[count],reverse=True))
        ax[i][j].set_xticklabels(ax[i][j].get_xticklabels(),rotation=30)
        count+=1

In [None]:
fig, ax = plt.subplots(1,2,figsize=(16,5))

sns.countplot(df['Sex'], ax=ax[0]).set_title('Male - Female Ratio');
sns.countplot(df.Risk, ax=ax[1]).set_title('Good - Bad Risk Ratio');

In [None]:
plt.figure(figsize=(16,5))
sns.countplot(x="Housing", hue="Risk", data=df).set_title("Housing and Frequency Graph by Risk", fontsize=15);
plt.show()

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(16,5))
sns.countplot(x="Saving accounts", hue="Risk", data=df, ax=ax1);
sns.countplot(x="Checking account", hue="Risk", data=df, ax=ax2);
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45)
fig.show()

In [None]:
plt.figure(figsize=(16,5))
sns.barplot(data=df,x="Sex",y="Age",hue="Cat Job",palette="hsv_r")

In [None]:
plt.figure(figsize = (16, 5))
sns.stripplot(x = "Cat Age", y = "Credit amount", data = df)

In [None]:
fig, ax = plt.subplots(2,1,figsize=(16,5))
plt.tight_layout(2)
sns.lineplot(data=df, x='Age', y='Credit amount', hue='Sex', lw=2, ax=ax[0]).set_title("Credit Amount Graph Depending on Age and Duration by Sex", fontsize=15);
sns.lineplot(data=df, x='Duration', y='Credit amount', hue='Sex', lw=2, ax=ax[1]);

In [None]:
sns.FacetGrid(data=df,col="Risk",aspect=1.5,height=4).map(sns.pointplot,"Cat Age","Credit amount","Sex",palette=["#FF7659","#30AB55"],ci=None).add_legend();

In [None]:
plt.figure(figsize=(8.5,5.5))
corr = sns.heatmap(df.corr(),xticklabels=df.corr().columns,yticklabels=df.corr().columns,annot=True)

# Data Preprocessing

In [None]:
df.head()

In [None]:
df["Age"],df["Duration"],df["Job"]=df["Cat Age"],df["Cat Duration"],df["Cat Job"]
df=df.drop(["Cat Age","Cat Duration","Cat Job"],axis=1)

In [None]:
liste_columns=list(df.columns)
liste_columns.remove("Sex")
liste_columns.remove("Risk")
liste_columns.remove("Credit amount")
liste_columns.remove("Monthly pay")
liste_columns.remove("Credit amount^2")

In [None]:
from sklearn.preprocessing import LabelEncoder
label=LabelEncoder()
df["Sex"]=label.fit_transform(df["Sex"])
df["Risk"]=label.fit_transform(df["Risk"])
df=pd.get_dummies(df,columns=liste_columns,prefix=liste_columns)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
df["Credit amount"]=scaler.fit_transform(df[["Credit amount"]])
df["Monthly pay"]=scaler.fit_transform(df[["Monthly pay"]])
df["Credit amount^2"]=scaler.fit_transform(df[["Credit amount^2"]])

In [None]:
df.head()

# Building Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,roc_curve,roc_auc_score,auc,classification_report

In [None]:
X=df.drop(["Risk"],axis=1)
Y=df["Risk"]
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=0)

## Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeClassifier
cart_model=DecisionTreeClassifier(criterion='gini',max_depth=4,min_samples_leaf=54,min_samples_split=2).fit(X_train,Y_train)

In [None]:
print("Train Accuracy Score : ",accuracy_score(Y_train,cart_model.predict(X_train)))
print("Test Accuracy Score : ",accuracy_score(Y_test,cart_model.predict(X_test)))

In [None]:
print(classification_report(Y_test,cart_model.predict(X_test)))

## GradientBoosting Model

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbm_model = GradientBoostingClassifier(learning_rate = 0.01,max_depth = 5,min_samples_split = 10,n_estimators = 100).fit(X_train, Y_train)

In [None]:
print("Train Accuracy Score : ",accuracy_score(Y_train,gbm_model.predict(X_train)))
print("Test Accuracy Score : ",accuracy_score(Y_test,gbm_model.predict(X_test)))

In [None]:
print(classification_report(Y_test,gbm_model.predict(X_test)))

In [None]:
X_train.shape

In [None]:
Importance=pd.DataFrame({"Values":gbm_model.feature_importances_*100},index=list(X_test.columns))
Importance.sort_values("Values",inplace=True,ascending=True)
Importance[28:].plot.barh()

## XGBoost Model

In [None]:
from xgboost import XGBClassifier
xgb_model = XGBClassifier(learning_rate = 0.05, max_depth = 5,n_estimators=100,subsample=0.8).fit(X_train,Y_train)

In [None]:
print("Train Accuracy Score : ",accuracy_score(Y_train,xgb_model.predict(X_train)))
print("Test Accuracy Score : ",accuracy_score(Y_test,xgb_model.predict(X_test)))

In [None]:
print(classification_report(Y_test,xgb_model.predict(X_test)))

In [None]:
Importance=pd.DataFrame({"Values":xgb_model.feature_importances_*100},index=list(X_test.columns))
Importance.sort_values("Values",inplace=True,ascending=True)
Importance[28:].plot.barh()

## LightGBM Model

In [None]:
from lightgbm import LGBMClassifier
lgbm_model=LGBMClassifier(learning_rate=0.02,max_depth=3,min_child_samples=10,n_estimators=200,subsample=0.6).fit(X_train,Y_train)

In [None]:
print("Train Accuracy Score : ",accuracy_score(Y_train,lgbm_model.predict(X_train)))
print("Test Accuracy Score : ",accuracy_score(Y_test,lgbm_model.predict(X_test)))

In [None]:
print(classification_report(Y_test,lgbm_model.predict(X_test)))

In [None]:
Importance=pd.DataFrame({"Values":lgbm_model.feature_importances_*100},index=list(X_test.columns))
Importance.sort_values("Values",inplace=True,ascending=True)
Importance[28:].plot.barh()

In [None]:
list_model=[cart_model,gbm_model,xgb_model,lgbm_model]
list_model_name=["DecisionTree Model","GradientBoosting Model","XGBoost Model","LightGBM Model"]
fig,ax=plt.subplots(nrows=2,ncols=2,figsize=(16,10))
count=0
for i in range(2):
    for j in range(2):
        if count==0:
            logit_roc_auc=roc_auc_score(Y_test,list_model[count].predict(X_test))
            fpr,tpr,thresholds = roc_curve(Y_test,list_model[count].predict_proba(X_test)[:,1])
        else:
            logit_roc_auc=roc_auc_score(Y_test,list_model[count].predict(X_test))
            fpr,tpr,thresholds = roc_curve(Y_test,list_model[count].predict_proba(X_test)[:,1])
        sns.lineplot(fpr,tpr,label="AUC = %0.2f"%logit_roc_auc,ax=ax[i][j])
        sns.lineplot([0,1],[0,1],color="red",ax=ax[i][j])
        ax[i][j].legend(loc="lower right")
        ax[i][j].set_title(list_model_name[count],fontsize=15)
        count+=1
fig.suptitle("ROC Curve",fontsize=18);

In [None]:
model_data=pd.DataFrame({"Model":["DecisionTree Model","GradientBoosting Model","XGBoost Model","LightGBM Model"],
                   "Train Accuracy":[accuracy_score(Y_train,cart_model.predict(X_train)),accuracy_score(Y_train,gbm_model.predict(X_train)),accuracy_score(Y_train,xgb_model.predict(X_train)),accuracy_score(Y_train,lgbm_model.predict(X_train))],
                   "Test Accuracy":[accuracy_score(Y_test,cart_model.predict(X_test)),accuracy_score(Y_test,gbm_model.predict(X_test)),accuracy_score(Y_test,xgb_model.predict(X_test)),accuracy_score(Y_test,lgbm_model.predict(X_test))]})

In [None]:
fig,ax=plt.subplots(ncols=2,figsize=(16,5))
sns.barplot(x="Model",y="Train Accuracy",data=model_data,ax=ax[0],palette="tab20c_r")
sns.barplot(x="Model",y="Test Accuracy",data=model_data,ax=ax[1],palette="tab20c_r")
ax[0].set_xticklabels(ax[0].get_xticklabels(),rotation=30)
ax[1].set_xticklabels(ax[0].get_xticklabels(),rotation=30);

* We saw some important Features at the models results.
* Now, we are creating a Tree image.
* This Tree image shows us to what's going on the behind.

In [None]:
df.head(1)

* Selection 4 features according to importance

In [None]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

variable = ["Risk","Monthly pay","Credit amount","Checking account_little","Checking account_moderate"]

data = df.loc[:,variable]

data.head(2)

In [None]:
X = data.drop("Risk",axis=1)
y = data["Risk"]

forest = RandomForestClassifier(max_depth = 3, n_estimators=4)
forest.fit(X,y)

In [None]:
estimator = forest.estimators_[3]

In [None]:
target_names = ["0: good","1: bad"]

In [None]:
from sklearn.tree import export_graphviz

export_graphviz(estimator,out_file="tree_limited.dot",feature_names=X.columns,
                class_names=target_names,rounded = True, proportion = False, precision = 2, filled = True)

In [None]:
forest_1 = RandomForestClassifier(max_depth = None, n_estimators=4)
forest_1 = forest_1.fit(X,y)
estimator_non = forest_1.estimators_[3]

In [None]:
export_graphviz(estimator_non, out_file='tree_nonlimited.dot', feature_names = X.columns,
                class_names = target_names,
                rounded = True, proportion = False, precision = 2, filled = True)

In [None]:
!dot -Tpng tree_limited.dot -o tree_limited.png -Gdpi=600

In [None]:
from IPython.display import Image
Image(filename = 'tree_limited.png')