# Introduction

![](https://comps.canstockphoto.com/credit-risk-drawings_csp11709232.jpg)

## Context

The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes.

## Content

1. [Load and Check Data](#0)
1. [Dataset Description](#1)
1. [Standardization of Data](#2)
1. [Missing Value Analysis](#3)
1. [Outlier Value Analysis](#4)
1. [Variable Transformation](#5)
1. [Exploratory Data Analysis](#6)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns #visualition
import matplotlib.pyplot as plt
import missingno as msno
import scipy.stats as stats
import statsmodels.api as sm
import pylab 
import scipy
from scipy.stats import mannwhitneyu
from scipy.stats import chi2_contingency
from scipy.stats import kstest
from yellowbrick.cluster import KElbowVisualizer
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import sklearn.metrics as metrics


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="0"></a>
# Load and Check Data

* Load file

In [None]:
germanCreditData=pd.read_csv("/kaggle/input/german-credit-data-with-risk/german_credit_data.csv")

* First 10 records in the Dataset

In [None]:
df=germanCreditData.copy()

df.head(10)

In [None]:
df.info()
print("\n")
print("shape: ",df.shape)


* Our data set consists of 11 columns and 1000 observations.
* There are two types of data among variables. Data types are int and object.
* Unnamed variable has no effect on the data set. Therefore, it will be removed from the data set in the next steps.
* When we look at the information about the data set, it is determined that there is missing data in the SAving account and Checking amount section.

<a id="1"></a>
# Dataset Description

* Unnecessary variable deletion.

In [None]:
df.drop(df.columns[[0]],axis=1,inplace=True)

* Check, delete successful. Our new number of variables is 10.

In [None]:
df.columns

<h2>Content</h2>
It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Thus, I wrote a small Python script to convert it into a readable CSV file. Several columns are simply ignored, because in my opinion either they are not important or their descriptions are obscure. The selected attributes are:

<b>Age </b>(numeric)<br>
<b>Sex </b>(text: male, female)<br>
<b>Job </b>(numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)<br>
<b>Housing</b> (text: own, rent, or free)<br>
<b>Saving accounts</b> (text - little, moderate, quite rich, rich)<br>
<b>Checking account </b>(numeric, in DM - Deutsch Mark)<br>
<b>Credit amount</b> (numeric, in DM)<br>
<b>Duration</b> (numeric, in month)<br>
<b>Purpose</b>(text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others<br>
<b>Risk </b> (Value target - Good or Bad Risk)<br>

<a id="2"></a>
# Standardization of Data

* Cleaning is done to standardize the column names in the data set. New column names and old column names are kept in the list for the operation.

In [None]:
oldColumn = df.columns

newColumn = ["age","sex","job","housing","savingAccounts","checkingAccount","creditAmount","duration","purpose","risk"]

* Old column names are replaced with new column names.

In [None]:
#df.rename(columns={"Age":"age"})

for i in range(len(newColumn)):
    
    df.rename(columns={oldColumn[i]:newColumn[i]},inplace=True)
                
df

<a id="3"></a>
# Missing Value 

* Is there any missing data?

In [None]:
df.isnull().values.any()

* How many are missing data in what variables?

In [None]:
df.isnull().sum()

* Missing data is observed in SavingAccount and checkingAccount variables.

# Missing data is Visualized

* The number of data is observed for each variable with barplot.

In [None]:
msno.bar(df,color=sns.color_palette("deep"));

* The relationship between the heatmap chart and missing observations is examined.

In [None]:
msno.heatmap(df);

* There is a 0.1 relationship between the Saving Account value and the Checking Account value.

* The randomness between Saving Account and Checking Account observations is examined by looking at the matrix table.

In [None]:
msno.matrix(df,color=(0.5,0.3,0.2));

* When the missing observation values are examined, it is confirmed that the relationship between the observations is low.


* Customers may not have or may not have an account in the bank, so missing data are filled with "no account" information.

In [None]:
df.savingAccounts=df.savingAccounts.fillna(value="no account")
df.checkingAccount=df.checkingAccount.fillna(value="no account")
df

In [None]:
ekle = pd.DataFrame(
        {'housing': pd.Categorical(
              values =  df["housing"],
              categories=["free","rent","own"]),

         'savingAccounts': pd.Categorical(
             values = df["savingAccounts"],
             categories=["no account","little","moderate","rich","quite rich"]),

         'checkingAccount': pd.Categorical(
             values = df["checkingAccount"],
             categories=["no account","little","moderate","rich"])
        }
    )

In [None]:
df1 = df.copy()
ekle = ekle.apply(lambda x: x.cat.codes)
ekle.head()

In [None]:
del df1["savingAccounts"]
del df1["checkingAccount"]
del df1["housing"]
df1 = pd.concat([df1,ekle],axis=1)
df1.head()

In [None]:
df1=pd.get_dummies(df1, columns = ["sex"], prefix = ["sex"])
df1=pd.get_dummies(df1, columns = ["risk"], prefix = ["risk"])

In [None]:
del df1["sex_male"]
del df1["risk_bad"]
df1.rename(columns={"risk_good":"risk",
                  "sex_female":"sex"},inplace=True)

In [None]:
df.duration.plot(kind='hist',color='green',bins=20,figsize=(10,5))
plt.title("duration Variable Histogram Chart");

<a id="4"></a>
# Data Outlier

* The normality of the credit amount variable is examined with histogram and propplot graphs.

In [None]:
plt.subplot(2,1,1)
df.creditAmount.plot(kind='hist',color='pink',bins=50,figsize=(10,10))
plt.title("creditAmount Variable Histogram Chart");

* The CreditAmount variable is skewed to the left so it is not distributed normally.

In [None]:
stats.probplot(df.creditAmount, dist="norm", plot=pylab)
pylab.show()

In [None]:
stat, p = stats.kstest(df["creditAmount"], 'norm')
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
    print('Credit Amount is distributed normally(H0:fail to reject)')
else:
    print('Credit Amount is not distributed normally.(H0:reject)')

* The graph is not distributed normally because the data in the CreditAmount variable is not around the line.

*  Also Kruskal-wallis test shows that it is not distributed normally.

In [None]:
group1 = df1["creditAmount"][df1["risk"] == 1]
group2 = df1["creditAmount"][df1["risk"] == 0]
stat, p = scipy.stats.mannwhitneyu(group1,group2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
    print('it is not significant between Risk and Credit Amount(H0:fail to reject)')
else:
    print('it is significant between Risk and Credit Amount(H0:reject)')

* The relationship between CreditAmount and risk is analyzed with the boxplot chart.

In [None]:
sns.set(style="ticks", palette="pastel")
sns.boxplot(x="risk",y="creditAmount",
             palette=["m", "g"],
            data=df)
sns.despine(offset=10, trim=True)

* The relationship between the credit amount and housing is visualized by the boxplot method according to risk.

In [None]:

sns.set(style="ticks", palette="pastel")
# Draw a nested boxplot to show bills by day and time
sns.boxplot(x="housing",y="creditAmount",
            hue="risk", palette=["m", "g"],
            data=df)
sns.despine(offset=10, trim=True)

* According to the graph, the most contradictory observations are observed in its own class in the housing variable.

* The relationship between creditAmount variable and job variable is visualized with violinplot.

In [None]:
sns.set(style="whitegrid", palette="pastel", color_codes=True)
sns.violinplot(x="job", y="creditAmount", hue="risk",
               split=True, inner="quarts",
               palette={"good": "G", "bad": "B"},
               data=df);
sns.despine(left=True);

* Violinplot visualizes the data according to its quartiles. The place where the violin is the widest is the place where the creditAmount value repeats the most according to the job variable.

In [None]:
df.purpose.value_counts()

* Boxenplot chart shows outliers between creditAmount and purpose variable.

In [None]:
sns.set(style="whitegrid")
sns.boxenplot(x="purpose", y="creditAmount",
              color="b",
              scale="linear", data=df);

* The Pairplot chart shows the relationship between creditAmount and the duration variable.

In [None]:
sns.pairplot(df, height=3,
                 vars=["creditAmount","duration"],hue="risk");

* The density of the values is between x = 0-50 and y = 0-10000.

* The relationship between creditAmount and the sex variable is visualized according to the barplot chart.

In [None]:
sns.barplot(x='sex',y='creditAmount',hue='risk',data=df);

In [None]:
sns.boxplot(df.creditAmount);

* IQR value calculation is made to observe excessive values.

**IQR (Interquartile Range)**

In [None]:
Q1 = df1.creditAmount.quantile(0.25)
Q3 = df1.creditAmount.quantile(0.75)
IQR = Q3 - Q1

In [None]:
print("Q1:",Q1)
print("Q3:",Q3)
print("IQR:",IQR)

In [None]:
upper_value = Q3 + 1.5*IQR
lower_value = Q1 - 1.5*IQR

In [None]:
print("upper_value:",upper_value)
print("lower_value:",lower_value)

* Using the threshold values, outliers in the data set are detected.

In [None]:
outlier_values = (df1.creditAmount < lower_value) | (df1.creditAmount > upper_value)

* Total outliers.

In [None]:
df1.creditAmount[outlier_values].value_counts().sum() 

**Outliers Value Correction**

In [None]:
upper_outlier = df1.creditAmount> upper_value
upper_outlier.sum()

* All outliers are upper outliers.

In [None]:
df1.creditAmount[upper_outlier] = upper_value

* After Correction

In [None]:
sns.boxplot(df1.creditAmount);

<a id="5"></a>
# Data Visualition

In [None]:
df1.columns

* We have 10 variables in total, 2 of these variables are numeric and 8 of them are categorical. Each variable will be analyzed according to targe and standardization work will be done for it.

* Unique values of observations are examined.

In [None]:
print("Purpose : ",df.purpose.unique())
print("Sex : ",df.sex.unique())
print("Housing : ",df.housing.unique())
print("Saving accounts : ",df['savingAccounts'].unique())
print("Risk : ",df['risk'].unique())
print("Checking account : ",df['checkingAccount'].unique())

* Categorical variables are examined and new variables are created from categorical variables using the dummy method and categorical methods.

* Age variable

In [None]:
df1.age.unique

In [None]:
stat, p = stats.kstest(df["age"], 'norm')
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
    print('Age is distributed normally(H0:fail to reject)')
else:
    print('Age is not distributed normally.(H0:reject)')

In [None]:
group1 = df["age"][df1["risk"] == 1]
group2 = df["age"][df1["risk"] == 0]
stat, p = scipy.stats.mannwhitneyu(group1,group2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
    print('it is not significant between Risk and Age(H0:fail to reject)')
else:
    print('it is significant between Risk and Age(H0:reject)')

There is a significant betweet Risk and Age features. We decided to classificate of variable Age.
K-Means applied.

In [None]:
sns.swarmplot(x='risk',y='age',hue='sex',data=df1);

In [None]:
from sklearn.cluster import KMeans
columns = ['job', 'creditAmount', 'duration', 'purpose', 'housing',
       'savingAccounts', 'checkingAccount', 'sex', 'risk']
kumeleme = df1.drop(columns,axis=1)
kumeleme

In [None]:
kmeans = KMeans()
clust = KElbowVisualizer(kmeans, k = (2,20))
clust.fit(kumeleme)
clust.poof()

In [None]:
df1.head()

In [None]:
k_means = KMeans(n_clusters = 3).fit(kumeleme)
cluster = k_means.labels_
plt.scatter(df1.iloc[:,0], df.iloc[:,9], c = cluster, s = 60, cmap = "winter");

In [None]:
df1["age"] = cluster

In [None]:
df1.age.value_counts()

In [None]:
nl = "\n"
crosstab = pd.crosstab(df1['age'], df1['risk'])
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f"Chi-square= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}")
alpha = 0.05
if p > alpha:
    print('it is not significant between Age and Risk(H0:fail to reject)')
else:
    print('it is significant between Age and Risk(H0:reject)')

In [None]:
sns.countplot(x="age",hue="risk",data=df1);

* Sex variable

In [None]:
df.sex.value_counts()

In [None]:
sns.countplot(x="sex",hue="risk",data=df);

In [None]:
nl = "\n"
crosstab = pd.crosstab(df1['sex'], df1['risk'])
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f"Chi-square= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}")
alpha = 0.05
if p > alpha:
    print('it is not significant between Sex and Risk(H0:fail to reject)')
else:
    print('it is significant between Sex and Risk(H0:reject)')

* Risk variable


In [None]:
df.risk.value_counts()

* Housing variable

In [None]:
df.housing.value_counts()

In [None]:
sns.countplot(x="housing",hue="risk",data=df);

In [None]:
nl = "\n"
crosstab = pd.crosstab(df1['housing'], df1['risk'])
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f"Chi-square= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}")
alpha = 0.05
if p > alpha:
    print('it is not significant between Housing and Risk(H0:fail to reject)')
else:
    print('it is significant between Housing and Risk(H0:reject)')

* CheckingAccount variable

In [None]:
df.checkingAccount.value_counts()

In [None]:
sns.countplot(x="checkingAccount",hue="risk",data=df);

In [None]:
nl = "\n"
crosstab = pd.crosstab(df1['checkingAccount'], df1['risk'])
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f"Chi-square= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}")
alpha = 0.05
if p > alpha:
    print('it is not significant between Checking Account and Risk(H0:fail to reject)')
else:
    print('it is significant between Checking Account and Risk(H0:reject)')

* SavingAccount variable

In [None]:
df.savingAccounts.value_counts()

In [None]:
sns.countplot(x="savingAccounts",hue="risk",data=df);

In [None]:
nl = "\n"
crosstab = pd.crosstab(df1['savingAccounts'], df1['risk'])
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f"Chi-square= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}")
alpha = 0.05
if p > alpha:
    print('it is not significant between Saving Accounts and Risk(H0:fail to reject)')
else:
    print('it is significant between Saving Accounts and Risk(H0:reject)')

In [None]:
risk2=df.risk.value_counts()

In [None]:
sns.barplot( x=risk2.index,y=risk2.values,data=df);

* Purpose Variable

In [None]:
sns.countplot(x="purpose",hue="risk",data=df);

In [None]:
nl = "\n"
crosstab = pd.crosstab(df1['purpose'], df1['risk'])
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f"Chi-square= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}")
alpha = 0.05
if p > alpha:
    print('it is not significant between Purpose and Risk(H0:fail to reject)')
else:
    print('it is significant between Purpose and Risk(H0:reject)')

In [None]:
purpose_vs_Risk = pd.crosstab(index=df1["purpose"], 
                             columns=df1["risk"],
                             margins=True)

purpose_vs_Risk


There is no significant difference between the Purpose and Risk variable.

Significance can be gained by making various changes.

We decided to combine domestic appliances and furniture/equipment.

In [None]:
df1.purpose[df1.purpose == "domestic appliances"] = "furniture/equipment"

In [None]:
nl = "\n"
crosstab = pd.crosstab(df1['purpose'], df1['risk'])
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f"Chi-square= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}")
alpha = 0.05
if p > alpha:
    print('it is not significant between Purpose and Risk(H0:fail to reject)')
else:
    print('it is significant between Purpose and Risk(H0:reject)')

In [None]:
ekle1 = pd.DataFrame({'purpose': pd.Categorical(
             values = df1["purpose"],
             categories=["repairs","vacation/others","furniture/equipment"
                         ,"radio/TV","education","business","car"])
    }
)

In [None]:
df2 = df1.copy()
ekle1 = ekle1.apply(lambda x: x.cat.codes)
ekle1.head()

In [None]:
del df2["purpose"]
df2 = pd.concat([df2,ekle1],axis=1)
df2.head()

* SECOND TRIAL

In [None]:
df1=pd.get_dummies(df1, columns = ["purpose"], prefix = ["p"])

In [None]:
del df1["p_repairs"]
df1.head()

* Job Variable

In [None]:
nl = "\n"
crosstab = pd.crosstab(df1['job'], df1['risk'])
chi2, p, dof, expected = stats.chi2_contingency(crosstab)
print(f"Chi-square= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}")
alpha = 0.05
if p > alpha:
    print('it is not significant between Job and Risk(H0:fail to reject)')
else:
    print('it is significant between Job and Risk(H0:reject)')

There is no significant difference between the Job and Risk variable.

In [None]:
job_vs_Risk = pd.crosstab(index=df1["job"], 
                             columns=df1["risk"],
                             margins=True)

job_vs_Risk

<a id="6"></a>
# ML Modeling

In [None]:
df2.head()

In [None]:
y = df2["risk"]
X = df2.drop(["risk"], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.30, 
                                                    random_state=982)

In [None]:
xgb_tuned1 = XGBClassifier(learning_rate= 0.01, 
                                max_depth= 7, 
                                n_estimators= 1000, 
                                subsample= 0.7).fit(X_train, y_train)
y_pred = xgb_tuned1.predict(X_test)
accuracy_score(y_test,y_pred)

In [None]:
metrics.confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)
plt.title('Feature Characteristics')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
feature_imp = pd.Series(xgb_tuned1.feature_importances_,
                        index=X_train.columns).sort_values(ascending=False)
sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel("Feature Significance Scores")
plt.ylabel('Features')
plt.title("Significance Levels")
plt.show()

* SECOND TRİAL ML MODEL

In [None]:
df1.head()

In [None]:
y = df1["risk"]
X = df1.drop(["risk"], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.30, 
                                                    random_state=982)

In [None]:
xgb_tuned2 = XGBClassifier(learning_rate= 0.01, 
                                max_depth= 7, 
                                n_estimators= 1000, 
                                subsample= 0.7).fit(X_train, y_train)
y_pred = xgb_tuned2.predict(X_test)
accuracy_score(y_test,y_pred)

In [None]:
metrics.confusion_matrix(y_test,y_pred)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
fpr, tpr, threshold = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)
plt.title('Feature Characteristics')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
feature_imp = pd.Series(xgb_tuned2.feature_importances_,
                        index=X_train.columns).sort_values(ascending=False)

sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel("Feature Significance Scores")
plt.ylabel('Features')
plt.title("Significance Levels")
plt.show()