
## About Dataset
### Context
The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.

### Content
It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Thus, I wrote a small Python script to convert it into a readable CSV file. Several columns are simply ignored, because in my opinion either they are not important or their descriptions are obscure. The selected attributes are:

1. Age (numeric)
2. Sex (text: male, female)
3. Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
4. Housing (text: own, rent, or free)
5. Saving accounts (text - little, moderate, quite rich, rich)
6. Checking account (numeric, in DM - Deutsch Mark)
7. Credit amount (numeric, in DM)
8. Duration (numeric, in month)
9. Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)

Acknowledgements
Source: [UCI](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29)

### Importing Modules

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
import plotly.offline as po
po.init_notebook_mode(connected=True)
from plotly.subplots import make_subplots

### Load and Preview Data

In [2]:
credit = pd.read_csv('German Credit Risk - With Target/german_credit_data.csv', index_col=0)
credit

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,male,2,own,,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,male,1,own,little,,2096,12,education,good
3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,53,male,2,free,little,little,4870,24,car,bad
...,...,...,...,...,...,...,...,...,...,...
995,31,female,1,own,little,,1736,12,furniture/equipment,good
996,40,male,3,own,little,little,3857,30,car,good
997,38,male,2,own,little,,804,12,radio/TV,good
998,23,male,2,free,little,little,1845,45,radio/TV,bad


In [3]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   817 non-null    object
 5   Checking account  606 non-null    object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   object
dtypes: int64(4), object(6)
memory usage: 85.9+ KB


In [4]:
credit.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,1000.0,35.546,11.375469,19.0,27.0,33.0,42.0,75.0
Job,1000.0,1.904,0.653614,0.0,2.0,2.0,2.0,3.0
Credit amount,1000.0,3271.258,2822.736876,250.0,1365.5,2319.5,3972.25,18424.0
Duration,1000.0,20.903,12.058814,4.0,12.0,18.0,24.0,72.0


In [5]:
credit.nunique()

Age                  53
Sex                   2
Job                   4
Housing               3
Saving accounts       4
Checking account      3
Credit amount       921
Duration             33
Purpose               8
Risk                  2
dtype: int64

In [6]:
credit.columns = [x.lower().replace(' ', '_') for x in credit.columns]
credit.head(3)

Unnamed: 0,age,sex,job,housing,saving_accounts,checking_account,credit_amount,duration,purpose,risk
0,67,male,2,own,,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,male,1,own,little,,2096,12,education,good


### EDA

In [7]:
risk_count = credit.risk.value_counts().reset_index(name='count')
risk_count


Unnamed: 0,index,count
0,good,700
1,bad,300


In [29]:
fig = px.bar(risk_count, 'index', 'count', color='index', labels={'count':'', 'index': ''},
       title='<b>Risk Distribution</b>', text_auto=True, width=700, height=600)
fig.update_yaxes(showticklabels=False)
fig.update_traces(textposition='outside', cliponaxis=False)


In [9]:
good_credit = credit[credit['risk'] == 'good']['age'].values.tolist()
bad_credit = credit[credit['risk'] == 'bad']['age'].values.tolist()
age_dist = credit['age']


In [27]:
trace1 = go.Histogram(
    x=good_credit,
    histnorm='probability',
    name='Good Credit'
)

trace2 = go.Histogram(
    x=bad_credit,
    histnorm='probability',
    name='Bad Crdit'
)

trace3 = go.Histogram(
    x=age_dist,
    histnorm='probability',
    name='Age [Overall]'
)

fig = make_subplots(
    rows=2, cols=2, specs=[[{}, {}], [{'colspan':2}, None]],
    subplot_titles=('Good', 'Bad', 'General Distribution')
)

fig.add_traces(trace1, 1,1)
fig.add_traces(trace2,1,2)
fig.add_traces(trace3, 2,1)

fig.update_layout(title='<b>Age Distribution by Risk</b>', bargap=0.02, width=700, height=600)
fig.show()


In [11]:
age_category = ['Student', 'Youth', 'Adult', 'Elderly']
credit['age_category'] = pd.cut(credit['age'], [18, 25, 35, 60, 120], labels=age_category)

good_cred = credit[credit['risk'] == 'good']
bad_cred = credit[credit['risk'] == 'bad']

In [28]:
trace1 = go.Box(
    x=good_cred['age_category'],
    y=good_cred['credit_amount'],
    name='Good Credit',
)

trace2 = go.Box(
    x=bad_cred['age_category'],
    y=bad_cred['credit_amount'],
    name='Bad Credit'
)
data = [trace1, trace2]

layout = go.Layout(
    boxmode='group',
    xaxis={'title': '<b>Age Category</b>'},
    yaxis={'title': '<b>Credit Amount ($)</b>'},
    title='<b>Credit Amount vs. Age Category</b>',
    width=700, height=600
)

fig = go.Figure(data=data, layout=layout)
fig.show()

In [35]:
gud_huz_risk = credit[credit['risk'] == 'good']['housing'].value_counts()
bad_huz_risk = credit[credit['risk'] == 'bad']['housing'].value_counts()
gud_huz_risk

own     527
rent    109
free     64
Name: housing, dtype: int64

In [41]:
trace1 = go.Bar(
    x=gud_huz_risk.index,
    y=gud_huz_risk.values,
    name='Good Credit',
    text=gud_huz_risk.values
)

trace2 = go.Bar(
    x=bad_huz_risk.index,
    y=bad_huz_risk.values,
    name='Bad Credit',
    text=bad_huz_risk.values
)

data = [trace1, trace2]

layout = go.Layout(
    title='<b>Housing Distribution by Risk</b>'
)

fig = go.Figure(data=data, layout=layout)
fig.show()