### Exploring Credit Risks

This activity is another open exploration of a dataset using both cleaning methods and visualizations.  The data describes customers as good or bad credit risks based on a small set of features specified below.  Your task is to create a Jupyter notebook with an exploration of the data using both your `pandas` cleaning and analysis skills and your visualization skills using `matplotlib`, `seaborn`, and `plotly`.  Your final notebook should be formatted with appropriate headers and markdown cells with written explanations for the code that follows. 

Post your notebook file in Canvas, as well as a brief (3-4 sentence) description of what you found through your analysis. Respond to your peers with reflections on thier analysis. 

-----


##### Data Description

```
1. Status of existing checking account, in Deutsche Mark.
2. Duration in months
3. Credit history (credits taken, paid back duly, delays, critical accounts)
4. Purpose of the credit (car, television,...)
5. Credit amount
6. Status of savings account/bonds, in Deutsche Mark.
7. Present employment, in number of years.
8. Installment rate in percentage of disposable income
9. Personal status (married, single,...) and sex
10. Other debtors / guarantors
11. Present residence since X years
12. Property (e.g. real estate)
13. Age in years
14. Other installment plans (banks, stores)
15. Housing (rent, own,...)
16. Number of existing credits at this bank
17. Job
18. Number of people being liable to provide maintenance for
19. Telephone (yes,no)
20. Foreign worker (yes,no)
```

In [23]:
import pandas as pd
import plotly.express as px

In [18]:
df = pd.read_csv('data/dataset_31_credit-g.csv')

In [19]:
df.head(3)

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,...,'real estate',67,none,own,2,skilled,1,yes,yes,good
1,'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,...,'real estate',22,none,own,1,skilled,1,none,yes,bad
2,'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,...,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   checking_status         1000 non-null   object
 1   duration                1000 non-null   int64 
 2   credit_history          1000 non-null   object
 3   purpose                 1000 non-null   object
 4   credit_amount           1000 non-null   int64 
 5   savings_status          1000 non-null   object
 6   employment              1000 non-null   object
 7   installment_commitment  1000 non-null   int64 
 8   personal_status         1000 non-null   object
 9   other_parties           1000 non-null   object
 10  residence_since         1000 non-null   int64 
 11  property_magnitude      1000 non-null   object
 12  age                     1000 non-null   int64 
 13  other_payment_plans     1000 non-null   object
 14  housing                 1000 non-null   object
 15  exist

### Clean Up
Clean up columns of interest here prior to analysis.

In [21]:
# Clean up job column
df.value_counts('job')

df['job'] = df['job'].str.replace('unskilled resident','unskilled').str.replace('high qualif/self emp/mgmt','management').str.replace('unemp/unskilled non res','unemployed')

### How do job and age relate to credit worthiness?

In [22]:
df[['age','job','class']].head()

Unnamed: 0,age,job,class
0,67,skilled,good
1,22,skilled,bad
2,49,'unskilled',good
3,45,skilled,good
4,53,skilled,bad


In [27]:
counts=df.value_counts(['job','class']).reset_index()
px.bar(counts, x='class', y='count', color='job', barmode='group')
# px.bar(counts, x=)

Looking at counts of good and bad credit ratings, split by employment type, there are similar proportions in each group. I would have expected unskilled people to be more classified as bad credit rating, but that's not really the case.

In [30]:
px.box(df, x='class', y='age')

Looking at age vs credit rating, it seems "good" credit scores tend to be older people. The difference is quite small and each credit category has outliers in the 70+ range.

In [32]:
px.box(df, x='class', y='age', color='job')

If we plot by job as well, see in both credit rating groups that skilled workers are slightly younger than unskilled workers. In the bad credit rating group, management also tends to be even older.

The unemployed bars are strange due to low count, so we shouldn't read too much into that here.

### How does amount of savings relate to credit worthiness?

In [36]:
counts=df.value_counts(['class','savings_status']).reset_index()
counts

Unnamed: 0,class,savings_status,count
0,good,'<100',386
1,bad,'<100',217
2,good,'no known savings',151
3,good,'100<=X<500',69
4,good,'500<=X<1000',52
5,good,'>=1000',42
6,bad,'100<=X<500',34
7,bad,'no known savings',32
8,bad,'500<=X<1000',11
9,bad,'>=1000',6


In [39]:
px.bar(counts, color='class', y='count', x='savings_status', barmode='group')


The most apparent thing here is that each savings bucket most people have good credit rating! I'm surprised to see for "no known savings", most are also of good credit rating.

Once you look at savings of greater than 500, then a vast majority are of good credit rating.