# Content
#### 1. Tasks
#### 2. Brainstorming on Dataset
#### 3. Reproducible Model Training
#### (4. Optimization of Model)
#### 5. Documentation and Reflection
#### 6. Summary

# 1. Tasks 

<ol>
    <li> Build a model that is predicting the creditworthiness of applicants. Make use of the pre-processed South German Credit dataset </li>
    <li> Optimize your model, until you are satisfied </li>
    <li> Reflect on what made you satisfied with the model's performance </li>
</ol>

Make sure that..
   <ul>
        <li> .. you have a clear directory structure. </li> 
        <li> .. you have a file (e.g. README.txt) documenting all your data, files and folders. </li> 
        <li> .. you document your code and the steps you did </li> 
        <li> .. your results are completely reproducible by using the `run all cells` command in your jupyter notebook </li> 
   </ul>
   

# 2. Brainstorming on Dataset

**[Contextualization](https://github.com/FUB-HCC/hcds-summer-2022/blob/main/lecture/HCDS22-03_Reproducibility.pdf)**: The dataset is from the 1970's, therefore we can assume that [moral and social implications](https://github.com/FUB-HCC/hcds-summer-2022/blob/main/lecture/HCDS22-05_Bias_Discrimination.pdf) that matched the general opinion of the average soceity will have impact on the dataset.

_We assume_: 
<ul> 
    <li> The data was not collected manually (no computers) </li>
    <li> Data contains biases connected to data </li>
    <li> Not every household had a car or telephone </li>
</ul>




# 3. Reproducible Model Training

### 3.1. Pipenv

In [63]:
##############TODO

### 3.2. Model

In [145]:
# import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import svm

In [146]:
df = pd.read_csv('south_german_credit_data_preprocessed.csv')

##### Data Preparation

In [147]:
# remove first line, because it contains unnecessary content
df.drop(index=df.index[0], 
        axis=0, 
        inplace=True)

# rename categorical coloumns to human readable values
df_readable = df.replace( {'checking account':   {'1': 'No Account', '2':'<0 DM', '3':'0-200 DM', '4':'>=200 DM'},
             'credit history':     {'0': 'past delay', '1':'critical/open', '2':'none open', '3':'paid open rates', '4': 'fully paid back'},
             'credit purpose':     {'0': 'others', '1': 'car (new)','2' : 'car (used)','3' : 'furniture/equipment','4' : 'radio/television','5': 'domestic appliances','6':'repairs','7':'education','8':'vacation','9':'retraining','10':'business'},
             'savings account':    {'1': 'unknown/no savings account','2': '<100 DM','3':'100-500 DM','4':'500-1000 DM','5':'>=1000 DM'},
             'employment since..': {'1': 'unemployed','2':'<1 yr','3': '1-3 yrs','4' :'4-6 yrs','5': '>= 7 yrs'},
             'installment rate':   {'1': '>= 35', '2': '25-34','3' : '20-24','4' : '<20'},
             'status : sex':       {'1': 'male : divorced/separated','2': 'female: non-single or male: single','3': 'male : married/widowed','4': 'female : single'},
             'other debtors / guarantors': { '1' : 'none', '2': 'co-applicant','3' : 'guarantor'},
             'residence since':    {'1': '< 1 yr','2' : '1-3 yrs','3' : '4-6 yrs','4' : '>= 7 yrs'},
             'property':           {'1': 'unknown / no property','2' : 'car or other','3' : 'building soc. savings agr./life insurance','4' : 'real estate'},
             'other installment plans': {'1' : 'bank','2' : 'stores','3' : 'none'},
             'housing':            {'1': 'for free', '2' : 'rent','3' : 'own'},
             'job':                {'1': 'unemployed/unskilled - non-resident','2' : 'unskilled - resident','3' : 'skilled employee/official','4' : 'manager/self-empl./highly qualif'},
             'people to provide maintenance for': {'1' : '3 or more','2' : '0 to 2'},
             'telephone':          {'1': 'no', '2':'yes'},
             'foreign worker':     {'1':'yes','2':'no'},
             'goodness':           {'0':'bad', '1':'good'}
             })

#df.head()
df_readable.head()

Unnamed: 0,checking account,duration in month,credit history,credit purpose,credit amount,savings account,employment since..,installment rate,status : sex,other debtors / guarantors,...,property,relationship : age,other installment plans,housing,existing credits,job,people to provide maintenance for,telephone,foreign worker,goodness
1,No Account,18,fully paid back,car (used),1049,unknown/no savings account,<1 yr,<20,female: non-single or male: single,none,...,car or other,21,none,for free,1,skilled employee/official,0 to 2,no,no,good
2,No Account,9,fully paid back,others,2799,unknown/no savings account,1-3 yrs,25-34,male : married/widowed,none,...,unknown / no property,36,none,for free,2,skilled employee/official,3 or more,no,no,good
3,<0 DM,12,none open,retraining,841,<100 DM,4-6 yrs,25-34,female: non-single or male: single,none,...,unknown / no property,23,none,for free,1,unskilled - resident,0 to 2,no,no,good
4,No Account,12,fully paid back,others,2122,unknown/no savings account,1-3 yrs,20-24,male : married/widowed,none,...,unknown / no property,39,none,for free,2,unskilled - resident,3 or more,no,yes,good
5,No Account,12,fully paid back,others,2171,unknown/no savings account,1-3 yrs,<20,male : married/widowed,none,...,car or other,38,bank,rent,2,unskilled - resident,0 to 2,no,yes,good


In [148]:
# save processed CSV file
df_readable.to_csv('../A3_Reproducibility/df_germanData_readable.csv', index = False)

In [149]:
# exclude sex as factor, because we assume the highest bias in this category
df = df.drop(columns=['status : sex'])

# shuffle data, because all 'good' scores are in the top of the table
df = df.sample(frac=1).reset_index(drop=True)
n = len(df)

In [150]:
# Split the data in test data and training data
X = df.drop(columns = 'goodness', axis = 1) 
y = df['goodness']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,random_state=42)

##### Training: Logistic Regression

In [151]:
logisticRegr = LogisticRegression(max_iter=10000)
logisticRegr.fit(X_train, y_train)
predictions = logisticRegr.predict(X_test)
score = logisticRegr.score(X_test, y_test)

In [152]:
# Model Accuracy
accur_score = accuracy_score(y_test, pred)
print("Accuracy:", metrics.accuracy_score(y_test, pred))

#### TODO:
# Model Precision
#print("Precision:", metrics.precision_score(y_test, pred))


Accuracy: 0.5966666666666667


In [154]:
# Create a Support Vector Machine Classifier with a Linear Kernel
SVM = svm.SVC(kernel='linear')

# Training
SVM.fit(X_train, y_train)

# Prediction for test dataset
y_pred = SVM.predict(X_test)

# Model Accuracy
accur_score = accuracy_score(y_test, pred)
print("Accuracy:", metrics.accuracy_score(y_test, pred))

#### TODO:
# Model Precision
#print("Precision:", metrics.precision_score(y_test, pred))

Accuracy: 0.5966666666666667


##### Training: [Decision Tree](https://scikit-learn.org/stable/modules/tree.html) Modeling 

In [155]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

# Predicting values for test data
pred=clf.predict(X_test)

# Finding accuracy score of model
accur_score = accuracy_score(y_test, pred)
print("Accuracy:", metrics.accuracy_score(y_test, pred))

Accuracy: 1.0


##### Training [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) Modeling 

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acurr_score = accuracy_score(y_test, y_pred)
print(accur_score)

#### Plotting

In [None]:
######## TODO
# Plotting the decision tree
#tree.plot_tree(clf)

In [None]:
# Export tree as PDF
# TODO: change the size of the nodes
#fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (10,10),dpi=300)
#tree.plot_tree(clf, feature_names=list(df.columns),class_names="01",filled =True)
#fig.savefig("SGCD_decisionTree.pdf")

# 5. Documentation and Reflection

#### 1. Tasks: 
        The tasks are taken from the assignment sheet.
        
#### 2. Brainstorming on Dataset
        In this section, we apply some social scientific considerations on our data sciences task. Our approach follows the arguments and theories from the lecture and are linked if there are any direct connections to contents of the lecture. 
        
#### 3. Reproducible Model Training

##### 3.1. Local dependencies created
         The environment that a project is created in has a large impact on the reproducibility [1] of the results on another computer. Therefore we use Pipenv, so the next user can install all the dependencies from the file created.
         
###### 3.2. Model
        During assignment 2, we did an exploratory data analysis and calculated a logistic regression, which follows a state-of-the-art approach to data sciences. 
        We test that logistic regression model on accuracy and precision, and compare it with two others: Decision tree model and random forest model. 
        
        Logistic Regression is a great model to start mastering a classification problem. Accuracy tests how often a classifier is correct. Precision shows to what percentage positive tuples are labeled as such.
        RESULTS: An accuracy of .59 is quite bad. We see what happens if we try to improve trough applying support vector machines (SVM) [3].
        
        Decision Trees are a fairly simple and explainable algorithm. They are applicable for classification problems (that we find here) as well as for regression problems. We decide to use this method because it is a common tool to use for creditworthiness, because they are easy to reads, easy to prepare and require less data cleaning [2]. From the lecture we learned that a clean data set is essential for the goodness of our model and since we don't know for this task if the results of our modelling will be used by data sciences experts, we assume the base case of someone with little data sciences expertise will need our model. 
        RESULTS: 
            ACCURACY: We get an accuracy of the model of 1.0 which is explained: We use a part of the 
            testdata set for the training, therefor the accuracy must be really high. In a next step 
            we should evaluate the model on a true holdout sample that is independent of the training 
            data. 
            
      Random Forests help to bring robustness to the model [4]. Random forests are a Bagging model, which brings stability to different models. Bagging is a method to combine predictions of different classificationmodells (with high variance of the prediction). By that, the variance is lowered. [5] 
        

###### Further remarks: Unittesting for Code. 

#### (4. Optimization of Model)
#### 5. Documentation and Reflection
#### (6. Summary)

Material we used: [1](https://towardsdatascience.com/5-tools-for-reproducible-data-science-c099c6b881e5), [2](https://corporatefinanceinstitute.com/resources/knowledge/other/decision-tree/), [3](https://www.geeksforgeeks.org/introduction-to-support-vector-machines-svm/), [4](https://towardsdatascience.com/6-predictive-models-models-every-beginner-data-scientist-should-master-7a37ec8da76d), [5](https://de.wikipedia.org/wiki/Bootstrap_aggregating)


# 6. Summary