### German Credit Dataset
Source : https://www.mldata.io/dataset-details/german_credit_data/

checking_account_status	- Status of existing checking account (A11: < 0 DM, A12: 0 <= x < 200 DM, A13 : >= 200 DM / salary assignments for at least 1 year, A14 : no checking account)

duration	-	Duration in month

credit_history	-	A30: no credits taken/ all credits paid back duly, A31: all credits at this bank paid back duly, A32: existing credits paid back duly till now, A33: delay in paying off in the past, A34 : critical account/ other credits existing (not at this bank)

purpose	- Purpose of Credit (A40 : car (new), A41 : car (used), A42 : furniture/equipment, A43 : radio/television, A44 : domestic appliances, A45 : repairs, A46 : education, A47 : (vacation - does not exist?), A48 : retraining, A49 : business, A410 : others)

savings	- Savings in accounts/bonds (A61 : < 100 DM, A62 : 100 <= x < 500 DM, A63 : 500 <= x < 1000 DM, A64 : >= 1000 DM, A65 : unknown/ no savings account

credit_amount	- savings	string	Savings in accounts/bonds (A61 : < 100 DM, A62 : 100 <= x < 500 DM, A63 : 500 <= x < 1000 DM, A64 : >= 1000 DM, A65 : unknown/ no savings account

present_employment	- A71 : unemployed, A72 : < 1 year, A73 : 1 <= x < 4 years, A74 : 4 <= x < 7 years, A75 : .. >= 7 years

installment_rate	-	Installment Rate in percentage of disposable income

personal	-	Personal Marital Status and Sex (A91 : male : divorced/separated, A92 : female : divorced/separated/married, A93 : male : single, A94 : male : married/widowed, A95 : female : single)

other_debtors	-	A101 : none, A102 : co-applicant, A103 : guarantor

present_residence	-	Present residence since

property	-	A121 : real estate, A122 : if not A121 : building society savings agreement/ life insurance, A123 : if not A121/A122 : car or other, not in attribute 6, A124 : unknown / no property

age	-	Age in years

other_installment_plans	-	A141 : bank, A142 : stores, A143 : none

customer_type	-	Predictor Class: 1=Good, 2=Bad

## Importing all libraries

In [848]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier 
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


## Importing the data 

In [849]:
data = pd.read_csv('german_credit_data_dataset.csv')

data.head()

Unnamed: 0,checking_account_status,duration,credit_history,purpose,credit_amount,savings,present_employment,installment_rate,personal,other_debtors,...,property,age,other_installment_plans,housing,existing_credits,job,dependents,telephone,foreign_worker,customer_type
0,A11,6,A34,A43,1169.0,A65,A75,4.0,A93,A101,...,A121,67.0,A143,A152,2.0,A173,1,A192,A201,1
1,A12,48,A32,A43,5951.0,A61,A73,2.0,A92,A101,...,A121,22.0,A143,A152,1.0,A173,1,A191,A201,2
2,A14,12,A34,A46,2096.0,A61,A74,2.0,A93,A101,...,A121,49.0,A143,A152,1.0,A172,2,A191,A201,1
3,A11,42,A32,A42,7882.0,A61,A74,2.0,A93,A103,...,A122,45.0,A143,A153,1.0,A173,2,A191,A201,1
4,A11,24,A33,A40,4870.0,A61,A73,3.0,A93,A101,...,A124,53.0,A143,A153,2.0,A173,2,A191,A201,2


In [850]:
data.shape

(1000, 21)

### Dropping the extra/ low contributing attributes

In [851]:
data = data.drop(['telephone', 'personal', 'present_residence', 'other_installment_plans'], axis=1) 


### Checking for nulls

In [852]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   checking_account_status  1000 non-null   object 
 1   duration                 1000 non-null   int64  
 2   credit_history           1000 non-null   object 
 3   purpose                  1000 non-null   object 
 4   credit_amount            1000 non-null   float64
 5   savings                  1000 non-null   object 
 6   present_employment       1000 non-null   object 
 7   installment_rate         1000 non-null   float64
 8   other_debtors            1000 non-null   object 
 9   property                 1000 non-null   object 
 10  age                      1000 non-null   float64
 11  housing                  1000 non-null   object 
 12  existing_credits         1000 non-null   float64
 13  job                      1000 non-null   object 
 14  dependents               

### Converting rest of the attributes to keys 

In [853]:
data.keys()

Index(['checking_account_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings', 'present_employment', 'installment_rate',
       'other_debtors', 'property', 'age', 'housing', 'existing_credits',
       'job', 'dependents', 'foreign_worker', 'customer_type'],
      dtype='object')

### Finding and ranking out all of the unique saving types 

In [854]:
data.savings.unique()

array(['A65', 'A61', 'A63', 'A64', 'A62'], dtype=object)

In [855]:
saving_dictionary= {'A65' : 0, 'A61' : 1, 'A62' : 2, 'A63': 4, 'A64' : 5}

data.savings.replace(saving_dictionary, inplace = True)

data.head()

Unnamed: 0,checking_account_status,duration,credit_history,purpose,credit_amount,savings,present_employment,installment_rate,other_debtors,property,age,housing,existing_credits,job,dependents,foreign_worker,customer_type
0,A11,6,A34,A43,1169.0,0,A75,4.0,A101,A121,67.0,A152,2.0,A173,1,A201,1
1,A12,48,A32,A43,5951.0,1,A73,2.0,A101,A121,22.0,A152,1.0,A173,1,A201,2
2,A14,12,A34,A46,2096.0,1,A74,2.0,A101,A121,49.0,A152,1.0,A172,2,A201,1
3,A11,42,A32,A42,7882.0,1,A74,2.0,A103,A122,45.0,A153,1.0,A173,2,A201,1
4,A11,24,A33,A40,4870.0,1,A73,3.0,A101,A124,53.0,A153,2.0,A173,2,A201,2


### Encoding rest of the categorical values
Encoding the rest of the categorical attributes into numerical values using the dummy encoding

In [856]:
data = pd.get_dummies(data, columns =['checking_account_status', 'credit_history', 'purpose', 'present_employment', 'property', 'housing', 'other_debtors', 'job', 'foreign_worker'])

data.head()

Unnamed: 0,duration,credit_amount,savings,installment_rate,age,existing_credits,dependents,customer_type,checking_account_status_A11,checking_account_status_A12,...,housing_A153,other_debtors_A101,other_debtors_A102,other_debtors_A103,job_A171,job_A172,job_A173,job_A174,foreign_worker_A201,foreign_worker_A202
0,6,1169.0,0,4.0,67.0,2.0,1,1,1,0,...,0,1,0,0,0,0,1,0,1,0
1,48,5951.0,1,2.0,22.0,1.0,1,2,0,1,...,0,1,0,0,0,0,1,0,1,0
2,12,2096.0,1,2.0,49.0,1.0,2,1,0,0,...,0,1,0,0,0,1,0,0,1,0
3,42,7882.0,1,2.0,45.0,1.0,2,1,1,0,...,1,0,0,1,0,0,1,0,1,0
4,24,4870.0,1,3.0,53.0,2.0,2,2,1,0,...,1,1,0,0,0,0,1,0,1,0


## Rechecking for nulls 

In [857]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 48 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     1000 non-null   int64  
 1   credit_amount                1000 non-null   float64
 2   savings                      1000 non-null   int64  
 3   installment_rate             1000 non-null   float64
 4   age                          1000 non-null   float64
 5   existing_credits             1000 non-null   float64
 6   dependents                   1000 non-null   int64  
 7   customer_type                1000 non-null   int64  
 8   checking_account_status_A11  1000 non-null   uint8  
 9   checking_account_status_A12  1000 non-null   uint8  
 10  checking_account_status_A13  1000 non-null   uint8  
 11  checking_account_status_A14  1000 non-null   uint8  
 12  credit_history_A30           1000 non-null   uint8  
 13  credit_history_A31 

In [None]:
### Logistic Regression Classifier

In [858]:
def logistic_classifier(x_train, y_train):
    print("Logistic Regression Classifier:")
    lg_classifier = LogisticRegression(penalty='l2', solver='liblinear')
    lg_classifier.fit(x_train, y_train)
    return lg_classifier

### Naive Baye's Classifier

In [859]:
def naive_bayes_classifier(x_train, y_train):
    print("Naive Bayes Classifier:")
    nb_classifier = GaussianNB()
    nb_classifier.fit(x_train, y_train)
    return nb_classifier

### K-nearest-neighbors Classifier

In [860]:
def k_nearest_neighbor_classifier(x_train, y_train):
    print("K Nearest Neighbor Classifier:")
    kn_classifier = KNeighborsClassifier(n_neighbors=10)
    kn_classifier.fit(x_train, y_train)
    return kn_classifier

### Support Vector Classifier

In [861]:
def svc_classifier(x_train, y_train):
    print("Support Vector Classifier:")
    svc_classifier = SVC(kernel='rbf', gamma ='scale')
    svc_classifier.fit(x_train, y_train)
    return svc_classifier

### MLP Neural Network Classifier

In [862]:
def mlp_classifier(x_train, y_train):
    mlp_classifier = MLPClassifier(activation='logistic', hidden_layer_sizes=(12, 12, 12, 12), solver='lbfgs', verbose=True, max_iter=100000)
    mlp_classifier.fit(x_train, y_train)
    return mlp_classifier

### Decision Tree Classifier

In [863]:
def decision_tree_classifer(x_train, y_train):
    dt_classifier = DecisionTreeClassifier(max_depth=6)
    dt_classifier.fit(x_train, y_train)
    return dt_classifier

### Train-Test Split and Standard Scaler

Defining the X and Y variables and splitting the data 80% training and 20% test, shuffling the data

In [864]:
X = data.drop('customer_type', axis= 1)
Y = data['customer_type']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, shuffle=True, random_state = 0)

### Build, Train, test and evaluate model 
 For our evaluations We not only measured the accuracy but also displayed the confusion matrix which evaluates the performance of each classification. From the confusion matrix we can derive the Precision (ratio of True positive(TP)/(TP + FP(False Positive))) , Recall(ratio of TP/TP + FN (False Negative)) and f1 - scores(where weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. f1 = 2 * (precision * recall) / (precision + recall)).

In [865]:
def build_and_train_model(data, target_name, class_fn):
    model = class_fn(x_train, y_train)
    score = model.score(x_train, y_train)
    print("Training Score : ", score)

    y_pred = model.predict(x_test)

    accuracy = accuracy_score(y_test, y_pred)
    print("Testing Score: ", accuracy)
   

    Confusion_Matrix = confusion_matrix(y_test, y_pred)
    Classification_report = classification_report(y_test, y_pred)

    df_y = pd.DataFrame({'y_test' : y_test, 'y_pred' : y_pred})
    print(df_y.sample(10))

    return {'model': model,
            'x_train' : x_train, 'x_test' : x_test,
            'y_train' : y_train, 'y_test' : y_test, 
            'y_pred' : y_pred, 'sample' : df_y.sample(10), 'Confusion Matrix' : print(' \n Confusion_matrix : \n', Confusion_Matrix),
            'Classification Report' : print('\n Classification Report : \n', Classification_report) 
            }



Utilizing the build and train model function to predict classification (male or female), score and generate classification report for eeach model on the data. 3 inputs as needed: data file, value to predict(label), and the type of model below. Results in order Training score, Testing/accuracy, precision, recall and f1- score to evaluate the classification accuracy and the Y_test: actual values and Y_pred = predicted values. 

In [866]:
Logistic_classifier = build_and_train_model(data, 'customer_type', logistic_classifier)

Logistic Regression Classifier:
Training Score :  0.77875
Testing Score:  0.745
     y_test  y_pred
939       1       1
299       1       1
706       2       2
298       1       1
923       1       1
145       1       2
175       2       1
45        1       1
319       1       1
141       1       2
 
 Confusion_matrix : 
 [[121  21]
 [ 30  28]]

 Classification Report : 
               precision    recall  f1-score   support

           1       0.80      0.85      0.83       142
           2       0.57      0.48      0.52        58

    accuracy                           0.74       200
   macro avg       0.69      0.67      0.67       200
weighted avg       0.73      0.74      0.74       200



In [867]:
Naive_bayes_classifier = build_and_train_model(data, 'customer_type', naive_bayes_classifier)

Naive Bayes Classifier:
Training Score :  0.7625
Testing Score:  0.695
     y_test  y_pred
615       1       2
65        1       1
643       1       1
236       2       2
636       1       1
939       1       1
372       1       1
927       2       2
736       2       2
34        1       1
 
 Confusion_matrix : 
 [[107  35]
 [ 26  32]]

 Classification Report : 
               precision    recall  f1-score   support

           1       0.80      0.75      0.78       142
           2       0.48      0.55      0.51        58

    accuracy                           0.69       200
   macro avg       0.64      0.65      0.65       200
weighted avg       0.71      0.69      0.70       200



In [868]:
k_nearest_neighbor_classifier = build_and_train_model(data, 'customer_type', k_nearest_neighbor_classifier)

K Nearest Neighbor Classifier:
Training Score :  0.72875
Testing Score:  0.7
     y_test  y_pred
367       1       1
736       2       2
386       1       1
873       1       1
315       2       1
267       1       1
785       1       1
996       1       1
832       2       2
298       1       1
 
 Confusion_matrix : 
 [[133   9]
 [ 51   7]]

 Classification Report : 
               precision    recall  f1-score   support

           1       0.72      0.94      0.82       142
           2       0.44      0.12      0.19        58

    accuracy                           0.70       200
   macro avg       0.58      0.53      0.50       200
weighted avg       0.64      0.70      0.63       200



In [869]:
support_vector_classifier = build_and_train_model(data, 'customer_type', svc_classifier)

Support Vector Classifier:
Training Score :  0.71
Testing Score:  0.715
     y_test  y_pred
780       2       1
576       1       1
231       1       1
529       1       1
466       2       1
122       1       1
601       2       1
214       1       1
267       1       1
458       1       1
 
 Confusion_matrix : 
 [[139   3]
 [ 54   4]]

 Classification Report : 
               precision    recall  f1-score   support

           1       0.72      0.98      0.83       142
           2       0.57      0.07      0.12        58

    accuracy                           0.71       200
   macro avg       0.65      0.52      0.48       200
weighted avg       0.68      0.71      0.62       200



In [870]:
mlp_classifier = build_and_train_model(data, 'customer_type', mlp_classifier)

Training Score :  0.76
Testing Score:  0.725
     y_test  y_pred
481       1       2
832       2       2
236       2       2
264       1       1
760       1       1
909       1       1
458       1       1
316       1       1
864       2       1
666       1       2
 
 Confusion_matrix : 
 [[109  33]
 [ 22  36]]

 Classification Report : 
               precision    recall  f1-score   support

           1       0.83      0.77      0.80       142
           2       0.52      0.62      0.57        58

    accuracy                           0.73       200
   macro avg       0.68      0.69      0.68       200
weighted avg       0.74      0.72      0.73       200



In [873]:
decision_tree_classifier_model = build_and_train_model(data, 'customer_type', decision_tree_classifer)

Training Score :  0.80875
Testing Score:  0.67
     y_test  y_pred
367       1       2
492       1       1
740       1       2
214       1       1
142       1       2
767       1       1
654       1       1
644       1       2
790       2       1
489       1       1
 
 Confusion_matrix : 
 [[109  33]
 [ 33  25]]

 Classification Report : 
               precision    recall  f1-score   support

           1       0.77      0.77      0.77       142
           2       0.43      0.43      0.43        58

    accuracy                           0.67       200
   macro avg       0.60      0.60      0.60       200
weighted avg       0.67      0.67      0.67       200



### Split the training data further into 2 parts to test warm_start

In [874]:
x_train_1, x_train_2, y_train_1, y_train_2 = train_test_split(x_train, y_train, test_size=0.5)

In [875]:
random_classifier_model = RandomForestClassifier(max_depth=4, n_estimators=2, warm_start=True)

In [876]:
random_classifier_model.fit(x_train_1, y_train_1)
y_pred = random_classifier_model.predict(x_test)
test_score = accuracy = accuracy_score(y_test, y_pred)
Confusion_Matrix = confusion_matrix(y_test, y_pred)
Classification_report = classification_report(y_test, y_pred)
print("Testing Score : ", test_score)
print('\n Confusion_matrix : \n', Confusion_Matrix)
print('\n Classification Report : \n', Classification_report)

Testing Score :  0.67

 Confusion_matrix : 
 [[128  14]
 [ 52   6]]

 Classification Report : 
               precision    recall  f1-score   support

           1       0.71      0.90      0.80       142
           2       0.30      0.10      0.15        58

    accuracy                           0.67       200
   macro avg       0.51      0.50      0.47       200
weighted avg       0.59      0.67      0.61       200



In [877]:
random_classifier_model.n_estimators += 2
random_classifier_model.fit(x_train_2, y_train_2)
y_pred = random_classifier_model.predict(x_test)
test_score = accuracy = accuracy_score(y_test, y_pred)
print("Testing Score : ", test_score)
print('\n Confusion_matrix : \n', Confusion_Matrix)
print('\n Classification Report : \n', Classification_report)

Testing Score :  0.71

 Confusion_matrix : 
 [[128  14]
 [ 52   6]]

 Classification Report : 
               precision    recall  f1-score   support

           1       0.71      0.90      0.80       142
           2       0.30      0.10      0.15        58

    accuracy                           0.67       200
   macro avg       0.51      0.50      0.47       200
weighted avg       0.59      0.67      0.61       200

