# German Credit Card Fraud

Data source: UCI machine learning library, found here: http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Project purpose:
I am enrolled in the Data Science with Python module under the Thinkful bootcamp program. The project represents a simulation of fraud detection for adding to my project portfolio. An open dataset allows me to practice applying machine learning algorithms through the prediction modeling process. 

Business Case: Simulating a fraud case, increasing accuracy, precision and recall remain the objectives. Precision is important because I want to predict all the bad creditors labels to be true. Recall for not mislabeling any good creditors accidently as bad. 

Cases to avoid:
I am not interested in decreasing processing time like helping the algorithm converge because the code is not meant to be scaled into production. 

Software: Python using packages: sklearn, pandas, numpy, ect

Machine learning problem: supervised learning - classification based on 20 variables

Algorithm: logistic regression

## Import Data

The data comes from the UCI repository, but does not have a link for downloading the .csv file. I copied and pasted the data found from the following link into and .xlsx file. With the data in the .xlsx file, I used delimiter for separating the columns by space and typed in each column name. I decided on leaving the categorical value names in instead of replacing with the original names. 

In [1]:
#Import pandas package
import pandas as pd
german_credit = pd.read_csv('German.csv')

## Data Exploration

In [2]:
#Dimension of dataset
german_credit.shape

(1000, 21)

1000 rows by 21 columns

In [3]:
#Column datatypes
german_credit.dtypes

Status checking            object
Duration                    int64
Credit history             object
Purpose                    object
Credit amount               int64
Savings account/bonds      object
Present employment         object
Installment rate            int64
Personal status/sex        object
Debtors/guarantors         object
Present resident since      int64
Property                   object
Age                         int64
Other installment plans    object
Housing                    object
Number existing credits     int64
Job                        object
Number of people liable     int64
Telephone                  object
Foreign worker             object
Classification              int64
dtype: object

13 categorical values and 7 numeric (excluding the class label)

In [4]:
#Number of NaN values in each column
german_credit.notnull().sum()

Status checking            1000
Duration                   1000
Credit history             1000
Purpose                    1000
Credit amount              1000
Savings account/bonds      1000
Present employment         1000
Installment rate           1000
Personal status/sex        1000
Debtors/guarantors         1000
Present resident since     1000
Property                   1000
Age                        1000
Other installment plans    1000
Housing                    1000
Number existing credits    1000
Job                        1000
Number of people liable    1000
Telephone                  1000
Foreign worker             1000
Classification             1000
dtype: int64

No NaN values found in dataset

Summary statistics of the numeric variables, including the classification label

In [5]:
#Summary statistics
german_credit.describe()

Unnamed: 0,Duration,Credit amount,Installment rate,Present resident since,Age,Number existing credits,Number of people liable,Classification
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155,1.3
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086,0.458487
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0,2.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0,2.0


Histograms of numeric variables

In [6]:
#Import matplotlib
import matplotlib.pyplot as plt
#Duration
plt.hist(german_credit['Duration'])
plt.xlabel('Duration')
plt.ylabel('Frequency')
plt.title('Histogram of Duration in Months')
plt.show()
#Credit amount
plt.hist(german_credit['Credit amount'])
plt.xlabel('Credit amount')
plt.ylabel('Frequency')
plt.title('Histogram of Credit Amount')
plt.show()
#Installment rate
plt.hist(german_credit['Installment rate'])
plt.xlabel('Installment rate')
plt.ylabel('Frequency')
plt.title('Histogram of Installment Rate in % of Disposable Income')
plt.show()
#Years of present residence - fix 
plt.hist(german_credit['Present resident since'])
plt.xlabel('Present resident since')
plt.ylabel('Frequency')
plt.title('Histogram of Present Residence Since')
plt.show()
#Age
plt.hist(german_credit['Age'])
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Histogram of Age')
plt.show()
#Number of existing credits - fix 
plt.hist(german_credit['Number existing credits'])
plt.xlabel('Number existing credits')
plt.ylabel('Frequency')
plt.title('Histogram Of Number Of Existing Credits At This Bank')
plt.show()
#Number of people liable
plt.hist(german_credit['Number of people liable'])
plt.xlabel('Number of people liable')
plt.ylabel('Frequency')
plt.title('Histogram of Number of people being liable to provide maintenance for')
plt.show()

Categorical value descriptions:
-Number of factors in each variable
-Descriptions of each factor

Status of existing checking account 
A11 : ... < 0 DM 
A12 : 0 <= ... < 200 DM 
A13 : ... >= 200 DM / salary assignments for at least 1 year 
A14 : no checking account 

In [7]:
#Status checking
german_credit['Status checking'].value_counts()

A201    963
A202     37
dtype: int64

Credit history 
A30 : no credits taken/ all credits paid back duly 
A31 : all credits at this bank paid back duly 
A32 : existing credits paid back duly till now 
A33 : delay in paying off in the past 
A34 : critical account/ other credits existing (not at this bank) 

In [35]:
#Credit history
german_credit['Credit history'].value_counts()

A32    530
A34    293
A33     88
A31     49
A30     40
dtype: int64

Purpose 
A40 : car (new) 
A41 : car (used) 
A42 : furniture/equipment 
A43 : radio/television 
A44 : domestic appliances 
A45 : repairs 
A46 : education 
A47 : (vacation - does not exist?) 
A48 : retraining 
A49 : business 
A410 : others 

In [36]:
#Purpose
german_credit['Purpose'].value_counts()

A43     280
A40     234
A42     181
A41     103
A49      97
A46      50
A45      22
A44      12
A410     12
A48       9
dtype: int64

Savings account/bonds 
A61 : ... < 100 DM 
A62 : 100 <= ... < 500 DM 
A63 : 500 <= ... < 1000 DM 
A64 : .. >= 1000 DM 
A65 : unknown/ no savings account 

In [33]:
#Savings account/bonds
german_credit['Savings account/bonds'].value_counts()

A61    603
A65    183
A62    103
A63     63
A64     48
dtype: int64

Present employment since 
A71 : unemployed 
A72 : ... < 1 year 
A73 : 1 <= ... < 4 years 
A74 : 4 <= ... < 7 years 
A75 : .. >= 7 years 

In [None]:
#Present employment
german_credit['Present employment'].value_counts()

Personal status and sex 
A91 : male : divorced/separated 
A92 : female : divorced/separated/married 
A93 : male : single 
A94 : male : married/widowed 
A95 : female : single 

In [None]:
#Personal status/sex
german_credit['Personal status/sex'].value_counts()

Other debtors / guarantors 
A101 : none 
A102 : co-applicant 
A103 : guarantor 

In [None]:
#Debtors/guarantors
german_credit['Debtors/guarantors'].value_counts()

Property 
A121 : real estate 
A122 : if not A121 : building society savings agreement/ life insurance 
A123 : if not A121/A122 : car or other, not in attribute 6 
A124 : unknown / no property 

In [None]:
#Property
german_credit['Property'].value_counts()

Other installment plans 
A141 : bank 
A142 : stores 
A143 : none 

In [37]:
#Other installment plans
german_credit['Other installment plans'].value_counts()

A143    814
A141    139
A142     47
dtype: int64

Housing 
A151 : rent 
A152 : own 
A153 : for free 

In [38]:
#Housing
german_credit['Housing'].value_counts()

A152    713
A151    179
A153    108
dtype: int64

Job 
A171 : unemployed/ unskilled - non-resident 
A172 : unskilled - resident 
A173 : skilled employee / official 
A174 : management/ self-employed/ 
highly qualified employee/ officer 

In [39]:
#Job
german_credit['Job'].value_counts()

A173    630
A172    200
A174    148
A171     22
dtype: int64

Telephone 
A191 : none 
A192 : yes, registered under the customers name 

In [34]:
#Telephone
german_credit['Telephone'].value_counts()

A191    596
A192    404
dtype: int64

foreign worker 
A201 : yes 
A202 : no 

In [None]:
#Foreign worker
german_credit['Foreign worker'].value_counts()

Check balance of class labels for good or bad creditor

In [8]:
german_credit['Classification'].value_counts()

1    700
2    300
dtype: int64

The dataset is unbalanced because there are more observations of the good creditors than bad. There are 700 observations of customers with good credit scores and 300 with bad. 1 represents the primary class because of the greater observations and 2 as the minority.

## Data Cleaning

Class label changes

Change classification label from (1,2) to (0,1) for aligning with sklearn's documentation. Without doing this, I got higher precision and recall scores after using the logit model. Sklearn’s precision defines identifying 1 correctly as the TP and since 1 is not the class we should predict, got a higher score. The incorrect class label resulted in predicting the occurrence of a good crediter that was 70% of the data instead of the bad creditor. 

In [9]:
german_credit['Classification'] = german_credit['Classification'].map(lambda x: x-1) 

Convert categorical variables to dummy

Logistic regression in sklearn does not automatically convert the categorical variables to binary.

In [10]:
dummy_var = pd.get_dummies(german_credit[['Status checking', 'Credit history', 'Purpose', 
                              'Savings account/bonds', 'Present employment', 
                              'Personal status/sex', 'Debtors/guarantors', 
                              'Property', 'Other installment plans', 'Housing', 
                              'Job', 'Telephone', 'Foreign worker']])

Identify number of new columns after converting to dummy

In [11]:
dummy_var.shape   

(1000, 54)

Each of the 54 factors turned into a binary variable

Subset numeric variables by dropping categorical from dataframe

In [12]:
credit_new = german_credit.drop(['Status checking', 'Credit history', 'Purpose', 
                              'Savings account/bonds', 'Present employment', 
                              'Personal status/sex', 'Debtors/guarantors', 
                              'Property', 'Other installment plans', 'Housing', 
                              'Job', 'Telephone', 'Foreign worker'], axis = 1)

Get dimensions of numeric dataframe

In [13]:
credit_new.shape                              

(1000, 8)

Combine numeric and categorical dataframes using .join() 

In [14]:
german_new_credit = dummy_var.join(credit_new)  

Get demensions of new dataframe

In [15]:
german_new_credit.shape        

(1000, 62)

## Model Building Version 1

Sklearn requires the logistic regression fit with a matrix or array. Separate the data into independent (X) and dependent (Y) variables, then convert to matrix format. 

In [16]:
#Subset dataframe into indepedent and dependent variables
X = german_new_credit.drop('Classification', axis = 1)
Y = german_credit['Classification']
#Convert dataframe to matrix 
X_mat = X.as_matrix()
Y_mat = Y.as_matrix()

Logistic regression

In [17]:
#Import logit model in sklearn
import sklearn.linear_model as ln
#Create logistic regression object
logreg = ln.LogisticRegression()
#Fit the logistic regression 
logreg.fit(X_mat, Y_mat)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

## Model Evaluation Version 1

In [44]:
#Import k-fold cross validation in sklearn
import numpy as np
from sklearn.cross_validation import cross_val_score

In [24]:
#Accuracy of test set
score = cross_val_score(logreg, X_mat, Y_mat, scoring = 'accuracy', cv = 10)
np.mean(score)

0.748

In [25]:
#Recall of test set
recall_score = cross_val_score(logreg, X_mat, Y_mat, scoring = 'recall', cv = 10)
np.mean(recall_score)

0.46333333333333326

In [26]:
#Precision of test set
precision_score = cross_val_score(logreg, X_mat, Y_mat, scoring = 'precision', cv = 10)
np.mean(precision_score)

0.61677208633660263

In [27]:
#AUC
auc_score = cross_val_score(logreg, X_mat, Y_mat, scoring = 'roc_auc', cv = 10)
np.mean(auc_score)

0.79119047619047611

## Model 1 Analysis 

The classifier seems accurate for predicting bad creditors at 74.8%. There is a class imbalance, meaning that other classification metrics are required. The recall shows 46.3% of the bad creditors are correctly labeled as bad. Precision identifies 61.67% of the bad creditors labeled by the model who were actually bad. 

## Model Building Version 3 - Fixing Unbalanced Dataset

Continuing from pt.1 of combining the categorical and numeric data

The UnbalancedDataset package is for python and provides different algorithms for fixing unbalanced datastets. Here’s the link: https://github.com/fmfn/UnbalancedDataset. I applied the SMOTE algorithm to create 400 more observations of variables with the minority class (bad creditors) for a 50/50 distribution of 700 each. I weighted the proportions of the majority to minority class for a 50/50 ratio. 

In [19]:
#Import module for applying oversampling using SMOTE
from unbalanced_dataset import SMOTE
#Set verbose as false to show less information
verbose = False
#Ratio of majority to minority class for 50/50 distribution
smote = SMOTE(ratio = 1.335, verbose = False, kind = 'regular')
#Fit data and transform
X_mod = X.as_matrix()
Y_mod = np.array(Y)
#Create new dataset
smox, smoy = smote.fit_transform(X_mod, Y_mod) 
#Check ratio of good and bad creditors
#Convert matrix to dataframe
y_data = pd.DataFrame(smoy, columns = ['classification'])
#check work
y_data['classification'].value_counts()

1    700
0    700
dtype: int64

Logistic regression 

## Model Building Version 3

In [32]:
#Create logistic regression object
logreg_ov = ln.LogisticRegression()
#Model building
#Fit the logistic regression 
logreg_ov.fit(smox, smoy)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

## Model Evaluation Version 3

In [28]:
#Accuracy of test set
score_ov = cross_val_score(logreg_ov, smox, smoy, scoring = 'accuracy', cv = 10)
np.mean(score_ov)

0.76642857142857135

In [29]:
#Recall of test set
recall_score_ov = cross_val_score(logreg_ov, smox, smoy, scoring = 'recall', cv = 10)
np.mean(recall_score_ov)

0.7857142857142857

In [30]:
#Precision of test set
precision_score_ov = cross_val_score(logreg_ov, smox, smoy, scoring = 'precision', cv = 10)
np.mean(precision_score_ov)

0.75793480734232133

In [31]:
#AUC
auc_score_ov = cross_val_score(logreg_ov, smox, smoy, scoring = 'roc_auc', cv = 10)
np.mean(auc_score_ov)

0.83169387755102042

# Conclusion 

# Additional Comments

## Model Building Version 2 - Standardizing Numeric Variables

There are seven numeric variables in the dataframe and wanted to check if standardizing increases accuracy or precision

# Data Cleaning 

Subset the numeric and categorical data like before. The StandardScaler() function requires the datatype as float.

In [21]:
#Import module for standardizing variables from sklearn
from sklearn.preprocessing import StandardScaler
#Subset numeric data
num_credit = german_credit[['Duration', 'Credit amount', 
                            'Installment rate', 'Present resident since',
                            'Age', 'Number existing credits', 'Number of people liable']]
#Apply function to change datatype
num_credit_st = num_credit.astype('float')                            
#Standardization object and fit to data
stan = StandardScaler().fit(num_credit_st)
#Transform dataset
stan_data = stan.transform(num_credit_st)
#Convert array to  dataframe
#Get strings of numeric column name_
col_names = ['Duration', 'Credit amount', 'Installment rate', 'Present resident since',
             'Age', 'Number existing credits', 'Number of people liable']
new_stan = pd.DataFrame(stan_data, columns = col_names)                            

#Subset categorical data
cat_credit = german_credit[['Status checking', 'Credit history', 'Purpose', 
                              'Savings account/bonds', 'Present employment', 
                              'Personal status/sex', 'Debtors/guarantors', 
                              'Property', 'Other installment plans', 'Housing', 
                              'Job', 'Telephone', 'Foreign worker']]
#Change categorical variables to dummy
dummy_var = pd.get_dummies(cat_credit)  
#Join dataframes together
german_new_credit = dummy_var.join(new_stan) 

## Model building with standardized variables

In [22]:
#Subset dataframe into indepedent and dependent variables
X_st = german_new_credit
Y_st = german_credit['Classification'] 

#Create logistic regression object
logreg_st = ln.LogisticRegression()
#Convert dataframe to matrix 
X_stm = X_st.as_matrix()
Y_stm = Y_st.as_matrix()

#Fit the logistic regression 
logreg_st.fit(X_stm, Y_stm)

0.79028571428571437

## Evaluation of dataset with standardized variables

In [40]:
#Accuracy of test set
score_st = cross_val_score(logreg_st, X_stm, Y_stm, scoring = 'accuracy', cv = 10)
np.mean(score_st)

0.75

In [41]:
#Recall of test set
recall_score_st = cross_val_score(logreg_st, X_stm, Y_stm, scoring = 'recall', cv = 10)
np.mean(recall_score_st)

0.46666666666666662

In [42]:
#Precision of test set
precision_score_st = cross_val_score(logreg_st, X_stm, Y_stm, scoring = 'precision', cv = 10)
np.mean(precision_score_st)

0.62043418308344656

In [43]:
#AUC
auc_score_st = cross_val_score(logreg_st, X_stm, Y_stm, scoring = 'roc_auc', cv = 10)
np.mean(auc_score_st)

0.79028571428571437

(Say how standardization did not help the analysis by more than 1000th of a percentage