# German Credit Card Fraud

Data source: UCI machine learning library, found here: http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

Project purpose:
I am enrolled in the Data Science with Python module under the Thinkful bootcamp program. The project represents a simulation of fraud detection for adding to my project portfolio. An open dataset allows me to practice applying machine learning algorithms through the prediction modeling process. 

Cases to avoid:
I am not interested in decreasing processing time like helping the algorithm converge because the code is not meant to be scaled into production. 

Software: Python using packages: sklearn, pandas, numpy, ect

Machine learning problem: supervised learning - classification based on 20 variables

Algorithm: logistic regression

## Import Data

The data comes from the UCI repository, but does not have a link for downloading the .csv file. I copied and pasted the data found from the following link into and .xlsx file. With the data in the .xlsx file, I used delimiter for separating the columns by space and typed in each column name. I decided on leaving the categorical value names in instead of replacing with the original names. 

In [1]:
#Import pandas package
import pandas as pd
german_credit = pd.read_csv('German.csv')

## Data Exploration

In [2]:
#Dimension of dataset
german_credit.shape

(1000, 21)

1000 rows by 21 columns

In [3]:
#Column datatypes
german_credit.dtypes

Status checking            object
Duration                    int64
Credit history             object
Purpose                    object
Credit amount               int64
Savings account/bonds      object
Present employment         object
Installment rate            int64
Personal status/sex        object
Debtors/guarantors         object
Present resident since      int64
Property                   object
Age                         int64
Other installment plans    object
Housing                    object
Number existing credits     int64
Job                        object
Number of people liable     int64
Telephone                  object
Foreign worker             object
Classification              int64
dtype: object

13 categorical values and 7 numeric (excluding the class label)

In [4]:
#Number of NaN values in each column
german_credit.notnull().sum()

Status checking            1000
Duration                   1000
Credit history             1000
Purpose                    1000
Credit amount              1000
Savings account/bonds      1000
Present employment         1000
Installment rate           1000
Personal status/sex        1000
Debtors/guarantors         1000
Present resident since     1000
Property                   1000
Age                        1000
Other installment plans    1000
Housing                    1000
Number existing credits    1000
Job                        1000
Number of people liable    1000
Telephone                  1000
Foreign worker             1000
Classification             1000
dtype: int64

No NaN values found in dataset

Summary statistics of the numeric variables, including the classification label

In [None]:
#Summary statistics
german_credit.describe()

Unnamed: 0,Duration,Credit amount,Installment rate,Present resident since,Age,Number existing credits,Number of people liable,Classification
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155,1.3
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086,0.458487
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0,2.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0,2.0


Histograms of numeric variables

In [None]:
#Import matplotlib
import matplotlib.pyplot as plt
#Duration
plt.hist(german_credit['Duration'])
plt.xlabel('Duration')
plt.ylabel('Frequency')
plt.title('Histogram of Duration in Months')
plt.show()
#Credit amount
plt.hist(german_credit['Credit amount'])
plt.xlabel('Credit amount')
plt.ylabel('Frequency')
plt.title('Histogram of Credit Amount')
plt.show()
#Installment rate
plt.hist(german_credit['Installment rate'])
plt.xlabel('Installment rate')
plt.ylabel('Frequency')
plt.title('Histogram of Installment Rate in % of Disposable Income')
plt.show()
#Years of present residence - fix 
plt.hist(german_credit['Present resident since'])
plt.xlabel('Present resident since')
plt.ylabel('Frequency')
plt.title('Histogram of Present Residence Since')
plt.show()
#Age
plt.hist(german_credit['Age'])
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Histogram of Age')
plt.show()
#Number of existing credits - fix 
plt.hist(german_credit['Number existing credits'])
plt.xlabel('Number existing credits')
plt.ylabel('Frequency')
plt.title('Histogram Of Number Of Existing Credits At This Bank')
plt.show()
#Number of people liable
plt.hist(german_credit['Number of people liable'])
plt.xlabel('Number of people liable')
plt.ylabel('Frequency')
plt.title('Histogram of Number of people being liable to provide maintenance for')
plt.show()


Count number of factors in each categorical variable

In [None]:
#Status checking
german_credit['Status checking'].value_counts()
#Credit history
german_credit['Credit history'].value_counts()
#Purpose
german_credit['Purpose'].value_counts()
#Savings account/bonds
german_credit['Savings account/bonds'].value_counts()
#Present employment
german_credit['Present employment'].value_counts()
#Personal status/sex
german_credit['Personal status/sex'].value_counts()
#Debtors/guarantors
german_credit['Debtors/guarantors'].value_counts()
#Property
german_credit['Property'].value_counts()
#Other installment plans
german_credit['Other installment plans'].value_counts()
#Housing
german_credit['Housing'].value_counts()
#Job
german_credit['Job'].value_counts()
#Telephone
german_credit['Telephone'].value_counts()
#Foreign worker
german_credit['Foreign worker'].value_counts()

Check balance of class labels for good or bad creditor

In [None]:
german_credit['Classification'].value_counts()

The dataset is unbalanced because there are more observations of the good creditors than bad. There are 700 observations of customers with good credit scores and 300 with bad. 1 represents the primary class because of the greater observations and 2 as the minority.

## Data Cleaning

Change classification label from (1,2) to (0,1) for aligning with sklearn's documentation. Without doing this, I got higher precision and recall scores after using the logit model. Sklearn’s precision defines identifying 1 correctly as the TP and since 1 is not the class we should predict, got a higher score. The incorrect class label resulted in predicting the occurrence of a good crediter that was 70% of the data instead of the bad creditor. 

In [None]:
german_credit['Classification'] = german_credit['Classification'].map(lambda x: x-1) 

Convert categorical variables to dummy

Logistic regression in sklearn does not automatically convert the categorical variables to binary.

In [None]:
dummy_var = pd.get_dummies(german_credit[['Status checking', 'Credit history', 'Purpose', 
                              'Savings account/bonds', 'Present employment', 
                              'Personal status/sex', 'Debtors/guarantors', 
                              'Property', 'Other installment plans', 'Housing', 
                              'Job', 'Telephone', 'Foreign worker']])

Identify number of new columns after converting to dummy

In [None]:
dummy_var.shape   

Each of the 54 factors turned into a binary variable

Subset numeric variables by dropping categorical from dataframe

In [None]:
credit_new = german_credit.drop(['Status checking', 'Credit history', 'Purpose', 
                              'Savings account/bonds', 'Present employment', 
                              'Personal status/sex', 'Debtors/guarantors', 
                              'Property', 'Other installment plans', 'Housing', 
                              'Job', 'Telephone', 'Foreign worker'], axis = 1)

Get dimensions of numeric dataframe

In [None]:
credit_new.shape                              

Combine numeric and categorical dataframes using .join() 

In [None]:
german_new_credit = dummy_var.join(credit_new)  

Get demensions of new dataframe

In [None]:
german_new_credit.shape        

## Model Building Version 1

Sklearn requires the logistic regression fit with a matrix or array. Separate the data into independent (X) and dependent (Y) variables, then convert to matrix format. 

In [None]:
#Subset dataframe into indepedent and dependent variables
X = german_new_credit.drop('Classification', axis = 1)
Y = german_credit['Classification']
#Convert dataframe to matrix 
X_mat = X.as_matrix()
Y_mat = Y.as_matrix()

Logistic regression

In [None]:
#Import logit model in sklearn
import sklearn.linear_model as ln
#Create logistic regression object
logreg = ln.LogisticRegression()
#Fit the logistic regression 
logreg.fit(X_mat, Y_mat)

## Model Evaluation Version 1

In [None]:
#Import k-fold cross validation in sklearn
import numpy as np
from sklearn.cross_validation import cross_val_score
#Accuracy of test set
score = cross_val_score(logreg, X_mat, Y_mat, scoring = 'accuracy', cv = 10)
np.mean(score)
#Recall of test set
recall_score = cross_val_score(logreg, X_mat, Y_mat, scoring = 'recall', cv = 10)
np.mean(recall_score)
#Precision of test set
precision_score = cross_val_score(logreg, X_mat, Y_mat, scoring = 'precision', cv = 10)
np.mean(precision_score)
#AUC
auc_score = cross_val_score(logreg, X_mat, Y_mat, scoring = 'roc_auc', cv = 10)
np.mean(auc_score)

## Model Building Version 3 - Fixing Unbalanced Dataset

Continuing from pt.1 of combining the categorical and numeric data

The UnbalancedDataset package is for python and provides different algorithms for fixing unbalanced datastets. Here’s the link: https://github.com/fmfn/UnbalancedDataset. I applied the SMOTE algorithm to create 400 more observations of variables with the minority class (bad creditors) for a 50/50 distribution of 700 each. I weighted the proportions of the majority to minority class for a 50/50 ratio. 

In [None]:
#Import module for applying oversampling using SMOTE
from unbalanced_dataset import SMOTE
#Set verbose as false to show less information
verbose = False
#Ratio of majority to minority class for 50/50 distribution
smote = SMOTE(ratio = 1.335, verbose = False, kind = 'regular')
#Fit data and transform
X_mod = X.as_matrix()
Y_mod = np.array(Y)
#Create new dataset
smox, smoy = smote.fit_transform(X_mod, Y_mod) 
#Check ratio of good and bad creditors
#Convert matrix to dataframe
y_data = pd.DataFrame(smoy, columns = ['classification'])
#check work
y_data['classification'].value_counts()

Logistic regression 

## Model Building and Evaluation Version 3

In [None]:
#Create logistic regression object
logreg_ov = ln.LogisticRegression()
#Model building
#Fit the logistic regression 
logreg_ov.fit(smox, smoy)

#Model testing
#Accuracy of test set
score_ov = cross_val_score(logreg_ov, smox, smoy, scoring = 'accuracy', cv = 10)
np.mean(score_ov)
#Recall of test set
recall_score_ov = cross_val_score(logreg_ov, smox, smoy, scoring = 'recall', cv = 10)
np.mean(recall_score_ov)
#Precision of test set
precision_score_ov = cross_val_score(logreg_ov, smox, smoy, scoring = 'precision', cv = 10)
np.mean(precision_score_ov)
#AUC
auc_score_ov = cross_val_score(logreg_ov, smox, smoy, scoring = 'roc_auc', cv = 10)
np.mean(auc_score_ov)

# Conclusion 

# Additional Comments

## Model Building Version 2 - Standardizing Numeric Variables

There are seven numeric variables in the dataframe and wanted to check if standardizing increases accuracy or precision

Subset the numeric and categorical data like before. The StandardScaler() function requires the datatype as float.

In [None]:
#Import module for standardizing variables from sklearn
from sklearn.preprocessing import StandardScaler
#Subset numeric data
num_credit = german_credit[['Duration', 'Credit amount', 
                            'Installment rate', 'Present resident since',
                            'Age', 'Number existing credits', 'Number of people liable']]
#Apply function to change datatype
num_credit_st = num_credit.astype('float')                            
#Standardization object and fit to data
stan = StandardScaler().fit(num_credit_st)
#Transform dataset
stan_data = stan.transform(num_credit_st)
#Convert array to  dataframe
#Get strings of numeric column name_
col_names = ['Duration', 'Credit amount', 'Installment rate', 'Present resident since',
             'Age', 'Number existing credits', 'Number of people liable']
new_stan = pd.DataFrame(stan_data, columns = col_names)                            

#Subset categorical data
cat_credit = german_credit[['Status checking', 'Credit history', 'Purpose', 
                              'Savings account/bonds', 'Present employment', 
                              'Personal status/sex', 'Debtors/guarantors', 
                              'Property', 'Other installment plans', 'Housing', 
                              'Job', 'Telephone', 'Foreign worker']]
#Change categorical variables to dummy
dummy_var = pd.get_dummies(cat_credit)  
#Join dataframes together
german_new_credit = dummy_var.join(new_stan) 

Model building and evaluation

In [None]:
#Subset dataframe into indepedent and dependent variables
X_st = german_new_credit
Y_st = german_credit['Classification'] 

#Create logistic regression object
logreg_st = ln.LogisticRegression()
#Convert dataframe to matrix 
X_stm = X_st.as_matrix()
Y_stm = Y_st.as_matrix()

#Fit the logistic regression 
logreg_st.fit(X_stm, Y_stm)

#Accuracy of test set
score_st = cross_val_score(logreg_st, X_stm, Y_stm, scoring = 'accuracy', cv = 10)
np.mean(score_st)
#Recall of test set
recall_score_st = cross_val_score(logreg_st, X_stm, Y_stm, scoring = 'recall', cv = 10)
np.mean(recall_score_st)
#Precision of test set
precision_score_st = cross_val_score(logreg_st, X_stm, Y_stm, scoring = 'precision', cv = 10)
np.mean(precision_score_st)
#AUC
auc_score_st = cross_val_score(logreg_st, X_stm, Y_stm, scoring = 'roc_auc', cv = 10)
np.mean(auc_score_st)

(Say how standardization did not help the analysis by more than 1000th of a percentage