# Using Machine Learning to decide on credit allocation

This week we are using a logistic regression to decide which customers in Germany should be entitled to credit or not. We are using a donated dataset from 1994 for this, you can find more information about the dataset and its variables here: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

## Importing the libraries and loading the dataset

Because we are not using neural networks or any form of "deep" learning this week, we no longer need to import keras. Instead, we are using Scikit-learn, which covers most traditional machine learning algorithms. We also use Pandas again for database management.

In [None]:
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import fbeta_score
from sklearn.metrics import make_scorer
from sklearn.linear_model import RidgeClassifier
from numpy import mean
from numpy import std
from imblearn.pipeline import Pipeline
from imblearn.combine import SMOTEENN
from imblearn.under_sampling import EditedNearestNeighbours

As usual, we upload our training data.

In [None]:
# Upload your dataset here by clicking "Choose Files" after running the code. 
# It should be in .csv format and  named "german.csv"

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

This step loads the dataset. Like in previous cases, the algorithm can only read numbers, so all of the categorical variables in the dataset need to be converted into numbers.

In [None]:
def load_dataset(full_path):
	# load the dataset as a numpy array
	dataframe = read_csv(full_path, header=None)
	last_ix = len(dataframe.columns) - 1
	X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]
	# select categorical and numerical features
	cat_ix = X.select_dtypes(include=['object', 'bool']).columns
	num_ix = X.select_dtypes(include=['int64', 'float64']).columns
	# label encode the target variable to have the classes 0 and 1
	y = LabelEncoder().fit_transform(y)
	return X.values, y, cat_ix, num_ix

# define the location of the dataset
full_path = 'german.csv'
# load the dataset
X, y, cat_ix, num_ix = load_dataset(full_path)

## Defining and training the model
As usual, this is the step where we train the logistic regression model. However, unlike neural networks, validation performance is less relevant for logistic regressions.

In [None]:
# define model to evaluate
model = LogisticRegression(solver='liblinear', class_weight='balanced')
# one hot encode categorical, normalize numerical
ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)])
# scale, then undersample, then fit model
pipeline = Pipeline(steps=[('t',ct), ('s', RepeatedEditedNearestNeighbours()), ('m',model)])
# fit the model
pipeline.fit(X, y)

## Evaluating the model

Because there is not validation component in model training, we have to assess the performance of it after the training. First, we can calculate an F2 score (which is focused at reducing false negatives) to see how well the model is performing.

In [None]:
# calculate f2-measure
def f2_measure(y_true, y_pred):
	return fbeta_score(y_true, y_pred, beta=2)

# evaluate a model
def evaluate_model(X, y, model):
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define the model evaluation metric
	metric = make_scorer(f2_measure)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1)
	return scores
 
scores = evaluate_model(X, y, pipeline)
print('%.3f (%.3f)' % (mean(scores), std(scores)))

Finally, we can create some "fictional" customers to see what the decision of the algorithm is. If it outputs a prediction of 0 it means that the customer will be granted the credit, while a prediction of 1 means that no credit is given to that customer. If you are unsure of what the numbers and letters mean, you can again check the dataset information here: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)

In the data part, each row corresponds to an individual, who in turn gets a prediction. You can change these data. For instance, if you want to change the situation of the first customer to "unemployed/ unskilled - non-resident" you can change this variable from 'A174' to 'A171'.

In [None]:
# evaluate on some good customers cases (known class 0)
print('Good Customers:')
data = [['A11', 6, 'A34', 'A43', 1169, 'A65', 'A75', 4, 'A93', 'A101', 4, 'A121', 67, 'A143', 'A152', 2, 'A174', 1, 'A192', 'A201'],
	['A14', 12, 'A34', 'A46', 2096, 'A61', 'A74', 2, 'A93', 'A101', 3, 'A121', 49, 'A143', 'A152', 1, 'A172', 2, 'A191', 'A201'],
	['A14', 24, 'A34', 'A49', 12096, 'A61', 'A71', 4, 'A92', 'A101', 3, 'A124', 25, 'A143', 'A151', 3, 'A171', 0, 'A191', 'A202']]
for row in data:
	# make prediction
	yhat = pipeline.predict([row])
	# get the label
	label = yhat[0]
	# summarize
	print('>Predicted=%d (expected 0)' % (label))
# evaluate on some bad customers (known class 1)
print('Bad Customers:')
data = [['A13', 18, 'A32', 'A43', 2100, 'A61', 'A73', 4, 'A93', 'A102', 2, 'A121', 37, 'A142', 'A152', 1, 'A173', 1, 'A191', 'A201'],
	['A11', 24, 'A33', 'A40', 4870, 'A61', 'A73', 3, 'A93', 'A101', 4, 'A124', 53, 'A143', 'A153', 2, 'A173', 2, 'A191', 'A201'],
	['A11', 24, 'A32', 'A43', 1282, 'A62', 'A73', 4, 'A92', 'A101', 2, 'A123', 32, 'A143', 'A152', 1, 'A172', 1, 'A191', 'A201']]
for row in data:
	# make prediction
	yhat = pipeline.predict([row])
	# get the label
	label = yhat[0]
	# summarize
	print('>Predicted=%d (expected 1)' % (label))