Identification of risky credits using SVM
===

Financial entities want to improve their credit approval procedures in order to reduce the risk of non-payment of debt, which causes losses to the entity. The real problem is being able to decide whether or not to approve a particular loan based on information that can be easily collected over the phone or on the web. There is a sample of 1000 observations. Each record contains 20 attributes that collect information about both the applicant's credit and financial health. Build a recommender system that uses support vector machines.

The data file is available at the following link:

https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/german.csv



The attributes and their values are as follows:

     Attribute 1:  (qualitative)
     	      Status of existing checking account
     	      A11 :      ... <    0 DM
     	      A12 : 0 <= ... <  200 DM
     	      A13 :      ... >= 200 DM /
     	            salary assignments for at least 1 year
     	      A14 : no checking account

     Attribute 2:  (numerical)
     	      Duration in month

     Attribute 3:  (qualitative)
     	      Credit history
     	      A30 : no credits taken/
     	            all credits paid back duly
     	      A31 : all credits at this bank paid back duly
     	      A32 : existing credits paid back duly till now
     	      A33 : delay in paying off in the past
     	      A34 : critical account/
     	            other credits existing (not at this bank)

     Attribute 4:  (qualitative)
     	      Purpose
     	      A40 : car (new)
     	      A41 : car (used)
     	      A42 : furniture/equipment
     	      A43 : radio/television
     	      A44 : domestic appliances
     	      A45 : repairs
     	      A46 : education
     	      A47 : (vacation - does not exist?)
     	      A48 : retraining
     	      A49 : business
     	      A410 : others

     Attribute 5:  (numerical)
     	      Credit amount

     Attribute 6:  (qualitative)
     	      Savings account/bonds
     	      A61 :          ... <  100 DM
     	      A62 :   100 <= ... <  500 DM
     	      A63 :   500 <= ... < 1000 DM
     	      A64 :          .. >= 1000 DM
     	      A65 :   unknown/ no savings account

     Attribute 7:  (qualitative)
     	      Present employment since
     	      A71 : unemployed
     	      A72 :       ... < 1 year
     	      A73 : 1  <= ... < 4 years  
     	      A74 : 4  <= ... < 7 years
     	      A75 :       .. >= 7 years

     Attribute 8:  (numerical)
     	      Installment rate in percentage of disposable income

     Attribute 9:  (qualitative)
     	      Personal status and sex
     	      A91 : male   : divorced/separated
     	      A92 : female : divorced/separated/married
     	      A93 : male   : single
     	      A94 : male   : married/widowed
     	      A95 : female : single

     Attribute 10: (qualitative)
     	      Other debtors / guarantors
     	      A101 : none
     	      A102 : co-applicant
     	      A103 : guarantor

     Attribute 11: (numerical)
     	      Present residence since

     Attribute 12: (qualitative)
     	      Property
     	      A121 : real estate
     	      A122 : if not A121 : building society savings agreement/
     				   life insurance
     	      A123 : if not A121/A122 : car or other, not in attribute 6
     	      A124 : unknown / no property

     Attribute 13: (numerical)
     	      Age in years

     Attribute 14: (qualitative)
     	      Other installment plans 
     	      A141 : bank
     	      A142 : stores
     	      A143 : none

     Attribute 15: (qualitative)
     	      Housing
     	      A151 : rent
     	      A152 : own
     	      A153 : for free

     Attribute 16: (numerical)
              Number of existing credits at this bank

     Attribute 17: (qualitative)
     	      Job
     	      A171 : unemployed/ unskilled  - non-resident
     	      A172 : unskilled - resident
     	      A173 : skilled employee / official
     	      A174 : management/ self-employed/
     		         highly qualified employee/ officer

     Attribute 18: (numerical)
     	      Number of people being liable to provide maintenance for

     Attribute 19: (qualitative)
     	      Telephone
     	      A191 : none
     	      A192 : yes, registered under the customers name

     Attribute 20: (qualitative)
     	      foreign worker
     	      A201 : yes
     	      A202 : no


In [2]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

df = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/german.csv"
)

df.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,residence_history,property,age,installment_plan,housing,existing_credits,default,dependents,telephone,foreign_worker,job
0,< 0 DM,6,critical,radio/tv,1169,unknown,> 7 yrs,4,single male,none,4,real estate,67,none,own,2,1,1,yes,yes,skilled employee
1,1 - 200 DM,48,repaid,radio/tv,5951,< 100 DM,1 - 4 yrs,2,female,none,2,real estate,22,none,own,1,2,1,none,yes,skilled employee
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 yrs,2,single male,none,3,real estate,49,none,own,1,1,2,none,yes,unskilled resident
3,< 0 DM,42,repaid,furniture,7882,< 100 DM,4 - 7 yrs,2,single male,guarantor,4,building society savings,45,none,for free,1,1,2,none,yes,skilled employee
4,< 0 DM,24,delayed,car (new),4870,< 100 DM,1 - 4 yrs,3,single male,none,4,unknown/none,53,none,for free,2,2,2,none,yes,skilled employee


In [3]:
#
# Use the LabelEncoder transformer to preprocess
# the alphanumeric columns of the dataframe.
#
# Use the first 900 data points for training the
# model and the remaining 100 for validation.
#
# Build the SVM using the default values of
# the parameters.
#
# Compute the confusion matrix for the sample of
# validation.
#
# answer/
# True
# True
# True
# True
#

df = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/german.csv"
)

# >>> INSERT YOU CODE HERE>>>
# create label encoder
encoder = LabelEncoder()

# select categorical columns from df
cat_cols = df.select_dtypes(include='object').columns.tolist()
# convert cat cols to numeric using label encoder
for col in cat_cols:
  df[col] = encoder.fit_transform(df[col])

# split into train and validation. First 900 goes to train rest 100 for validation
train = df[:900]
val = df[900:]

X_train, y_train = train.drop(columns='default'), train['default']
X_val, y_val = val.drop(columns='default'), val['default']

# Train SVM classifier with default parameters
clf = SVC()
clf.fit(X_train, y_train)


# ---->>> Evaluation ---->>>
# cm is the confusion matrix
y_pred = clf.predict(X_val)
cm = confusion_matrix(y_val, y_pred)
print(cm[0][0] == 67)
print(cm[0][1] == 1)
print(cm[1][0] == 30)
print(cm[1][1] == 2)

True
True
True
True


In [4]:
#
# Find the best combination of kernel and parameters
# Regularization for supplied values
# during training and compute the matrix of
# confusion for the test sample.
#
# answer/
# True
# True
# True
# True
#

kernels = ['rbf', 'linear', 'poly', 'sigmoid']
Cs = [1, 2, 3, 4, 5]

# >>> INSERT YOU CODE HERE >>>
from sklearn.model_selection import GridSearchCV
param_grid = {'C':Cs, 'kernel':kernels}
gridSearchCV = GridSearchCV(estimator=SVC(), param_grid=param_grid, cv=5, scoring="accuracy", refit=True, return_train_score=False)
gridSearchCV.fit(X_train, y_train)

print(gridSearchCV.best_estimator_)

# ---->>> Evaluation ---->>>
# cm is the confusion matrix
clf = SVC(C=gridSearchCV.best_params_['C'], kernel=gridSearchCV.best_params_['kernel']).fit(X_train, y_train)
y_pred = clf.predict(X_val)
cm = confusion_matrix(y_val, y_pred)
print(cm[0][0] == 68)
print(cm[0][1] == 0)
print(cm[1][0] == 30)
print(cm[1][1] == 2)
cm

SVC(C=2, kernel='linear')
False
False
False
False


array([[62,  6],
       [25,  7]])