# Workshop Task - Training models and preprocessing


We have a binary classification problem where we have to predict whether a credit should be approved or not for a new client of a bank.


|Field Name|	Order|	Type (Format)|Description|
| -------| -------|-----------|---------|
|checking_status|	1|	string (default)|Status of existing checking account, in Deutsche Mark.|	
|duration	|2|	number (default)	|Duration in months|
|credit_history	|3|	string (default)	|Credit history (credits taken, paid back duly, delays, critical accounts)|
|purpose	|4|	string (default)	|Purpose of the credit (car, television,…)|
|credit_amount	|5|	number (default)	|Credit amount|
|savings_status	|6|	string (default)	|Status of savings account/bonds, in Deutsche Mark.|
|employment	|7|	string (default)	|Present employment, in number of years.|
|installment_commitment	|8|	number (default)|Installment rate in percentage of disposable income|	
|personal_status	|9|	string (default)|Personal status (married, single,…) and sex|
|other_parties	|10|	string (default)|Other debtors / guarantors|	
|residence_since	|11|	number (default)|Present residence since X years|	
|property_magnitude	|12|	string (default)|Property (e.g. real estate)|	
|age	|13|	number (default)	|Age in years|
|other_payment_plans	|14|	string (default)|Other installment plans (banks, stores)|
|housing	|15|	string (default)	|Housing (rent, own,…)|
|existing_credits	|16|	number (default)|Number of existing credits at this bank|	
|job	|17|	string (default)	|Job|
|num_dependents	|18|	number (default)|Number of people being liable to provide maintenance for|	
|own_telephone	|19|	string (default)|Telephone (yes,no)|	
|foreign_worker	|20|	string (default)|Foreign worker (yes,no)|	
accepted	|21|	string (default)	|Class|


Your task is to : 
  1. Use some EDA techniques we learned the last 2 weeks
  2. Detect missing values
  4. From the seaborn package use the functions displot and boxplot to plot the distributions of the numerical variables. This should give you insight into what scaling type you should use. The boxplots will give a good indication on the presence of outliers.

  5. Scale the data.

  6. For the categorical features try different encodings e.g. target, label... 
    
  7. Make train/test split : with train(70%), test(30%) with random_state = 0

  8. Try to build quickly a few models, a Decision Tree, a Random Forest, a polynomial SVM, a Radial Basis SVM, KNN. Try to achieve performance of 80% + on test set. 
  
  9. Evaluate the model
    
  10. For reproducibility please use random_state on train_test_split and model initialization
  
  11. Write a summary :
    - Which model gives the best result?
    - What can we improve in the future?
    - BONUs: Which encoding give better performance on this dataset?

Bonus:
- Try building a model with only a subset of features. Try any of the feature selection techniques to find the 5 most important features according to each of the methods we learned in the previous week. Write a short summary of the results.
    



In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from collections import Counter
from sklearn.metrics import classification_report

In [2]:
dataset = pd.read_csv('dataset-workshop.csv')


In [3]:
dataset = dataset.drop(['Unnamed: 0','checking_status','duration','purpose','savings_status',
'personal_status','property_magnitude',
'other_payment_plans','job','own_telephone'], axis=1)

In [4]:
dataset['credit_history'] = dataset['credit_history'].map({'Critical_acct_other_credits_existing':0,'Existing_credits_paid_till_now':1,'Delay_in_past': 2,'None': 3,'No_credits_taken_or_all_paid':4,'All_credits_paid_duly':5})

In [5]:
dataset['employment'] =dataset['employment'].map({'>7yrs':0,'1_to_4yrs':1,'4_to_7yrs':2,'unemployed':3,'<1yr':4})

In [6]:
dataset['other_parties'] = dataset['other_parties'].map({'None':0, 'guarantor':1,'co-applicant':2})

In [7]:
dataset['housing'] = dataset['housing'].map({'own': 0, 'for_free': 1, 'rent':2})

In [8]:
dataset['foreign_worker'] = dataset['foreign_worker'].map({'yes': 0, 'no': 1})

In [9]:
zero_not_accepted = ['age']
for column in zero_not_accepted:
    dataset[column] = dataset[column].replace(0,np.NaN)
    mean = int(dataset[column].mean(skipna=True))
    dataset[column] = dataset[column].replace(np.NaN,mean)

In [10]:
cols = ['credit_amount','age']
for col in cols:
   dataset[col] = dataset[col].apply(lambda x: int(x) if x == x else 0)

In [11]:
train_df, test_df = train_test_split(dataset,test_size=None)

In [12]:
print(train_df.shape)
print(test_df.shape)

(752, 12)
(251, 12)


In [13]:
print(train_df.describe())
print(test_df.describe())

       credit_history  credit_amount  employment  installment_commitment  \
count      752.000000     752.000000  752.000000              752.000000   
mean         1.174202    4493.726064    1.514628                2.994681   
std          1.269419    3651.895451    1.374990                1.125884   
min          0.000000     508.000000    0.000000                1.000000   
25%          0.000000    1930.750000    0.000000                2.000000   
50%          1.000000    3625.500000    1.000000                3.000000   
75%          1.000000    5871.250000    2.000000                4.000000   
max          5.000000   26200.000000    4.000000                4.000000   

       other_parties  residence_since         age     housing  \
count     752.000000       752.000000  752.000000  752.000000   
mean        0.121011         2.872340   35.507979    0.468085   
std         0.422389         1.088461   11.057889    0.778570   
min         0.000000         1.000000   19.000000    0.

In [14]:
X = train_df.drop(columns = ['accepted'])

In [15]:
y = train_df['accepted']

In [16]:
X_train = train_df.drop(columns = ['accepted']).values
X_test = test_df.drop(columns = ['accepted']).values
y_train = train_df['accepted'].values
y_test = test_df['accepted'].values


In [17]:
X_train, X_test,y_train, y_test = train_test_split(X_train,y_train, random_state=0, test_size=0.3, train_size=0.7)

mms = MinMaxScaler()
X_train = mms.fit_transform(X_train)
X_test = mms.fit_transform(X_test) 

In [18]:
print(Counter(y_train))
print(Counter(y_test))

Counter({0: 372, 1: 154})
Counter({0: 160, 1: 66})


In [19]:
assert X_train.shape[0] == y_train.shape[0]
assert X_test.shape[0] == y_test.shape[0]

assert X_train.shape[1] == X_test.shape[1]
assert type(y_train) == type(y_test)

In [20]:
print("X_train:", X_train.shape)
print("y_train:", y_train.shape)

print("X_test:", X_test.shape)
print("y_test:", y_test.shape)

X_train: (526, 11)
y_train: (526,)
X_test: (226, 11)
y_test: (226,)


In [21]:
print("Whole dataset : ", len(train_df)+len(test_df))
print("X Train size", len(X_train))
print("X Test size", len(X_test))
print("y train size", len(y_train))
print("y test size", len(y_test))

print(X_train.shape)
print(len(X_train))

Whole dataset :  1003
X Train size 526
X Test size 226
y train size 526
y test size 226
(526, 11)
526


In [22]:
X_test[0]
X_train[0]

array([0.        , 0.15678032, 0.25      , 1.        , 0.        ,
       0.33333333, 0.07407407, 0.        , 0.33333333, 0.        ,
       0.        ])

In [23]:
model = GaussianNB()

In [24]:
model

GaussianNB()

In [25]:
tm = model.fit(X_train, y_train)

In [26]:
tm

GaussianNB()

In [27]:
y_pred = tm.predict(X_test)

In [28]:
mpf = model.partial_fit(X_test,y_test, np.unique(y_test))
mpf

GaussianNB()

In [29]:
y_pred

array([1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0], dtype=int64)

In [30]:
y_test

array([0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0], dtype=int64)

In [31]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.84      0.90       160
           1       0.71      0.95      0.81        66

    accuracy                           0.87       226
   macro avg       0.84      0.90      0.86       226
weighted avg       0.90      0.87      0.88       226



In [32]:
accuracy_score(y_train, model.predict(X_train))

0.8498098859315589

In [33]:
#svc = SVC(kernel='rbf', C=5)
svc = LinearSVC(C=100, max_iter=100000)

In [34]:
svc.fit(X_train, y_train)

LinearSVC(C=100, max_iter=100000)

In [35]:
y_pred = svc.predict(X_test)

In [36]:
y_pred

array([0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0], dtype=int64)

In [37]:
print(classification_report(y_test, y_pred,zero_division=0))

              precision    recall  f1-score   support

           0       0.97      0.84      0.90       160
           1       0.70      0.94      0.81        66

    accuracy                           0.87       226
   macro avg       0.84      0.89      0.85       226
weighted avg       0.89      0.87      0.87       226



In [38]:
svc.classes_

array([0, 1], dtype=int64)

In [39]:
svc.n_features_in_

11

In [None]:
## VASIL STAMENKOSKI