# Herhalingslabo Classificatie

Hoe weet je wanneer je welk algoritme best gebruikt ? Gebruik volgende handleiding:

**KNearestNeighbours**
- Geschikt voor classificatie en regressie
- Voor kleine/middelgrote datasets
- heel simpel, kan snel worden getrained
- werkt niet goed voor datasets met heel veel features

**Support Vector Machines**
- Vooral een standaard voor binaire classificatie, of wanneer je een duidelijke marge wilt op classificatie (=het hoofddoel van het algoritme)
- Voor kleine/middelgrote datasets
- Is kostelijker om te trainen dan KNN, maar classificeert dan wel sneller
- Kernel keuze en hyperparameter tuning zijn essentieel

**Decision Trees**
- Wanneer je een 'white box', 'explainable' model wilt
- kan zowel numerieke als categorische features makkelijk aan
- kan heel goed om met missende datapunten
- het risico op overfitting is groot. Bij twijfel, gebruik een random forest

**Random forest**
- Zelfde voordelen als decision tree, minder risico op overfitting
- Geschikt voor middelgrote/grote datasets
- Geschikt voor data met (heel) veel features
- Hebben uiteraard meer trainingstijd nodig dan een enkele decision tree

**Adaboost**
- Als data niet goed is gebalanceerd (niet evenveel items van elke klasse)
- Vermijd een onderfit. Dit model is beter dan 1 enkele onderliggende decision tree

**XGBoost**
- XGBoost is bijna *altijd* een goede keuze
- Middelgrote / grote datasets
- Datasets met (heel) veel features
- De implementatie van XGBoost staat, ondanks dat het een boosting algoritme is, toch deels parallellisatie toe.

In de praktijk zijn er vaak meerdere goede keuzes. Het is belangrijk te weten waarom een bepaald model wel/niet een goede keuze zou zijn, en vervolgens de meest veelbelovende modellen te proberen en te kijken wat het beste werkt.

### Wat is een kleine / middelgrote / grote dataset ?

Uiteraard is dat een beetje afhankelijk van hoelang je bereid bent te wachten. In deze cursus gebruiken we de gangbare regel:
- Klein: <10k
- Middelgroot: 10k - 100k
- Groot: >100k

Dit is ongetwijfeld over een paar jaar volledig voorbijgestreefd. Besef dat je in real life dit altijd moet afwegen tegenover de beschikbare compute power, op een Raspberry Pi 3 zal een middelgrote dataset al een stevige trainingstijd hebben, terwijl dat kinderspel is op een cloud compute server.

## Oefeningen

Behandel de volgende 3 datasets - je kan ze downloaden vanop Digitap. Lees in, bekijk, en beoordeel welke algoritmes het meeste zin hebben. Pas de ML algoritmes toe en rapporteer de uiteindelijke score van je model.
- Diabetes - i de 'outcome' kolom wordt aangegeven of mensen diabetes hebben
- Customer Churn: in commerciële setting wordt vaak gesproken van 'churn', oftewel het aangeven in een soort diagram waar klanten in een process afhaken. Deze dataset bevat een 'churn' kolom die dat codeert.
- Creditcard bevat data die reeds door een ander algoritme is gegaan, waardoor de kolommen geen herkenbare naam meer hebben. De laatste kolom wijst op fraude (1) of een valide transactie (0).

#### <u>diabetes dataset</u>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

In [2]:
diabetes_data = pd.read_csv("data/diabetes.csv")
diabetes_data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


##### we gebruiken om te beginnen een SVM algorithme

In [3]:
seed = 42
scaler = StandardScaler()
x = scaler.fit_transform(
    diabetes_data.drop('Outcome', axis=1)
)
y = diabetes_data['Outcome']
x_trn, x_tst, y_trn, y_tst = train_test_split(x,y, test_size=0.3, random_state = seed)

In [4]:
# Define an SVC
svc = SVC(random_state=seed)

param_grid = [{'kernel': ['rbf'],
               'C': [0.1, 1, 10, 100, 1000],
               'gamma': [1, 0.1, 0.01, 0.001, 0.0001]},
              {'kernel': ['linear'],
               'C': [0.1, 1, 10, 100, 1000]},
              {'kernel': ['poly'],
               'C': [0.1, 1, 10, 100, 1000],
               'degree': [2, 3, 4, 5],
               'coef0': [0.0, 1.0]}]

grid_search = GridSearchCV(svc, param_grid=param_grid)
grid_search.fit(x_trn, y_trn)

best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")
best_classifier = grid_search.best_estimator_

Best Parameters: {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}


In [5]:
svc_best = grid_search.best_estimator_
svc_best.fit(x_trn, y_trn)

In [6]:
y_pred = svc_best.predict(x_tst)

In [7]:
print(f"Accuracy: {accuracy_score(y_tst, y_pred)}")
print(f"Precision: {precision_score(y_tst, y_pred, average='macro')}")
print(f"Recall: {recall_score(y_tst, y_pred, average='macro')}")
print(f"F1-Score: {f1_score(y_tst, y_pred, average='macro')}")
print(f"Confusion Matrix: \n{confusion_matrix(y_tst, y_pred)}")

#pretty good accuracy

Accuracy: 0.7575757575757576
Precision: 0.7327044025157232
Recall: 0.7205298013245033
F1-Score: 0.7254668930390492
Confusion Matrix: 
[[127  24]
 [ 32  48]]


##### trying xgboost

In [14]:
from xgboost import XGBClassifier

xgb_clf = XGBClassifier(random_state=seed)

param_grid = {
    'n_estimators': [100],
    'max_depth': [2, 3, 4],
    'learning_rate': [0.02, 0.05, 0.1],
    'subsample': [0.8, 1]
}

grid_search = GridSearchCV(xgb_clf, param_grid=param_grid)
grid_search.fit(x_trn, y_trn)

best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")

xgb_best = grid_search.best_estimator_
xgb_best.fit(x_trn, y_trn)

y_pred = xgb_best.predict(x_tst)

Best Parameters: {'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1}


In [15]:
print(f"Accuracy: {accuracy_score(y_tst, y_pred)}")
print(f"Precision: {precision_score(y_tst, y_pred, average='macro')}")
print(f"Recall: {recall_score(y_tst, y_pred, average='macro')}")
print(f"F1-Score: {f1_score(y_tst, y_pred, average='macro')}")
print(f"Confusion Matrix: \n{confusion_matrix(y_tst, y_pred)}")

#small improvement

Accuracy: 0.7575757575757576
Precision: 0.7322847682119206
Recall: 0.7322847682119206
F1-Score: 0.7322847682119205
Confusion Matrix: 
[[123  28]
 [ 28  52]]


#### <u>creditcard dataset</u>

In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import BaggingClassifier
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

In [29]:
creditcard_data = pd.read_csv("data/creditcard.csv")
creditcard_data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [30]:
seed = 42
scaler = StandardScaler()
x = scaler.fit_transform(
    creditcard_data.drop('Class', axis=1)
)

y = creditcard_data['Class']
x_trn, x_tst, y_trn, y_tst = train_test_split(x,y, test_size=0.3, random_state = seed)

In [46]:
xgb = XGBClassifier(max_depth=5, learning_rate=0.01, n_estimators=10, random_state=42)
bagging_clf = BaggingClassifier(xgb, random_state=42)
pipeline = Pipeline([("classifier", bagging_clf)])

In [47]:
pipeline.fit(x_trn, y_trn)

In [48]:
y_pred = pipeline.predict(x_tst)

In [49]:
print(f"Accuracy: {accuracy_score(y_tst, y_pred)}")
print(f"Precision: {precision_score(y_tst, y_pred, average='macro')}")
print(f"Recall: {recall_score(y_tst, y_pred, average='macro')}")
print(f"F1-Score: {f1_score(y_tst, y_pred, average='macro')}")
print(f"Confusion Matrix: \n{confusion_matrix(y_tst, y_pred)}")

Accuracy: 0.9995435553526912
Precision: 0.9606937729803765
Recall: 0.8896531316994192
F1-Score: 0.9221964779554972
Confusion Matrix: 
[[85298     9]
 [   30   106]]


#### <u>Telco-customer-churn</u>

In [98]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

##### preprocessing

In [99]:
telco_data = pd.read_csv("data/Telco-Customer-Churn.csv")
telco_data.drop(columns='customerID',inplace=True)
telco_data['TotalCharges'] = telco_data['TotalCharges'].replace(' ', np.nan)
telco_data['TotalCharges'] = telco_data['TotalCharges'].astype(float)

In [100]:
telco_data

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.50,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.50,No
7039,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.90,No
7040,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.60,Yes


In [101]:
numeric_features = telco_data.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_features = telco_data.select_dtypes(include=['object']).columns.tolist()

scaler = StandardScaler()
telco_data_scaled = scaler.fit_transform(telco_data[numeric_features])
telco_data_scaled_df = pd.DataFrame(telco_data_scaled, columns=numeric_features)
telco_data_combined = pd.concat([telco_data.drop(columns=numeric_features), telco_data_scaled_df], axis=1)

encoder = OneHotEncoder(handle_unknown='ignore', drop='if_binary')
telco_data_onehot = encoder.fit_transform(telco_data_combined[categorical_features])
telco_data_onehot_df = pd.DataFrame(telco_data_onehot.toarray(), columns=encoder.get_feature_names_out(categorical_features))

telco_data_transformed = pd.concat([telco_data_combined.drop(columns=categorical_features), telco_data_onehot_df], axis=1)

In [102]:
telco_data_transformed

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No,MultipleLines_No phone service,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,Churn_Yes
0,-0.439916,-1.277445,-1.160323,-0.994194,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
1,-0.439916,0.066327,-0.259629,-0.173740,1.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,-0.439916,-1.236724,-0.362660,-0.959649,1.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0
3,-0.439916,0.514251,-0.746535,-0.195248,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,-0.439916,-1.236724,0.197365,-0.940457,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,-0.439916,-0.340876,0.665992,-0.129180,1.0,1.0,1.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
7039,-0.439916,1.613701,1.277533,2.241056,0.0,1.0,1.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
7040,-0.439916,-0.870241,-1.168632,-0.854514,0.0,1.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
7041,2.273159,-1.155283,0.320338,-0.872095,1.0,1.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0


In [103]:
x = telco_data_transformed.drop('Churn_Yes', axis=1)
y = telco_data_transformed['Churn_Yes']

In [104]:
seed = 42
x_trn, x_tst, y_trn, y_tst = train_test_split(x,y, test_size=0.25, random_state = seed)

##### xgboost

In [105]:
xgb = XGBClassifier(max_depth=5, learning_rate=0.01, n_estimators=10, random_state=seed)
xgb.fit(x_trn, y_trn)

In [106]:
y_pred = xgb.predict(x_tst)

In [118]:
print(f"Accuracy: {accuracy_score(y_tst, y_pred)}")
print(f"Precision: {precision_score(y_tst, y_pred, average='macro')}")
print(f"Recall: {recall_score(y_tst, y_pred, average='macro')}")
print(f"F1-Score: {f1_score(y_tst, y_pred, average='macro')}")
print(f"Confusion Matrix: \n{confusion_matrix(y_tst, y_pred)}")

Accuracy: 0.8018171493469619
Precision: 0.7600681117898265
Recall: 0.7036972501864585
F1-Score: 0.7219035423317353
Confusion Matrix: 
[[1178  104]
 [ 245  234]]


##### svm

In [119]:
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

param_grid = {
    'svc__C': [0.1, 1, 10],
    'svc__kernel': ['linear', 'rbf']
}

clf = make_pipeline(SimpleImputer(), SVC())

grid_search = GridSearchCV(clf, param_grid, cv=5)

grid_search.fit(x_trn, y_trn)

best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

best_estimator.fit(x_trn, y_trn)

In [120]:
y_pred = best_estimator.predict(x_tst)

In [121]:
print(f"Accuracy: {accuracy_score(y_tst, y_pred)}")
print(f"Precision: {precision_score(y_tst, y_pred, average='macro')}")
print(f"Recall: {recall_score(y_tst, y_pred, average='macro')}")
print(f"F1-Score: {f1_score(y_tst, y_pred, average='macro')}")
print(f"Confusion Matrix: \n{confusion_matrix(y_tst, y_pred)}")

Accuracy: 0.8114707552526973
Precision: 0.7658882333661094
Recall: 0.7377881962877679
F1-Score: 0.7492907351311668
Confusion Matrix: 
[[1153  129]
 [ 203  276]]
