# Machine Learning: Lab 3
## Classification
TA: Bryan Coulier (bryan.coulier@kuleuven.be)

### Simple Classification
Create and train the following classification models for the iris dataset:
- K-nearist neighbor (n=5)
- Support-Vector machines
- Gaussian Naive Bayes
- Decision Tree Classifier

Determine the accuracy of each trained model.

In [14]:
from sklearn import datasets
%matplotlib inline

iris = datasets.load_iris()
X = iris.data
y = iris.target

In [15]:
# Split the data into training and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train) # just remember the training data

y_pred_knn = knn.predict(X_test)

accuracy_knn = accuracy_score(y_test, y_pred_knn)
print("KNN Accuracy:", accuracy_knn)

KNN Accuracy: 1.0


In [None]:
# SVM
from sklearn.svm import SVC

svm = SVC() # works with a hyperplane, if needed, 
# it will use a kernel trick: mapping the data to a higher dimension
# a kernel is a function that computes the inner product of two data points in a high-dimensional feature space
svm.fit(X_train, y_train)

y_pred_svm = svm.predict(X_test)

accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("SVM Accuracy:", accuracy_svm)

SVM Accuracy: 1.0


In [None]:
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train) # assuming that the features are independent
# it also assumes that the features are normally distributed
# formula: P(y|x) = P(x|y)*P(y) / P(x) where y is the class and x is the feature

y_pred_gnb = gnb.predict(X_test) # calculate the mean and variance of each feature for each class, also the prior probability of each class

accuracy_gnb = accuracy_score(y_test, y_pred_gnb)
print("Gaussian Naive Bayes Accuracy:", accuracy_gnb)

Gaussian Naive Bayes Accuracy: 0.9777777777777777


In [None]:
# Decision Tree Classifier.

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42) # random_state is used to reproduce the same results
dt.fit(X_train, y_train) # starts with the root node, and then splits the data into subsets
# decision nodes are used to split the data, and leaf nodes are used to predict the outcome
# gini impurity is used to decide the best split (how well the data is split)
# or entropy is used to decide the best split (how much information is gained)

y_pred_dt = dt.predict(X_test)

accuracy_dt = accuracy_score(y_test, y_pred_dt)
print("Decision Tree Accuracy:", accuracy_dt)

Decision Tree Accuracy: 1.0


All models achieve 100% accuracy on the test set, which can be expected because the Iris dataset is relatively simple and well-separated.

### Cross-validation
Determine the mean accuracy and standard deviation with a confidence interval of 98% for the KNN-model by using 5-fold cross-validation.

Explain why you chose a certain method to determine the confidence interval and how it is calculated.

What does this confidence interval tell you about the accuracy of the model?

In case you assume a normal distribution, justify this.


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Create the KNN model
knn = KNeighborsClassifier(n_neighbors=5)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
# 5-fold cross-validation means that the data is split into 5 parts, and the model is trained on 4 parts and tested on 1 part
# this is done 5 times, and the accuracy is calculated each time

# Calculate mean accuracy and standard deviation
mean_accuracy = np.mean(cv_scores)
std_accuracy = np.std(cv_scores)

print("Cross-Validation Scores:", cv_scores)
print("Mean Accuracy:", mean_accuracy)
print("Standard Deviation:", std_accuracy)

Cross-Validation Scores: [0.96666667 1.         0.93333333 0.96666667 1.        ]
Mean Accuracy: 0.9733333333333334
Standard Deviation: 0.02494438257849294


In [None]:
from scipy.stats import t

# Number of samples (folds)
n = len(cv_scores)

# Degrees of freedom
dof = n - 1

# Calculate the t-value for a 98% confidence interval
t_value = t.ppf(0.99, dof)  # 0.99 because it's two-tailed and we want 98% confidence
# ppf is the percent point function, it is used to calculate the t-value for a given probability and degrees of freedom

# Calculate the margin of error
margin_of_error = t_value * (std_accuracy / np.sqrt(n))
# the margin of error is used to calculate the confidence interval

# Calculate the confidence interval
confidence_interval = (mean_accuracy - margin_of_error, mean_accuracy + margin_of_error)

print("98% Confidence Interval:", confidence_interval)

5
98% Confidence Interval: (np.float64(0.9315343853193323), np.float64(1.0151322813473345))


**Cross validation**

Cross-validation is a technique to assess how well a machine learning model will perform on unseen data. It’s like a practice run to avoid overfitting.

- Split your dataset into k equal parts (folds). Here, k=5, so 5 folds.
- Train the model on k-1 folds (4 folds) and test it on the remaining 1 fold.
- Repeat this k times, each time using a different fold as the test set.
- Calculate a performance metric (e.g., accuracy) for each fold.
- Average these metrics to estimate the model’s overall performance

Method for determining the confidence interval: **t-distribution**
- The sample size is small (n=5 folds in cross-validation), so the t-distribution is more appropriate than the normal distribution.
- The population standard deviation is unknown (we only have the sample standard deviation).
- The t-distribution is specifically designed for small sample sizes and accounts for the additional uncertainty introduced by estimating the standard deviation from the sample.

What does this confidence interval tell you about the accuracy of the model?
- The 98% confidence interval for the mean accuracy is approximately (0.946,1.014).
- This means we are 98% confident that the true mean accuracy of the KNN model lies within this range.
- The interval is very close to 1, indicating that the model performs exceptionally well on the Iris dataset.

Assumption of Normality:
- The t-distribution is used because of the small sample size and unknown population standard deviation.
- The assumption of normality is reasonable due to the robustness of the t-distribution and the symmetry of the cross-validation scores.
- If we wanted a more distribution-free approach, we could use bootstrapping to construct a confidence interval without assuming normality.

#### Hyper-parameter Tuning
Use Grid search to tune the hyperparameters for Gaussian Naive Bayes and retrain the model with these hyper-parameters. 

Use 5-fold Cross Validation for the grid search.

Print out the selected Hyperparameter values and their grid search score.

Determine the accuracy of the retrained model, and compare it to the original model.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn import datasets

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # random_state is used to reproduce the same results

# Define the hyperparameter grid
param_grid = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]
} # var_smoothing is used to avoid division by zero in the Gaussian Naive Bayes model, it is used to smooth the variance

# Create the Gaussian Naive Bayes model
gnb = GaussianNB()

# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(gnb, param_grid, cv=5, scoring='accuracy') # grids search is used to find the best hyperparameters for the model (param_grid)
grid_search.fit(X, y)

# Print the best hyperparameters and their score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Grid Search Score:", grid_search.best_score_)

# Retrain the model with the best hyperparameters
best_gnb = grid_search.best_estimator_
best_gnb.fit(X_train, y_train)

# Predict on the test set
y_pred_best_gnb = best_gnb.predict(X_test)

# Calculate accuracy
accuracy_best_gnb = accuracy_score(y_test, y_pred_best_gnb)
print("Accuracy of Retrained Model:", accuracy_best_gnb)

# Train the original Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on the test set
y_pred_gnb = gnb.predict(X_test)

# Calculate accuracy
accuracy_gnb = accuracy_score(y_test, y_pred_gnb)
print("Accuracy of Original Model:", accuracy_gnb)

Best Hyperparameters: {'var_smoothing': 1e-09}
Best Grid Search Score: 0.9533333333333334
Accuracy of Retrained Model: 0.9777777777777777
Accuracy of Original Model: 0.9777777777777777


original and retrained model are the same because the default var_smoothing: 1e-09 is already optimal for the Iris dataset.

## Example
#### Problem statement

Given the "obesity_dataset.csv" containing information of individuals, the task is to train a machine learning model to predict whether an individual is overweight or not and which type of overweight they have.
A complete description of the dataset can be found in the paper.

Requirements:
* Preprocess the data to make it suitable for a machine learning model.
* Create a pipeline for you classifier.
* Use grid search or random search with 5 fold cross validation to find a good set of parameters for your classifier.
* Determine the mean accuracy of the best grid search model by using 5-fold cross validation and a confidence interval of 95%.
You can assume the accuracy is normally distributed here without justification.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Step 1: Preprocess the data to make it suitable for a machine learning model
# Load the dataset
df = pd.read_csv('obesity_dataset.csv')
print(df.head())

# Check for missing values
print(df.isnull().sum()) # No missing values

# If missing values:
# Fill missing numerical values with the median
# df.fillna(df.select_dtypes(include=['number']).median(), inplace=True)
# Fill missing categorical values with the mode (most frequent value)
# for column in df.select_dtypes(include=['object']).columns:
    # df[column].fillna(df[column].mode()[0], inplace=True)

# Encode the categorical variables
label_encoders = {} # store the label encoders for future use, to decode the data
for column in df.select_dtypes(include=['object']).columns:
    le = LabelEncoder() 
    df[column] = le.fit_transform(df[column]) # fit_transform is used to encode the data
    label_encoders[column] = le
# check after encoding
print(df.head())

# split the data into features and target
X = df.drop('NObeyesdad', axis=1) # NObeysdad is the target
y = df['NObeyesdad']


   Gender   Age  Height  Weight family_history_with_overweight FAVC  FCVC  \
0  Female  21.0    1.62    64.0                            yes   no   2.0   
1  Female  21.0    1.52    56.0                            yes   no   3.0   
2    Male  23.0    1.80    77.0                            yes   no   2.0   
3    Male  27.0    1.80    87.0                             no   no   3.0   
4    Male  22.0    1.78    89.8                             no   no   2.0   

   NCP       CAEC SMOKE  CH2O  SCC  FAF  TUE        CALC  \
0  3.0  Sometimes    no   2.0   no  0.0  1.0          no   
1  3.0  Sometimes   yes   3.0  yes  3.0  0.0   Sometimes   
2  3.0  Sometimes    no   2.0   no  2.0  1.0  Frequently   
3  3.0  Sometimes    no   2.0   no  2.0  0.0  Frequently   
4  1.0  Sometimes    no   2.0   no  0.0  0.0   Sometimes   

                  MTRANS           NObeyesdad  
0  Public_Transportation        Normal_Weight  
1  Public_Transportation        Normal_Weight  
2  Public_Transportation        

In [None]:
# Step 2: Create a pipeline for you classifier.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Create a pipeline, chain multiple steps into one, in sequence
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features: mean=0 and variance=1, normalizes the data, better for certain algorithms that rely on the distribution of the data
    ('classifier', RandomForestClassifier(random_state=42))  # Classifier (Random Forest) : it uses multiple decision trees, aggregates the predictions of multiple decision trees
])

# Define the hyperparameter grid
param_grid = {
    'classifier__n_estimators': [50, 100, 200], # number of trees in the forest
    'classifier__max_depth': [None, 10, 20, 30], # maximum depth of the tree
    'classifier__min_samples_split': [2, 5, 10] # minimum number of samples required to split an internal node, node splits only if it has more than min_samples_split samples
}

In [42]:
# Step 3: Use grid search or random search with 5 fold cross validation to find a good 
# set of parameters for your classifier.
from sklearn.model_selection import GridSearchCV

# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

# Print the best hyperparameters and their score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Grid Search Score:", grid_search.best_score_)

# Retrain the model with the best hyperparameters
best_model = grid_search.best_estimator_

Best Hyperparameters: {'classifier__max_depth': None, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 200}
Best Grid Search Score: 0.9375640034508643


In [43]:
# Step 4: Determine the mean accuracy of the best grid search model by using 5-fold 
# cross validation and a confidence interval of 95%.

from sklearn.model_selection import cross_val_score
import numpy as np
from scipy.stats import norm


# Perform 5-fold cross-validation on the best model
cv_scores = cross_val_score(best_model, X, y, cv=5, scoring='accuracy')

# Calculate mean accuracy and standard deviation
mean_accuracy = np.mean(cv_scores)
std_accuracy = np.std(cv_scores)

print("Cross-Validation Scores:", cv_scores)
print("Mean Accuracy:", mean_accuracy)
print("Standard Deviation:", std_accuracy)


# Number of samples (folds)
n = len(cv_scores)

# Calculate the z-value for a 95% confidence interval
z_value = norm.ppf(0.975)  # 0.975 because it's two-tailed

# Calculate the margin of error
margin_of_error = z_value * (std_accuracy / np.sqrt(n))

# Calculate the confidence interval
confidence_interval = (mean_accuracy - margin_of_error, mean_accuracy + margin_of_error)

print("95% Confidence Interval:", confidence_interval)

Cross-Validation Scores: [0.73995272 0.99052133 0.98104265 0.98578199 0.99052133]
Mean Accuracy: 0.9375640034508643
Standard Deviation: 0.09886813788221695
95% Confidence Interval: (np.float64(0.8509038520522678), np.float64(1.0242241548494608))


Preprocessing:
- Handled missing values and encoded categorical variables to make the data suitable for machine learning.

Pipeline:
- Created a pipeline to standardize features and train a Random Forest classifier.

Grid Search:
- Used Grid Search with 5-fold cross-validation to find the best hyperparameters.

Confidence Interval:
- Calculated the mean accuracy and a 95% confidence interval to assess the model's performance.