# **Course: Fundamental Concepts of AI**

#**Homework 3: Applying ML Algorithms**



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Student name and surname:**
Almir Mustafic
**Student index:**
20114
**Date:** January 3, 2025


## Objective:
In this assignment, you will apply the machine learning models introduced in Labs 5 and 6 to the dataset you chose and analyzed in Homework 2 (Assignment 3). Your goal is to train and evaluate these models on your dataset and compare their performance.

---

## Tasks:

### 1. Data Preparation and Preprocessing
Prepare your dataset for machine learning by completing the following steps:

1. **Load the Dataset**  
   - Use the dataset you analyzed in Homework 2.

2. **Inspect the Data**  
   - Display the first few rows, column names, and data types.  
   - Check for missing values and handle them if necessary (if you haven't already).

3. **Split the Data**  
   - Split the dataset into training, validation and testing sets (e.g., 70% training, 20% validating, 10% testing).

4. **Standardize the Features**  
   - Normalize all numerical features so that they have a mean of 0 and a standard deviation of 1 (if you haven't already).  
   - Skip this step for categorical features or the target variable.

5. **Encode the Target Variable**  
   - Ensure that the target column is in a numeric format if necessary (e.g., for classification tasks).

---

### 2. Applying ML Models
Train and evaluate the following machine learning models on your dataset. For each model:

1. **K-Nearest Neighbors (KNN)**  
   - Experiment with a few values of `k`.  
   - Report the accuracy on the test set for each experiment.
   - Note: Perform this task if you dataset involves a classification problem.

2. **Decision Tree (DT)**  
   - Train a Decision Tree model with default settings.  
   - Report the accuracy on the test set.
   - Note: Perform this task if you dataset involves a classification problem.


3. **Support Vector Classifier (SVC)**  
   - Train a Support Vector Classifier model with default settings.  
   - Report the accuracy on the test set.
   - Note: Perform this task if you dataset involves a classification problem.


4. **Logistic Regression (LR)**  
   - Train a Logistic Regression model.  
   - Report the accuracy on the test set.
   - Note: Only perform this task if your chosen dataset is a binary classification problem (i.e., it has exactly two classes).

5. **Neural Network (NN)**  
   - Build and train a simple Neural Network.  
   - Report the accuracy on the test set.
   - Note: Perform this task if you dataset involves a classification or regression problem.


6. **Linear Regression (LinR)**
  - Train a Linear Regression model on your dataset.
  - Report the Mean Squared Error (MSE) on the test set.
  - Note: Perform this task only if your dataset involves a regression problem (predicting continuous values).


---

### 3. Model Comparison
1. Create a table summarizing the test set accuracy of all models (KNN, DT, SVC, LR, NN and LinR).  
2. Discuss the following:
   - Which model performed the best?  
   - Why do you think this model performed better than the others?  

---

### 4. Save and Submit
1. **Save Your Work:**  
   - Save your trained models using an appropriate method (e.g., pickle for Scikit-learn models or `model.save` for Keras models).  
   - Save the final notebook with all outputs, tables, discussions included.

2. **Submit:**  
   - A Colab Notebook to c3 Homework 3 Assignment section.  
---

## Grading Criteria:
- **Data Preparation (2 points)**  
- **Implementation of ML Models (2 points)**  
- **Comparison and Analysis (1 point)**  
- **Code Clarity and Documentation (1 point)**


In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
import pickle
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report
# from sklearn.metrics import accuracy_score, classification_report


# Mount Google Drive
drive.mount('/content/drive')

# 1.1) Load the dataset
file_path = '/content/drive/MyDrive/Colab Notebooks/penguins.csv'
penguins_data = pd.read_csv(file_path)

# 1.2.1) Display the first few rows, column names, and data types.
print(penguins_data.head())
print("Column names:", penguins_data.columns)
print("Data types:", penguins_data.dtypes)

# 1.2.2) Check for missing values and handle them
# Replace the missing values in numerical columns with a median value
# Replace the missing values in categorical columns with the most common value
# Check if there are missing values
print("\nMissing values handling")
print(penguins_data.isnull().sum())
numerical_columns = penguins_data.select_dtypes(include=['float64', 'int64']).columns
penguins_data[numerical_columns] = penguins_data[numerical_columns].fillna(penguins_data[numerical_columns].median())
categorical_columns = penguins_data.select_dtypes(include=['object']).columns

for col in categorical_columns: # 'species', 'island', 'sex'
    most_common_value = penguins_data[col].mode()[0] # if more than 1 take the first
    penguins_data[col] = penguins_data[col].fillna(most_common_value) # replace empty with the most common one

# Check if there are any remaining missing values
# print(penguins_data.isnull().sum())

# 1.3) Split the dataset into training, validation and testing sets (e.g., 70% training, 20% validating, 10% testing)
# The number 42 is famously used as a "random" choice due to its cultural reference in The Hitchhiker's Guide to
# the Galaxy, where it is the answer to the "Ultimate Question of Life, the Universe, and Everything."
train_data, remainig_data_data = train_test_split(penguins_data, test_size=0.3, random_state=42)
validation_data, test_data = train_test_split(remainig_data_data, test_size=0.33, random_state=42)
# print("Training data:", train_data.shape)
# print("Validation data:", validation_data.shape)
# print("Test data:", test_data.shape)

# 1.4) Normalize all numerical features so that they have a mean of 0 and a standard deviation of 1
# 1.4.1) Skip this step for categorical features or the target variable.
# Normalized Value = (Value−Mean)/Standard Deviation
# Standard deviation = square root of the average of the squared differences between each number and the mean of the data
numerical_columns = penguins_data.select_dtypes(include=['float64', 'int64']).columns
scaler = StandardScaler()
penguins_data[numerical_columns] = scaler.fit_transform(penguins_data[numerical_columns])
# print(penguins_data.head())

# 1.5) Convert the target column to numeric (species)
unique_species = penguins_data['species'].unique()
penguins_data['species'] = penguins_data['species'].astype('category').cat.codes # .cat.codes converts categorical values in a pandas category dtype column into numeric codes
unique_species_numerical = penguins_data['species'].unique()
# print(unique_species)
# print(unique_species_numerical)


# Prepare data for training and testing
# Take the labels i.e. categories and training, validation and test data without species column
X_train = train_data.drop(columns=['species'])
X_validation = validation_data.drop(columns=['species'])
X_test = test_data.drop(columns=['species'])
y_train = train_data['species']
y_validation = validation_data['species']
y_test = test_data['species']

# Convert categorical columns into numeric codes
categorical_columns = X_train.select_dtypes(include=['object']).columns
for col in categorical_columns:
    X_train[col] = X_train[col].astype('category').cat.codes
    X_validation[col] = X_validation[col].astype('category').cat.codes
    X_test[col] = X_test[col].astype('category').cat.codes


# 2) K-Nearest Neighbors (KNN)
# 2.1.1) Experiment with a few values of k
# 2.1.2.) Report the accuracy on the test set for each experiment
# 2.1.3) Note: Perform this task if your dataset involves a classification problem
print("K-NEAREST NEIGHBORS (KNN)")
print("Testing k values on validation data set")
best_k = None
best_accuracy = 0
k_values = [3, 17, 65, 66, 99]

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_val_pred = knn.predict(X_validation)
    validation_accuracy = accuracy_score(y_validation, y_val_pred)
    print(f"k={k} - accuracy={validation_accuracy:.3f}")

    if validation_accuracy > best_accuracy:
        best_k = k
        best_accuracy = validation_accuracy

print(f"Best-performing k is {best_k}")

knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)
y_test_pred = knn.predict(X_test)
test_accuracy = knn_test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"The test data set accuracy after training and for best k is {test_accuracy:.3f}")
print(f"Test vs validation data performance difference: {test_accuracy - best_accuracy:.3f}")
print(f"\nKNN Neighbors Accuracy: {test_accuracy:.3f}")
print("==============================================")


# 2.2) Decision Tree (DT)
# 2.2.1) Train a Decision Tree model with default settings
# 2.2.2) Report the accuracy on the test set
# 2.2.3) Note: Perform this task if you dataset involves a classification problem
# The max_depth_value is a the maximum depth for the Decision Tree i.e. it controls how deep the tree can grow.
# Smaller values limit the tree's depth, reducing overfitting but might lead to underfitting.
# Larger values grow the tree deeper including more patterns but might lead to overfitting. None means unlimited.
print("\nDECISION TREE (DT)")
max_depth_values = [None, 1, 5, 10, 20]
decision_tree = None
best_accuracy = 0

print("Validation after training accuracy")
for max_depth in max_depth_values:
    dt_model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    dt_model.fit(X_train, y_train)
    y_val_pred = dt_model.predict(X_validation)
    val_accuracy = accuracy_score(y_validation, y_val_pred)
    print(f"max_depth={max_depth} - accuracy={val_accuracy:.3f}")

    if val_accuracy > best_accuracy:
        decision_tree = dt_model
        best_accuracy = val_accuracy

y_test_pred = decision_tree.predict(X_test)
test_accuracy = decision_tree_test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"The test data set accuracy after training and for best k is {test_accuracy:.3f}")
print(f"Test vs validation data performance difference: {test_accuracy - best_accuracy:.3f}")
print(f"\nDecision Tree Accuracy: {test_accuracy:.3f}")
print("==============================================")


# 2.3) Support Vector Classifier (SVC)
# 2.3.1) Train a Support Vector Classifier model with default settings.
# 2.3.2) Report the accuracy on the test set.
# 2.3.3) Note: Perform this task if you dataset involves a classification problem.
print("\nSUPPORT VECTOR CLASSIFIER (SVC)")
# The goal of SVC is to maximize the margin between classes while minimizing the classification error.
# The C parameter controls the trade-off between these two objectives.
# High C means the model will be very strict about fitting the training data (fewer misclassifications), but it might lead to overfitting.
# Lower C means the model will tolerate more misclassifications, which might lead to a wider margin and more generalization, but also to
# less accuracy on the training data.
C_values = [0.1, 1, 10, 100]
best_accuracy = 0
best_C = None
support_vector_classifier = None

for C in C_values:
    svc_model = SVC(C=C, random_state=42)
    svc_model.fit(X_train, y_train)
    y_pred_validation = svc_model.predict(X_validation)
    accuracy = accuracy_score(y_validation, y_pred_validation)
    print(f"max_depth={C} - accuracy={accuracy:.3f}")

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_C = C
        support_vector_classifier = svc_model

y_pred_test = support_vector_classifier.predict(X_test)
test_accuracy = svc_test_accuracy = accuracy_score(y_test, y_pred_test)
print(f"The test data set accuracy after training, validation and for best C is {best_accuracy:.3f}")
print(f"Test vs validation data performance difference: {test_accuracy - best_accuracy:.3f}")
print(f"\nSVC Accuracy: {test_accuracy:.3f}")
print("====================================")


# 2.4) Logistic Regression (LR)
# 2.4.1) Train a Logistic Regression model.
# 2.4.2) Report the accuracy on the test set.
# 2.4.3) Note: Only perform this task if your chosen dataset is a binary classification problem (i.e., it has exactly two classes).
# NOTE: My data set has more classes, but I decided to change the classification task to 'sex' instead od species, so that I can do this
# task out of curiosity and future reference. If I should not have done this, please ignore it.
print("\nLOGISTIC REGRESSION")
penguins_binary_sex = penguins_data.dropna(subset=['sex'])
penguins_binary_sex['sex'] = penguins_binary_sex['sex'].str.lower()
penguins_binary_sex['sex'] = penguins_binary_sex['sex'].map({'male': 1, 'female': 0})

X_lr = penguins_binary_sex.drop('sex', axis=1)
y_lr = penguins_binary_sex['sex']

non_numeric_columns = X_lr.select_dtypes(exclude=['number']).columns

for col in non_numeric_columns:
    most_common_value = X_lr[col].mode()[0] # Fill missing values with the most common value (mode)
    X_lr[col] = X_lr[col].fillna(most_common_value)

X_lr = X_lr.apply(pd.to_numeric, errors='coerce') # Convert to numeric
X_lr = X_lr.fillna(0)  # Remove the remaining NaN values, if any present

X_lr_train, X_lr_temp, y_lr_train, y_lr_temp = train_test_split(X_lr, y_lr, test_size=0.4, random_state=42)
X_lr_val, X_lr_test, y_lr_val, y_lr_test = train_test_split(X_lr_temp, y_lr_temp, test_size=0.5, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_lr_train, y_lr_train)
y_val_pred = model.predict(X_lr_val)
y_test_pred = model.predict(X_lr_test)
validation_accuracy = accuracy_score(y_lr_val, y_val_pred)
test_accuracy = accuracy_score(y_lr_test, y_test_pred)

print(f"Validation Accuracy: {validation_accuracy:.3f}")
print(f"Test vs validation data performance difference: {test_accuracy - validation_accuracy:.3f}")
print(f"\nLR Accuracy: {test_accuracy:.3f}")
print("===================================")


# 2.5) Neural Network (NN)
# 2.5.1) Build and train a simple Neural Network.
# 2.5.2) Report the accuracy on the test set.
# 2.5.3) Note: Perform this task if you dataset involves a classification or regression problem.
print("\nNEURAL NETWORK (NN)")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_validation_scaled = scaler.transform(X_validation)
X_test_scaled = scaler.transform(X_test)

# Encode target labels and convert data to arrays
label_encoder = {label: idx for idx, label in enumerate(y_train.unique())}
y_train_encoded = y_train.map(label_encoder)
y_validation_encoded = y_validation.map(label_encoder)
y_test_encoded = y_test.map(label_encoder)

X_train_np = np.array(X_train_scaled)
X_validation_np = np.array(X_validation_scaled)
X_test_np = np.array(X_test_scaled)
y_train_np = np.array(y_train_encoded)
y_validation_np = np.array(y_validation_encoded)
y_test_np = np.array(y_test_encoded)

neural_network = tf.keras.Sequential([
    tf.keras.Input(shape=(X_train_np.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(len(label_encoder), activation='softmax')
])

# sparse_categorical_crossentropy used when target labels are integers
# categorical_crossentropy used when target labels are one-hot encoded
# Verbose: 0 - no output, 1 - displays a progress, 2 - displays one line per epoch with training details
# validation_split=0.2 splits the X_train data into training and validation, if no explicit validation data to pass like here
# early_stopping is a callback for early stopping based on validation loss to prevent overfitting
neural_network.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
history = neural_network.fit(X_train_np, y_train_np, validation_data=(X_validation_np, y_validation_np), epochs=50, batch_size=32, verbose=0, callbacks=[early_stopping])
val_accuracy = history.history['val_accuracy'][-1]
train_accuracy = history.history['accuracy'][-1]
loss, accuracy = neural_network.evaluate(X_test_np, y_test_np)
nn_test_accuracy = accuracy
print(f"Train Accuracy: {train_accuracy:.3f}")
print(f"Validation Accuracy: {val_accuracy:.3f}")
print(f"Test vs validation data performance difference: {accuracy - val_accuracy:.3f}")
print(f"\nNN Accuracy: {accuracy:.3f}\nTest Loss: {loss:.3f}")
print("=======================================================")


# 2.6) Linear Regression (LinR)
# 2.6.1) Train a Linear Regression model on your dataset.
# 2.6.2) Report the Mean Squared Error (MSE) on the test set.
# 2.6.3) Note: Perform this task only if your dataset involves a regression problem (predicting continuous values).
# NOTE: My data set is based on classes, but I decided to do this out of curiosity and for future reference. I predict the body mass based on
# 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'. If I should not have done this, please ignore it.
print("\nLINEAR REGRESSION (LN)")
X = penguins_data[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
y = penguins_data['body_mass_g']  # Target
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_validation, X_test, y_validation, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_val_pred = model.predict(X_validation)
y_test_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_test_pred)
r2_val_accuracy = r2_score(y_validation, y_val_pred)
r2_test_accuracy = r2_score(y_test, y_test_pred)
print(f"Validation Accuracy: {r2_val_accuracy:.3f}")
# print(f"Loss: {np.sqrt(mse):.3f}")
print(f"Test vs validation data performance difference: {r2_test_accuracy - r2_val_accuracy:.3f}")
print(f"\nLR Accuracy): {r2_test_accuracy:.3f}")
print("=======================================")


# 3) Model Comparison
print("\nACCURACIES")
model_comparison = {
    'Model': ['KNN', 'Decision Tree', 'SVC', 'Neural Network'],
    'Test Accuracy': [knn_test_accuracy, decision_tree_test_accuracy, svc_test_accuracy, nn_test_accuracy]
}

comparison_df = pd.DataFrame(model_comparison)
print(comparison_df)

# 3.2.1) Which model performed the best?
# The Neural Network performed the best with a test accuracy of 1.000000, although this was probably the most difficult model to set and train.
# Determining the number of layers, neurons, epochs, batch_size, etc. to get a decent accuracy was really challenging.

# 3.2.2) Why This Model Performed Better?
# Because it took me hours to set it and train it, just kidding :)
# Neural Networks can handle complex patterns in data due to their multiple layers and non-linear activation functions.
# This is why they are more flexible in learning from the data compared to other models, that is they can adapt to various types of
# data relationships and achieve better performance on large datasets.


# 4) Save models
# 4.1) Save your trained models using an appropriate method (e.g., pickle for Scikit-learn models or model.save for Keras models)
# 4.2) Save the final notebook with all outputs, tables, discussions included
pickle.dump(knn, open('/content/drive/MyDrive/Colab Notebooks/trained_models/knn_model.pkl', 'wb'))
pickle.dump(decision_tree, open('/content/drive/MyDrive/Colab Notebooks/trained_models/decision_tree_model.pkl', 'wb'))
pickle.dump(support_vector_classifier, open('/content/drive/MyDrive/Colab Notebooks/trained_models/support_vector_classifier_model.pkl', 'wb'))
pickle.dump(neural_network, open('/content/drive/MyDrive/Colab Notebooks/trained_models/neural_network_model.pkl', 'wb'))

# Load models
knn = pickle.load(open('/content/drive/MyDrive/Colab Notebooks/trained_models/knn_model.pkl', 'rb'))
decision_tree = pickle.load(open('/content/drive/MyDrive/Colab Notebooks/trained_models/decision_tree_model.pkl', 'rb'))
decision_tree = pickle.load(open('/content/drive/MyDrive/Colab Notebooks/trained_models/support_vector_classifier_model.pkl', 'rb'))
decision_tree = pickle.load(open('/content/drive/MyDrive/Colab Notebooks/trained_models/neural_network_model.pkl', 'rb'))




Mounted at /content/drive
   rowid species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0      1  Adelie  Torgersen            39.1           18.7              181.0   
1      2  Adelie  Torgersen            39.5           17.4              186.0   
2      3  Adelie  Torgersen            40.3           18.0              195.0   
3      4  Adelie  Torgersen             NaN            NaN                NaN   
4      5  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  year  
0       3750.0    male  2007  
1       3800.0  female  2007  
2       3250.0  female  2007  
3          NaN     NaN  2007  
4       3450.0  female  2007  
Column names: Index(['rowid', 'species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')
Data types: rowid                  int64
species               object
island                object
bill_length_mm       float64
bill_d