<!-- Notebook title -->
# Vanilla KNN

# 1. Notebook Description

### 1.1 Task Description
<!-- 
- A brief description of the problem you're solving with machine learning.
- Define the objective (e.g., classification, regression, clustering, etc.).
-->

#### Implement and Evaluate the K-Nearest Neighbors (KNN) Algorithm from Scratch

In this task, you will implement the K-Nearest Neighbors (KNN) algorithm using only Python, without the use of any machine learning libraries like scikit-learn. You will then evaluate the performance of your implementation using various metrics.

##### Download the Dataset

Use the Pima Indian dataset, which can be found [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).

##### Implement the KNN Classifier

- Implement the K-Nearest Neighbors (KNN) algorithm from scratch. Your implementation should include:
  - Calculation of distances between instances.
  - Selecting K neighbors to identify the nearest neighbors.
  - Voting mechanism to predict the class of a data point based on its neighbors.

##### Predict the Classes

- Use your KNN implementation to predict the two classes in the Pima Indian dataset (diabetic or non-diabetic).

##### Hyperparameter Tuning

- Experiment with different values of the hyperparameter `K` (number of neighbors) to find the best fit for the model.
- Discuss how the choice of `K` affects the model’s performance.

##### Evaluate the Algorithm

- Evaluate the performance of your KNN implementation using the following metrics:
  - **Accuracy**: The ratio of correctly predicted instances to the total instances.
  - **F1 Score**: The harmonic mean of precision and recall.
  - **Precision**: The ratio of correctly predicted positive observations to all predicted positive observations.
  - **Recall**: The ratio of correctly predicted positive observations to all observations in that actual class.
  - **Mean Squared Error (MSE)**: The average of the squares of the errors between the predicted and actual values.
  - **Confusion Matrix**: A table that describes the performance of the classification model by showing the true positives, true negatives, false positives, and false negatives.

- Plot accuracy and loss graphs (plot an accuracy and loss graph).

##### Additional Instructions

- Choose the network architecture with care.
- Train and validate all algorithms.
- Make the necessary assumptions.

### 1.2 Useful Resources
<!--
- Links to relevant papers, articles, or documentation.
- Description of the datasets (if external).
-->

### 1.2.1 Data

#### 1.2.1.1 Common

* [Datasets Kaggle](https://www.kaggle.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A vast repository of datasets across various domains provided by Kaggle, a platform for data science competitions.
  
* [Toy datasets from Sklearn](https://scikit-learn.org/stable/datasets/toy_dataset.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of small datasets that come with the Scikit-learn library, useful for quick prototyping and testing algorithms.
  
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)  
  &nbsp;&nbsp;&nbsp;&nbsp;A widely-used repository for machine learning datasets, with a variety of real-world datasets available for research and experimentation.
  
* [Google Dataset Search](https://datasetsearch.research.google.com/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A tool from Google that helps to find datasets stored across the web, with a focus on publicly available data.
  
* [AWS Public Datasets](https://registry.opendata.aws/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A registry of publicly available datasets that can be analyzed on the cloud using Amazon Web Services (AWS).
  
* [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of curated datasets from various domains, made available by Microsoft Azure for use in machine learning and analytics.
  
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A GitHub repository that lists a wide variety of datasets across different domains, curated by the community.
  
* [Data.gov](https://www.data.gov/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A portal to the US government's open data, offering access to a wide range of datasets from various federal agencies.
  
* [Google BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data)  
  &nbsp;&nbsp;&nbsp;&nbsp;Public datasets hosted by Google BigQuery, allowing for quick and powerful querying of large datasets in the cloud.
  
* [Papers with Code](https://paperswithcode.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A platform that links research papers with the corresponding code and datasets, helping researchers reproduce results and explore new data.
  
* [Zenodo](https://zenodo.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An open-access repository that allows researchers to share datasets, software, and other research outputs, often linked to academic publications.
  
* [The World Bank Open Data](https://data.worldbank.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A comprehensive source of global development data, with datasets covering various economic and social indicators.
  
* [OpenML](https://www.openml.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An online platform for sharing datasets, machine learning experiments, and results, fostering collaboration in the ML community.
  
* [Stanford Large Network Dataset Collection (SNAP)](https://snap.stanford.edu/data/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of large-scale network datasets from Stanford University, useful for network analysis and graph-based machine learning.
  
* [KDnuggets Datasets](https://www.kdnuggets.com/datasets/index.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A curated list of datasets for data mining and data science, compiled by the KDnuggets community.


#### 1.2.1.2 Project

### 1.2.2 Learning

* [K-Nearest Neighbors on Kaggle](https://www.kaggle.com/code/mmdatainfo/k-nearest-neighbors)

* [Complete Guide to K-Nearest-Neighbors](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor)

### 1.2.3 Documentation

---

# 2. Setup

## 2.1 Imports
<!--
- Import necessary libraries (e.g., `numpy`, `pandas`, `matplotlib`, `scikit-learn`, etc.).
-->

In [183]:
from ikt450.src.common_imports import *
from ikt450.src.config import get_paths
from ikt450.src.common_func import load_dataset, save_dataframe, ensure_dir_exists

## 2.2 Global Variables
<!--
- Define global constants, paths, and configuration settings used throughout the notebook.
-->

### 2.2.1 Paths

In [184]:
paths = get_paths()

### 2.2.2 Seed

In [185]:
RANDOM_SEED = 7

### 2.2.3 Split ratio

In [186]:
SPLITRATIO = 0.8

### 2.2.4 Results

In [187]:
results = []

## 2.3 Function Definitions
<!--
- Define helper functions that will be used multiple times in the notebook.
- Consider organizing these into separate sections (e.g., data processing functions, model evaluation functions).
-->

### 2.3.1 Distance Calculation

#### 2.3.1.1 Euclidian Distance

In [188]:
def euclidean_distance(x1, x2):
    return np.sqrt(np.sum((x1 - x2) ** 2))

In [189]:
def distance(one,two):
    return np.linalg.norm(one-two)

### 2.3.2 K-Nearest Neighbors

In [190]:
def knn(X_train, y_train, X_test, k):
    y_pred = []
    for x_test in X_test:
        # Calculate distances between x_test and all training samples
        distances = [euclidean_distance(x_test, x_train) for x_train in X_train]
        # Get the indices of k-nearest neighbors
        k_indices = np.argsort(distances)[:k]
        # Get the labels of the k-nearest neighbors
        k_nearest_labels = [y_train[i] for i in k_indices]
        # Determine the most common class label
        most_common = Counter(k_nearest_labels).most_common(1)
        y_pred.append(most_common[0][0])
    return np.array(y_pred)

---

# 3. System Setup 
<!-- (Optional but recommended) -->

## 3.1 Styling
<!--
- Set up any visual styles (e.g., for plots).
- Configure notebook display settings (e.g., `matplotlib` defaults, pandas display options).
-->

## 3.2 Environment Configuration
<!--
- Check system dependencies, versions, and ensure reproducibility (e.g., set random seeds).
-->

### 3.2.1 Seed

In [191]:
np.random.seed(RANDOM_SEED)

---

# 4. Data Processing

## 4.1 Data loading
<!--
- Load datasets from files or other sources.
-->

In [192]:
%ls {paths['PATH_COMMON_DATASETS']}

pima-indians-diabetes.data.csv


In [193]:
df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pima-indians-diabetes.data.csv", delimiter=",")

## 4.2 Data inspection
<!--
- Preview the data (e.g., `head`, `describe`).
-->

### 4.2.1 Info

In [194]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


### 4.2.2 Describe

In [195]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### 4.2.3 Head

In [196]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## 4.3 Data Cleaning
<!--
- Handle missing values, outliers, and inconsistencies.
- Remove or impute missing data.
-->

### 4.3.1 NULL, NaN, Missing values

In [197]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [198]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [199]:
df.duplicated().sum()

np.int64(0)

In [200]:
#df.corr()

## 4.4 Feature Engineering
<!--
- Create new features from existing data.
- Normalize or standardize features.
- Encode categorical variables.
-->

## 4.6 Data Splitting
<!--
- Split data into training, validation, and test sets.
-->

In [201]:
dataset = df.to_numpy()

In [202]:
np.random.shuffle(dataset)

In [203]:
# Split the dataset into training and validation sets
X_train = dataset[:int(len(dataset)*SPLITRATIO), 0:8]
X_val = dataset[int(len(dataset)*SPLITRATIO):, 0:8]
Y_train = dataset[:int(len(dataset)*SPLITRATIO), 8]
Y_val = dataset[int(len(dataset)*SPLITRATIO):, 8]

In [204]:
# Normalization step
X_train = (X_train - X_train.mean(axis=0)) / X_train.std(axis=0)
X_val = (X_val - X_train.mean(axis=0)) / X_train.std(axis=0)

In [205]:
print(X_train)
print(Y_train)

[[-0.83643661 -1.03771391 -0.39900222 ... -0.6550122   0.30404563
  -0.80057075]
 [ 0.89470523  1.94455268  0.75289391 ...  0.45435501  0.32199314
   1.47981788]
 [ 2.62584707  0.99416003  1.06704739 ... -0.70601759  0.75572463
   0.80414718]
 ...
 [ 2.04879979  0.53534978  0.22930476 ...  0.4798577   0.23524684
   1.39535905]
 [-0.83643661 -0.51335935  0.33402259 ...  0.65837656 -0.84160376
  -0.63165307]
 [ 1.76027615 -0.97216959 -0.39900222 ... -0.82077971 -0.93134131
  -0.20935888]]
[0. 1. 1. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 0. 0. 1. 1.
 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 1.
 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1.
 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 1. 0. 0. 

---

# 5. Model Development

## 5.1 Model Selection
<!--
- Choose the model(s) to be trained (e.g., linear regression, decision trees, neural networks).
-->

In [206]:
from ikt450.common.classes.knn import KNN
knn = KNN(k=3)

## 5.2 Model Training
<!--
- Train the selected model(s) using the training data.
-->

In [207]:
knn.fit(X_train, Y_train)

## 5.3 Model Evaluation
<!--
- Evaluate model performance on validation data.
- Use appropriate metrics (e.g., accuracy, precision, recall, RMSE).
-->

In [208]:
accuracy, precision, recall, f1_score = knn.evaluate(X_val, Y_val)

Accuracy: 0.3116883116883117
Recall: 1.0
Precision: 0.3116883116883117
F1 Score: 0.4752475247524752


In [209]:
y_pred = knn.predict(X_val)
# Error metrics
mse = np.mean((Y_val - y_pred) ** 2)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(Y_val - y_pred))
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)

Mean Squared Error: 0.6883116883116883
Root Mean Squared Error: 0.8296455196719189
Mean Absolute Error: 0.6883116883116883


In [210]:
results.append({
    'approach': 'Custom KNN',
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'f1_score': f1_score,
    'mse': mse,
    'rmse': rmse,
    'mae': mae
})

In [211]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


# Load dataset using Pandas
df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pima-indians-diabetes.data.csv", delimiter=",")

# Convert the DataFrame to a NumPy array
dataset = df.to_numpy()

# Shuffle the dataset
np.random.shuffle(dataset)

# Split the dataset into input (X) and output (Y) variables
X = dataset[:, 0:8]
y = dataset[:, 8]

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=7)

# Normalization (if required)
X_train = (X_train - X_train.mean(axis=0)) / X_train.std(axis=0)
X_val = (X_val - X_train.mean(axis=0)) / X_train.std(axis=0)

# Create an instance of the KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)

# Fit the model with training data
knn.fit(X_train, Y_train)

# Predict the class labels for the validation set
y_pred = knn.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(Y_val, y_pred)
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

# Error metrics
mse = np.mean((y_val - y_pred) ** 2)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(y_val - y_pred))
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)

Accuracy: 0.44805194805194803
Precision: 0.3978494623655914
Recall: 0.6065573770491803
F1 Score: 0.4805194805194805
Mean Squared Error: 0.5194805194805194
Root Mean Squared Error: 0.7207499701564472
Mean Absolute Error: 0.5194805194805194


In [212]:
results.append({
    'approach': 'Sklearn KNN',
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'f1_score': f1,
    'mse': mse,
    'rmse': rmse,
    'mae': mae
})

In [213]:
import numpy as np
import pandas as pd

def stratified_train_test_split(X, y, test_size=0.2, random_state=None):
    """
    Custom function to split data into training and testing sets with stratification.
    
    Parameters:
    - X: Features.
    - y: Labels.
    - test_size: Proportion of the data to use as test data.
    - random_state: Seed for the random number generator.
    
    Returns:
    - X_train, X_test, y_train, y_test: The split datasets.
    """
    
    if random_state is not None:
        np.random.seed(random_state)
    
    # Get unique classes and their corresponding indices
    unique_classes, y_indices = np.unique(y, return_inverse=True)
    
    # List to hold training and test indices
    train_indices = []
    test_indices = []
    
    # Split the data for each class
    for class_index in range(len(unique_classes)):
        class_indices = np.where(y_indices == class_index)[0]
        np.random.shuffle(class_indices)  # Shuffle the class indices
        n_test = int(np.floor(test_size * len(class_indices)))  # Determine number of test samples
        test_indices.extend(class_indices[:n_test])
        train_indices.extend(class_indices[n_test:])
    
    # Convert lists to arrays
    train_indices = np.array(train_indices)
    test_indices = np.array(test_indices)
    
    # Split the data
    X_train, X_test = X[train_indices], X[test_indices]
    y_train, y_test = y[train_indices], y[test_indices]
    
    return X_train, X_test, y_train, y_test

# Load dataset using Pandas
df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/pima-indians-diabetes.data.csv", delimiter=",")

# Convert the DataFrame to a NumPy array
dataset = df.to_numpy()

# Shuffle the entire dataset
np.random.shuffle(dataset)

# Split the dataset into input (X) and output (y) variables
X = dataset[:, 0:8]
y = dataset[:, 8]

# Perform stratified train-test split
X_train, X_val, y_train, y_val = stratified_train_test_split(X, y, test_size=0.2, random_state=7)

# Normalization (if required)
X_train = (X_train - X_train.mean(axis=0)) / X_train.std(axis=0)
X_val = (X_val - X_train.mean(axis=0)) / X_train.std(axis=0)

# Check the distribution of classes in training and validation sets
print("Training set class distribution:", np.bincount(y_train.astype(int)))
print("Validation set class distribution:", np.bincount(y_val.astype(int)))

# Use the custom KNN class (assuming you've implemented it as before)
from ikt450.common.classes.knn import KNN
knn = KNN(k=3)
knn.fit(X_train, y_train)
accuracy, precision, recall, f1_score = knn.evaluate(X_val, y_val)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)

y_pred = knn.predict(X_val)
# Error metrics
mse = np.mean((y_val - y_pred) ** 2)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(y_val - y_pred))
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)



Training set class distribution: [400 215]
Validation set class distribution: [100  53]
Accuracy: 0.5490196078431373
Recall: 0.5849056603773585
Precision: 0.3974358974358974
F1 Score: 0.47328244274809156
Accuracy: 0.5490196078431373
Precision: 0.3974358974358974
Recall: 0.5849056603773585
F1 Score: 0.47328244274809156
Mean Squared Error: 0.45098039215686275
Root Mean Squared Error: 0.6715507368448513
Mean Absolute Error: 0.45098039215686275


In [214]:
results.append({
    'approach': 'Custom stratified KNN',
    'accuracy': accuracy,
    'precision': precision,
    'recall': recall,
    'f1_score': f1_score,
    'mse': mse,
    'rmse': rmse,
    'mae': mae
})

In [215]:
for result in results:
    print(f"Results for {result['approach']}:")
    print(f"Accuracy: {result['accuracy']}")
    print(f"Precision: {result['precision']}")
    print(f"Recall: {result['recall']}")
    print(f"F1 Score: {result['f1_score']}")
    print(f"Mean Squared Error: {result['mse']}")
    print(f"Root Mean Squared Error: {result['rmse']}")
    print(f"Mean Absolute Error: {result['mae']}")
    print("-" * 40)

Results for Custom KNN:
Accuracy: 0.3116883116883117
Precision: 0.3116883116883117
Recall: 1.0
F1 Score: 0.4752475247524752
Mean Squared Error: 0.6883116883116883
Root Mean Squared Error: 0.8296455196719189
Mean Absolute Error: 0.6883116883116883
----------------------------------------
Results for Sklearn KNN:
Accuracy: 0.44805194805194803
Precision: 0.3978494623655914
Recall: 0.6065573770491803
F1 Score: 0.4805194805194805
Mean Squared Error: 0.5194805194805194
Root Mean Squared Error: 0.7207499701564472
Mean Absolute Error: 0.5194805194805194
----------------------------------------
Results for Custom stratified KNN:
Accuracy: 0.5490196078431373
Precision: 0.3974358974358974
Recall: 0.5849056603773585
F1 Score: 0.47328244274809156
Mean Squared Error: 0.45098039215686275
Root Mean Squared Error: 0.6715507368448513
Mean Absolute Error: 0.45098039215686275
----------------------------------------


In [216]:
# Predict the class labels for the validation set
y_pred = knn.predict(X_val)
y_pred

array([0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1])

In [217]:
# Error metrics
mse = np.mean((y_val - y_pred) ** 2)
rmse = np.sqrt(mse)
mae = np.mean(np.abs(y_val - y_pred))
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)


Mean Squared Error: 0.45098039215686275
Root Mean Squared Error: 0.6715507368448513
Mean Absolute Error: 0.45098039215686275


## 5.4 Hyperparameter Tuning
<!--
- Fine-tune the model using techniques like Grid Search or Random Search.
- Evaluate the impact of different hyperparameters.
-->

## 5.5 Model Testing
<!--
- Evaluate the final model on the test dataset.
- Ensure that the model generalizes well to unseen data.
-->

## 5.6 Model Interpretation (Optional)
<!--
- Interpret the model results (e.g., feature importance, SHAP values).
- Discuss the strengths and limitations of the model.
-->

---

# 6. Predictions


## 6.1 Make Predictions
<!--
- Use the trained model to make predictions on new/unseen data.
-->

## 6.2 Save Model and Results
<!--
- Save the trained model to disk for future use.
- Export prediction results for further analysis.
-->

---

# 7. Documentation and Reporting

## 7.1 Summary of Findings
<!--
- Summarize the results and findings of the analysis.
-->

## 7.2 Next Steps
<!--
- Suggest further improvements, alternative models, or future work.
-->

## 7.3 References
<!--
- Cite any resources, papers, or documentation used.
-->