# Machine Learning 2024/2025 - Progress Task 2 (Application of K-Nearest Neighbours)

## Abstract

Given that there were a few alternatives to Random Forest that could potentially outperform it, the team decided to try the algorithm that was very close in performance to it: KNN. After making some runs with the random state, KNN did not give better results than Random Forest.

## Introduction

K-Nearest Neighbours (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition. This algorithm classifies itself among the simplest of all machine learning algorithms.

## Methodology

The team used the K-Nearest Neighbours algorithm to classify the data. The team used the following hyperparameters:

- n_neighbors: 5
- weights: distance

## Implementation

The team used the following code to implement the K-Nearest Neighbours algorithm:

In [None]:
!pip install mlflow pandas scipy scikit-learn

### Initial Configuration

In [None]:
# IMPORTS
import mlflow
import mlflow.data
import mlflow.data.pandas_dataset
from mlflow.models import infer_signature
from mlflow.data.pandas_dataset import PandasDataset
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

import pandas as pd

from sklearn.model_selection import train_test_split
from create_dataset import Dataset
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import MultiOutputClassifier
import logging
import configparser

# GLOBAL VARIABLES

# Load configuration fileconfig["data"]["targets"]
CONFIG_FILE_PATH = "mlflow/test.conf"

config = configparser.ConfigParser()
config.read(CONFIG_FILE_PATH)
DATASET_PATH = config["data"]["dataset_path"]
TEST_DATASET_PATH = config["data"]["dataset_test_path"]
DATASET_INDEX_FEATURE = config["data"]["dataset_index"]
DATASET_TARGET_FEATURES = ["h1n1_vaccine", "seasonal_vaccine"]
CONFIG_SECTION_MLFLOW = "mlflow"
CONFIG_SECTION_NAMES = "names"
CONFIG_SECTION_DATA = "data"
CONFIG_PARAM_MLFLOW_ADDRESS = "mflow_address"
CONFIG_PARAM_MFLWOW_PORT = "mlflow_port"
CONFIG_PARAM_EXPERIMENT_NAME = "mlflow_experiment_name"
MLFLOW_LOCATION = config[CONFIG_SECTION_MLFLOW][CONFIG_PARAM_MLFLOW_ADDRESS] + config[CONFIG_SECTION_MLFLOW][CONFIG_PARAM_MFLWOW_PORT]
OPTIMIZED_SUFFIX = config[CONFIG_SECTION_NAMES]["model_optimized"]
ROC_AUC_NAME = config[CONFIG_SECTION_NAMES]["parameter_roc_auc"]
ACCURACY_NAME = config[CONFIG_SECTION_NAMES]["parameter_accuracy"]
OUTPUT_FILE_PATH = config[CONFIG_SECTION_DATA]["output_path"]

# Load logging configuration
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%H:%M:%S' 
)
tuner_logger = logging.Logger("[Tuner]")
run_logger = logging.Logger("[Run]")
logger_main = logging.Logger("[Main]",level=logging.DEBUG)

### Dataset Setup

In [None]:
# Dataset Class creation

class Dataset:
    '''
    ## Dataset
    
    This class represents a dataset. It handles dataset loading and splitting.
    
    ### Attributes
    
    - test: The test dataset.
    
    '''
    def __init__(self):
        '''
        Constructor for the Dataset class.
        '''
        data = pd.read_csv(DATASET_PATH)
        target = DATASET_TARGET_FEATURES
        data.set_index(DATASET_INDEX_FEATURE, inplace=True)
        self._y = data[target]
        self._X = data.drop(columns=target)
        test_data =  pd.read_csv(TEST_DATASET_PATH)
        test_data.set_index(DATASET_INDEX_FEATURE, inplace=True)
        self.test = test_data
    
    def with_correlation(self):
        '''
        ## with_correlation
        Method that returns a copy of the dataset features and targets.
        
        ### Returns
        (X, y): A tuple containing the dataset features and targets.
        '''
        
        return self._X.copy(), self._y.copy()
    

### Method Definition for running the Experiment

#### Hyperparameter Fine-Tuning

In [None]:
# TODO: This method only works for RandomForestClassifier in Grid Search. Maybe it should work with any model
def hyperparameters(model_to_train):
    '''
    ## hyperparameters
    
    Initialize the hyperparameters for a Random Forest model and creates it.
    
    :param model_to_train: The model to train.
    
    :return model: The model, now built with fine-tuned hyperparameters.
    
    '''
    

    
    param_dist_random = {
                'estimator__n_estimators': randint(50, 200),
                'estimator__max_depth': [None, 10, 20, 30],
                'estimator__min_samples_split': randint(2, 11),
                'estimator__min_samples_leaf': randint(1, 5)
    }
    tuner_logger.info("Hyperparameters optimized. Building model...")
    model = RandomizedSearchCV(estimator=model_to_train, param_distributions=param_dist_random,
                                    n_iter=50, cv=5, n_jobs=-1, verbose=0,
                                    scoring= ROC_AUC_NAME)
    tuner_logger.info("Model built successfully!")
    
    return model

#### Running the Experiment

In [None]:
def play_model(model, model_name : str, X : pd.DataFrame, y : pd.DataFrame, output : pd.DataFrame):
    
    '''
    ## play_model
    
    performs a Machine Learning run using the model passed as parameter and some input data.
    
    :param Any model: The model to train.
    :param str model_name: The name of the model.
    :param pd.DataFrame X: The input data.
    :param pd.DataFrame y: The target data.
    :param pd.DataFrame output: The data used to test the model.
    
    :return None: This method saves the predictions for the testing in a file within the `mlflow` directory.
    '''
        
    # Split the dataset into train and testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    # All this codeblock englobes the part of the run tracked by MLflow (Anything outside this block won't be tracked)
    with mlflow.start_run():
        
        ################################## Initial logs ##################################
        
        run_logger.info(f"========== Starting initial MLflow logging for model {model_name} =========")
        
        # Log the "presentation card" of the model. What is it trying to achieve?
        mlflow.set_tag("Objective", "Compare multiple models with dataset with correlation")
        
        # Log the input data of the run: features and targets used for training the model.
        pd_train = pd.concat([X_train, y_train], axis=1)
        pd_dataset = mlflow.data.pandas_dataset.from_pandas(pd_train, 
                                                            source = "df_encoded.csv", name="whole dataset and correlation")
        mlflow.log_input(pd_dataset, "training")
               
        # Tune the hyperparameters of the model (if needed. Only for optimized models) and log them
        if model_name.endswith(OPTIMIZED_SUFFIX):
            run_logger.info(f"Model {model_name} is optimized. Tuning hyperparameters...")
            model = hyperparameters(model)
        mlflow.log_params(model.get_params())           


        ########################## Training, testing and evaluation ######################
        
        # Train the model
        run_logger.info(f"Training model {model_name}...")
        model.fit(X_train, y_train)
        signature = infer_signature(X_train, model.predict(X_train))    # Log the signature to MLFlow      
        run_logger.info(f"Model {model_name} trained successfully!")
       
        # Predict the test data
        run_logger.info(f"Predicting test data with trained model {model_name}...")
        y_pred = model.predict(X_test)
        run_logger.info(f"Test data predictions finished!")

        # Evaluate the model (get the metrics)
        run_logger.info(f"Evaluating model {model_name}...")
        accuracy = accuracy_score(y_test, y_pred)
        roc_auc = roc_auc_score(y_test, y_pred, average="macro")
        run_logger.info(f"Model {model_name} evaluated successfully!\nAccuracy: {accuracy}\nROC AUC: {roc_auc}")


        ################################## Result logs ##################################
        
        # Log the model's metrics and information to MLflow
        mlflow.log_metric(ROC_AUC_NAME, float(roc_auc))
        mlflow.log_metric(ACCURACY_NAME, float(accuracy))
        model_info = mlflow.sklearn.log_model(
            sk_model=model,
            artifact_path="model",
            signature=signature,
            input_example=X_train,
            registered_model_name=model_name,
        )
        
        run_logger.info(f"Model {model_name} logged successfully on MLflow.") # I infer the model_info was for this...
        
        # Predict probabilities for the output data
        predictions = model.predict_proba(output)
        
        h1n1_probs = predictions[0][:, 1]  # Probabilidades de clase positiva para h1n1_vaccine
        seasonal_probs = predictions[1][:, 1]  # Probabilidades de clase positiva para seasonal_vaccine

        predict = pd.DataFrame({
            "respondent_id": output.index,
            "h1n1_vaccine": h1n1_probs,
            "seasonal_vaccine": seasonal_probs
        })
        
        # The predictions are indexed by their value of respondent_id
        predict.set_index("respondent_id", inplace=True)
        
        ################################## Final logs ##################################
        
        # Store the predictions in a file and log them to MLflow
        predict.to_csv(OUTPUT_FILE_PATH) 
        mlflow.log_artifact(OUTPUT_FILE_PATH)
        run_logger.info("predictions saved")

#### Main Function

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

def main():
    
    # Set our tracking server uri for logging
    mlflow.set_tracking_uri(uri=MLFLOW_LOCATION)
    experiment_name = CONFIG_PARAM_EXPERIMENT_NAME
    if not mlflow.get_experiment_by_name(experiment_name):
        mlflow.create_experiment(experiment_name)
    # Create a new MLflow Experiment
    mlflow.set_experiment(experiment_name)
    logger_main.info("fetching data")
    data = Dataset()
    X, y = data.with_correlation()
    output = data.test
    print(f"Showing the number of null values for the training data:\n {X.isnull().sum()}")
    print(f"Showing the number of null values for the test data:\n {output.isnull().sum()}")
    # Split the data
    # models = {'RandomForest_no_opt': MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42), n_jobs=-1), 
    #           'RandomForest_si_opt': MultiOutputClassifier(RandomForestClassifier(random_state=42), n_jobs=-1)}
    models = {
        f"{config[CONFIG_SECTION_NAMES]['knn_model_name']}": MultiOutputClassifier(KNeighborsClassifier(n_neighbors=5, weights='distance'), n_jobs=-1),
        }

    for model_name, model in models.items():
        logger_main.info(f"Starting run with {model_name}")
        play_model(model, model_name, X, y, output)

#### Execution

In [None]:
if __name__ == "__main__":
    main()

## Results

On average, the results of the K-Nearest Neighbours algorithm did not outperform other previously executed algorithms, obtaining the following results:

| Metric | Value |
| --- | --- |
| Accuracy | 0.6142|
| ROC AUC | 0.6928 |

These results have been taking from the logs of MLflow on this experiment. These results can be obseved in the following image:

![KNNResults](KNNResults.png)


## Conclusion

This algorithm did not outperform the Random Forest algorithm, which was the best algorithm tested so far with a ROC AUC of 0.75. The team will continue to test other algorithms to find the best one for the dataset.