# ID2214/FID3214 Fx Assignment
Abyel Tesfay, Abyel@kth.se

### Instructions
The following jupyter notebook contains solutions to a set of tasks in the form of simulations and tests, comments explaining the solutions and any assumptions made. This notebook was written with the purpose of completing the assignments below and receive the grade E. Each assignment consists of an explanation, a form of simulation (or results from it) and a conclusion. Below the assignmets you will find instructions to recreate the same results.

## Load packages used

In [1]:
import numpy as np
import pandas as pd

## 1a. Methodology

It depends on the outcome of the models generated from the hyper-parameter settings and the algorithm used. The performance of the best-performing model is biased on how the given dataset is randomly split into two samples. Therefore the performance (accuracy) of the best-performing model might be too optimistic, its good score is dependent on the current sample that was randomly generated. For this observation i performed the following steps.
- I chose the dataset "healthcare-dataset-stroke-data.csv" which is classified with binary labels
- I prepared two equally sized samples using randomized sampling
- For modelling i used RandomForest with the hyper-parameters 'n_estimators', 'criterion' and 'max_features', the best performing model was picked by the highest average accuracy from a ten-fold cross-validation
- For performance estimation i trained a model with the best configuration and a baseline model, using the first half as training set. I then tested both models using the second half as a test set.

Using the hyper-paramters 'n_estimators'= [1,10,50,100,250], criterion ['gini', 'entropy'] and 'max_features' = [1,2,...,10] i received the following results:

The results show that even if the best-performing configuration for hyper-parameters (and algorithm) outperforms the baseline model in the first half of data, the baseline model may still *outperform the best-performing configuration* in the second half. I also checked the amount models that performed better than the baseline during modelling, this was to see if a majority of them could outperform on the first half of data. If this were true, then the best performing configuration would be *more likely to outperform* the baseline on the second half of data.

## 1b. Data preparation


Assuming that the model was trained on a imbalanced training set which contains instances that are not present in the test set, we should expect a **lower accuracy but a similar AUC** when evaluating the model on the class-balanced set. The reason is that the model was trained on a imbalanced set where the majority class is frequent. When evaluated on a class-balanced test set (which has a lower frequency of the majority class) the acuracy will decrease. For the AUC however we will see a similar performance. The AUC only measures the probability of the model to rank an instance with the correct label ahead of instances with the wrong label. A lower accuracy will not affect this metric. 

The following steps were taken with two different datasets
- Select a data set for the task
- Split the dataset into two halves, one training set and one 'sampling' set 
- Use the sampling set to create the following test sets described in 1b:
    - An imbalanced test set in which the majority class is 4 times more frequent than the minority class
    - A class-balanced test set (has fewer instances than the above data set however)
- Perform data preparation on the training set: filtering and imputation
- Generate and train two identical models using a selected algorithm e.g RandomForest
- Evaluate the models using both the imbalanced and balanced test sets

Results, smiles_one_hot.csv:

Results, diabetes_binary_health_indicators_BRFSS2015.csv

------------------------------

## Code for the assignments
Below you will find instructions for how to recreate the same simulations/tests using the same datasets, this process consists of running several python files e.g. Data preparation, modelling, testing in order to achieve the same results as the student.

### 1a
Pre-requesites: Import the healthcare-dataset-stroke-data.csv file provided into the same directory. It should also work with any data sets with binary classification (through you must provide the correct dataset and class label in the code)

Steps
1. Run the Fx_data_preparation_A.py file to obtain two equal-sized halves of the dataset, training_set and test_set
2. Run the Fx_RF_modelling_A.py file to find the best performing configuration of the algorithm and hyper-parameters, the output shows the parameters that will generate the best-performing model and compare with baseline model. 
    * The output also shows the amount models that perform better than the baseline model.
3. Lastly run the Fx_RF_testing_A.py to compare the best-performing model with the baseline model, when evaluated on the second half of the data. 

### 1b
Pre-requesites: Import the healthcare-dataset-stroke-data.csv and smiles_one_hot.csv files provided into the same directory

Steps:
1. Run Fx_data_preparation_B1.py to obtain the training_set and test_set
2. Run Fx_prepare_test_sets_B2.py to receive majority and equal-sized data set
3. Run Fx_testing_B3.py to evaluate and receive the Accuracy and AUC of both models