# Report Statistical Foundations of Machine Learning

- Khalil Oauld Chaib  0557031  Master of Applied Informatics: Artificial Intelligence
- Mohammed Shabot     0563065  Master of Science in Applied Sciences and Engineering: Computer Science: Artificial Intelligence
- Ferit Fikri Murad   0620940  Master of Applied Informatics: Artificial Intelligence

In [6]:
%pip install pandas scikit-learn

Collecting pandas
  Using cached pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (19 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.5.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Using cached pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl (11.3 MB)
Using cached scikit_learn-1.5.0-cp312-cp312-macosx_12_0_arm64.whl (11.0 MB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Using cached pytz-2024.1-py2.py3-none-any.whl (505 kB)
Using cached threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Using cached tzdata-2024.1-py2.py3-none-any

In [2]:
from Dataset.pre_processing import *
from sklearn.neural_network import MLPClassifier
from sklearn.model_selkection import GridSearchCV
import pandas as pd
from sklearn.metrics import f1_score, make_scorer
from sklearn.preprocessing import StandardScaler
from nn import datasplitter
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import learning_curve
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score

ModuleNotFoundError: No module named 'pandas'

In this project for the course: 'Statistical Foundations of Machine Learning', we are expected to explore and address three fundamental research questions. We have to address these research questions in the form of an interactive report, combining textual explanations, visualizations and executable Python code to provide a comprehensive understanding of the research undertaken. These are the three research questions we chose:

1. Which features have the highest impact on predicting obesity or CVD risk?
2. What is the effect of hyperparameter choices, such as learning rate, batch size, or number of hidden units, on the performance of machine learning algorithms?
3. What is the impact of feature imbalance and mislabeling on the performance of machine learning models?

To address these research questions, we utilized two Machine learning models: Random forests and Neural Networks. Both these machine learning models will be explained in the next chapters. Thereafter, we will briefly discuss the pre processing phase. Lastly, we discuss the thorough experimentation and discuss the results in the last chapter.


## Research
### Random forests
Random forests is an ensemble learning method, which, as the name suggests, is based on constructing multiple decision trees. Random forests can be used for both classification and regression tasks. In the case of classification, the result of the random forest depends on the class that is returned by the majority of the trees. For regression, the result of the random forest is the mean of the outputs from all the trees. Decision trees are known to overfit, especially when noisy data is added, but random forests mitigate this problem by averaging the results of multiple trees, thus enhancing robustness and generalizability.

As mentioned above, random forests is an ensemble learning method. The model uses __bagging__ (Bootstrap Aggregating) to increase the performance and robustness. Bagging is a method that combines multiple models to decrease variance and increase accuracy, where the model on itself might lack this. On a high level, the process of random forests consists out of 3 steps.

1. __Bootstrap Samples__:
   The dataset is split up into multiple subsets (bootstrap samples) by means of random sampling with replacement. This means that each         dataset has the same amount of samples, however, some observations might be selected multiple times, while other observations might not      be selected at all.
2. __Model training__:
   Each individual model will be trained on each bootstrap sample. In case of random forest, these models are decision trees. This results      in each model being trained on a slightly different dataset, thus resulting in slightly different decision trees.
3. __Aggregating__:
   For classification problems, the random forest determines the final result by selecting the class predicted by the majority of the trees     (majority voting). For regression problems, the final prediction is the average of the predictions from all the trees.

Lets delve into some mathematics. 
Given a dataset D with N samples, bootstrap sampling will create B different subsets D<sub>1</sub>, D<sub>2</sub>, ..., D<sub>B</sub>, each of size N, by sampling from D with replacement. The probability of any given sample being included in a single bootstrap is equal to the following formula (Efron & Tibshirani, 1993):

$$P(x_i \in D_j) = 1 - \left(1 - \frac{1}{N}\right)^N$$

Each decision tree T<sub>i</sub> in the forest is trained on a bootstrap sample D<sub>i</sub>.At each point where the tree splits, only a random selection of  m  features out of the total  p  features is considered to find the best split. This added randomness makes the trees less similar to each other, which helps to make the overall model stronger and more reliable.

Lastly, after all the individual trees have been trained on the bootstrap samples, we have to aggregate the predictions.

__Classification__: When a new input is given to the random forest, each tree in the forest will make its own prediction T<sub>i</sub>(x).
                    The final prediction is determined by a majority vote, which looks as follows: $$\hat{y} = \text{mode}(T_1(x), T_2(x), \ldots, T_B(x))$$

__Regression__: When a new input is given to the random forest, each tree provides its own prediction T<sub>i</sub>(x). The final prediction $$\hat{y}$$ will be determined be the following formula: $$\hat{y} = \frac{1}{B} \sum_{i=1}^{B} T_i(x)$$




### Neural Networks


Neural networks are a class of machine learning models that are inspired by the human brain's structure and function, in that sense that they are designed to recognize complex structures and patterns in data through a series of interconnected nodes, called neurons, which are organized into layers. Their applications are enormous, they range from classification and regression, to image and speech recognition.

### High level overview of neural networks
A neural network typically consists of 3 types of layers, namely: The input layer, the hidden layer and the output layer. The training of a neural network can be summarized into the following steps (high level):

1. The weights of the connections between the neurons are initialized; In our case, scikit-learn's MLPClassifier handles this initialization.
2. Before training, you have to make sure that your non-numerical columns are encoded, and that the features are standardized. We achieved this by using LabelEncoder and StandardScaler.
3. __Forward Propagation__: Data from the input layer is passed through the network, where each neuron will apply a transformation to the input. This continues through each layer, until the data reaches the third layer (output layer), and the predictions are made.
4. __Loss Calculation__: After the predictions are made, they are compared to the actual target values using a loss function. The loss function measures the distance between the predicted values and the true values. The MLPClassifier, which is in our case, uses the cross-entropy loss function for classification tasks.
5. __Backward Propagation__: Subsequently, the network will adjust its weight, so that the loss is minimzed. This is done through a backpropagation algorithm, which calculates the gradient of the loss function with respect to each weight and updates the weights accordingly using an optimization method like gradient descent.
6. __Iteration__: Steps 2 - 4 are repeated for many iterations, which are called epochs, until the network's performance stabilizes and loss is minimzed.

A key aspect to consider when training neural networks is the activation functions. Each neuron in the network receives input from all the neurons in the previous layer, and an activation function is applied to all these inputs. The activation function will determine whether the neuron is activated (passes a significant signal) or not (outputs zero). In our grid search algorithm (which will be mentioned later on in this notebook), we consider two activation functions:

- __ReLU__ (Rectified Linear Unit):
  The ReLU activation function is defined as follows: $$ \text{ReLU}(z) = \max(0, z) $$
  where z is the weighted sum of the neurons from the previous layer, which is defined as follows: $$ z = \sum_{i} w_i x_i + b $$
  The ReLU activation function helps in mitigating the vanishing gradient problem, making it easier to train deep networks. It is   computationally efficient and helps the network to converge faster.
- __Tanh__ (hyperbolic tangent):
  The tanh activation function is defined as follows:
  $$ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} $$
  where z, in the same fashion as above, is the weighted sum of the neurons from the previous layer, which is defined as follows:
  $$ z = \sum_{i} w_i x_i + b $$
  The tanh activation function outputs values between -1 and 1, making it centered around zero, which can be useful during training. It is   often used in hidden layers to provide a smooth gradient and better convergence properties.

Another key aspect to consider is gradient-based optimization. After backpropagation, where all the gradients are computed, the weights are then adjusted accordingly using a gradient-based optimization method. For our experiments, we chose to work with Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation.

__SGD__: Stochastic Gradient Descent is, as mentioned above, an optimization method to minimize the loss function by updating the weights iteratively. The weights are updated as follows: $$ w_{t+1} = w_t - \eta \nabla L(w_t) $$

where:
- w<sub>t</sub> is the weight vector at iteration
- $\eta$ is the learning rate
- $\nabla$ L(w_t) is the gradient of the loss function L with respect to  w  at iteration t

    
__Adam__:   Adam is an optimization algorithm that combines the benefits of both the AdaGrad and RMSProp algorithms to minimize the loss function by updating the weights iteratively. The weights are updated using the following equations (Kingma & Ba, 2014):

- $m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla L(w_t)$
- $v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla L(w_t))^2$
- $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$
- $\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
- $w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

where:
- $w_t$ is the weight vector at iteration $t$
- $\eta$ is the learning rate
- $\nabla L(w_t)$ is the gradient of the loss function $L$ with respect to $w$ at iteration $t$
- $m_t$ and $v_t$ are the first and second moment estimates at iteration $t$
- $\beta_1$ and $\beta_2$ are the exponential decay rates for the moment estimates
- $\hat{m}_t$ and $\hat{v}_t$ are the bias-corrected moment estimates
- $\epsilon$ is a small constant to prevent division by zero


  
  

## Preprocessing

In this study, we utilized the dataset which is focused on obesity classification. It includes features concerning eating habits and physical condition.

__Eating habits__:
- FAVC: Frequenct consumtion of high caloric food
- FCVC: Frequency of consumption of vegetables
- NCP: Number of main meals
- CAEC: Consumption of good between meals
- CH20: Consumption of water daily
- CALC: Consumption of alcohol

__Physical condition__:

- SCC: Calories consumption monitrong
- FAF: Physical activity frequency
- TUE: Time using technology devices
- MTRANS: Transportation used

Other variables that are obtained are: Gender, Age, Height and Weight. Lastly, there are 6 obesity types, which are the target values:
•Underweight Less than 18.5
•Normal 18.5 to 24.9
•Overweight 25.0 to 29.9
•Obesity I 30.0 to 34.9
•Obesity II 35.0 to 39.9
•Obesity III Higher than 40

All this data, can be found on https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster/data.

In [1]:
dataset = pd.read('/Dataset/Dataset.csv')
dataset.head()

NameError: name 'pd' is not defined