# Lab 03 : Advanced Numerical and Categorical Techniques


#### Lab Overview

This workshop focuses on using a processed tabular dataset to extract usefule features, build an optimized model, and evaluate its performance using advanced techniques.

---

#### Objective

By the end of this lab, you will be able to:

1. Select and split data
2. Select an ML model and perform hyperparameter tuning
3. Test and evaluate the trained model
4. apply explainability methods to interpret model decisions

---


#### Data Splitting

**How to split**<br>
The primary goal of data splitting is to ensure that the model you build is evaluated fairly and can generalize to new, unseen data. You split the data into three main subsets:

- Training Set: used for model training and consists of 60%-80% of the total data
- Validation Set: used for model tuning and consists of 10%-20% of the total data
- Test Set: used for model evaluation and consists of 10%-20% of the total data

**Best practices**

- Randomize Data: Always shuffle the data before splitting to ensure the splits are representative of the entire dataset.
- Stratify for Imbalanced Data: For classification tasks with imbalanced classes, stratify your data to ensure that the proportions of classes in your splits are similar to the full dataset.
- Avoid Data Leakage: Ensure that the test set remains unseen throughout the training and validation process. The model should never have access to any data in the test set before final evaluation.

---


#### Model Selection

Below is an overview of common classification and regression algorithms, including brief descriptions, when to use each algorithm, and key parameters that can be tuned to optimize performance. By adjusting these parameters, you can find the best model for your dataset and task.


##### Classification Algorims

| Algorithm                        | Description                                                                                          | When to Use                                                                                                                      | Parameters to Tune                                                                                                                      |
| -------------------------------- | ---------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| **Logistic Regression**          | A statistical method for binary classification, based on the logistic function.                      | When the output is binary (0/1) or multi-class (1-vs-rest). Suitable for linear decision boundaries.                             | - **C** (Regularization strength)<br>- **solver** (Algorithm to use for optimization)<br>- **max_iter** (Number of iterations)          |
| **K-Nearest Neighbors (KNN)**    | A non-parametric algorithm that classifies based on the majority class of nearest neighbors.         | When the data has many classes, or non-linear decision boundaries. Suitable for small datasets with well-defined boundaries.     | - **n_neighbors** (Number of neighbors)<br>- **weights** (Distance weighting)<br>- **metric** (Distance metric)                         |
| **Support Vector Machine (SVM)** | A supervised learning model that finds the hyperplane that best separates classes.                   | When you need high-dimensional space separation and robust decision boundaries.                                                  | - **C** (Regularization parameter)<br>- **kernel** (Linear, RBF, etc.)<br>- **gamma** (Kernel coefficient)                              |
| **Random Forest**                | An ensemble method that creates multiple decision trees and combines their predictions.              | When you have large datasets with high dimensionality and feature interactions. Works well for imbalanced data.                  | - **n_estimators** (Number of trees)<br>- **max_depth** (Max depth of each tree)<br>- **min_samples_split** (Minimum samples for split) |
| **Decision Tree**                | A tree-based algorithm that splits data into subsets to make decisions based on feature values.      | When the model needs to be interpretable or easy to visualize. Can handle both classification and regression tasks.              | - **max_depth** (Max depth of the tree)<br>- **min_samples_split** (Minimum samples for split)<br>- **criterion** (Split quality)       |
| **Naive Bayes**                  | A probabilistic classifier based on Bayes' Theorem with strong independence assumptions.             | When features are independent and you need a fast and simple model. Works well with categorical data.                            | - **alpha** (Laplace smoothing)<br>- **fit_prior** (Whether to learn class prior probabilities)                                         |
| **Gradient Boosting (GBDT)**     | An ensemble method that builds decision trees sequentially, optimizing for errors of previous trees. | When you need robust performance and are willing to trade off model complexity for higher accuracy. Suitable for large datasets. | - **n_estimators** (Number of trees)<br>- **learning_rate** (Step size)<br>- **max_depth** (Tree depth)                                 |


##### Regression Algorithms

| Algorithm                           | Description                                                                                                                           | When to Use                                                                                                     | Parameters to Tune                                                                                                                      |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| **Linear Regression**               | A linear approach for modeling the relationship between a dependent variable and one or more independent variables.                   | When the relationship between input features and the target is linear.                                          | - **fit_intercept** (Whether to include an intercept term)<br>- **normalize** (Whether to normalize the input data)                     |
| **Decision Tree Regression**        | A tree-based algorithm that predicts continuous values by splitting data based on feature values.                                     | When relationships between features are non-linear or when interpretability is important.                       | - **max_depth** (Max depth of the tree)<br>- **min_samples_split** (Minimum samples for split)<br>- **criterion** (Split quality)       |
| **Random Forest Regression**        | An ensemble method using multiple decision trees to make predictions and average their results.                                       | When there are complex feature interactions and non-linear relationships. Works well for high-dimensional data. | - **n_estimators** (Number of trees)<br>- **max_depth** (Max depth of each tree)<br>- **min_samples_split** (Minimum samples for split) |
| **Support Vector Regression (SVR)** | A regression method based on the principles of Support Vector Machine (SVM), which tries to fit the error within a certain threshold. | When the data is noisy and you need robust regression with high-dimensional data.                               | - **C** (Regularization parameter)<br>- **kernel** (Linear, RBF, etc.)<br>- **epsilon** (Epsilon for the margin of error)               |
| **K-Nearest Neighbors Regression**  | A non-parametric regression method that predicts values based on the average of k nearest neighbors.                                  | When the relationship between features and target is complex, and there's no clear functional form.             | - **n_neighbors** (Number of neighbors)<br>- **weights** (Distance weighting)<br>- **metric** (Distance metric)                         |
| **Gradient Boosting Regression**    | An ensemble method that builds models sequentially, focusing on correcting errors made by previous models.                            | When you need high accuracy for regression problems and are willing to handle more complex models.              | - **n_estimators** (Number of trees)<br>- **learning_rate** (Step size)<br>- **max_depth** (Tree depth)                                 |
| **Lasso Regression**                | A linear model that uses L1 regularization to penalize the absolute value of coefficients, helping with feature selection.            | When there is multicollinearity in the dataset, or you need to perform feature selection.                       | - **alpha** (Regularization strength)                                                                                                   |
| **Ridge Regression**                | Similar to Lasso but with L2 regularization, penalizing the square of the coefficients.                                               | When you need to prevent overfitting but don't need feature selection like Lasso.                               | - **alpha** (Regularization strength)                                                                                                   |


##### Model Training and Hyperparameter tuning

**Training Options:**

- Train Base Model, Then Fine-tune Using Validation Set:Train a base model on the training set, then fine-tune it using the validation set to improve performance and adjust hyperparameters.
- Train Multiple Models on Entire Dataset: Train several models with different parameters on the full dataset without a validation set, comparing their performance.

**Search Options:**

- Grid Search: Exhaustively tests all combinations of hyperparameters.
- Random Search: Randomly selects hyperparameter combinations from the search space.


---

#### Model Evaluation

Each type of model has its own set of metrics to evaluate its performance. The model is assessed by predicting the output for the test data and then calculating the metrics by comparing the predictions to the actual outcomes.


##### Classification Model Evaluation Metrics

| **Metric**               | **Description**                                                                                                       | **When to Use**                                                                             | **How to Identify if the Model is Good**                                                             |
| ------------------------ | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| **Accuracy**             | The proportion of correctly predicted instances out of the total instances.                                           | When the classes are balanced or when you just need a quick measure of overall performance. | High accuracy (close to 1) indicates a good model, but may be misleading with imbalanced data.       |
| **Precision**            | The proportion of true positive predictions out of all positive predictions made by the model.                        | Useful when false positives are costly (e.g., spam detection).                              | High precision (close to 1) means the model has few false positives.                                 |
| **Recall (Sensitivity)** | The proportion of true positives detected out of all actual positive instances.                                       | Useful when false negatives are costly (e.g., medical diagnoses).                           | High recall (close to 1) indicates that most actual positives are identified.                        |
| **F1-Score**             | The harmonic mean of precision and recall. Balances both metrics, providing a single score for performance.           | When both precision and recall are important to balance, especially in imbalanced datasets. | A high F1-score (close to 1) suggests a good balance between precision and recall.                   |
| **AUC-ROC Curve**        | The area under the Receiver Operating Characteristic curve, showing the tradeoff between sensitivity and specificity. | Useful to evaluate model performance for binary classification across various thresholds.   | AUC closer to 1 indicates a better model, with 0.5 meaning no discrimination ability.                |
| **Confusion Matrix**     | A matrix showing true positives, true negatives, false positives, and false negatives.                                | Provides detailed insight into classification errors, especially in imbalanced datasets.    | Good models will have high true positives and true negatives, and low false positives and negatives. |
| **Log Loss**             | Measures the performance of classification models where the prediction is a probability value.                        | When working with probabilistic models, especially in multi-class classification.           | Lower log loss values (close to 0) indicate better predictive accuracy.                              |


##### Regression Model Evaluation Metrics

| **Metric**                                | **Description**                                                                                 | **When to Use**                                                                | **How to Identify if the Model is Good**                                                  |
| ----------------------------------------- | ----------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------- |
| **Mean Absolute Error (MAE)**             | The average of the absolute differences between the predicted and actual values.                | When you want a simple and interpretable error metric.                         | Lower MAE values indicate better model performance, with 0 being perfect.                 |
| **Mean Squared Error (MSE)**              | The average of the squared differences between the predicted and actual values.                 | When penalizing larger errors is more important than smaller ones.             | Lower MSE values indicate better model performance.                                       |
| **Root Mean Squared Error (RMSE)**        | The square root of the mean squared error, providing the error in the same units as the target. | When you want to emphasize larger errors.                                      | Lower RMSE values suggest better model performance, with 0 being perfect.                 |
| **R-squared (R²)**                        | The proportion of variance in the dependent variable explained by the model.                    | When you want to measure how well the model explains the variance in the data. | R² close to 1 indicates a good fit, but very high values may indicate overfitting.        |
| **Adjusted R-squared**                    | Adjusted version of R-squared that accounts for the number of predictors in the model.          | When you want to compare models with different numbers of features.            | Higher values indicate better model fit after accounting for complexity.                  |
| **Mean Absolute Percentage Error (MAPE)** | Measures the average percentage error between predicted and actual values.                      | When dealing with percentage errors and need an interpretable metric.          | Lower MAPE values (below 10% is often considered good) indicate better model performance. |
| **Explained Variance Score**              | Measures the proportion of variance explained by the model.                                     | When you want to know how much of the data variance is captured by the model.  | Higher values closer to 1 indicate a better model.                                        |


#### Hands-on Activity

For the output dataset from Lab 2, perform the following tasks:

**Task 1**: Based on the statistical tests performed in Lab 2, provide a list of relevant features to be used in the model. If needed, conduct additional statistical tests. Explain your reasoning for selecting or dropping features.

_You now have two datasets: one with all features and one with selected features. Apply the following tasks to both datasets._

**Task 2**: Split both datasets into training, validation (optional), and test sets, following a reasonable ratio that aligns with best practices. Justify the proportions and approach you used for the split.

**Task 3**: For each dataset, select two algorithms, build models, train them, and tune hyperparameters using grid search or random search. Explain your rationale behind your choices.

**Task 4**: Evaluate the models on the test set using appropriate metrics for the type of problem.

**Task 5**: Compare the performance of the models on both the full-feature dataset and the feature-selected dataset.

**Task 6**: Analyze the impact of feature selection on model performance. Provide insights into whether feature selection improved the models and why.

**Task 7**: Conclude which dataset, model, and hyperparameters performed the best, based on the evaluation results.

##### Note the following:

- When necessary display/add briefly the logic/reasoning of a data procedure done.
- Write clean code, allocate at least 1 code block for each task.
