# Week 7 Applying Machine Learning

Thus far, you have been introduced to various types of models. In this week's main module, you will be tasked with solving a problem using any of the tools you have learned so far. This pre-module aims to introduce you to two additional models that may be useful in tackling this challenge: Support Vector Machines (SVMs) and extreme Gradient Boosting (XGBoost). While we will not delve into the details of each model, or how it works, we will provide enough details on their hyperparameters to enable you to apply them to your dataset.

As usual, we will be using the Pima Indian Diabetes dataset:

In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd
import numpy as np

# we fetch the dataset from https://www.openml.org/search?type=data&status=active&id=37
X,y = fetch_openml(data_id = 37, as_frame = True, return_X_y = True)

# convert tested_positive and tested_negative to 1 and 0
y = (y == 'tested_positive').astype(int)

---

##### **Q1: Split the data into a train and test set.**

In [None]:
from sklearn.model_selection import train_test_split
### Your Code Here.
## NOTE: please use X_train, X_test, y_train, and y_test as your variable names.

---

## Support Vector Machines (SVMs)

In Week 5, you explored the basics of machine learning models like Logistic Regression. This week, we will introduce a new model: **Support Vector Machines (SVMs)**.

SVMs are supervised learning models used for **classification** and **regression** tasks. They work by finding **hyperplanes** (AKA multi-dimensional lines) that best separates data points from different classes in a dataset. For example, in a biological context, an SVM could classify whether a gene is active or inactive based on certain gene expression features.

---

### Key Definitions:

- **Hyperplane**: A hyperplane is the decision boundary that separates different classes. For a 2D dataset, it is a line; for a 3D dataset, it is a plane; and in higher dimensions, it is generalized to a hyperplane.  
- **Support Vectors**: These are the data points closest to the hyperplane. They “support” the hyperplane and play a crucial role in defining its position.  
- **Margin**: The margin is the distance between the hyperplane and the closest data points from each class. SVMs aim to maximize this margin for better generalization.

![hyperplane](hpplane.png)





### Kernel Trick for Non-Linear Data

As you well know by now, not all data can be neatly separated by a straight line. SVMs employ functions known as kernels to project data into a higher-dimensional space, where it becomes separable. One commonly used kernel is the Radial Basis Function (RBF). This kernel effectively adds extra features to the data, facilitating the separation of different classes. More specifically, the RBF kernel adds a feature equivalent to the distance of a point from the centre of a circle, enabling the model to separate classes that are not linearly separable.

![kernel_trick](kernel_Trick.png)

### Parameters to Tune

The SVM has three primary parameters that can be tuned:
1. `max_iter`: Similar to Logistic Regression, this parameter specifies the number of training iterations the SVM will perform.
2. `kernel`:  This parameter determines the kernel function to be used. Common choices include `linear`, radial basis function `rbf` radial basis function, or `poly` (which adds squared and cubic terms).
3. `C`: This parameter, which should be set between zero and one, controls the extent to which the hyperplane depends on the features. Higher values of C allow the hyperplane to utilize more features (resulting in smaller feature coefficients). If your model is overfitting, reducing the value of C may help by making the hyperplane more robust to noisy features.

Below, we provide an example of training an SVM on the diabetes dataset:

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

model = SVC(max_iter=1000, # 1000 iterations
            kernel='rbf', # Radial Basis Kernel
            C=1.0) # do not limit hyperplane by much
model.fit(X_train, y_train)


---

##### **Q2: Make predictions on the train and test set, and provide the training and testing accuracy.**

In [None]:
# make predictions
train_predictions = ... # model.predict()
test_predictions = ...

# determine accuracy
train_accuracy = ... # accuracy_score()
test_accuracy = ...

---


## XGBoost

In week 6, you learned about decision trees. This week, we will introduce **XGBoost**, which uses decision trees as a foundation to build a much more robust model architecture.

XGBoost, short for e**X**treme **G**radient **Boost**ing, is a supervised learning algorithm that builds a strong prediction model by combining many weak models (typically decision trees). It is widely used for structured/tabular data in both **classification** and **regression** tasks.


### Boosting

XGBoost operates by training multiple decision trees. To illustrate, imagine playing darts and aiming for a target. With a single throw, you are likely to miss the target. However, with a second throw, you can adjust based on the initial miss, increasing your chances of hitting the target.

XGBoost functions similarly. It begins by training one decision tree and making predictions on the training set. Subsequently, it trains another tree to correct and adjust these predictions. The combined predictions from the two trees are then analyzed to identify errors, and another tree is trained to address these errors. This process is repeated iteratively until no further errors can be fixed.


![training](boosting.png)
---

### Key Definitions:

- **Boosting**: Boosting is an ensemble method that combines the predictions of several weak models (e.g., shallow decision trees) to form a strong model.  
- **Learning Rate**: This parameter determines the extent to which each tree contributes to the final model and adjusts the predictions of the preceding tree.



### Parameters to Tune

Some common parameters tuned by XGBoost are:
1. `n_estimators`: The max number of trees we will construct.
2. `max_depth`: Controls the maximum depth of each individual tree.
3. `learning_rate`: As mentioned above, this parameter controls how much each tree updates the predictions of the previous trees.

Below we provide an example of training a XGBoost on the diabetes dataset:

In [None]:
# we often first need to install xgboost,
# which we will do with the command below
!pip install xgboost

from xgboost import XGBClassifier

xgb = XGBClassifier(
        n_estimators = 100, # 100 trees
        max_depth=2, # each tree only has a depth of 2
        learning_rate=0.1, # update scale
)

xgb.fit(X_train, y_train)

---

##### **Q3: Make predictions on the train and test set using XGBoost, and provide the training and testing accuracy.**

In [None]:
# make predictions
train_predictions = ... # xgb.predict()
test_predictions = ...

# determine accuracy
train_accuracy = ... # accuracy_score()
test_accuracy = ...

---

### Graded Question:





##### **GQ1: Compare the performance of the decision tree model created in the first half of this module with the model of your choice developed in the second half. Which model performed better? (1 mark) Justify your reasoning using model evaluation metrics (1 mark) and explain the reasons for any observed differences (1 mark).**

*Your Answer Here*

## Conclusion
This week, you were introduced to two advanced machine learning models: SVMs and XGBoost. These are powerful, state-of-the-art machine learning models. You gained practical experience in training, evaluating, and interpreting these powerful models. In the next module, you will apply them independently to solve a biological case study problem.