<a href="https://colab.research.google.com/github/EndangSupriyadi/GCI_GLOBAL_2025/blob/master/HW6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 6: Model Evaluation & Feature Engineering

## Mission
Create a breast cancer classifier with the ROC-AUC score >0.99.

## Task
In this homework, your task is to build a breast cancer classification model that predicts if a tumor of given properties is benign(`1`) or malignant(`0`) with a high ROC-AUC score, i.e., more than 0.99.

Prior to building the classifier, we first load the the Breast Cancer Wisconsin data from scikit-learn as `X` (`pd.DataFrame`, features) and `y` (`pd.Series`, target). Then, we will split the data into train and test data (`X_train`, `y_train`, `X_test`, `y_test`) using `train_test_split` with `test_size=0.2`, `random_state=42`, and `stratify=y`. These procedures will be done **before** the solution cell where you write your answer `homework()`, and you **do not** need to include it in your answer.

<!-- 0. **Load data**: Load the dataset. Create `X` (`pd.DataFrame`, features) and `y` (`pd.Series`, target). Split the data into train and test with `test_size=0.2`, `random_state=42`, and `stratify=y`. -->

<!-- Then, include the following steps 1-4 in `homework()` to achieve the goal. -->
In `homework()`, include the following steps 1-4.

1. **Define your model**: Design your classifier using SVM using scikit-learn's `SVC` class.
  - We recommend you to standardize features using `StandardScaler()` before applying SVM.
  - As we want to obtain prediction probability in Step 3, you will need to pass `probability=True` as argument when initializing `SVC`.

2.  **Determine hyperparameters**: Do a grid search and five-fold cross validation to determine hyperparameter of your SVM model using `X_train` and `y_train`.
  - Search `C` from `[0.5, 1, 2, 4, 8]` and `gamma` from `[0.01, 0.1, 1.0]`. Choose the best `C` and `gamma` by roc_auc.
  - Use `GridSearchCV`. By passing `scoring="roc_auc"` as argument, the grid search will search for the hyperparameters that maximizes ROC-AUC score.

3. **Output the prediction & ROC-AUC**: With the model you've found, predict the probability of malignancy and the ROC-AUC value for `X_test`, the test data. Aim for a good score!

4. (Optional) **Try other classification models and ensemble**: Try other models like logistic regression, random forest, etc. Then build an ensemble model with voting or other methods combining the base models.
<!-- Evaluate with the same CV setup on the training set, then **fit on the full training set** and report **test** ROC AUC and **test** Accuracy. -->

## Inputs/Outputs of `homework()`
+ Inputs:
    - `X_train` (`pd.DataFrame`)
    - `y_train` (`pd.Series`)
    - `X_test` (`pd.DataFrame`)
    - `y_test` (`pd.Series`)
+ Outputs:
    - `proba` (`np.ndarray`): array with probabilities of being benign on test data
    - `auc` (`np.float`): ROC-AUC score on test data

**Example of Expected Output:**
```python
proba, auc = homework(X_train, y_train, X_test, y_test)
print(proba)
> [5.88824186e-08 9.99988665e-01 6.41082462e-03 ...]
print(auc)
> 0.996492...
```

## Hints
+ You can combine the standardization and the SVM classifier with scikit-learn's `Pipeline` module in such a way as:
  ```python
  pipe_svc = Pipeline([
      ("scaler", StandardScaler()),
      ("clf", SVC(kernel="rbf", probability=True, random_state=42)),
  ])
  ```
+ To set the hyperparameter grid, you can write like:
  ```python
  param_svc = {"clf__C": [0.5, 1, 2, 4, 8], "clf__gamma": [0.01, 0.1]}
  ```

## Leaderboard
The fun part of today's HW - the `auc` value of this HW will go to the leaderboard! When you submit your answer to Omnicampus, you will be able to see a leaderboard of the `auc` scores. Let's aim for a higher score!

## Submission Guidelines
When submitting your solution, only submit the entire `homework()` function. Submit by selecting this week's assignment in the Omnicampus homework section, pasting the function into the submission area, and then clicking [Submit Python Code].

Please pay attention to the following points when submitting.
- Erase the `!!WRITE ME!!` when submitting
- Write your answer as one function


## Deadline

Wed, Nov 5th, 20:00 JST (GMT+9)

## 1. Importing Libraries

In [None]:
# Please do not modify this cell
import numpy as np
import pandas as pd

import sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score

## 2. Download Breast Cancer data

In [None]:
# Please do not modify this cell
data = load_breast_cancer(as_frame=True)
X = data.frame.drop(columns=['target'])
y = data.frame['target']

## 3. Split the data

In [None]:
# Please do not modify this cell
seed = 42

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=seed, stratify=y
)

# 4. Solution

In [None]:
def homework(X_train, y_train, X_test, y_test):
    # ! WRITE ME !

    return proba, auc

## Try your code output
Run the cell below to test your code's output.

**Note**:
The score evaluation is done using training & test data with hidden seed three values. You need to get `auc` > 0.99 at least two seeds out of the three. Thus, **you might get score -1 even though the ouputs of the next cell is >0.99**. In that case, you need to improve your model.

In [None]:
proba, auc = homework(X_train, y_train, X_test, y_test)
print(auc)