## Gradient Boosting Classifier


In this challenge, we’ll use Gradient Boosting to classify tumors. We'll assume that tumors can be benign or malignant. To make this exercise, we'll use the dataset from the scikit-learn library

In [9]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)

## Visualize the data

As usual, try to get more comfortable with the data and try to understand the meaning of each feature (column) and have a look at the target column. Also you should try to make some plots to visualize the data.

Answer the following questions : 
1. what is the type of each column (for the features)
2. how many features do we have for each sample?
3. how many samples do we have in the dataset?
4. Is this a classification or regression problem? If classification, what are the different categories?
5. How many malignant tumors do we have in the dataset?
6. How many benign?

## Split and "clean" the data

As you should have noticed, the target is a **categorical feature**. So we should start by convert this categories into numbers, using the `LabelEncoder` class of scikit learn. For this problem, we’ll set malignant to 1 and benign to 0.

Split the data into 60/40 between training and test sets thanks to **train_test_split** from scikit learn. Add the option `random_state=1`

## Create, train, predict and measure with Gradient Boosting

You should be familiar with these steps by now : 
- initialize your model
- train ("fit") your model on the training data
- make predictions on the test data
- evaluate the performance of your model

We will use max_depth=1. As we explained before, we want weak (but fast) learner. max_depth=1 is a good way to do that. Basically we tell our model that we’d like our forest to be composed of trees with a single decision node and two leaves. 

**quick reminder**: n_estimators specifies the number of trees in our forest.

For this part of the exercise, you have to: 
- create an instance of GradientBoostingClassifier apply on a DecisionTreeClassifier with max_depth=1 and 50 trees in your forest
- train your classifier on the training data
- predict whether a tumor is malignant or benign on the test set
- evaluate the model using a **confusion matrix**. Note : do not use this method blindly. False Positive, True Negative,... are an important concept in machine learning.



Could you answer these questions : 
1. How many false positive do we have? How does that translate in that context?
2. How many false negative do we have? How does that translate in that context?
3. How many true negative do we have? How does that translate in that context?
4. What are the index(es) of the false positive? 



## Optimise further our classifier

### Find the number of optimal trees

We can plot the number of estimators needed to get the best performance for our current problem. You should see when the error gets asymptotical by plotting increasing number of trainings.

**👉Plot the training error for the Gradient Boosting in function of the number of trees to find the optimal number of trees needed**

Hint: you can access to those submodel trainings through the `staged_predict` method of Gradient Boosting class

**We could feed this classifier to a logistic regression** in order to improve the decision. **Each leaf position is gonna be the new feature we construct for our samples**, with what is called a "one-hot" encoding.

👉Look at [the `apply` method of the gradient boosting class](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.apply)

The idea is to use this `apply` method to get new features, that we can encode as one hot encoding vectors.

**👉Fit and transform a one hot encoder over the leaf position of every entry in the dataset**

**👉Use those new hot encoded features as an input for a logistic regression model, and compare the score with the previous score**

This additional prediction layer can be useful for more complex modelisation problems

## ROC Curve

ROC curves are extensively used for classification problems.

You can plot both ROC curves zoomed out on the decision area 

Hint: [to plot the ROC curve](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html) (False Positive Rate (FPR) on the X-axis and True Positive Rate on the Y-axis).


NB: Here we have nearly similar results because the dataset is small but encoding new features from a boosting classifier to feed another simpler estimator can be very powerful!