# Exploring Data

In [None]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

adult_census.head()

# Getting counts for the target column
# Check if there is a class imbalance
target_column = 'class'
data = adult_census.drop(columns=[target_column])
target = adult_census[target_column]
target.value_counts()

## Visualising the Data

In [None]:
_ = adult_census.hist(figsize=(20, 14))

Using a `pairplot` can quickly visualise the distribution of the variables (on the diagonals) and the correlation between the variables (on the off-diagonals)

In [None]:
import seaborn as sns

# We will plot a subset of the data to keep the plot readable and make the
# plotting faster
n_samples_to_plot = 5000
columns = ['age', 'education-num', 'hours-per-week']
_ = sns.pairplot(data=adult_census[:n_samples_to_plot], vars=columns,
                 hue=target_column, plot_kws={'alpha': 0.2},
                 height=3, diag_kind='hist', diag_kws={'bins': 30})

# Diagram Mode for sklearn

In [None]:
from sklearn import set_config

set_config(display="diagram")

# Feature Scaling
[Reasons for Feature Scaling:](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html)
- Models that rely on the distance between a pair of samples, for instance k-nearest neighbors, should be trained on normalized features to make each feature contribute approximately equally to the distance computations.
- Many models such as logistic regression use a numerical solver (based on gradient descent) to find their optimal parameters. This solver converges faster when the features are scaled.

- Models that rely on the distance between a pair of samples, for instance k-nearest neighbors, should be trained on normalized features to make each feature contribute approximately equally to the distance computations.
- Many models such as logistic regression use a numerical solver (based on gradient descent) to find their optimal parameters. This solver converges faster when the features are scaled.
- Predictors using Euclidean distance, for instance k-nearest-neighbors, should have normalized features so that each one contributes equally to the distance computation;
- predictors using gradient-descent based algorithms, for instance logistic regression, to find optimal parameters work better and faster;
- predictors using regularization, for instance logistic regression, require normalized features to properly apply the weights.

# Handling Categorical/Numerical Variables

## Strategies to Encode Categories

### Encoding Ordinal Categories
The `OrdinalEncoder` will encode each category with a different number.

**Cons:** If a categorical variable does not carry any meaningful order information, then this encoding might be misleading to downstream statistical models.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

categorical_columns = data.select_dtypes(["object"])
numerical_columns = data.select_dtypes(["integer", "float"])

encoder = OrdinalEncoder()
cols_encoded = encoder.fit_transform(categorical_columns)

# To check the mapping between each category and the encoding
cols_encoded.categories_

### Encoding Nominal Categories (without assuming any order)
`OneHotEncoder` is an alternative encoder that prevents the downstreams models to make a false assumption about the ordering of categories. For a given feature, it will create as many new columns as there are possible categories. For a given sample, the value of the column corresponding to the category will be set to `1` while allthe columns of the other categories will be set to `0`.

**Cons:** Creates many additional columns to denote each possible category value.

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")
cols_encoded = encoder.fit_transform(categorical_columns)

feature_names = encoder.get_feature_names(input_features=["categorical_variables"])
data_encoded = pd.DataFrame(cols_encoded, columns=feature_names)
data_encoded

### Choosing an Encoding strategy
Choosing an encoding strategy will depend on the underlying models and the type of categories (i.e. ordinal vs. nominal).

Indeed, using an `OrdinalEncoder` will output ordinal categories. It means that there is an order in the resulting categories (e.g. 0 > 1 > 2). The impact of violating this ordering assumption is really dependent on the downstream models. Linear models will be impacted by misordered categories while tree-based models will not be.

Thus, in general `OneHotEncoder` is the encoding strategy used when the downstream models are **linear models** while `OrdinalEncoder` is used with **tree-based models**.

You still can use an `OrdinalEncoder` with linear models but you need to be sure that:

- the original categories (before encoding) have an ordering
- the encoded categories follow the same ordering as the original categories. The next exercise highlight the issue of misusing `OrdinalEncoder` with a linear model.

Also, there is no need to use an `OneHotEncoder` even if the original categories do not have an given order with tree-based model. It will be the purpose of the final exercise of this sequence.

### Handling Unknown Categories
In scikit-learn, there are two solutions to bypass this issue:

- list all the possible categories and provide it to the encoder via the keyword argument `categories`
- use the parameter `handle_unknown`

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"), 
    LogisticRegression(max_iter=500)
)

model.get_params()

In [None]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data_categorical, target)
cv_results

In [None]:
scores = cv_results["test_score"]
print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

### Dispatch columns to a specific processor

Scikit-learn provides a `ColumnTransformer` class which will send specific
columns to a specific transformer, making it easy to fit a single predictive
model on a dataset that combines both kinds of variables together
(heterogeneously typed tabular data).

We first define the columns depending on their data type:

* **one-hot encoding** will be applied to categorical columns. Besides, we
  use `handle_unknown="ignore"` to solve the potential issues due to rare
  categories.
* **numerical scaling** numerical features which will be standardized.

Now, we create our `ColumnTransfomer` by specifying three values:
the preprocessor name, the transformer, and the columns.

A `ColumnTransformer` does the following:

* It **splits the columns** of the original dataset based on the column names
  or indices provided. We will obtain as many subsets as the number of
  transformers passed into the `ColumnTransformer`.
* It **transforms each subsets**. A specific transformer is applied to
  each subset: it will internally call `fit_transform` or `transform`. The
  output of this step is a set of transformed datasets.
* It then **concatenate the transformed datasets** into a single dataset.

The important thing is that `ColumnTransformer` is like any other
scikit-learn transformer. In particular it can be combined with a classifier
in a `Pipeline`:

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, selector(dtype_include=object)), #Only need to provide the column names
    ('standard-scaler', numerical_preprocessor, selector(dtype_exclude=object))],
    remainder='passthrough')

# Selecting the Best Model

# Overfitting/Underfitting
- High Bias == Underfitting
    - systematic prediction errors
    - the model prefers to ignore some aspects of the data
    - mispecified models
- High Variance == Overfitting
    - prediction errors without obvious structure
    - small change in the training set, large change in model
    - unstable models
    
**Overfitting** is caused by the **limited size** of the training set, the **noise** in the data, and the **high flexibility** of common machine learning models.

**Underfitting** happens when the learnt prediction functions suffers from **systematic errors**. This can be caused by a choice of model family and parameters, which leads to a **lack of flexibility** to capture the repeatable structure of the true data generating process.

Increasing the training set size will **decrease overfitting** but can also cause an **increase of underfitting**.

## Validation Curve
Some model hyperparameters are usually the key to go from a model that
underfits to a model that overfits, hopefully going through a region were we
can get a good balance between the two. We can acquire knowledge by plotting
a curve called the validation curve. This curve can also be applied to the
above experiment and varies the value of a hyperparameter.

In [None]:
from sklearn.tree import DecisionTreeRegressor

regressor = DecisionTreeRegressor()

In [None]:
import pandas as pd
from sklearn.model_selection import cross_validate, ShuffleSplit

cv = ShuffleSplit(n_splits=30, test_size=0.2)
cv_results = cross_validate(regressor, data, target,
                            cv=cv,
                            return_train_score=True, n_jobs=-1)
cv_results = pd.DataFrame(cv_results)

In [None]:
%%time
from sklearn.model_selection import validation_curve

param_range = [1, 2, 5, 10, 20, 50, 100, 200, 500]
train_scores, test_scores = validation_curve(
    pipeline, data, target, param_name="KNN__n_neighbors", param_range=param_range,
    cv=5)

In [None]:
plt.plot(param_range, train_scores.mean(axis=1), label="Training score")
plt.plot(param_range, test_scores.mean(axis=1), label="Testing score")
plt.legend()

plt.xlabel("Maximum depth of decision tree")
plt.ylabel("Mean absolute error (k$)")
_ = plt.title("Validation curve for decision tree")

In [None]:
# To plot the variance in train/test errors
plt.errorbar(param_range, train_scores.mean(axis=1),
             yerr=train_scores.std(axis=1), label='Training score')
plt.errorbar(param_range, test_scores.mean(axis=1),
             yerr=test_scores.std(axis=1), label='Testing score')
plt.legend()

plt.xlabel("Maximum depth of decision tree")
plt.ylabel("Mean absolute error (k$)")
_ = plt.title("Validation curve for decision tree")

### Checking the size of Train Set to use

In [None]:
from sklearn.model_selection import learning_curve
# Write your code here.
train_sizes = np.linspace(0.1, 1, num=10)
results = learning_curve(
    pipeline, data, target, train_sizes=train_sizes, cv=cv, n_jobs=-1
)
train_size, train_scores, test_scores = results[:3]

In [None]:
plt.errorbar(train_size, train_scores.mean(axis=1),
             yerr=train_scores.std(axis=1), label='Training error')
plt.errorbar(train_size, test_scores.mean(axis=1),
             yerr=test_scores.std(axis=1), label='Testing error')
plt.legend()

plt.xlabel("Number of samples in the training set")
plt.ylabel("Accuracy")
_ = plt.title("Learning curve for support vector machine")

# Hyperparameter Tuning

## Manual Tuning

In [None]:
learning_rate = [0.01, 0.1, 1, 10]
max_leaf_nodes = [3, 10, 30]
model.get_params()

from sklearn.model_selection import cross_val_score

learning_rate = [0.05, 0.1, 0.5, 1, 5]
max_leaf_nodes = [3, 10, 30, 100]

best_score = 0
best_params = {}
for lr in learning_rate:
    for mln in max_leaf_nodes:
        print(f"Evaluating model with learning rate {lr:.3f} and max leaf nodes {mln}...", end="")
        model.set_params(
            classifier__learning_rate = lr,
            classifier__max_leaf_nodes = mln
        )
        scores = cross_val_score(model, data_train, target_train, cv=2)
        mean_score = scores.mean()
        print(f"score: {mean_score:.3f}")
        if mean_score > best_score:
            best_score = mean_score
            best_params = {"learning-rate":lr,
                          "max leaf nodes":mln}
            print(f"Found new best model with score {best_score:.3f}!")

print(f"The best accuracy obtained is {best_score:.3f}")
print(f"The best parameters found are:\n {best_params}")

## Automated using grid-search
The `GridSearchCV` estimator takes a `param_grid` parameter which defines
all hyperparameters and their associated values. The grid-search will be in
charge of creating all possible combinations and test them.

The number of combinations will be equal to the product of the
number of values to explore for each parameter (e.g. in our example 4 x 4
combinations). Thus, adding new parameters with their associated values to be
explored become rapidly computationally expensive.

Once the grid-search is fitted, it can be used as any other predictor by
calling `predict` and `predict_proba`. Internally, it will use the model with
the best parameters found during `fit`.

**Cons:** Does not scale when the number of parameters to tune is increasing. Also will impose a regularity during the search which might be problematic.

In [None]:
%%time
from sklearn.model_selection import GridSearchCV

param_grid = {
    'classifier__learning_rate': (0.05, 0.1, 0.5, 1, 5),
    'classifier__max_leaf_nodes': (3, 10, 30, 100)}
model_grid_search = GridSearchCV(model, param_grid=param_grid)
model_grid_search.fit(data_train, target_train)

In [None]:
accuracy = model_grid_search.score(data_test, target_test)
print(
    f"The test accuracy score of the grid-searched pipeline is: "
    f"{accuracy:.2f}"
)

print(f"The best set of parameters is: "
      f"{model_grid_search.best_params_}")

In [None]:
# To inspect all the test scores when using different parameters
cv_results = pd.DataFrame(model_grid_search.cv_results_).sort_values(
    "mean_test_score", ascending=False)
cv_results.head()

Another method for Cross Validation with GridSearchCV

In [None]:
grid_search = GridSearchCV(
    model,
    param_grid=param_grid,
    scoring="balanced_accuracy",
    cv=10,
).fit(data, target)
grid_search.cv_results_

# To sort the results into a dataframe
results = (
    pd.DataFrame(grid_search.cv_results_)
    .sort_values(by="mean_test_score", ascending=False)
)

results = results[
    [c for c in results.columns if c.startswith("param_")]
    + ["mean_test_score", "std_test_score"]
]

## Automated using randomized-search
A randomized-search allows a search with a fixed budget even with an increasing number of hyperparameters.

**Note:** Random search (with RandomizedSearchCV) is typically beneficial compared to grid search (with GridSearchCV) to optimize 3 or more hyperparameters.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint

param_distributions = {
    'classifier__learning_rate': loguniform(0.001, 10),
    'columntransformer__standard-scaler__with_mean' : [True, False],
    "n_estimators": randint(10, 30),
    "max_samples": [0.5, 0.8, 1.0]
}

model_random_search = RandomizedSearchCV(
    model, param_distributions=param_distributions, n_iter=500,
    n_jobs=4, cv=5)
model_random_search.fit(data_train, target_train)

print("The best parameters are:")
print(model_random_search.best_params_)

In [None]:
import pandas as pd

# To see the results of all the params tried
cv_results = pd.DataFrame(model_random_search.cv_results_)
cv_results = cv_results.sort_values(by="rank_test_score")
cv_results

# Linear Models

In [None]:
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(data, target)
linear_regression.coef_ #To get the coefficients
linear_regression.intercept_ #To get the intercept

## Modelling Non-Linear Relationships (Feature Engineering)

We can use `sklearn.preprocessing.PolynomialFeatures` to generate polynomial features

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_regression = make_pipeline(
    PolynomialFeatures(degree=3),
    LinearRegression(),
)
polynomial_regression.fit(data, target)
target_predicted = polynomial_regression.predict(data)
mse = mean_squared_error(target, target_predicted)

### Using SVM (Kernels)
See [documentation](https://scikit-learn.org/stable/modules/svm.html)

Allows for  locally-based decision function instead of a global linear decision function.

Kernel methods such as SVR are very efficient for small to medium datasets.

For larger datasets with `n_samples >> 10_000`, it is often computationally
more efficient to perform explicit feature expansion using
`PolynomialFeatures` or other non-linear transformers from scikit-learn such
as
[KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html)
or
[Nystroem](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_approximation.Nystroem.html).


In [None]:
from sklearn.svm import SVR

# Using a Linear kernel
svr = SVR(kernel="linear")
svr.fit(data, target)
target_predicted = svr.predict(data)
mse = mean_squared_error(target, target_predicted)

# Using a non-linear (polynomial) kernel
svr = SVR(kernel="poly", degree=3)
svr.fit(data, target)
target_predicted = svr.predict(data)
mse = mean_squared_error(target, target_predicted)

## Regularization in Linear Models
Use `RidgeCV` to create a linear model with ridge regression, and to identify the optimal `alpha` parameter value.

Requires features to be **scaled**. If two features are as important, our model will boost the weights of features with small dispersion and reduce the weights of features with high dispersion. We recall that regularization forces weights to be closer. Therefore, we get an intuition that if we want to use regularization, dealing with rescaled data would make it easier to find an optimal regularization parameter and thus an adequate model.

In [None]:
from sklearn.linear_model import Ridge

ridge_model = make_pipeline(
    PolynomialFeatures(degree=2),
    Ridge(alpha=10)
)

In [None]:
from sklearn.linear_model import RidgeCV

alphas = np.logspace(-2, 0, num=20)
ridge = make_pipeline(PolynomialFeatures(degree=2), StandardScaler(),
                      RidgeCV(alphas=alphas, store_cv_values=True))

In [None]:
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=5, random_state=1)
cv_results = cross_validate(ridge, data, target,
                            cv=cv, scoring="neg_mean_squared_error",
                            return_train_score=True,
                            return_estimator=True, n_jobs=-1)

Plotting the effect of regularization term `alpha` on the MSE

In [None]:
mse_alphas = [est[-1].cv_values_.mean(axis=0)
              for est in cv_results["estimator"]]
cv_alphas = pd.DataFrame(mse_alphas, columns=alphas)

cv_alphas.mean(axis=0).plot(marker="+")
plt.ylabel("Mean squared error\n (lower is better)")
plt.xlabel("alpha")
_ = plt.title("Error obtained by cross-validation")

To check effects of the model coefficients:

In [None]:
coefs = [estimator[-1].coef_[0] for estimator in cv_results["estimator"]]
coefs = pd.DataFrame(coefs, columns=feature_names)

# Define the style of the box style
boxplot_property = {
    "vert": False,
    "whis": 100,
    "patch_artist": True,
    "widths": 0.5,
    "boxprops": dict(linewidth=3, color="black", alpha=0.9),
    "medianprops": dict(linewidth=2.5, color="black", alpha=0.9),
    "whiskerprops": dict(linewidth=3, color="black", alpha=0.9),
    "capprops": dict(linewidth=3, color="black", alpha=0.9),
}

_, ax = plt.subplots(figsize=(10, 35))
_ = coefs.abs().plot.box(**boxplot_property, ax=ax)

## Logistic Regression

`LogisticRegression` in `sklearn` is regularised by default, using the parameter `C` for regularisation, with smaller values indicating larger regularisation, and larger values indicating smaller regularisation.

# Decision Trees

Decision Trees are non-parametric models and cannot hence extrapolate predictions beyond the data set

## Classification Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=2)
tree.fit(data_train, target_train)
tree.score(data_test, target_test)

In [None]:
import numpy as np
import matplotlib.pyplot as plt


def plot_decision_function(fitted_classifier, range_features, ax=None):
    """Plot the boundary of the decision function of a classifier."""
    from sklearn.preprocessing import LabelEncoder

    feature_names = list(range_features.keys())
    # create a grid to evaluate all possible samples
    plot_step = 0.02
    xx, yy = np.meshgrid(
        np.arange(*range_features[feature_names[0]], plot_step),
        np.arange(*range_features[feature_names[1]], plot_step),
    )

    # compute the associated prediction
    Z = fitted_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = LabelEncoder().fit_transform(Z)
    Z = Z.reshape(xx.shape)

    # make the plot of the boundary and the data samples
    if ax is None:
        _, ax = plt.subplots()
    ax.contourf(xx, yy, Z, alpha=0.4, cmap="RdBu")

    return ax

In [None]:
# To plot the decision boundaries (2D)
import seaborn as sns

# create a palette to be used in the scatterplot
palette = ["tab:red", "tab:blue", "black"]

ax = sns.scatterplot(data=penguins, x=culmen_columns[0], y=culmen_columns[1],
                     hue=target_column, palette=palette)
plot_decision_function(tree, range_features, ax=ax)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
_ = plt.title("Decision boundary using a decision tree")

In [None]:
from sklearn.tree import plot_tree

_, ax = plt.subplots(figsize=(16, 12))
_ = plot_tree(tree, feature_names=data_columns,
             class_names=tree.classes_, ax=ax)

## Regression Decision Trees

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(max_depth=1)
tree.fit(data_train, target_train)
target_predicted = tree.predict(data_test)

In [None]:
from sklearn.tree import plot_tree

_, ax = plt.subplots(figsize=(8, 6))
_ = plot_tree(tree, feature_names=data_columns, ax=ax)

## Hyperparameters of Decision Trees
The hyperparameters `min_samples_leaf`, `min_samples_split`,
`max_leaf_nodes`, or `min_impurity_decrease` allows growing asymmetric trees
and apply a constraint at the leaves or nodes level.

In [None]:
from sklearn.model_selection import GridSearchCV

# Finding optimal max_depth parameter using GridSearchCV
param_grid = {"max_depth": np.arange(2, 10, 1)}
tree_clf = GridSearchCV(DecisionTreeClassifier(), param_grid=param_grid)
tree_reg = GridSearchCV(DecisionTreeRegressor(), param_grid=param_grid)
tree_clf.fit(data, target)
tree_reg.fit(data, target)

tree_clf.best_params_['max_depth'] #Optimal max_depth for Classification tree
tree_reg.best_params_['max_depth'] #Optimal max_depth for Regression tree

Plotting the test scores when using the different parameters

In [None]:
import numpy as np
from sklearn.model_selection import GridSearchCV

params = {"decisiontreeregressor__max_depth": np.arange(1, 15)}
search = GridSearchCV(tree, params, cv=10)
cv_results_tree_optimal_depth = cross_validate(
    search, data_numerical, target, cv=10, return_estimator=True, n_jobs=-1,
)
cv_results_tree_optimal_depth["test_score"].mean()

In [None]:
import seaborn as sns
sns.set_context("talk")

max_depth = [
    estimator.best_params_["decisiontreeregressor__max_depth"]
    for estimator in cv_results_tree_optimal_depth["estimator"]
]
max_depth = pd.Series(max_depth, name="max depth")
sns.swarmplot(max_depth)

# Ensemble of Models

## Bagging Models
**Bagging** is a general strategy that can work with any base model (linear, trees, etc)
- Bagging selects random subsets of the full data
- Fit one model on each bagged subset, independent of the other fitted models
- Average predictions

**Random Forests** are bagged *randomized* decision trees
- At each split: a random subset of features are selected
- The best split is taken among the restricted subset
- Extra randomisation **decorrelates** the prediction errors
- Uncorrelated errors make bagging work better

Each deep tree overfits individually but averaging the tree predictions reduces overfitting

A Bootstrap sample corresponds to a resampling with replacement, of the original dataset, to obtain a sample that is the same size as the original dataset.

The parameter `n_estimators` controls how many models will be used, hence a larger value will give a more smooth bagging prediction

In [None]:
%%time
from sklearn.ensemble import BaggingRegressor

base_estimator = DecisionTreeRegressor(random_state=0)
bagging_regressor = BaggingRegressor(
    base_estimator=base_estimator, n_estimators=20, 
    random_state=0)

cv_results = cross_validate(bagging_regressor, data, target, n_jobs=-1)
scores = cv_results["test_score"]

print(f"R2 score obtained by cross-validation: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

### Random Forests
Main parameter to tune: `n_estimators`

In general, the more trees in the forest, the better the statistical
performance will be. However, it will slow down the fitting and prediction
time. The goal is to balance computing time and statistical performance when
setting the number of estimators when putting such learner in production.

For random forests, it is possible to control the amount of randomness for
each split by setting the value of `max_features` hyperparameter:

- `max_feature=0.5` means that 50% of the features are considered at each
  split;
- `max_features=1.0` means that all features are considered at each split
  which effectively disables feature subsampling.
  
 By default, `RandomForestRegressor` disables feature subsampling while
`RandomForestClassifier` uses `max_features=np.sqrt(n_features)`. These
default values reflect good practices given in the scientific literature.

However, `max_features` is one of the hyperparameters to consider when tuning
a random forest:
- too much randomness in the trees can lead to underfitted base models and
  can be detrimental for the ensemble as a whole,
- too few randomness in the trees leads to more correlation of the prediction
  errors and as a result reduce the benefits of the averaging step in terms
  of overfitting control.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

random_forest = RandomForestClassifier(n_estimators=50, n_jobs=2, random_state=0)

scores = cross_val_score(random_forest, data, target, n_jobs=-1)

print(f"R2 score obtained by cross-validation: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

## Boosted Models
### Boosting
**Traditional Boosting:** Adaptive Boosting / `AdaBoost`
- Mispredicted **samples are re-weighted** at each step
- Can use any base model that accepts `sample_weight`

### Gradient Boosting
**Gradient Boosting:** `HistGradientBoosting`
- Each base model predicts the **residual error** of previous models (according to some loss function)

Differences between `HistGradientBoosting` and `GradientBoosting`
 - `GradientBoosting`
    - Implements the traditional method, requires sortings of the weights
    - Too slow for n_samples > 10,000
 - `HistGradientBoosting`
    - Discretize numerical features (256 levels on default)
    - Efficient multi-core implementation
    - Much faster when n_samples is large

(Gradient) Boosting fits trees sequentially, each shallow tree underfits individually and sequentially adding trees reduces underfitting

Gradient boosting tends to perform slightly better than bagging and random forest, and furthermore shallow trees predict faster.

## Traditional Boosting (AdaBoost)
**Cons:** Can overfit if number of estimators used is not optimal (too much)

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor

adaboost = AdaBoostClassifier(n_estimators=3, algorithm="SAMME",
                              random_state=0)
adaboost.fit(data, target)

## Gradient Boosting
Main parameters: `max_depth`, and `learning_rate`

Gradient boosting will always improve when increasing the number of trees in the ensemble, but will reach a plateau where adding new trees will make fitting and scoring slower.

To avoid adding new unnecessary tree, gradient boosting offers an **early-stopping option**. (Instead of tuning `n_estimators`)  
Internally, the algorithm will use an out-of-sample set to compute the statistical performance of the model at each addition of a tree.  
Thus, if the statistical performance are not improving for several iterations, it will stop adding trees.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_validate

grad_boost = GradientBoostingRegressor(n_estimators=1000, n_iter_no_change=5) 
grad_boost.fit(X_train, y_train)
grad_boost.n_estimators_ #To check the number of trees used in the model

cv_results_gbdt = cross_validate(
    grad_boost, 
    data, target, scoring="neg_mean_absolute_error",
    n_jobs=-1,
)

## Histogram Gradient Boosting (Faster)
Implementation is similar to **LightGBM** whereby the number of splits considered within the tree building is reduced by **binning the data** before passing them into the gradient boosting.

In [None]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.ensemble import HistGradientBoostingClassifier

histogram_gradient_boosting = HistGradientBoostingRegressor(max_iter=1000, random_state=0, early_stopping=True)
cv_results_hgbdt = cross_validate(
    gradient_boosting, data, target, scoring="neg_mean_absolute_error",
    n_jobs=-1,
)

# Evaluating Model Performance
Have to evaluate model performance against a baseline model.

## Cross Validation Strategies

### Stratification
To create K-Folds with stratified data in each fold, perserving the proportion of classes within each fold  
Ensures that no class is left out during training

In [None]:
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=3)
results = cross_validate(model, data, target, cv=cv)

### Shuffling the Data
Using `KFold` with `shuffle=True` will allow the data to be shuffled before splitting, ensuring that any ordered data will not be taken into account during training

In [None]:
from sklearn.model_selection import KFold

cv = KFold(shuffle=True)
results = cross_validate(model, data, target, cv=cv)

In [None]:
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit()

### Handling Non-iid Data (Time Series)
Have to split in order, using the first n% of data to predict the remaining (100-n)% of data

In [None]:
from sklearn.model_selection import TimeSeriesSplit

groups = data.index.to_period("Q")
cv = TimeSeriesSplit(n_splits=groups.nunique())

### Grouped Data
Use a grouped cross validation strategy to account for groups in the dataset, else the results will be overoptimistic

### Nested Cross-Validation
To both evaluate the model and tune the model's hyperparameters

When optimizing parts of the machine learning pipeline (e.g.
hyperparameter, transform, etc.), one needs to use nested cross-validation to
evaluate the statistical performance of the predictive model. Otherwise, the
results obtained without nested cross-validation are over-optimistic.

In [None]:
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV

# Declare the inner and outer cross-validation
inner_cv = KFold(n_splits=4, shuffle=True, random_state=0)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=0)

# Inner cross-validation for parameter search
model = GridSearchCV(
    estimator=model_to_tune, param_grid=param_grid, cv=inner_cv, n_jobs=-1)

# Outer cross-validation to compute the testing score
test_score = cross_val_score(model, data, target, cv=outer_cv, n_jobs=-1)
print(f"The mean score using nested cross-validation is: "
      f"{test_score.mean():.3f} +/- {test_score.std():.3f}")

## Evaluation Metrics

### Classification Metrics

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(target_test, target_predicted)
print(f"Accuracy: {accuracy:.3f}")

In [None]:
from sklearn.metrics import plot_confusion_matrix

_ = plot_confusion_matrix(classifier, data_test, target_test)

In [None]:
from sklearn.metrics import precision_score, recall_score

precision = precision_score(target_test, target_predicted, pos_label="donated")
recall = recall_score(target_test, target_predicted, pos_label="donated")

In [None]:
from sklearn.metrics import balanced_accuracy_score

balanced_accuracy = balanced_accuracy_score(target_test, target_predicted)
print(f"Balanced accuracy: {balanced_accuracy:.3f}")

The function `cross_validate` allows the computation of multiple scores by passing a list of string or scorer to the parameter `scoring`.

In [None]:
from sklearn.model_selection import cross_validate
scoring = ["accuracy", "balanced_accuracy"]

scores = cross_validate(tree, data, target, cv=cv, scoring=scoring)
scores

### Regression

**Loss Functions:**
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- $R^2$ Score: represents the proportion of variance explained by the model (default score in `sklearn`)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error