# **Gradient Boost**

Gradient Boost is an ensemble learning technique that builds models sequentially, where each new model attempts to correct the errors made by the previous ones. It combines weak learners (often decision trees) to create a strong predictive model.

It is a powerful technique that combines the strengths of multiple weak learners to create a robust predictive model. It is widely used in various domains and has become a standard tool in the machine learning toolkit

It is particularly effective for both regression and classification tasks, and it can handle various types of data, including categorical and numerical features.

Gradient Boosting works by optimizing a loss function through gradient descent, where each new model is trained on the residuals (errors) of the previous models. 
This iterative process continues until a specified number of models are built or until the model performance stops improving.

It is widely used in machine learning competitions and real-world applications due to its high accuracy and flexibility. 
However, it can be prone to overfitting if not properly regularized, and it may require careful tuning of hyperparameters to achieve optimal performance.

Gradient Boosting is implemented in various libraries, including Scikit-Learn, XGBoost, LightGBM, and CatBoost, each with its own optimizations and features.

Gradient Boosting is particularly effective for structured data and is often used in applications such as fraud detection, customer churn prediction, and ranking tasks.

It is important to note that while Gradient Boosting can achieve high accuracy, it may require more computational resources and time to train compared to simpler models. Therefore, it is essential to balance model complexity with training time and resource availability.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import datasets

In [7]:
# as_frame=True loads the dataset as a pandas DataFrame
# This is useful for easier data manipulation and visualization.
wine = datasets.load_wine(as_frame=True)
display(wine)

{'data':      alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
 0      14.23        1.71  2.43               15.6      127.0           2.80   
 1      13.20        1.78  2.14               11.2      100.0           2.65   
 2      13.16        2.36  2.67               18.6      101.0           2.80   
 3      14.37        1.95  2.50               16.8      113.0           3.85   
 4      13.24        2.59  2.87               21.0      118.0           2.80   
 ..       ...         ...   ...                ...        ...            ...   
 173    13.71        5.65  2.45               20.5       95.0           1.68   
 174    13.40        3.91  2.48               23.0      102.0           1.80   
 175    13.27        4.28  2.26               20.0      120.0           1.59   
 176    13.17        2.59  2.37               20.0      120.0           1.65   
 177    14.13        4.10  2.74               24.5       96.0           2.05   
 
      flavanoids  nonflavanoid

In [8]:
wine.feature_names

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

In [None]:
X = wine.data.drop(columns=['magnesium'])
Y = wine.data['magnesium']

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [12]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, accuracy_score, r2_score

In [13]:
gbc = GradientBoostingRegressor()
gbc.fit(X_train, Y_train)

In [14]:
y_pred = gbc.predict(X_test)
display(y_pred)

array([108.51162395, 117.1751086 , 100.7580958 , 116.07507535,
        87.24993741, 111.91232403,  80.62445067, 108.59224685,
       109.904828  , 103.99342012, 112.47455697, 103.92000418,
       101.16431518,  97.397633  , 126.74826003,  87.99453223,
        91.18661663,  90.0414513 , 111.45754722,  89.89936959,
       112.52447361,  89.77328928,  95.15577995,  96.8965493 ,
        92.50236912,  97.91169707,  86.07898403,  95.87796528,
        93.63249246, 112.62877353, 105.54124926,  85.65628935,
       100.15635114, 112.12564095, 115.81602274, 111.43076457])

In [15]:
print("Mean Squared Error:", mean_squared_error(Y_test, y_pred))
print("R^2 Score:", r2_score(Y_test, y_pred))
print("Cross-Validation Scores:", cross_val_score(gbc, X, Y, cv=5))
print("Cross-Validation Mean Score:", cross_val_score(gbc, X, Y, cv=5).mean())
print("Cross-Validation Standard Deviation:", cross_val_score(gbc, X, Y, cv=5).std())
print("Accuracy Score:", accuracy_score(Y_test, y_pred.round()))
print("Feature Importances:", gbc.feature_importances_)

Mean Squared Error: 134.8847698439398
R^2 Score: 0.11631005253415505
Cross-Validation Scores: [-0.02741016 -0.01653129 -0.02828623 -0.46420355  0.03777905]
Cross-Validation Mean Score: -0.07533300839622972
Cross-Validation Standard Deviation: 0.18217550463973922
Accuracy Score: 0.027777777777777776
Feature Importances: [0.03426317 0.06930774 0.10105258 0.05709802 0.04467351 0.05107233
 0.18304399 0.05234966 0.07730106 0.32983794]
