<a href="https://colab.research.google.com/github/Jhansipothabattula/Machine_Learning/blob/main/Day38.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to XGBoost

**Overview of XGBoost**

- What is XGBoost?

  - Advanced implementation of the Gradient Boosting algorithm designed for speed and perfomance

  - It introduces various enhancements that make it faster, more efficient and capable of handling complex datasets

- Improvements Over Traditional Gradient Boosting

  - Speed

  - Handling Missing Data

  - Regularization

  - Custom Loss Functions

  - Tree Pruning

**Key features of XGBoost**

- Handling Missing Data

  - Automatically assigns missing values to the branch that minimizes the loss function

  - Reduces Preprocessing steps for datasets with missing values

- Regularization

  - Includes Penalties for overly complex models, reducing overfitting

  - Hyperparameters

    - Lambda:L2 Regularization term

    - Alpha: L1 regularization term

  - Parallel Processing

   - Splits calculations for tree construction across multiple cores, significantly improving training time

**Hyperparameters inXGBoost and how to Tune them**

- Key parameters

  - Learning Rate(eta)

    - Controls the contributions of each tree to the model

    - Typical range: 0.01-0.3

  - Number of Trees(n_estimators)

    - Determines the number of boosting rounds

    - larger values may improve perfomance but increase computation time

  - Tree Depth(max_depth)

    - Limits the depth of tree, balancing bias and variance

    - Shallower trees generalize better, while deeper trees may overfit

  - Subsample

    - Fraction of data used to train each tree

    - Helps reduce overfitting:typical range:0.5-1.0

  - Colsamle_bytree

    - fraction of features used for each tree split

    - Typical range:0.5-1.0

  - Regularization parameters: lambda and alpha control L2 and L1 regularization respectively

**1. Train an XGBoost model on a dataset, tune Hyperparameters using cross-validation, and compare it's perfomance with a Gradient Boosting model**

In [3]:
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier

import warnings
warnings.filterwarnings("ignore")

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display dataset info
print(f"Features: {data.feature_names}")
print(f"classes: {data.target_names}")

# convert dataset to DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Train XGBoost Model
params = {
    "objective": "binary:logistic",
    "eval_metric": "logloss",
    "max_depth": 3,
    "eta": 0.1,
}

xgb_model = xgb.train(params, dtrain, num_boost_round=100)

# Predict
y_pred = (xgb_model.predict(dtest)>0.5).astype(int)

# Evaluate perfomance
accuracy = accuracy_score(y_test, y_pred)
print(f"XGBoost Accuracy: \n {accuracy}")
print("Classification Report: \n", classification_report(y_test, y_pred))

# Define hyperparameter grid
param_grid ={
    "learning_rate":[0.01, 0.1, 0.2],
    "n_estimators":[50, 100, 200],
    "max_depth":[3, 5, 7],
    "subsample":[0.8, 1.0],
    "colsample_bytree":[0.8, 1.0]
}

# Initialize XGBoost classifier
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric="logloss")

# Perform GridSearch
grid_search = GridSearchCV(estimator=xgb_clf, param_grid=param_grid, cv=5, scoring="accuracy")
grid_search.fit(X_train, y_train)

# Display best parameters and score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation accuracy: {grid_search.best_score_}")

# Evaluate Gradient Boosting Perfomance
# Missing y_pred_gb for Gradient Boosting evaluation
# accuracy_gb = accuracy_score(y_test, y_pred_gb)
# print(f"Gradient Boosting Accuracy: {accuracy_gb}")

Features: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
classes: ['malignant' 'benign']
XGBoost Accuracy: 
 0.956140350877193
Classification Report: 
               precision    recall  f1-score   support

           0       0.95      0.93      0.94        43
           1       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

Best Paramete