# XGBoost

XGBoost (Extreme Gradient Boosting) is a popular and powerful machine learning algorithm that belongs to the family of ensemble learning methods. It is particularly effective for regression and classification problems. XGBoost builds a series of weak learners (typically decision trees) and combines their predictions to create a strong learner. Here are some key features and advantages of XGBoost:

### Features of XGBoost:

1. **Regularization:**
   - XGBoost includes regularization terms in its objective function, which helps prevent overfitting.

2. **Parallel Processing:**
   - It supports parallel and distributed computing, making it faster than many other gradient boosting algorithms.

3. **Tree Pruning:**
   - XGBoost incorporates a pruning step to remove unimportant branches of trees, which helps in reducing complexity and improving efficiency.

4. **Handling Missing Values:**
   - XGBoost has a built-in mechanism to handle missing values in the dataset, eliminating the need for imputation.

5. **Cross-validation:**
   - It has built-in cross-validation capabilities, allowing users to perform model selection efficiently.

6. **Flexibility:**
   - XGBoost can be used for both regression and classification problems and can handle a variety of data types.

7. **Feature Importance:**
   - It provides a feature importance score, which helps in understanding the relative importance of different features in the dataset.

### Advantages of XGBoost:

1. **Accuracy:**
   - XGBoost often produces highly accurate models, as it combines the strengths of multiple weak learners.

2. **Speed:**
   - It is optimized for performance and efficiency, making it faster than many other boosting algorithms.

3. **Scalability:**
   - XGBoost is scalable and can handle large datasets with ease.

4. **Robustness:**
   - It is less prone to overfitting and can handle noise in the data.

5. **Flexibility:**
   - XGBoost can be used for a wide range of data types and problem types.


# Practical implementation

In [1]:
!pip install xgboost




In [2]:
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

In [3]:
# Load the breast cancer dataset (binary classification)
data = load_breast_cancer()
X, y = data.data, data.target

In [4]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [5]:
# Create an XGBoost classifier
xgb_model = xgb.XGBClassifier()

In [6]:
#Train the model
xgb_model.fit(X_train, y_train)

In [7]:
# Make predictions on the test set
y_pred = xgb_model.predict(X_test)

In [8]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy without Hyperparameter Tuning: {accuracy}")

Accuracy without Hyperparameter Tuning: 0.956140350877193


In [9]:
# With Hyperparameter Tuning using Grid Search

# Define a parameter grid for Grid Search
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

In [10]:
# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=2)

In [11]:
# Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 108 candidates, totalling 324 fits


In [12]:
# Get the best parameters and the best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

In [13]:
# Make predictions on the test set using the best model
y_pred_tuned = best_model.predict(X_test)

In [14]:
# Calculate accuracy after tuning
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print(f"Accuracy after Hyperparameter Tuning: {accuracy_tuned}")
print(f"Best Parameters: {best_params}")

Accuracy after Hyperparameter Tuning: 0.9649122807017544
Best Parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
