# Introduction to XGBoost for Classification 🚀

**XGBoost (eXtreme Gradient Boosting)** is a highly efficient and powerful machine learning algorithm based on the **Gradient Boosting** framework. It has become a go-to choice for many data scientists and is a frequent winner of machine learning competitions.

### Why is XGBoost so popular?
* **Performance:** It is known for its state-of-the-art performance on a wide range of structured data problems.
* **Speed:** It is optimized for computational efficiency and can be much faster than standard gradient boosting implementations.
* **Flexibility:** It includes built-in regularization to prevent overfitting, can handle missing values, and offers many hyperparameters for fine-tuning.

Like other boosting methods, XGBoost works by building a series of decision trees **sequentially**. Each new tree is trained to correct the errors made by the previous trees, allowing the model to learn complex patterns and make highly accurate predictions.

This notebook will compare the performance of a simple Logistic Regression model against an XGBoost classifier on a synthetic multi-class dataset.


## 1. Generating a Multi-Class Dataset

First, we'll use `scikit-learn` to generate a synthetic dataset for a three-class classification problem. The dataset will have 10,000 samples and 10 features.


In [6]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=10000,
    n_features=10,
    n_informative=8,
    n_redundant=2,
    n_repeated=0,
    n_classes=3,
    random_state=42
)

--- Logistic Regression Report ---
              precision    recall  f1-score   support

           0       0.74      0.70      0.72       677
           1       0.76      0.77      0.76       664
           2       0.68      0.71      0.70       659

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



Let's inspect the shape and a few samples of our generated data.

In [3]:
X.shape

(10000, 10)

In [4]:
X[:2]

array([[-5.31573515,  0.6775586 , -4.43495008, -1.755074  , -0.47264511,
        -2.96504643,  2.39563871, -0.38616042, -5.99696616,  2.70706827],
       [-1.71149777,  1.42608068, -0.56808572,  1.19785018, -1.45465463,
         2.03940975, -1.64207421,  0.54053374, -1.52128605,  1.09364584]])

In [5]:
y[:2]

array([2, 1])

## 2. Baseline Model: Logistic Regression

Before using XGBoost, let's establish a performance baseline with a simpler model, `LogisticRegression`. We'll split our data and train the model.


In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("--- Logistic Regression Report ---")
print(classification_report(y_test, y_pred))

--- Logistic Regression Report ---
              precision    recall  f1-score   support

           0       0.74      0.70      0.72       677
           1       0.76      0.77      0.76       664
           2       0.68      0.71      0.70       659

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.73      0.73      0.73      2000



The baseline Logistic Regression model achieves an accuracy of **73%**.

## 3. Training the XGBoost Classifier

Now, let's train an `XGBClassifier` on the same data. The XGBoost library provides an easy-to-use, `scikit-learn`-compatible interface.


In [8]:
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("--- XGBoost Classifier Report ---")
print(classification_report(y_test, y_pred))

--- XGBoost Classifier Report ---
              precision    recall  f1-score   support

           0       0.88      0.90      0.89       677
           1       0.91      0.91      0.91       664
           2       0.91      0.90      0.91       659

    accuracy                           0.90      2000
   macro avg       0.90      0.90      0.90      2000
weighted avg       0.90      0.90      0.90      2000



The XGBoost model achieves an accuracy of **90%**, a dramatic improvement over the baseline.


## 4. Conclusion

| Model | Accuracy |
|:--- |:--- |
| Logistic Regression | 73% |
| **XGBoost Classifier** | **90%** |

This comparison clearly demonstrates the power of XGBoost. For complex classification tasks, advanced ensemble methods like XGBoost can provide a substantial performance boost over simpler linear models. While XGBoost is more complex under the hood, its default settings often provide excellent results, making it a powerful and accessible tool.