# Introduction to XGBoost

## What is XGBoost?
XGBoost (Extreme Gradient Boosting) is a powerful and widely used library for supervised machine learning. It was originally developed as a C++ command-line application but gained massive popularity after winning a major machine learning competition. Today, it is available in multiple languages including Python, R, Scala, and Julia. This note focuses on the Python API.

## Why is XGBoost Popular?
- **Speed and Performance**: XGBoost is highly optimized and parallelizable, making it fast to train on large datasets.
- **Scalability**: It can run on multiple CPU cores, GPUs, and even distributed computing environments.
- **Consistent Accuracy**: XGBoost often outperforms other single-algorithm machine learning methods in competitions and real-world tasks.

## Key Characteristics
- Designed for supervised learning tasks (classification and regression).
- Handles large datasets with high efficiency.
- Supports advanced techniques like regularization, boosting, and parallelized training.

## Why Learn XGBoost?
- It is one of the top-performing algorithms in machine learning competitions.
- It provides both flexibility and speed for data science projects.
- It integrates easily with the Python ecosystem (scikit-learn compatible API).



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import xgboost as xgb

In [2]:
churn = pd.read_csv("C:/Users/Emigb/Documents/Data Science/datasets/telecom_churn_clean.csv")
churn.head()

Unnamed: 0.1,Unnamed: 0,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,customer_service_calls,churn
0,0,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,1,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,2,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,3,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,4,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


In [3]:
# Create arrays for the features and the target: X, y
X, y = churn.iloc[:,:-1], churn.iloc[:,-1]

### XGBoost: Fit and Predict

This example demonstrates how to build and evaluate a simple XGBoost classification model using the scikit-learn API.

**Steps:**
1. Import `xgboost` as `xgb`.
2. Split your dataset into training and test sets using an 80/20 split (`random_state=123`).
3. Create an `XGBClassifier` with:
   - `n_estimators=10`
   - `objective='binary:logistic'`
4. Fit the model to your training data using `.fit(X_train, y_train)`.
5. Predict the labels for your test set with `.predict(X_test)`.
6. Calculate accuracy by comparing predictions to the true labels.

This workflow helps you quickly train an XGBoost model and measure its initial performance.


In [4]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Instantiate the XGBClassifier: xg_cl
xgb_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xgb_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xgb_cl.predict(X_test)

# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.947526
