## XGBoost Decision Tree Classifier

XGBoost (Extreme Gradient Boosting) is a powerful machine learning algorithm based on the gradient boosting framework. It is widely used for classification tasks due to its speed, accuracy, and ability to handle large datasets efficiently.

### What is a Decision Tree Classifier?
A decision tree classifier predicts target labels by learning simple decision rules from features in the data. Each node splits the data based on a condition, and the process continues until the model forms a tree structure that maps inputs to outputs.

### How XGBoost Improves Decision Trees
- Uses **gradient boosting**, which builds trees sequentially, each one correcting errors from the previous.
- Incorporates **regularization** (L1 and L2) to reduce overfitting.
- Handles missing values internally.
- Supports parallel processing for faster training.

### Key Features
- Suitable for both **binary and multi-class classification**.
- Can handle **numeric and categorical features** (with proper encoding).
- Includes hyperparameters for fine-tuning (e.g., `learning_rate`, `max_depth`, `n_estimators`).

### Typical Use Case
- Prepare your dataset (features `X` and target `y`).
- Create an `XGBClassifier()` model.
- Train it with `.fit(X_train, y_train)`.
- Make predictions using `.predict(X_test)`.

XGBoost is often the go-to choice for Kaggle competitions, business analytics, and real-world problems requiring high predictive performance.


Building a Decision Tree Classifier

Split the dataset into training and testing sets using `X` (features) and `y` (labels).
Define a `DecisionTreeClassifier` and set the `max_depth` parameter to control the maximum depth of the tree.
Fit the classifier on the training data (`X_train`, `y_train`).
The dataset contains numeric measurements of tumors with labels indicating whether each tumor is malignant or benign.
This model will classify tumors based on their features.


In [2]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.tree import DecisionTreeClassifier

In [3]:
churn = pd.read_csv("C:/Users/Emigb/Documents/Data Science/datasets/telecom_churn_clean.csv")
churn.head()

Unnamed: 0.1,Unnamed: 0,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,customer_service_calls,churn
0,0,128,415,0,1,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,0
1,1,107,415,0,1,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,0
2,2,137,415,0,0,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,0
3,3,84,408,1,0,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,0
4,4,75,415,1,0,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,0


In [4]:
X = churn.iloc[:,:-1]
y= churn['churn']

In [8]:
#1. Import: train_test_split from sklearn.model_selection.
from sklearn.model_selection import train_test_split

#2. Create training and test sets such that 20% of the data is used for testing. Use a random_state of 123.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=123)

#3. Instantiate a DecisionTreeClassifier called dt_clf_4 with a max_depth of 4. This parameter specifies the maximum number of successive split points you can have before reaching a leaf node.
dt_clf_4 = DecisionTreeClassifier(max_depth=4, random_state=123)

#4. Fit the classifier to the training set and predict the labels of the test set.
dt_clf_4.fit(X_train, y_train)

#5. Predict the labels of the test set: preds
preds = dt_clf_4.predict(X_test)

#6. Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]


In [9]:
print('accuracy:', accuracy)

accuracy: 0.9070464767616192


Measuring Accuracy with XGBoost Cross-Validation**

Convert the input data into an XGBoost `DMatrix` format.
Use the `xgboost.cv` function to run cross-validation on the dataset.
Evaluate the model’s performance using the cross-validation results.
