<a href="https://colab.research.google.com/github/MathMachado/DSWP/blob/master/XGBoost_Imbalanced_sample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XGBoost "scale_pos_weight" vs "sample_weight" for Imbalanced Classification
* When working with imbalanced classification tasks, where the number of instances in each class is significantly different, XGBoost provides two main parameters to handle class imbalance: scale_pos_weight and sample_weight.

This example demonstrates how to use both parameters and compares their performance using evaluation metrics on a synthetic imbalanced dataset.

## scale_pos_weight example:

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compute scale_pos_weight as ratio of negative to positive instances in train set
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# Initialize XGBClassifier with scale_pos_weight
model_spw = XGBClassifier(n_estimators=100, scale_pos_weight=scale_pos_weight, random_state=42)

# Train model and evaluate performance on test set
model_spw.fit(X_train, y_train)
pred_spw = model_spw.predict(X_test)
print("scale_pos_weight model:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, pred_spw))
print("\nClassification Report:")
print(classification_report(y_test, pred_spw))

## sample_weight example:

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create sample_weight array mapping class weights to instances in train set
class_weights = {0: 1, 1: 10}
sample_weights = np.array([class_weights[class_id] for class_id in y_train])

# Initialize XGBClassifier with default parameters
model_sw = XGBClassifier(n_estimators=100, random_state=42)

# Train model using sample_weight and evaluate on test set
model_sw.fit(X_train, y_train, sample_weight=sample_weights)
pred_sw = model_sw.predict(X_test)
print("sample_weight model:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, pred_sw))
print("\nClassification Report:")
print(classification_report(y_test, pred_sw))

# XGBoost Configure "class_weight" Parameter for Imbalanced Classification
* Class imbalance is a common issue in real-world classification problems, where the number of instances in one class significantly outweighs the other.

XGBoost provides the scale_pos_weight parameter to effectively handle imbalanced datasets by adjusting the weights of the positive class.

It’s important to note that while some other machine learning algorithms use the parameter name class_weight, XGBoost specifically uses scale_pos_weight to handle class imbalance.

This example demonstrates how to compute and set the scale_pos_weight parameter when training an XGBoost model on imbalanced data.

We’ll generate a synthetic imbalanced binary classification dataset using scikit-learn, train an XGBClassifier with scale_pos_weight, and evaluate the model’s performance using the confusion matrix and classification report.

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compute the scale_pos_weight
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
print(f'pos weight: {scale_pos_weight}')

# Initialize XGBClassifier with scale_pos_weight
model = XGBClassifier(scale_pos_weight=scale_pos_weight, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Generate predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# XGBoost Configure "max_delta_step" Parameter for Imbalanced Classification
* When working with imbalanced classification tasks in XGBoost, where the number of instances in each class differs significantly, the model may overfit on the majority class.

The max_delta_step parameter can help mitigate this issue by limiting the maximum change in the predictions between iterations, effectively preventing the model from giving too much importance to the majority class.

This example demonstrates how to use the max_delta_step parameter and evaluates its impact on model performance using a synthetic imbalanced dataset.

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Generate an imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBClassifier with default settings
model_default = XGBClassifier(n_estimators=100, random_state=42)

# Initialize XGBClassifier with max_delta_step set to a non-default value
model_mds = XGBClassifier(n_estimators=100, max_delta_step=1, random_state=42)

# Fit the models
model_default.fit(X_train, y_train)
model_mds.fit(X_train, y_train)

# Generate predictions
pred_default = model_default.predict(X_test)
pred_mds = model_mds.predict(X_test)

# Evaluate the models
print("Model with default settings:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, pred_default))
print("\nClassification Report:")
print(classification_report(y_test, pred_default))

print("\nModel with max_delta_step set to 1:")
print("Confusion Matrix:")
print(confusion_matrix(y_test, pred_mds))
print("\nClassification Report:")
print(classification_report(y_test, pred_mds))