
#### Markdown:
In this cell, we:
1. **Imported required libraries**: `pandas` for data handling, `XGBClassifier` from `xgboost` for model implementation, and metrics from `sklearn`.
2. **Loaded the prepared split datasets**: Reading the CSV files for sampled training and validation sets.
3. **Separated features and labels**: `X_train_sample` and `y_train_sample` for training, `X_val` and `y_val` for validation.
4. **Verified shapes**: Printed the shapes to ensure consistency and correct data loading.
initial setup, and let me know when you're ready to proceed with defining, training, and evaluating the XGBoost model!

In [None]:
# Import necessary libraries
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, accuracy_score
import os

# Define dataset filenames
dataset_files = {
    "X_train_sample": "split_data/train_1_split.csv",
    "X_val": "split_data/val_1_split.csv"
}

# Construct dynamic paths based on the current working directory
current_dir = os.getcwd()
dataset_paths = {key: os.path.join(current_dir, "data/features", filename) for key, filename in dataset_files.items()}

# Load the prepared datasets
try:
    X_train_sample = pd.read_csv(dataset_paths["X_train_sample"])
    y_train_sample = X_train_sample.pop('label')

    X_val = pd.read_csv(dataset_paths["X_val"])
    y_val = X_val.pop('label')
except FileNotFoundError as e:
    print(f"File not found: {e}")


# Confirm shapes for training and validation sets
print("Sampled Training Features Shape:", X_train_sample.shape)
print("Sampled Training Labels Shape:", y_train_sample.shape)
print("Validation Features Shape:", X_val.shape)
print("Validation Labels Shape:", y_val.shape)


Sampled Training Features Shape: (1024933, 177)
Sampled Training Labels Shape: (1024933,)
Validation Features Shape: (256234, 177)
Validation Labels Shape: (256234,)


#### Markdown:
In this cell, we:
1. **Defined the XGBoost Classifier**:
   - Set `random_state` for reproducibility.
   - Disabled `use_label_encoder` and specified `eval_metric='mlogloss'` to avoid warnings and optimize multi-class classification.
2. **Trained the model**: The classifier is trained using the sampled training dataset.
3. **Made predictions**: We generated predictions on the validation dataset.
4. **Evaluated the classifier**: Output accuracy and a full classification report including precision, recall, and F1-score for each class.
ve run this, let me know, and we’ll proceed to visualizations for further insights.

In [None]:
# Define the XGBoost Classifier with default parameters
xgb_clf = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss')

# Train the classifier on the sampled training data
xgb_clf.fit(X_train_sample, y_train_sample)

# Make predictions on the validation data
xgb_y_pred = xgb_clf.predict(X_val)

# Evaluate the model
print("XGBoost Classifier Accuracy:", accuracy_score(y_val, xgb_y_pred))
print("\nClassification Report:\n", classification_report(y_val, xgb_y_pred))


Parameters: { "use_label_encoder" } are not used.



### Model Evaluation Metrics

In this cell, we calculate and display a set of comprehensive metrics to evaluate the Random Forest model:

1. **Accuracy**: The overall percentage of correct predictions.
2. **Precision (Weighted)**: The weighted average of precision for all classes, considering class imbalance.
3. **Recall (Weighted)**: The weighted average of recall for all classes.
4. **F1 Score (Weighted)**: The weighted average of the F1 score for all classes, giving a balance between precision and recall.

Finally, we plot a **Confusion Matrix** for a visual breakdown of true vs. predicted labels, providing insights into specific class-level performance.


In [None]:
# Display model metrics and classification report for XGBoost

from sklearn.metrics import classification_report, accuracy_score

# Step 1: Calculate and display accuracy
accuracy_xgb = accuracy_score(y_val, xgb_y_pred)
print(f"XGBoost Classifier Accuracy: {accuracy_xgb:.4f}")

# Step 2: Generate and display classification report
xgb_classification_report = classification_report(y_val, xgb_y_pred)
print("\nClassification Report:\n", xgb_classification_report)


## Visualization
1. **Confusion Matrix**:
   - Displays the classifier’s accuracy per class.
   - Helps identify patterns of misclassification.

2. **ROC Curve and AUC**:
   - Plots a ROC curve for each class in a one-vs-rest setup.
   - The Area Under the Curve (AUC) provides a measure of how well the classifier distinguishes between classes.

3. **Feature Importance Plot**:
   - Highlights the top 10 most influential features used by the XGBoost classifier.
   - This plot helps in understanding which features contributed most to the classification process.


In [None]:
# Import necessary libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, auc
from xgboost import plot_importance

# Step 1: Confusion Matrix
conf_matrix = confusion_matrix(y_val, xgb_y_pred)
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix for XGBoost Classifier")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

# Step 2: ROC Curve and AUC for each class (One-vs-Rest)
y_val_binary = label_binarize(y_val, classes=sorted(set(y_val)))
xgb_y_pred_proba = xgb_clf.predict_proba(X_val)

plt.figure(figsize=(10, 7))
for i in range(y_val_binary.shape[1]):
    fpr, tpr, _ = roc_curve(y_val_binary[:, i], xgb_y_pred_proba[:, i])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"Class {i} (AUC = {roc_auc:.2f})")

plt.plot([0, 1], [0, 1], "k--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("XGBoost Classifier ROC Curve (One-vs-Rest)")
plt.legend(loc="best")
plt.show()

# Step 3: Feature Importance Plot
plt.figure(figsize=(12, 8))
plot_importance(xgb_clf, max_num_features=10, importance_type="weight")
plt.title("Top 10 Feature Importances for XGBoost Classifier")
plt.show()



1. **Displaying Model Accuracy**:
   - Shows the overall performance of the XGBoost classifier.

2. **Generating and Displaying the Classification Report**:
   - Outputs metrics including precision, recall, and f1-score for each class, which provides insights into the model's ability to handle different categories.
