# XGBoost
* XGBoost (Extreme Gradient Boosting) is an optimized and efficient gradient boosting framework. It follows the usual Gradient Boosting process, but includes several key concepts to improve model performance, speed, and scalability.
    * Regularization: XGBoost includes L1 and L2 regularization terms to the algorithm's objective function to prevent overfitting and control tree complexity
    * Loss Functions: XGBoost gives users the option to define loss functions based on the specific problem, allowing flexibility for custom tasks
    * Tree Construction: XGBoost uses a histogram-based approach to find the best splits for the dataset. This involves precomputing statistics on features and storing them in histograms to speed up the process. XGBoost also handles sparse data by using a compressed format to skip missing/empty values
    * Missing Values: XGBoost handles missing values during the training step and can learn how to handle them based on the given data
    * Scalability: Parallel and distributed computing is supported, making it efficient for larger datasets
    * Categorical Features: Categorical features are encoded as integers and splits are done on these encodings (no need to do One-Hot Encoding)

XGBoost is currently regarded as one of the most powerful and effective machine learning algorithms. There are other packages that use the Gradient Boosting framework and are worth checking out (e.g. LightGBM, CatBoost). LightGBM focuses on optimizing memory usage and CatBoost (Categorical Boosting) handles categorical features more effectively by using a permutation-based approach to reduce overfitting from categorical features.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier, XGBRegressor
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

In [2]:
# read in test and train data from S3
import pandas as pd
test_data = pd.read_csv('https://project4-wine-quality-2023.s3.us-west-2.amazonaws.com/test.csv')
train_data = pd.read_csv('https://project4-wine-quality-2023.s3.us-west-2.amazonaws.com/train.csv')
submission = pd.read_csv('C:/Users/nicho/Downloads/playground-series-s3e5/sample_submission.csv')

In [3]:
# Split the data into training and testing sets (Dont need to split the data due to using a pre-determined set of data)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = train_data.drop(['Id', 'quality'], axis= 1)
y_train = train_data['quality'].copy()
X_test = test_data.copy()
y_train-=3
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=0)


In [4]:
# create a sample weight class due to out of balance dataset
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(class_weight = 'balanced', y = y_train)

In [5]:
# Create an XGBoost classifier
xgb = XGBClassifier(objective = 'multi:softprob', num_class = 6, eval_metric = 'mlogloss')

# Define the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(xgb, param_grid, cv=5, verbose = 3)
grid_search.fit(X_train, y_train)

# Get the best XGBoost model from grid search
best_xgb = grid_search.best_estimator_

# Make predictions on the test data using the best model
y_pred = best_xgb.predict(X_val)

# Calculate metrics
accuracy = accuracy_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred, average="weighted")
precision = precision_score(y_val, y_pred, average="weighted")
recall = recall_score(y_val, y_pred, average="weighted")

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=50;, score=0.573 total time=   0.0s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=50;, score=0.583 total time=   0.0s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=50;, score=0.576 total time=   0.0s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=50;, score=0.545 total time=   0.0s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=50;, score=0.554 total time=   0.0s
[CV 1/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.559 total time=   0.0s
[CV 2/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.604 total time=   0.0s
[CV 3/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.559 total time=   0.0s
[CV 4/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.545 total time=   0.0s
[CV 5/5] END learning_rate=0.01, max_depth=3, n_estimators=100;, score=0.571 to

  _warn_prf(average, modifier, msg_start, len(result))


In [6]:
grid_search.best_params_

{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}

In [7]:
final_model = XGBClassifier(objective = 'multi:softprob', num_class = 6, eval_metric = 'mlogloss', **grid_search.best_params_)

In [9]:
final_model.fit(train_data.drop(['Id', 'quality'], axis= 1), train_data['quality'] - 3)

In [10]:
y_pred = final_model.predict(X_test.drop('Id', axis = 1)) + 3

In [11]:
submission["quality"] = y_pred

In [12]:
submission.to_csv("test_submission.csv", index = False)

In [13]:
print(f"Best Estimators: {best_xgb.n_estimators}")
print(f"Best Learning Rate: {best_xgb.learning_rate}")
print(f"Best Max Depth: {best_xgb.max_depth}")
print(f"Accuracy: {accuracy:.2f}")
print(f"F1: {f1:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

Best Estimators: 50
Best Learning Rate: 0.1
Best Max Depth: 3
Accuracy: 0.59
F1: 0.57
Precision: 0.56
Recall: 0.59
