XGBoost in Python via scikit-learn and 5-fold CV

In [9]:
#import libraries
import pyreadr
import xgboost as xgb
from sklearn.model_selection import cross_val_score
import pandas as pd
import time

This step imports all necessary Python packages. Firstly, pyreadr is used to load .rds datasets from R, xgboost and sklearn provide the tools for training the XGBoost model, time and pandas are used for timing and recording results.



In [10]:
# Define Function to Run XGBoost with 5-Fold CV
def run_xgb_scikit(data):
    X = data.drop("outcome", axis=1)
    y = data["outcome"].astype(int)
    model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
    start = time.time()
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    end = time.time()
    return scores.mean(), end - start

The run_xgb_scikit() function trains an XGBoost classifier.
It separates predictors (X) and target (y), then uses cross_val_score() for 5-fold CV.
The function returns mean accuracy and training time.
This ensures consistent evaluation across all dataset sizes.

In [11]:
# Loop Through Dataset Sizes i.e., load RDS Data and Evaluate
dataset_sizes = [100, 1000, 10000, 100000, 1000000, 10000000]
results = []

for sz in dataset_sizes:
    try:
        path = f"bootstrap_data_{sz}.rds"
        result = pyreadr.read_r(path)
        df = result[None]
        acc, duration = run_xgb_scikit(df)
        results.append({
            "Method used": "XGBoost in Python via scikit-learn and 5-fold CV",
            "Dataset size": sz,
            "Testing-set predictive performance": round(acc, 4),
            "Time taken for the model to be fit": round(duration, 2)
        })
    except Exception as e:
        results.append({
            "Method used": "XGBoost in Python via scikit-learn and 5-fold CV",
            "Dataset size": sz,
            "Testing-set predictive performance": "Error",
            "Time taken for the model to be fit": str(e)
        })

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

The code loops through all defined dataset sizes from 100 to 10 million.
Each .rds file is read using pyreadr, and the model is trained using the function.
Accuracy and time taken are collected in a structured dictionary.
If there's an error, it's caught and recorded for that size.

In [12]:
#Output the Results as a DataFrame
results_df = pd.DataFrame(results)
print(results_df)

                                        Method used  Dataset size  \
0  XGBoost in Python via scikit-learn and 5-fold CV           100   
1  XGBoost in Python via scikit-learn and 5-fold CV          1000   
2  XGBoost in Python via scikit-learn and 5-fold CV         10000   
3  XGBoost in Python via scikit-learn and 5-fold CV        100000   
4  XGBoost in Python via scikit-learn and 5-fold CV       1000000   
5  XGBoost in Python via scikit-learn and 5-fold CV      10000000   

   Testing-set predictive performance  Time taken for the model to be fit  
0                              0.9200                                0.97  
1                              0.9500                                0.86  
2                              0.9777                                0.76  
3                              0.9872                                3.87  
4                              0.9919                               46.43  
5                              0.9932                       

The results from the Python-based XGBoost implementation using scikit-learn and 5-fold cross-validation demonstrate strong and consistent predictive performance across all dataset sizes. For smaller datasets (100 to 10,000 rows), the model achieved high accuracy ranging from 92% to nearly 98%, with fitting times under one second. As the dataset size increased to 100,000 and 1 million rows, accuracy slightly improved to around 99%, but training time also increased significantly to 3.87 and 46.43 seconds respectively. For the largest dataset with 10 million records, the model maintained high predictive accuracy (99.32%) but required over 6 minutes to complete. These results highlight the scalability of the Python XGBoost model in terms of accuracy, though computation time becomes a critical factor at very large scales.








