## Instructions {-}

1. This notebook serves as the template for your code and final report on the Prediction Problem.

2. You may modify the template as needed, but it should include all required sections and information listed below.

3. Please make sure to include your name at the top of the assignment.

In [71]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, roc_auc_score #precision_recall_curve, roc_curve, auc, 
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures, KBinsDiscretizer
from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score, cross_val_predict, RepeatedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PowerTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier, StackingClassifier
from lightgbm import LGBMClassifier
from skopt.space import Integer, Categorical, Real
from skopt import BayesSearchCV

## 1) Model Setup

This section should include any **data preprocessing** (e.g., encoding, scaling) and **feature engineering** you applied specifically for the final model.


I'll be making a stacking classifier that consists of:
- KNN
- Logistic regression
- Bagging
- Boosting

In [56]:
X = pd.read_csv('../Datasets/train_X.csv')
y_csv = pd.read_csv('../Datasets/train_y.csv')
y = y_csv['ON_TIME_AND_COMPLETE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**KNN preprocessing**

In [54]:
object_cols = list(X.select_dtypes('object').columns)
knn_preprocessor = ColumnTransformer(
    transformers=[
    ('impute_then_bin', Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('binner', KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform'))
    ]), ['AVERAGE_VENDOR_ORDER_CYCLE_DAYS', 'AVERAGE_ORDER_CYCLE_DAYS']),
    ('impute', SimpleImputer(strategy = 'mean'), 
        ['AVERAGE_DAILY_DEMAND_CASES', 'AVERAGE_ORDER_CYCLE_CASES']),
    ('binning', KBinsDiscretizer(n_bins = 5, encode = 'ordinal', strategy = 'uniform'),
        ['GIVEN_TIME_TO_LEAD_TIME_RATIO', 'PRODUCT_CLASSIFICATION']),
    ('polynomial', PolynomialFeatures(degree = 2, include_bias = False),
        ['PURCHASING_LEAD_TIME', 'TRANSIT_LEAD_TIME', 'GIVEN_TIME_TO_LEAD_TIME_RATIO']), #or selector(dtype_include='number') or include=['int64', 'float64']
    ('encode_categorical', OneHotEncoder(handle_unknown = 'ignore', sparse_output = False),
        ['PRODUCT_CLASSIFICATION', 'DIVISION_NUMBER', 'DISTANCE_IN_MILES', 'PURCHASE_ORDER_TYPE', 'SHIP_FROM_VENDOR', 'LEAD_TIME_TO_DISTANCE_RATIO']),
    ('drop_objects', 'drop', object_cols),
    ('drop_id', 'drop', ['ID'])
        ],
    remainder = 'passthrough'
)

**Logistic regression preprocessing**

In [59]:
logistic_preprocessor = ColumnTransformer(
    transformers=[
    ('imputer', SimpleImputer(strategy='most_frequent'), 
        ['AVERAGE_DAILY_DEMAND_CASES', 'AVERAGE_VENDOR_ORDER_CYCLE_DAYS', 'AVERAGE_ORDER_CYCLE_DAYS']),
    ('encode_categorical', OneHotEncoder(handle_unknown='ignore', drop='first'), 
        ['ORDER_DAY_OF_WEEK', 'PURCHASE_ORDER_TYPE', 'TRANSIT_LEAD_TIME', 'DUE_DATE_WEEKDAY']),
    ('binning', KBinsDiscretizer(n_bins = 5, encode = 'ordinal', strategy = 'uniform'),
        ['LEAD_TIME_TO_DISTANCE_RATIO']),
    ('drop_cols', 'drop', 
        ['DIVISION_CODE', 'RESERVABLE_INDICATOR', 'PRODUCT_STATUS', 'DAYS_BETWEEN_ORDER_AND_DUE_DATE', 'ID', 'DIVISION_NUMBER', 
         'AVERAGE_PRODUCT_ORDER_QUANTITY_MARKET', 'AVERAGE_ORDER_CYCLE_CASES', 'ORDER_DATE','PURCHASE_ORDER_DUE_DATE']),
            #dropping both columns that contain no information and those that have a high VIF
    ('skewed', PowerTransformer(method='yeo-johnson'),
        ['DISTANCE_IN_MILES', 'AVERAGE_PRODUCT_ORDER_QUANTITY_MARKET', 'ORDER_QUANTITY_DEVIATION', 'PURCHASING_LEAD_TIME']),
    ],
    remainder='passthrough'
)


**Bagging preprocessing**

In [16]:
top_correlations = list(np.abs(pd.merge(X, y_csv, on = 'ID').select_dtypes(include = 'number').corr()['ON_TIME_AND_COMPLETE']).sort_values(ascending = False).head(21).index)
top_correlations.remove('ON_TIME_AND_COMPLETE')
bagging_preprocessor = ColumnTransformer(
    transformers=[
    ('top_corrs', 'passthrough', top_correlations)
    ],
    remainder='drop'
)

**Boosting preprocessing**

In [67]:
#Drop columns with no information and the two datetime columns
    #I found that the model severely overfits when including the date, and using derived columns (ex. days since the start, day of month) don't improve performance
boosting_preprocessor = ColumnTransformer(
    transformers=[
        ('drop', 'drop', 
         ['DIVISION_CODE', 'RESERVABLE_INDICATOR', 'PRODUCT_STATUS', 'ORDER_DATE','PURCHASE_ORDER_DUE_DATE', 'ID'])
    ],
    remainder='passthrough'
)

## 2) Model Training

Build and train your base model(s) in this section.  
If you are using techniques like stacking or voting, be sure to show how the base models and the final estimator are defined and fit.


Since the base dataset is roughly balanced, I won't worry about stratifying the data. Also, accuracy score is an appropriate metric here (if the data wasn't balanced, I would use f1 score).

**KNN model**

In [32]:
knn_params = {'metric': 'euclidean',
              'n_neighbors': np.int64(8),
              'weights': 'uniform'}

knn_pipeline = Pipeline([
    ('preprocessor', knn_preprocessor),
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(**knn_params))
])

knn_pipeline.fit(X_train, y_train)

print(f'Train accuracy: {accuracy_score(y_train, knn_pipeline.predict(X_train)):.4f}')
print(f'Test accuracy: {accuracy_score(y_test, knn_pipeline.predict(X_test)):.4f}')

Train accuracy: 0.8344
Test accuracy: 0.8041


**Logistic regression model**

In [None]:
logistic_params = {'C': 0.52641053,
                   'l1_ratio': 0.21052632,
                   'solver': 'saga', 
                   'penalty': 'elasticnet'}

logistic_pipeline = Pipeline([
    ('preprocessor', logistic_preprocessor),
    # ('scaler', StandardScaler()),
    ('model', LogisticRegression(**logistic_params))
])

logistic_pipeline.fit(X_train, y_train)
print(f'Train accuracy: {accuracy_score(y_train, logistic_pipeline.predict(X_train)):.4f}')
print(f'Test accuracy: {accuracy_score(y_test, logistic_pipeline.predict(X_test)):.4f}')

Train accuracy: 0.6464
Test accuracy: 0.6424




**Bagging model**

In [65]:
bagging_params = {'bootstrap_features': True,
                  'max_features': 0.5,
                  'max_samples': 0.5,
                  'n_estimators': 100}

bagging_pipeline = Pipeline([
    ('preprocessor', bagging_preprocessor),
    ('classifier', BaggingClassifier(**bagging_params))
])

bagging_pipeline.fit(X_train, y_train)
print(f'Train accuracy: {accuracy_score(y_train, bagging_pipeline.predict(X_train)):.4f}')
print(f'Test accuracy: {accuracy_score(y_test, bagging_pipeline.predict(X_test)):.4f}')

Train accuracy: 0.8962
Test accuracy: 0.8168


**Boosting model**

In [68]:
boosting_params = {'learning_rate': 0.03813597268507302,
                   'max_depth': 7,
                   'lambda_l1': 1.1672465647983294,
                   'lambda_l2': 5,
                   'min_split_gain': 0.04485197937078586,
                   'num_leaves': 183,
                   'feature_fraction': 0.41751919689550565,
                   'min_data_in_leaf': 84,
                   'n_estimators': 250,
                   'bagging_fraction': 0.7461616073499773,
                   'bagging_freq': 7}

boosting_pipeline = Pipeline([
    ('preprocessor', boosting_preprocessor),
    ('classifier', LGBMClassifier(**boosting_params, boosting_type='gbdt', verbose=-1))
])

boosting_pipeline.fit(X_train, y_train)
print(f'Train accuracy: {accuracy_score(y_train, boosting_pipeline.predict(X_train)):.4f}')
print(f'Test accuracy: {accuracy_score(y_test, boosting_pipeline.predict(X_test)):.4f}')



Train accuracy: 0.8436
Test accuracy: 0.8290


## 3) Hyperparameter Tuning

Describe and implement any hyperparameter tuning you applied (e.g., using Optuna, BayesSearchCV, or other methods).  
Include your code and clearly report the best parameters found.

> ⚠️ Even if your tuned model did not outperform the default settings, this step is still required. You must demonstrate and document your tuning efforts.


I will use a stacking model with a logistic regression metamodel, tuning the regularization hyperparameter with BayesSearchCV.

In [78]:
base_learners = [
    ('knn', knn_pipeline),
    ('logistic_regression', logistic_pipeline),
    ('bagging', bagging_pipeline),
    ('boosting', boosting_pipeline)
]

stacking_classifier = StackingClassifier(
    estimators=base_learners,
    final_estimator=LogisticRegression()
)

stacking_param_space = {
    'final_estimator__C': Real(1e-4, 1e4, prior='log-uniform')
}

stacking_bayes = BayesSearchCV(
    estimator=stacking_classifier,
    search_spaces=stacking_param_space,
    n_iter=5,
    cv=3,
    scoring='accuracy'
)

Train the model and determine accuracy before tuning the C metamodel hyperparameter:

In [75]:
stacking_classifier.fit(X_train, y_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [76]:
print(f'Train accuracy: {accuracy_score(y_train, stacking_classifier.predict(X_train)):.4f}')
print(f'Test accuracy: {accuracy_score(y_test, stacking_classifier.predict(X_test)):.4f}')



Train accuracy: 0.8569
Test accuracy: 0.8327




Tune the regularization hyperparameter:

In [None]:
stacking_bayes.fit(X_train, y_train)

In [80]:
stacking_bayes.best_params_

OrderedDict([('final_estimator__C', 11.065843210137139)])

The best C value is **11.065843210137139**

In [81]:
print(f'Train accuracy: {accuracy_score(y_train, stacking_bayes.best_estimator_.predict(X_train)):.4f}')
print(f'Test accuracy: {accuracy_score(y_test, stacking_bayes.best_estimator_.predict(X_test)):.4f}')



Train accuracy: 0.8548
Test accuracy: 0.8322




Look at meta model coefficients:

In [86]:
stacking_coefs = {}

stacking_coefs['Base Learner'] = [name for name, _ in base_learners]
stacking_coefs['Meta Model Coefficient'] = stacking_bayes.best_estimator_.final_estimator_.coef_[0]

pd.DataFrame(stacking_coefs).sort_values(by = 'Meta Model Coefficient').sort_values(by = 'Meta Model Coefficient', ascending = False)

Unnamed: 0,Base Learner,Meta Model Coefficient
3,boosting,4.561875
0,knn,1.170416
2,bagging,0.66821
1,logistic_regression,-1.002871


## 4) Final Model Training & Prediction

Retrain your final model using the best hyperparameters from the tuning step, and generate predictions for evaluation.  
Make sure to include code for both training and prediction, along with the evaluation metric (e.g., MAE).


In [83]:
public_private_X = pd.read_csv('../Datasets/public_private_X.csv')
submission = pd.DataFrame()
submission['ID'] = public_private_X['ID']

output_prediction = stacking_bayes.best_estimator_.predict(public_private_X)

submission['ON_TIME_AND_COMPLETE'] = output_prediction
submission.to_csv('Submissions/final.csv', index = False)



## 5) Justification for Final Model Credit

Explain why this model goes beyond the boosting model, even if the performance is not better.  
Highlight any techniques you applied—such as stacking, voting, extensive tuning, or any other technique that you tried—that reflect a meaningful effort beyond what was done for the boosting model.

**Example justification:**

*I applied stacking with multiple base models and meta-learning. I also tuned all components using Optuna. Although the leaderboard score did not surpass the boosting model, this work demonstrates a thoughtful attempt and applies techniques covered in class.*


This model goes beyong the boosting model because it is a stacking model that uses 4 base learners and a logistic regression metamodel with a tuned regularization hyperparameter. This utilizes multiple different models covered in class, as well as Bayes search for hyperparameter tuning.

## 6) Comparison with Boosting Model

Was your final model better than your previously submitted boosting model?

> **Yes / No**  
> _Brief explanation:_ What metric(s) did you use to compare? Did the final model improve MAE on cross-validation or the leaderboard? If not, why do you think that happened?


As seen in the metamodel coefficients, the boosting model is heavily weighted over the different models. As such, it will behave very similarly to the boosting model. This model performed the same as the boosting model, which makes sense because of this.

## 7) Key Takeaways (Short Reflection)

- Provide a brief summary of your key takeaways from working on this prediction problem.
- Reflect on any challenges you faced and lessons you learned during the model-building process.
- You may also include insights about what worked well, what didn’t, and what you might do differently in future projects.

Working with many different base models gave me great insight into how each behaves differently. Each base model needed different preprocessing – some models needed categorical encoding while others natively handle categories, I also found that models performed better with different types of numerical preprocessing. Logistic regression greatly benefitted from binning and power transformations to offset skew.  
In putting all of these models together for stacking classification, I could see how models were weighted differently based on their performance.  
I tended to have issues with overfitting, where my model would perform drastically better on the test set made from train_test_split than it would on the kaggle public set, and it would perform slightly better on the kaggle public set than the kaggle private set. The differences between the kaggle public and private sets were small enough to be due to randomness, but I found a few ways to offset the obvious overfitting between my test set and the kaggle public set: I dropped columns with too many categorical variables. I utilized more regularization – I increased the lower bounds of the search space for the l1 and l2 regularization hyperparameters. This slightly decreased performance on my validation set but noticeably increased performance on the kaggle set.  
In future projects, I will apply what I've learned about combatting overfitting and about combining diversified base learners.