In [17]:
import pandas as pd
import sys
import os

pd.options.display.max_rows = 999

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
from sklearn.model_selection import train_test_split
from flaml import AutoML
sys.path.append(os.path.abspath(".."))
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    f1_score,
    roc_auc_score,
    classification_report,
)

from src.pipeline import build_pipeline
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)


In [18]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [19]:
print("Loading raw data...")
raw_df = pd.read_json("../data/dataset.json")
raw_df.drop_duplicates(subset=["request_id"], keep="first", inplace=True)

y = raw_df["requester_received_pizza"]
X = raw_df.drop("requester_received_pizza", axis=1)

# Split the data into training and a temporary holdout/test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Data split: {len(X_train)} training samples, {len(X_test)} test samples.")


preprocessor_pipeline = build_pipeline(use_tfidf=True)
print("Fitting the preprocessor on the training data...")
X_train_processed = preprocessor_pipeline.fit_transform(X_train, y_train)
X_test_processed = preprocessor_pipeline.transform(
    X_test
)

Loading raw data...
Data split: 3232 training samples, 808 test samples.
Fitting the preprocessor on the training data...


# 5. Building a machine learning model

Our initial choice for the primary evaluation metric was ROC AUC. While it's excellent for measuring a model's ability to rank predictions, we found it gave a misleading picture of performance for this specific problem. A model could achieve a high AUC score by being good at ranking, but still be practically useless if its decision threshold resulted in a very low recall for the positive class.

For this reason, we pivoted to the F1-score as our primary metric. With only ~24% of requests being successful, our main goal is to correctly identify this small, positive group. The F1-score is the harmonic mean of Precision and Recall, which forces the model to be good at two critical things:

    Precision: When the model predicts "Pizza Received," how often is it correct?

    Recall: Of all the requests that actually resulted in a pizza, how many did our model find?

A model can't achieve a high F1-score by simply predicting the majority class ("No Pizza"). It must make a genuine effort to find the positive cases, making it a far more reliable and realistic measure of the model's practical value for this challenge.

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB  
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
random_state = 42
from sklearn.metrics import roc_auc_score, classification_report, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

# A dictionary to store the results for comparison
model_results = {}


## Exploration of base classification models

In [21]:
print("--- Training Logistic Regression (Baseline) ---")
lr_model = LogisticRegression(random_state=random_state, max_iter=1000)
lr_model.fit(X_train_processed, y_train)

# Make predictions on the test set, need probabilities for ROC AUC calculation
lr_probs = lr_model.predict_proba(X_test_processed)[:, 1]
lr_preds = lr_model.predict(X_test_processed)

# Calculate all metrics
lr_auc = roc_auc_score(y_test, lr_probs)
lr_accuracy = accuracy_score(y_test, lr_preds)
lr_recall = recall_score(y_test, lr_preds)
lr_f1 = f1_score(y_test, lr_preds)

# Store results for comparison
model_results["Logistic Regression"] = {
    "auc": lr_auc,
    "probs": lr_probs,
    "accuracy": lr_accuracy,
    "recall": lr_recall,
    "f1": lr_f1,
}

print(f"\n--- Logistic Regression Performance Metrics ---")
print(f"ROC AUC:   {lr_auc:.4f}")
print(f"Accuracy:  {lr_accuracy:.4f}")
print(f"Recall:    {lr_recall:.4f}")
print(f"F1 Score:  {lr_f1:.4f}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, lr_preds))


--- Training Logistic Regression (Baseline) ---

--- Logistic Regression Performance Metrics ---
ROC AUC:   0.6254
Accuracy:  0.7537
Recall:    0.0905
F1 Score:  0.1532

Detailed Classification Report:
              precision    recall  f1-score   support

       False       0.77      0.97      0.86       609
        True       0.50      0.09      0.15       199

    accuracy                           0.75       808
   macro avg       0.63      0.53      0.50       808
weighted avg       0.70      0.75      0.68       808



The recall is very low for the positive class, we have a lot of fake positives. This was awaited as there is class imbalance. Logistic Regression f1 score 0.1447 will be our baseline

In [22]:
print("\n--- Training Gaussian Naive Bayes ---")

# Naive Bayes requires a dense array, not a sparse matrix
gnb_model = GaussianNB()
gnb_model.fit(X_train_processed.toarray(), y_train)

# Make predictions
gnb_probs = gnb_model.predict_proba(X_test_processed.toarray())[:, 1]
gnb_preds = gnb_model.predict(X_test_processed.toarray())

# Calculate all metrics
gnb_auc = roc_auc_score(y_test, gnb_probs)
gnb_accuracy = accuracy_score(y_test, gnb_preds)
gnb_recall = recall_score(y_test, gnb_preds)
gnb_f1 = f1_score(y_test, gnb_preds)

# Store results for comparison
model_results["Naive Bayes"] = {
    "auc": gnb_auc,
    "probs": gnb_probs,
    "accuracy": gnb_accuracy,
    "recall": gnb_recall,
    "f1": gnb_f1,
}

print(f"\n--- Naive Bayes Performance Metrics ---")
print(f"ROC AUC:   {gnb_auc:.4f}")
print(f"Accuracy:  {gnb_accuracy:.4f}")
print(f"Recall:    {gnb_recall:.4f}")
print(f"F1 Score:  {gnb_f1:.4f}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, gnb_preds))



--- Training Gaussian Naive Bayes ---

--- Naive Bayes Performance Metrics ---
ROC AUC:   0.5749
Accuracy:  0.5384
Recall:    0.5678
F1 Score:  0.3773

Detailed Classification Report:
              precision    recall  f1-score   support

       False       0.79      0.53      0.63       609
        True       0.28      0.57      0.38       199

    accuracy                           0.54       808
   macro avg       0.54      0.55      0.51       808
weighted avg       0.66      0.54      0.57       808



 Naive Bayes balances finding positives and avoiding wrong alarms better than the others. With an F1 of 0.38, it finds over half of the positives while keeping false alerts at a moderate level. It is the strongest basic model for mixed errors.

In [23]:
print("\n--- Training Random Forest ---")

# init and train the model
rf_model = RandomForestClassifier(random_state=42, n_jobs=-1)
rf_model.fit(X_train_processed, y_train)

# make predictions
rf_probs = rf_model.predict_proba(X_test_processed)[:, 1]
rf_preds = rf_model.predict(X_test_processed)

# Calculate all metrics
rf_auc = roc_auc_score(y_test, rf_probs)
rf_accuracy = accuracy_score(y_test, rf_preds)
rf_recall = recall_score(y_test, rf_preds)
rf_f1 = f1_score(y_test, rf_preds)

# Store results for comparison
model_results["Random Forest"] = {
    "auc": rf_auc,
    "probs": rf_probs,
    "accuracy": rf_accuracy,
    "recall": rf_recall,
    "f1": rf_f1,
}

print(f"\n--- Random Forest Performance Metrics ---")
print(f"ROC AUC:   {rf_auc:.4f}")
print(f"Accuracy:  {rf_accuracy:.4f}")
print(f"Recall:    {rf_recall:.4f}")
print(f"F1 Score:  {rf_f1:.4f}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, rf_preds))



--- Training Random Forest ---

--- Random Forest Performance Metrics ---
ROC AUC:   0.6062
Accuracy:  0.7574
Recall:    0.0251
F1 Score:  0.0485

Detailed Classification Report:
              precision    recall  f1-score   support

       False       0.76      1.00      0.86       609
        True       0.71      0.03      0.05       199

    accuracy                           0.76       808
   macro avg       0.74      0.51      0.45       808
weighted avg       0.75      0.76      0.66       808



 Although its accuracy is similar to the others, this model almost never flags positives. Its very low recall means it misses nearly all true positives. That makes its F1 score close to zero and shows it is too cautious to be useful for finding the rare class.

In [24]:
print("\n--- Training LightGBM ---")

# init and train the model
lgbm_model = LGBMClassifier(random_state=42)
lgbm_model.fit(X_train_processed, y_train)

# make predictions
lgbm_probs = lgbm_model.predict_proba(X_test_processed)[:, 1]
lgbm_preds = lgbm_model.predict(X_test_processed)

# Calculate all metrics
lgbm_auc = roc_auc_score(y_test, lgbm_probs)
lgbm_accuracy = accuracy_score(y_test, lgbm_preds)
lgbm_recall = recall_score(y_test, lgbm_preds)
lgbm_f1 = f1_score(y_test, lgbm_preds)

# Store results for comparison
model_results["LightGBM"] = {
    "auc": lgbm_auc,
    "probs": lgbm_probs,
    "accuracy": lgbm_accuracy,
    "recall": lgbm_recall,
    "f1": lgbm_f1,
}

print(f"\n--- LightGBM Performance Metrics ---")
print(f"ROC AUC:   {lgbm_auc:.4f}")
print(f"Accuracy:  {lgbm_accuracy:.4f}")
print(f"Recall:    {lgbm_recall:.4f}")
print(f"F1 Score:  {lgbm_f1:.4f}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, lgbm_preds))



--- Training LightGBM ---

--- LightGBM Performance Metrics ---
ROC AUC:   0.5967
Accuracy:  0.7339
Recall:    0.1608
F1 Score:  0.2294

Detailed Classification Report:
              precision    recall  f1-score   support

       False       0.77      0.92      0.84       609
        True       0.40      0.16      0.23       199

    accuracy                           0.73       808
   macro avg       0.59      0.54      0.53       808
weighted avg       0.68      0.73      0.69       808



 LightGBM finds more true positives than logistic regression and random forest, but still only about one in six. Its F1 of 0.23 hints it could do better with more tuning or by giving extra weight to the positive class. With the right adjustments, it might achieve a higher balance between finding positives and avoiding false alarms.

In [25]:

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score, classification_report

print("\n--- Starting GridSearchCV for Scikit-learn MLP ---")
print("This may take a few minutes...")

# 1. Initialize the MLP Classifier
# early_stopping is great for preventing overfitting and speeding up the search.
mlp_model = MLPClassifier(
    random_state=42,
    max_iter=500,
    early_stopping=True,
    n_iter_no_change=10,
)

param_grid = {
    "hidden_layer_sizes": [
        (64,),
        (32, 16),
        (64, 32),  
        (128, 64, 32),  # More complex architecture
    ],  
    "alpha": [0.0001, 0.001],  # L2 regularization 
    "learning_rate_init": [0.005, 0.001, 0.0001],  # Initial learning rate   
}

grid_search = GridSearchCV(
    estimator=mlp_model,
    param_grid=param_grid,
    scoring="f1_macro",  # Use F1 score for classification tasks
    cv=3,
    n_jobs=-1,
    verbose=1,
)

#  Grid Search
grid_search.fit(X_train_processed, y_train)

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score, classification_report

print("\n--- Starting GridSearchCV for Scikit-learn MLP ---")
print("This may take a few minutes...")

# 1. Initialize the MLP Classifier
# early_stopping is great for preventing overfitting and speeding up the search.
mlp_model = MLPClassifier(
    random_state=42,
    max_iter=500,
    early_stopping=True,
    n_iter_no_change=10,
)

param_grid = {
    "hidden_layer_sizes": [
        (64,),
        (32, 16),
        (64, 32),
        (128, 64, 32),  # More complex architecture
    ],
    "alpha": [0.0001, 0.001],  # L2 regularization
    "learning_rate_init": [0.005, 0.001, 0.0001],  # Initial learning rate
}

grid_search = GridSearchCV(
    estimator=mlp_model,
    param_grid=param_grid,
    scoring="f1_macro",  # Use F1 score for classification tasks
    cv=3,
    n_jobs=-1,
    verbose=1,
)

#  Grid Search
grid_search.fit(X_train_processed, y_train)

print("\n--- GridSearchCV Results ---")
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best F1 score: {grid_search.best_score_:.4f}")

print("\n--- Final Evaluation on Holdout Test Set ---")
best_mlp_model = grid_search.best_estimator_

# Make predictions on the test set
mlp_probs = best_mlp_model.predict_proba(X_test_processed)[:, 1]
mlp_preds = best_mlp_model.predict(X_test_processed)

# Calculate all metrics
mlp_auc = roc_auc_score(y_test, mlp_probs)
mlp_accuracy = accuracy_score(y_test, mlp_preds)
mlp_recall = recall_score(y_test, mlp_preds)
mlp_f1 = f1_score(y_test, mlp_preds)

# Store results for comparison
model_results["Neural Network (MLP)"] = {
    "auc": mlp_auc,
    "probs": mlp_probs,
    "accuracy": mlp_accuracy,
    "recall": mlp_recall,
    "f1": mlp_f1,
}
print(f"\n--- MLP Performance Metrics ---")
print(f"ROC AUC:   {mlp_auc:.4f}")
print(f"Accuracy:  {mlp_accuracy:.4f}")
print(f"Recall:    {mlp_recall:.4f}")
print(f"F1 Score:  {mlp_f1:.4f}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, mlp_preds))


--- Starting GridSearchCV for Scikit-learn MLP ---
This may take a few minutes...
Fitting 3 folds for each of 24 candidates, totalling 72 fits

--- Starting GridSearchCV for Scikit-learn MLP ---
This may take a few minutes...
Fitting 3 folds for each of 24 candidates, totalling 72 fits

--- GridSearchCV Results ---
Best parameters found: {'alpha': 0.0001, 'hidden_layer_sizes': (128, 64, 32), 'learning_rate_init': 0.005}
Best F1 score: 0.5019

--- Final Evaluation on Holdout Test Set ---

--- MLP Performance Metrics ---
ROC AUC:   0.5911
Accuracy:  0.7525
Recall:    0.0402
F1 Score:  0.0741

Detailed Classification Report:
              precision    recall  f1-score   support

       False       0.76      0.99      0.86       609
        True       0.47      0.04      0.07       199

    accuracy                           0.75       808
   macro avg       0.61      0.51      0.47       808
weighted avg       0.69      0.75      0.66       808



MLP Neural Network (F1 = 0.095)
On the holdout set the tuned MLP scored ROC AUC 0.528, accuracy 0.741, precision 0.34 and recall 0.055, giving an F1 of only 0.095. This is below the no‐skill baseline F1≈0.20 (random guessing at the 20 % positive rate), which means the network isn’t learning useful patterns for the minority class. To improve, try techniques like class weighting, oversampling the positives, adding dropout or batch normalization, experimenting with different architectures (depth/width), or using alternative optimizers and learning-rate schedules.

## Optimization with an auto ml tool

In [28]:
y_train_clean = y_train.astype(int)
y_test_clean = y_test.astype(int)


print(f"After conversion - y_train dtype: {y_train_clean.dtype}")
print(f"y_train unique values: {y_train_clean.unique()}")


print("\nStarting FLAML search on pre-processed data...")
automl = AutoML()
settings = {
    "time_budget": 400,  # Increased time budget
    "metric": "f1",
    "task": "classification",
    "log_file_name": "flaml_run.log",
    "seed": 42,
}

# Use cleaned target variables
automl.fit(X_train=X_train_processed, y_train=y_train_clean, **settings)
print("FLAML search complete.")


After conversion - y_train dtype: int64
y_train unique values: [0 1]

Starting FLAML search on pre-processed data...
[flaml.automl.logger: 06-27 11:00:15] {1752} INFO - task = classification
[flaml.automl.logger: 06-27 11:00:15] {1763} INFO - Evaluation method: holdout
[flaml.automl.logger: 06-27 11:00:15] {1862} INFO - Minimizing error metric: 1-f1
[flaml.automl.logger: 06-27 11:00:15] {1979} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'sgd', 'lrl1']
[flaml.automl.logger: 06-27 11:00:15] {2282} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 06-27 11:00:15] {2417} INFO - Estimated sufficient time budget=1332s. Estimated necessary time budget=31s.
[flaml.automl.logger: 06-27 11:00:15] {2466} INFO -  at 0.1s,	estimator lgbm's best error=1.0000,	best estimator lgbm's best error=1.0000
[flaml.automl.logger: 06-27 11:00:15] {2282} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 06-27 11:00:15] {2466} INFO

Exception ignored in: <function ResourceTracker.__del__ at 0x7d60dfe0eb60>
Traceback (most recent call last):
  File "/home/gauthier/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py", line 77, in __del__
  File "/home/gauthier/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop
  File "/home/gauthier/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x7fc9d1cf2b60>
Traceback (most recent call last):
  File "/home/gauthier/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py", line 77, in __del__
  File "/home/gauthier/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop
  File "/home/gauthier/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked
Chi

[flaml.automl.logger: 06-27 11:05:04] {2466} INFO -  at 289.6s,	estimator xgboost's best error=0.5844,	best estimator sgd's best error=0.5289
[flaml.automl.logger: 06-27 11:05:04] {2282} INFO - iteration 3511, current learner sgd
[flaml.automl.logger: 06-27 11:05:04] {2466} INFO -  at 289.6s,	estimator sgd's best error=0.5289,	best estimator sgd's best error=0.5289
[flaml.automl.logger: 06-27 11:05:04] {2282} INFO - iteration 3512, current learner sgd
[flaml.automl.logger: 06-27 11:05:04] {2466} INFO -  at 289.6s,	estimator sgd's best error=0.5289,	best estimator sgd's best error=0.5289
[flaml.automl.logger: 06-27 11:05:04] {2282} INFO - iteration 3513, current learner xgb_limitdepth
[flaml.automl.logger: 06-27 11:05:05] {2466} INFO -  at 290.1s,	estimator xgb_limitdepth's best error=0.5669,	best estimator sgd's best error=0.5289
[flaml.automl.logger: 06-27 11:05:05] {2282} INFO - iteration 3514, current learner sgd
[flaml.automl.logger: 06-27 11:05:05] {2466} INFO -  at 290.1s,	estima

Exception ignored in: <function ResourceTracker.__del__ at 0x76d10a04eb60>
Traceback (most recent call last):
  File "/home/gauthier/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py", line 77, in __del__
  File "/home/gauthier/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop
  File "/home/gauthier/.pyenv/versions/3.12.11/lib/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked
ChildProcessError: [Errno 10] No child processes


[flaml.automl.logger: 06-27 11:05:17] {2466} INFO -  at 302.7s,	estimator xgboost's best error=0.5844,	best estimator sgd's best error=0.5289
[flaml.automl.logger: 06-27 11:05:17] {2282} INFO - iteration 3691, current learner sgd
[flaml.automl.logger: 06-27 11:05:17] {2466} INFO -  at 302.7s,	estimator sgd's best error=0.5289,	best estimator sgd's best error=0.5289
[flaml.automl.logger: 06-27 11:05:17] {2282} INFO - iteration 3692, current learner xgb_limitdepth
[flaml.automl.logger: 06-27 11:05:18] {2466} INFO -  at 302.9s,	estimator xgb_limitdepth's best error=0.5669,	best estimator sgd's best error=0.5289
[flaml.automl.logger: 06-27 11:05:18] {2282} INFO - iteration 3693, current learner xgb_limitdepth
[flaml.automl.logger: 06-27 11:05:18] {2466} INFO -  at 303.4s,	estimator xgb_limitdepth's best error=0.5669,	best estimator sgd's best error=0.5289
[flaml.automl.logger: 06-27 11:05:18] {2282} INFO - iteration 3694, current learner sgd
[flaml.automl.logger: 06-27 11:05:18] {2466} INF

In [30]:
# flaml search


# Evaluate the best model
print("\n--- FLAML Results ---")
print(f"Best model: {automl.model.estimator}")

# Make predictions on test set
y_pred_flaml = automl.predict(X_test_processed)
y_pred_proba_flaml = automl.predict_proba(X_test_processed)[:, 1]

# Calculate all metrics
flaml_auc = roc_auc_score(y_test, y_pred_proba_flaml)
flaml_accuracy = accuracy_score(y_test, y_pred_flaml)
flaml_recall = recall_score(y_test, y_pred_flaml)
flaml_f1 = f1_score(y_test, y_pred_flaml)

# Store results for comparison
model_results["FLAML AutoML"] = {
    "auc": flaml_auc,
    "probs": y_pred_proba_flaml,
    "accuracy": flaml_accuracy,
    "recall": flaml_recall,
    "f1": flaml_f1,
}

print(f"\n--- FLAML Performance Metrics ---")
print(f"ROC AUC:   {flaml_auc:.4f}")
print(f"Accuracy:  {flaml_accuracy:.4f}")
print(f"Recall:    {flaml_recall:.4f}")
print(f"F1 Score:  {flaml_f1:.4f}")

print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred_flaml))



--- FLAML Results ---
Best model: SGDClassifier(alpha=0.002022932789263593, eta0=0.0022707739703656357,
              learning_rate='invscaling', loss='log_loss', n_jobs=-1,
              penalty='l1', power_t=1.0, tol=0.0001)

--- FLAML Performance Metrics ---
ROC AUC:   0.5751
Accuracy:  0.6423
Recall:    0.3367
F1 Score:  0.3168

Detailed Classification Report:
              precision    recall  f1-score   support

       False       0.77      0.74      0.76       609
        True       0.30      0.34      0.32       199

    accuracy                           0.64       808
   macro avg       0.54      0.54      0.54       808
weighted avg       0.66      0.64      0.65       808



The AutoML process yielded poor results because the current features in the data are not predictive enough. The performance metrics, especially an AUC score near 0.5, show that the model is performing barely better than random guessing. This happens because advanced models, like those FLAML tests, are so powerful that when the predictive signal is weak, they begin to overfit to random noise in the training data. These "patterns" are useless on the unseen test set, causing performance to drop.

A simpler model like Naive Bayes performed better because it lacks the complexity to learn this noise, forcing it to rely only on the faint, true signals that exist. The clear next step is to stop model tuning and focus on feature engineering to create more meaningful and powerful predictors from the raw data.

### What to do next

Our initial model tests were a critical first step, telling us that the real predictive power isn't in the surface-level data, but hidden within the text of the requests. With more time, we would focus on unlocking that information.

*   **Deep Text Understanding with a Model like BERT:** Our highest priority would be to use a modern language model like BERT. Instead of just counting words, BERT can understand the context and sentiment behind what people write. It could tell the difference between a simple request and a compelling story, which is likely the key to making a great prediction.

*   **Smarter Feature Creation and Data Balancing:** We would also create more insightful features by hand, like a "politeness score" or user activity ratios. Alongside this, we would address the data imbalance by using techniques like SMOTE to generate more examples of successful requests. This would give our models a much fairer and richer dataset to learn from.

*   **Focused Tuning of Our Best Model (LightGBM):** With this new, high-quality data, we could then properly fine-tune our most promising model, LightGBM. We would specifically adjust its settings to make it focus more on the minority class—the successful pizza requests. This targeted tuning, combined with better data, is the most direct path to a significantly improved F1-score.