# Predicting Daily Milk Yield - Fall 2025 ML Course Project

In [None]:
# Felipe Benitez, feb478
# Edwin Torres, ert863
# Gora Bepary, gcb883
# Sankarsh Narayanan, EID

'''
Make cells to explain what dataset is
Objective
Tools used
Summarizing final results, such like "Best local CV RMSE... Best Kaggle RMSE... Kaggle LB Score..., etc." Things to orient reader.
'''

# Data Loading & Initial Inspection

In [None]:
'''
Goal: show we know what we're working with before touching models.
So show train/test. Other things we could show as example is train and test shape, train head, info, describe, # of rows/columns, target column, presence of categorical columns, missing values, just to name a few. Show whatever is important and useful. This will contribute to Data Exploration + Quality & Clarity.
'''

# Data Cleaning Section 3

## 3.1 Summary of what worked

- **Removed invalid target values**  
  - Filtered out rows where `Milk_Yield_L < 0` for both CatBoost and XGBoost pipelines.

- **Standardized categorical text**  
  - Trimmed whitespace from `Breed` and corrected inconsistent spelling (`Holstien → Holstein`).

- **Imputed missing values**
  - `Housing_Score`: replaced missing values using the column median.
  - `Feed_Quantity_kg`: imputed per‐`Feed_Type` median to preserve context.
  - All remaining numeric fields: filled missing entries using column medians.
  - Median is safe to extremes in general.

- **Temporal feature extraction from `Date`**  
  - Parsed as datetime and derived features: `month`, `day`, `dayofweek`, `weekofyear`, `quarter`, and `is_weekend`; then removed `Date`.

- **Categorical feature handling**
  - CatBoost: preserved original categorical columns and passed them as native categorical features.
  - XGBoost: applied one-hot encoding after other feature engineering steps.

- **Farm identifier treatment**
  - Removed `Farm_ID` in CatBoost pipeline.
  - In XGBoost pipeline, applied fold-safe target encoding to convert `Farm_ID` into a numeric performance metric.

- **Outlier columns and IDs**
  - Dropped columns that uniquely identify samples (`Cattle_ID`) in both pipelines.


## 3.2 Summary of what did not work. 
- **Handeling Negative Values**
    - Taking an absolute value of those missing values made performance worse.
    - Since 55% of entries were negative, we treated this as a systemic issue rather than random noise. Positive values (mean 5.74, max 31.2) and negative values (mean −4.22, min −8.8) appeared to represent different behaviors, so taking the absolute value mixed distinct patterns. We attempted to separate them into different features, but this split did not improve performance.

# Exploratory Data Analysis 4

# Target Distribution 4.2

# Relationships with Key Features (or things we tried idk) 4.3

# Farm-Level Differences 4.4

In [None]:
'''
Since we did farm clustering, show why and how, and also how it ended up being wrong in some way. Just talk about why we did this for example
'''

# Feature Engineering 5

### 5.1 Overview

We didn’t just feed the raw CSVs directly into our models. We iteratively engineered features, tested them with cross validation, and only kept transformations that were neutral or helpful for RMSE. Most experiments were guided by dairy domain reasoning, short and focused code changes, and immediate validation through average CV RMSE. If a feature block added noise or made the models worse, we removed it from the final pipeline.

A major early issue involved the **date column**. The raw file stored the date as an **object** rather than a true `datetime`. Because our preprocessing step converted all object columns into categoricals, the date was being treated as thousands of unrelated categories instead of a temporal variable. This caused instability and strange model behavior. Converting the date into a proper `datetime` type and extracting structured components — such as **year**, **month**, **day**, and **day of week** — immediately fixed this instability and made the models behave far more consistently. This correction became an important foundation for every later feature block.


### 5.2 Core CatBoost Feature Engineering

For CatBoost, we built features incrementally, keeping only the pieces that consistently helped or at least did not hurt CV RMSE.

**Kept features and steps included:**
- Biologically meaningful ratios:
  - `Feed_per_kg_bw`
  - Temperature-Humidity Index (THI) for heat stress  
  - `Walk_per_graze` (distance walked per grazing hour)
- Explicitly labeling categorical columns so CatBoost uses its native encoding properly.
- Dropping 74 rows with **negative** `Milk_Yield_L` labels, which are physically impossible and were hurting training.
- Lactation curve features such as:
  - `is_peak_lactation`, `is_early_lactation`, `is_late_lactation`
  - `dim_squared`, `dim_cubed`, `dim_log`
  - `dim_parity`

These changes moved CatBoost from about **4.114 RMSE → ~4.108**, with dropping negative labels being the biggest single improvement.


### 5.3 CatBoost Ideas We Tested But Did Not Keep

We experimented with several larger interaction blocks that ended up hurting RMSE:

- Additional efficiency ratios (`water_per_weight`, `age_parity_ratio`, `age_parity_product`, etc.)
- Activity combinations (`total_activity`, `rest_activity_ratio`)
- More complicated temperature-humidity interactions
- Bundled vaccine indicators or sums

Most of these made RMSE worse or added noise, so we removed them and kept only the simpler, more stable features.


### 5.4 Rumination and Farm-Level Statistics

We explored multiple strategies to handle the unusual rumination values and farm-level context:

- Making all rumination values positive made the model worse.
- Splitting rumination into “positive mode” and “negative mode” plus flags was more logical but still not better in practice.
- Treating negative rumination values as **missing (NaN)** was neutral and aligned with the idea of faulty sensors.
- Farm-level stats — farm mean and standard deviation for predictors like:
  - `Weight_kg`, `Feed_Quantity_kg`, `Water_Intake_L`, `Age_Months`
  - `Days_in_Milk`, `Ambient_Temperature_C`

These features describe farm environment without leaking targets.

In the final pipeline, we used a simpler rumination approach and a **lighter** set of farm statistics to avoid overfitting.


### 5.5 Peer-Relative Features and Clustering

We explored two types of contextual features:

#### 1. Peer-relative features
For each farm, we compared each cow to its farm peers via:
- **Difference** features (e.g., `Weight_kg_vs_farm_diff`)
- **Ratio** features (e.g., `Weight_kg_vs_farm_ratio`)

These were meant to capture whether a cow is above or below typical farm-level baselines.

#### 2. Farm clustering
We tried clustering farms with KMeans and assigning each farm a cluster label.
However, since train and test were clustered separately, the IDs did not align, which created anti-signal.  
Once we realized this, we removed `Farm_Cluster`.

Peer-relative features were conceptually strong but did not outperform simpler normalized farm stats, so we relied mainly on the latter.


### 5.6 XGBoost Feature Engineering

We also maintained an XGBoost pipeline to complement CatBoost and use as a second model family.  
XGBoost was very sensitive to early feature blocks, especially before we correctly processed the **date column**, which had been treated as a high-cardinality categorical. Once we converted the date to `datetime` and stabilized the feature set, **we returned to XGBoost**, and it became much more consistent.

We engineered and tested:

- Age and parity transforms (`Age_Years`, `Age_Years2`, `Parity2`, `Age_x_Parity`)
- Nonlinear `Days_in_Milk` transforms (`DIM_log`)
- Efficiency ratios (`Feed_per_kgBW`, `Water_per_kgBW`, `PrevYield_per_Feed`)
- THI and simple vaccine summaries (`Vax_Sum`)
- Farm-delta and cohort-relative features
- A large biological block of lactation and health indicators

The most consistently helpful additions were:
- Simple **parity categories**
- **Farm-normalized predictors**
- A **Vax_Sum + farm-delta** combination


### 5.7 Encodings and Dimensionality Reduction

Instead of PCA, we used targeted supervised encodings and structured normalization:

**Tried:**
- Farm-normalized predictors (subtracting or standardizing by farm averages)
- K-Fold target encoding with out-of-fold means
- Frequency encoding for large categoricals such as `Breed`

Target encoding usually hurt XGBoost and didn’t help CatBoost, so we removed it.  
Farm-normalized predictors and simple parity categories were the most reliable encodings.

We considered PCA, but tree models handle moderate dimensionality well and PCA reduces interpretability.  
Targeted feature selection and selective dropping worked better for our dataset.


### 5.8 Summary

Across all iterations, the most important feature engineering wins were:

- **Fixing the date column** by converting it from object → datetime  
- **Dropping physically impossible negative labels**
- **Meaningful biological ratios and farm-normalized predictors**
- **Lightweight farm statistics**
- **Avoiding oversized interaction blocks**
- **Revisiting XGBoost only after stabilizing the feature space**

These steps led to a stable, interpretable, and high-performing feature set that consistently improved CatBoost and allowed XGBoost to be evaluated fairly.


### 5.9 Model Feedback During Feature Engineering

Our modeling work during feature engineering focused on using CatBoost, LightGBM, and XGBoost as feedback tools rather than finalized models. We applied the same preprocessing pipeline and cross validation splits to each model to understand how different engineered features affected stability, signal strength, and overall RMSE.

Key observations:

- CatBoost consistently responded the best to our engineered features, making it a reliable indicator of whether a new feature block was helpful or harmful.
- LightGBM performed reasonably but tended to lose performance once the feature space became more complex.
- XGBoost struggled early on—especially before fixing the date column—but improved significantly after the feature pipeline was stabilized and we revisited it near the end.

We also monitored feature interactions through multiple CatBoost runs, tracking how RMSE changed as new biological ratios, farm-normalized predictors, and date-derived components were added. This allowed us to keep only the transformations that consistently improved model behavior.

Overall, this model feedback loop was essential during feature engineering. It helped confirm which features were robust across different frameworks and guided the final feature set before moving on to full modeling and tuned ensembles.


# Baseline Model (Modeling Approach) 6

In [None]:
'''
Here we show the baseline model we used w/ default ish paramaters. Show it, train it, run it, no optuna, show fold RMSEs, Average CV RMSEs, and explain what it's doing for us and anything else we can add to make this section full
'''

## 6. Modeling Approach & Experiments
In this section we describe our full modeling process starting from our base model, from how we tuned and compared, the different experiments, and how we eventually arrived and stuck with our final CatBoost ensemble with blending. The goal was not just to get a good leaderboard score but also explore model decisions and what needed to take place for things to improve.


### 6.1 Hyperparameter Tuning with Optuna

After we selected CatBoost as our main model and saw that additional feature engineering was giving only small gains compared to the numerous experiments we tried with features, we focused on **hyperparameter tuning** to squeeze out as much performance as possible from our single best strongest learner.

We actually ran **two separate Optuna studies** for CatBoost:

- **Run 1 – 40 trials (our best hyperparameters)**
- **Run 2 – 80 trials with an expanded search space**

Both runs used the same 5-fold CV split and the same preprocessing pipeline, so their RMSEs are directly comparable.


#### 6.1.1 First Optuna Run (40 trials)

In the first study, we tuned the “core” CatBoost hyperparameters:

- `depth` (5–7)
- `learning_rate` (0.02–0.04)
- `l2_leaf_reg` (L2 regularization)
- `subsample` (row subsampling)
- `random_strength` (randomness in split selection)
- `bagging_temperature` (controls how aggressive the sampling is)
- `n_estimators` was capped at 3000, with early stopping on each fold

For each trial, the objective function trained on 4 folds and validated on the 5th, and we minimized the **mean 5-fold RMSE**.

- **Best CV RMSE (Run 1):** ≈ **4.10639**
- **Best hyperparameters (Run 1):**  
  - `depth = 6`  
  - `learning_rate ≈ 0.0229`  
  - `l2_leaf_reg ≈ 4.01`  
  - `subsample ≈ 0.847`  
  - `random_strength ≈ 0.73`  
  - `bagging_temperature ≈ 0.46`

When we retrained a 5-fold CV ensemble with these parameters, we got:

- **Final 5-fold CV RMSE:** ≈ **4.1064**  
- **Best iterations per fold:** around **1,000–1,150 trees**, with an average of ~**1,094** boosting rounds

This first Optuna run gave us the **strongest configuration** we found and became the base for our ensembling and blending experiments.


#### 6.1.2 Second Optuna Run (80 trials, expanded space)

Later, we ran a **second Optuna study with 80 trials**, this time expanding the search space to include more tree-shape and regularization parameters:

- New hyperparameters included:
  - `border_count` (number of candidate split points)
  - `min_data_in_leaf` (minimum samples per leaf)
  - `bootstrap_type = "Bayesian"` with tuned `bagging_temperature`
- We continued to tune:
  - `learning_rate`
  - `l2_leaf_reg`
  - `random_strength`

Again we minimized the mean 5-fold CV RMSE.

- **Best CV RMSE (Run 2):** ≈ **4.10642**

So the second run got basically the **same performance**, but **very slightly worse** than Run 1 (difference on the order of 0.00003 in RMSE, which is completely negligible and likely within CV noise).

The best hyperparameters from Run 2 looked like:

- `depth = 6` (fixed)
- `border_count = 128`
- `learning_rate ≈ 0.0150`
- `l2_leaf_reg ≈ 1.94`
- `random_strength ≈ 0.30`
- `bagging_temperature ≈ 2.16`
- `min_data_in_leaf = 44`
- `bootstrap_type = "Bayesian"`
- `grow_policy = "SymmetricTree"`

When we retrained with these parameters:

- **Final 5-fold CV RMSE:** again ≈ **4.1064**
- **Best iterations per fold:** much **larger**, around **1,700–2,100 trees**, with an average of ~**1,883** boosting rounds


#### 6.1.3 Interpreting the differences between the two runs

Even though Run 2 searched a bigger space and ran for more trials, it did **not** improve RMSE beyond Run 1. The difference in parameters tells us why:

- **Learning rate**
  - Run 1: `learning_rate ≈ 0.0229`  
  - Run 2: `learning_rate ≈ 0.0150` (smaller)  
  → A smaller learning rate usually needs **more trees** (which we see from the best iterations) and can make training slower without guaranteeing a better optimum. In our case, the lower learning rate just led to **more boosting rounds** with essentially the same RMSE.

- **L2 regularization (`l2_leaf_reg`)**
  - Run 1: `≈ 4.01` (stronger regularization)  
  - Run 2: `≈ 1.94` (weaker regularization)  
  → Run 2 allowed individual leaves to fit slightly more aggressively, but we also increased bagging and min leaf size. These trade-offs roughly canceled out, leading again to almost identical performance.

- **Bagging behavior**
  - Run 1: `bagging_temperature ≈ 0.46` (milder stochasticity)  
  - Run 2: `bagging_temperature ≈ 2.16` + `bootstrap_type = "Bayesian"`  
  → Run 2 used **much more aggressive Bayesian-style bagging**, injecting more randomness into which data points each tree sees. This can help reduce overfitting, but because our dataset is large and our Run 1 model was already well-regularized, the extra randomness did not produce a clear RMSE gain.

- **Tree shape and leaf constraints**
  - Run 1: used CatBoost’s default `border_count` and leaf constraints  
  - Run 2: explicitly tuned  
    - `border_count = 128` (fewer split candidates than 254, slightly simpler trees)  
    - `min_data_in_leaf = 44` (prevents tiny leaves, smooths predictions)  
  → These changes **regularize** the tree structure: they avoid overly fine splits and tiny leaves. That can improve generalization if the model is overfitting, but in our case the Run 1 configuration was already near the bias-variance sweet spot, so the extra constraints did not translate into a meaningful RMSE improvement.

Overall, the second Optuna run **validated** that we were already sitting in a very flat optimum: many slightly different hyperparameter combinations (with different regularization/bagging trade-offs) all land around RMSE ≈ 4.1064.

Because **Run 1** achieved the **lowest CV RMSE** and used a slightly simpler set of hyperparameters, we treated it as our **primary “best” configuration** and used it as the base for our CatBoost ensembles and blending experiments. The second run mainly served as a robustness check and showed that even after 80 more trials and a richer search space, we could not significantly beat our original tuned model.

#### 6.1.4 XGBoost Hyperparameter Tuning with Optuna (500 trials)

To build a strong **second model** for ensembling, we also ran a large Optuna study for **XGBoost** with **500 trials**, using the same 5-fold CV and essentially the same preprocessing pipeline as CatBoost:

- cleaned and engineered features (date features, farm clustering, vaccine sum, parity indicators),
- applied **fold-safe target encoding** on `Farm_ID` and created farm-delta features (`Prev_vs_Farm`, `Prev_over_Farm`),
- one-hot encoded categorical variables **within each fold** and aligned train/validation columns to avoid leakage.

Each Optuna trial trained a 5-fold CV XGBoost regressor (with early stopping) and minimized the **mean 5-fold RMSE**.

The search space covered both **tree structure** and **regularization**:

- Tree growth and structure:
  - `grow_policy ∈ {depthwise, lossguide}`
  - `max_depth` (0–12, depending on `grow_policy`)
  - `max_leaves` (16–256)
  - `max_bin ∈ {128, 256, 512}`
- Learning and sampling:
  - `learning_rate ∈ [0.005, 0.1]` (log-scaled)
  - `subsample ∈ [0.5, 0.95]`
  - `colsample_bytree`, `colsample_bylevel`, `colsample_bynode ∈ [0.5, 0.95]`
- Regularization:
  - `gamma ∈ [1e-4, 10]` (log-scaled)
  - `reg_alpha ∈ [1e-4, 50]` (log-scaled)
  - `reg_lambda ∈ [1e-3, 50]` (log-scaled)

The **best trial** out of 500 achieved a **mean CV RMSE ≈ 4.1151** with a relatively shallow but strongly regularized configuration:

- `grow_policy = "depthwise"`
- `max_depth = 4`
- `max_bin = 128`
- `learning_rate ≈ 0.00681`
- `max_leaves ≈ 171`
- `min_child_weight ≈ 2.88`
- `subsample ≈ 0.529`
- `colsample_bytree ≈ 0.528`
- `colsample_bylevel ≈ 0.920`
- `colsample_bynode ≈ 0.941`
- `gamma ≈ 0.632`
- `reg_alpha ≈ 49.96`
- `reg_lambda ≈ 35.12`

Using these tuned hyperparameters, we then retrained:

- A **5-fold CV ensemble**, which achieved  
  - **Final 5-fold CV RMSE:** ≈ **4.1151**  
  - Fold RMSEs in the range **4.106–4.127**, with best iteration counts around **2,800–3,300** trees per fold.
- A **full-data multi-seed ensemble**, where we:
  - re-fit on all training data for ~the average best-iteration from CV,
  - used multiple random seeds (e.g., 42, 100, 200, 300, 400),
  - averaged their predictions to produce our final XGBoost test predictions.

Even after this large 500-trial search, tuned XGBoost remained slightly weaker than our best CatBoost configuration (≈ **4.115** vs. ≈ **4.106** RMSE). However, because its error pattern was different, this XGBoost model was still **valuable as a complementary learner**, and we used its OOF and test predictions in our **stacking / blending experiments** described in Section 6.2.

### 6.2 Ensembling, Stacking, and Snapshot Strategy

Once we had a strong single CatBoost model, we explored techniques to reduce variance and squeeze out additional performance:

#### 6.2.1 Multi-Seed Ensembling

Even with fixed hyperparameters, CatBoost’s training process is stochastic (e.g., random permutations of categorical features, bootstrap sampling). We trained the same CatBoost configuration with **different random seeds**, and averaged their predictions:

- Trained the tuned CatBoost model on the full training data with several seeds (e.g., 5 seeds).  
- Averaged the predictions from all seed models on the test set.

This simple **multi-seed ensemble** slightly improved CV RMSE and also stabilized leaderboard performance, consistent with variance-reduction theory.


#### 6.2.2 Stacking and Blending

We also experimented with **stacking / blending** strategies:

- **CatBoost + XGBoost blend**  
  - Trained both a tuned CatBoost model and a tuned XGBoost model (500-trial Optuna study described in 6.1.4).  
  - Created out-of-fold (OOF) predictions for each model using 5-fold CV.  
  - Searched over a blending weight `alpha` in  
    `y_blend = alpha * y_catboost + (1 - alpha) * y_xgboost`.  
  - Selected the `alpha` that minimized OOF RMSE, then used that same weight to blend test predictions.

- **Ridge stacking on top of OOF predictions**  
  - Built a 2-feature meta-dataset where each row contained `(CatBoost_OOF, XGBoost_OOF)`.  
  - Trained a Ridge regression model to learn the optimal linear combination.  
  - Used the fitted Ridge model to combine test predictions.  

These stacking experiments generally gave small gains on CV, but in some cases were less stable on the leaderboard than a pure CatBoost ensemble. This is likely because the meta-learner can overfit to the noise in the OOF predictions, especially when the base models are already highly correlated.

#### 6.2.3 Snapshot-Style Ensembling (Parameter Variants)

We then tried to extend this idea to a **snapshot ensemble** during our last moments before the submission deadline:

- In addition to changing the random seed, we trained several “nearby” versions of the model by slightly varying:
  - `depth` (e.g., 5, 6, 7)
  - `learning_rate` (e.g., ±10% around the tuned value)
- For each parameter variant, we trained models with multiple seeds and averaged all of them.

Intuition:

- **Shallower trees** (e.g., depth 5) capture smoother, more global patterns.
- **Deeper trees** (e.g., depth 7) can model more complex feature interactions but risk overfitting.
- Slight learning-rate shifts change the effective regularization.

By averaging these diverse models, we hoped to obtain a more robust predictor that was less sensitive to any specific hyperparameter setting or random seed. The model ended up taking up 4+ hours to train after we started it at 10pm before the 11:55pm deadline, so we ended up stopping it after we saw it likely wasn't finishing anytime soon, but believe it would have done best since it was squeezing out our already best model even more.

In the final model, we relied primarily on **CatBoost-only CV + Multi-Seed ensembling**, with blending used carefully and only when it showed consistent improvement across folds.


### 6.3 Cross-Validation Experiments and Robustness

A major focus of our modeling approach was making sure our CV estimates were **reliable** and not overly optimistic.

We experimented with multiple validation strategies:

- **Standard KFold with shuffling**  
  - Our default setup: 5-fold KFold with `shuffle=True` and a fixed `random_state`.  
  - Simple and effective for quickly comparing model variants.

- **GroupKFold by `Farm_ID` (sanity check)**  
  - To test whether the model was accidentally memorizing per-farm patterns in a way that would not transfer to unseen farms, we ran experiments where entire farms were held out in validation.  
  - Result: RMSE under GroupKFold was very similar to our standard KFold RMSE, suggesting that the model generalizes reasonably well across farms.

- **Time-related considerations**  
  - We created date-based features (e.g., month, week of year, day of week) and considered that the hidden test set might correspond to later calendar periods.  
  - We monitored whether features like `year` or `date_ordinal` appeared to cause over-optimistic CV. In parallel, we compared CV scores with and without these trend-like features as a robustness check.

Overall, these CV experiments increased our confidence that the improvements we saw during development were not purely due to leakage or overfitting to a specific fold configuration.


### 6.4 Final Model and Reflection

Our final submission is based on a **CatBoost-only ensemble**, with:

- Carefully tuned hyperparameters obtained via Optuna.
- Multiple random seeds to create a diverse ensemble.
- An alpha-blending step to optimally combine:
  - The CV-trained CatBoost model, and  
  - The multi-seed ensemble.

On our internal 5-fold CV, this final configuration achieved a best RMSE of approximately **4.106145**, and translated into 4th place out of 52 total teams on the course Kaggle leaderboard.

**What helped the most:**

- Moving from simple baselines to tuned CatBoost gave the largest improvement.
- Thoughtful feature engineering (especially around date, feed, and farm-level behavior) provided strong, interpretable signals.
- Multi-seed and snapshot ensembling added a smaller but reliable boost and made predictions more stable.

**What helped less or was risky:**

- Very aggressive stacking and meta-models sometimes improved CV by a tiny margin but did not always translate into better leaderboard performance.
- Certain features and encodings (e.g., farm clustering variations, very trend-heavy date features) required careful validation and could lead to overly optimistic CV if not tested properly.

Overall, our modeling process was iterative and experimental: we started from simple models, gradually introduced more sophisticated techniques, and continuously used cross-validation and leaderboard feedback to decide which ideas were genuinely helpful and which were mainly adding complexity. This approach allowed us to build a final model that is both **accurate** and **robust**, while also learning a lot about practical model development in the process.