# Predicting Daily Milk Yield - Fall 2025 ML Course Project

In [None]:
# Felipe Benitez, feb478
# Edwin Torres, EID
# Gora Bepary, EID
# Sankarsh Narayanan, EID

'''
Make cells to explain what dataset is
Objective
Tools used
Summarizing final results, such like "Best local CV RMSE... Best Kaggle RMSE... Kaggle LB Score..., etc." Things to orient reader.
'''

# Data Loading & Initial Inspection

In [None]:
'''
Goal: show we know what we're working with before touching models.
So show train/test. Other things we could show as example is train and test shape, train head, info, describe, # of rows/columns, target column, presence of categorical columns, missing values, just to name a few. Show whatever is important and useful. This will contribute to Data Exploration + Quality & Clarity.
'''

# Data Cleaning Section 3

# Handling Missing Values 3.1

In [None]:
'''
Show how we handled missing values here and explain it. Like median, dropping impossible targets like milk yield < 0, etc. 
'''

# Outliers 3.2

In [None]:
'''
Do what's needed here
'''

# Exploratory Data Analysis 4

# Target Distribution 4.2

# Relationships with Key Features (or things we tried idk) 4.3

# Farm-Level Differences 4.4

In [None]:
'''
Since we did farm clustering, show why and how, and also how it ended up being wrong in some way. Just talk about why we did this for example
'''

# Feature Engineering 5

In [None]:
''' 
Here we tell the story of our features
'''

### 5.1 Overview

We did not just throw the raw CSV into CatBoost or XGBoost. We iteratively engineered features, tested them with cross validation, and only kept ideas that were neutral or helpful for RMSE. Most experiments were guided by dairy domain logic and short, focused code changes, followed by logging the new average CV RMSE. If a change made the model worse or clearly added noise, we removed it from the final pipeline.

### 5.2 Core CatBoost feature engineering

For CatBoost we started from a clean baseline and then tried small, targeted feature blocks.

Kept features and steps:

- New biologically meaningful ratios:
  - `Feed_per_kg_bw` (feed quantity divided by body weight).
  - Temperature humidity index (THI) as a standard heat stress indicator.
  - Grazing efficiency `Walk_per_graze` (distance walked over grazing hours).
- Telling CatBoost exactly which columns are categorical so it can use its native categorical handling.
- Dropping 74 rows with **negative** `Milk_Yield_L` labels because they are physically impossible and hurt training.
- Lactation curve features such as `is_peak_lactation`, `is_early_lactation`, `is_late_lactation`, `dim_squared`, `dim_cubed`, `dim_log`, and `dim_parity`. These were mostly neutral but did not make things worse, so we kept them for biological interpretability.

These changes moved CatBoost from about 4.114 RMSE down to about 4.108 on average, with dropping negative labels being the single biggest win.


### 5.3 CatBoost ideas that we tested but did not keep

We also tried several larger “interaction blocks” that ended up hurting performance:

- Additional efficiency ratios like `water_per_weight`, `age_parity_ratio`, `age_parity_product`, and combined activity metrics such as `total_activity` and `rest_activity_ratio`.
- Extra interaction terms between temperature and humidity such as `temp_humidity` and heavy weight–feed products.
- Bundling various vaccine columns and using sums or more complex combinations.

These sets usually moved CatBoost’s average RMSE in the wrong direction, so we removed them from the final pipeline and kept only the simpler, more robust pieces. 


### 5.4 Rumination and farm-level statistics

We explored several ways to handle the strange `Rumination_Time_hrs` values and farm context:

- Naively making all rumination values positive performed much worse.
- A more careful approach that split positive and negative rumination into separate modes and created flags seemed conceptually better but did not beat the simpler baselines.
- Setting negative rumination values to missing (NaN) and letting the model handle them was roughly neutral and is closer to how we would treat systemic sensor problems.
- We added farm-level mean and standard deviation features per `Farm_ID` for important columns like `Weight_kg`, `Feed_Quantity_kg`, `Water_Intake_L`, `Age_Months`, `Days_in_Milk`, and `Ambient_Temperature_C`. These helped describe the overall environment of each farm without leaking label information.

In the final model we kept a simpler rumination treatment and a lighter version of farm stats to avoid overfitting.


### 5.5 Peer-relative features and clustering

We tried two main ideas for farm context beyond simple stats:

1. Peer-relative features  
   For each farm we computed how a cow compares to its farm mates for key predictors like `Weight_kg`, `Age_Months`, `Days_in_Milk`, `Previous_Week_Avg_Yield`, `Water_Intake_L`, `Body_Condition_Score`, and `Feed_Quantity_kg`. For each column we created:
   - A difference feature, like `Weight_kg_vs_farm_diff`.
   - A ratio feature, like `Weight_kg_vs_farm_ratio`.

   These were designed to capture “is this cow above or below the typical animal on this farm” rather than only absolute levels.

2. Farm clustering  
   We also experimented with clustering farms using KMeans on farm-level aggregates and using the cluster label as `Farm_Cluster`. However, doing this separately on train and test created mismatched cluster labels and effectively injected anti-signal in evaluation. Once we understood this, we removed `Farm_Cluster` and avoided that form of clustering in the final pipeline.

Overall, peer-relative features were conceptually strong but did not clearly beat our simpler combination of farm stats and core predictors, so we relied more heavily on the latter in the final CatBoost version.


### 5.6 XGBoost feature engineering

We also built an XGBoost pipeline to complement CatBoost and used it as a second family of models. We tried a wide range of engineered features:

- Age and parity transforms: `Age_Years`, `Age_Years2`, `Parity2`, and `Age_x_Parity`.
- Nonlinear day in milk transforms such as `DIM_log`.
- Efficiency ratios: `Feed_per_kgBW`, `Water_per_kgBW`, and `PrevYield_per_Feed`.
- Heat stress index THI and health summaries like `Vax_Sum`.
- Farm deltas and cohort features that compare a cow’s performance to its farm or to breed plus lactation stage averages.
- A large “biological feature block” including lactation curve flags, behavioral health indicators, parity categories, stress indicators, water and body condition flags, and age parity deviation.

Most of these had very small effects on average XGBoost RMSE. The best improvements came from relatively simple parity categories, certain farm normalized predictors, and a `Vax_Sum` plus farm delta combination that gave a small but consistent gain.


### 5.7 Encodings and dimensionality reduction

Instead of classic PCA style dimensionality reduction we focused on encodings that compress categorical structure in a supervised way.

We tried:

- Farm normalized predictors where we subtract or standardize by farm means and standard deviations for variables like `Previous_Week_Avg_Yield`, `Feed_Quantity_kg`, `Water_Intake_L`, and `Weight_kg`. This effectively reduces useless absolute scale variation and focuses on deviations within each farm.
- KFold target encoding for several high signal categorical variables, with out of fold means on the training set and full train means on the test set, to avoid leakage.
- Frequency encoding for some remaining categoricals like `Breed`.

Target encoding in particular did not help XGBoost in this setup and sometimes hurt RMSE, so we removed it from the final feature set. Farm normalized predictors and simple parity categories were the most reliable pieces we kept. :contentReference[oaicite:5]{index=5}

We considered PCA style dimensionality reduction on standardized numeric features, but since our strongest models are tree based and already handle moderate dimensionality well, PCA did not provide a clear benefit and would have reduced interpretability. We chose targeted feature selection, encoding, and dropping noisy columns instead of global projection methods.


### 5.8 Final feature set used

In the final models we kept a compact but expressive set of engineered features:

- Cleaned labels with negative milk yields removed.
- Date features like year, month, day of week, week of year, quarter, `date_ordinal`, and sometimes a weekend flag.
- Key dairy science features: feed per body weight, THI, grazing efficiency, and lactation curve terms.
- Light but useful farm context features based on per farm means and standard deviations.
- Select efficiency and interaction terms that consistently did not hurt cross validation.
- Parity categories and a small number of health and management summaries.

We intentionally dropped many of the more complex interaction blocks and heavy encodings because they either did not help or made performance worse. This left us with a feature space that is easier to explain and tuned to what the cross validation actually supported.


## 6. Modeling Approach

Our modeling work focused on gradient boosted tree methods. We compared CatBoost, LightGBM, and XGBoost using the same preprocessing pipeline and cross validation splits.

Key steps:

- Established baseline CatBoost and LightGBM models, which showed CatBoost was consistently stronger on this dataset.
- Removed LightGBM from the final ensemble once we saw that blending it with CatBoost hurt leaderboard performance.
- Tuned CatBoost with Optuna over depth, learning rate, regularization, subsample, random strength, and bagging temperature, reaching a best CV RMSE around 4.1064 with depth 6 and a five fold ensemble.
- Built multiple CatBoost models using the best hyperparameters and different random seeds, then blended:
  - CV only ensemble.
  - Full data seed ensemble.
  - Weighted blends between CV and full style predictions.
- Trained and tuned an XGBoost model mainly as a diverse second opinion. It never beat CatBoost alone but helped us better understand which engineered features were robust across model classes.

Overall, the best performing pipeline was an optimized CatBoost ensemble with carefully chosen features and tuned hyperparameters, rather than very heavy feature blocks or wide model diversity at all costs.

# Baseline Model (Modeling Approach) 6

In [None]:
'''
Here we show the baseline model we used w/ default ish paramaters. Show it, train it, run it, no optuna, show fold RMSEs, Average CV RMSEs, and explain what it's doing for us and anything else we can add to make this section full
'''

# Modeling Experiments & Improvements 7


In [None]:
''''
Here show the process we took and not just the final script, subsections could be how we started with Catboost XGBoost and LightGBM (i have how we started in the doc), for each, show how we went on about it, and why we ended up choosing catboost as main model. We did a lot here so no reason to not uti`lize it all. Can make another subsection for hyperparamater tuning and how we used optuna, show the code or like how we integrated it and did it, show the outputs. Also we could show our cross validation strat. Also ensembles, like snapshot, seeds, blending, stacking. 
'''

# Final Model & Leadership Results

In [None]:
''' 
Short but important, here we show what we stuck with, why, the RMSE, final kaggle RMSE, show a table of the different `models we tried and their results, etc. Write a descr`iption of what we stuck with and why. As always, just make it good and useful for the reader

'''

# Conclusion

In [None]:
'''
What worked best (CatBoost, features, multi-seed ensemble, blending, alpha). What didnt help (Certain features, some stacking ideas which improved CV but not LB as you can seee in Felipe notebook. What we would do w/ more time)
'''