In [9]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import time

In [10]:
data = pd.read_csv("cattle_data_train.csv")

features = data.iloc[:, 1:-1]
yields = data.iloc[:, -1]



In [11]:
# Feature Removal and Preprocessing
# Based on correlation analysis and data quality issues

# Features to remove (15 total):
features_to_remove = [
    'Feed_Quantity_lb',      # Duplicate of Feed_Quantity_kg (99.99% correlation)
    'Cattle_ID',             # Unique identifier, no predictive value
    'Rumination_Time_hrs',   # 55% negative values - data quality issue
    'HS_Vaccine',            # Very low correlation (0.000034)
    'BQ_Vaccine',            # Very low correlation (0.000466)
    'BVD_Vaccine',           # Very low correlation (0.000491)
    'Brucellosis_Vaccine',   # Very low correlation (0.002089)
    'FMD_Vaccine',           # Very low correlation (0.002477)
    'Resting_Hours',         # Nearly zero correlation (0.001653)
    'Housing_Score',         # Low correlation (0.004) + 3% missing values
    'Feeding_Frequency',     # No correlation (0.000380)
    'Walking_Distance_km',   # No correlation (0.001538)
    'Body_Condition_Score',  # No correlation (0.001647)
    'Humidity_percent',      # Very low correlation (0.002153)
    'Grazing_Duration_hrs',  # Very low correlation (0.004350)
    'Milking_Interval_hrs'   # Very low correlation (0.014734)
]

# Remove features
data_cleaned = data.drop(columns=features_to_remove)

print(f"Original shape: {data.shape}")
print(f"Cleaned shape: {data_cleaned.shape}")
print(f"Removed {len(features_to_remove)} features")

Original shape: (210000, 36)
Cleaned shape: (210000, 20)
Removed 16 features


In [12]:
# Extract Season from Date column
# Analysis shows seasons have strong effect on milk yield:
#   - Spring: 16.59 L (+6.4% vs average) - BEST season
#   - Winter: 16.12 L (+3.4% vs average)
#   - Fall:   15.70 L (+0.7% vs average)
#   - Summer: 13.94 L (-10.6% vs average) - WORST season (heat stress)
#   - Range: 2.65 L difference between best and worst seasons!

# Convert Date to datetime
data_cleaned['Date'] = pd.to_datetime(data_cleaned['Date'])

# Extract month temporarily to create seasons
data_cleaned['Month'] = data_cleaned['Date'].dt.month

# Create Season feature (meteorological seasons)
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:  # 9, 10, 11
        return 'Fall'

data_cleaned['Season'] = data_cleaned['Month'].apply(get_season)

# Drop both Date and Month (we only keep Season)
data_cleaned = data_cleaned.drop(columns=['Date', 'Month'])

print("Replaced Date with Season:")
print("  - Season (Winter/Spring/Summer/Fall)")
print(f"\nSeason distribution:")
print(data_cleaned['Season'].value_counts().sort_index())
print(f"\nFinal shape: {data_cleaned.shape}")

Replaced Date with Season:
  - Season (Winter/Spring/Summer/Fall)

Season distribution:
Season
Fall      52425
Spring    53061
Summer    52663
Winter    51851
Name: count, dtype: int64

Final shape: (210000, 20)


In [13]:
# Update features and target using cleaned data
features = data_cleaned.drop(columns=['Milk_Yield_L'])
yields = data_cleaned['Milk_Yield_L']

print(f"Features shape: {features.shape}")
print(f"Target shape: {yields.shape}")

Features shape: (210000, 19)
Target shape: (210000,)


## Data Quality Analysis

After feature selection, we performed a comprehensive quality check on all 19 remaining features plus the target variable. This analysis examined:

**For Categorical Features:**
- Missing values
- Unique value counts
- Whitespace issues (leading/trailing spaces)
- Potential typos or duplicate values

**For Numeric Features:**
- Missing values  
- Range and distribution (min, max, mean, median, std dev)
- Impossible values (e.g., negative quantities)
- Outliers (using IQR method)

### Issues Discovered:

**1. Breed Column - Data Entry Errors (CRITICAL)**
- **Typo**: "Holstien" (112 records) should be "Holstein"
- **Whitespace**: " Brown Swiss" (57 records with leading space)
- **Whitespace**: "Brown Swiss " (46 records with trailing space)
- **Impact**: Model would treat these as 7 different breeds instead of 4!
- **Fix**: Strip whitespace and correct typo

**2. Milk_Yield_L (Target) - Impossible Values**
- **Issue**: 74 records (0.04%) have negative milk yields
- **Range**: -5.70 to 44.56 liters
- **Why impossible**: Cows cannot produce negative milk!
- **Likely cause**: Data entry error or measurement issue
- **Fix**: Remove these 74 records

**3. Feed_Quantity_kg - Missing Values**
- **Issue**: 10,481 records (4.99%) missing feed quantity
- **Impact**: Cannot use these records without imputation
- **Fix**: Impute with median by Feed_Type (different feed types have different typical quantities)

In [14]:
# Fix Breed column: Remove whitespace and correct typo

print("Before cleaning:")
print(data_cleaned['Breed'].value_counts())
print(f"\nUnique breeds: {data_cleaned['Breed'].nunique()}")

# Step 1: Strip leading/trailing whitespace
data_cleaned['Breed'] = data_cleaned['Breed'].str.strip()

# Step 2: Fix the typo "Holstien" -> "Holstein"
data_cleaned['Breed'] = data_cleaned['Breed'].replace({'Holstien': 'Holstein'})

print("\n" + "="*70)
print("After cleaning:")
print(data_cleaned['Breed'].value_counts())
print(f"\nUnique breeds: {data_cleaned['Breed'].nunique()}")
print("\nBreed column cleaned: 7 values -> 4 correct breeds")

Before cleaning:
Breed
Holstein        104775
Jersey           42183
Guernsey         31672
Brown Swiss      31155
Holstien           112
 Brown Swiss        57
Brown Swiss         46
Name: count, dtype: int64

Unique breeds: 7

After cleaning:
Breed
Holstein       104887
Jersey          42183
Guernsey        31672
Brown Swiss     31258
Name: count, dtype: int64

Unique breeds: 4

Breed column cleaned: 7 values -> 4 correct breeds


## Handling Impossible Values

### Removing Negative Milk Yields

Our analysis found **74 records (0.04%)** with negative milk yields, ranging from -5.70 to -0.001 liters. 

**Why this is impossible:**
- Cows cannot physically produce negative milk
- This is clearly a data entry or measurement error

**Decision: Remove these records**

**Justification:**
- Only 0.04% of data (minimal impact on training)
- Including them would teach the model incorrect patterns
- Better to have clean, accurate data than preserve bad records
- 209,926 remaining samples is still more than sufficient for training

In [15]:
# Remove records with negative milk yields

print(f"Original dataset size: {len(data_cleaned):,} records")

# Check for negative yields
negative_yields = data_cleaned[data_cleaned['Milk_Yield_L'] < 0]
print(f"\nNegative milk yields found: {len(negative_yields)} records")
print(f"Range of negative values: {negative_yields['Milk_Yield_L'].min():.3f} to {negative_yields['Milk_Yield_L'].max():.3f} L")

# Remove negative yields
data_cleaned = data_cleaned[data_cleaned['Milk_Yield_L'] >= 0].copy()

print(f"\nAfter removal: {len(data_cleaned):,} records")
print(f"Records removed: {len(negative_yields)} ({len(negative_yields)/210000*100:.3f}%)")
print("\nAll milk yields are now >= 0")

Original dataset size: 210,000 records

Negative milk yields found: 74 records
Range of negative values: -5.700 to -0.015 L

After removal: 209,926 records
Records removed: 74 (0.035%)

All milk yields are now >= 0


## Missing Value Imputation

### Feed_Quantity_kg - Strategic Imputation

**Issue**: 10,481 records (4.99%) are missing Feed_Quantity_kg values.

**Why not drop these records?**
- Losing 5% of our training data would reduce model performance
- The missingness appears random (not systematic)
- We have enough information to make reasonable estimates

**Imputation Strategy: Median by Feed_Type**

We'll impute missing values using the **median** Feed_Quantity_kg for each Feed_Type group.

**Rationale:**
1. **Different feed types have different quantities**: 
   - Concentrates: Dense, high-calorie (typically less volume)
   - Pasture Grass: Low density (typically more volume)
   - Median is better than mean (robust to outliers)

2. **Preserves realistic patterns**:
   - A cow eating "Concentrates" gets the typical concentrate amount
   - A cow eating "Hay" gets the typical hay amount

3. **Maintains group-specific distributions**:
   - Doesn't assume all feed types are equal
   - Respects domain knowledge about feeding practices

In [17]:
# Impute missing Feed_Quantity_kg values using median by Feed_Type

print("Before imputation:")
print(f"Missing Feed_Quantity_kg: {data_cleaned['Feed_Quantity_kg'].isnull().sum()} records")

# Show median feed quantity for each feed type
print("\nMedian Feed_Quantity_kg by Feed_Type:")
feed_medians = data_cleaned.groupby('Feed_Type')['Feed_Quantity_kg'].median().sort_values(ascending=False)
for feed_type, median in feed_medians.items():
    print(f"  {feed_type:20s}: {median:.2f} kg")

# Impute missing values with group median
data_cleaned['Feed_Quantity_kg'] = data_cleaned.groupby('Feed_Type')['Feed_Quantity_kg'].transform(
    lambda x: x.fillna(x.median())
)

print("\n" + "="*70)
print("After imputation:")
print(f"Missing Feed_Quantity_kg: {data_cleaned['Feed_Quantity_kg'].isnull().sum()} records")
print("\nAll Feed_Quantity_kg values imputed successfully")

Before imputation:
Missing Feed_Quantity_kg: 0 records

Median Feed_Quantity_kg by Feed_Type:
  Concentrates        : 12.04 kg
  Hay                 : 12.03 kg
  Mixed_Feed          : 12.01 kg
  Dry_Fodder          : 12.00 kg
  Pasture_Grass       : 12.00 kg
  Crop_Residues       : 11.99 kg
  Green_Fodder        : 11.98 kg
  Silage              : 11.97 kg

After imputation:
Missing Feed_Quantity_kg: 0 records

All Feed_Quantity_kg values imputed successfully


## Train/Test Split Strategy


**IMPORTANT**: We perform train/test split **BEFORE** any scaling or normalization to prevent **data leakage**.

### What is Data Leakage?

Data leakage occurs when information from the test set "leaks" into the training process, causing inflated performance estimates that don't generalize to truly unseen data.


**Problem**: The scaler calculated mean/std using test data, so the model indirectly knows about test data!

### Correct Approach (no leakage):

```python
# 1. Split first (test data becomes invisible)
X_train, X_test = train_test_split(X_all_data)

# 2. Fit scaler ONLY on training data
scaler = StandardScaler()
scaler.fit(X_train)  # Only learns from training data

# 3. Transform both using training statistics
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Uses training mean/std
```

**Why this matters**: 
- Simulates real production scenario (we won't have test data statistics)
- Prevents optimistically biased performance estimates
- Ensures model truly generalizes to unseen data

### Our Split:
- **80% Training** (167,940 samples)
- **20% Testing** (41,986 samples)
- **Random State**: 42 (for reproducibility)

In [18]:
# Perform Train/Test Split

from sklearn.model_selection import train_test_split

# Extract features and target AFTER all data cleaning
X = data_cleaned.drop(columns=['Milk_Yield_L'])
y = data_cleaned['Milk_Yield_L']

print(f"Total cleaned dataset: {len(X):,} records, {X.shape[1]} features")

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"\nTraining set: {len(X_train):,} records ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test set:     {len(X_test):,} records ({len(X_test)/len(X)*100:.1f}%)")
print(f"\nTrain/test split complete (split BEFORE any scaling to prevent data leakage)")

Total cleaned dataset: 209,926 records, 19 features

Training set: 167,940 records (80.0%)
Test set:     41,986 records (20.0%)

Train/test split complete (split BEFORE any scaling to prevent data leakage)


## Categorical Encoding Strategy

We have 7 categorical features that need encoding before modeling. Our strategy:

### Analysis of Train vs Test Sets

We verified that **100% of categorical values in test set exist in training set**:
- Farm_ID: 1000/1000 farms overlap (100%)
- Breed: 7/7 breeds overlap
- Climate_Zone: 6/6 zones overlap  
- Management_System: 5/5 systems overlap
- Lactation_Stage: 3/3 stages overlap
- Feed_Type: 8/8 feed types overlap

This perfect overlap means our encoding will work seamlessly on test data!

### Encoding Approach:

**1. Farm_ID (1000 unique values) -> Target Encoding**
- **Why**: High cardinality (1000 farms)
- **Method**: Replace each farm with its mean milk yield from training data
- **Benefit**: Captures farm-specific patterns in just 1 column (vs 999 with one-hot)
- **Impact**: Explains 0.46% of variance
- **Safe**: 100% overlap with test set, no unseen farms

**2. All Other Categoricals -> One-Hot Encoding**
- **Breed** (4 values) -> 3 binary columns
- **Climate_Zone** (6 values) -> 5 binary columns
- **Management_System** (5 values) -> 4 binary columns
- **Lactation_Stage** (3 values) -> 2 binary columns ****** (explains 1.37% variance!)
- **Feed_Type** (8 values) -> 7 binary columns
- **Season** (4 values) -> 3 binary columns ************ (explains 4.66% variance!)

**Total**: 1 (Farm_ID) + 24 (one-hot) + 12 (numeric) = **37 features**

**Why One-Hot?**
- Interpretable (each category has clear coefficient)
- No ordinal assumptions (categories not ordered)
- Works well for linear and tree-based models
- Cardinality manageable (largest is 8 values)

In [19]:
# Target Encode Farm_ID using training data statistics only

print("Target Encoding Farm_ID...")
print(f"Before: Farm_ID has {X_train['Farm_ID'].nunique()} unique values\n")

# Calculate mean milk yield per farm from TRAINING data only
# Group by Farm_ID in X_train and get corresponding y_train values
farm_yield_train = pd.DataFrame({'Farm_ID': X_train['Farm_ID'], 'Milk_Yield': y_train})
farm_means = farm_yield_train.groupby('Farm_ID')['Milk_Yield'].mean()

print(f"Farm target encoding statistics:")
print(f"  Mean of farm means: {farm_means.mean():.3f} L")
print(f"  Std of farm means:  {farm_means.std():.3f} L")
print(f"  Range: {farm_means.min():.3f} to {farm_means.max():.3f} L")

# Encode training set
X_train['Farm_ID_encoded'] = X_train['Farm_ID'].map(farm_means)

# Encode test set using TRAINING statistics (prevent data leakage!)
X_test['Farm_ID_encoded'] = X_test['Farm_ID'].map(farm_means)

# Check for any unseen farms in test (should be 0 based on our analysis)
unseen_farms = X_test['Farm_ID_encoded'].isnull().sum()
if unseen_farms > 0:
    print(f"\nWARNING: {unseen_farms} unseen farms in test set, filling with global mean")
    X_test['Farm_ID_encoded'].fillna(farm_means.mean(), inplace=True)
else:
    print(f"\nAll test farms seen in training (0 unseen farms)")

# Drop original Farm_ID column
X_train = X_train.drop(columns=['Farm_ID'])
X_test = X_test.drop(columns=['Farm_ID'])

print(f"\nAfter: Farm_ID replaced with Farm_ID_encoded (1 numeric column)")
print(f"Train shape: {X_train.shape}")
print(f"Test shape:  {X_test.shape}")

Target Encoding Farm_ID...
Before: Farm_ID has 1000 unique values

Farm target encoding statistics:
  Mean of farm means: 15.593 L
  Std of farm means:  0.414 L
  Range: 14.221 to 16.984 L

All test farms seen in training (0 unseen farms)

After: Farm_ID replaced with Farm_ID_encoded (1 numeric column)
Train shape: (167940, 19)
Test shape:  (41986, 19)


In [21]:
# One-Hot Encode remaining categorical features

print("One-Hot Encoding Categorical Features...")
print(f"Before encoding: {X_train.shape}")

# Define categorical columns to encode
categorical_cols = ['Breed', 'Climate_Zone', 'Management_System', 
                   'Lactation_Stage', 'Feed_Type', 'Season']

print(f"\nEncoding {len(categorical_cols)} categorical features:")
for col in categorical_cols:
    n_unique = X_train[col].nunique()
    print(f"  {col:20s}: {n_unique} values -> {n_unique-1} binary columns")

# One-Hot encode (drop_first=True to avoid multicollinearity)
X_train_encoded = pd.get_dummies(X_train, columns=categorical_cols, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, columns=categorical_cols, drop_first=True)

# Ensure test set has same columns as train (in same order)
# This handles any edge case where a category might not appear in test
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

print(f"\nAfter encoding:")
print(f"  Train: {X_train_encoded.shape}")
print(f"  Test:  {X_test_encoded.shape}")
print(f"  Columns match: {list(X_train_encoded.columns) == list(X_test_encoded.columns)}")

print(f"\nCategorical encoding complete!")
print(f"   Total features: {X_train_encoded.shape[1]} (all numeric now)")

One-Hot Encoding Categorical Features...
Before encoding: (167940, 19)

Encoding 6 categorical features:
  Breed               : 4 values -> 3 binary columns
  Climate_Zone        : 6 values -> 5 binary columns
  Management_System   : 5 values -> 4 binary columns
  Lactation_Stage     : 3 values -> 2 binary columns
  Feed_Type           : 8 values -> 7 binary columns
  Season              : 4 values -> 3 binary columns

After encoding:
  Train: (167940, 37)
  Test:  (41986, 37)
  Columns match: True

Categorical encoding complete!
   Total features: 37 (all numeric now)


## Feature Scaling

Our numeric features have vastly different scales:
- Age_Months: [24, 143]
- Weight_kg: [250, 750]  
- Parity: [1, 6]
- Water_Intake_L: [14, 150]

### Why Scale?

**Models that NEED scaling:**
- Neural Networks (MLPRegressor) - gradients unstable without scaling
- Linear models with regularization (Ridge, Lasso) - penalties affect large-scale features more
- Support Vector Regression - distance-based
- K-Nearest Neighbors - distance-based

**Models that DON'T need scaling:**
- Tree-based: Random Forest, XGBoost, LightGBM, CatBoost
- Make decisions on thresholds, not distances

### Our Decision: Scale Everything

**Why?**
- Keeps options open for ALL model types
- No downside (tree models unaffected)
- Helps convergence for neural networks
- Makes coefficients interpretable for linear models

### Method: StandardScaler (Z-score normalization)

Transforms each feature to have:
- Mean = 0
- Standard deviation = 1

Formula: `z = (x - mean) / std`

**Critical**: Fit scaler on training data ONLY, then transform both train and test using training statistics. This prevents data leakage!

In [22]:
# Apply StandardScaler to all features

from sklearn.preprocessing import StandardScaler

print("Scaling features with StandardScaler...")
print(f"Features to scale: {X_train_encoded.shape[1]}")

# Initialize scaler
scaler = StandardScaler()

# Fit on TRAINING data only
scaler.fit(X_train_encoded)

# Transform both train and test using TRAINING statistics
X_train_scaled = scaler.transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)

# Convert back to DataFrame for interpretability (optional but helpful)
X_train_final = pd.DataFrame(X_train_scaled, columns=X_train_encoded.columns, index=X_train_encoded.index)
X_test_final = pd.DataFrame(X_test_scaled, columns=X_test_encoded.columns, index=X_test_encoded.index)

print(f"\nScaling statistics from training data:")
print(f"  Example feature means: {scaler.mean_[:5]}")
print(f"  Example feature stds:  {scaler.scale_[:5]}")

print(f"\nAfter scaling:")
print(f"  Train shape: {X_train_final.shape}")
print(f"  Test shape:  {X_test_final.shape}")
print(f"\n  Train data now has:")
print(f"    Mean ≈ 0: {X_train_final.mean().mean():.6f}")
print(f"    Std ≈ 1:  {X_train_final.std().mean():.6f}")

print(f"\nFeature scaling complete!")

Scaling features with StandardScaler...
Features to scale: 37

Scaling statistics from training data:
  Example feature means: [ 83.450524   500.02723056   3.49977968 182.13083244  12.01271132]
  Example feature stds:  [ 34.60915767 144.65626669   1.70685556 105.1192756    3.86423676]

After scaling:
  Train shape: (167940, 37)
  Test shape:  (41986, 37)

  Train data now has:
    Mean ≈ 0: 0.000000
    Std ≈ 1:  1.000003

Feature scaling complete!


## Data Preprocessing Complete!

### Final Dataset Summary

Our preprocessing pipeline has successfully:
1. Removed 16 low-value features
2. Extracted Season from Date (explains 4.66% variance)
3. Fixed data quality issues (Breed typos, negative yields, missing values)
4. Target encoded Farm_ID (1 column, no unseen farms)
5. One-Hot encoded 6 categorical features (24 binary columns)
6. Scaled all numeric features (mean≈0, std≈1)

Let's verify everything is ready for modeling:

## Summary of Feature Selection

**Removed 16 features:**
1. Feed_Quantity_lb - duplicate of Feed_Quantity_kg (99.99% correlation)
2. Cattle_ID - unique identifier, no predictive value
3. Rumination_Time_hrs - data quality issue (55% negative values)
4-8. Low-correlation vaccines: HS, BQ, BVD, Brucellosis, FMD
9-15. Zero/near-zero correlation: Resting_Hours, Housing_Score, Feeding_Frequency, Walking_Distance_km, Body_Condition_Score, Humidity_percent, Grazing_Duration_hrs
16. Milking_Interval_hrs - very low correlation (0.015)

**Replaced Date with Season:**
- Removed: Date (raw timestamp)
- Added: Season (Winter/Spring/Summer/Fall)
- Rationale: Strong seasonal effect on milk yield (Spring: 16.59L vs Summer: 13.94L = 2.65L range)
- Month was NOT kept (redundant with Season - only 0.1L variation within seasons)

**Final: 19 features (down from 35 = 46% reduction)**

**Categorical (7):**
- Breed, Climate_Zone, Management_System, Lactation_Stage, Feed_Type, Farm_ID, Season

**Numeric (12):**
- Age_Months (corr: 0.31), Weight_kg (0.30), Parity (0.24), Days_in_Milk (0.06), Feed_Quantity_kg (0.22), Water_Intake_L (0.12), Ambient_Temperature_C (0.04), Anthrax_Vaccine (0.07), IBR_Vaccine (0.07), Rabies_Vaccine (0.07), Previous_Week_Avg_Yield (0.09), Mastitis (0.12)

In [23]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer

# Update categorical columns from cleaned data
cat_cols = features.select_dtypes(include=["object", "string"]).columns.tolist()
print(f"Categorical columns: {cat_cols}")

Categorical columns: ['Breed', 'Climate_Zone', 'Management_System', 'Lactation_Stage', 'Feed_Type', 'Farm_ID', 'Season']


In [24]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, cols=None, mode="freq", m=5):
        self.cols = cols
        self.mode = mode
        self.m = m

    def fit(self, X, y=None):
        X = X.copy()
        self.maps = {}

        for col in self.cols:
            freq = X[col].value_counts()
            total = len(X)

            if self.mode == "freq":
                enc = freq / total
            elif self.mode == "count":
                enc = freq
            elif self.mode == "logfreq":
                enc = np.log1p(freq / total)
            elif self.mode == "smooth":
                prior = freq.sum() / total
                enc = (freq + self.m * prior) / (freq.sum() + self.m)
            else:
                raise ValueError("Unknown mode: " + self.mode)

            self.maps[col] = enc

        return self

    def transform(self, X):
        X = X.copy()
        for col in self.cols:
            X[col] = X[col].map(self.maps[col]).fillna(0)
        return X

In [25]:
# Final verification of preprocessed data

print("="*80)
print("FINAL PREPROCESSING VERIFICATION")
print("="*80)

print(f"\nDataset shapes:")
print(f"  X_train_final: {X_train_final.shape}")
print(f"  X_test_final:  {X_test_final.shape}")
print(f"  y_train:       {y_train.shape}")
print(f"  y_test:        {y_test.shape}")

print(f"\nData quality checks:")
print(f"  Missing values (train): {X_train_final.isnull().sum().sum()}")
print(f"  Missing values (test):  {X_test_final.isnull().sum().sum()}")
print(f"  Infinite values (train): {np.isinf(X_train_final).sum().sum()}")
print(f"  Infinite values (test):  {np.isinf(X_test_final).sum().sum()}")

print(f"\nFeature statistics:")
print(f"  Total features: {X_train_final.shape[1]}")
print(f"  Feature names (first 10): {list(X_train_final.columns[:10])}")
print(f"  Feature names (last 5):   {list(X_train_final.columns[-5:])}")

print(f"\nTarget statistics:")
print(f"  y_train mean: {y_train.mean():.3f} L")
print(f"  y_train std:  {y_train.std():.3f} L")
print(f"  y_train range: [{y_train.min():.3f}, {y_train.max():.3f}] L")
print(f"  y_test mean:  {y_test.mean():.3f} L")
print(f"  y_test std:   {y_test.std():.3f} L")

print(f"\nScaling verification (should be ~0 mean, ~1 std):")
print(f"  X_train mean: {X_train_final.mean().mean():.6f}")
print(f"  X_train std:  {X_train_final.std().mean():.6f}")
print(f"  X_test mean:  {X_test_final.mean().mean():.6f}")
print(f"  X_test std:   {X_test_final.std().mean():.6f}")

print("\n" + "="*80)
print("PREPROCESSING COMPLETE - READY FOR MODELING!")
print("="*80)
print("\nNext steps:")
print("  1. Train baseline model (Ridge Regression)")
print("  2. Try advanced models (LightGBM, CatBoost, Random Forest)")
print("  3. Hyperparameter tuning")
print("  4. Final model selection and predictions")

FINAL PREPROCESSING VERIFICATION

Dataset shapes:
  X_train_final: (167940, 37)
  X_test_final:  (41986, 37)
  y_train:       (167940,)
  y_test:        (41986,)

Data quality checks:
  Missing values (train): 0
  Missing values (test):  0
  Infinite values (train): 0
  Infinite values (test):  0

Feature statistics:
  Total features: 37
  Feature names (first 10): ['Age_Months', 'Weight_kg', 'Parity', 'Days_in_Milk', 'Feed_Quantity_kg', 'Water_Intake_L', 'Ambient_Temperature_C', 'Anthrax_Vaccine', 'IBR_Vaccine', 'Rabies_Vaccine']
  Feature names (last 5):   ['Feed_Type_Pasture_Grass', 'Feed_Type_Silage', 'Season_Spring', 'Season_Summer', 'Season_Winter']

Target statistics:
  y_train mean: 15.593 L
  y_train std:  5.340 L
  y_train range: [0.055, 44.536] L
  y_test mean:  15.603 L
  y_test std:   5.358 L

Scaling verification (should be ~0 mean, ~1 std):
  X_train mean: 0.000000
  X_train std:  1.000003
  X_test mean:  -0.000018
  X_test std:   1.000560

PREPROCESSING COMPLETE - READY