# CS 363M Machine Learning Project

## Coders/Writers:
-   Carlos Olvera (cao2546)
-   Ariel Zolton
-   Patricio Hernandez
-

## Background Information

In this project, we want to predict the milk yield of dairy cows based on various characteristics such as age, weight, feed type, vaccination status, weather conditions, and other attributes. This is an interesting problem because it could be used to optimize feeding and grazing strategies, improve resource planning and management, forecast financial outcomes, monitor animal health status, and enable data-driven decisions for more sustainable farm operations.

To do this, we are using data from a dataset containing 250,000 dairy cow records. Our dataset includes information on each animal's biological characteristics, nutritional factors, health status, and environmental conditions during the recorded milking period.

We want to use this data to predict the milk yield (in liters) for each cow. We will use information such as the cow's age, weight, feed intake, water consumption, weather conditions, and other attributes to predict this. This ML problem is especially interesting, as the accurate prediction of milk yield can significantly impact farm productivity and animal welfare. The challenge involves working with a large-scale real-world dataset and building models that generalize well to unseen data, avoiding overfitting while capturing the complex relationships between various factors and milk production.

## Preparations:

In [7]:
### Import packages

In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as spstats
import seaborn as sns
import sklearn as sk

## Cleaning

In [9]:
# Load the training data
train_data = pd.read_csv('cattle_data_train.csv')

# Print shape to see dimensions
print("Shape: ", train_data.shape)

# Display first few rows
train_data.head()

Shape:  (210000, 36)


Unnamed: 0,Cattle_ID,Breed,Climate_Zone,Management_System,Age_Months,Weight_kg,Parity,Lactation_Stage,Days_in_Milk,Feed_Type,...,BVD_Vaccine,Rabies_Vaccine,Previous_Week_Avg_Yield,Body_Condition_Score,Milking_Interval_hrs,Date,Farm_ID,Feed_Quantity_lb,Mastitis,Milk_Yield_L
0,CATTLE_133713,Holstein,Tropical,Intensive,114,544.8,4,Mid,62,Concentrates,...,0,1,6.31,3.0,12,2024-01-15,FARM_0301,36.8235,1,12.192634
1,CATTLE_027003,Holstein,Arid,Mixed,136,298.9,4,Mid,213,Crop_Residues,...,0,0,17.16,4.0,12,2023-10-31,FARM_0219,,0,14.717031
2,CATTLE_122459,Holstein,Tropical,Semi_Intensive,64,336.6,4,Late,16,Hay,...,1,0,4.07,3.5,12,2024-05-20,FARM_0802,16.0965,0,14.006142
3,CATTLE_213419,Jersey,Mediterranean,Intensive,58,370.5,1,Early,339,Crop_Residues,...,0,0,10.23,3.0,24,2024-07-22,FARM_0034,40.7925,0,24.324325
4,CATTLE_106260,Guernsey,Subtropical,Intensive,84,641.5,6,Early,125,Mixed_Feed,...,1,1,20.68,3.0,12,2023-01-03,FARM_0695,33.7365,1,12.023074


One notable feature of this dataset is the diversity of variables captured - we have biological factors (age, weight, parity), nutritional information (feed type, feed quantity, water intake), behavioral metrics (walking distance, grazing duration, rumination time), environmental conditions (temperature, humidity, climate zone), and health indicators (vaccination status, body condition score, mastitis presence). Looking at the data structure, we can see that some variables are categorical (like breed, management system, lactation stage) while others are continuous numerical measurements.

Another important observation is that the dataset includes the Previous_Week_Avg_Yield feature, which represents the average milk yield from the previous week. This temporal information could be highly predictive of current milk yield, as milk production tends to follow patterns over time. We'll need to carefully handle both categorical and numerical features, check for missing values or inconsistencies, and explore correlations between variables to build a robust predictive model that can accurately forecast milk yield for dairy cows across different farms and conditions.