# Herd Production Assessment

## Background

Dairy producers of all sizes are under an ever-present economic pressure to produce more with less to meet the global demand for dairy products.  As a result, farmers need to carefully monitor the environments of their cattle to prevent injury, encourage production, and stop the spread of disease.  As a result, careful examination of input factors such as genetics, nutrition, climate, facilities, and negative health-events may provide actionable insights into ways to modify operations and improve the health, well-being, and production of their herds.

## Report Objective

This report is intended to provide insight and guidance aimed at improving herd health and milk production volumes of a unique herd in Franklin County, Pennsylvania.  The current scope of analysis can be grouped into the following facilities and weather, herd composition, and nutrition.  This report focuses on milk production data for calendar years 2016 and 2017.

### Findings: Milk Production History

The herd produced an estimated 522,244 gallons of milk (4,493,019 milk-pounds) between January 1, 2016, and December 31, 217. The total milk pounds produced per month ranged from 313,223 milk-pounds (36,421 gallons) to 395,524 milk-pounds (45,991 gallons).  The chart below provides a monthly summary of total herd production by month for the two year period.

<div class='col-md-12'><img style="height:auto" src="figures/herd-total-milk-by-month.png"></div>

For calendar years 2016 and 2017, the total number of animal milked per month ranged from 141 to 155.  The plot below provides a visual description of the total number of cows milked per month.  The number of animals milked per month has an upper constraint dictated by the size of the milking and housing facilities.

<div class='col-md-12'><img style="height:auto" src="figures/count-cows-milked-per-month.png"></div>

### Findings: Performance by Number of Lactations

Older cattle having gone through at least three lactations, outperformed younger cattle by 17.5% for the first 305 days after calving, with an average of 24,435 milk-pounds (2,841 gallons) compared to 20,789 milk-pounds (2,417 gallons).  The performance gap is most pronounced in the 20-100 day post-calving range but gradually decreased to zero until approximately 400 days after calving.  The variance for milk-weights beyond 400 days is believed to be the result of incomplete calving data.  The visualization below provides a perspective on the average milk weight produced after calving for aged cows versus cows who are less than 36 months old.

<div class='col-md-12'>
    <div class='col-md-12'><img style="height:auto" src="figures/milk-production-after-calving.png"></div>
</div>

### Findings: Calvings per Month

Calvings occurred regularly throughout the 2016 and 2017.  The monthly average number of calves born was 16.75, with 203 calves born in 2016 and 199 calves born in 2017.  The plot below indicates that a steady stream of calves allowed for cows to have to a regular dry period 2-3 month dry period after each lactation while permitting the herd overall to keep a relatively consistent output for year-round income.

<div class='col-md-12'>
    <div class='col-md-12'><img style="height:auto" src="figures/calvings-per-month.png"></div>
</div>

### Findings: Top Producers 2016-2017

For 2016 and 2017 calendar year, the top 10% of the animals by total milk volume are below.  Animals were only considered for this analysis if they had produced milk for 400 or more days.  Of cows that milked more than 400 days in that two year period at the top 25% produced at least 44,050 pounds of milk and the top 50% of the herd produced at least 39,343 pounds of milk in the same two year period.  The plot below highlights the top 10 producing animals by total milk volume in that two year period.  As high producers, these animals should continue to part of the herd, and animals with similar profiles should be added to the herd to ensure higher milk volumes better and associated financial success.

<div class='col-md-12'><img style="height:auto" src="figures/top-producers.png"></div>

### Findings: Under Performers, 2016-2017

For 2016 and 2017 calendar year, the bottom 10% of the animals by total milk volume are pictured below.  These animals were milked for more than 400 days in that two year period, and produced at most 30,398 milk-pounds.  These animals should be evaluated for medical conditions, and prioritized for replacement with cows that are capable of producing higher milk weights.

<div class='col-md-12'><img style="height:auto" src="figures/bottom-producers.png"></div>

### Findings: Facilities and Climate

In the years 2015, 2016, and 2017, Franklin County experienced a total of 6 days of where the average of 3 weather stations recorded maximum temperatures greater than 90 degrees, and 33 days where a low of temperature less than 10 degrees occurred.  The volume of milk produced during and immediately after the extreme temperature days did not see a statistically significant impact on production volumes. As a result, current data suggest that existing facilities and practices have proven sufficiently effective for heat and cold abatement.  The current recommendation is to maintain existing ventilation, cooling, insulation and heating strategies.  Additional capital investment to improve these facilities and practices beyond regular maintenance may not lead to improved milk volumes.

### Findings: Cluster Analysis

Identification of low and high performing animals is essential to making well-informed decisions regarding the retention and removal of animals from the milking herd.  The following table demonstrates the features used in the analysis.

<div class='col-md-12'><img style="height:auto" src="figures/clustering-data-example.png"></div>

After reducing the dimensions of the dataset through PCA and applying k-means analysis, two clusters were identified, as pictured below.  The lowest performing animals are concentrated in a single identified cluster.  The animals in the low performing cluster should be prioritized for replacement and removal from the herd, however, considering the relatively weak nature of the clustering further metrics and analysis should be included before removing an animal from the herd.

<div class='col-md-12'>
    <div class='col-md-6'><img style="height:auto" src="figures/silhouette-score.png"></div>
    <div class='col-md-6'><img style="height:350px" src="figures/top-and-bottom-producers-clusters.png"></div>
    
</div>

### Findings: Predicting Daily Milk Weights



The ability to predict future milk production translates into a single tangible business objective, modeling future cash flows.  The ability to predict the total dollar value of milk produced during a single cows lactation cycle requires some additional components.

- Expected future calving date
- Planned length of dry periods
- Projected sale price for milk produced

These elements, coupled with the total predicted milk volumes over a given location period can produce an estimate of income from a given animal over the projected period.  When applied across the herd, it can give a sense of predictability to overall income.  Prerequisite to building this estimate of financial success is a reliable model of milk volume produced on a given day by an animal.  The following steps were taken to produce such a model.

A series of single-day production profiles include data such as milk-weight, the age of the animal, days after calving, prior year production, classification scores, and genetics metrics. The following image is an example of this data.

<div class='col-md-12'>
    <div class='col-md-12'><img style="height:auto" src="figures/regression-data-example.png"></div>
</div>

Provided aggregated data, a series of regression algorithms were applied.   Of those tested, a random forest regressor presented the best results.  However, the model had limited success, explaining only approximately 28% of the variability in milk production.  The following plots provide a visualization of the results.

<div class='col-md-12'>
    <div class='col-md-6'><img style="height:auto" src="figures/regression-predicted-versus-actual.png"></div>
    <div class='col-md-6'><img style="height:auto" src="figures/regression-residual-errors.png"></div>
</div>

Given the relatively low level of accuracy, additional experimentation should be conducted to improve the reliability of the model, before leveraging it for business decisions.  Considerations for future iterations include:

- Prediction of milk-weight per week, or per month
- Prediction of milk-weight at herd level, using the 'average' animal
- Prediction of total milk-weight produced during entire lactation
- Collection of additional data and development of features such as
    - Number of lactations
    - Changes in Classification Score
    - Animal Weight
    - Milk Production Performance in 1st, 2nd, and 3rd lactations 
    
As additional years of data become available, these data-sets should be included in the current analysis and experimented upon to improve the predictive accuracy of the model.

---

## Appendix
### Appendix: Selected Terms

The following terms should provide additional context for those unfamiliar with the Dairy Industry.

#### Milk Weight (milk-pounds)

The amount of milk produced by an animal. Measured in pounds of milk. For reference, a gallon of milk weighs approximately 8.6 pounds.

#### Dry Period

The period when a cow is not producing milk. Often serves as a time of rest following a lactation period.

#### Lactation Period

The period when a cow is producing milk.

#### Days Since Calving

The number of days that have passed since a cow has given birth.

#### Linear Classification Score

An integer score between 50-99 given to a milk cow, providing a numerical representation of how well a the physical attributes of an animal fits the profile of an 'ideal' milking cow. A weighted summarization of 18+ assessments of a given animal.

### Appendix: Data Pipeline - Milk Weight

#### Description

Daily milk-pound production data was derived from the on-site storage from the DeLaval - ALPRO™ herd management system from files such as [this](../references/example_files/milk_volume_example.txt).  Daily log files were collected for a date range spanning from July 2015 to December 2017.  Approximately 15 files were corrupted, and no log files were retained. The results of these milking sessions are captured daily system logs in a series of text files from the local storage. The following lines provide an example of relevant data elements:

``` txt
04:52:14    R    1831    Cow    Duration1    6:25
04:52:14    R    1831    Cow    AverFlow1    3.6
04:52:14    R    1831    Cow    PeakFlow1    4.8
04:52:14    R    1831    Cow    MilkToday1    23.2
```

The lines above suggest that Cow #1831 produced 23.2 pounds of milk, in six minutes and twenty-five seconds with an average flow rate of 3.6 lb/min and a peak flow of 4.8 lb/min.  Also, this milking occurred at 04:52:14 am.

#### Raw Data Acquisition

The system logs were manually retrieved from the herd management system and uploaded into secured private storage utilizing Amazon Web Services (AWS) for on-demand, repeatable retrieval, processing, and backup.

#### Data Wrangling

Prior to analysis, the contents of each log file are [downloaded via script](../scripts/get_data.py) from AWS and brought into local storage. Each file is [processed individually](../scripts/parse_milk_volume.py) and put into [local storage](../scripts/load_database.py) for future analysis.

#### Future Improvments

This process can be improved through an automated retrieval, ingestion, and cleansing of daily milk production data.  This process would be enabled by the connection of the herd management system to an active network connection and the creation of automated scripts to conduct daily uploads of production data.

### Appendix: Data Pipeline - Linear Classification Score

##### Description

Linear Classification Scores provide a periodic assessment of the physical attributes of a given animal.  Animals are classified on a scale from 50-99 based on some measured characteristics for comparison against the 'ideal' milking cow.  These [Linear Classification Reports](http://www.holsteinusa.com/programs_services/classification.html)  were conducted by a representative of [Holstein Association USA](http://www.holsteinusa.com/) between August 8, 2014, and July 10, 2017.

``` txt
8/5/14,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
BARN_ID,AGE,LAC,DATE_CALVED,ST,SR,BD,DF,RA,RW,LS,RL,LO,FA,FU,UH,UW,UC,UD,TP,RT,TL,UT,CS,FC,DS,RP,FL,MS,FS,E,%BAA
1485,7-Jul,6,7/10/14,50,45,44,42,25,35,29,25,25,25,14,26,35,50,5,35,35,35,35,25,92,92,92,82,80,85,,106
1542,9-Jun,4,8/6/13,50,35,35,42,15,35,50,25,,25,35,36,35,35,40,25,25,25,26,17,93,93,90,84,93,91,2,113.5
```

The example above shows the scoring for cows number 1485 and 1542.  The animals received a final linear score of 85 and 91 respectively.  The assessment occurred on August 5, 2014.

##### Raw Data Acquisition

They were retrieved in the form of paper reports. The contents of the reports were scanned to PDF and parsed into [csv files](../references/example_files/classification_example.csv) using the program [PDF Element by Wondershare](https://pdf.wondershare.com/).  The resulting CSVs were uploaded to a private AWS S3 bucket for on-demand, repeatable retrieval via [script](../scripts/get_data.py)

##### Future Improvments

This process can be improved through integration with the Holstein USA online systems.  A software integration was explored early in the process, but was abandoned due to cost prohibitive pay-per-drink model per-animal per classification.  In the event of further automation, an alternative data acquisition process would be required to prevent the analysis from becoming too costly.

### Appendix: Data Pipeline - Genetic Evaluations

#### Description

[Holstein Association USA](http://www.holsteinusa.com/) conducts additional analysis on individual animal genetics based on available pedigree data, genomic sequencing, as well as actual production information from the animal and its genetic siblings where available.  CTPI and Milk are two values from this report that represent the 
[CTPI](http://www.holsteinusa.com/genetic_evaluations/ss_tpi_formula.html) as an aggregated indicator of milking performance and Milk as an indicator focused solely on the likelihood of higher volumes of milk production.  In both cases, higher values are more favorable.

``` csv
ANIMAL_ID,NAME,FS,PRO,%P,Fat,%F,Rel,Milk,SCS,PL,DPR,TYPE,REL,UDC,FLC,CTPI
1999 ,"     BELSHWAY PLANET 1999
     USA 71404944100-NA12/12/2012",86 ,49,-0.02,40,-0.09,50 ,1772,2.94 ,3.4,0.2,1.34,53 ,0.55,-0.29 ,2198
2043 ,"     BELSHWAY MASSEY 2043
     USA 72758233100-NA 06/26/2013",79 ,36,0.01,38,-0.01,47 ,1132,2.76 ,3.6,-0.2,0.94,53 ,0.66,0.94 ,2150
```

The example above indicates that animal with the ID of 1999 had a Milk Indicator of  1772 a CTPI of 2198.  Cow #2043 had a milk indicator of 1132.

#### Raw Data Acquisition

They were retrieved in the form of paper reports. The contents of the reports were scanned to PDF and parsed into CSV files using the program [PDF Element by Wondershare](https://pdf.wondershare.com/).  The resulting CSVs were uploaded to a private AWS S3 bucket for on-demand, repeatable retrieval via [script](../scripts/get_data.py)

#### Future Improvments

This portion of the data pipeline can be improved through integration with the Holstein USA online systems.  A software integration was explored early in the process but was abandoned due to cost prohibitive pay-per-drink model.  The expected format of the [CTPI](http://www.holsteinusa.com/genetic_evaluations/Topctpi.html) report is available online.  In the event of further automation, an alternative data acquisition process would be required to prevent this analysis from becoming too costly.

### Appendix: Data Pipeline - Weather Data

#### Description

The weather data set consists of daily summaries of weather measurements for Franklin County, Pennsylvania such as low temperature, high temperature, and total precipitation.  The following provides an example of the CSV file format.

```
STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DATE,PRCP,PRCP_ATTRIBUTES,SNOW,SNOW_ATTRIBUTES,SNWD,SNWD_ATTRIBUTES,TMAX,TMAX_ATTRIBUTES,TMIN,TMIN_ATTRIBUTES,TOBS,TOBS_ATTRIBUTES,WESD,WESD_ATTRIBUTES,WESF,WESF_ATTRIBUTES,WT01,WT01_ATTRIBUTES,WT03,WT03_ATTRIBUTES,WT04,WT04_ATTRIBUTES,WT06,WT06_ATTRIBUTES,WT11,WT11_ATTRIBUTES
USC00361354,"CHAMBERSBURG, PA US",39.9353,-77.6394,195.1,2016-01-01,0,",,7,2100",0,",,7",0,",,7",38,",,7",34,",,7",34,",,7,2100",,,,,,,,,,,,,,
USC00361354,"CHAMBERSBURG, PA US",39.9353,-77.6394,195.1,2016-01-02,0,",,7,2100",0,",,7",0,",,7",42,",,7",28,",,7",29,",,7,2100",,,,,,,,,,,,,,

```

#### Raw Data Acquisition

The CSV files were requested from the [NOAA Online Climate Data Online Search](https://www.ncdc.noaa.gov/cdo-web/search) for full calendar year 2014, 2015, and 2016, and then again for all available data in 2017.  The resulting CSV files were uploaded to AWS S3 to be programmatically retrieved by the script [get_data.py](../scripts/get_data.py). The raw files are processed by in the script [parse_weather.py](../scripts/parse_weather.py) to produce daily weather summaries.

#### Future Improvments

Automated scripts with the [NOAA weather API](https://www.ncdc.noaa.gov/cdo-web/webservices/v2) would reduce the workload required to collect this data.

### Appendix: Data Pipeline -  Calving Records

#### Description

The calving data set consists of daily records of animals born, their identification number and their mother's identification number. These records were transposed to CSV format to collect the birthdates of each animal in the herd, as well as the calving dates for animals between January 1st, 2015 to December 31st, 2017.

#### Raw Data Acquisition

The CSV files were stored on AWS S3 to be programmatically retrieved by the script [get_data.py](../scripts/get_data.py). The raw files are processed by in the scripts [parse_birthdates.py](../scripts/parse_birthdates.py) and [parse_calvings.py](../scripts/parse_calvings.py).

#### Future Improvments

This data is also collected by Holstein USA.  Scripted collection of this information should be investigated for future iterations.