<style>
*
{
	text-align: justify;
	line-height: 1.5;
	font-family: "Arial", sans-serif;
	font-size: 12px;
}

h2, h3, h4, h5, h6
{
	font-family: "Arial", sans-serif;
	font-size: 12px;
	font-weight: bold;
}
h2
{
	font-size: 14px;
}
h1
{
	font-family: "Wingdings", sans-serif;
	font-size: 16px;
}
</style>

## EDA of Irish Bovine Tuberculosis

<!--
import data_analytics.github as github
print(github.create_jupyter_notebook_header("tahirawwad", "agriculture-data-analytics", "notebooks/notebook-2-02-eda-irish-bovine-tuberculosis.ipynb", "master"))
-->
<table style="margin: auto;"><tr><td><a href="https://mybinder.org/v2/gh/tahirawwad/agriculture-data-analytics/master?filepath=notebooks/notebook-2-02-eda-irish-bovine-tuberculosis.ipynb" target="_parent"><img src="https://mybinder.org/badge_logo.svg" alt="Open In Binder"/></a></td><td>online editors</td><td><a href="https://colab.research.google.com/github/tahirawwad/agriculture-data-analytics/blob/master/notebooks/notebook-2-02-eda-irish-bovine-tuberculosis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a></td></tr></table>

### Objective

The objective is to provide an Exploratory Data Analysis (EDA) of the `cso-daa01-bovine-tuberculosis-2022-01-Jan-15.csv` file provided by the <a href="https://data.cso.ie/table/DAA01" target="_new">CSO: DAA01 Table</a>. The EDA is performed to investigate and clean the data, to spot anomalies.  

### Setup

Import required third party Python libraries, import supporting functions and sets up data source file paths.

In [1]:
# Local
#!pip install -r script/requirements.txt
# Remote option
#!pip install -r https://raw.githubusercontent.com/tahirawwad/agriculture-data-analytics/requirements.txt
#Options: --quiet --user

In [2]:
from agriculture_data_analytics.project_manager import *
from agriculture_data_analytics.dataframe_labels import *
from pandas import DataFrame
import data_analytics.exploratory_data_analysis_reports as eda_reports
import data_analytics.github as github
import os
import pandas

In [3]:
artifact_manager: ProjectArtifactManager = ProjectArtifactManager()
asset_manager: ProjectAssetManager = ProjectAssetManager()
artifact_manager.is_remote = asset_manager.is_remote = True
github.display_jupyter_notebook_data_sources(
    [asset_manager.get_bovine_tuberculosis_filepath()])
artifact_manager.is_remote = asset_manager.is_remote = False

https://github.com/markcrowe-com/agriculture-data-analytics/assets/cso-daa01-bovine-tuberculosis-2022-01-Jan-15.csv?raw=true


### Loading the CSV file

#### Create Data Frames

In [4]:
filepath: str = "./../assets/cso-daa01-bovine-tuberculosis.csv"
bovine_tuberculosis_dataframe: DataFrame = pandas.read_csv(filepath)

#### Renaming Columns

In [5]:
old_to_new_column_names_dictionary = {
    "Regional Veterinary Offices": VETERINARY_OFFICE,
    "Herds in County": HERDS_COUNT,
    "Animals in County": ANIMAL_COUNT,
    "Herds Restricted since 1st of January": RESTRICTED_HERDS_AT_START_OF_YEAR,
    "Herds Restricted by 31st of December": RESTRICTED_HERDS_AT_END_OF_YEAR,
    "Herds Tested": HERDS_TESTED,
    "Herd Incidence": HERD_INCIDENCE_RATE,
    "Tests on Animals": TESTS_ON_ANIMALS,
    "Reactors per 1000 Tests A.P.T.": REACTORS_PER_1000_TESTS_APT,
    "Reactors to date": REACTORS_TO_DATE,
    UNIT.upper(): UNIT,
    VALUE.upper(): VALUE,
}
bovine_tuberculosis_dataframe = bovine_tuberculosis_dataframe.rename(
    columns=old_to_new_column_names_dictionary)
bovine_tuberculosis_dataframe.head(0)

Unnamed: 0,Statistic,Year,Veterinary Office,Unit,Value


#### Data Type Analysis Quick View

Print an analysis report of each dataset.  
- Show the top five rows of the data frame as a quick sample.
- Show the data types of each column.
- Report the count of any duplicate rows.
- Report the counts of any missing values.

In [6]:
filename: str = os.path.basename(filepath)
eda_reports.print_dataframe_analysis_report(bovine_tuberculosis_dataframe, filename)

Unnamed: 0,Statistic,Year,Veterinary Office,Unit,Value
1504,Herds Restricted by 31st of December,2016,Cork North,Number,130.0
2874,Reactors per 1000 Tests A.P.T.,2017,Tipperary South,Number,0.91
941,Herds Restricted since 1st of January,2019,Kilkenny,Number,131.0
2075,Tests on Animals,2013,Cork South,Number,588299.0
1239,Herd Incidence,2018,Kerry,%,2.62


Statistic             object
Year                   int64
Veterinary Office     object
Unit                  object
Value                float64
dtype: object

### Normalizing the table

In [7]:
bovine_tuberculosis_dataframe = bovine_tuberculosis_dataframe.set_index(
    [YEAR, VETERINARY_OFFICE, STATISTIC])[VALUE].unstack().reset_index()
bovine_tuberculosis_dataframe.columns = bovine_tuberculosis_dataframe.columns.tolist()
bovine_tuberculosis_dataframe = bovine_tuberculosis_dataframe.rename(
    columns=old_to_new_column_names_dictionary)

In [8]:
bovine_tuberculosis_dataframe.head()

Unnamed: 0,Year,Veterinary Office,Animal Count,Herd Incidence Rate,Restricted Herds at end of Year,Restricted Herds at start of Year,Herds Tested,Herds Count,Reactors per 1000 Tests A.P.T.,Reactors to date,Tests on Animals
0,2010,Carlow,86258.0,4.02,28.0,52.0,1295.0,1353.0,1.14,124.0,108584.0
1,2010,Cavan,202119.0,5.32,124.0,257.0,4832.0,4915.0,3.13,981.0,313822.0
2,2010,Clare,237260.0,5.71,175.0,350.0,6134.0,6282.0,5.05,1947.0,385705.0
3,2010,Cork North,462707.0,4.43,119.0,259.0,5849.0,5986.0,1.62,1078.0,664648.0
4,2010,Cork South,417478.0,6.3,216.0,385.0,6107.0,6310.0,2.72,1592.0,586105.0


#### Data Type Analysis Quick View

In [9]:
eda_reports.print_dataframe_analysis_report(bovine_tuberculosis_dataframe, filename)

Unnamed: 0,Year,Veterinary Office,Animal Count,Herd Incidence Rate,Restricted Herds at end of Year,Restricted Herds at start of Year,Herds Tested,Herds Count,Reactors per 1000 Tests A.P.T.,Reactors to date,Tests on Animals
53,2011,Tipperary North,261294.0,4.33,103.0,148.0,3419.0,3488.0,2.08,775.0,371880.0
165,2015,Louth,81508.0,3.95,21.0,47.0,1190.0,1228.0,1.14,120.0,105062.0
6,2010,Dublin,20455.0,7.16,16.0,28.0,391.0,398.0,1.99,60.0,30096.0
135,2014,Louth,81025.0,3.73,19.0,43.0,1153.0,1196.0,1.03,115.0,111171.0
7,2010,Galway,356143.0,3.98,221.0,463.0,11639.0,11796.0,3.15,1579.0,500797.0


Year                                   int64
Veterinary Office                     object
Animal Count                         float64
Herd Incidence Rate                  float64
Restricted Herds at end of Year      float64
Restricted Herds at start of Year    float64
Herds Tested                         float64
Herds Count                          float64
Reactors per 1000 Tests A.P.T.       float64
Reactors to date                     float64
Tests on Animals                     float64
dtype: object

The table contains both data for county level and state level an aggregate of the county level data

In [10]:
county_bovine_tuberculosis_dataframe = bovine_tuberculosis_dataframe.drop(bovine_tuberculosis_dataframe[(bovine_tuberculosis_dataframe[VETERINARY_OFFICE] == "State")].index)
eda_reports.print_dataframe_analysis_report(county_bovine_tuberculosis_dataframe, filename)

Unnamed: 0,Year,Veterinary Office,Animal Count,Herd Incidence Rate,Restricted Herds at end of Year,Restricted Herds at start of Year,Herds Tested,Herds Count,Reactors per 1000 Tests A.P.T.,Reactors to date,Tests on Animals
209,2016,Wicklow W,42015.0,12.52,34.0,69.0,551.0,554.0,6.72,470.0,69902.0
111,2013,Sligo,102421.0,2.91,48.0,104.0,3572.0,3615.0,1.51,213.0,140999.0
226,2017,Mayo,243720.0,2.2,101.0,197.0,8936.0,9033.0,1.54,507.0,329714.0
185,2016,Donegal,170385.0,2.21,56.0,117.0,5300.0,5375.0,1.15,249.0,217131.0
138,2014,Monaghan,190963.0,2.54,61.0,106.0,4166.0,4246.0,1.45,372.0,257034.0


Year                                   int64
Veterinary Office                     object
Animal Count                         float64
Herd Incidence Rate                  float64
Restricted Herds at end of Year      float64
Restricted Herds at start of Year    float64
Herds Tested                         float64
Herds Count                          float64
Reactors per 1000 Tests A.P.T.       float64
Reactors to date                     float64
Tests on Animals                     float64
dtype: object

In [11]:
bovine_tuberculosis_dataframe = bovine_tuberculosis_dataframe.drop(bovine_tuberculosis_dataframe[(bovine_tuberculosis_dataframe[VETERINARY_OFFICE] != "State")].index)
eda_reports.print_dataframe_analysis_report(bovine_tuberculosis_dataframe, filename)

Unnamed: 0,Year,Veterinary Office,Animal Count,Herd Incidence Rate,Restricted Herds at end of Year,Restricted Herds at start of Year,Herds Tested,Herds Count,Reactors per 1000 Tests A.P.T.,Reactors to date,Tests on Animals
262,2018,State,6398745.0,3.51,2176.0,3874.0,110454.0,112105.0,1.97,17491.0,8869856.0
292,2019,State,6363409.0,3.72,2273.0,4060.0,109175.0,111004.0,1.62,17058.0,8827682.0
112,2013,State,6146958.0,3.88,2512.0,4430.0,114051.0,115765.0,1.84,15612.0,8474961.0
142,2014,State,6115528.0,3.64,2177.0,4111.0,112937.0,114508.0,1.91,16145.0,8445262.0
82,2012,State,6145469.0,4.26,2665.0,4856.0,113887.0,115787.0,2.16,18476.0,8534677.0


Year                                   int64
Veterinary Office                     object
Animal Count                         float64
Herd Incidence Rate                  float64
Restricted Herds at end of Year      float64
Restricted Herds at start of Year    float64
Herds Tested                         float64
Herds Count                          float64
Reactors per 1000 Tests A.P.T.       float64
Reactors to date                     float64
Tests on Animals                     float64
dtype: object

### Save Artifacts

Saving the output of the notebook.

In [12]:
bovine_tuberculosis_dataframe.to_csv(
    artifact_manager.get_bovine_tuberculosis_eda_filepath(), index=None)
county_bovine_tuberculosis_dataframe.to_csv(
    artifact_manager.get_county_bovine_tuberculosis_eda_filepath(), index=None)

Author &copy; 2021 <a href="https://github.com/markcrowe-com" target="_parent">Mark Crowe</a>. All rights reserved.