# Respiratory Disease Deaths | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Manipulation**
1. Rename columns.
2. Reformat 'financial year' values from XX/YY to XXXX/YYYY
3. Remove 'LHD' from LHD name values.
4. Remove 'All' data (Representing a state-wide average).
5. Remove seperate age groups, keeping only rows with "All Ages"
6. Remove columns holding Confidence Interval data.
7. Remove rows holding 'Persons' data in the sex column (Representing a genderless rate per 100,000).
8. Remove 'risk group' column.

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [1]:
import pandas as pd

## Load Dataset

In [2]:
# File path.
file_path = 'data-raw.csv'

# Read the file.
df = pd.read_csv(file_path)

df

Unnamed: 0,Disease type,LHD,Period,"Rate per 100,000 population",LL 95% CI,UL 95% CI
0,Influenza and pneumonia,Sydney LHD,2001-2003,13.5,11.7,15.5
1,Influenza and pneumonia,Sydney LHD,2002-2004,15.9,13.9,18.1
2,Influenza and pneumonia,Sydney LHD,2003-2005,16.3,14.3,18.6
3,Influenza and pneumonia,Sydney LHD,2004-2006,15.8,13.8,17.9
4,Influenza and pneumonia,Sydney LHD,2005-2007,13.3,11.5,15.2
...,...,...,...,...,...,...
1819,Total,All LHDs,2015-2017,50.2,49.4,51.0
1820,Total,All LHDs,2016-2018,49.2,48.5,50.0
1821,Total,All LHDs,2017-2019,49.0,48.2,49.7
1822,Total,All LHDs,2018-2020,43.5,42.8,44.2


## Data Manipulation

Rename columns to match Air Quality data set.

In [3]:
# Rename columns.
df = df.rename(columns={
    'LHD': 'lhd',
    'Period': 'financial year'
})

# Set column names to lower case.
df.columns = df.columns.str.lower()

Remove LHD from Local Area Districts values.

In [4]:
# Remove ' LHD' from the 'lhd' column.
df['lhd'] = df['lhd'].str.replace(' LHD', '')

Remove rows representing state-wide aggregated date.

In [5]:
# Remove rows with NaN in the 'lhd' column.
df = df.dropna(subset=['lhd'])

# Remove rows with 'All' in the 'lhd' column.
df = df[~df['lhd'].str.contains('All')]

Remove rows holding Confidence Interval data.

In [6]:
# Drop columns with '% ci' in the header
df = df.loc[:, ~df.columns.str.contains('% ci')]

## Output Processed Dataset

In [10]:
# File path.
file_path_output = 'data-processed.csv'

# Save the file.
df.to_csv(file_path_output, index=False)

## View Dataset

In [9]:
df

Unnamed: 0,disease type,lhd,financial year,"rate per 100,000 population"
0,Influenza and pneumonia,Sydney,2001-2003,13.5
1,Influenza and pneumonia,Sydney,2002-2004,15.9
2,Influenza and pneumonia,Sydney,2003-2005,16.3
3,Influenza and pneumonia,Sydney,2004-2006,15.8
4,Influenza and pneumonia,Sydney,2005-2007,13.3
...,...,...,...,...
1800,Total,Far West,2015-2017,65.8
1801,Total,Far West,2016-2018,56.3
1802,Total,Far West,2017-2019,65.3
1803,Total,Far West,2018-2020,69.5
