# Influenza and Pneumonia Deaths | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Cleaning**
1. Drop columns with irrelevant/incompatable data (CLI)
2. Drop rows with 'Persons' in sex (remove reduntant aggregates)
3. Drop rows with 'All LHD' in LHD (remove reduntant aggregates)

**Data Normalization**
1. Rename columns.
2. Remove 'LHD' from LHD name values.
3. Normalize 'financial year' with other processed datasets

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [4]:
import pandas as pd

## Load Dataset

In [5]:
# Read RAW datafile, skip first row as it is reduntant
df_influpneu_deaths = pd.read_csv("data-raw.csv")

df_influpneu_deaths

Unnamed: 0,Sex,LHD,Period,"Rate per 100,000 population",LL 95% CI,UL 95% CI
0,Males,Sydney LHD,2011-2012,9.2,6.7,12.3
1,Males,Sydney LHD,2012-2013,8.7,6.3,11.6
2,Males,Sydney LHD,2013-2014,9.0,6.6,11.9
3,Males,Sydney LHD,2014-2015,7.9,5.8,10.7
4,Males,Sydney LHD,2015-2016,8.4,6.2,11.1
...,...,...,...,...,...,...
445,Persons,All LHDs,2016-2017,10.7,10.3,11.2
446,Persons,All LHDs,2017-2018,10.3,9.8,10.7
447,Persons,All LHDs,2018-2019,8.9,8.5,9.3
448,Persons,All LHDs,2019-2020,7.5,7.1,7.8


## Data Cleaning

Drop unnessacary columns.

In [6]:
df_influpneu_deaths = df_influpneu_deaths.drop(columns=[col for col in df_influpneu_deaths if 'CI' in col])

Remove aggregate data entries.

In [7]:
# Remove 'persons' from sex
df_influpneu_deaths = df_influpneu_deaths[df_influpneu_deaths['Sex'] != 'Persons']

# Remove 'all lhd' from lhd
df_influpneu_deaths = df_influpneu_deaths[df_influpneu_deaths['LHD'] != 'All LHDs']

## Data Normalization

Rename columns to match processed data standards.

In [8]:
# Rename columns.
df_influpneu_deaths = df_influpneu_deaths.rename(columns={
    'LHD': 'lhd',
    'Period': 'financial year'
})

# Set column names to lower case.
df_influpneu_deaths.columns = df_influpneu_deaths.columns.str.lower()

Remove 'LHD' from LHD row entries.

In [9]:
# Remove ' LHD' from the 'lhd' column.
df_influpneu_deaths['lhd'] = df_influpneu_deaths['lhd'].str.replace(' LHD', '')

Format financial year to match processed data standards.

In [10]:
# Reformat 'financial year' values from XXXX-YYYY to XXXX/YYYY.
df_influpneu_deaths['financial year'] = df_influpneu_deaths['financial year'].str.replace('-', '/')

## Alternate Output Config

In [11]:
# Pivot the dataframe to have 'sex' as columns
df_alt = df_influpneu_deaths.pivot_table(index=['financial year', 'lhd'], columns='sex', values='rate per 100,000 population').reset_index()

# Rename the columns to match the desired format
df_alt.columns.name = None
df_alt = df_alt.rename(columns={'Males': 'Male rate per 100,000 population', 'Females': 'Female rate per 100,000 population'})

df_alt

Unnamed: 0,financial year,lhd,"Female rate per 100,000 population","Male rate per 100,000 population"
0,2011/2012,Central Coast,7.6,10.2
1,2011/2012,Hunter New England,7.8,11.3
2,2011/2012,Illawarra Shoalhaven,8.6,7.9
3,2011/2012,Mid North Coast,10.6,9.8
4,2011/2012,Murrumbidgee,9.3,10.6
...,...,...,...,...
135,2020/2021,South Western Sydney,4.8,6.3
136,2020/2021,Southern NSW,3.9,4.5
137,2020/2021,Sydney,4.0,6.0
138,2020/2021,Western NSW,4.8,8.6


## Output Processed Dataset

In [12]:
df_influpneu_deaths.to_csv("data-processed.csv", index=False)

df_alt.to_csv("data-processed-alt.csv", index=False)

df_influpneu_deaths

Unnamed: 0,sex,lhd,financial year,"rate per 100,000 population"
0,Males,Sydney,2011/2012,9.2
1,Males,Sydney,2012/2013,8.7
2,Males,Sydney,2013/2014,9.0
3,Males,Sydney,2014/2015,7.9
4,Males,Sydney,2015/2016,8.4
...,...,...,...,...
285,Females,Western NSW,2016/2017,14.8
286,Females,Western NSW,2017/2018,11.5
287,Females,Western NSW,2018/2019,11.9
288,Females,Western NSW,2019/2020,9.8
