# Influenza and Pneumonia | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Manipulation**
1. Rename columns.
2. Reformat 'financial year' values from XX/YY to XXXX/YYYY
3. Remove 'LHD' from LHD name values.
4. Remove 'All' data (Representing a state-wide average).
5. Remove seperate age groups, keeping only rows with "All Ages"
6. Remove columns holding Confidence Interval data.
7. Remove rows holding 'Persons' data in the sex column (Representing a genderless rate per 100,000).
8. Remove 'risk group' column.

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [14]:
import pandas as pd

## Load Dataset

In [15]:
# File path.
file_path = 'data-raw.csv'

# Read the file.
df = pd.read_csv(file_path)

df

Unnamed: 0,Age (years),Sex,LHD,Period,"Rate per 100,000 population",LL 95% CI,UL 95% CI
0,0-4 years,Males,Sydney LHD,01/02,817.3,678.7,975.8
1,0-4 years,Males,Sydney LHD,02/03,720.5,591.6,869.1
2,0-4 years,Males,Sydney LHD,03/04,786,652.2,939.2
3,0-4 years,Males,Sydney LHD,04/05,596.6,481.6,730.9
4,0-4 years,Males,Sydney LHD,05/06,418.3,323.5,532.2
...,...,...,...,...,...,...,...
2695,All ages,Persons,All LHDs,16/17,337.4,333.6,341.3
2696,All ages,Persons,All LHDs,17/18,391.3,387.2,395.4
2697,All ages,Persons,All LHDs,18/19,372.6,368.6,376.6
2698,All ages,Persons,All LHDs,19/20,351.5,347.7,355.4


## Data Manipulation

Rename columns to match Air Quality data set.

In [16]:
# Rename columns.
df = df.rename(columns={
    'LHD': 'lhd',
    'Period': 'financial year'
})

# Set column names to lower case.
df.columns = df.columns.str.lower()

Make Year data in 'financial year' column more verbose.

In [17]:
# Reformat 'financial year' values from XX/YY to XXXX/YYYY.
df['financial year'] = df['financial year'].apply(
    lambda x: f'20{x[:2]}/20{x[3:]}' if isinstance(x, str) else x
)

Remove LHD from Local Area Districts values.

In [18]:
# Remove ' LHD' from the 'lhd' column.
df['lhd'] = df['lhd'].str.replace(' LHD', '')

Remove rows representing state-wide aggregated date.

In [19]:
# Remove rows with NaN in the 'lhd' column.
df = df.dropna(subset=['lhd'])

# Remove rows with 'All' in the 'lhd' column.
df = df[~df['lhd'].str.contains('All')]

Remove rows representing specific ages within a Local Health District.

In [20]:
# Remove rows with NaN in the 'risk group' column.
df = df.dropna(subset=['age (years)'])

# Remove rows with 'All ages' in the 'risk group' column.
df = df[df['age (years)'].str.contains('All ages')]

df

Unnamed: 0,age (years),sex,lhd,financial year,"rate per 100,000 population",ll 95% ci,ul 95% ci
1800,All ages,Males,Sydney,2001/2002,345,319.8,371.7
1801,All ages,Males,Sydney,2002/2003,349.1,323.2,376.6
1802,All ages,Males,Sydney,2003/2004,362.4,335.1,391.3
1803,All ages,Males,Sydney,2004/2005,364.8,337.5,393.7
1804,All ages,Males,Sydney,2005/2006,285.2,262.7,309
...,...,...,...,...,...,...,...
2675,All ages,Persons,Western NSW,2016/2017,539.2,514.1,565.2
2676,All ages,Persons,Western NSW,2017/2018,515.6,491.2,541
2677,All ages,Persons,Western NSW,2018/2019,472.4,448.9,496.9
2678,All ages,Persons,Western NSW,2019/2020,444.9,422.1,468.6


Remove rows holding Confidence Interval data.

In [21]:
# Drop columns with '% ci' in the header
df = df.loc[:, ~df.columns.str.contains('% ci')]

Remove rows holding "Persons" data in the sex column.

In [22]:
# Drop rows with 'Persons' in the 'sex' column.
df = df[~df['sex'].str.contains('Persons')]

Remove the "age (years)" column

In [23]:
# Drop the 'risk group' column.
df = df.drop(columns=['age (years)'])

Drop rows where column titles had been entered as data.

In [24]:
# Drop rows where column title had been entered as data.
df = df[~df['sex'].str.contains('Sex')]

## Output Processed Dataset

In [25]:
# File path.
file_path_output = 'data-processed.csv'

# Save the file.
df.to_csv(file_path_output, index=False)

## View Dataset

In [26]:
df

Unnamed: 0,sex,lhd,financial year,"rate per 100,000 population"
1800,Males,Sydney,2001/2002,345
1801,Males,Sydney,2002/2003,349.1
1802,Males,Sydney,2003/2004,362.4
1803,Males,Sydney,2004/2005,364.8
1804,Males,Sydney,2005/2006,285.2
...,...,...,...,...
2375,Females,Western NSW,2016/2017,490.2
2376,Females,Western NSW,2017/2018,476.2
2377,Females,Western NSW,2018/2019,463.4
2378,Females,Western NSW,2019/2020,435.9


## Alternative Approach
Composite primary key becomes only 'lhd' and 'date'.

Tracks 'Male rate per 100,000 population' and 'Female rate per 100,000 population' on the same row.

### Reconfigure Table

In [27]:
# Convert 'rate per 100,000 population' to numeric
df['rate per 100,000 population'] = pd.to_numeric(df['rate per 100,000 population'], errors='coerce')

# Pivot the dataframe to have 'sex' as columns
df_alt = df.pivot_table(index=['financial year', 'lhd'], columns='sex', values='rate per 100,000 population').reset_index()

# Rename the columns to match the desired format
df_alt.columns.name = None
df_alt = df_alt.rename(columns={'Males': 'Male rate per 100,000 population', 'Females': 'Female rate per 100,000 population'})

df_alt

Unnamed: 0,financial year,lhd,"Female rate per 100,000 population","Male rate per 100,000 population"
0,2001/2002,Central Coast,198.1,294.6
1,2001/2002,Hunter New England,249.6,354.7
2,2001/2002,Illawarra Shoalhaven,198.4,286.8
3,2001/2002,Mid North Coast,301.8,381.6
4,2001/2002,Murrumbidgee,333.3,464.6
...,...,...,...,...
275,2020/2021,South Western Sydney,176.4,233.6
276,2020/2021,Southern NSW,255.7,306.4
277,2020/2021,Sydney,172.8,221.6
278,2020/2021,Western NSW,289.3,338.9


In [28]:
# File path.
file_path_output_alt = 'data-processed-alt.csv'

# Save the file.
df_alt.to_csv(file_path_output_alt, index=False)