# Asthma Hospitalisations | Processing

## Alternative Approach
Composite primary key becomes only 'lhd' and 'date'.

Tracks 'Male rate per 100,000 population' and 'Female rate per 100,000 population' on the same row.

The main tasks completed to clean and preprocess this dataset were:

**Data Manipulation**
1. Rename columns.
2. Reformat 'financial year' values from XX/YY to XXXX/YYYY
3. Remove 'LHD' from LHD name values.
4. Remove 'All' data (Representing a state-wide average).
5. Remove seperate age groups, keeping only rows with "All Ages" data. 
6. Remove columns holding Confidence Interval data.
7. Remove rows holding 'Persons' data in the sex column (Representing a genderless rate per 100,000).
8. Remove 'risk group' column.

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [13]:
import pandas as pd

## Load Dataset

In [20]:
# File path.
file_path = 'data-raw.csv'

# Read the file.
df = pd.read_csv(file_path)

df

Unnamed: 0,Sex,LHD,Risk group,Period,"Rate per 100,000 population",LL 95% CI,UL 95% CI
0,Males,Sydney LHD,5-34 years,01/02,123.6,100.8,149.7
1,Males,Sydney LHD,5-34 years,02/03,112.8,91.3,137.5
2,Males,Sydney LHD,5-34 years,03/04,120.0,97.9,145.5
3,Males,Sydney LHD,5-34 years,04/05,124.6,101.8,150.8
4,Males,Sydney LHD,5-34 years,05/06,117.6,95.3,143.4
...,...,...,...,...,...,...,...
4027,Persons,All LHDs,All ages,17/18,140.3,137.7,143.0
4028,Persons,All LHDs,All ages,18/19,145.0,142.4,147.8
4029,Persons,All LHDs,All ages,19/20,119.5,117.1,122.0
4030,Persons,All LHDs,All ages,20/21,95.4,93.2,97.6


## Data Manipulation

Rename columns to match Air Quality data set.

In [21]:
# Rename columns.
df = df.rename(columns={
    'LHD': 'lhd',
    'Period': 'financial year'
})

# Set column names to lower case.
df.columns = df.columns.str.lower()

Mark Year data in 'financial year' column more verbose.

In [22]:
# Reformat 'financial year' values from XX/YY to XXXX/YYYY.
df['financial year'] = df['financial year'].apply(
    lambda x: f'20{x[:2]}/20{x[3:]}' if isinstance(x, str) else x
)

Remove ' LHD' for Local Health District values.

In [23]:
# Remove ' LHD' from the 'lhd' column.
df['lhd'] = df['lhd'].str.replace(' LHD', '')

Remove rows representing state-wide aggregated data.

In [24]:
# Remove rows with NaN in the 'lhd' column.
df = df.dropna(subset=['lhd'])

# Remove rows with 'All' in the 'lhd' column.
df = df[~df['lhd'].str.contains('All')]

Remove rows representing an aggregate of all risk groups within a Local Health District.

In [25]:
# Remove rows with NaN in the 'risk group' column.
df = df.dropna(subset=['risk group'])

# Remove rows without 'All ages' in the 'risk group' column.
df = df[df['risk group'].str.contains('All ages')]

df

Unnamed: 0,sex,lhd,risk group,financial year,"rate per 100,000 population",ll 95% ci,ul 95% ci
63,Males,Sydney,All ages,2001/2002,150.9,134.9,168.3
64,Males,Sydney,All ages,2002/2003,134.4,119.4,150.8
65,Males,Sydney,All ages,2003/2004,144.8,129.2,161.8
66,Males,Sydney,All ages,2004/2005,151.7,135.8,169.0
67,Males,Sydney,All ages,2005/2006,150.7,135.0,167.8
...,...,...,...,...,...,...,...
3943,Persons,Far West,All ages,2017/2018,157.1,113.4,211.5
3944,Persons,Far West,All ages,2018/2019,190.8,143.0,249.0
3945,Persons,Far West,All ages,2019/2020,157.8,114.2,212.2
3946,Persons,Far West,All ages,2020/2021,117.3,80.0,165.5


Remove columns holding Confidence Interval data.

In [26]:
# Drop columns with '% ci' in the header
df = df.loc[:, ~df.columns.str.contains('% ci')]

Remove rows holding 'Persons' data in the sex column.

In [27]:
# Drop rows with 'Persons' in the 'sex' column.
df = df[~df['sex'].str.contains('Persons')]

Remove 'risk group' column.

In [28]:
# Drop the 'risk group' column.
df = df.drop(columns=['risk group'])

## Output Processed Dataset

In [29]:
# File path.
file_path_output = 'data-processed.csv'

# Save the file.
df.to_csv(file_path_output, index=False)

## View Dataset

In [30]:
df

Unnamed: 0,sex,lhd,financial year,"rate per 100,000 population"
63,Males,Sydney,2001/2002,150.9
64,Males,Sydney,2002/2003,134.4
65,Males,Sydney,2003/2004,144.8
66,Males,Sydney,2004/2005,151.7
67,Males,Sydney,2005/2006,150.7
...,...,...,...,...
2599,Females,Far West,2017/2018,198.2
2600,Females,Far West,2018/2019,230.0
2601,Females,Far West,2019/2020,127.6
2602,Females,Far West,2020/2021,153.4


## Alternative Approach
Composite primary key becomes only 'lhd' and 'date'.

Tracks 'Male rate per 100,000 population' and 'Female rate per 100,000 population' on the same row.

### Reconfigure Table

In [None]:
# Pivot the dataframe to have 'sex' as columns
df_alt = df.pivot_table(index=['financial year', 'lhd'], columns='sex', values='rate per 100,000 population').reset_index()

# Rename the columns to match the desired format
df_alt.columns.name = None
df_alt = df_alt.rename(columns={'Males': 'Male rate per 100,000 population', 'Females': 'Female rate per 100,000 population'})

df_alt

Unnamed: 0,financial year,lhd,"Female rate per 100,000 population","Male rate per 100,000 population"
0,2014/2015,Central Coast,373.6,354.3
1,2014/2015,Far West,771.9,603.0
2,2014/2015,Hunter New England,455.6,445.5
3,2014/2015,Illawarra Shoalhaven,410.1,383.6
4,2014/2015,Mid North Coast,495.6,478.8
...,...,...,...,...
130,2022/2023,South Western Sydney,250.3,248.3
131,2022/2023,Southern NSW,440.6,418.8
132,2022/2023,Sydney,206.4,193.3
133,2022/2023,Western NSW,620.9,575.8


### Output Alternative Processed Dataset

In [None]:
# File path.
file_path_output_alt = 'data-processed-alt.csv'

# Save the file.
df_alt.to_csv(file_path_output_alt, index=False)