# Asthma Hospitalisations | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Manipulation**
1. Rename columns.
2. Reformat 'financial year' values from XX/YY to XXXX/YYYY
3. Remove 'LHD' from LHD name values.
4. Remove 'All' data (Representing a state-wide average).
5. Remove 'All Ages' data (Representing an aggregate of all risk groups)
5. Remove columns holding Confidence Interval data.

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [1]:
import pandas as pd

## Load Dataset

In [2]:
# File path.
file_path = 'data-raw.csv'

# Read the file.
df = pd.read_csv(file_path)

df

Unnamed: 0,Sex,LHD,Risk group,Period,"Rate per 100,000 population",LL 95% CI,UL 95% CI
0,Males,Sydney LHD,5-34 years,01/02,123.6,100.8,149.7
1,Males,Sydney LHD,5-34 years,02/03,112.8,91.3,137.5
2,Males,Sydney LHD,5-34 years,03/04,120.0,97.9,145.5
3,Males,Sydney LHD,5-34 years,04/05,124.6,101.8,150.8
4,Males,Sydney LHD,5-34 years,05/06,117.6,95.3,143.4
...,...,...,...,...,...,...,...
4027,Persons,All LHDs,All ages,17/18,140.3,137.7,143.0
4028,Persons,All LHDs,All ages,18/19,145.0,142.4,147.8
4029,Persons,All LHDs,All ages,19/20,119.5,117.1,122.0
4030,Persons,All LHDs,All ages,20/21,95.4,93.2,97.6


## Data Manipulation

Rename columns to match Air Quality data set.

In [3]:
# Rename columns.
df = df.rename(columns={
    'LHD': 'lhd',
    'Period': 'financial year'
})

# Set column names to lower case.
df.columns = df.columns.str.lower()

Mark Year data in 'financial year' column more verbose.

In [4]:
# Reformat 'financial year' values from XX/YY to XXXX/YYYY.
df['financial year'] = df['financial year'].apply(
    lambda x: f'20{x[:2]}/20{x[3:]}' if isinstance(x, str) else x
)

Remove ' LHD' for Local Health District values.

In [5]:
# Remove ' LHD' from the 'lhd' column.
df['lhd'] = df['lhd'].str.replace(' LHD', '')

Remove rows representing state-wide aggregated data.

In [6]:
# Remove rows with NaN in the 'lhd' column.
df = df.dropna(subset=['lhd'])

# Remove rows with 'All' in the 'lhd' column.
df = df[~df['lhd'].str.contains('All')]

Remove rows representing an aggregate of all risk groups within a Local Health District.

In [7]:
# Remove rows with NaN in the 'risk group' column.
df = df.dropna(subset=['risk group'])

# Remove rows with 'All ages' in the 'risk group' column.
df = df[~df['risk group'].str.contains('All ages')]

Remove columns holding Confidence Interval data.

In [8]:
# Drop columns with '% ci' in the header
df = df.loc[:, ~df.columns.str.contains('% ci')]

## Output Processed Dataset

In [9]:
# File path.
file_path_output = 'data-processed.csv'

# Save the file.
df.to_csv(file_path_output, index=False)

## View Dataset

In [10]:
df

Unnamed: 0,sex,lhd,risk group,financial year,"rate per 100,000 population"
0,Males,Sydney,5-34 years,2001/2002,123.6
1,Males,Sydney,5-34 years,2002/2003,112.8
2,Males,Sydney,5-34 years,2003/2004,120.0
3,Males,Sydney,5-34 years,2004/2005,124.6
4,Males,Sydney,5-34 years,2005/2006,117.6
...,...,...,...,...,...
3922,Persons,Far West,65+ years,2017/2018,152.4
3923,Persons,Far West,65+ years,2018/2019,69.2
3924,Persons,Far West,65+ years,2019/2020,72.5
3925,Persons,Far West,65+ years,2020/2021,80.6
