# Asthma-Like Illness Emergency Department Presentations (monthly) | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Manipulation**
1. Rename columns.
2. Move 'date' column to the first position.
3. Remove 'LHD' from LHD name values.
4. Remove 'All' data (Representing a state-wide average).
5. Remove columns holding Confidence Interval data.
6. Remove rows holding 'Persons' data in the sex column (Representing a genderless rate per 100,000).

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [48]:
import pandas as pd

## Load Dataset

In [49]:
# File path.
file_path = 'data-raw.csv'

# Read the file.
df = pd.read_csv(file_path)

## Data Manipulation

Rename columns to match Air Quality data set.

In [50]:
# Rename columns.
df = df.rename(columns={
    'LHD': 'lhd',
    'Period': 'date'
})

# Set column names to lower case.
df.columns = df.columns.str.lower()

df

Unnamed: 0,sex,lhd,date,"rate per 100,000 population",ll 95% ci,ul 95% ci
0,Males,Sydney LHD,2014-07,22.6,17.4,28.8
1,Males,Sydney LHD,2014-08,28.9,22.6,36.2
2,Males,Sydney LHD,2014-09,15.7,11.4,21.2
3,Males,Sydney LHD,2014-10,19.2,14.2,25.3
4,Males,Sydney LHD,2014-11,19.7,14.8,25.8
...,...,...,...,...,...,...
4855,Persons,All LHDs,2023-02,24.0,22.9,25.2
4856,Persons,All LHDs,2023-03,22.1,21.1,23.2
4857,Persons,All LHDs,2023-04,25.0,23.8,26.1
4858,Persons,All LHDs,2023-05,30.7,29.5,32.0


Move 'date' column to the first position.

In [51]:
# Move 'date' column to the first position.
date_col = df.pop('date')
df.insert(0, 'date', date_col)

df

Unnamed: 0,date,sex,lhd,"rate per 100,000 population",ll 95% ci,ul 95% ci
0,2014-07,Males,Sydney LHD,22.6,17.4,28.8
1,2014-08,Males,Sydney LHD,28.9,22.6,36.2
2,2014-09,Males,Sydney LHD,15.7,11.4,21.2
3,2014-10,Males,Sydney LHD,19.2,14.2,25.3
4,2014-11,Males,Sydney LHD,19.7,14.8,25.8
...,...,...,...,...,...,...
4855,2023-02,Persons,All LHDs,24.0,22.9,25.2
4856,2023-03,Persons,All LHDs,22.1,21.1,23.2
4857,2023-04,Persons,All LHDs,25.0,23.8,26.1
4858,2023-05,Persons,All LHDs,30.7,29.5,32.0


Remove ' LHD' for Local Health District values.

In [52]:
# Remove ' LHD' from the 'lhd' column.
df['lhd'] = df['lhd'].str.replace(' LHD', '')

Remove rows representing state-wide aggregated data.

In [53]:
# Remove rows with 'All' in the 'lhd' column.
df = df[~df['lhd'].str.contains('All')]

Remove columns holding Confidence Interval data.

In [54]:
# Drop columns with '% ci' in the header
df = df.loc[:, ~df.columns.str.contains('% ci')]

Remove rows holding 'Persons' data in the sex column.

In [55]:
# Drop rows with 'Persons' in the 'sex' column.
df = df[~df['sex'].str.contains('Persons')]

## Output Processed Dataset

In [56]:
# File path.
file_path_output = 'data-processed.csv'

# Save the file.
df.to_csv(file_path_output, index=False)

## View Dataset

In [57]:
df

Unnamed: 0,date,sex,lhd,"rate per 100,000 population"
0,2014-07,Males,Sydney,22.6
1,2014-08,Males,Sydney,28.9
2,2014-09,Males,Sydney,15.7
3,2014-10,Males,Sydney,19.2
4,2014-11,Males,Sydney,19.7
...,...,...,...,...
3127,2023-02,Females,Western NSW,41.4
3128,2023-03,Females,Western NSW,43.3
3129,2023-04,Females,Western NSW,43.9
3130,2023-05,Females,Western NSW,57.7


## Alternative Approach
Composite primary key becomes only 'lhd' and 'date'.

Tracks 'Male rate per 100,000 population' and 'Female rate per 100,000 population' on the same row.

### Reconfigure Table

In [58]:
# Pivot the dataframe to have 'sex' as columns
df_alt = df.pivot_table(index=['date', 'lhd'], columns='sex', values='rate per 100,000 population').reset_index()

# Rename the columns to match the desired format
df_alt.columns.name = None
df_alt = df_alt.rename(columns={'Males': 'Male rate per 100,000 population', 'Females': 'Female rate per 100,000 population'})

df_alt

Unnamed: 0,date,lhd,"Female rate per 100,000 population","Male rate per 100,000 population"
0,2014-07,Central Coast,29.8,21.8
1,2014-07,Hunter New England,45.3,32.3
2,2014-07,Illawarra Shoalhaven,42.1,26.7
3,2014-07,Mid North Coast,58.4,35.5
4,2014-07,Murrumbidgee,46.2,51.4
...,...,...,...,...
1507,2023-06,South Western Sydney,18.9,18.4
1508,2023-06,Southern NSW,29.7,40.7
1509,2023-06,Sydney,16.9,14.4
1510,2023-06,Western NSW,55.6,38.1


### Output Alternative Processed Dataset

In [59]:
# File path.
file_path_output_alt = 'data-processed-alt.csv'

# Save the file.
df_alt.to_csv(file_path_output_alt, index=False)