# Asthma Deaths | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Manipulation**
1. Rename columns.
2. Remove 'LHD' from LHD name values.
3. Remove 'All' data (Representing a state-wide average).
4. Remove columns holding Confidence Interval data.

**Data Normalization**
1. Convert 2-year totals to 1-year totals.

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [37]:
import pandas as pd

## Load Dataset

In [38]:
# File path.
file_path = 'data-raw.csv'

# Read the file.
df = pd.read_csv(file_path)

## Data Manipulation

Rename columns to match Air Quality data set.

In [39]:
# Rename columns.
df = df.rename(columns={
    'LHD': 'lhd',
    'Period': 'date'
})

# Set column names to lower case.
df.columns = df.columns.str.lower()

Remove ' LHD' for Local Health District values.

In [40]:
# Remove ' LHD' from the 'lhd' column.
df['lhd'] = df['lhd'].str.replace(' LHD', '')

Remove rows representing state-wide aggregated data.

In [41]:
# Remove rows with NaN in the 'lhd' column.
df = df.dropna(subset=['lhd'])

# Remove rows with 'All' in the 'lhd' column.
df = df[~df['lhd'].str.contains('All')]

Remove columns holding Confidence Interval data.

In [42]:
# Drop columns with '% ci' in the header
df = df.loc[:, ~df.columns.str.contains('% ci')]

## Data Normalization

Convert rolling 2-Year totals to annual totals.

In [43]:
yearly_data = []

# Iterate over the rows.
for index, row in df.iterrows():
    start_year, end_year = map(int, row['date'].split('-'))
    mid_year = start_year + 1
    split_value = row[col] / 2

    yearly_data.extend([
        {'lhd': row['lhd'], 'financial year': f"{start_year}/{mid_year}", col: split_value},
        {'lhd': row['lhd'], 'financial year': f"{mid_year}/{end_year}", col: split_value}
    ])
    
# Create a new DataFrame.
df_yearly = pd.DataFrame(yearly_data)

# Group by 'lhd' and 'financial year'. Get an average of the 'value' column.
df_yearly = df_yearly.groupby(['lhd', 'financial year']).mean().reset_index()

# Assign to original DataFrame.
df = df_yearly

## Output Processed Dataset

In [44]:
# File path.
file_path_output = 'data-processed.csv'

# Save the file.
df.to_csv(file_path_output, index=False)

## View Dataset

In [45]:
df

Unnamed: 0,lhd,financial year,"rate per 100,000 population"
0,Central Coast,2011/2012,0.600
1,Central Coast,2012/2013,0.600
2,Central Coast,2013/2014,0.600
3,Central Coast,2014/2015,0.675
4,Central Coast,2015/2016,0.725
...,...,...,...
135,Western Sydney,2016/2017,1.025
136,Western Sydney,2017/2018,0.900
137,Western Sydney,2018/2019,0.725
138,Western Sydney,2019/2020,0.675
