# Asthma Deaths | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Manipulation**
1. Rename columns.
2. Remove 'LHD' from LHD name values.
3. Remove 'All' data (Representing a state-wide average).
4. Remove columns holding Confidence Interval data.

**Data Normalization**
1. Convert 2-year totals to 1-year totals.

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [82]:
import pandas as pd

## Load Dataset

In [83]:
# File path.
file_path = 'data-raw.csv'

# Read the file.
df = pd.read_csv(file_path)

## Data Manipulation

Rename columns to match Air Quality data set.

In [84]:
# Rename columns.
df = df.rename(columns={
    'LHD': 'lhd',
    'Period': 'date'
})

# Set column names to lower case.
df.columns = df.columns.str.lower()

Remove ' LHD' for Local Health District values.

In [85]:
# Remove ' LHD' from the 'lhd' column.
df['lhd'] = df['lhd'].str.replace(' LHD', '')

Remove rows representing state-wide aggregated data.

In [86]:
# Remove rows with NaN in the 'lhd' column.
df = df.dropna(subset=['lhd'])

# Remove rows with 'All' in the 'lhd' column.
df = df[~df['lhd'].str.contains('All')]

Remove columns holding Confidence Interval data.

In [87]:
# Drop columns with '% ci' in the header
df = df.loc[:, ~df.columns.str.contains('% ci')]

df

Unnamed: 0,lhd,date,"rate per 100,000 population"
0,Sydney,2011-2013,1.6
1,Sydney,2012-2014,2.4
2,Sydney,2013-2015,1.9
3,Sydney,2014-2016,1.5
4,Sydney,2015-2017,0.8
...,...,...,...
121,Western NSW,2015-2017,2.2
122,Western NSW,2016-2018,2.6
123,Western NSW,2017-2019,2.4
124,Western NSW,2018-2020,2.8


## Data Normalization

Convert rolling 2-Year totals to annual totals.

In [88]:
yearly_data = []

# Iterate over the rows.
for index, row in df.iterrows():
    start_year, end_year = map(int, row['date'].split('-'))
    mid_year = start_year + 1
    split_value = row['rate per 100,000 population'] / 2

    yearly_data.extend([
        {'lhd': row['lhd'], 'financial year': f"{start_year}/{mid_year}", 'rate per 100,000 population': split_value},
        {'lhd': row['lhd'], 'financial year': f"{mid_year}/{end_year}", 'rate per 100,000 population': split_value}
    ])
    
# Create a new DataFrame.
df_yearly = pd.DataFrame(yearly_data)

# Group by 'lhd' and 'financial year'. Get an average of the 'value' column.
df_yearly = df_yearly.groupby(['lhd', 'financial year']).mean().reset_index()

# Assign to original DataFrame.
df = df_yearly

## Output Processed Dataset

In [89]:
# File path.
file_path_output = 'data-processed.csv'

# Save the file.
df.to_csv(file_path_output, index=False)

## View Dataset

In [90]:
df

Unnamed: 0,lhd,financial year,"rate per 100,000 population"
0,Central Coast,2011/2012,0.600
1,Central Coast,2012/2013,0.600
2,Central Coast,2013/2014,0.600
3,Central Coast,2014/2015,0.675
4,Central Coast,2015/2016,0.725
...,...,...,...
135,Western Sydney,2016/2017,1.025
136,Western Sydney,2017/2018,0.900
137,Western Sydney,2018/2019,0.725
138,Western Sydney,2019/2020,0.675


## Alt Output

Set the range of financial years from 2014/2015 to 2023/2024.

In [91]:
# Drop pre 2014/2015 data.
df = df[~df['financial year'].isin(['2011/2012', '2012/2013', '2013/2014'])]                                        # Drop pre 2014/2015 data.

# Add rows for each LHD for the missing years until 2023/2024.
missing_rows = []                                                                                                   # Create a list to store the missing rows.
lhds = df['lhd'].unique()                                                                                           # Get unique LHDs.
years = [f"{year}/{year + 1}" for year in range(2014, 2024)]                                                        # Create a list of years from 2014/2015 to 2023/2024.

for lhd in lhds:
    for year in years:
        if not ((df['lhd'] == lhd) & (df['financial year'] == year)).any():                                         # Check if the row is missing.
            missing_rows.append({'lhd': lhd, 'financial year': year, 'rate per 100,000 population': None})          # Append the missing row to the list.

# Create a DataFrame from the missing rows and concatenate it to the original DataFrame
if missing_rows:
    df_missing = pd.DataFrame(missing_rows)                                                                         # Create a DataFrame from the missing rows.
    df = pd.concat([df, df_missing], ignore_index=True)                                                             # Concatenate the DataFrames.

# Sort the DataFrame by 'lhd' and 'financial year'.
df = df.sort_values(by=['lhd', 'financial year']).reset_index(drop=True)                                           # Sort the DataFrame by 'lhd' and 'financial year'.

# View the DataFrame.
df.tail()                                                                                                          # View the last 5 rows of the DataFrame.

  df = pd.concat([df, df_missing], ignore_index=True)                                                             # Concatenate the DataFrames.


Unnamed: 0,lhd,financial year,"rate per 100,000 population"
135,Western Sydney,2019/2020,0.675
136,Western Sydney,2020/2021,0.65
137,Western Sydney,2021/2022,
138,Western Sydney,2022/2023,
139,Western Sydney,2023/2024,


Fill missing values using linear interpolation.

In [92]:
# Fill missing values using linear interpolation.
df['rate per 100,000 population'] = df['rate per 100,000 population'].interpolate()

# View the DataFrame.
df

Unnamed: 0,lhd,financial year,"rate per 100,000 population"
0,Central Coast,2014/2015,0.675
1,Central Coast,2015/2016,0.725
2,Central Coast,2016/2017,0.700
3,Central Coast,2017/2018,0.675
4,Central Coast,2018/2019,0.675
...,...,...,...
135,Western Sydney,2019/2020,0.675
136,Western Sydney,2020/2021,0.650
137,Western Sydney,2021/2022,0.650
138,Western Sydney,2022/2023,0.650


## Output Alt Processed Dataset

In [93]:
df.to_csv('data-processed-alt.csv', index=False)