# Chronic Obstructive Pulmonary Disease Deaths | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Cleaning**
1. Drop columns with irrelevant/incompatable data (CLI, Risk Group)
2. Drop rows with 'Persons' in sex (remove reduntant aggregates)
3. Drop rows with 'All LHD' in LHD (remove reduntant aggregates)

**Data Normalization**
1. Rename columns.
2. Remove 'LHD' from LHD name values.
3. Normalize 'financial year' with other processed datasets

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [63]:
import pandas as pd

## Load Dataset

In [64]:
# Read RAW datafile, skip first row as it is reduntant
df_copd_deaths = pd.read_csv("data-raw.csv", skiprows=1)

df_copd_deaths

Unnamed: 0,Risk group,Sex,LHD,Period,"Rate per 100,000 population",LL 95% CI,UL 95% CI
0,Older adults (65+ years),Males,Sydney LHD,2011-2012,192.2,158.8,230.4
1,Older adults (65+ years),Males,Sydney LHD,2012-2013,199.8,166.3,237.9
2,Older adults (65+ years),Males,Sydney LHD,2013-2014,227.2,192.1,266.8
3,Older adults (65+ years),Males,Sydney LHD,2014-2015,212.0,178.8,249.6
4,Older adults (65+ years),Males,Sydney LHD,2015-2016,164.4,135.7,197.3
...,...,...,...,...,...,...,...
955,All ages,Persons,All LHDs,2016-2017,25.8,25.1,26.6
956,All ages,Persons,All LHDs,2017-2018,24.3,23.6,25.0
957,All ages,Persons,All LHDs,2018-2019,23.0,22.3,23.6
958,All ages,Persons,All LHDs,2019-2020,21.9,21.3,22.6


## Data Cleaning

Remove aggregate and/or duplicate data entries.

In [65]:
# Remove 'persons' from sex
df_copd_deaths = df_copd_deaths[df_copd_deaths['Sex'] != 'Persons']

# Remove 'all lhd' from lhd
df_copd_deaths = df_copd_deaths[df_copd_deaths['LHD'] != 'All LHDs']

# Remove duplicates via Risk Group
df_copd_deaths = df_copd_deaths[df_copd_deaths['Risk group'] != 'Older adults (65+ years)']

df_copd_deaths

Unnamed: 0,Risk group,Sex,LHD,Period,"Rate per 100,000 population",LL 95% CI,UL 95% CI
480,All ages,Males,Sydney LHD,2011-2012,26.5,22.2,31.5
481,All ages,Males,Sydney LHD,2012-2013,27.8,23.4,32.7
482,All ages,Males,Sydney LHD,2013-2014,31.1,26.6,36.3
483,All ages,Males,Sydney LHD,2014-2015,29.5,25.1,34.4
484,All ages,Males,Sydney LHD,2015-2016,23.3,19.5,27.7
...,...,...,...,...,...,...,...
785,All ages,Females,Far West LHD,2016-2017,18.9,8.4,38.6
786,All ages,Females,Far West LHD,2017-2018,20.7,9.4,41.5
787,All ages,Females,Far West LHD,2018-2019,28.1,14.5,50.9
788,All ages,Females,Far West LHD,2019-2020,36.3,21.3,59.9


Drop unnessacary columns

In [66]:
df_copd_deaths = df_copd_deaths.drop(columns=[col for col in df_copd_deaths if 'Risk group' in col or 'CI' in col])

## Data Normalization

Rename columns to match processed data standards.

In [67]:
# Rename columns.
df_copd_deaths = df_copd_deaths.rename(columns={
    'LHD': 'lhd',
    'Period': 'financial year'
})

# Set column names to lower case.
df_copd_deaths.columns = df_copd_deaths.columns.str.lower()

Remove 'LHD' from LHD row entries.

In [68]:
# Remove ' LHD' from the 'lhd' column.
df_copd_deaths['lhd'] = df_copd_deaths['lhd'].str.replace(' LHD', '')

Format financial year to match processed data standards.

In [69]:
# Reformat 'financial year' values from XXXX-YYYY to XXXX/YYYY.
df_copd_deaths['financial year'] = df_copd_deaths['financial year'].str.replace('-', '/')

## Alternative Output Config

In [70]:
# Pivot the dataframe to have 'sex' as columns
df_alt = df_copd_deaths.pivot_table(index=['financial year', 'lhd'], columns='sex', values='rate per 100,000 population').reset_index()

# Rename the columns to match the desired format
df_alt.columns.name = None
df_alt = df_alt.rename(columns={'Males': 'Male rate per 100,000 population', 'Females': 'Female rate per 100,000 population'})

df_alt

Unnamed: 0,financial year,lhd,"Female rate per 100,000 population","Male rate per 100,000 population"
0,2011/2012,Central Coast,24.9,33.1
1,2011/2012,Far West,30.9,38.5
2,2011/2012,Hunter New England,19.1,34.3
3,2011/2012,Illawarra Shoalhaven,20.3,34.0
4,2011/2012,Mid North Coast,18.8,38.8
...,...,...,...,...
145,2020/2021,South Western Sydney,15.4,23.3
146,2020/2021,Southern NSW,22.1,32.6
147,2020/2021,Sydney,9.8,21.6
148,2020/2021,Western NSW,31.0,37.6


## Output Processed Dataset

In [72]:
df_copd_deaths.to_csv("data-processed.csv", index=False)

df_alt.to_csv("data-processed-alt.csv", index=False)

df_copd_deaths

Unnamed: 0,sex,lhd,financial year,"rate per 100,000 population"
480,Males,Sydney,2011/2012,26.5
481,Males,Sydney,2012/2013,27.8
482,Males,Sydney,2013/2014,31.1
483,Males,Sydney,2014/2015,29.5
484,Males,Sydney,2015/2016,23.3
...,...,...,...,...
785,Females,Far West,2016/2017,18.9
786,Females,Far West,2017/2018,20.7
787,Females,Far West,2018/2019,28.1
788,Females,Far West,2019/2020,36.3
