# Respiratory Disease Hospitalizations | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Manipulation**
1. Rename columns.
2. Reformat 'financial year' values from XX/YY to XXXX/YYYY
3. Remove 'LHD' from LHD name values.
4. Remove 'All' data (Representing a state-wide average).
5. Remove seperate age groups, keeping only rows with "All Ages"
6. Remove columns holding Confidence Interval data.
7. Remove rows holding 'Persons' data in the sex column (Representing a genderless rate per 100,000).

## Set Up

Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [1]:
import pandas as pd

## Load Dataset

In [2]:
# File path.
file_path = 'data-raw.csv'

# Read the file.
df = pd.read_csv(file_path)

df

Unnamed: 0,Acute respiratory infection,Age (years),Sex,LHD,Period,"Rate per 100,000 population",LL 95% CI,UL 95% CI
0,Influenza and pneumonia,0-4 years,Males,Sydney LHD,01/02,817.3,678.7,975.8
1,Influenza and pneumonia,0-4 years,Males,Sydney LHD,02/03,720.5,591.6,869.1
2,Influenza and pneumonia,0-4 years,Males,Sydney LHD,03/04,786,652.2,939.2
3,Influenza and pneumonia,0-4 years,Males,Sydney LHD,04/05,596.6,481.6,730.9
4,Influenza and pneumonia,0-4 years,Males,Sydney LHD,05/06,418.3,323.5,532.2
...,...,...,...,...,...,...,...,...
5395,All acute respiratory infection,All ages,Persons,All LHDs,16/17,671.5,665.9,677
5396,All acute respiratory infection,All ages,Persons,All LHDs,17/18,721.7,716,727.4
5397,All acute respiratory infection,All ages,Persons,All LHDs,18/19,730.2,724.5,736
5398,All acute respiratory infection,All ages,Persons,All LHDs,19/20,633.8,628.5,639.2


## Data Manipulation

Rename columns to match Air Quality data set.

In [3]:
# Rename columns.
df = df.rename(columns={
    'LHD': 'lhd',
    'Period': 'financial year'
})

# Set column names to lower case.
df.columns = df.columns.str.lower()

Make Year data in 'financial year' column more verbose.

In [4]:
# Reformat 'financial year' values from XX/YY to XXXX/YYYY.
df['financial year'] = df['financial year'].apply(
    lambda x: f'20{x[:2]}/20{x[3:]}' if isinstance(x, str) else x
)

Remove LHD from Local Area Districts values.

In [5]:
# Remove ' LHD' from the 'lhd' column.
df['lhd'] = df['lhd'].str.replace(' LHD', '')

Remove rows representing state-wide aggregated date.

In [6]:
# Remove rows with NaN in the 'lhd' column.
df = df.dropna(subset=['lhd'])

# Remove rows with 'All' in the 'lhd' column.
df = df[~df['lhd'].str.contains('All')]

Remove rows holding Confidence Interval data.

In [7]:
# Drop columns with '% ci' in the header
df = df.loc[:, ~df.columns.str.contains('% ci')]

Remove rows holding "Persons" data in the sex column.

In [8]:
# Drop rows with 'Persons' in the 'sex' column.
df = df[~df['sex'].str.contains('Persons')]

Remove rows with seperate age groups, keeping only "All Ages"

In [12]:
# Remove rows with NaN in the 'risk group' column.
df = df.dropna(subset=['age (years)'])

# Remove rows without 'All ages' in the 'risk group' column.
df = df[df['age (years)'].str.contains('All ages')]

# Drop age column.
df = df.drop(columns=['age (years)'])
df

Unnamed: 0,acute respiratory infection,sex,lhd,financial year,"rate per 100,000 population"
900,Influenza and pneumonia,Males,Sydney,2001/2002,345
901,Influenza and pneumonia,Males,Sydney,2002/2003,349.1
902,Influenza and pneumonia,Males,Sydney,2003/2004,362.4
903,Influenza and pneumonia,Males,Sydney,2004/2005,364.8
904,Influenza and pneumonia,Males,Sydney,2005/2006,285.2
...,...,...,...,...,...
5075,All acute respiratory infection,Females,Western NSW,2016/2017,1036.70
5076,All acute respiratory infection,Females,Western NSW,2017/2018,1009.00
5077,All acute respiratory infection,Females,Western NSW,2018/2019,976.2
5078,All acute respiratory infection,Females,Western NSW,2019/2020,884.9


Drop rows where column titles had been entered as data.

In [13]:
# Drop rows where column title had been entered as data.
df = df[~df['sex'].str.contains('Sex')]

## Output Processed Dataset

In [15]:
# File path.
file_path_output = 'data-processed.csv'

# Save the file.
df.to_csv(file_path_output, index=False)

## View Dataset

In [14]:
df

Unnamed: 0,acute respiratory infection,sex,lhd,financial year,"rate per 100,000 population"
900,Influenza and pneumonia,Males,Sydney,2001/2002,345
901,Influenza and pneumonia,Males,Sydney,2002/2003,349.1
902,Influenza and pneumonia,Males,Sydney,2003/2004,362.4
903,Influenza and pneumonia,Males,Sydney,2004/2005,364.8
904,Influenza and pneumonia,Males,Sydney,2005/2006,285.2
...,...,...,...,...,...
5075,All acute respiratory infection,Females,Western NSW,2016/2017,1036.70
5076,All acute respiratory infection,Females,Western NSW,2017/2018,1009.00
5077,All acute respiratory infection,Females,Western NSW,2018/2019,976.2
5078,All acute respiratory infection,Females,Western NSW,2019/2020,884.9
