# Prevalence in Children | Processing

The main tasks completed to clean and preprocess this dataset were:

**Data Manipulation**
1. Rename columns.
3. Remove 'LHD' from LHD name values.
4. Remove 'Ever Had Asthma' data. (Only showing cases where children has asthma at the time).
6. Remove columns holding Confidence Interval data.

## Set Up

Ensure that all the required libraries are available by running the below code in the terminal before execution:

- pip install pandas

Execute the following in the jupyterr notebook before execution to ensure that the reqruired libraries are imported:

In [1]:
import pandas as pd

## Load the Dataset

In [2]:
# File path.
file_path = 'data-raw.csv'

# Read the file.
df = pd.read_csv(file_path)

df

Unnamed: 0,Asthma Type,LHD,Period,Per cent,LL 95% CI,UL 95% CI
0,Current Asthma,Sydney LHD,2002-2004,11.7,7.6,15.8
1,Current Asthma,Sydney LHD,2003-2005,10.7,6.6,14.8
2,Current Asthma,Sydney LHD,2004-2006,13.2,8.6,17.8
3,Current Asthma,Sydney LHD,2005-2007,12.5,8.0,17.0
4,Current Asthma,Sydney LHD,2006-2008,14.6,9.9,19.3
...,...,...,...,...,...,...
507,Ever Had Asthma,All LHDs,2013-2015,20.2,18.6,21.9
508,Ever Had Asthma,All LHDs,2014-2016,18.5,16.9,20.0
509,Ever Had Asthma,All LHDs,2015-2017,18.7,17.2,20.2
510,Ever Had Asthma,All LHDs,2016-2018,19.9,18.4,21.4


## Data Manipulation

Rename the columns to match the Air Quality data set.

In [4]:
# Rename columns.
df = df.rename(columns={
    'LHD': 'lhd',
    'Period': 'financial year'
})

# Set column names to lower case.
df.columns = df.columns.str.lower()

Unnamed: 0,asthma type,lhd,financial year,per cent,ll 95% ci,ul 95% ci
0,Current Asthma,Sydney LHD,2002-2004,11.7,7.6,15.8
1,Current Asthma,Sydney LHD,2003-2005,10.7,6.6,14.8
2,Current Asthma,Sydney LHD,2004-2006,13.2,8.6,17.8
3,Current Asthma,Sydney LHD,2005-2007,12.5,8.0,17.0
4,Current Asthma,Sydney LHD,2006-2008,14.6,9.9,19.3
...,...,...,...,...,...,...
507,Ever Had Asthma,All LHDs,2013-2015,20.2,18.6,21.9
508,Ever Had Asthma,All LHDs,2014-2016,18.5,16.9,20.0
509,Ever Had Asthma,All LHDs,2015-2017,18.7,17.2,20.2
510,Ever Had Asthma,All LHDs,2016-2018,19.9,18.4,21.4


Remove 'LHD' for the Local Health District values.

In [5]:
# Remove ' LHD' from the 'lhd' column.
df['lhd'] = df['lhd'].str.replace(' LHD', '')
df

Unnamed: 0,asthma type,lhd,financial year,per cent,ll 95% ci,ul 95% ci
0,Current Asthma,Sydney,2002-2004,11.7,7.6,15.8
1,Current Asthma,Sydney,2003-2005,10.7,6.6,14.8
2,Current Asthma,Sydney,2004-2006,13.2,8.6,17.8
3,Current Asthma,Sydney,2005-2007,12.5,8.0,17.0
4,Current Asthma,Sydney,2006-2008,14.6,9.9,19.3
...,...,...,...,...,...,...
507,Ever Had Asthma,Alls,2013-2015,20.2,18.6,21.9
508,Ever Had Asthma,Alls,2014-2016,18.5,16.9,20.0
509,Ever Had Asthma,Alls,2015-2017,18.7,17.2,20.2
510,Ever Had Asthma,Alls,2016-2018,19.9,18.4,21.4


Remove rows representing state-wide aggregated data.

In [7]:
# Remove rows with NaN in the 'lhd' column.
df = df.dropna(subset=['lhd'])

# Remove rows with 'Alls' in the 'lhd' column.
df = df[~df['lhd'].str.contains('Alls')]

df

Unnamed: 0,asthma type,lhd,financial year,per cent,ll 95% ci,ul 95% ci
0,Current Asthma,Sydney,2002-2004,11.7,7.6,15.8
1,Current Asthma,Sydney,2003-2005,10.7,6.6,14.8
2,Current Asthma,Sydney,2004-2006,13.2,8.6,17.8
3,Current Asthma,Sydney,2005-2007,12.5,8.0,17.0
4,Current Asthma,Sydney,2006-2008,14.6,9.9,19.3
...,...,...,...,...,...,...
491,Ever Had Asthma,Far West,2013-2015,19.6,10.6,28.6
492,Ever Had Asthma,Far West,2014-2016,18.0,11.2,24.7
493,Ever Had Asthma,Far West,2015-2017,18.8,12.2,25.3
494,Ever Had Asthma,Far West,2016-2018,17.0,11.2,22.8


Remove rows where children historically had Asthma (Ever Had Asthma) to limit dataset to current Asthma cases in that year.
Additionally, once the appropriate rows are removed, remove the asthma type column.

In [8]:
# Remove rows with 'Ever Had Asthma' in the 'asthma type' column.
df = df[~df['asthma type'].str.contains('Ever Had Asthma')]
df = df.drop(columns=['asthma type'])

df


Unnamed: 0,lhd,financial year,per cent,ll 95% ci,ul 95% ci
0,Sydney,2002-2004,11.7,7.6,15.8
1,Sydney,2003-2005,10.7,6.6,14.8
2,Sydney,2004-2006,13.2,8.6,17.8
3,Sydney,2005-2007,12.5,8.0,17.0
4,Sydney,2006-2008,14.6,9.9,19.3
...,...,...,...,...,...
235,Far West,2013-2015,15.1,6.4,23.7
236,Far West,2014-2016,14.1,8.1,20.1
237,Far West,2015-2017,14.8,8.7,20.9
238,Far West,2016-2018,12.5,7.4,17.5


Remove columns holding Confidence Interval data.

In [9]:
# Drop columns with '% ci' in the header
df = df.loc[:, ~df.columns.str.contains('% ci')]

df

Unnamed: 0,lhd,financial year,per cent
0,Sydney,2002-2004,11.7
1,Sydney,2003-2005,10.7
2,Sydney,2004-2006,13.2
3,Sydney,2005-2007,12.5
4,Sydney,2006-2008,14.6
...,...,...,...
235,Far West,2013-2015,15.1
236,Far West,2014-2016,14.1
237,Far West,2015-2017,14.8
238,Far West,2016-2018,12.5


## Output Processed Data

In [10]:
# File path.
file_path_output = 'data-processed.csv'

# Save the file.
df.to_csv(file_path_output, index=False)

## View Final Dataset

In [11]:
df

Unnamed: 0,lhd,financial year,per cent
0,Sydney,2002-2004,11.7
1,Sydney,2003-2005,10.7
2,Sydney,2004-2006,13.2
3,Sydney,2005-2007,12.5
4,Sydney,2006-2008,14.6
...,...,...,...
235,Far West,2013-2015,15.1
236,Far West,2014-2016,14.1
237,Far West,2015-2017,14.8
238,Far West,2016-2018,12.5
