# NSW Air QUality Monthly Averages 2000 - 2024 | Combine Partial Datasets
Dataset downloaded in 4-year chunks to avoid gateway timeout on airquality.nsw.go.au. This notebook outlines the process for combining these disjointed datasets into one.

No additional pre-processing or cleaning is completed in this workflow.


## Set Up
Ensure that the required libraries are available by running the below code in the terminal before execution:
- pip install pandas


Execute the following in the jupyter notebook before execution to ensure that the required libraries are imported:

In [1]:
import pandas as pd

# Allows access to xls data format.
%pip install xlrd

Note: you may need to restart the kernel to use updated packages.


## Load Disjointed Datasets

In [31]:
# File paths for the datasets.
file_path_2000 = 'partial-datasets/nsw-monthly-air-quality-2000-2003-raw.xls'
file_path_2004 = 'partial-datasets/nsw-monthly-air-quality-2004-2007-raw.xls'
file_path_2008 = 'partial-datasets/nsw-monthly-air-quality-2008-2011-raw.xls'
file_path_2012 = 'partial-datasets/nsw-monthly-air-quality-2012-2015-raw.xls'
file_path_2016 = 'partial-datasets/nsw-monthly-air-quality-2016-2019-raw.xls'
file_path_2020 = 'partial-datasets/nsw-monthly-air-quality-2020-2024-raw.xls'

# Read the datasets.
data_2000 = pd.read_excel(file_path_2000)
data_2004 = pd.read_excel(file_path_2004)
data_2008 = pd.read_excel(file_path_2008)
data_2012 = pd.read_excel(file_path_2012)
data_2016 = pd.read_excel(file_path_2016)
data_2020 = pd.read_excel(file_path_2020)




## Combine Datasets

In [32]:
# Set appropriate column names for each dataset.
data_2000.columns = data_2000.iloc[1]
data_2000 = data_2000.iloc[2:]

data_2004.columns = data_2004.iloc[1]
data_2004 = data_2004.iloc[2:]

data_2008.columns = data_2008.iloc[1]
data_2008 = data_2008.iloc[2:]

data_2012.columns = data_2012.iloc[1]
data_2012 = data_2012.iloc[2:]

data_2016.columns = data_2016.iloc[1]
data_2016 = data_2016.iloc[2:]

data_2020.columns = data_2020.iloc[1]
data_2020 = data_2020.iloc[2:]

# Create an array of the datasets.
datasets = [data_2000, data_2004, data_2008, data_2012, data_2016, data_2020]

# Concatenate the datasets.
data = pd.concat(datasets)

## Output Combine Dataset

In [35]:
# File path for the output dataset.
file_path_output = 'nsw-monthly-air-quality-2000-2024-raw.xlsx'

# Save the concatenated dataset.
data.to_excel(file_path_output, index=False, engine='openpyxl')

# Display the concatenated dataset.
data

1,Date,RANDWICK SO2 monthly average [pphm],ROZELLE SO2 monthly average [pphm],LINDFIELD SO2 monthly average [pphm],LIVERPOOL SO2 monthly average [pphm],BRINGELLY SO2 monthly average [pphm],CHULLORA SO2 monthly average [pphm],WYONG SO2 monthly average [pphm],WALLSEND SO2 monthly average [pphm],CARRINGTON SO2 monthly average [pphm],...,BERESFIELD PM10 monthly average [µg/m³],TAMWORTH PM10 monthly average [µg/m³],WOLLONGONG PM10 monthly average [µg/m³],KEMBLA GRANGE PM10 monthly average [µg/m³],RICHMOND PM10 monthly average [µg/m³],BARGO PM10 monthly average [µg/m³],ALBURY PM10 monthly average [µg/m³],WAGGA WAGGA PM10 monthly average [µg/m³],ST MARYS PM10 monthly average [µg/m³],VINEYARD PM10 monthly average [µg/m³]
2,31/01/2000,0.1,,0.1,,0,,,0.2,,...,16.8,,18.3,,15.3,,,,16.9,15.5
3,29/02/2000,0.1,,0.1,,0.1,,,0.2,,...,20.3,,27.8,,21.2,,,,21.5,18.5
4,31/03/2000,,,0.1,,0,,,0.2,,...,17.9,,21.8,,15,,,,15.7,14.5
5,30/04/2000,,,0.1,,0,,,0.2,,...,15.8,,14.9,,13.6,,,,13.2,15.4
6,31/05/2000,,,0.1,,0,,,0.2,,...,16.9,,15.7,,11.4,,,,11.6,13.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53,30/04/2024,0.1,0.1,,0.1,0,,0,0.2,0.2,...,15.5,12.1,15.8,21.8,13.6,14.1,23.5,,15.6,
54,31/05/2024,0.1,0.1,,0.1,0,,0,0.2,0.2,...,14.8,13,12.9,16.5,11.3,10.7,24.2,,11.7,
55,30/06/2024,0.1,0,,0.1,0,,0,0.2,0.3,...,12.8,12.3,9.5,11.4,9.2,8.6,11.7,,10.3,
56,31/07/2024,0.1,0,,0,0,,0,0.2,0.2,...,13.1,10.5,9.7,14.5,8.6,7.9,10.5,,10.9,
