# NSW Local Health Districts and Air Quality Recording Sites - Preprocessing

## Dependencies

Ensure that the required libraries have been installed locally as per the README.md file included in this project.

Run the following cell the import the required dependencies for this notebook.

In [15]:
# Import libraries
import pandas as pd

## Load Dataset

In [16]:
# Load the raw data.
df = pd.read_csv('raw.csv')

## Exploratory Analysis of Raw Data

In [17]:
# Generate summary statistics for object columns.
object_summary_stats = df.describe(include=['O']).transpose()                                                       # Generate summary statistics for object columns.
object_summary_stats['missing_values'] = df.isnull().sum()                                                          # Add missing values to the summary table.
object_summary_stats['present_values'] = df.notnull().sum()                                                         # Add present values to the summary table.
object_summary_stats['datatype'] = df.dtypes                                                                        # Add data types to the summary table.
object_summary_stats = object_summary_stats[['datatype', 'present_values', 'missing_values', 'unique']]             # Select features and reorder table.

# Display the summary tables with titles.
print("Dataset Head:")                                                                                              # Display the dataset head title.
display(df.head().style.set_table_styles([{'selector': 'th', 'props': [('min-width', '100px')]}]))                  # Display the dataset head. For better visualization, set the minimum width of the table headers to 100px.

print("\nObject Summary Statistics:")                                                                               # Display the object summary statistics.
display(object_summary_stats.style.set_table_styles([{'selector': 'th', 'props': [('min-width', '100px')]}]))       # Display the object summary statistics. For better visualization, set the minimum width of the table headers to 100px.

Dataset Head:


Unnamed: 0,WKT,name,description
0,POINT (148.8180856 -34.4392256),Albion Park South,Murrumbidgee
1,POINT (146.9135418 -36.0737293),Albury,Albury Wodonga Health (Network with Victoria)
2,POINT (151.1902576 -33.9080273),Alexandria,Sydney
3,POINT (150.5772064 -34.28291919999999),Bargo,South Western Sydney
4,POINT (149.5786977 -33.4111925),Bathurst,Western NSW



Object Summary Statistics:


Unnamed: 0,datatype,present_values,missing_values,unique
WKT,object,50,0,50
name,object,50,0,50
description,object,50,0,14


## Preprocess

In [18]:
# Data cleaning.
df = df.rename(columns={'name': 'suburb', 'description': 'lhd'})      # Rename columns.
df = df.drop(columns=['WKT'])                                         # Drop the 'WKT' column. Contains irrelevant longitude and latitude data.
df = df.drop(df[df['suburb'] == 'Albury'].index)                      # Drop the row with 'Albury' as the suburb name. Albury is part of a joint LHD with Victoria and not included in the analysis.

## Output Processed Dataset

In [19]:
# Save the processed data.
df.to_csv('processed.csv', index=False)

## View Processed Dataset

In [20]:
# Display the processed data.
display(df.head())

Unnamed: 0,suburb,lhd
0,Albion Park South,Murrumbidgee
2,Alexandria,Sydney
3,Bargo,South Western Sydney
4,Bathurst,Western NSW
5,Beresfield,Hunter New England
