# Preliminary Steps

The first step of this project is to load the data correctly and check for any missing values. The data is stored in `data` folder.

The given excel workbooks contain a sheet named `data`, which comprises of the data of the respective suburbs. 

The leftmost column (as seen when loaded in Microsoft Excel) describes the category of the features- we'll be referring this column as the 'metadata' column. 

Now, as the data is given in its original format, the 'metadata' column only contains a few blocks, with each block aligned against multiple feature-names to its right which fall into the same category. To adjust this, we have repeated the entries of the metadata column to resolve this entry-mismatch issue and created a single `test.xlsx` file. The first column of this file will be later used to replace the metadata column of the others.

In [1]:
import pandas as pd
import numpy as np

df_test = pd.read_excel('test.xlsx', sheet_name='data', header=None)

df_test

Unnamed: 0,0,1,2
0,Community,Community Name,Braybrook (Suburb)
1,Community,Region,Northern and Western Metropolitan
2,Geography,Map reference,4
3,Geography,Grid reference,A3
4,Geography,Location,10km WNW of Melbourne
...,...,...,...
221,Hospital,Distance to nearest public hospital with emerg...,10.161988
222,Hospital,Presentations to emergency departments due to ...,543.631989
223,Hospital,Presentations to emergency departments due to ...,20.647263
224,Hospital,Category 4 & 5 emergency department presentations,1683.966712


We can now create a reference column that we'll be using to replace the first columns of the other sheets. 

A dictionary is also created for looking up the feature-names that belong to the same category, with the categories as the keys.

In [18]:
reference_column=df_test.iloc[:,0]
feature_category_dict=df_test.groupby(0)[1].unique().to_dict()

## Loading the excel sheets

We first check if the number of features across the datasets is identical. If yes, then we load all the datasets into memory and adjust for downstream tasks.

In [51]:
# utility function to load the excel files
def excel_loader(xlsx_path, sheet_name='data', ref_col=reference_column):
	# load the excel file
	df = pd.read_excel(xlsx_path, sheet_name=sheet_name, header=None)
	# replace the first column with the reference column
	df.iloc[:, 0] = ref_col
	# change the column names
	df.columns = ['feature_kind', 'feature_name', 'feature_value']
	return df

# get the list of sheets

import os
dataset_list=[i for i in os.listdir('./data/') if i.endswith('.xlsx')]

# sort the list
dataset_list.sort()


len_list=[]
feature_num_consistency = True
for i in dataset_list:
	len_list.append(len(pd.read_excel('./data/'+i, sheet_name='data', header=None)[1]))
		

# check if the number of features is consistent across all the datasets
if all(x == len_list[0] for x in len_list):
	print('The number of features is consistent across all the datasets')
	print('The number of features is:', len_list[0])
else:
	print('The number of features is not consistent across all the datasets')
	feature_num_consistency = False
	print('The number of features in each dataset is:', len_list)

# load and adjust the excel files
df_all = [excel_loader('./data/'+i) for i in dataset_list]

The number of features is consistent across all the datasets
The number of features is: 226


### Checking which sheets contain missing values

In [52]:
# check for missing values in any of the datasets
missing_values = [df.isnull().values.any() for df in df_all]
# print the index of the dataset with missing values
dfs_with_na=[i for i, x in enumerate(missing_values) if x]

print('Datasets with missing values:')
for i in dfs_with_na:
	print(f'Index: {i}, Dataset: {dataset_list[i]}')

Datasets with missing values:
Index: 8, Dataset: Malvern-Suburb - XLSX.xlsx
Index: 9, Dataset: Melbourne-Airport-Suburb - XLSX.xlsx
Index: 22, Dataset: Sorrento-Suburb - XLSX.xlsx
Index: 26, Dataset: St-Andrews-Beach-Suburb - XLSX.xlsx
Index: 29, Dataset: St-Kilda-West-Suburb - XLSX.xlsx
Index: 30, Dataset: Toorak-Suburb - XLSX.xlsx
Index: 31, Dataset: Tyabb-Suburb - XLSX.xlsx
Index: 32, Dataset: Waterways-Suburb - XLSX.xlsx
