# Data Integration and Data Reshaping of Victorian Housing Data


In this project, I will integrate numerous input datasets, all of which come in very different formats (html, json, xml, pdf, xlsx, txt, shapefiles), with the goal of creating 1 unified dataset. After that, I will study the effects of different normalization/transformation methods on given attributes and their effects when it comes to building a linear model.

The data schema of the final unified dataset is as follows:


## Table of Content
1. [Load and Parse files of different formats](#1)
2. [Data Integration](#2)
3. [Data Reshaping](#3)

In [2]:
#Load libraries
%matplotlib inline

#For dataframe manipulation
import pandas as pd
from pandas import DataFrame
import numpy as np

#For parsing data in xml and html format
from bs4 import BeautifulSoup

#For parsing data in json format, and converting json format data into dataframe
import json
from pandas.io.json import json_normalize

#For accessing directory
import os

#For extracting tables in pdf files
from tabula import read_pdf

#For reading shapefile
import shapefile

#For processing data in shapefiles
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon

#To perform mathematical operation
from math import radians, cos, sin, asin, sqrt,floor,log,exp

#To convert datetime data into proper format
from datetime import datetime

#For plotting graph
from matplotlib import pyplot as plt

#For different 
from sklearn.preprocessing import MinMaxScaler,minmax_scale,MaxAbsScaler,StandardScaler,RobustScaler,Normalizer,QuantileTransformer,PowerTransformer

## 1. Load and Parse data files of different formats <a class="anchor" id="1"></a>

### Load `shoppingcenters` excel file

In [4]:
shopping_centers_excel=pd.ExcelFile('DataIntegration/shopingcenters.xlsx')
shopping_centers_excel

<pandas.io.excel._base.ExcelFile at 0x1fd223ef588>

In [5]:
shopping_centers_excel.sheet_names
#Only 1 sheet called Sheet1

['Sheet1']

In [7]:
shopping_centers=shopping_centers_excel.parse('Sheet1')

#Keep only the relevant columns
shopping_centers=shopping_centers.iloc[:,1:4]
shopping_centers.head()

Unnamed: 0,sc_id,lat,lng
0,SC_001,-37.767915,145.04179
1,SC_002,-37.819375,145.171472
2,SC_003,-37.971131,145.089065
3,SC_004,-35.280406,149.13255
4,SC_005,-37.574572,144.920452


In [8]:
shopping_centers.dtypes
#Correct datatypes

sc_id     object
lat      float64
lng      float64
dtype: object

### Load `hospitals` html file

In [9]:
html_file=open('DataIntegration/hospitals.html',encoding='utf8')

#Use BeautifulSoup package to load and parse the html file
bsobj=BeautifulSoup(html_file)
bsobj

<html><body><table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>lat</th>
<th>lng</th>
<th>name</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>hospital_001</td>
<td>-37.990622</td>
<td>145.072836</td>
<td>Como Private Hospital</td>
</tr>
<tr>
<th>1</th>
<td>hospital_002</td>
<td>-37.855469</td>
<td>145.268183</td>
<td>Mountain District Private Hospital</td>
</tr>
<tr>
<th>2</th>
<td>hospital_003</td>
<td>-37.792230</td>
<td>144.889128</td>
<td>Western Hospital</td>
</tr>
<tr>
<th>3</th>
<td>hospital_004</td>
<td>-37.756042</td>
<td>145.061896</td>
<td>Mercy Hospital for Women</td>
</tr>
<tr>
<th>4</th>
<td>hospital_005</td>
<td>-37.760623</td>
<td>144.815624</td>
<td>Sunshine Hospital</td>
</tr>
<tr>
<th>5</th>
<td>hospital_006</td>
<td>-36.359274</td>
<td>145.410832</td>
<td>Shepparton Private Hospital</td>
</tr>
<tr>
<th>6</th>
<td>hospital_007</td>
<td>-37.774573</td>
<td>144.923973</td>
<td>Ascot Vale Road Specialist Rooms</td>
</t

We can see that the information needed to build the dataframe all have `<tr>` tag, so we extract all of those information

In [10]:
bsobj_info=bsobj.findAll('tr')
bsobj_info

[<tr style="text-align: right;">
 <th></th>
 <th>id</th>
 <th>lat</th>
 <th>lng</th>
 <th>name</th>
 </tr>,
 <tr>
 <th>0</th>
 <td>hospital_001</td>
 <td>-37.990622</td>
 <td>145.072836</td>
 <td>Como Private Hospital</td>
 </tr>,
 <tr>
 <th>1</th>
 <td>hospital_002</td>
 <td>-37.855469</td>
 <td>145.268183</td>
 <td>Mountain District Private Hospital</td>
 </tr>,
 <tr>
 <th>2</th>
 <td>hospital_003</td>
 <td>-37.792230</td>
 <td>144.889128</td>
 <td>Western Hospital</td>
 </tr>,
 <tr>
 <th>3</th>
 <td>hospital_004</td>
 <td>-37.756042</td>
 <td>145.061896</td>
 <td>Mercy Hospital for Women</td>
 </tr>,
 <tr>
 <th>4</th>
 <td>hospital_005</td>
 <td>-37.760623</td>
 <td>144.815624</td>
 <td>Sunshine Hospital</td>
 </tr>,
 <tr>
 <th>5</th>
 <td>hospital_006</td>
 <td>-36.359274</td>
 <td>145.410832</td>
 <td>Shepparton Private Hospital</td>
 </tr>,
 <tr>
 <th>6</th>
 <td>hospital_007</td>
 <td>-37.774573</td>
 <td>144.923973</td>
 <td>Ascot Vale Road Specialist Rooms</td>
 </tr>,
 <tr>
 

Within `bsobj_info`, we can see that `<th>` tag of the first element are all the column headers, while `<td>` tag of all other elements are all the tuples. We can extract the information accordingly.

In [12]:
html_list,header=[],[]
for i in range(len(bsobj_info)):
    #Column header row
    if i==0:
        content=bsobj_info[i].findAll('th')
        #First element is <th></th>, so remove this first element
        content.pop(0)
        for tag in content:
            #Get only the content within the tag
            result=tag.text
            header.append(result)
        html_list.append(header)
    
    #tuple rows
    else:
        tuple_list=[]
        content=bsobj_info[i].findAll('td')
        for tag in content:
            #Get only the content within the tag
            result=tag.text
            tuple_list.append(result)
        html_list.append(tuple_list)

In [13]:
#Make it into a dataframe and change column header accordingly
hospitals=DataFrame(html_list)
header=hospitals.iloc[0]  
hospitals=hospitals[1:]  #remove the first header row
hospitals.columns=header  #make the header
hospitals

Unnamed: 0,id,lat,lng,name
1,hospital_001,-37.990622,145.072836,Como Private Hospital
2,hospital_002,-37.855469,145.268183,Mountain District Private Hospital
3,hospital_003,-37.792230,144.889128,Western Hospital
4,hospital_004,-37.756042,145.061896,Mercy Hospital for Women
5,hospital_005,-37.760623,144.815624,Sunshine Hospital
...,...,...,...,...
195,hospital_195,-38.234091,146.406812,Maryvale Private Hospital
196,hospital_196,-37.837972,144.996182,South Yarra Clinic
197,hospital_197,-37.798231,144.957169,Prof George Andrew Varigos Specialist Practice
198,hospital_198,-37.910968,144.990415,Mr Harry Clitherow - Orthopaedic Surgeon


In [14]:
hospitals.dtypes #Change datatypes of lat and lng to appropriate datatypes

0
id      object
lat     object
lng     object
name    object
dtype: object

In [15]:
hospitals[['lat','lng']]=hospitals[['lat','lng']].astype(float)

### Load `real_estate` json file

In [17]:
with open('DataIntegration/real_state.json') as json_file:
    real_state_json=json.load(json_file)
real_state_json

[{'property_id': 31200,
  'lat': -37.754143,
  'lng': 144.990876,
  'addr_street': '37 Fyffe Street',
  'price': 3968000,
  'property_type': 'house',
  'year': 2009,
  'bedrooms': 2,
  'bathrooms': 1,
  'parking_space': 0},
 {'property_id': 10280,
  'lat': -37.795382,
  'lng': 144.932417,
  'addr_street': '23 Hardiman Street',
  'price': 9675000,
  'property_type': 'house',
  'year': 2011,
  'bedrooms': 2,
  'bathrooms': 1,
  'parking_space': 0},
 {'property_id': 66367,
  'lat': -37.786888,
  'lng': 145.307171,
  'addr_street': '4 Summit Court',
  'price': 5040000,
  'property_type': 'house',
  'year': 2014,
  'bedrooms': 4,
  'bathrooms': 2,
  'parking_space': 2},
 {'property_id': 31340,
  'lat': -37.761896,
  'lng': 145.01209,
  'addr_street': '56 Clyde Street',
  'price': 14200000,
  'property_type': 'house',
  'year': 2014,
  'bedrooms': 4,
  'bathrooms': 1,
  'parking_space': 2},
 {'property_id': 80290,
  'lat': -37.982964,
  'lng': 145.168388,
  'addr_street': '46 Kingsclere Aven

In [18]:
#Convert to dataframe
real_state_json=json_normalize(real_state_json)
real_state_json.head()

  


Unnamed: 0,property_id,lat,lng,addr_street,price,property_type,year,bedrooms,bathrooms,parking_space
0,31200,-37.754143,144.990876,37 Fyffe Street,3968000,house,2009,2,1,0
1,10280,-37.795382,144.932417,23 Hardiman Street,9675000,house,2011,2,1,0
2,66367,-37.786888,145.307171,4 Summit Court,5040000,house,2014,4,2,2
3,31340,-37.761896,145.01209,56 Clyde Street,14200000,house,2014,4,1,2
4,80290,-37.982964,145.168388,46 Kingsclere Avenue,9100000,house,2015,4,2,1


In [19]:
real_state_json.dtypes  #Correct datatypes

property_id        int64
lat              float64
lng              float64
addr_street       object
price              int64
property_type     object
year               int64
bedrooms           int64
bathrooms          int64
parking_space      int64
dtype: object

### Load `real_state` xml file

In [20]:
myfile=[]
#Read the XML file
with open('DataIntegration/real_state.xml','r') as file:
    #Read each line in the file, readlines() returns a list of lines
    readfile=file.readlines()
    #Combine the lines in the list into a string
    myfile=''.join(readfile)
    real_state_xml=BeautifulSoup(myfile,'lxml')

In [21]:
#Take a look at the prettified version of the xml file
print(real_state_xml.prettify())

<html>
 <body>
  <p>
   b'
   <?xml version="1.0" encoding="UTF-8" ?>
   <root>
    <property_id type="dict">
     <n16867 type="int">
      17402
     </n16867>
     <n16384 type="int">
      16919
     </n16384>
     <n5935 type="int">
      5956
     </n5935>
     <n29022 type="int">
      29557
     </n29022>
     <n42455 type="int">
      42990
     </n42455>
     <n84948 type="int">
      85483
     </n84948>
     <n89564 type="int">
      90099
     </n89564>
     <n23760 type="int">
      24295
     </n23760>
     <n23033 type="int">
      23568
     </n23033>
     <n44518 type="int">
      45053
     </n44518>
     <n25114 type="int">
      25649
     </n25114>
     <n11627 type="int">
      12142
     </n11627>
     <n11575 type="int">
      12090
     </n11575>
     <n78577 type="int">
      79112
     </n78577>
     <n35831 type="int">
      36366
     </n35831>
     <n58512 type="int">
      59047
     </n58512>
     <n44423 type="int">
      44958
     </n44423>
     <n70

In [22]:
#Get property_id tag
property_id=real_state_xml.find('property_id')

#Get all the property id
property_id_list=[]
for i in property_id:
    result=i.text
    property_id_list.append(result)

In [23]:
#Get lat tag
lat=real_state_xml.find('lat')
#Get all the lat values
lat_list=[]
for i in lat:
    result=i.text
    lat_list.append(result)

In [24]:
#Get lng tag
lng=real_state_xml.find('lng')
#Get all the lng values
lng_list=[]
for i in lng:
    result=i.text
    lng_list.append(result)

In [25]:
#Get addr_street tag
address=real_state_xml.find('addr_street')
#Get all the addr_street values
address_list=[]
for i in address:
    result=i.text
    address_list.append(result)

In [26]:
#Get price tag
price=real_state_xml.find('price')
#Get all the price values
price_list=[]
for i in price:
    result=i.text
    price_list.append(result)

In [27]:
#Get property_type tag
property_type=real_state_xml.find('property_type')
#Get all the property_type values
property_type_list=[]
for i in property_type:
    result=i.text
    property_type_list.append(result)

In [28]:
#Get year tag
year=real_state_xml.find('year')
#Get all the year values
year_list=[]
for i in year:
    result=i.text
    year_list.append(result)

In [29]:
#Get bedrooms tag
bedrooms=real_state_xml.find('bedrooms')
#Get all the bedrooms values
bedrooms_list=[]
for i in bedrooms:
    result=i.text
    bedrooms_list.append(result)

In [30]:
#Get bathrooms tag
bathrooms=real_state_xml.find('bathrooms')
#Get all the bathrooms values
bathrooms_list=[]
for i in bathrooms:
    result=i.text
    bathrooms_list.append(result)

In [31]:
#Get parking space tag
parking_space=real_state_xml.find('parking_space')
#Get all the parking_space values
parking_space_list=[]
for i in parking_space:
    result=i.text
    parking_space_list.append(result)

In [32]:
#Join all the lists into a dataframe
real_state_xml=pd.DataFrame(
    {'Property_id':property_id_list,
     'lat':lat_list,
     'lng':lng_list,
     'addr_street':address_list,
     'price':price_list,
     'property_type':property_type_list,
     'year':year_list,
     'bedrooms':bedrooms_list,
     'bathrooms':bathrooms_list,
     'parking_space':parking_space_list
    })
real_state_xml.head()

Unnamed: 0,Property_id,lat,lng,addr_street,price,property_type,year,bedrooms,bathrooms,parking_space
0,17402,-37.71226501,144.9121552,52 Stanley Street,5300000,house,2009,4,2,1
1,16919,-37.702736,144.947733,26 Hilton Street,5844000,house,2011,2,1,1
2,5956,-37.780454,144.839847,1 Osbert Street,5908000,house,2016,3,1,1
3,29557,-37.778578,144.999559,17 Plant Street,7794000,house,2010,2,1,1
4,42990,-37.7242012,145.1026001,23 Bimbadeen Crescent,5320000,house,2009,4,1,2


In [33]:
real_state_xml.dtypes
#Incorrect datatypes, need to change

Property_id      object
lat              object
lng              object
addr_street      object
price            object
property_type    object
year             object
bedrooms         object
bathrooms        object
parking_space    object
dtype: object

In [34]:
real_state_xml[['Property_id', 'price','year','bedrooms','bathrooms','parking_space']] = real_state_xml[['Property_id',
                                        'price','year','bedrooms','bathrooms','parking_space']].astype(int)

In [35]:
real_state_xml[['lat', 'lng']]=real_state_xml[['lat', 'lng']].astype(float)

### Load `supermarkets` pdf file

In [37]:
#Using read_pdf from tabula package, extract the portion of the table from each page
supermarkets_1=read_pdf('DataIntegration/supermarkets.pdf',pages='all')[0]
supermarkets_2=read_pdf('DataIntegration/supermarkets.pdf',pages='all')[1]
supermarkets_3=read_pdf('DataIntegration/supermarkets.pdf',pages='all')[2]
supermarkets_4=read_pdf('DataIntegration/supermarkets.pdf',pages='all')[3]
supermarkets_5=read_pdf('DataIntegration/supermarkets.pdf',pages='all')[4]

In [38]:
#Concatenate all different portions, to form a complete table, and drop unneccesary columns
supermarkets=pd.concat([supermarkets_1,supermarkets_2,supermarkets_3,supermarkets_4,supermarkets_5],ignore_index=True)
supermarkets=supermarkets.drop(columns=['Unnamed: 0'])
supermarkets

Unnamed: 0,id,lat,lng,type
0,S_001,-37.883978,144.735287,Woolworths
1,S_002,-41.161591,147.514797,Woolworths
2,S_003,-37.984078,145.077167,Woolworths
3,S_004,-37.707023,144.938740,Woolworths
4,S_005,-37.597670,144.938413,Woolworths
...,...,...,...,...
234,S_235,-37.860188,145.028920,Coles
235,S_236,-37.875984,144.614490,Coles
236,S_237,-37.047330,143.744610,Coles
237,S_238,-38.351648,144.922881,Coles


### Load `GTFS` text files

In [39]:
#Check all the text files in the GTFS data file
os.listdir('GTFS_MelbourneTrainInformation')

['agency.txt',
 'calendar.txt',
 'calendar_dates.txt',
 'routes.txt',
 'shapes.txt',
 'stops.txt',
 'stop_times.txt',
 'trips.txt']

In [40]:
#Load agency text file
agency=pd.read_csv('GTFS_MelbourneTrainInformation/agency.txt',sep=',')
agency

Unnamed: 0,agency_id,agency_name,agency_url,agency_timezone,agency_lang
0,1,PTV,http://www.ptv.vic.gov.au,Australia/Melbourne,EN


In [41]:
#Load calendar text file
calendar=pd.read_csv('GTFS_MelbourneTrainInformation/calendar.txt',sep=',')
calendar

Unnamed: 0,service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
0,T2,0,0,0,0,0,1,0,20151009,20151011
1,UJ,0,0,0,0,0,0,1,20151009,20151011
2,T6,0,0,0,0,1,0,0,20151009,20151011
3,T5,1,1,1,1,0,0,0,20151012,20151015
4,T2_1,0,0,0,0,0,1,0,20151016,20151018
5,UJ_1,0,0,0,0,0,0,1,20151016,20151018
6,T6_1,0,0,0,0,1,0,0,20151016,20151018
7,T5_1,1,1,1,1,0,0,0,20151019,20151022
8,T0,1,1,1,1,1,0,0,20151023,20151122
9,T2_2,0,0,0,0,0,1,0,20151023,20151122


In [42]:
#Load calendar text file
calendar_dates=pd.read_csv('GTFS_MelbourneTrainInformation/calendar_dates.txt',sep=',')
calendar_dates

Unnamed: 0,service_id,date,exception_type
0,T0,20151103,2
1,T0+a5,20151103,2


In [43]:
#Load routes text file
routes=pd.read_csv('GTFS_MelbourneTrainInformation/routes.txt',sep=',')
routes

Unnamed: 0,route_id,agency_id,route_short_name,route_long_name,route_type
0,2-ALM-B-mjp-1,1,Alamein,Alamein - City (Flinders Street),2
1,2-ALM-C-mjp-1,1,Alamein,Alamein - City (Flinders Street),2
2,2-ALM-D-mjp-1,1,Alamein,Alamein - City (Flinders Street),2
3,2-ALM-E-mjp-1,1,Alamein,Alamein - City (Flinders Street),2
4,2-ALM-F-mjp-1,1,Alamein,Alamein - City (Flinders Street),2
...,...,...,...,...,...
76,2-WMN-B-mjp-1,1,Williamstown,Williamstown - City (Flinders Street),2
77,2-WMN-C-mjp-1,1,Williamstown,Williamstown - City (Flinders Street),2
78,2-WMN-D-mjp-1,1,Williamstown,Williamstown - City (Flinders Street),2
79,2-WMN-E-mjp-1,1,Williamstown,Williamstown - City (Flinders Street),2


In [45]:
#Load shapes text file
shapes_txt=pd.read_csv('GTFS_MelbourneTrainInformation/shapes.txt',sep=',')
shapes_txt

Unnamed: 0,shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
0,2-ain-mjp-1.1.H,-37.818631,144.951994,1,0.000000
1,2-ain-mjp-1.1.H,-37.817425,144.951050,2,157.543645
2,2-ain-mjp-1.1.H,-37.817241,144.950828,3,185.827916
3,2-ain-mjp-1.1.H,-37.816327,144.950047,4,308.469671
4,2-ain-mjp-1.1.H,-37.816127,144.949950,5,332.239399
...,...,...,...,...,...
339706,2-WMN-F-mjp-1.6.R,-37.864271,144.895021,17,2655.879090
339707,2-WMN-F-mjp-1.6.R,-37.864818,144.896370,18,2789.160747
339708,2-WMN-F-mjp-1.6.R,-37.867094,144.903228,19,3443.379365
339709,2-WMN-F-mjp-1.6.R,-37.867382,144.904208,20,3535.406535


In [46]:
#Load stops text file
stops=pd.read_csv('GTFS_MelbourneTrainInformation/stops.txt',sep=',')
stops

Unnamed: 0,stop_id,stop_name,stop_short_name,stop_lat,stop_lon
0,15351,Sunbury Railway Station,Sunbury,-37.579091,144.727319
1,15353,Diggers Rest Railway Station,Diggers Rest,-37.627017,144.719922
2,19827,Stony Point Railway Station,Crib Point,-38.374235,145.221837
3,19828,Crib Point Railway Station,Crib Point,-38.366123,145.204043
4,19829,Morradoo Railway Station,Crib Point,-38.354033,145.189602
...,...,...,...,...,...
213,44817,Coolaroo Railway Station,Coolaroo,-37.661003,144.926056
214,45793,Lynbrook Railway Station,Lynbrook,-38.057341,145.249275
215,45794,Cardinia Road Railway Station,Pakenham,-38.071290,145.437791
216,45795,South Morang Railway Station,South Morang,-37.649159,145.067032


In [47]:
#Load stop_times text file
stop_times=pd.read_csv('GTFS_MelbourneTrainInformation/stop_times.txt',sep=',')
stop_times

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
0,17182517.T2.2-ALM-B-mjp-1.1.H,04:57:00,04:57:00,19847,1,,0,0,0.000000
1,17182517.T2.2-ALM-B-mjp-1.1.H,04:58:00,04:58:00,19848,2,,0,0,723.017818
2,17182517.T2.2-ALM-B-mjp-1.1.H,05:00:00,05:00:00,19849,3,,0,0,1951.735072
3,17182517.T2.2-ALM-B-mjp-1.1.H,05:02:00,05:02:00,19850,4,,0,0,2899.073349
4,17182517.T2.2-ALM-B-mjp-1.1.H,05:04:00,05:04:00,19851,5,,0,0,3927.090952
...,...,...,...,...,...,...,...,...,...
390300,17199140.UJ.2-ain-mjp-1.4.R,18:09:00,18:09:00,20028,1,,0,0,0.000000
390301,17199140.UJ.2-ain-mjp-1.4.R,18:15:00,18:15:00,19973,4,,0,0,4011.161109
390302,17199140.UJ.2-ain-mjp-1.4.R,18:19:00,18:19:00,22180,5,,0,0,5676.741894
390303,17199142.T2.2-ain-mjp-1.5.R,24:00:00,24:00:00,20027,1,,0,0,0.000000


In [48]:
#Load trips text file
trips=pd.read_csv('GTFS_MelbourneTrainInformation/trips.txt',sep=',')
trips

Unnamed: 0,route_id,service_id,trip_id,shape_id,trip_headsign,direction_id
0,2-ALM-F-mjp-1,T0,17067982.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
1,2-ALM-F-mjp-1,T0,17067988.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
2,2-ALM-F-mjp-1,T0,17067992.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
3,2-ALM-F-mjp-1,T0,17067999.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
4,2-ALM-F-mjp-1,T0,17068003.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
...,...,...,...,...,...,...
23804,2-WMN-F-mjp-1,UJ_2,17072252.UJ.2-WMN-F-mjp-1.6.R,2-WMN-F-mjp-1.6.R,Williamstown,1
23805,2-WMN-F-mjp-1,UJ_2,17072256.UJ.2-WMN-F-mjp-1.6.R,2-WMN-F-mjp-1.6.R,Williamstown,1
23806,2-WMN-F-mjp-1,UJ_2,17072260.UJ.2-WMN-F-mjp-1.6.R,2-WMN-F-mjp-1.6.R,Williamstown,1
23807,2-WMN-F-mjp-1,UJ_2,17072264.UJ.2-WMN-F-mjp-1.6.R,2-WMN-F-mjp-1.6.R,Williamstown,1


### Load `vic_suburb_boundary` shape files

In [49]:
sf=shapefile.Reader("vic_suburb_boundary/VIC_LOCALITY_POLYGON_shp") 
recs=sf.records()
shapes=sf.shapes()
recs

[Record #0: ['6670', datetime.date(2011, 8, 31), None, 'VIC2615', datetime.date(2012, 4, 27), None, 'UNDERBOOL', '', '', 'G', None, '2'],
 Record #1: ['6671', datetime.date(2011, 8, 31), None, 'VIC1986', datetime.date(2012, 4, 27), None, 'NURRAN', '', '', 'G', None, '2'],
 Record #2: ['6672', datetime.date(2011, 8, 31), None, 'VIC2862', datetime.date(2012, 4, 27), None, 'WOORNDOO', '', '', 'G', None, '2'],
 Record #3: ['6673', datetime.date(2011, 8, 31), None, 'VIC734', datetime.date(2017, 8, 9), None, 'DEPTFORD', '', '', 'G', None, '2'],
 Record #4: ['6674', datetime.date(2011, 8, 31), None, 'VIC2900', datetime.date(2012, 4, 27), None, 'YANAC', '', '', 'G', None, '2'],
 Record #5: ['6405', datetime.date(2011, 8, 31), None, 'VIC1688', datetime.date(2012, 4, 27), None, 'MINIMAY', '', '', 'G', None, '2'],
 Record #6: ['6451', datetime.date(2011, 8, 31), None, 'VIC999', datetime.date(2012, 4, 27), None, 'GLEN FORBES', '', '', 'G', None, '2'],
 Record #7: ['6452', datetime.date(2011, 8, 31

In [50]:
len(recs),len(shapes)

(2973, 2973)

## 2. Data Integration <a class="anchor" id="2"></a>

## 3. Data Reshaping <a class="anchor" id="3"></a>


**Refer to sample solution**

In [None]:
__