# Iteration 02 Data Wrangling - On Your Toes (Part 1 - Continued from iteration 1) 


Created Date : 23 April 2020

Version : 3.1

Last Update : 15 May 2020


Data Sets analysed:


* Crime Data : Data Tables - LGA Criminal Incidents Visualisation - year ending December 2019
* VIC_LOCALITY_POLYGON_shp.shp
* Offence_Severity_Levels.csv (Custom dataset)

In this notebook, the crime dataset `Table 03 Criminal incidents by principal offence, local government area and postcode or suburb/town - January 2010 to December 2019` is put through the wrangling tasks performed in iteration 1 and joined with new location information `VIC_LOCALITY_POLYGON_shp.shp` in order to create a meanningful map that indicates crime level of each suburb.


Detailed data plan : https://mahara.infotech.monash.edu.au/mahara/artefact/file/download.php?file=239816

-------------------------------------------------------------

#### Import Libraries and Packages

In [1]:
import pandas as pd
import numpy as np
import shapefile as shp
import re
from shapely.geometry import Point, Polygon

------------------------------------------------------

## Crime Data

### Dataset Info

Name: Data Tables - LGA Criminal Incidents Visualisation - year ending December 2019

Description: Table 03 Criminal incidents by principal offence, local government area and postcode or suburb/town - January 2010 to December 2019 

Link : https://www.crimestatistics.vic.gov.au/sites/default/files/embridge_cache/emshare/original/public/users/202003/e7/d73f8b607/Data_Tables_LGA_Recorded_Offences_Year_Ending_December_2019.xlsx

Permissions : CC BY 4.0

### Wrangling tasks


Read data into python and examine the column names

In [2]:
crime_df = pd.read_excel('Data_Tables_LGA_Criminal_Incidents_Year_Ending_December_2019.xlsx','Table 03')
print("Dataset shape : ",crime_df.shape)
print(crime_df.columns)

Dataset shape :  (309358, 9)
Index(['Year', 'Year ending', 'Local Government Area', 'Postcode',
       'Suburb/Town Name', 'Offence Division', 'Offence Subdivision',
       'Offence Subgroup', 'Incidents Recorded'],
      dtype='object')


Check dataset for null values

In [3]:
crime_df.isnull().sum()

Year                     0
Year ending              0
Local Government Area    0
Postcode                 0
Suburb/Town Name         0
Offence Division         0
Offence Subdivision      0
Offence Subgroup         0
Incidents Recorded       0
dtype: int64

Check dataset for duplicate values

In [4]:
print("Number of duplicate rows : ",len(crime_df[crime_df.duplicated(subset=None, keep='first')]))

Number of duplicate rows :  0


------------------------------------------------------------------------


Create "Suburb, Postcode" column for easy data filtering

In [5]:
# Convert columns to string data type
crime_df["Suburb/Town Name"] = crime_df["Suburb/Town Name"].astype(str)
crime_df["Postcode"]=crime_df["Postcode"].astype(str)

# Add new column "Suburb, Postcode" to the dataset
crime_df["SuburbPostcode"] = crime_df["Suburb/Town Name"] +", "+ crime_df["Postcode"]
crime_df.head(3)

Unnamed: 0,Year,Year ending,Local Government Area,Postcode,Suburb/Town Name,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,SuburbPostcode
0,2019,December,Alpine,3691,Dederang,B Property and deception offences,B30 Burglary/Break and enter,B321 Residential non-aggravated burglary,1,"Dederang, 3691"
1,2019,December,Alpine,3691,Dederang,E Justice procedures offences,E10 Justice procedures,E14 Pervert the course of justice or commit pe...,1,"Dederang, 3691"
2,2019,December,Alpine,3691,Dederang,A Crimes against the person,A20 Assault and related offences,A212 Non-FV Serious assault,1,"Dederang, 3691"


------------------------

Create a subset of the dataset that contains the local government area and the suburb from the crime_df dataset

In [6]:
# Create dataset and drop duplicate balues
lga_suburb = crime_df[['Local Government Area','SuburbPostcode']]

# Reset index
lga_suburb = lga_suburb.reset_index(drop=True)

# Rename "Local Government Area" column name as "LGA" for easier processing
lga_suburb.rename(columns={'Local Government Area':'LGA'}, inplace=True)

# Drop duplicates
lga_suburb= lga_suburb.drop_duplicates()
print("Length of lga_suburb dataset : ",len(lga_suburb))
lga_suburb.head(2)

Length of lga_suburb dataset :  3126


Unnamed: 0,LGA,SuburbPostcode
0,Alpine,"Dederang, 3691"
7,Alpine,"Glen Creek, 3691"


Export dataset as a CSV file

In [7]:
lga_suburb.to_csv("lga_suburb_export.csv", index=False)

Create "Total Incidents by Suburb/PC" column for data filtering purposes

In [8]:
# Create custom dataframe to store the total incident numbers grouped by "Suburb, Postcode column"
crime_sum=crime_df.groupby(['SuburbPostcode'])[['Incidents Recorded']].sum().reset_index()

# Create a new dataframe merging the crime_df and crime_sum to include "Total incidents by Suburb/PC"
crime_total_inc = pd.merge(crime_df, crime_sum, how='outer', on=['SuburbPostcode', 'SuburbPostcode'])
crime_total_inc.rename(columns={'Incidents Recorded_x':'Incidents Recorded','Incidents Recorded_y':'Total Incidents by Suburb/PC'}, inplace=True)

crime_total_inc.head(2)

Unnamed: 0,Year,Year ending,Local Government Area,Postcode,Suburb/Town Name,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,SuburbPostcode,Total Incidents by Suburb/PC
0,2019,December,Alpine,3691,Dederang,B Property and deception offences,B30 Burglary/Break and enter,B321 Residential non-aggravated burglary,1,"Dederang, 3691",59
1,2019,December,Alpine,3691,Dederang,E Justice procedures offences,E10 Justice procedures,E14 Pervert the course of justice or commit pe...,1,"Dederang, 3691",59


Create "Total incidents by Offence Division and Suburb, Postcode" column for filtering purposes

In [9]:
# Create custom dataframe to store the total incident numbers grouped by "Offense Division" and "SuburbPostcode" column
crime_div_sum=crime_total_inc.groupby(['Offence Division','SuburbPostcode'])[['Incidents Recorded']].sum().reset_index()

# Create a new dataframe merging the crime_df and crime_sum to include "Total incidents by Suburb/PC"
suburb_and_od = pd.merge(crime_total_inc, crime_div_sum, how='outer',left_on=['Offence Division','SuburbPostcode'], right_on = ['Offence Division','SuburbPostcode'])
suburb_and_od.rename(columns={'Incidents Recorded_x':'Incidents Recorded','Incidents Recorded_y':'Total Incidents by Offence Division and Suburb/PC'}, inplace=True)

suburb_and_od.head(2)

Unnamed: 0,Year,Year ending,Local Government Area,Postcode,Suburb/Town Name,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,SuburbPostcode,Total Incidents by Suburb/PC,Total Incidents by Offence Division and Suburb/PC
0,2019,December,Alpine,3691,Dederang,B Property and deception offences,B30 Burglary/Break and enter,B321 Residential non-aggravated burglary,1,"Dederang, 3691",59,20
1,2019,December,Alpine,3691,Dederang,B Property and deception offences,B20 Property damage,B21 Criminal damage,1,"Dederang, 3691",59,20


Create "Total incidents by Offence Subgroup and Suburb, Postcode" column for filtering purposes

In [10]:
# Create custom dataframe to store the total incident numbers grouped by "Offence Subgroup" and "SuburbPostcode"
crime_sub_sum=crime_total_inc.groupby(['Offence Subgroup','SuburbPostcode'])[['Incidents Recorded']].sum().reset_index()

# Create a new dataframe merging the crime_df and crime_sum to include "Total incidents by Suburb/PC"
crime_all_totals = pd.merge(suburb_and_od, crime_sub_sum, how='outer',left_on=['Offence Subgroup','SuburbPostcode'], right_on = ['Offence Subgroup','SuburbPostcode'])
crime_all_totals.rename(columns={'Incidents Recorded_x':'Incidents Recorded','Incidents Recorded_y':'Total Incidents by Offence Subgroup and Suburb/PC'}, inplace=True)

crime_all_totals.head(2)

Unnamed: 0,Year,Year ending,Local Government Area,Postcode,Suburb/Town Name,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,SuburbPostcode,Total Incidents by Suburb/PC,Total Incidents by Offence Division and Suburb/PC,Total Incidents by Offence Subgroup and Suburb/PC
0,2019,December,Alpine,3691,Dederang,B Property and deception offences,B30 Burglary/Break and enter,B321 Residential non-aggravated burglary,1,"Dederang, 3691",59,20,2
1,2014,December,Alpine,3691,Dederang,B Property and deception offences,B30 Burglary/Break and enter,B321 Residential non-aggravated burglary,1,"Dederang, 3691",59,20,2


Filter dataframe for year 2017 - 2019 as the app uses the last 3 years to calculate safety levels

In [11]:
crime_all_totals=crime_all_totals[crime_all_totals["Year"].isin([2017, 2018,2019])]
crime_all_totals.shape

(100833, 13)

-------------------------------------------------

#### Import offene severity levels dataset

This is a custom dataset based on offencese recorded in `Table 03 Criminal incidents by principal offence, local government area and postcode or suburb/town - January 2010 to December 2019`, filtered according to teh offences someone who workout outside might encounter with severity rating assigned for each crime

In [12]:
offence_severity_levels = pd.read_csv("Offence_Severity_Levels.csv")
print("Length of offence_severity_levels dataset", len(offence_severity_levels))

Length of offence_severity_levels dataset 27


Merge the offence_severity_levels dataset and crime_all_totals

In [13]:
crime_severity_added = pd.merge(crime_all_totals, offence_severity_levels, how='outer', on=['Offence Subgroup', 'Offence Subgroup'])
crime_severity_added.drop(crime_severity_added.tail(1).index,inplace=True)

# Since Missing values appear in the new dataset due to the outer join performed, replace missing values with 0
crime_severity_added["Rating"].fillna(0, inplace = True) 

crime_severity_added.head(2)

Unnamed: 0,Year,Year ending,Local Government Area,Postcode,Suburb/Town Name,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,SuburbPostcode,Total Incidents by Suburb/PC,Total Incidents by Offence Division and Suburb/PC,Total Incidents by Offence Subgroup and Suburb/PC,Rating
0,2019.0,December,Alpine,3691,Dederang,B Property and deception offences,B30 Burglary/Break and enter,B321 Residential non-aggravated burglary,1.0,"Dederang, 3691",59.0,20.0,2.0,0.0
1,2018.0,December,Alpine,3697,Tawonga,B Property and deception offences,B30 Burglary/Break and enter,B321 Residential non-aggravated burglary,1.0,"Tawonga, 3697",58.0,19.0,2.0,0.0


Create custom dataframe to store the total incident numbers grouped by "Rating" and "SuburbPostcode"

In [14]:
crime_sev_sub_sum=crime_severity_added.groupby(['SuburbPostcode','Rating'])[['Incidents Recorded']].sum().reset_index()

# Create a new dataframe merging the crime_df and crime_sum to include "Total incidents by Suburb/PC"
crime_final = pd.merge(crime_severity_added, crime_sev_sub_sum, how='outer',left_on=['SuburbPostcode','Rating'], right_on = ['SuburbPostcode','Rating'])
crime_final.rename(columns={'Incidents Recorded_x':'Incidents Recorded','Incidents Recorded_y':'Total Incidents by rating and Suburb/PC'}, inplace=True)

crime_final.head(2)

Unnamed: 0,Year,Year ending,Local Government Area,Postcode,Suburb/Town Name,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,SuburbPostcode,Total Incidents by Suburb/PC,Total Incidents by Offence Division and Suburb/PC,Total Incidents by Offence Subgroup and Suburb/PC,Rating,Total Incidents by rating and Suburb/PC
0,2019.0,December,Alpine,3691,Dederang,B Property and deception offences,B30 Burglary/Break and enter,B321 Residential non-aggravated burglary,1.0,"Dederang, 3691",59.0,20.0,2.0,0.0,6.0
1,2019.0,December,Alpine,3691,Dederang,B Property and deception offences,B20 Property damage,B21 Criminal damage,1.0,"Dederang, 3691",59.0,20.0,4.0,0.0,6.0


Create a list of offences to retain in the dataset so that they cover the possibile safety risks for someone who works out outdoors

In [15]:
retain_offences =  offence_severity_levels["Offence Subgroup"].unique().tolist()

Filter dataframe using the offence subgroups listed above

In [16]:
crime_severity_df = crime_final[crime_final["Offence Subgroup"].isin(retain_offences)]
crime_severity_df.head(2)

Unnamed: 0,Year,Year ending,Local Government Area,Postcode,Suburb/Town Name,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,SuburbPostcode,Total Incidents by Suburb/PC,Total Incidents by Offence Division and Suburb/PC,Total Incidents by Offence Subgroup and Suburb/PC,Rating,Total Incidents by rating and Suburb/PC
72592,2019.0,December,Alpine,3691,Dederang,A Crimes against the person,A20 Assault and related offences,A212 Non-FV Serious assault,1.0,"Dederang, 3691",59.0,12.0,4.0,3.0,5.0
72593,2018.0,December,Alpine,3691,Dederang,A Crimes against the person,A20 Assault and related offences,A232 Non-FV Common assault,1.0,"Dederang, 3691",59.0,12.0,2.0,3.0,5.0


Group the dataset by "SuburbPostcode" column to get the total number of retained offences of each "SuburbPostcode"

In [17]:
total_off_sub=crime_severity_df.groupby(['SuburbPostcode'])[['Incidents Recorded']].sum().reset_index()

# Create a new dataframe merging the crime_df and crime_sum to include "Total incidents by Suburb/PC"
total_off_sub_df = pd.merge(crime_severity_df, total_off_sub, how='outer',left_on=['SuburbPostcode'], right_on = ['SuburbPostcode'])
total_off_sub_df.rename(columns={'Incidents Recorded_x':'Incidents Recorded','Incidents Recorded_y':'Total Relevant Incidents - Suburb'}, inplace=True)

total_off_sub_df.head(2)

Unnamed: 0,Year,Year ending,Local Government Area,Postcode,Suburb/Town Name,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,SuburbPostcode,Total Incidents by Suburb/PC,Total Incidents by Offence Division and Suburb/PC,Total Incidents by Offence Subgroup and Suburb/PC,Rating,Total Incidents by rating and Suburb/PC,Total Relevant Incidents - Suburb
0,2019.0,December,Alpine,3691,Dederang,A Crimes against the person,A20 Assault and related offences,A212 Non-FV Serious assault,1.0,"Dederang, 3691",59.0,12.0,4.0,3.0,5.0,8.0
1,2018.0,December,Alpine,3691,Dederang,A Crimes against the person,A20 Assault and related offences,A232 Non-FV Common assault,1.0,"Dederang, 3691",59.0,12.0,2.0,3.0,5.0,8.0


----------------------------------------------------------------------------------

### Crime rate dataset

### Dataset Info

Name: Data Tables - LGA Criminal Incidents Visualisation - year ending December 2019

Description: Table 02 Criminal incidents and rate per 100,000 population by principal offence, local government area and police service area - January 2010 to December 2019

Link : https://www.crimestatistics.vic.gov.au/sites/default/files/embridge_cache/emshare/original/public/users/202003/e7/d73f8b607/Data_Tables_LGA_Recorded_Offences_Year_Ending_December_2019.xlsx

Permissions : CC BY 4.0

### Wrangling Tasks

Read data into python

In [18]:
rate_df = pd.read_excel('Data_Tables_LGA_Criminal_Incidents_Year_Ending_December_2019.xlsx','Table 02')
print("Dataset shape : ",rate_df.shape)
print(rate_df.columns)
rate_df.head(2)

Dataset shape :  (50104, 10)
Index(['Year', 'Year ending', 'Police Service Area', 'Local Government Area',
       'Offence Division', 'Offence Subdivision', 'Offence Subgroup',
       'Incidents Recorded', 'PSA Rate per 100,000 population',
       'LGA Rate per 100,000 population'],
      dtype='object')


Unnamed: 0,Year,Year ending,Police Service Area,Local Government Area,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,"PSA Rate per 100,000 population","LGA Rate per 100,000 population"
0,2019,December,Ballarat,Ballarat,A Crimes against the person,A10 Homicide and related offences,A10 Homicide and related offences,2,1.712247,1.827338
1,2019,December,Ballarat,Ballarat,A Crimes against the person,A20 Assault and related offences,A211 FV Serious assault,150,128.418495,137.050315


Filter dataset for the relevant year range of 2017 -2019

In [19]:
rate_df=rate_df[rate_df["Year"].isin([2017, 2018,2019])]

Create new dataframe storing the median values of crimes in each LGA

In [20]:
median_rate = rate_df["LGA Rate per 100,000 population"].median()

Filter the dataframe to only includ ethe offences identified previously in the project

In [21]:
rate_df = rate_df[rate_df["Offence Subgroup"].isin(retain_offences)]
print(rate_df.shape)
rate_df.head(3)

(4128, 10)


Unnamed: 0,Year,Year ending,Police Service Area,Local Government Area,Offence Division,Offence Subdivision,Offence Subgroup,Incidents Recorded,"PSA Rate per 100,000 population","LGA Rate per 100,000 population"
2,2019,December,Ballarat,Ballarat,A Crimes against the person,A20 Assault and related offences,A212 Non-FV Serious assault,169,144.684838,154.410021
5,2019,December,Ballarat,Ballarat,A Crimes against the person,A20 Assault and related offences,A232 Non-FV Common assault,184,157.526687,168.115053
9,2019,December,Ballarat,Ballarat,A Crimes against the person,A50 Robbery,A51 Aggravated robbery,32,27.395946,29.2374


Calculate the meadian value of each offence in the above dataframe

In [22]:
lga_rate_medians = rate_df.groupby('Offence Subgroup')['LGA Rate per 100,000 population'].median().reset_index()
lga_rate_medians.head(2)

Unnamed: 0,Offence Subgroup,"LGA Rate per 100,000 population"
0,A212 Non-FV Serious assault,110.91531
1,A232 Non-FV Common assault,118.765584


Obtain total incidents recorded by suburb, offence and rating from total_off_sub_df and create a new dataset

In [23]:
incident_total = total_off_sub_df.groupby(['SuburbPostcode','Offence Subgroup','Rating'])[['Incidents Recorded']].sum().reset_index()
incident_total.head(2)

Unnamed: 0,SuburbPostcode,Offence Subgroup,Rating,Incidents Recorded
0,"Abbeyard, 3737",A212 Non-FV Serious assault,3.0,1.0
1,"Abbeyard, 3737",A232 Non-FV Common assault,3.0,1.0


Assign severity weights to each of the rating. 1 -> 0.2, 2 -> 0.3 and 3 -> 0.5, 

In [24]:
score_list = []

for i, row in incident_total.iterrows():
    if (incident_total['Rating'][i] == 3.0):
            score_list.append(0.5)
    elif (incident_total['Rating'][i] == 2.0):
            score_list.append(0.3)
    else:
        score_list.append(0.2)

incident_total['Severity Weight'] = pd.Series(score_list).values 

Create new dataframe to combine incident_total and lga_rate_medians through an outer join

In [25]:
median_comparison = pd.merge(incident_total, lga_rate_medians, how='outer',left_on=['Offence Subgroup'], right_on = ['Offence Subgroup'])
median_comparison.head(2)

Unnamed: 0,SuburbPostcode,Offence Subgroup,Rating,Incidents Recorded,Severity Weight,"LGA Rate per 100,000 population"
0,"Abbeyard, 3737",A212 Non-FV Serious assault,3.0,1.0,0.5,110.91531
1,"Abbotsford, 3067",A212 Non-FV Serious assault,3.0,69.0,0.5,110.91531


Check one suburb ("Beaconsfield, 3807") to inspect the dataset

In [26]:
median_comparison[median_comparison["SuburbPostcode"]=="Beaconsfield, 3807"]

Unnamed: 0,SuburbPostcode,Offence Subgroup,Rating,Incidents Recorded,Severity Weight,"LGA Rate per 100,000 population"
72,"Beaconsfield, 3807",A212 Non-FV Serious assault,3.0,20.0,0.5,110.91531
1182,"Beaconsfield, 3807",A232 Non-FV Common assault,3.0,26.0,0.5,118.765584
2190,"Beaconsfield, 3807",A51 Aggravated robbery,3.0,9.0,0.5,20.265544
2702,"Beaconsfield, 3807",A52 Non-Aggravated robbery,2.0,1.0,0.3,3.979991
2973,"Beaconsfield, 3807",A712 Non-FV Stalking,3.0,1.0,0.5,13.489831
3537,"Beaconsfield, 3807",A722 Non-FV Harassment and private nuisance,3.0,1.0,0.5,8.86549
4066,"Beaconsfield, 3807",A732 Non-FV Threatening behaviour,3.0,4.0,0.5,29.902518
5392,"Beaconsfield, 3807",A89 Other dangerous or negligent acts endanger...,1.0,5.0,0.2,38.43702
6425,"Beaconsfield, 3807",D11 Firearms offences,3.0,1.0,0.5,37.846532
7453,"Beaconsfield, 3807",D12 Prohibited and controlled weapons offences,3.0,23.0,0.5,86.003491



Create a new column in median_comparison dataset as "Flag" to store a binary ndicator that denotes whether the incidents recorded in that specific suburb is greater than the LGA Rate per 100,000 population. Create another column "Suburb Score" to store the multiplication of flag and the severity weight

In [27]:
median_comparison['Flag'] = np.where(median_comparison['Incidents Recorded']>median_comparison['LGA Rate per 100,000 population'], 1,0)
median_comparison['Suburb Score'] = median_comparison['Flag']*median_comparison['Severity Weight']
median_comparison.head(2)

Unnamed: 0,SuburbPostcode,Offence Subgroup,Rating,Incidents Recorded,Severity Weight,"LGA Rate per 100,000 population",Flag,Suburb Score
0,"Abbeyard, 3737",A212 Non-FV Serious assault,3.0,1.0,0.5,110.91531,0,0.0
1,"Abbotsford, 3067",A212 Non-FV Serious assault,3.0,69.0,0.5,110.91531,0,0.0


Group the median comparison dataframe by "SuburbPostcode" column to obtain the total Suburb Score

In [28]:
median_comparison_df =median_comparison.groupby(['SuburbPostcode'])[['Suburb Score']].sum().reset_index()
median_comparison_df.head(2)

Unnamed: 0,SuburbPostcode,Suburb Score
0,"Abbeyard, 3737",0.0
1,"Abbotsford, 3067",1.6


In [29]:
median_comparison_df['Suburb Score'].unique()

array([0. , 1.6, 0.2, 0.3, 0.8, 1. , 0.5, 0.6, 1.3, 3.5, 5.1, 0.7, 4.3,
       3.6, 3. , 1.8, 6.5, 4. , 2.1, 2.8, 3.8, 1.1, 5. , 4.7, 5.2, 8.9,
       6.2, 1.5, 4.1, 4.7, 8.3, 5.2, 1.7, 1.2, 5.5, 2. , 9.5, 8.1, 1.9,
       3.1, 8. , 5.7, 6.1, 2.6, 6.7, 3.1, 7.5, 6.9, 5.9, 3.9, 2.5, 7.9,
       5.8, 6.2, 6.6, 5.4, 4.8, 2.4, 2.9, 7.4, 7. ])

Create column "Indicator" to assign "Safe" and "Unsafe" values to each suburb based on the total suburb score. If suburb score = 0, the indicator is "Safe" and if suburb score >0, the indicator is "Unsafe"

In [30]:
median_comparison_df['Indicator'] = np.where(median_comparison_df['Suburb Score']==0, "Safe","Unsafe")
median_comparison_df.rename(columns={'Local Government Area':'LGA','Suburb, Postcode':'SuburbPostcode','Suburb Score':'SuburbScore' }, inplace=True)

median_comparison_df.head(2)

Unnamed: 0,SuburbPostcode,SuburbScore,Indicator
0,"Abbeyard, 3737",0.0,Safe
1,"Abbotsford, 3067",1.6,Unsafe


-----------------------------------------------------------------------------------------------

#### Import geographical data

Import the victorian locality borders dataset, `VIC_LOCALITY_POLYGON_shp.shp` to python so that the suburbs scores calculated can be integrated with the map 

In [31]:
# read shapefile
sf = shp.Reader("VIC_LOCALITY_POLYGON_shp.shp")

# taking the shapefile's field names omitting the first psuedo field
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]

#write the records into a dataframe
shapefile_dataframe = pd.DataFrame(columns=fields, data=records)

#add the coordinate data to a column called "coords"
shapefile_dataframe = shapefile_dataframe.assign(coords=shps)
shapefile_dataframe.head()

Unnamed: 0,LC_PLY_PID,DT_CREATE,DT_RETIRE,LOC_PID,VIC_LOCALI,VIC_LOCA_1,VIC_LOCA_2,VIC_LOCA_3,VIC_LOCA_4,VIC_LOCA_5,VIC_LOCA_6,VIC_LOCA_7,coords
0,6670,2011-08-31,,VIC2615,2012-04-27,,UNDERBOOL,,,G,,2,"[(141.74552399, -35.07228701), (141.74552471, ..."
1,6671,2011-08-31,,VIC1986,2012-04-27,,NURRAN,,,G,,2,"[(148.668767, -37.39571245), (148.66876202, -3..."
2,6672,2011-08-31,,VIC2862,2012-04-27,,WOORNDOO,,,G,,2,"[(142.92287999, -37.97885997), (142.90449196, ..."
3,6673,2011-08-31,,VIC734,2018-08-03,,DEPTFORD,,,G,,2,"[(147.82335712, -37.66000897), (147.8231274, -..."
4,6674,2011-08-31,,VIC2900,2012-04-27,,YANAC,,,G,,2,"[(141.279783, -35.99858911), (141.27988533, -3..."


Check size of the shapefile dataframe and the count of unique values in the 'VIC_LOCA_2' column which corresponds to the suburb/locality name

In [32]:
print("Rows in dataframe : ", len(shapefile_dataframe), "\nUnique values in 'VIC_LOCA_2' column : ", len(shapefile_dataframe["VIC_LOCA_2"].unique()))

Rows in dataframe :  2973 
Unique values in 'VIC_LOCA_2' column :  2957


Create new dataframe extracting the suburb column from the shapefile dataframe for convenient processing

In [33]:
suburbs = shapefile_dataframe[["VIC_LOCA_2"]]
suburbs['VIC_LOCA_2'].str.title()
print(len(suburbs))
suburbs.head(3)

2973


Unnamed: 0,VIC_LOCA_2
0,UNDERBOOL
1,NURRAN
2,WOORNDOO


Rename the column

In [34]:
suburbs.rename(columns={'VIC_LOCA_2':'Suburb'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Create new dataframe from 'median_comparison_df' to only include the 'SuburbPostcode', 'SuburbScore' and 'Indicator' columns

In [35]:
mc_edit_df = median_comparison_df[['SuburbPostcode','SuburbScore','Indicator']]

# Split SuburbPostcode column to two as Suburb and Postcode
mc_edit_df[['Suburb','Postcode']] = mc_edit_df['SuburbPostcode'].str.split(", ",expand=True,)

# Convert 'Suburb' column to uppercase
mc_edit_df['Suburb'] = mc_edit_df['Suburb'].str.upper()
mc_edit_df.head(3)

Unnamed: 0,SuburbPostcode,SuburbScore,Indicator,Suburb,Postcode
0,"Abbeyard, 3737",0.0,Safe,ABBEYARD,3737
1,"Abbotsford, 3067",1.6,Unsafe,ABBOTSFORD,3067
2,"Aberfeldie, 3040",0.0,Safe,ABERFELDIE,3040


Merge 'mc_edit_df' with 'suburbs' dataframe to combine the information

In [36]:
mc_df_2 = suburbs.merge(mc_edit_df, on='Suburb', how='outer')
mc_df_2.head(3)


Unnamed: 0,Suburb,SuburbPostcode,SuburbScore,Indicator,Postcode
0,UNDERBOOL,"Underbool, 3509",0.0,Safe,3509.0
1,NURRAN,,,,
2,WOORNDOO,"Woorndoo, 3272",0.0,Safe,3272.0


Check a few random suburb names to verify whether the data is coming through properly

In [37]:
mc_df_2[mc_df_2["Suburb"]=="KINGS PARK"]

Unnamed: 0,Suburb,SuburbPostcode,SuburbScore,Indicator,Postcode
2179,KINGS PARK,"Kings Park, 3021",0.0,Safe,3021


In [38]:
mc_df_2[mc_df_2["Suburb"]=="ALBANVALE"]

Unnamed: 0,Suburb,SuburbPostcode,SuburbScore,Indicator,Postcode
536,ALBANVALE,"Albanvale, 3021",0.0,Safe,3021


In [39]:
mc_df_2[mc_df_2["Suburb"]=="KEALBA"]

Unnamed: 0,Suburb,SuburbPostcode,SuburbScore,Indicator,Postcode
958,KEALBA,"Kealba, 3021",0.0,Safe,3021


In [40]:
len(mc_df_2)

3007

Assign group to each suburb record in the dataset as follows.

* Suburbscore between 0 and 3 - group1
* Suburbscore between 3 and 6 - group2
* Suburbscore between 6 and above - group3


In [41]:
group_list = []

for i, row in mc_df_2.iterrows():
    if (mc_df_2["SuburbScore"][i]>=0) & (mc_df_2["SuburbScore"][i]<3):
        group_list.append(1)
    elif (mc_df_2["SuburbScore"][i]>=3) & (mc_df_2["SuburbScore"][i]<6):
        group_list.append(2)
    elif (mc_df_2["SuburbScore"][i]>=6)& (mc_df_2["SuburbScore"][i]<10):
        group_list.append(3)
    else:
        group_list.append(0)
        
mc_df_2['Group'] = pd.Series(group_list).values 
mc_df_2.head(5)

Unnamed: 0,Suburb,SuburbPostcode,SuburbScore,Indicator,Postcode,Group
0,UNDERBOOL,"Underbool, 3509",0.0,Safe,3509.0,1
1,NURRAN,,,,,0
2,WOORNDOO,"Woorndoo, 3272",0.0,Safe,3272.0,1
3,DEPTFORD,,,,,0
4,YANAC,"Yanac, 3418",0.0,Safe,3418.0,1


Replace the NaN values in SuburbScore with 'Score Unavailable' and NaN values in 'Indicator' with 'Data Unavailable'

In [42]:
mc_df_2["SuburbScore"].fillna('Score Unavailable', inplace=True)
mc_df_2["Indicator"].fillna('Data Unavailable', inplace=True)

# Drop duplicates
mc_df_2= mc_df_2.drop_duplicates()
mc_df_2.head(2)

Unnamed: 0,Suburb,SuburbPostcode,SuburbScore,Indicator,Postcode,Group
0,UNDERBOOL,"Underbool, 3509",0,Safe,3509.0,1
1,NURRAN,,Score Unavailable,Data Unavailable,,0


In [43]:
mc_df_2 = mc_df_2[~mc_df_2['Postcode'].isin(['3004','3005'])]
len(mc_df_2)
 
# # Delete these row indexes from dataFrame
# dfObj.drop(indexNames , inplace=True)

2982

In [44]:
mc_df_2[mc_df_2["Suburb"]=="KINGS PARK"]

Unnamed: 0,Suburb,SuburbPostcode,SuburbScore,Indicator,Postcode,Group
2179,KINGS PARK,"Kings Park, 3021",0,Safe,3021,1


In [45]:
mc_df_2[mc_df_2["Suburb"]=="St Albans"]
mc_df_2=mc_df_2.drop([1013])
mc_df_2=mc_df_2.drop(['Indicator'], axis = 1)

In [46]:
mc_df_2['Suburb'] = mc_df_2['Suburb'].str.upper()
mc_df_2.head()

Unnamed: 0,Suburb,SuburbPostcode,SuburbScore,Postcode,Group
0,UNDERBOOL,"Underbool, 3509",0,3509.0,1
1,NURRAN,,Score Unavailable,,0
2,WOORNDOO,"Woorndoo, 3272",0,3272.0,1
3,DEPTFORD,,Score Unavailable,,0
4,YANAC,"Yanac, 3418",0,3418.0,1


Create the final dataset only including the suburb and score group

In [47]:
mc_df_2=mc_df_2[['Suburb','Group']]
mc_df_2.rename(columns={'Suburb':'SUBURB','Group':'GROUP'}, inplace=True)
mc_df_2.head(3)

Unnamed: 0,SUBURB,GROUP
0,UNDERBOOL,1
1,NURRAN,0
2,WOORNDOO,1


Export dataset as CSV file

In [49]:
mc_df_2.to_csv("SAFETY_EXPORT.csv", index=False)

-----------------------------------------------------------------------------------------------

#### Export files produced

* lga_suburb_export--> Contains the LGA, Suburb and Postcode information

* SAFETY_EXPORT -->  Contains the safety indicator group for each suburb

* lga_crimes_export.csv --> Contains the LGA, Location and percentage of crimes


---------------------------------------------

## End of data wrangling process for iteration 02, continued from iteration 01