**Doug Marcum  
DSC 540 - Term Project Milestone 3**  

### Cleaning/Formatting Website Data

Perform at least 5 data transformation and/or cleansing steps to your website data. For example: Replace Headers, format data into a more readable format, identify outliers and bad data, find duplicates, fix casing or inconsistent values, conduct Fuzzy Matching.

### Part 1 - Loading Data

In [1]:
# load libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import unicodedata
import re

In [2]:
# make request to website url
page = requests.get('https://www.narcity.com/ca/on/toronto/news/toronto-neighbourhoods-ranked-by-how-dangerous-they-are-right-now-based-on-2018-crime-rates')
page

<Response [200]>

In [3]:
# run BeautifulSoup and display prettified data
soup = BeautifulSoup(page.content, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-CA">
 <head>
  <link crossorigin="" href="https://fonts.gstatic.com/" rel="preconnect"/>
  <link crossorigin="" href="https://www.google-analytics.com" rel="preconnect"/>
  <link crossorigin="" href="https://connect.facebook.net" rel="preconnect"/>
  <link crossorigin="" href="https://securepubads.g.doubleclick.net" rel="preconnect"/>
  <link crossorigin="" href="https://fundingchoicesmessages.google.com" rel="preconnect"/>
  <link crossorigin="" href="https://tpc.googlesyndication.com" rel="preconnect"/>
  <link crossorigin="" href="https://contributor.google.com" rel="preconnect"/>
  <script async="async" src="https://securepubads.g.doubleclick.net/tag/js/gpt.js">
  </script>
  <link as="script" href="https://www.narcity.com/polemos.7ebaedfc1e57734fd8a9.js" rel="preload"/>
  <link as="script" href="https://www.narcity.com/v2_vendors.7ebaedfc1e57734fd8a9.js" rel="preload"/>
  <link as="script" href="https://www.narcity.com/v2_main.7ebaedfc1e57734fd8a9.js

### Part 2a - Neighborhoods Transformation Steps

In [4]:
# use find_all to find all <em> tags for lists of neighborhoods
tag = soup.find_all('em')

# use get_text to extract text
neigh = [x.get_text() for x in tag]
neigh

['CP24',
 '"',
 'Here Are The Most Dangerous Cities In Canada Right Now, Ranked By The Crime"',
 '\xa0',
 '(Yorkdale-Glen Park, Englemout-Lawrence, Briar Hill-Belgravia, Forest Hill North, Caledonia-Fairbanks, Oakwood-Vaughan, Humewood-Cedarvale, Forest Hill South, Corsa Italia-Davenport, Dovercourt-Walla Emerson-Junction, Wychwood, Casa Loma)',
 '\xa0',
 "(Flemingdon Park, Victoria Village, O'Connor- Parkview, Old East York, Woodbine-Liamsden, Crescent Town, East End Danforth, Old East York, Danforth Village East York, Danforth Village Toronto, Broadview North, Playter Estates-Danforth)",
 '\xa0',
 '(North Riverdale, Blake-Jones, Greenwood-Coxwell, Woodbine Corridor, East End Danforth, The Beaches, South Riverdale)',
 '\xa0',
 '(Bayview Woods-Steeles, Hillcrest Village, Bayview Village, Don Valley Village, Pleasant View, Henry Farm, St. Andrew Windfields, Parkwoods Donalda Victoria Village, Banbury Don Mills, Bridle Path, Sunnybrook, York Mills)',
 '\xa0',
 '(Lambton-Baby Point, Rockc

In [5]:
# layers of filtering to make the text more readable and useable
neigh = list(filter(lambda a: a != 'CP24', neigh))
neigh = list(filter(lambda a: a != '\xa0', neigh))
neigh = list(filter(lambda a: a != '"', neigh))
neigh = list(filter(lambda a: a != 'CP 24 Crime Map\xa0', neigh))
neigh

['Here Are The Most Dangerous Cities In Canada Right Now, Ranked By The Crime"',
 '(Yorkdale-Glen Park, Englemout-Lawrence, Briar Hill-Belgravia, Forest Hill North, Caledonia-Fairbanks, Oakwood-Vaughan, Humewood-Cedarvale, Forest Hill South, Corsa Italia-Davenport, Dovercourt-Walla Emerson-Junction, Wychwood, Casa Loma)',
 "(Flemingdon Park, Victoria Village, O'Connor- Parkview, Old East York, Woodbine-Liamsden, Crescent Town, East End Danforth, Old East York, Danforth Village East York, Danforth Village Toronto, Broadview North, Playter Estates-Danforth)",
 '(North Riverdale, Blake-Jones, Greenwood-Coxwell, Woodbine Corridor, East End Danforth, The Beaches, South Riverdale)',
 '(Bayview Woods-Steeles, Hillcrest Village, Bayview Village, Don Valley Village, Pleasant View, Henry Farm, St. Andrew Windfields, Parkwoods Donalda Victoria Village, Banbury Don Mills, Bridle Path, Sunnybrook, York Mills)',
 '(Lambton-Baby Point, Rockcliffe-Smythe, Runnymede-Bloor West Village, High Park-Swanse

In [6]:
# create data frame
df_1 = pd.DataFrame({'neighborhoods': neigh})
df_1

Unnamed: 0,neighborhoods
0,Here Are The Most Dangerous Cities In Canada R...
1,"(Yorkdale-Glen Park, Englemout-Lawrence, Briar..."
2,"(Flemingdon Park, Victoria Village, O'Connor- ..."
3,"(North Riverdale, Blake-Jones, Greenwood-Coxwe..."
4,"(Bayview Woods-Steeles, Hillcrest Village, Bay..."
5,"(Lambton-Baby Point, Rockcliffe-Smythe, Runnym..."
6,"(Pelmo Park- Humberlea, Weston, Mount Dennis, ..."
7,"(Thorncliffe Park, Leaside-Bennington, Bridle ..."
8,"(Eringate Centennial/West Deane, Edenbridge - ..."
9,"(Steeles, L'amoreaux, Tam O'Shanter Sullivan, ..."


In [7]:
# drop row 0 as this text is not necessary and reset index
df_1 = df_1.drop(df_1.index[0]).reset_index(drop = True)
df_1

Unnamed: 0,neighborhoods
0,"(Yorkdale-Glen Park, Englemout-Lawrence, Briar..."
1,"(Flemingdon Park, Victoria Village, O'Connor- ..."
2,"(North Riverdale, Blake-Jones, Greenwood-Coxwe..."
3,"(Bayview Woods-Steeles, Hillcrest Village, Bay..."
4,"(Lambton-Baby Point, Rockcliffe-Smythe, Runnym..."
5,"(Pelmo Park- Humberlea, Weston, Mount Dennis, ..."
6,"(Thorncliffe Park, Leaside-Bennington, Bridle ..."
7,"(Eringate Centennial/West Deane, Edenbridge - ..."
8,"(Steeles, L'amoreaux, Tam O'Shanter Sullivan, ..."
9,"(University, Bay Street Corridor, Kensington-C..."


In [8]:
# after reviewing the data, it was noticeable that Birchcliffe was part of some ugly coding on the website. 
# the following adds Birchcliffe back to the list, connects the disconnected rows, and drops the unnecessary row. 
df_1.neighborhoods[16] = 'Birchcliffe' + df_1.neighborhoods[16]
df_1.neighborhoods[15] = df_1.neighborhoods[15][0:25]
df_1.neighborhoods[15] = df_1.neighborhoods[15] + df_1.neighborhoods[16]
df_1 = df_1.drop(df_1.index[16]).reset_index(drop = True)
df_1

Unnamed: 0,neighborhoods
0,"(Yorkdale-Glen Park, Englemout-Lawrence, Briar..."
1,"(Flemingdon Park, Victoria Village, O'Connor- ..."
2,"(North Riverdale, Blake-Jones, Greenwood-Coxwe..."
3,"(Bayview Woods-Steeles, Hillcrest Village, Bay..."
4,"(Lambton-Baby Point, Rockcliffe-Smythe, Runnym..."
5,"(Pelmo Park- Humberlea, Weston, Mount Dennis, ..."
6,"(Thorncliffe Park, Leaside-Bennington, Bridle ..."
7,"(Eringate Centennial/West Deane, Edenbridge - ..."
8,"(Steeles, L'amoreaux, Tam O'Shanter Sullivan, ..."
9,"(University, Bay Street Corridor, Kensington-C..."


In [9]:
# remove the remaining \xa0 from the strings
df_1.neighborhoods[6] = unicodedata.normalize("NFKD",df_1.neighborhoods[6])
df_1.neighborhoods[7] = unicodedata.normalize("NFKD",df_1.neighborhoods[7])
df_1.neighborhoods[14] = unicodedata.normalize("NFKD",df_1.neighborhoods[14])

print(df_1.neighborhoods[6])
print(df_1.neighborhoods[7])
print(df_1.neighborhoods[14])

(Thorncliffe Park, Leaside-Bennington, Bridle Path, Sunnybrook, York Mills, Mount Pleasant East, Mount Pleasant West, Lawrence Park South, Yonge-Eglinton, Yonge-St. Clair, Forest Hill South, Casa Loma, Annex, Lawrence Park South, Bedford Park-Norton, Forest Hill North) 
(Eringate Centennial/West Deane, Edenbridge - Humber Valley, Etobicoke - West Mall, Kingsway South, Alderwood, Long Branch, Princess Rosethorn, Markland Wood, Islington - City Centre West, Stonegate - Queensway, Mimico, New Toronto)
(York University Heights, Bathurst Manot, Westminster-Branson, Newtonbrook West, Willowdale West, Newtonbrook East, Willowdale East, St. Andrew-Windfields, Lansing-Westgate, Bridle Path, Sunnybrook, York Mills, Lawrence Park North, Bedford Park-Nortown, Clanton Park, Englemount-Lawrence, Yorkdale-Glen Park, Downsview-Roding-CFB)


In [10]:
# here rows are stripped of '()', split on ',', and returned to a data frame
df_1 = df_1.neighborhoods.str.strip('()')
df_1 = df_1.str.split(',')
df_1 = pd.DataFrame({'neighborhoods': df_1})
df_1

Unnamed: 0,neighborhoods
0,"[Yorkdale-Glen Park, Englemout-Lawrence, Bri..."
1,"[Flemingdon Park, Victoria Village, O'Connor..."
2,"[North Riverdale, Blake-Jones, Greenwood-Cox..."
3,"[Bayview Woods-Steeles, Hillcrest Village, B..."
4,"[Lambton-Baby Point, Rockcliffe-Smythe, Runn..."
5,"[Pelmo Park- Humberlea, Weston, Mount Dennis..."
6,"[Thorncliffe Park, Leaside-Bennington, Bridl..."
7,"[Eringate Centennial/West Deane, Edenbridge -..."
8,"[Steeles, L'amoreaux, Tam O'Shanter Sullivan..."
9,"[University, Bay Street Corridor, Kensington..."


### Part 2b - Police Divisions Transformation Steps

In [11]:
# use find_all to find all <h2> tags for list of police divisions
tag = soup.find_all('h2')

# use get_tect to extract text
divisions = [x.get_text() for x in tag]
divisions

['How does your neighbourhood rank?',
 '17. Division 13:',
 '16. Division 54:',
 '15. Division 55:',
 '14. Division 33:',
 '13. Division 11:\xa0',
 '12. Division 12:\xa0',
 '11. Division 53:',
 '10. Division 22:',
 '9. Division 42:',
 '8. Division 52:',
 '7. Division 23:',
 '6. Division 31:',
 '5. Division 43:',
 '4. Division 14:',
 '3. Division 32:',
 '2. Division 41:',
 '1. Division 51:']

In [12]:
# create dataframe
df_2 = pd.DataFrame({'divisions': divisions})
df_2

Unnamed: 0,divisions
0,How does your neighbourhood rank?
1,17. Division 13:
2,16. Division 54:
3,15. Division 55:
4,14. Division 33:
5,13. Division 11:
6,12. Division 12:
7,11. Division 53:
8,10. Division 22:
9,9. Division 42:


In [13]:
# drop row 0 as this text is not necessary and reset index
df_2 = df_2.drop(df_2.index[0]).reset_index(drop = True)
df_2

Unnamed: 0,divisions
0,17. Division 13:
1,16. Division 54:
2,15. Division 55:
3,14. Division 33:
4,13. Division 11:
5,12. Division 12:
6,11. Division 53:
7,10. Division 22:
8,9. Division 42:
9,8. Division 52:


In [14]:
# split on '.' and then split on ':' to remove unnessary characters
df_2['divisions'] = df_2['divisions'].str.split('.').str[1]
df_2['divisions'] = df_2['divisions'].str.split(':').str[0]
df_2

Unnamed: 0,divisions
0,Division 13
1,Division 54
2,Division 55
3,Division 33
4,Division 11
5,Division 12
6,Division 53
7,Division 22
8,Division 42
9,Division 52


### Part 3 - Combining Data Frames and Final Combined Transformations

In [15]:
# concatenate the two data frames
df_3 = pd.concat([df_1, df_2], axis=1)

# sort by 'divisions' to make it a bit easier to read
df_3.sort_values(by=['divisions'], inplace=True)
df_3

Unnamed: 0,neighborhoods,divisions
4,"[Lambton-Baby Point, Rockcliffe-Smythe, Runn...",Division 11
5,"[Pelmo Park- Humberlea, Weston, Mount Dennis...",Division 12
0,"[Yorkdale-Glen Park, Englemout-Lawrence, Bri...",Division 13
13,"[Christie-Ossington, Annex, Palmerston-Littl...",Division 14
7,"[Eringate Centennial/West Deane, Edenbridge -...",Division 22
10,"[Mount Oliver-Silverstone-Jamestown, Thistlet...",Division 23
11,"[Humber Summit, Humbermede, Pelmo Park-Humbe...",Division 31
14,"[York University Heights, Bathurst Manot, We...",Division 32
3,"[Bayview Woods-Steeles, Hillcrest Village, B...",Division 33
15,"[Kennedy Park, Oakridge, Birchcliffe-Cliffsi...",Division 41


In [16]:
# explode the lists in each row
df_exploded = df_3.explode('neighborhoods')

In [17]:
# sort by neighborhoods for easier reading
df_exploded.sort_values(by = ['neighborhoods'], inplace = True)
df_exploded

Unnamed: 0,neighborhoods,divisions
8,Agincourt North,Division 42
8,Agincourt South Malvern West,Division 42
7,Alderwood,Division 22
13,Annex,Division 14
6,Annex,Division 53
...,...,...
8,Steeles,Division 42
6,Thorncliffe Park,Division 53
9,University,Division 52
14,York University Heights,Division 32


In [18]:
# format fields in data frame
df_exploded['neighborhoods'] = df_exploded['neighborhoods'].str.lower().str.lstrip().str.rstrip().str.replace(' ', '_')
df_exploded['divisions'] = df_exploded['divisions'].str.lower().str.lstrip().str.replace(' ', '_')
df_exploded = df_exploded.reset_index(drop = True)
df_exploded

Unnamed: 0,neighborhoods,divisions
0,agincourt_north,division_42
1,agincourt_south_malvern_west,division_42
2,alderwood,division_22
3,annex,division_14
4,annex,division_53
...,...,...
166,steeles,division_42
167,thorncliffe_park,division_53
168,university,division_52
169,york_university_heights,division_32


In [19]:
# make a number of spelling/format corrections to align with goverment datasets
# I know there has to be a better way to do this, but with so many inconsistencies, this was the most accurate way
# to make the corrections
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('bathurst_manot', 'bathurst_manor')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('bedford_park-norton', 'bedford_park-nortown')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('bridle_path', 'bridle_path-sunnybrook-york_mills')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('crescent_town', 'danforth')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('cabbagetown-south_st._jamestown', 'cabbagetown-south_st_james_town')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('corsa_italia-davenport', 'corso_italia-davenport')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('dovercourt-walla_emerson-junction', 'dovercourt-wallace_emerson-junction')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('englemout-lawrence', 'englemount-lawrence')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('eringate_centennial/west_deane', 'eringate-centennial_west_deane')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('maple_leafs', 'maple_leaf')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('oakwood-vaughan', 'oakwood_village')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('st._andrew-windfields', 'st_andrew-windfields')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('st._andrew_windfields', 'st_andrew-windfields')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('stonegate_-_queensway', 'stonegate-queensway')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('woodbine-liamsden', 'woodbine-lumsden')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('edenbridge_-_humber_valley', 'edenbridge-humber_valley')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('etobicoke_-_west_mall', 'etobicoke-west_mall')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace(re.escape('forest_hill_north)'), 'forest_hill_north')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('islington_-_city_centre_west', 'islington-city_centre_west')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('north_st._jamestown', 'north_st_jamestown')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace("o'connor-_parkview", "o'connor-parkview")
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('pelmo_park-_humberlea', 'pelmo_park-humberlea')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('yonge-st._clair', 'yonge-st_clair')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('lambton-baby_point', 'lambton_baby_point')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('caledonia-fairbanks', 'caledonia-fairbank')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace(re.escape('waterfront)'), 'niagara')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace(re.escape('weston_pelham_park)'), 'weston_pelham_park')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('weston_pelham_park', 'weston-pelham_park')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('trinity_bellwoods', 'trinity-bellwoods')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('etobicoke-west_mall', 'etobicoke_west_mall')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('princess_rosethorn', 'princess-rosethorn')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('eringate-centennial_west_deane', 'eringate-centennial-west_deane')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('mount_oliver-silverstone-jamestown', 'mount_olive-silverstone-jamestown')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('thistletown-beaumonde_heights', 'thistletown-beaumond_heights')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('banbury_don_mills', 'banbury-don_mills')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('parkwoods_donalda_victoria_village', 'parkwoods-donalda')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('woodford/maryvale', 'wexford/maryvale')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace("tam_o\'shanter_sullivan", "tam_o'shanter-sullivan")
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('agincourt_south_malvern_west', 'agincourt_south-malvern_west')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('north_st_jamestown', 'north_st_james_town')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('east_end_danforth', 'east_end-danforth')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('danforth_village_east_york', 'danforth_east_york')
df_exploded.neighborhoods = df_exploded.neighborhoods.str.replace('danforth_village_toronto', 'danforth')


In [20]:
# make value change from junction to junction_area and duplicate of old_east_york to taylor-massey
df_exploded['neighborhoods'][66] = 'junction_area'
df_exploded['neighborhoods'][67] = 'junction_area'
df_exploded['neighborhoods'][97] = 'taylor-massey'

In [21]:
# drop duplicate rows since they were previously combined
df_exploded.drop(df_exploded.index[122:125], inplace = True)
df_exploded.drop(df_exploded.index[146:149], inplace = True)
df_exploded.drop(df_exploded.index[150], inplace = True)

In [22]:
df_cleaned = df_exploded.reset_index()
df_cleaned = df_cleaned.drop('index', axis = 1)
df_cleaned.sort_values(by = ['divisions'], inplace = True) 
df_cleaned

Unnamed: 0,neighborhoods,divisions
107,rockcliffe-smythe,division_11
37,dovercourt-wallace_emerson-junction,division_11
57,high_park_north,division_11
109,roncesvalles,division_11
58,high_park-swansea,division_11
...,...,...
14,blake-jones,division_55
41,east_end-danforth,division_55
156,north_riverdale,division_55
54,greenwood-coxwell,division_55


In [23]:
# check for duplicates
dupes = df_exploded['neighborhoods'].value_counts()
dupes.head(25)

bridle_path-sunnybrook-york_mills      3
lawrence_park_south                    2
weston-pelham_park                     2
bedford_park-nortown                   2
casa_loma                              2
york_university_heights                2
east_end-danforth                      2
danforth                               2
st_andrew-windfields                   2
annex                                  2
kensington-chinatown                   2
rouge                                  2
rockcliffe-smythe                      2
yorkdale-glen_park                     2
pelmo_park-humberlea                   2
south_parkdale                         2
englemount-lawrence                    2
forest_hill_south                      2
south_riverdale                        2
dovercourt-wallace_emerson-junction    2
junction_area                          2
waterfront_communities-the_island      2
downsview-roding-cfb                   2
forest_hill_north                      2
maple_leaf      

Duplicates were found, but this was expected. The City of Toronto has 140 neighborhoods and 17 Police Divisions covering these neighborhoods. The duplicates only exist in the 'correct' neighborhood/police divison scenarios. GIS information is available from multiple sources, but not readily available to the degree needed to create subcategories for each overlapping neighborhood/police division. 

No additional duplicates were found.

In [24]:
# groupby the neighborhoods in each division and place the neighborhoods in lists
divs = df_cleaned.groupby('divisions')['neighborhoods'].apply(lambda x: x.values.tolist())

# create new data frame
xfx = pd.DataFrame(divs)
xfx = xfx.reset_index()

In [25]:
# transpose the data frame and make 'divisions' the header, 
# as this is how it will be used in future portions of the overall project
divisions = xfx.T
divisions.columns = divisions.iloc[0]
divisions = divisions.drop(divisions.index[0])

# completed dataframe for Milestone 3
divisions

divisions,division_11,division_12,division_13,division_14,division_22,division_23,division_31,division_32,division_33,division_41,division_42,division_43,division_51,division_52,division_53,division_54,division_55
neighborhoods,"[rockcliffe-smythe, dovercourt-wallace_emerson...","[keelesdale-eglinton_west, brookhaven-amesbury...","[yorkdale-glen_park, forest_hill_north, humewo...","[annex, trinity-bellwoods, niagara, south_park...","[islington-city_centre_west, princess-rosethor...","[thistletown-beaumond_heights, kingsview_villa...","[pelmo_park-humberlea, glenfield-jane_heights,...","[york_university_heights, downsview-roding-cfb...","[don_valley_village, banbury-don_mills, parkwo...","[birchcliffe-cliffside, dorset_park, oakridge,...","[rouge, tam_o'shanter-sullivan, steeles, aginc...","[west_hill, eglinton_east, woburn, cliffcrest,...","[church-yonge_corridor, cabbagetown-south_st_j...","[kensington-chinatown, university, waterfront_...","[thorncliffe_park, annex, bedford_park-nortown...","[victoria_village, old_east_york, taylor-masse...","[south_riverdale, woodbine_corridor, blake-jon..."


This data fame will need more manipulation once it is merged with the other data frames. Each column will pull data from the other data frames to make calculations for each division. This data frame and web scraping was necessary in order to have the neighborhoods and police divisions available togther. Almost a reference point for future analysis upon completion of Milestone 4.

In [26]:
# save to pickle for future use
divisions.to_pickle("divisions.pkl")