# Cleaning Data Scraped From Web

In this notebook I am cleaning the data obtained from wikipedia: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

First, import some necessary Python libraries.

In [1]:
# import necessary libraries
import numpy as np
import pandas as pd
import geocoder

Now read in the .csv file as a pandas dataframe and take a peak of the initial dataframe (raw scrape).

In [2]:
# create dataframe from .csv file that was scraped from wikipedia
toronto_df = pd.read_csv('toronto_postal_data.csv')

toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Next I remove the 'Not assigned' boroughs. This eliminates all 'Not assigned' neighborhoods as well.  The info() method shows this.

In [3]:
# filter out 'Not assigned' values
toronto_df = toronto_df[toronto_df['Borough'] != 'Not assigned']

toronto_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 2 to 285
Data columns (total 3 columns):
PostalCode      210 non-null object
Borough         210 non-null object
Neighborhood    210 non-null object
dtypes: object(3)
memory usage: 6.6+ KB


Just taking a peak to see the nice, neat structure of the data frame so far.

In [4]:
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Desired dataframe should have unique postal codes with the proper corresponding boroughs, and all neighborhoods within each postal code/borough pair should be merged into a list under the neighborhood column. To start this merge, I first create a unique list of tuples with unique postal code and its corresponding borough. Converting the zip object to a set removes all duplicates and then converting the set to a list allows me to iterate in the next portion of the merge.

In [5]:
# create list of tuples with unique postal code and corresponding unique borough
uni_list = zip(list(toronto_df['PostalCode']),list(toronto_df['Borough']))
uni_list = set(uni_list)
uni_list = list(uni_list)

To finish the merge I create a new, empty dataframe with proper column names. Then, I loop over the list of tuples defined above to merge the rows into the desired format as described above. Finally, I take a peak at the resulting dataframe.

In [6]:
# create new, filtered dataframe and take a (big) peak
toronto_cleaned_df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])

for i, tuple_ in enumerate(uni_list):
    postal_list = list(toronto_df[toronto_df['PostalCode'] == tuple_[0]]['Neighborhood'])
    toronto_cleaned_df.loc[i] = [tuple_[0], tuple_[1],
                                 ', '.join(postal_list)]


toronto_cleaned_df.head(50)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M4B,East York,"Woodbine Gardens, Parkview Hill"
1,M9C,Etobicoke,"Bloordale Gardens, Eringate, Markland Wood, Ol..."
2,M2K,North York,Bayview Village
3,M6G,Downtown Toronto,Christie
4,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel"
5,M1B,Scarborough,"Rouge, Malvern"
6,M4M,East Toronto,Studio District
7,M6L,North York,"Downsview, North Park, Upwood Park"
8,M1X,Scarborough,Upper Rouge
9,M5G,Downtown Toronto,Central Bay Street


Checking the number of observations and features in the new, cleaned dataframe.

In [7]:
# check number of observations and features
toronto_cleaned_df.shape

(103, 3)

We see that there are 103 observations (103 unique postal codes).

# Assigning Lat./Long. Coords. to Postal Codes

Geocoder has not been working for me and, as stated in the assignment description, geocoder is sometimes not the most reliable. So I have decided to use the .csv file provided that lists the postal codes with their accompanying latitude and longitude coordinates. So I create a dataframe from the .csv and concatenate that to the cleaned dataframe. I sort both dataframes by postal code (ascending), drop the postal codes from the lat/long dataframe and then concatenate the lat/long dataframe to the cleaned dataframe to ensure all coordinates are matching their corresponding postal code.

In [8]:
# read dataframe from geospatial .csv file (provided in assignement description) 
# and sort rows by postal code
lat_long_df = pd.read_csv('Geospatial_Coordinates.csv')
lat_long_df.sort_values(by=['Postal Code'], inplace=True)

lat_long_df.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [9]:
# drop 'Postal Code' column from lat_long_df, sort the cleaned dataframe, and concatenate
lat_long_df.drop(['Postal Code'], axis=1, inplace=True)
toronto_cleaned_df.sort_values(by=['PostalCode'], inplace=True) # sorts cleaned dataframe (index is out of order)
toronto_cleaned_df.reset_index(drop=True,inplace=True) # resets the index and drops the old index
toronto_geo_df = pd.concat([toronto_cleaned_df, lat_long_df], axis=1, sort=False) # concatenate dataframes' columns

toronto_geo_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


# Clustering

I will cluster neighborhoods based on postal codes; in the rendered map the tags should display the postal code followed by the borough the code is contained within.

In [10]:
# importing necessary libraries for clustering and visuals

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# import folium for map rendering
import folium

In [11]:
# grabbing the lat/long coords for Toronto, Ontario, Canada

address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


In [12]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=9)

# add markers to map
for lat, lng, borough, postal in zip(toronto_geo_df['Latitude'], toronto_geo_df['Longitude'], toronto_geo_df['Borough'], toronto_geo_df['PostalCode']):
    label = '{}, {}'.format(postal, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto