## Segmenting and Clustering Neighborhoods in Toronto

Importing the modules required for the assignment.

In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# install geopy & folium
#pip install geopy
#!pip install folium=0.5.0 

from geopy.geocoders import Nominatim
import folium

### Question 1: Scrape the Wikipedia page and get the dataframe

The wikipedia page is obtained using the python requests module. Once this is done, the beautiful soup package is used
for the purpose of getting the table extracted from the html output

In [2]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('tbody')
rows = table.select('tr')
row = [r.get_text() for r in rows]
row[0:5]

['\nPostal code\n\nBorough\n\nNeighborhood\n',
 '\nM1A\n\nNot assigned\n\n\n',
 '\nM2A\n\nNot assigned\n\n\n',
 '\nM3A\n\nNorth York\n\nParkwoods\n',
 '\nM4A\n\nNorth York\n\nVictoria Village\n']

As we can see that in the above output we have some newline characters. So we cansplit by these characters and push them to our dataframe.



### Extract the column values and store it in the dataframe.

In [3]:
df = pd.DataFrame(row)
df1 = df[0].str.split('\n', expand=True)
df1.rename(columns=df1.iloc[0], inplace=True)
df1.drop(df1.index[0],inplace=True)
df1.head()

Unnamed: 0,Unnamed: 1,Postal code,Unnamed: 3,Borough,Unnamed: 5,Neighborhood,Unnamed: 7
1,,M1A,,Not assigned,,,
2,,M2A,,Not assigned,,,
3,,M3A,,North York,,Parkwoods,
4,,M4A,,North York,,Victoria Village,
5,,M5A,,Downtown Toronto,,Regent Park / Harbourfront,


If we look into the data we can see that the rows having the Not Assigned can be ignored since the neighborhood column for that row will always be empty

In [4]:
# Filter out the columns where the Borough is Not Assigned. This can also be done by removing rows with Empty neighborhood.

df1 = df1[df1["Borough"]!="Not Assigned"]
df1.head()

Unnamed: 0,Unnamed: 1,Postal code,Unnamed: 3,Borough,Unnamed: 5,Neighborhood,Unnamed: 7
1,,M1A,,Not assigned,,,
2,,M2A,,Not assigned,,,
3,,M3A,,North York,,Parkwoods,
4,,M4A,,North York,,Victoria Village,
5,,M5A,,Downtown Toronto,,Regent Park / Harbourfront,


### Now we will combine the neighborhoods that have the same postal code.


In [5]:
df2 = df1.groupby(['Postal code', 'Borough'], sort = False).agg(','.join)
df2.reset_index(inplace = True)
df2.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### If a cell has a borough but a Not assigned neighborhood, then change the neighborhood will be the same as the borough.

In [6]:
df2['Neighborhood'].replace('Not assigned', "Queen's Park", inplace=True)
df2.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Malvern / Rouge


In [7]:
df2.shape

(180, 3)






##  Question 2: Get the location Coordinates for the neighborhoods



In this section the latitude and longitudes are obtained and merged with dataframe obtained above

In [8]:
df_coordinates = pd.read_csv('http://cocl.us/Geospatial_data')
df_coordinates.columns = ['Postal code', 'Latitude', 'Longitude']

In [12]:
df_merge = pd.merge(df2, df_coordinates, on=['Postal code'], how='inner')

df_location = df_merge[['Borough', 'Neighborhood', 'Postal code', 'Latitude', 'Longitude']].copy()

df_location.head(15)

Unnamed: 0,Borough,Neighborhood,Postal code,Latitude,Longitude
0,North York,Parkwoods,M3A,43.753259,-79.329656
1,North York,Victoria Village,M4A,43.725882,-79.315572
2,Downtown Toronto,Regent Park / Harbourfront,M5A,43.65426,-79.360636
3,North York,Lawrence Manor / Lawrence Heights,M6A,43.718518,-79.464763
4,Downtown Toronto,Queen's Park / Ontario Provincial Government,M7A,43.662301,-79.389494
5,Etobicoke,Islington Avenue,M9A,43.667856,-79.532242
6,Scarborough,Malvern / Rouge,M1B,43.806686,-79.194353
7,North York,Don Mills,M3B,43.745906,-79.352188
8,East York,Parkview Hill / Woodbine Gardens,M4B,43.706397,-79.309937
9,Downtown Toronto,Garden District / Ryerson,M5B,43.657162,-79.378937





## Question 3: Clustering of neighborhoods of Toronto


Here the lattitude and longitude are obtained for Toronto. The Toronto is located in the map and all the neighborhood cluster is added to the map.

In [17]:
address = 'Toronto, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

  This is separate from the ipykernel package so we can avoid doing imports until


In [21]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_location['Latitude'], df_location['Longitude'], df_location['Borough'], df_location['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3199cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_toronto)  
    
map_toronto