<h1> Peer-Reviewed Assignment: Segmenting and Clustering Neighborhoods in Toronto </h1>

## Problem 3

<hr>

For this assignment, I explore and cluster the neighborhoods in Toronto. I will work with only boroughs that contain the word Toronto.

In [26]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

!conda install -c conda-forge folium=0.5.0 --yes
import folium 
import matplotlib.cm as cm
import matplotlib.colors as colors
from IPython.display import display_html
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans


print('Complete.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Complete.


***Scrape the List of Canadian Postal Codes***

In [2]:

List_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(List_url).text

soup = BeautifulSoup(source, 'xml')

table=soup.find('table')




***Define Dataframes***

In [3]:
column_names = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip().replace('/', ','))
        
    if len(row_data)==3:
        df.loc[len(df)] = row_data
        

df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"


***Data Wrangling***

In [6]:
# Dropping the rows where Borough is 'Not assigned'
df1=df[df.Borough !='Not assigned']


# Combining the neighbourhoods with same Postalcode
df2 = df1.groupby(['Postalcode','Borough'], sort=False).agg(','.join)
df2.reset_index(inplace=True)

# Replacing the name of the neighbourhoods which are 'Not assigned' with names of Borough
df2['Neighborhood'] = np.where(df2['Neighborhood'] == 'Not assigned',df2['Borough'], df2['Neighborhood'])



df2



Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,..."


In [7]:
df.shape

(180, 3)

***Import csv file conatining the latitudes and longitudes for Canadian neighbourhoods***

In [18]:
def get_geocode(postal_code):

# initialize your variable to None
    lat_lng_coords = None
   
# loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]


In [19]:
geo_df=pd.read_csv('http://cocl.us/Geospatial_data')
geo_df.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


***Merging two tables to get latitudes and longitudes for analyzed Canadian Neighborhoods***

In [11]:
geo_df.rename(columns={'Postal Code':'Postalcode'},inplace=True)
geo_merged = pd.merge(geo_df, df, on='Postalcode')

geo_data=geo_merged[['Postalcode','Borough','Neighborhood','Latitude','Longitude']]

geo_data.head()


Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern , Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill , Port Union , Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


***Getting all the rows from the data frame which contains Toronto in their Borough.***

In [13]:
df_toronto = geo_data[geo_data['Borough'].str.contains('Toronto', regex =False)]
df_toronto.head()


Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West , Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar , The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


<h1> Visualizing all the Neighborhoods of the above data frame using Folium </h1>

<h2> Below is a cluster of the neighborhoods in Toronto. </h2>

In [48]:
map_toronto = folium.Map(location=[43.728020,-79.388790],zoom_start=10)

for lat,lng,borough,neighborhood in zip(df_toronto['Latitude'],df_toronto['Longitude'],df_toronto['Borough'],df_toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=6,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

In [56]:
k=4
toronto_clustering = df_toronto.drop(['Postalcode','Borough','Neighborhood'],1)

kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_

df_toronto

Unnamed: 0,Cluster Number,Cluster,Clusters,Cluster Labels,Postalcode,Borough,Neighborhood,Latitude,Longitude
37,2,3,3,1,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,2,3,3,1,M4K,East Toronto,"The Danforth West , Riverdale",43.679557,-79.352188
42,2,3,3,1,M4L,East Toronto,"India Bazaar , The Beaches West",43.668999,-79.315572
43,2,3,3,1,M4M,East Toronto,Studio District,43.659526,-79.340923
44,3,1,2,2,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,3,1,2,2,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,3,1,2,2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
47,3,1,2,2,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,3,1,2,2,M4T,Central Toronto,"Moore Park , Summerhill East",43.689574,-79.38316
49,3,1,2,2,M4V,Central Toronto,"Summerhill West , Rathnelly , South Hill , For...",43.686412,-79.400049


<h2> Below is a cluster of the neighborhoods in East, Central, Downtown, and West Toronto. </h2>

In [57]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighborhood, cluster in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood'], df_toronto['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters