# Segmenting and Clustering Neighborhoods in Toronto

 *This project includes the scraping a wikipedia page to figureout the postalcodes of Canada, then process and clean the data for clustering.Here clustering is done by K-Means and then all clusters are plotted using a library called Folium*.

### Importing and Installing all the required Libraries

In [34]:
import pandas as pd
import numpy as np
import requests
import random
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from  geopy.geocoders import Nominatim
from IPython.display import Image
from IPython.core.display import HTML
from IPython.display import display_html
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans

## Scraping the webpage for the tabel of postal codes, Canada

 _Python uses a library called __BeautifulSoup__ for web scraping the table from wikipedia. To check whether the page has been successfully scraped, the title of the webpage is printed and also the table of postalcodes of Canada is printed_.

In [35]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
print(soup.title)
from IPython.display import display_html
tab = str(soup.table)
#display_html(tab,raw=True)

<title>List of postal codes of Canada: M - Wikipedia</title>


### Convert HTML table to Pandas DataFrame

 _For the purpose of cleaning and preprocessing the data we have to convert the HTML table into pandas DataFrame_

In [36]:
canada_data = pd.read_html(tab)
canada = canada_data[0]
print(canada.shape)
canada.head()

(180, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Preprocessing and Cleaning the Data

 *Before Preprocessing the data in the dataframe 'canada', we copy it into another dataframe 'df' using the function __copy()__, and then the 'Not assigned' values are droped using the functions __replace()__ and __dropna()__. If any values in the field 'Neighborhood' are 'Not assigned', they are replaced with the names of 'Borough'.The dimensions of the preprocessed dataframe 'df' are displayed using __shape__*.

In [37]:
df = canada.copy()

# Replace 'Not assigned' to Nan and Drop the 'NaN' values of 'Borough'
df.replace("Not assigned", np.NaN, inplace=True)
df.dropna(subset = ["Borough"], axis = 0, inplace=True)
df.reset_index(drop = True ,inplace = True)
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned', df['Borough'], df['Neighborhood'])
print(df.head())
df.shape

  Postal Code           Borough                                 Neighborhood
0         M3A        North York                                    Parkwoods
1         M4A        North York                             Victoria Village
2         M5A  Downtown Toronto                    Regent Park, Harbourfront
3         M6A        North York             Lawrence Manor, Lawrence Heights
4         M7A  Downtown Toronto  Queen's Park, Ontario Provincial Government


(103, 3)

## Import the CSV file that contains the coordinates of neighborhoods in Canada.

In [38]:
lat_lon = pd.read_csv('http://cocl.us/Geospatial_data')
lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merging Tables

 *For getting the Latitudes and Longitudes for various neighborhoods in Canada we are merging two tables 'df' and 'lat_lon'*. 

In [39]:
new_df = pd.merge(df, lat_lon, on= 'Postal Code')
new_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


##  Clustering and Plotting of neighborhoods of Canada where Toronto is in their Borough

 *Getting all rows from the dataframe 'new_df' where the field'Borough' has 'Toronto' in it*.

In [40]:
Toronto = new_df[new_df['Borough' ].str.contains('Toronto', regex = False)]
Toronto.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


 ### Visualizing the Neighborhoods of the DataFrame 'Toronto'.

 *By using the library __Folium__ all the neighborhood of 'Toronto' has been Visualized*.

In [41]:
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [42]:
# Using Latitude and Longitude values,Toronto map is created 
map_Toronto = folium.Map(location = [latitude, longitude], zoom_start = 11)

# add markers to map
for lat,lng,borough,neighborhood in zip(Toronto['Latitude'], Toronto['Longitude'], Toronto['Borough'], Toronto['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='green',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_Toronto)
map_Toronto

## Clustering of Neighborhoods

 *For Clustering the neighborhoods __KMeans__ Clustering is used*.

In [45]:
k=5
toronto_clustering = Toronto.drop(['Postal Code','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
Toronto.insert(0, 'Cluster Labels', kmeans.labels_)


In [50]:
Toronto.head(10)

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,4,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,3,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,0,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


In [51]:
# create map
map_clusters = folium.Map(location=[latitude,longitude],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighborhood, cluster in zip(Toronto['Latitude'], Toronto['Longitude'], Toronto['Neighborhood'], Toronto['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters