# Segmenting & Clustering the neighborhoods in the city of Toronto, Canada

## Project Brief
In this project data is extracted from a Wikipedia page that has all the information required to explore and cluster the 
neighborhoods in Toronto. We are required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into 
a pandas dataframe so that it is in a structured format. Once the data is in a structured format, we required to carry
out an analysis of the dataset to explore and cluster the neighborhoods in the city of Toronto.  The Foursquare API  was employed to find the venues on each postal code zone.

## Contents

1. <a href="#item1">Data Extraction of Toronto neighborhoods from Wikipedia</a>

2. <a href="#item2">Clean and Transform Neigbourhood DataFrame</a>
    
3. <a href="#item3">Obtaining Coordinates of each Neigborhood</a>
    
4. <a href="#item4">Creating a map with the coordinates of each postal code and markers for each Postcode</a>
    


In [1]:
# importing Libraries
import numpy as np # library to handle data in a vectorized manner
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import json # JSON files manipulation
import requests # HTTP library
from bs4 import BeautifulSoup # scraping library

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 1. Data Extraction of Toronto neighborhoods from Wikipedia
The extraction of data was done using the BeautifulSoup library. The data in the article is presented in a table that is parsed.
There is need to transform the text to html.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

text_result = requests.get(url).text # get the entire html of the article as a str
html_parsed_result = BeautifulSoup(text_result, 'html.parser') # transform the text to html

neighborhood_info_table = html_parsed_result.find('table', class_ = 'wikitable')
neighborhood_rows = neighborhood_info_table.find_all('tr')

# extract the info ('Postcode', 'Borough', 'Neighbourhood') from the table
neighborhood_info = []
for row in neighborhood_rows:
    info = row.text.split('\n')[1:-1] # remove empty str (first and last items)
    neighborhood_info.append(info)
    
neighborhood_info[0:10]

[['Postcode', 'Borough', 'Neighbourhood'],
 ['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned']]

### Transform the data into a pandas dataframe
The neigborhood_info is passed to pandas to create a Dataframe

In [3]:
#create a Neighborhoods dataframe
df = pd.DataFrame(neighborhood_info[1:], columns=neighborhood_info[0])

df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [4]:
df.shape

(289, 3)

## 2. Clean & Transform Neigbourhood DataFrame

In [5]:
# Ignore cells with a borough that is Not assigned.
df = df[df.Borough != 'Not assigned']

In [6]:
# Combining neigborhood with single Postal Code
df = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(list).apply(lambda x:', '.join(x)).to_frame().reset_index()


In [7]:
# Not assigned neighborhood, then the neighborhood will be the same as the borough.
for index, row in df.iterrows():
    if row['Neighbourhood'] == 'Not assigned':
        row['Neighbourhood'] = row['Borough']


In [8]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [9]:
df.shape

(103, 3)

The cleaned dataframe has 103 row and 3 colums

## 3. Obtaining Coordinates of each Neigborhood

Obtaining latitude and the longitude coordinates of each neighborhood.This will allow us to utilize the Foursquare location data
Given that the Geocoder Python package can be very unreliable, I have obtained the coordinates from the following link http://cocl.us/Geospatial_data

In [10]:
# reading data from csv file
df1=pd.read_csv('Geospatial_Coordinates.csv')
df1.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
df.shape

(103, 3)

In [12]:
# rename 'Postal Code' to 'Postcode' to allow for merging the two dataframs on Postcode
df1.columns = ['Postcode', 'Latitude', 'Longitude']
df1.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
# Merge two dataframs df and df1 on Postcode
df = pd.merge(df1, df, on='Postcode')
df.head()

Unnamed: 0,Postcode,Latitude,Longitude,Borough,Neighbourhood
0,M1B,43.806686,-79.194353,Scarborough,"Rouge, Malvern"
1,M1C,43.784535,-79.160497,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae


In [14]:
# reorder column names and show the dataframe
df = df[['Postcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude']]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [15]:
df.shape


(103, 5)

## 4. Creating a map with the coordinates of each postal code & markers for each Postcode

In [17]:
# Create Map
map = folium.Map(location=[43.6532,-79.3832], zoom_start=11)

for location in df.itertuples(): #iterate each row of the dataframe
    label = 'Postal Code: {};  Borough: {};  Neighborhoods: {}'.format(location[1], location[2], location[3])
    label = folium.Popup(label, parse_html=True)    
    folium.CircleMarker(
        [location[-2], location[-1]],
        radius=1,
        color='Red',
        fill=True,
        fill_color='#8631CC',
        fill_opacity=0.7,
        parse_html=False).add_to(map) 
    folium.Circle(
        radius=500,
        popup=label,
        location=[location[-2], location[-1]],
        color='#8631CC',
        fill=True,
        fill_color='#86CC31'
    ).add_to(map) 
    
map