## Segmentation and Clustering of Neighborhood in Toronto, Canada.

### Introduction
    1.Web-scraping of the Toronto Neighborhood data using requests and BeautifulSoup Libraries.
    
    2.Performing Exploratory and Data analysis Using the Pandas and Numpy Library.
    
    3.Using Foursquare API Explore the Neighborhoods of Toronto location.
    
    4.Using Scikit-learn library to apply K-means Clustering alogrithm.  

## Table of Contents

1. Performing Web-scraping and Extracting Neighborhood and borough information from Wikipedia Page

2. Perfoming EDA 

3. Explore Neighborhoods in Toronto

4. Analyze Each Neighborhood

5. Cluster Neighborhoods

6. Examine Clusters    


## 1.  Performing Web-scraping and Extracting Neighborhood and borough information from Wikipedia Page

In [26]:
# Installing essential libraries in Web-scraping
# import the library we use to open the URL 
import lxml
import requests # library to handle requests
# Import the beautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup as BS

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modulesl
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [58]:
# Storing the URL of wikipedia wiki_url
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Getting the source code of the HTML page from the URL using the request library
source = requests.get(wiki_url).text

# Converting the HTML souce code to the beautiful soup format
soup = BS(source,'html.parser')

# Reading the table Information and storing in the database 
df_old = pd.read_html(wiki_url)

# Using the find method to find the exact borough information 
df_old[0]

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [60]:
# Checking the  database for all the Unique Borough names and total number of neighborhood
df = df_old[0].copy()
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## 2. Perfoming EDA 
Neighborhood has a total of 3 boroughs and 177 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 3 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

### Data Handling and Cleaning

In [61]:
# Checking the dataFrame for missing Values
df.isna().sum()

Postal Code      0
Borough          0
Neighbourhood    0
dtype: int64

In [66]:
# dropping the rows having 'Not assigned' in Borough column
df = df[~(df['Borough'] == 'Not assigned')]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [80]:
# Cheking again for any missing or null values
df.isna().sum()

Postal Code      0
Borough          0
Neighbourhood    0
dtype: int64

In [81]:
df.shape

(103, 3)

In [10]:
# Extracting Neihborhood names having values in 'name/name' format and 'name-name' format
def extract(x):
    x = x.split('–')[0]
    x = x.split('/')[0]
    return x
df.Name = df.Name.apply(extract)

### Using geopy library to get the latitudes and longitudes of each Neighborhood

43.695403
-79.293099


In [43]:
# Creating a new DataFrame to store the latitudes and longitudes of the each neighborhood
column_names = [  'Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
Neighborhood = pd.DataFrame(columns=column_names)
Neighborhood

# We will define a instance of the geopy library as an tor_agent
# We cannot use the Nominatim method for large number of values hence we divide the dataset into two parts

# First Part of the dataset
for name , bor in zip(df.Name[0:70], df.Borough[0:70]):  
    try:
        address = f'{name} , Toronto'
        geolocator = Nominatim(user_agent="tor_agent")
        location = geolocator.geocode(address)
        Neighborhood = Neighborhood.append({
                                            'Borough':bor,
                                            'Neighborhood':name,
                                            'Latitude':location.latitude,
                                            'Longitude':location.longitude
                                           },ignore_index=True)
    except:
        Neighborhood = Neighborhood.append({
                                            'Borough':bor,
                                            'Neighborhood':name,
                                            'Latitude':np.NaN,
                                            'Longitude':np.NaN
                                           },ignore_index=True)

# Second Part of the dataset
for name , bor in zip(df.Name[70:], df.Borough[70:]):  
    try:
        address = f'{name}, Toronto'
        geolocator = Nominatim(user_agent="tor_agent")
        location = geolocator.geocode(address)
        Neighborhood = Neighborhood.append({
                                            'Borough':bor,
                                            'Neighborhood':name,
                                            'Latitude':location.latitude,
                                            'Longitude':location.longitude
                                           },ignore_index=True)
    except:
        Neighborhood = Neighborhood.append({
                                            'Borough':bor,
                                            'Neighborhood':name,
                                            'Latitude':np.NaN,
                                            'Longitude':np.NaN
                                           },ignore_index=True)

# Cheking for any null values        
Neighborhood.isna().sum()

Borough         0
Neighborhood    0
Latitude        6
Longitude       6
dtype: int64

In [44]:
# Removing the names not having Proper coordinates
Neighborhood = Neighborhood[~Neighborhood.Latitude.isna()]

# Checking for any Missing values or Null values
Neighborhood.isna().sum()

Borough         0
Neighborhood    0
Latitude        0
Longitude       0
dtype: int64

In [32]:
# Conforming after map visualization given below latitude location in not proper for Niagara Neighborhood
Neighborhood = Neighborhood[~ (Neighborhood['Neighborhood'] == 'Niagara')]
Neighborhood.shape

(169, 4)

In [45]:
Neighborhood.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,East York,Crescent Town,43.695403,-79.293099
1,East York,Governor's Bridge,43.689423,-79.369426
2,East York,Leaside,43.704798,-79.36809
3,East York,O'Connor,43.750275,-79.317901
4,East York,Old East York,43.670862,-79.372792


The Final Extracted dataset has 3 boroughs and 169 neighborhoods.

In [46]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(Neighborhood['Borough'].unique()),
        Neighborhood.shape[0]
    )
)

The dataframe has 3 boroughs and 167 neighborhoods.


## 3. Explore Neighborhoods in Toronto

In [36]:
# Using geopy.geocoders librarys Nominatim method to convert address to latitudes and longitudes
neighborhood_address = "toronto" # 
geolocator = Nominatim(user_agent="tor_agent")
location = geolocator.geocode(neighborhood_address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

43.6534817 -79.3839347


In [47]:
# Creating a map of Toronto Neighborhood using Folium Library and latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, long, borough, neigh in zip(Neighborhood['Latitude'], 
                                     Neighborhood['Longitude'], 
                                     Neighborhood['Borough'], 
                                     Neighborhood['Neighborhood']):
    label = '{}, {}'.format(neigh, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        colors='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.4,
        parse_html=False).add_to(map_toronto)
map_toronto

In [None]:
## For generating N values of random color in 
# x_coordinates = [0, 1, 2, 3, 4, 5]
# y_coordinates = [0, 1, 2, 3, 4, 5]

# for x, y in zip(x_coordinates, y_coordinates):
#     rgb = (random.random(), random.random(), random.random())
#     plt.scatter(x, y, c=[rgb])

### Exploring the dataset for the Scarborough borough only

In [50]:
Scarborough_Neighborhood = Neighborhood[Neighborhood['Borough'] == 'Scarborough']
Scarborough_Neighborhood.reset_index(drop=True)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Scarborough,Agincourt,43.785353,-79.278549
1,Scarborough,Alexandra Park,43.650787,-79.404318
2,Scarborough,Allenby,43.711351,-79.553424
3,Scarborough,Amesbury,43.706162,-79.483492
4,Scarborough,Armour Heights,43.743944,-79.430851
5,Scarborough,Banbury,43.742796,-79.369957
6,Scarborough,Bathurst Manor,43.665519,-79.411937
7,Scarborough,Bay Street Corridor,43.667342,-79.388457
8,Scarborough,Bayview Village,43.769197,-79.376662
9,Scarborough,Bayview Woods,43.798127,-79.382973


## Exploring the Scarborough of Borough
#### As we did for the all the neighborhood lets Visualize the Neiborhood of Scarborough


In [55]:
# Using geopy.geocoders librarys Nominatim method to convert address to latitudes and longitudes
neighborhood_address = "Scarborough, Toronto" # 
geolocator = Nominatim(user_agent="scarborough_agent")
location = geolocator.geocode(neighborhood_address)
latitude = location.latitude
longitude = location.longitude
print('The latitude and longitude of Scarborough,Toronto in  {}, {}'.format(latitude, longitude) )

The latitude and longitude of Scarborough,Toronto in  43.773077, -79.257774


In [57]:
# Creating a map of Toronto Neighborhood using Folium Library and latitude and longitude values
map_scarborough = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, long, borough, neigh in zip(Scarborough_Neighborhood['Latitude'], 
                                     Scarborough_Neighborhood['Longitude'], 
                                     Scarborough_Neighborhood['Borough'], 
                                     Scarborough_Neighborhood['Neighborhood']):
    label = '{}, {}'.format(neigh, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        colors='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.4,
        parse_html=False).add_to(map_scarborough)
map_scarborough