# Segmenting and Clustering Neighborhoods in Toronto
1. Introduction
This notebook will use the Foursquare API to explore neighborhoods in Toronto. The k-means clustering algorithm is used to complete this task and the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

2. Download the dependencies

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.18.1-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00  14.81 MB/s
geopy-1.18.1-p 100% |################################| Time: 0:00:00  29.13 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  50.06 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  23.96 MB/s
vincent-0.4.4- 100% |###################

# 3. Download and Explore the Dataset
Since the dataset for Toronto is not readily available, it is required to download and wrangle the dataset from wikipedia

3.1 Read content of the wikipedia page containing the data about Toronto¶

In [2]:
# Get data from the wikipedia page
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
toronto_data_wiki=requests.get(url)
# Convert the data to string and display the data
toronto_wiki_page=toronto_data_wiki.text

# 3.2 Wrangle the data
The data read from the wikipedia page contains a table which needs to be converted to a pandas dataframe. This can be done with the help of the BeautifulSoup package

In [4]:
# Install beautiful soup
!conda install -c anaconda beautifulsoup4 --yes

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    beautifulsoup4: 4.6.0-py35h442a8c9_1 --> 4.6.3-py35_0 anaconda

beautifulsoup4 100% |################################| Time: 0:00:00  26.00 MB/s


The following packages will be UPDATED:

    cryptography:    2.3.1-py36hdffb7b8_0    conda-forge --> 2.4.1-py36h1ba5d50_0    anaconda
    grpcio:          1.16.0-py36hd60e7a3_0   conda-forge --> 1.16.1-py36hf8bcb03_1   anaconda
    libarchive:      3.3.3-h823be47_0        conda-forge --> 3.3.3-h5d8350f_4        anaconda
    libcurl:         7.63.0-hbdb9355_0       conda-forge --> 7.63.0-h20c2e04_1000            
    libssh2:         1.8.0-h5b517e9_3        conda-forge --> 1.8.0-h1ba5d50_4        anaconda
    openssl:         1.0.2p-h470a237_2       conda-forge --> 1.1.1-h7b6447c_0        anaconda
    pycurl:          7.43.0.2-py36hb7f436b_0             --> 7.43.0.2-py36h1ba5d50_0         
    python:          3.6.6-h5001a0f_3        conda-forge --> 3.6.7-h0371630_0        anaconda

The following packages will be DOWNGRADED:

    ca-certificates: 2018.11.29-ha4d7672_0   conda-forge --> 2018.03.07-0            anaconda
    certifi:         2018.11.29-py36_1000    conda-forge --> 2018.10.15-py36_0       anaconda
    krb5:            1.16.2-hbb41f41_0       conda-forge --> 1.16.1-h173b8e3_7       anaconda

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

In [5]:
# Import the BeautifulSoup Package 
from bs4 import BeautifulSoup
bs = BeautifulSoup(toronto_wiki_page,'lxml')
# Use prettify function to determine the html table class name which needs to be extracted
#print(bs.prettify())

In [6]:
# Extract the table from the page
toronto_table = bs.find('table',{'class':'wikitable sortable'})
# toronto_table // Display the table

In [7]:
#Extract postcode, borough and neighbour hood lists based on conditions specified
rows = toronto_table.find_all('tr')
postcode=[]
borough=[]
neighbourhood=[]
for row in rows:
    cols=row.find_all('td')
    cols=[x.text.strip() for x in cols] # Cols is a list with 3 elements in the order postcode, borough,neighbourhood
    if cols:
        if(cols[1] !='Not assigned'): # Ignore cells with a borough that is Not assigned.
            borough.append(cols[1]) 
            if(cols[2] == 'Not assigned'): # If neighborhood is not assigned, it is the same as borough
                neighbourhood.append(cols[1])
            else:
                neighbourhood.append(cols[2])
            if(cols[0] != 'Not Assigned'):
                postcode.append(cols[0]) 

# Display lists
#print(postcode)
#print(borough)
#print(neighbourhood)

In [8]:
# Combine the lists to a dataframe
toronto_df_ini=pd.DataFrame(
    {'Postalcode': postcode,
     'Borough': borough,
     'Neighborhood': neighbourhood
    })
toronto_df_ini.head()

Unnamed: 0,Borough,Neighborhood,Postalcode
0,North York,Parkwoods,M3A
1,North York,Victoria Village,M4A
2,Downtown Toronto,Harbourfront,M5A
3,Downtown Toronto,Regent Park,M5A
4,North York,Lawrence Heights,M6A


Concatenate the neighborhoods coming under the same postal code using lambda functions and pandas groupby function

In [9]:
def concatenate_neighborhood(x):
    neigh_concat = ""
    for i in range(len(x)-1):
        neigh_concat = neigh_concat + x.iloc[i] + ", "
    neigh_concat += x.iloc[-1]
    return neigh_concat

def select_Borough(x):
    borough_sel = x.iloc[0]
    for i in range(1, len(x)):
        if borough_sel != x.iloc[i]:
            for i in x:
                print(x)
            raise Exception("Postcode comprises two Boroughs")
    return borough_sel
toronto_df = toronto_df_ini.groupby(["Postalcode"]).agg({"Borough": lambda x: select_Borough(x),
                                 "Neighborhood": lambda x: concatenate_neighborhood(x)},as_index=False)
toronto_df = toronto_df.reset_index()
toronto_df.head(5)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## 4. Display Shape of dataframe¶


In [10]:
toronto_df.shape

(103, 3)

###### 5. GeoSpatial Analysis
5.1 Download csv file
Since the geocoder package is not reliable enough, the csv file provided by coursera is used for the latitude and longitude coordinated of locations

In [11]:
# Download the csv file
!wget -q -O 'lat_long_data.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [12]:
# Read the csv data into a pandas data frame
lat_long_df = pd.read_csv('lat_long_data.csv')
lat_long_df.head(5)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
# Remove postal code column from lat_long_df before concatenation
lat_long_df.drop(['Postal Code'],axis=1,inplace=True)
lat_long_df.head()

Unnamed: 0,Latitude,Longitude
0,43.806686,-79.194353
1,43.784535,-79.160497
2,43.763573,-79.188711
3,43.770992,-79.216917
4,43.773136,-79.239476
