<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto, CA</font></h1>

## Introduction

In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in New York City. You will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the *k*-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in New York City and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in New York City</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from bs4 import BeautifulSoup #library for scraping from web

!conda install -c conda-forge geocoder --yes
import geocoder # import geocoder

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.18.1               |             py_0          51 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          84 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.18.1-py_0 conda-forge


Downloading and Extracting Packages
geopy-1.18.1         | 51 KB     | ##################################### | 100% 
geographiclib-1.49   | 32 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Solving environme

<a id='item1'></a>

## 1. Download and Explore Dataset

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url)

soup = BeautifulSoup(page.content,'html.parser')

Get the Table 

In [3]:
table = soup.find('table').prettify()
table

'<table class="wikitable sortable">\n <tbody>\n  <tr>\n   <th>\n    Postcode\n   </th>\n   <th>\n    Borough\n   </th>\n   <th>\n    Neighbourhood\n   </th>\n  </tr>\n  <tr>\n   <td>\n    M1A\n   </td>\n   <td>\n    Not assigned\n   </td>\n   <td>\n    Not assigned\n   </td>\n  </tr>\n  <tr>\n   <td>\n    M2A\n   </td>\n   <td>\n    Not assigned\n   </td>\n   <td>\n    Not assigned\n   </td>\n  </tr>\n  <tr>\n   <td>\n    M3A\n   </td>\n   <td>\n    <a href="/wiki/North_York" title="North York">\n     North York\n    </a>\n   </td>\n   <td>\n    <a href="/wiki/Parkwoods" title="Parkwoods">\n     Parkwoods\n    </a>\n   </td>\n  </tr>\n  <tr>\n   <td>\n    M4A\n   </td>\n   <td>\n    <a href="/wiki/North_York" title="North York">\n     North York\n    </a>\n   </td>\n   <td>\n    <a href="/wiki/Victoria_Village" title="Victoria Village">\n     Victoria Village\n    </a>\n   </td>\n  </tr>\n  <tr>\n   <td>\n    M5A\n   </td>\n   <td>\n    <a href="/wiki/Downtown_Toronto" title="Downtown 

#### Tranform the data into a *pandas* dataframe

In [4]:
df = pd.read_html(table, match='str', header=0) # Returns a list of dataframes from tables tags in the input string. header=0 is denote the first row conatins the column lables.
table_df = df[0] # First dataframe in the list
table_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Data columns (total 3 columns):
Postcode         289 non-null object
Borough          289 non-null object
Neighbourhood    289 non-null object
dtypes: object(3)
memory usage: 6.9+ KB


In [5]:
table_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Removing the Borough that have "Not assigned"

In [6]:
table_df2 = table_df[table_df.Borough != 'Not assigned'].reset_index(drop=True)
table_df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [7]:
table_df2 = table_df2.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
table_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 3 columns):
Postcode         103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
dtypes: object(3)
memory usage: 2.5+ KB


#### Assigning Neighbourhood to Borough for Not Assigned recods in Neighbourhood

In [8]:
table_df2.loc[table_df2.Neighbourhood=='Not assigned'].reset_index(drop=True)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M7A,Queen's Park,Not assigned


In [9]:
table_df2.loc[table_df2.Neighbourhood=='Not assigned', 'Neighbourhood']=table_df2.Borough
table_df2.loc[table_df2.Neighbourhood=='Not assigned'].reset_index(drop=True) # Should have no rows satisfying this condition excists
table_df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### Number of Rows in dataframe

In [10]:
print('Number of Rows =', table_df2.shape[0])

Number of Rows = 103


## Getting the Latitude and Longitude

In [11]:
table_df2['Latitude']=''
table_df2['Longitude']=''

In [16]:
cord_df = pd.read_csv("http://cocl.us/Geospatial_data")
cord_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [18]:
table_df2 = table_df2.rename(columns={'Postcode':'PostalCode'})
table_df2.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",,
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",,
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",,
3,M1G,Scarborough,Woburn,,
4,M1H,Scarborough,Cedarbrae,,


In [22]:
for i in range(len(cord_df)):
    table_df2.at[table_df2['PostalCode']==cord_df.iloc[i]['Postal Code'],'Latitude']=cord_df.iloc[i]['Latitude']
    table_df2.at[table_df2['PostalCode']==cord_df.iloc[i]['Postal Code'],'Longitude']=cord_df.iloc[i]['Longitude']

In [23]:
table_df2

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.8067,-79.1944
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.7845,-79.1605
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.7636,-79.1887
3,M1G,Scarborough,Woburn,43.771,-79.2169
4,M1H,Scarborough,Cedarbrae,43.7731,-79.2395
5,M1J,Scarborough,Scarborough Village,43.7447,-79.2395
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.7279,-79.262
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.7111,-79.2846
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.7163,-79.2395
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.6927,-79.2648
