<a href="https://colab.research.google.com/github/ClintonGJohnson/Coursera_Capstone/blob/main/Applied_Data_Science_Capstone_Week3_My_Submission.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applied Data Science Capstone Week 3
## Peer Graded Assignment
### Segmenting and Clustering Neighbourhoods in Toronto

---
### Instructions

In this assignment, you will be required to explore, segment, and cluster the Neighbourhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the Neighbourhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto Neighbourhood data, a Wikipedia page exists that has all the information we need to explore and cluster the Neighbourhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the Neighbourhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.


---
### My Approach


1.   **Compile a list (DataFrame df) of neighbourhoods in Toronto**

> Gather the list of neighbourhood names and coordinates (latitude and longitude of neighbourhood centroids)


> Clean the list removing rows or columns with missing data



2.   **Compile a list of venues near each neighbourhood** (within 500 meters of the centroid, which may lead to overlaps in some cases and may not encompass the full Neighbourhood in other cases)

3. **Analyze Each Neighbourhood**

4. **Cluster Neighbourhoods**

5. **Examine Clusters**







### Install and import relevant libraries
* Pandas
* Numpy
* SciKit-Learn
* MatPlotLib
* GeoPy
* Json

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

/bin/bash: conda: command not found
Libraries imported.


# 1. Compile a list (DataFrame df) of neighbourhoods in Toronto
> Gather the list of neighbourhood names and coordinates (latitude and longitude of neighbourhood centroids)
> 1. **Gather Neighbourhood List from Wiki page:** https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050.
> The first Table in the web page contains a list of Neighbourhoods in Toronto
> 2. **Clean the list** removing rows or columns with missing data
> Remove any records with "Not Assigned" Burough or Neighbourhood
> 3. **Geocode each neighbourhood** to collect coordinates

### 1.1. **Gather Neighbourhood List from Wiki page:**

In [2]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050"
html_tables = pd.read_html(url)

html_tables[0].head() # confirming that we have the correct table (Columns: Postcode, Borough, Neighbourhood)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### **1.2**. **Clean the list**

In [3]:
df = html_tables[0] # set the DataFrame df
# replace "Not Assigned" with None
df = df.replace({'Not assigned':None})
df.count()

Postcode         287
Borough          210
Neighbourhood    210
dtype: int64

In [4]:
# drop nulls (Nones)
df.dropna(axis=0, inplace=True)
df.count()

Postcode         210
Borough          210
Neighbourhood    210
dtype: int64

Explore the list

In [5]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [6]:
df.shape

(210, 3)

# 2. Geocode Neighborhoods

In [13]:
!pip install arcgis # Using Esri's ArcGIS for accurate geocoding
#!conda install -c esri arcgis



In [7]:
from arcgis.gis import GIS
from arcgis.geocoding import Geocoder, get_geocoders, geocode


from IPython.display import display

arcgis_online = GIS()
items = arcgis_online.content.search('Geocoder', 'geocoding service', max_items=3)
    
# construct a geocoder using the first geocoding service item
worldgeocoder = Geocoder.fromitem(items[0])
worldgeocoder   

<Geocoder url:"https://geocoder.arcgisonline.nl/arcgis/rest/services/Geocoder_BAG_RD/GeocodeServer">

Geocode each neighbourhood

In [10]:
address_format = "{}, {}, ON, CA" # Neighbourhood, Borough, ON, CA

toronto_data = pd.DataFrame(columns=['Postcode', 'Borough', 'Neighbourhood', 'Address', 'Latitude', 'Longitude'])
no_match = 0
matched = 0

for p, b, n in zip(df['Postcode'], df['Borough'], df['Neighbourhood']):
  address = address_format.format(n,b)
  print('Geocoding {}'.format( address))

  latitude = None
  longitude = None
  location = None

  try: 
    matches = geocode(address) #Geocode using ArcGIS Online
    if len(matches)>0:
      location = matches[0]
  except TimeoutError:
    print('Geocoder Request Timed Out')
  except:
    print('Error has occurred')

  if location == None:
    no_match+=1
    print('...No results :(')
  else:
    latitude = location['location']['y']
    longitude = location['location']['x']
    matched+=1
    print('...{}, {}'.format(latitude,longitude))

  toronto_data = toronto_data.append(
      {
        'Address': address,
        'Latitude': latitude,
        'Longitude': longitude,
        'Neighbourhood': n,
        'Borough': b,
        'Postcode': p
      },
      ignore_index=True
  )

print('Geocoding Complete! {} locations matched, but {} not matched'.format(matched, no_match))

toronto_data.shape

Geocoding Parkwoods, North York, ON, CA
...44.20973226495906, -79.47189723748289
Geocoding Victoria Village, North York, ON, CA
...43.73154000000005, -79.31427999999994
Geocoding Harbourfront, Downtown Toronto, ON, CA
...43.65011000000004, -79.38289999999995
Geocoding Lawrence Heights, North York, ON, CA
...43.72357000000005, -79.43710999999996
Geocoding Lawrence Manor, North York, ON, CA
...43.72294000000005, -79.43115999999998
Geocoding Queen's Park, Downtown Toronto, ON, CA
...43.660673101153115, -79.39083464301146
Geocoding Islington Avenue, Etobicoke, ON, CA
...43.738221166575215, -79.56573343932973
Geocoding Rouge, Scarborough, ON, CA
...43.807660000000055, -79.17404999999997
Geocoding Malvern, Scarborough, ON, CA
...43.81023000000005, -79.22037999999998
Geocoding Don Mills North, North York, ON, CA
...43.705685127473515, -79.33385691603588
Geocoding Woodbine Gardens, East York, ON, CA
...43.70626000000004, -79.30090999999999
Geocoding Parkview Hill, East York, ON, CA
...43.70464

(210, 6)

Review the data

In [12]:
toronto_data.head(11)

Unnamed: 0,Postcode,Borough,Neighbourhood,Address,Latitude,Longitude
0,M3A,North York,Parkwoods,"Parkwoods, North York, ON, CA",44.209732,-79.471897
1,M4A,North York,Victoria Village,"Victoria Village, North York, ON, CA",43.73154,-79.31428
2,M5A,Downtown Toronto,Harbourfront,"Harbourfront, Downtown Toronto, ON, CA",43.65011,-79.3829
3,M6A,North York,Lawrence Heights,"Lawrence Heights, North York, ON, CA",43.72357,-79.43711
4,M6A,North York,Lawrence Manor,"Lawrence Manor, North York, ON, CA",43.72294,-79.43116
5,M7A,Downtown Toronto,Queen's Park,"Queen's Park, Downtown Toronto, ON, CA",43.660673,-79.390835
6,M9A,Etobicoke,Islington Avenue,"Islington Avenue, Etobicoke, ON, CA",43.738221,-79.565733
7,M1B,Scarborough,Rouge,"Rouge, Scarborough, ON, CA",43.80766,-79.17405
8,M1B,Scarborough,Malvern,"Malvern, Scarborough, ON, CA",43.81023,-79.22038
9,M3B,North York,Don Mills North,"Don Mills North, North York, ON, CA",43.705685,-79.333857
