<center><h1>Segmentating and Clustering Neighborhoods in Toronto</h1></center>

### Introduction

In this notebook, we will explore, segment, and cluster the neighborhoods in the city of Toronto based on the postal code and borough information. However, unlike New York, the neighborhood data is not readily available on the internet. So first thing first, we need to 


 *After retreiving the URL and creating a Beautiful soup object** 

 **Firstly create a list**  

 **Later after finding the table and table data  create a dictionary called cell having 3 keys PostalCode, Borough and Neighborhood.**

**As postal code contains upto 3 characters extract that using tablerow.p.text**

 **Next use split ,strip and replace functions for getting Borough and Neighborhood information.**.

 **Append to the list**  

 **Create a dataframe with list**

First thing first, lets import necessary libraries and download all dependencies we need

In [1]:
import numpy as np #handl data in a vectorised manner
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


import json
from pandas.io.json import json_normalize #json to panda df

#Visualisation
import matplotlib.cm as cm
import matplotlib.colors as colors

#k-means cluster
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes #already installed
import folium #map rendering library

#web scraping libs
from bs4 import BeautifulSoup
import requests

### 1. Download, Explore and Prepare Dataset

Unfortunately, the data is not readily available to download. So, we need to do web scraping to collect the data for Toronto neighborhood.

For Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. What we aim to do now is:

* Scrape data from wiki page
* Wrangle the data
* Clean and read into panda dataframe


Scrape the web for the data

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" # wiki page with postal codes
html = requests.get(url).text

In [3]:
soup = BeautifulSoup(html, 'html5lib')

In [4]:
soup.title

<title>List of postal codes of Canada: M - Wikipedia</title>

In [5]:
#print(soup.prettify())

In [6]:
table = soup.find('table')
table_contents = []

<details><summary>My approach</summary>

```python
table_contents = []
for row in table.find_all('tr'):
    #print(row)
    data_cell = {}
    for col in row.find_all('td'):
        if col.span.text == 'Not assigned': # if the 
              pass
        else:
            data_cell['PostalCode'] = col.p.text[:3] 
            data_cell['Borough'] = (col.span.text).split('(')[0] # gives borough
            #(row.span.text).split('(')[0]
            data_cell['Neighborhood'] = ((((col.span.text).split('(')[1]).replace(' /',',')).replace(')',' ')).strip(' ')
            table_contents.append(data_cell)
   ```

</details>   

In [7]:
# iterate through each row and extract postcode, borough and neighborhood info
for row in table.findAll('td'):
    cell = {} #cell array for dic
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3] # the first three letters are postal code
        cell['Borough'] = (row.span.text).split('(')[0] # split string at parenthesis. the first half is the borough
        #essentially, we remove the parenthesis and replace '/' with a ','
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

In [8]:
df = pd.DataFrame(table_contents)
#cleaning data to remove unwanted info such as email link, po box etc... 
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [9]:
df.shape

(103, 3)

In [18]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


-------

### 2. Get latitude and longitude of the neighborhood location

Here, we will use python geocoder APO to get longitude and latitude data for each neighborhood location. We will then use these lat,long with foursquare API 

In [17]:
#!pip install geocoder

The geocoder api returned None. Hence, we will download the csv file directly

Code for getting latlong data using geocoder  -- API KEYS NEEDED WHICH IS CHARGED SO AVOIDED!!
```python
from IPython.display import clear_output
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
    g = geocoder.google('Mountain View, CA')
    lat_lng_coords = g.latlng
    clear_output(wait=False)
    print(lat_lng_coords)

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
```

In [28]:
!pip install pgeocode


Collecting pgeocode
  Downloading pgeocode-0.3.0-py3-none-any.whl (8.5 kB)
Installing collected packages: pgeocode
Successfully installed pgeocode-0.3.0


> Tried pgeocode library as suggested by Laxmi in the discussion. I can get the lat long but the values are slightly off at 3 decimal place. SUGGESTED OPTION WAS TO USE THE .CSV FILE PROVIDED !!!
```python
import pgeocode
pgeocode.Nominatim('ca')
geolocator = pgeocode.Nominatim('ca')
postal_codes = df['PostalCode'].tolist()
latitudes = []
longitudes = []
for i, postal_code in enumerate(postal_codes):
    # initialize your variable to None
    #print(f'--Getting Postal Code: {postal_code}')
    g = geolocator.query_postal_code(postal_code)
    
    if not g.empty:
        #print(f'Postal Code {postal_code} has been retrieved. {len(postal_codes) - (i + 1)} codes left')
        latitudes.append(g.latitude)
        longitudes.append(g.longitude)
df['Latitudes'] = latitudes
df['Longitudes'] = longitudes
```

We will be using the provided csv file with latitude and longitude data

In [16]:
#!wget -O geospatial.csv "http://cocl.us/Geospatial_data"

In [14]:
geo_spa_df = pd.read_csv('geospatial.csv')

In [15]:
geo_spa_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge two tables to create a combined tables with borough, neighborhood and lat long data

In [26]:
# The col name is different so change it to match
geo_spa_df.rename(columns = {"Postal Code":"PostalCode"}, inplace = True)
new_df = df.merge(geo_spa_df, on = "PostalCode", how="inner")

In [24]:
new_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


----------------------

### 3. Exploration of neighborhood clusters in Toronto

This is the last part of the assignment where we will use the data frame created up until now for cluster analysis. We will use foursquare api to explore the neighborhood and segment them, and use folium to visualisation

<b>FourSquare API credentials<b>

In [73]:
CLIENT_ID = 'NQNDVRYUJJPXJPI4X13F5ZL3XRX4NRNHLTOEC1GATLBCLPG0' # your Foursquare ID
CLIENT_SECRET = '5E10NP0GXAWXZPOOQHEJVSFQ4JE1KULDDYROQGHS2YFXDJCP' # your Foursquare Secret
ACCESS_TOKEN = 'ZOSS1PTM5BIFOGBDSGRKGCK02BRTT5F43XJCIVEUKBP3ZFCN' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 100
radius = 500
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: NQNDVRYUJJPXJPI4X13F5ZL3XRX4NRNHLTOEC1GATLBCLPG0
CLIENT_SECRET:5E10NP0GXAWXZPOOQHEJVSFQ4JE1KULDDYROQGHS2YFXDJCP


In [44]:
print('Toronto has {} boroughs and {} neighborhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

Toronto has 15 boroughs and 103 neighborhoods.


<b> Create a map of Toronto with neighborhoods.<b>

In [42]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="tr_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 43.6534817, -79.3839347.


<b> Create a folium map of Toronto with boroughs and neighborhood<b>

In [45]:
#create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 10)

#add markers
for lat,lng, borough, neighborhood in zip(new_df['Latitude'],new_df['Longitude'],new_df['Borough'],new_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat,lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.6,
        parse_html = False).add_to(map_toronto)
map_toronto

### To make it simpler lets extract only the boroughs with Toronto in the name.

In [128]:
toronto_df = new_df.loc[new_df['Borough'].str.contains("Toronto")]
toronto_df.reset_index(drop = True, inplace=True)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [130]:
toronto_df.loc[1,'Neighborhood']

'Garden District, Ryerson'

### Explore all the neighborhodds in boroughs which has Toronto in the name

#### Helper function to extract category of the venue -- from foursquare lab

In [113]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Functino to iterate through each neighborhood and search upto desired distance for 100 most popular venues

In [198]:
def getNearbyVenues(names, latitudes, longitudes, radius = 500 ):
    venues_list = []
    for name, lat, long in zip(names, latitudes, longitudes):
        
        #set the url with foursquare credentials
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,CLIENT_SECRET,VERSION,lat,long,radius,LIMIT)
        
        
        #Get data as json with a GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        #iterates through each item. 
        #if you wish to iterate via for loop, then use results[i]['venue'].... 
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    #reshape the list for dataframe and store into as a panda df
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'Neighborhood latitude',
                             'Neighbor longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']
    return (nearby_venues)


Now lets pass our toronto_df data for neighborhoods with toronto in the name to this function to return venues and co-ordinates

In [199]:
toronto_venues = getNearbyVenues(toronto_df['Neighborhood'],
                                 toronto_df['Latitude'],
                                 toronto_df['Longitude'],
                                radius = 500)

In [201]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood latitude,Neighbor longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.520999,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.520999,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.520999,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,"Regent Park, Harbourfront",43.65426,-79.520999,Impact Kitchen,43.656369,-79.35698,Restaurant
4,"Regent Park, Harbourfront",43.65426,-79.520999,Body Blitz Spa East,43.654735,-79.359874,Spa
