# Segmenting and Clustering Neighborhoods in Toronto

## In order to do this, I will have to break the process down into a couple parts:
1. Scrape the [Canadian Postal Code Wiki page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) to get the table containing postal code, borough, and neighborhood information for the city of Toronto
2. Obtain coordinates for each of the postal codes using the Geocoder package
3. Explore and cluster the Toronto neighborhoods for analysis

## Part 1
To get some practice web-scraping I'm going to obtain the postal code table with the BeautifulSoup package.

In [1]:
# Import BeautifulSoup4
from bs4 import BeautifulSoup

# Import Requests library so that we can feed the document behind the url to the BeautifulSoup constructor
import requests

# Get the html for soup
text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(text)

postalCodeTable = soup.find('table')

In [2]:
# Get the table headings for later
headings = []
for th in postalCodeTable.find('tr').find_all('th'):
    headings.append(th.text.replace('\n', ' ').strip())

# loop through each table row 'tr' and get table data 'td'
# store this data in an array and append the row data to a larger array of rows
arrayOfRows = []
for tr in postalCodeTable.find_all('tr'):
    temp_row = []
    for td in tr.find_all('td'):
        temp_row.append(td.text.replace('\n', ' ').strip())
    arrayOfRows.append(temp_row)

#remove an empty row that was created
del arrayOfRows[0]

Now that we have the data scraped and placed into an array of rows, let's put it all together into a Pandas DataFrame object. To do this we will need to import some libraries.

In [3]:
import pandas as pd
import numpy as np

# Use the headings and row data to make a DataFrame object
df = pd.DataFrame(arrayOfRows, columns = headings)
df.head()

Unnamed: 0,Postal Code,District,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now that we have the DataFrame object 'df' we have to clean it to make it look like the project description. This means chaning the name of the 'District' column to 'Borough' and deleting any rows that have no borough (District) listed.

In [4]:
# Rename 'District' to 'Borough'
df.columns = ['Postal Code', 'Borough', 'Neighbourhood']

# Remove postal codes with no borough (District) assigned to them
df = df[df.Borough != 'Not assigned']

# Make sure that there are no unassigned neighbourhoods
print('There are {} postal codes with unassigned neighbourhoods.'.format(
    df[df.Neighbourhood == 'Not assigned'].shape[0]))

df.head()

There are 0 postal codes with unassigned neighbourhoods.


Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [5]:
# Print the number of rows in my Dataframe 
print('There are {} rows in my DataFrame of Toronto postal codes!'.format(
    df.shape[0]))

There are 103 rows in my DataFrame of Toronto postal codes!


## Part 2

For part 2 we have to obtain the latitude and longitude coordinates of each postal code so that we can make calls to the Foursquare API in the final part of this assignment.

To obtain these coordinates I was going to be using the geocoder package, but...

### Apparently the geocoder package can be "very unreliable"

I wrote the following code to obtain the coordinates using the geocoder package, but it spins in circles without returning any data (I also tried with simpler examples, but had no success). 

```Python
# Import geocoder (install if you haven't)
# !pip install geocoder
import geocoder

latitudes = [] #empty list to hold latitude values
longitudes = [] #empty list to hold longitutde values

for postal_code in df['Postal Code']:
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    latitudes.append(lat_lng_coords[0])
    longitudes.append(lat_lng_coords[1])
    
```

The assignment page links a csv file with the coordinate data in the case that this were to happen. I will import it below.

In [6]:
# Read in the csv and look at how it is structured
coordinateData = pd.read_csv("https://cocl.us/Geospatial_data")
coordinateData.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
# Join the two DataFrames on the Postal Code values
df = df.join(coordinateData.set_index('Postal Code'), on='Postal Code')
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


## Part 3

Now that we have all of the coordinates for the postal codes in Toronto, we can get to analyzing and clustering them using data from the Foursquare API!

To visualize the boroughs and eventual clusters on a map we will use the folium package. We will also need some other packages to get this part done, so let's import them now.

In [8]:
# Importing the libraries
import folium
from sklearn.cluster import KMeans
import matplotlib

In [9]:
torontoMap = folium.Map(location = [43.6532, -79.3832], zoom_start = 12)
torontoMap

Now that we have a Map object focused on Toronto, let's add in the neighborhoods and boroughs that we found coordinates for earlier!

In [10]:
# Add borough markers to the map
for lat, lng, borough, neighbourhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#ADD8E6',
        fill_opacity=0.6,
        parse_html=False).add_to(torontoMap)

torontoMap

Let's now use the Foursquare API to get some information about venues in these boroughs. We can then use the information that we acquire to cluster and group the boroughs.

In [11]:
import os
import json
from pandas.io.json import json_normalize

In [12]:
CLIENT_ID =  os.environ.get('FOURSQUARE_CLIENT_ID')
CLIENT_SECRET = os.environ.get('FOURSQUARE_CLIENT_SECRET')
VERSION = '20200826' 
LIMIT = 100 # Top 100 venues
RADIUS = 500 # Within a 500 meter radius of the coordinates


Now that we have the API key and varius request information saved, let's create a function to allow us to get the venue information (namely the categories) for each postal code. We will make a call to the Foursquare API for each postal code and extract the venue categories for each of the venues returned. The venue categories will be saved with the postal code and information so that we can cluster the postal codes based on the types of venues nearby.

In [13]:
def getNearbyVenues(postal_codes, boroughs, neighbourhoods, latitudes, longitudes):
    
    # Create an empty list to hold the venue categories and their respective postal code information
    venuesList = []
    
    # Loop through all of the postal codes to make requests about the venues surrounding each one
    for postal_code, borough, neighbourhood, latitude, longitude in zip(postal_codes, boroughs, neighbourhoods, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude, 
            RADIUS, # 500 meter raius
            LIMIT) # 100 venue limit

        results = requests.get(url).json()
        
        venues_information = results['response']['groups'][0]['items']
        
        # For every venue of that given postal code's result set, store the category and postal code information
        for venue in venues_information:
            venuesList.append([postal_code, borough, neighbourhood, latitude, longitude, venue['venue']['categories'][0]['name']])
            
        # Transform the list containing all of the venue categories and their corresponding postal codes into a DataFrame
        nearbyVenues = pd.DataFrame(venuesList)
        nearbyVenues.columns = ['Postal Code', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude', 'Venue Category']
        
    return(nearbyVenues)

Now that we have our function created, let's run it with information from the DataFrame containing all of the postal codes! The function will return a DataFrame object that we will call 'nearbyVenues'.

In [14]:
nearbyVenues = getNearbyVenues(df['Postal Code'], df['Borough'], df['Neighbourhood'], df['Latitude'], df['Longitude'])

In [15]:
nearbyVenues

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Venue Category
0,M3A,North York,Parkwoods,43.753259,-79.329656,Park
1,M3A,North York,Parkwoods,43.753259,-79.329656,Food & Drink Shop
2,M4A,North York,Victoria Village,43.725882,-79.315572,Hockey Arena
3,M4A,North York,Victoria Village,43.725882,-79.315572,Coffee Shop
4,M4A,North York,Victoria Village,43.725882,-79.315572,Portuguese Restaurant
...,...,...,...,...,...,...
2131,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Grocery Store
2132,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Hardware Store
2133,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Flower Shop
2134,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Tanning Salon


Now we can move on and get ready for cluster analysis.