# Toronto Neighborhoods Project

In this project, I will be exploring and clustering neighborhoods in Toronto. 

This project highlights my skills in:
- BeautifulSoup web scraping
- Dataframe manipulation with Pandas
- Mapping with Folium
- Collecting data from the Foursquare API


## Part 1: Preparing the Dataframe

First, I will scrape data from the following website to gather postal codes of various neighborhoods. 
- https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


In [8]:
# Imports

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
from IPython.display import display, HTML # Allows for aesthetic data frame displays

print("Imports complete.")

Imports complete.


In [9]:
# Scrape data from the Wikipedia page

page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = bs(page.text, 'lxml')
table = soup.find("table", class_ = "wikitable")

In [10]:
# Transforming data into a dataframe

A = []
B = []
C = []

# In HTML, <tr> denotes a row in a table and <td> denotes each element
# These tags are used to isolate the data from the table

for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3: # Isolates rows from the body
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        
df = pd.DataFrame(A, columns = ['Postal Code'])
df['Borough'] = B
df['Neighborhood'] = C



print(df.shape)
display(df.head(10))


(288, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Now, I will remove the rows where a borough is not assigned. I will also assign the borough name to the neighborhood if one is not assigned.

Additionally, more than one neighborhood can exist in one postal code area. These rows will be combined to show a list in the neighborhood column. 

In [11]:
# Remove rows where a borough is not assigned

df.drop(df[df.Borough == "Not assigned"].index, inplace = True)
print(df.shape)
display(df.head(10))

(211, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [12]:
# Remove the \n that appears in some rows

df.replace('\n','', regex = True, inplace = True)


# Combine neighborhoods that have the same postcode

df2 = df.groupby(['Postal Code', 'Borough'])['Neighborhood'].apply(", ".join).reset_index()

print(df2.shape)
display(df2.head(10))

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [13]:
# Assign the borough name to the neighborhood for any rows with no assigned neighborhood

df2.loc[df2.Neighborhood == "Not assigned", ['Neighborhood']] = df2['Borough']
print(df2.shape)
display(df2.head(10))

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


The data frame is now complete and ready for future analysis.

**Here is the final data frame and shape:**

In [14]:
print(df2.shape)
display(df2)

(103, 3)


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Therefore, the final **shape** is 103 rows by 3 columns.

## Part 2: Adding Location Data

In this part, I will add latitute and longitude data from a provided csv file. This will allow for future analysis of the region. 

In [15]:
import csv

csv_url = "http://cocl.us/Geospatial_data"
coordinates = pd.read_csv(csv_url)

print(coordinates.shape)
display(coordinates.head())

(103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [16]:
# Merge the two dataframes

df_full = pd.merge(df2, coordinates, on = "Postal Code")
print(df_full.shape)
display(df_full.head())

(103, 5)


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Part 3: Explore and Cluster

Now we can explore the dataset and generate maps to visualize the neighborhoods and how they cluster together. 

Let's explore **Central and Downtown Toronto.**

In [17]:

c_toronto = df_full[df_full['Borough'].str.contains("Central Toronto")].reset_index()
d_toronto = df_full[df_full['Borough'].str.contains("Downtown Toronto")].reset_index()
cd_toronto = pd.concat([c_toronto, d_toronto])
display(cd_toronto)

Unnamed: 0,index,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
3,47,M4S,Central Toronto,Davisville,43.704324,-79.38879
4,48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
5,49,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049
6,63,M5N,Central Toronto,Roselawn,43.711695,-79.416936
7,64,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
8,65,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
0,50,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529


In [18]:
# Now we can drop the first two columns

drop_col = ['index', 'Postal Code']
cd_toronto_data = cd_toronto.drop(drop_col, 1)
display(cd_toronto_data)

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Central Toronto,Lawrence Park,43.72802,-79.38879
1,Central Toronto,Davisville North,43.712751,-79.390197
2,Central Toronto,North Toronto West,43.715383,-79.405678
3,Central Toronto,Davisville,43.704324,-79.38879
4,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
5,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049
6,Central Toronto,Roselawn,43.711695,-79.416936
7,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
8,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
0,Downtown Toronto,Rosedale,43.679563,-79.377529


Now we can pull in more location data from the geocoding library. This will allow us to create a map of these boroughs in relation to Toronto as a whole. 

In [19]:
import folium

In [30]:
# Toronto latitude and longitude
latitute = 43.7170226
longitute = -79.4197830350134


# Generating a map of Toronto
map_toronto = folium.Map(location=[43.7170226, -79.4197830350134], zoom_start = 10)

# Adding markers for the neighborhoods
for latitude, longitude, borough, neighborhood in zip(cd_toronto_data['Latitude'], cd_toronto_data['Longitude'], cd_toronto_data['Borough'], cd_toronto_data['Neighborhood']):
    label = "{}, {}".format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius = 5, 
        popup = label,
        color = "purple",
        fill = True,
        fill_color = "#8f52e4", 
        fill_opacity = 0.6,
        parse_html = False).add_to(map_toronto)
    
map_toronto

## Foursquare API Integration

Now we can pull in data from Foursquare. 

Let's look at the coffee shops in the region.

In [35]:
CLIENT_ID = "Z5GRDXCYA1MPMN1QIZ2VN23EJLT40LFMPGL5ETQLCW2PW3JG"
CLIENT_SECRET = "PDVT5KZVEUERMIBZCQZA5ZBIQTQIN22RRMZJZTT1N1BMPJ1H"
VERSION = "20190608"
venue = "4bf58dd8d48988d196941735"

In [38]:
def getNearbyVenues(names, latitudes, longitudes, radius = 500):
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = "https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}".format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius)
            
        # make the GET request
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v["venue"]["name"], 
            v["venue"]["location"]["lat"], 
            v["venue"]["location"]["lng"],  
            v["venue"]['categories'][0]["name"]) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ["Neighborhood", 
                  "Neighborhood Latitude", 
                  "Neighborhood Longitude", 
                  "Venue", 
                  "Venue Latitude", 
                  "Venue Longitude", 
                  "Venue Category"]
    
    return(nearby_venues)

In [39]:
all_venues = getNearbyVenues(names = cd_toronto_data["Neighborhood"],
                            latitudes = cd_toronto_data['Latitude'],
                            longitudes = cd_toronto_data['Longitude'])

Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie


In [40]:
print(all_venues.shape)
display(all_venues)

(585, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Lawrence Park,43.728020,-79.388790,Lawrence Park Ravine,43.726963,-79.394382,Park
1,Lawrence Park,43.728020,-79.388790,Zodiac Swim School,43.728532,-79.382860,Swim School
2,Lawrence Park,43.728020,-79.388790,TTC Bus #162 - Lawrence-Donway,43.728026,-79.382805,Bus Line
3,Davisville North,43.712751,-79.390197,Sherwood Park,43.716551,-79.387776,Park
4,Davisville North,43.712751,-79.390197,Summerhill Market North,43.715499,-79.392881,Food & Drink Shop
5,Davisville North,43.712751,-79.390197,Homeway Restaurant & Brunch,43.712641,-79.391557,Breakfast Spot
6,Davisville North,43.712751,-79.390197,Best Western Roehampton Hotel & Suites,43.708878,-79.390880,Hotel
7,Davisville North,43.712751,-79.390197,Circle K,43.712834,-79.391554,Grocery Store
8,Davisville North,43.712751,-79.390197,Subway,43.708378,-79.390473,Sandwich Place
9,Davisville North,43.712751,-79.390197,Gym,43.713126,-79.393537,Gym


In [48]:
cd_toronto_coffee = all_venues[all_venues['Venue Category'] == "Coffee Shop"]

display(cd_toronto_coffee)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
14,North Toronto West,43.715383,-79.405678,Starbucks,43.715456,-79.400303,Coffee Shop
20,North Toronto West,43.715383,-79.405678,Tim Hortons,43.714894,-79.399776,Coffee Shop
33,Davisville,43.704324,-79.38879,Starbucks,43.706084,-79.389355,Coffee Shop
47,Davisville,43.704324,-79.38879,Second Cup,43.704344,-79.388659,Coffee Shop
63,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049,Starbucks,43.687101,-79.398612,Coffee Shop
65,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049,Tim Hortons,43.687682,-79.39684,Coffee Shop
86,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Creeds Coffee Bar,43.6741,-79.410838,Coffee Shop
98,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Super Jet International Coffee Shop,43.674332,-79.409325,Coffee Shop
99,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Tim Hortons,43.6758,-79.403532,Coffee Shop
130,"Cabbagetown, St. James Town",43.667967,-79.367675,Jetfuel Coffee,43.665295,-79.368335,Coffee Shop


Lets also group these coffee shops by the neighborhoods we are looking at.

In [52]:
coffee_counts = cd_toronto_coffee.groupby("Neighborhood").count()
drop_cols = ['Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
coffee_counts = coffee_counts.drop(drop_cols, 1)
coffee_counts.rename(columns = {"Neighborhood":"Neighborhood",
                              "Neighborhood Latitude":"Number of Coffee Shops"},
                    inplace = True)
display(coffee_counts)

Unnamed: 0_level_0,Number of Coffee Shops
Neighborhood,Unnamed: 1_level_1
"Adelaide, King, Richmond",1
Berczy Park,1
"Cabbagetown, St. James Town",2
Central Bay Street,7
"Chinatown, Grange Park, Kensington Market",1
Christie,1
Church and Wellesley,1
"Commerce Court, Victoria Hotel",3
Davisville,2
"Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West",2
