# Data Science Capstone

Notebook for the *Data Science Capstone Course*. Part of the <b>Data Science Professional Certificate</b> in Coursera. Offered by IBM <br>
<b>Notebook Owner:</b> Adriana Cortés Buelvas

In [1]:
import folium
import requests
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from geopy.geocoders import Nominatim 

In [2]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Segmenting and Clustering Neighborhoods in Toronto

### Exploring and Scraping : Assignment Part 1


In this first part of the Capstone Project I am going to explore and cluster some information of Toronto Neighborhoods. For this, I will need *'to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe'*, as indicaded on the course.

To do that, I am going to read the data set in the link as a Dataframe using Pandas' method `.read_html`. Since the web age holds more than one dataframe, it is necessary to select the first one by using the index `[0]` <br>
Then, I will drop all the rows that has a value of 'Not assigned' on the **Borough** column and give the value of a Borough to the values in **Neighborhood** that are 'Not assigned'

In [3]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df = df[df.Borough != 'Not assigned'].reset_index()
df.loc[df.Neighborhood == 'Not assigned', ['Borough']] = df.Borough
df.drop(labels='index', axis=1, inplace=True)
df.head(11)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


From the DataFrame above, notice that, those Postal Codes that corresponds to more than one neighbourhood, will indicate all of them in the same row and separated by a coma. For example, look at M6A: in the column **Neighborhood** it has Lawrence Manor and Lawrence Heights. 

Finally, I print the shape to now the number of rows that the Dataframe has

In [4]:
df.shape

(103, 3)

### Latitude and Longitude : Assignment Part 2


Now, in order for the data to be available for the Foursquare API, I'm going to need the coordinates for each Postal Code. For this, I am going to use the data on a CSV file that Coursera provided. <br>

In [5]:
coordinates = pd.read_csv('https://cocl.us/Geospatial_data')
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


After exploring the CSV file, I will merge this information with the previous dataset so we can obtain all of the information in a single DataFrame

In [6]:
df_coordinates = pd.merge(df, coordinates, how='left')
df_coordinates.head(11)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


Notice that the DataFrame is not printed in the same order as the example given on the Assignment Instructions. However, I made a test to verify that the information was accurate. The following code is a test code to compare the information. Here I am verifying the information on the M4M postal code. You can verify it with any postal code on the example.

In [7]:
print(df_coordinates.loc[df_coordinates['Postal Code'] == 'M4M'])

   Postal Code       Borough     Neighborhood   Latitude  Longitude
54         M4M  East Toronto  Studio District  43.659526 -79.340923


Here is the DataFrame shown on the example: <br>
Go ahead and try different Postal Codes on the code above to verify the coordinates
<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1592352000000&hmac=KBUZNSrBuMzrJCGlMGWJQSlzSjzMBJutNI7OzulUn18"
     alt="Course Data frame Example"
     style="float: left; margin-right: 6px;" />

### Explore and Cluster Analysis : Assignment Part 3

First, I am going to generate a map of Toronto's Boroughs and Neighbrhoods using the Dataframe generated before using `folium`. For that, I will need to get the coordinates of Toronto using `geopy` and the user_agent named *toronto_explorer*

In [8]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


Now, lets see the map.

In [9]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_coordinates['Latitude'], df_coordinates['Longitude'], df_coordinates['Borough'], df_coordinates['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Toronto is a pretty big city. Let's say we want to segment and analyse only the neighborhoods in the **North York** Borough. For that, let's first generate a Dataframe with the data we're interested in. 

In [10]:
NorthYork_data = df_coordinates[df_coordinates['Borough'] == 'North York'].reset_index(drop=True)
NorthYork_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


Now, I will obtain the coordinates for North York and display its map with markers on each of its postal codes. 

In [11]:
address = 'North York, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.7543263, -79.44911696639593.


In [12]:
map_NorthYork = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(NorthYork_data['Latitude'], NorthYork_data['Longitude'], NorthYork_data['Borough'], NorthYork_data['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_NorthYork)  
    
map_NorthYork

After this, I can start using Foursquare API to explore these neighborhoods in North York and segment the data. To to that, I need to define my Foursquare credentials and its version.

In [13]:
CLIENT_ID = '35VPWLPU40NOKIXO5LTPEIT1AIRAKREX1Y5CAE3MROUMG50F' 
CLIENT_SECRET = 'JZQH5WVUWHLEWAGKB3EEJBX0V0R243L3DN5XDLTNCKWZEIA1'
VERSION = '20180605' # Foursquare API version

Let's say I want to explore a limit of 100 venues on a radius of 500km. We need to define this so it can be used on the analysis

In [14]:
LIMIT = 100
radius = 500

Now, I am going to use the function defined on the New York's lab to get all the nearby venues of each neighborhood in North York. The aim is to, later, create a DataFrame with this information so it can be analyzed easier.  

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

The dataframe where this information is going to be stored is going to be called NorthYork_venues. To create it, I just need to call the function defined above into each of the neighborhoods in North York Borough.

In [16]:
NorthYork_venues = getNearbyVenues(names=NorthYork_data['Neighborhood'],
                                   latitudes=NorthYork_data['Latitude'],
                                   longitudes=NorthYork_data['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills
Glencairn
Don Mills
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview
York Mills, Silver Hills
Downsview
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale, Willowdale East
Downsview
York Mills West
Willowdale, Willowdale West


Now, let's take a look into the DataFrame generated

In [17]:
print(NorthYork_venues.shape)
NorthYork_venues.head()

(241, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,GTA Restoration,43.753396,-79.333477,Fireworks Store
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


Now, we can see how many Venues does each neighborhood has. For this, I made some processing so the data can be observed easier.

In [18]:
Num_venues = NorthYork_venues.groupby('Neighborhood').count()
Num_venues['Number of Venues'] = Num_venues.Venue
Num_venues = Num_venues.drop(['Neighborhood Latitude', 'Neighborhood Longitude', 'Venue Latitude', 'Venue', 'Venue Longitude', 'Venue Category'], axis=1).reset_index()
Num_venues.sort_values(by='Number of Venues', ascending=False, inplace=True)
Num_venues.reset_index(inplace=True)
Num_venues.drop('index', axis=1, inplace=True)

Num_venues

Unnamed: 0,Neighborhood,Number of Venues
0,"Fairview, Henry Farm, Oriole",62
1,"Willowdale, Willowdale East",33
2,Don Mills,28
3,"Bedford Park, Lawrence Manor East",24
4,"Bathurst Manor, Wilson Heights, Downsview North",20
5,Downsview,17
6,"Lawrence Manor, Lawrence Heights",11
7,"Willowdale, Willowdale West",6
8,Glencairn,6
9,"Northwood Park, York University",6


And finally, let's see how many unique categories there are in the North York Borough

In [19]:
print('There are {} uniques categories.'.format(len(NorthYork_venues['Venue Category'].unique())))

There are 104 uniques categories.


And this is all :) <br>
If we wanted to go further we could also create a chart using `Matplotlib`.