# Coursera Capstone Project
Anthony Suárez

This notebook is to work on my Capstone for the IBM Data Science Specialization.

## Week 1

In [1]:
import pandas as pd
import numpy as np
import requests
import bs4
from bs4 import BeautifulSoup

In [2]:
print("Hello Coursera Capstone Project!")

Hello Coursera Capstone Project!


## Week 3

### Collect data about Toronto neighborhoods

I will use Beautiful Soup to do web scraping and get data from Wikipedia.

In [3]:
page_url = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto#Lists_of_city-designated_neighbourhoods"
page_html = requests.get(page_url, timeout=10)

page_html

<Response [200]>

In [4]:
toronto_soup = BeautifulSoup(page_html.content)
# print(toronto_soup.prettify())

We are interested in the Multiple listing service districts and neighbourhoods table, which has the classes "wikitable sortable jquery-tablesorter"

In [5]:
districts_table = toronto_soup.find("table", {"class": "wikitable sortable"})
# districts_table.__dict__

In [6]:
districts_df = pd.read_html(str(districts_table))
districts_df = districts_df[0]
districts_df.head()

Unnamed: 0,District Number,Neighbourhoods Included
0,C01,"Downtown, Harbourfront, Little Italy, Little P..."
1,C02,"The Annex, Yorkville, South Hill, Summerhill, ..."
2,C03,"Forest Hill South, Oakwood–Vaughan, Humewood–C..."
3,C04,"Bedford Park, Lawrence Manor, North Toronto, F..."
4,C06,"North York, Clanton Park, Bathurst Manor"


Now that we have parsed the table from Wikipedia, we have to get the data from each neighborhood.

In [7]:
neighborhoods = []

for row in districts_df["Neighbourhoods Included"]:
    n_in_district = row.split(', ')
    neighborhoods = neighborhoods + n_in_district
    
print(str(len(neighborhoods)) + ' neighborhoods found.')

225 neighborhoods found.


Now we have a list of 225 individual neighborhoods in Toronto. As almost each one of them has a Wikipedia page with their name, we can use those pages to extract the coordinates for each neighborhoods.

In [8]:
neighborhoods_df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
neighborhoods_df.head()

Unnamed: 0,Neighborhood
0,Downtown
1,Harbourfront
2,Little Italy
3,Little Portugal
4,Dufferin Grove


In [9]:
def find_wiki_coords(page):
    possible_titles = [
        page.replace(' ', '_') + "_Toronto",
        page.replace(' ', '_') + ",_Toronto",
        "Toronto_" + page.replace(' ', '_') ,
        "Toronto,_" + page.replace(' ', '_') ,
        page.replace(' ', '_')
    ]
    
    possible_urls = []
    for title in possible_titles:
        possible_urls.append("https://en.wikipedia.org/wiki/" + title)
    
    for url in possible_urls:
        wiki_page = requests.get(url, timeout=10)
    
        if (wiki_page.status_code == 200):
            soup = BeautifulSoup(wiki_page.content)
            latitude = soup.find("span", {"class": "latitude"})
            longitude = soup.find("span", {"class": "longitude"})

            if latitude and longitude:
                return [latitude.text, longitude.text]
            
    return None

The following code will find the coordinates of each neighborhood in Toronto. It takes a bit of time to run, so the resulting dataframe was saved in a .csv file.

```python
latitudes = []
longitudes = []

for neighborhood in neighborhoods_df["Neighborhood"]:
    coords = find_wiki_coords(neighborhood)
    
    if coords:
        latitudes.append(coords[0])
        longitudes.append(coords[1])
    else:
        latitudes.append(None)
        longitudes.append(None)

neighborhoods_df["Latitude"] = latitudes
neighborhoods_df["Longitude"] = longitudes
neighborhoods_df.to_csv("data/toronto_neighborhoods_coords.csv", index=False)

neighborhoods_df.head()
```

In [10]:
neighborhoods_df = pd.read_csv("data/toronto_neighborhoods_coords.csv")
neighborhoods_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Downtown,43°39′9.01″N,79°23′0.81″W
1,Harbourfront,43°38′17″N,79°23′06″W
2,Little Italy,43°39′18″N,79°24′47″W
3,Little Portugal,43°39′00″N,79°26′08″W
4,Dufferin Grove,43°39′25″N,79°25′41″W


In [11]:
# For folium we need coordinates as decimals.

def dms_to_dd(coords_str):
    
    if isinstance(coords_str, str):
        new_str = coords_str[:-2]
        delimiters = ["°", "′"]

        for delimiter in delimiters:
            new_str = new_str.replace(delimiter, ',')

        dms = new_str.split(',')
        dms = dms + [0, 0, 0]
        degrees = float(dms[0])
        minutes = float(dms[1])
        seconds = float(dms[2])

        decimal = degrees + (minutes / 60) + (seconds / 3600)
        return decimal
    return None

In [12]:
decimal_latitudes = []
decimal_longitudes = []

for latitude in neighborhoods_df['Latitude']:
    decimal_latitudes.append(dms_to_dd(latitude))
    
for longitude in neighborhoods_df['Longitude']:
    decimal_longitudes.append(dms_to_dd(longitude))
    
neighborhoods_df['Latitude'] = decimal_latitudes
neighborhoods_df['Longitude'] = decimal_longitudes
neighborhoods_df['Longitude'] = neighborhoods_df['Longitude'] * -1 # Had to multiply by -1 because Toronto is west

neighborhoods_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Downtown,43.652503,-79.383558
1,Harbourfront,43.638056,-79.385
2,Little Italy,43.655,-79.413056
3,Little Portugal,43.65,-79.435556
4,Dufferin Grove,43.656944,-79.428056


In [13]:
print(neighborhoods_df.shape)

(225, 3)


In [14]:
# Drop nans
neighborhoods_df = neighborhoods_df.dropna(axis=0)
print(neighborhoods_df.shape)

(163, 3)


### Visualize Toronto Neighborhoods

In [15]:
# !pip install folium
import folium

In [16]:
toronto_coords = [43.651070, -79.347015]
toronto_map = folium.Map(location=toronto_coords,
                         tiles='Stamen Toner',
                         zoom_start=10.5)

for i, row in neighborhoods_df.iterrows():
    marker = folium.CircleMarker(
        location=[row.Latitude, row.Longitude],
        popup=row.Neighborhood,
        color='crimson',
        radius=5,
        fill=True
    ).add_to(toronto_map)

toronto_map

If we zoom out on the map we can see some neighborhood coordinates are wrong. This may be due to the way I got the coords from Wikipedia. I will remove those neighborhoods manually.

In [17]:
neighborhoods_df[neighborhoods_df['Neighborhood'] == 'Hunt Club'].index

Int64Index([126], dtype='int64')

In [18]:
wrong_neighborhoods = ['Westmount', 'Adelaide']

for neighborhood in wrong_neighborhoods:
    neighborhoods_df = neighborhoods_df[neighborhoods_df['Neighborhood'] != neighborhood]
    
neighborhoods_df.shape

(161, 3)

In [19]:
toronto_coords = [43.651070, -79.347015]
toronto_map = folium.Map(location=toronto_coords,
                         tiles='Stamen Toner',
                         zoom_start=10.5)

for i, row in neighborhoods_df.iterrows():
    marker = folium.CircleMarker(
        location=[row.Latitude, row.Longitude],
        popup=row.Neighborhood,
        color='crimson',
        radius=5,
        fill=True
    ).add_to(toronto_map)

toronto_map

### Explore venues with Foursquare API

In [20]:
from dotenv import load_dotenv # Use pip to install python-dotenv package
import os

In [21]:
load_dotenv()

# If you are going to run this notebook, please add your own Foursquare credentials in a .env file.
CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')

In [22]:
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
radius = 500

As a test, I'll explore the first neighborhood in the dataframe.

In [23]:
test_neighborhood = neighborhoods_df.loc[0]['Neighborhood']
test_lat = neighborhoods_df.loc[0]['Latitude']
test_long = neighborhoods_df.loc[0]['Longitude']

In [24]:
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    test_lat, 
    test_long, 
    radius, 
    LIMIT)

results = requests.get(url)
results = results.json()
# results

In [25]:
# function that extracts the category of the venue. By Coursera.
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [26]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Downtown Toronto,Neighborhood,43.653232,-79.385296
1,Nathan Phillips Square,Plaza,43.65227,-79.383516
2,Indigo,Bookstore,43.653515,-79.380696
3,Four Seasons Centre for the Performing Arts,Concert Hall,43.650592,-79.385806
4,The Keg Steakhouse + Bar - York Street,Restaurant,43.649987,-79.384103


It works. Now, it's time to repeat the process for all of Toronto neighborhoods. The following function is also featured in the Coursera IBM Data Science Course:

In [27]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

As the next piece of code takes some time to run, I exported the results to a .csv file.

```python
neighborhoods_venues = getNearbyVenues(names=neighborhoods_df['Neighborhood'],
                                      latitudes=neighborhoods_df['Latitude'],
                                      longitudes=neighborhoods_df['Longitude'])

neighborhoods_venues.to_csv('data/toronto_neighborhoods_venues.csv', index=False)
```

In [28]:
neighborhoods_venues = pd.read_csv('data/toronto_neighborhoods_venues.csv')
neighborhoods_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Downtown,43.652503,-79.383558,Downtown Toronto,43.653232,-79.385296,Neighborhood
1,Downtown,43.652503,-79.383558,Nathan Phillips Square,43.65227,-79.383516,Plaza
2,Downtown,43.652503,-79.383558,Indigo,43.653515,-79.380696,Bookstore
3,Downtown,43.652503,-79.383558,Four Seasons Centre for the Performing Arts,43.650592,-79.385806,Concert Hall
4,Downtown,43.652503,-79.383558,The Keg Steakhouse + Bar - York Street,43.649987,-79.384103,Restaurant


In [29]:
neighborhoods_venues['Venue Category'].unique().shape

(307,)

Perfect! Now we have used Foursquare data to obtain information about venues close to each neighborhood in Toronto. The next step is processing the obtained data to feed it into our clustering algorithm.

### Data wrangling