# Coursera Capstone Project
Anthony Suárez

This notebook is to work on my Capstone for the IBM Data Science Specialization.

## Week 1

In [1]:
import pandas as pd
import numpy as np
import requests
import bs4
from bs4 import BeautifulSoup

In [2]:
print("Hello Coursera Capstone Project!")

Hello Coursera Capstone Project!


## Week 2

### Collect data about Toronto neighborhoods

I will use Beautiful Soup to do web scraping and get data from Wikipedia.

In [3]:
page_url = "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto#Lists_of_city-designated_neighbourhoods"
page_html = requests.get(page_url, timeout=10)

page_html

<Response [200]>

In [4]:
toronto_soup = BeautifulSoup(page_html.content)
print(toronto_soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of neighbourhoods in Toronto - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"72c0627f-1310-4b05-80d1-de6074d275cb","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_neighbourhoods_in_Toronto","wgTitle":"List of neighbourhoods in Toronto","wgCurRevisionId":989720641,"wgRevisionId":989720641,"wgArticleId":1150939,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","Articles with short description","Short description is 

We are interested in the Multiple listing service districts and neighbourhoods table, which has the classes "wikitable sortable jquery-tablesorter"

In [5]:
districts_table = toronto_soup.find("table", {"class": "wikitable sortable"})
districts_table.__dict__

{'parser_class': bs4.BeautifulSoup,
 'name': 'table',
 'namespace': None,
 'prefix': None,
 'known_xml': False,
 'attrs': {'class': ['wikitable', 'sortable']},
 'contents': ['\n',
  <tbody><tr bgcolor="lightblue">
  <th width="10%">District Number
  </th>
  <th width="90%">Neighbourhoods Included
  </th></tr>
  <tr>
  <td>C01
  </td>
  <td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown</a>, <a class="mw-redirect" href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>, <a href="/wiki/Little_Italy,_Toronto" title="Little Italy, Toronto">Little Italy</a>, <a href="/wiki/Little_Portugal,_Toronto" title="Little Portugal, Toronto">Little Portugal</a>, Dufferin Grove, Palmerston, University, Bay Street Corridor, Kensington Market, Chinatown, Trinity Bellwoods, South Niagara, Island airport, The Islands, waterfront communities C1, Queen's Park, Ontario Provincial Government, Victoria Hotel, Central Bay Street, First Canadian Place, Design Exchan

In [6]:
districts_df = pd.read_html(str(districts_table))
districts_df = districts_df[0]
districts_df.head()

Unnamed: 0,District Number,Neighbourhoods Included
0,C01,"Downtown, Harbourfront, Little Italy, Little P..."
1,C02,"The Annex, Yorkville, South Hill, Summerhill, ..."
2,C03,"Forest Hill South, Oakwood–Vaughan, Humewood–C..."
3,C04,"Bedford Park, Lawrence Manor, North Toronto, F..."
4,C06,"North York, Clanton Park, Bathurst Manor"


Now that we have parsed the table from Wikipedia, we have to get the data from each neighborhood.

In [7]:
neighborhoods = []

for row in districts_df["Neighbourhoods Included"]:
    n_in_district = row.split(', ')
    neighborhoods = neighborhoods + n_in_district
    
neighborhoods

['Downtown',
 'Harbourfront',
 'Little Italy',
 'Little Portugal',
 'Dufferin Grove',
 'Palmerston',
 'University',
 'Bay Street Corridor',
 'Kensington Market',
 'Chinatown',
 'Trinity Bellwoods',
 'South Niagara',
 'Island airport',
 'The Islands',
 'waterfront communities C1',
 "Queen's Park",
 'Ontario Provincial Government',
 'Victoria Hotel',
 'Central Bay Street',
 'First Canadian Place',
 'Design Exchange',
 'Adelaide',
 'University of Toronto',
 'Union Station',
 'The Annex',
 'Yorkville',
 'South Hill',
 'Summerhill',
 'Wychwood Park',
 'Deer Park',
 'Casa Loma',
 'Forest Hill South',
 'Oakwood–Vaughan',
 'Humewood–Cedarvale',
 'Corso Italia',
 'Humewood-Cedarvale',
 'Forest Hill Road Park',
 'Bedford Park',
 'Lawrence Manor',
 'North Toronto',
 'Forest Hill North',
 'Lawrence Park',
 'Lawrence Heights',
 'Roselawn',
 'North York',
 'Clanton Park',
 'Bathurst Manor',
 'Willowdale West',
 'Newtonbrook West',
 'Westminster–Branson',
 'Lansing-Westgate',
 'Cabbagetown',
 'St. La

In [8]:
len(neighborhoods)

225

Now we have a list of 225 individual neighborhoods in Toronto. As almost each one of them has a Wikipedia page with their name, we can use those pages to extract the coordinates for each neighborhoods.

In [9]:
neighborhoods_df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
neighborhoods_df.head()

Unnamed: 0,Neighborhood
0,Downtown
1,Harbourfront
2,Little Italy
3,Little Portugal
4,Dufferin Grove


In [10]:
def find_wiki_coords(page):
    possible_titles = [
        page.replace(' ', '_') + "_Toronto",
        page.replace(' ', '_') + ",_Toronto",
        "Toronto_" + page.replace(' ', '_') ,
        "Toronto,_" + page.replace(' ', '_') ,
        page.replace(' ', '_')
    ]
    
    possible_urls = []
    for title in possible_titles:
        possible_urls.append("https://en.wikipedia.org/wiki/" + title)
    
    for url in possible_urls:
        wiki_page = requests.get(url, timeout=10)
    
        if (wiki_page.status_code == 200):
            soup = BeautifulSoup(wiki_page.content)
            latitude = soup.find("span", {"class": "latitude"})
            longitude = soup.find("span", {"class": "longitude"})

            if latitude and longitude:
                return [latitude.text, longitude.text]
            
    return None

The following code will find the coordinates of each neighborhood in Toronto. It takes a bit of time to run, so the resulting dataframe was saved in a .csv file.

```python
latitudes = []
longitudes = []

for neighborhood in neighborhoods_df["Neighborhood"]:
    coords = find_wiki_coords(neighborhood)
    
    if coords:
        latitudes.append(coords[0])
        longitudes.append(coords[1])
    else:
        latitudes.append(None)
        longitudes.append(None)

neighborhoods_df["Latitude"] = latitudes
neighborhoods_df["Longitude"] = longitudes
neighborhoods_df.to_csv("data/toronto_neighborhoods_coords.csv", index=False)

neighborhoods_df.head()
```

In [11]:
latitudes = []
longitudes = []

for neighborhood in neighborhoods_df["Neighborhood"]:
    coords = find_wiki_coords(neighborhood)
    
    if coords:
        latitudes.append(coords[0])
        longitudes.append(coords[1])
    else:
        latitudes.append(None)
        longitudes.append(None)

neighborhoods_df["Latitude"] = latitudes
neighborhoods_df["Longitude"] = longitudes
neighborhoods_df.to_csv("data/toronto_neighborhoods_coords.csv", index=False)

neighborhoods_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Downtown,43°39′9.01″N,79°23′0.81″W
1,Harbourfront,43°38′17″N,79°23′06″W
2,Little Italy,43°39′18″N,79°24′47″W
3,Little Portugal,43°39′00″N,79°26′08″W
4,Dufferin Grove,43°39′25″N,79°25′41″W


In [12]:
neighborhoods_df = pd.read_csv("data/toronto_neighborhoods_coords.csv")
neighborhoods_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Downtown,43°39′9.01″N,79°23′0.81″W
1,Harbourfront,43°38′17″N,79°23′06″W
2,Little Italy,43°39′18″N,79°24′47″W
3,Little Portugal,43°39′00″N,79°26′08″W
4,Dufferin Grove,43°39′25″N,79°25′41″W


In [13]:
# For folium we need coordinates as decimals.

def dms_to_dd(coords_str):
    
    if isinstance(coords_str, str):
        new_str = coords_str[:-2]
        delimiters = ["°", "′"]

        for delimiter in delimiters:
            new_str = new_str.replace(delimiter, ',')

        dms = new_str.split(',')
        dms = dms + [0, 0, 0]
        degrees = float(dms[0])
        minutes = float(dms[1])
        seconds = float(dms[2])

        decimal = degrees + (minutes / 60) + (seconds / 3600)
        return decimal
    return None

In [14]:
decimal_latitudes = []
decimal_longitudes = []

for latitude in neighborhoods_df['Latitude']:
    decimal_latitudes.append(dms_to_dd(latitude))
    
for longitude in neighborhoods_df['Longitude']:
    decimal_longitudes.append(dms_to_dd(longitude))
    
neighborhoods_df['Latitude'] = decimal_latitudes
neighborhoods_df['Longitude'] = decimal_longitudes
neighborhoods_df['Longitude'] = neighborhoods_df['Longitude'] * -1 # Had to multiply by -1 because Toronto is west

neighborhoods_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Downtown,43.652503,-79.383558
1,Harbourfront,43.638056,-79.385
2,Little Italy,43.655,-79.413056
3,Little Portugal,43.65,-79.435556
4,Dufferin Grove,43.656944,-79.428056


In [15]:
print(neighborhoods_df.shape)

(225, 3)


In [16]:
# Drop nans
neighborhoods_df = neighborhoods_df.dropna(axis=0)
print(neighborhoods_df.shape)

(163, 3)


### Visualize Toronto Neighborhoods

In [17]:
# !pip install folium
import folium

In [18]:
toronto_coords = [43.651070, -79.347015]
toronto_map = folium.Map(location=toronto_coords,
                         tiles='Stamen Toner',
                         zoom_start=10.5)

for i, row in neighborhoods_df.iterrows():
    marker = folium.CircleMarker(
        location=[row.Latitude, row.Longitude],
        popup=row.Neighborhood,
        color='crimson',
        radius=5,
        fill=True
    ).add_to(toronto_map)

toronto_map

If we zoom out on the map we can see some neighborhood coordinates are wrong. This may be due to the way I got the coords from Wikipedia. I will remove those neighborhoods manually.

In [24]:
neighborhoods_df[neighborhoods_df['Neighborhood'] == 'Hunt Club'].index

Int64Index([126], dtype='int64')

In [29]:
wrong_neighborhoods = ['Westmount', 'Adelaide']

for neighborhood in wrong_neighborhoods:
    neighborhoods_df = neighborhoods_df[neighborhoods_df['Neighborhood'] != neighborhood]
    
neighborhoods_df.shape

(161, 3)

In [30]:
toronto_coords = [43.651070, -79.347015]
toronto_map = folium.Map(location=toronto_coords,
                         tiles='Stamen Toner',
                         zoom_start=10.5)

for i, row in neighborhoods_df.iterrows():
    marker = folium.CircleMarker(
        location=[row.Latitude, row.Longitude],
        popup=row.Neighborhood,
        color='crimson',
        radius=5,
        fill=True
    ).add_to(toronto_map)

toronto_map