# Coursera Capstone Project
Anthony Suárez

This notebook is to work on my Capstone for the IBM Data Science Specialization.

## Week 1

In [1]:
import pandas as pd
import numpy as np
import requests
import bs4
from bs4 import BeautifulSoup

In [2]:
print("Hello Coursera Capstone Project!")

Hello Coursera Capstone Project!


## Week 3

### Collect data about Toronto neighborhoods

I will use Beautiful Soup to do web scraping and get data from Wikipedia.

In [3]:
page_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page_html = requests.get(page_url, timeout=10)

page_html

<Response [200]>

In [4]:
toronto_soup = BeautifulSoup(page_html.content)
# print(toronto_soup.prettify())

We are interested in the Toronto - 103 FSAs table, which has the classes "wikitable sortable"

In [5]:
fsas_table = toronto_soup.find("table", {"class": "wikitable sortable"})
# districts_table.__dict__

In [6]:
fsas_df = pd.read_html(str(fsas_table))
fsas_df = fsas_df[0]
fsas_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Next, we need to drop the rows with Borough not assigned

In [7]:
rows_not_assigned = fsas_df[fsas_df["Borough"] == "Not assigned"]
rows_not_assigned.index

Int64Index([  0,   1,   7,  10,  15,  16,  19,  24,  25,  28,  29,  33,  34,
             35,  37,  38,  42,  43,  44,  51,  52,  53,  60,  61,  62,  69,
             70,  71,  78,  79,  87,  88,  96,  97, 101, 105, 106, 110, 115,
            118, 119, 123, 124, 125, 127, 128, 131, 132, 133, 134, 136, 137,
            140, 141, 145, 146, 149, 150, 154, 155, 158, 159, 161, 162, 163,
            164, 166, 167, 170, 171, 172, 173, 174, 175, 176, 177, 179],
           dtype='int64')

In [8]:
fsas_df = fsas_df.drop(rows_not_assigned.index, axis=0).reset_index().drop(columns=["index"])
fsas_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


If a neighborhood doesn't have a name assigned, we will assume it's the same name as its borough.

In [9]:
for row in fsas_df.iterrows():
    i = row[0]
    neighborhood = row[1]["Neighbourhood"]
    if neighborhood == "Not assigned":
        fsas_df.iloc[i]["Neighbourhood"] = fsas_df.iloc[i]["Borough"]
        
    break
        
fsas_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [10]:
fsas_df.rename(columns={"Neighbourhood": "Neighborhoods"}, inplace=True)
fsas_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhoods
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Now that we have parsed the table from Wikipedia, we have to get the location from each neighborhood.

In [11]:
import geocoder

In [12]:
def find_coords(postal_code):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    attempts = 0
    while(lat_lng_coords is None or attempts < 100):
        attempts += 1
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng

    return lat_lng_coords[0], lat_lng_coords[1]

As the package used is unreliable, I used the csv file provided by Coursera

In [19]:
coords = pd.read_csv("data/Geospatial_Coordinates.csv")
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, we need to combine theh coords and fsas dataframes

In [23]:
neighborhoods_data = pd.merge(fsas_df, coords, on="Postal Code")
neighborhoods_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhoods,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


End of part 2