# My Coursera Capstone Project
## By Francesco Rivano

I've just created this Jupyter Notebook and I'm going to use it for the Capstone Project. What is it going to be about? Ah, that's the exciting part: I don't know yet.

In [1]:
import pandas as pd
import numpy as np
print('Hello Capstone Project Course!')


Hello Capstone Project Course!


## Data Scraping and Generation of a Map of Toronto Neighbourhoods

This notebook will feature, as mentioned before, data scraping (from a Wikipedia article, to be more specific) and the generation of a map of said neighbourhoods.
I will use BeautifulSoup for data scraping, pandas to handle the datasets, and Folium to generate the map.

In [2]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

soup = BeautifulSoup(requests.get(wiki).text, 'lxml')

## Finding the table in the article

In [3]:
postcodes_table = soup.find('table',{'class':'wikitable sortable'})

## And creating a table

In [4]:
parsed_table_data = []
rows = postcodes_table.findAll('tr')
for row in rows:
    children = row.find_all('td')
    row_text = []

    for child in children:
        clean_text = child.text
        clean_text = clean_text.strip()
        clean_text = clean_text.replace('\n', '')
        row_text.append(clean_text)
    parsed_table_data.append(row_text)

In [5]:
df = pd.DataFrame(parsed_table_data)[1:]
df.columns = ['Postcode', 'Borough', 'Neighbourhood']
df = df.replace('Not assigned', np.nan)
for i, j in df.iterrows():
    if type(j[1]) == str and type(j[2]) == float:
       j[2] = j[1] 
df.dropna(axis=0, inplace=True)
df = df.reset_index()
df = df[['Postcode', 'Borough', 'Neighbourhood']]
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


## The table is "too long" because we haven't grouped each neighbourhood by borough yet, we can however check how many Postcode unique values there are. They should be 103, just like those in the coordinates csv file we will deal with later.

In [6]:
len(df.Postcode.value_counts())

103

In [7]:
postcodes_dict = {}

boroughs = df.Borough.unique()
postcodes = df.Postcode.unique()
final_df = pd.DataFrame(data= {'Postcode':[np.nan], 'Borough': [np.nan], 'Neighbourhood' : [np.nan]})
#final_df.columns = ['Postcode', 'Borough', 'Neighbourhood']
for postcode in postcodes:
    for borough in boroughs:
        temp = df[df.Postcode == postcode]
        temp = temp[temp.Borough == borough]
        neighbourhoods = ''
        for value in temp.Neighbourhood.values:
            neighbourhoods += value + ', '
        neighbourhoods = neighbourhoods[:-2]
        #print(pd.DataFrame([borough, postcode, neighbourhoods]))
        if neighbourhoods != '':
            final_df = final_df.append(pd.DataFrame(data = {'Postcode': [postcode], 'Borough': [borough], 'Neighbourhood': neighbourhoods}))
        #neighbourhoods_list = []
        #for i, j in df.iterrows():
        #    if j[1] == borough and j[0] == postcode:
        #        neighbourhoods_list.append(j[2])
        #print(postcode, borough, neighbourhoods_list)

In [8]:
#

In [9]:
len(final_df.Postcode.value_counts())

103

## 104? Why? An empty row was added, or so it seems. Let's get rid of it and also reset the index (otherwise we'd get an identically 0 index for each row).

In [10]:
final_df.shape

(104, 3)

In [11]:
final_df = final_df.reset_index()
final_df = final_df[['Postcode', 'Borough', 'Neighbourhood']].dropna(axis=0)
final_df

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,"Harbourfront, Regent Park"
4,M6A,North York,"Lawrence Heights, Lawrence Manor"
5,M7A,Queen's Park,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,"Rouge, Malvern"
8,M3B,North York,Don Mills North
9,M4B,East York,"Woodbine Gardens, Parkview Hill"
10,M5B,Downtown Toronto,"Ryerson, Garden District"


In [12]:
coordinates = pd.read_csv('Geospatial_Coordinates.csv')
coordinates 

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


## Let's sort the coordinate DataFrame values, besides the final_df ones, just in case, and then check if they are actually the same set of values.

In [13]:
coordinates.columns = ['Postcode', 'Latitude', 'Longitude']
coordinates.sort_values(['Postcode'], inplace=True)
coordinates

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [14]:
final_df.sort_values('Postcode', inplace=True)

#To check if all values are identically true after sorting both datasets; if they are, we can move on and everything so far should be perfect
print('Everything okay.') if np.array_equal(final_df.Postcode.values, coordinates.Postcode.values) else print('Something didn\'t go as planned.')

Everything okay.


In [15]:
final_df = final_df.merge(coordinates, on='Postcode')
final_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [16]:
# The map in all its glory.

In [17]:
import folium
from folium.plugins import MarkerCluster

m = folium.Map(location=[final_df.Latitude.mean(), final_df.Longitude.mean()], tiles="cartodbdark_matter", zoom_start=11)
cluster = MarkerCluster()

cluster.add_to(m)

folium.LayerControl().add_to(m)

for i in range(0,len(final_df)):
    #folium.Marker([43, 80], popup=str(final_df.iloc[i]['Neighbourhood'] + str(final_df.iloc[i]['Longitude']) + str(final_df.iloc[i]['Latitude']))).add_to(m)
    folium.Marker([final_df.iloc[i]['Latitude'], final_df.iloc[i]['Longitude']], popup=str('Postal code: ' + final_df.iloc[i]['Postcode'] + '\n\nNeighbourhoods: ' + final_df.iloc[i]['Neighbourhood'] + '\n\n[' + str(final_df.iloc[i]['Longitude']) + ', ' + str(final_df.iloc[i]['Latitude'])+']')).add_to(cluster)
#m.add_child(cluster)
m