# IBM Data Science on Coursera -  Applied Data Science Capstone
## Week 3 - Part 1: Scrape neighborhood information from Wikipedia

Build a pandas dataframe with the postal codes of each Toronto Neighborhood / Borough. Postal code information scraped from Wikipedia, and then cleaned up.

### Let's start by installing and importing all the libraries we need.

In [1]:
!pip install pandas -U
import pandas as pd
from pandas.io.json import json_normalize
print("\n*** Pandas Installed, Updated, & Imported\n")
print("\n*** JSON_normalize Imported\n")

!pip install numpy -U
import numpy as np
print("\n*** NumPy Installed, Updated, & Imported\n")

import requests
import urllib.request
print("\n*** Requests Imported\n")

import random
print("\n*** Random Imported\n")

!pip install geopy -U
from geopy.geocoders import Nominatim
print("\n*** Geopy Installed, Updated, & Imported\n")
print("\n*** Nominatim Imported\n")

!pip install ipython -U
from IPython.display import Image
from IPython.core.display import HTML
print("\n*** IPython Installed, Updated, & Imported\n")
print("\n*** Image & HTML Imported\n")

!pip install folium -U
import folium
print("\n*** Folium Installed, Updated, & Imported\n")

!pip install BeautifulSoup4 -U
from bs4 import BeautifulSoup

Requirement already up-to-date: pandas in c:\users\alexi\anaconda3\lib\site-packages (1.0.3)

*** Pandas Installed, Updated, & Imported


*** JSON_normalize Imported

Requirement already up-to-date: numpy in c:\users\alexi\anaconda3\lib\site-packages (1.18.3)

*** NumPy Installed, Updated, & Imported


*** Requests Imported


*** Random Imported

Requirement already up-to-date: geopy in c:\users\alexi\anaconda3\lib\site-packages (1.21.0)

*** Geopy Installed, Updated, & Imported


*** Nominatim Imported

Requirement already up-to-date: ipython in c:\users\alexi\anaconda3\lib\site-packages (7.13.0)

*** IPython Installed, Updated, & Imported


*** Image & HTML Imported

Requirement already up-to-date: folium in c:\users\alexi\anaconda3\lib\site-packages (0.10.1)

*** Folium Installed, Updated, & Imported

Requirement already up-to-date: BeautifulSoup4 in c:\users\alexi\anaconda3\lib\site-packages (4.9.0)


### Scrape data from Wikipedia w/ BeautifulSoup

Using BeautifulSoup, we scrape the wikipedia page, looking through the code for the table containing our data. There are several tables available, however the one containing our Postal Code / Borough / Neighborhood information is the "sortable wikitable". Let's isolate it.

In [67]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

In [68]:
table = soup.find('table', class_='wikitable sortable')

### Parse table, & append to Pandas DF

We're parsing the "sortable wikitable" and extracting each cell on each row. The cells are appended to their own lists. Then, a pandas dataframe is initialized using the data in the lists to populate each column.

In [71]:
PostalCode = []
Borough = []
Neighborhood = []

for row in table.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        PostalCode.append(cells[0].find(text = True))
        Borough.append(cells[1].find(text = True))
        Neighborhood.append(cells[2].find(text = True))

In [72]:
df = pd.DataFrame(PostalCode, columns=["PostalCode"])
df["Borough"] = Borough
df["Neighborhood"] = Neighborhood

### Cleanup & Formatting

Let's see what we've scraped up. We notice there are a lot of **\n (newline)** characters, **empty cells** and **Not assigned** cells. These all need to be cleaned up. Otherwise it looks like it picked up all of the postal codes (from M1A to M9Z) and the corresponding Boroughs, and Neighborhoods.

In [76]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,Regent Park / Harbourfront\n
...,...,...,...
175,M5Z\n,Not assigned\n,\n
176,M6Z\n,Not assigned\n,\n
177,M7Z\n,Not assigned\n,\n
178,M8Z\n,Etobicoke\n,Mimico NW / The Queensway West / South of Bloo...


In [77]:
df.shape

(180, 3)

In [78]:
df.dtypes

PostalCode      object
Borough         object
Neighborhood    object
dtype: object

In [80]:
# Remove \n (newline) characters, & replace Not Assigned with NAN
df = df.replace(r'\n',  ' ', regex=True)
df = df.replace(r'Not assigned', np.nan, regex=True)

In [81]:
# Drop NAN rows
df.dropna(axis = 0, how = "any", inplace = True)

In [82]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
160,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,Business reply mail Processing CentrE
169,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


### Grouping the dataframe

Now, let's group the dataframe by Postal Code, & Borough. At the same time, we will be replacing the backslashes with comas.

In [86]:
df_grouped = df.groupby(["PostalCode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
df_grouped = df_grouped.replace(r' / ',  ', ', regex=True)

In [87]:
df_grouped

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [85]:
df_grouped.shape

(103, 3)

## Week 3 - Part 2: Adding Geospatial Data to our dataframe

Let's start by getting the lat, lon for these postal codes. We're going to be using the .csv file instead of fiddling about with geocoder.

In [58]:
coords = pd.read_csv("https://cocl.us/Geospatial_data")
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Concatenate the Coordinates, and the Postal Codes dataframe

In [114]:
df_grp_coords = pd.concat([df_grouped, coords], axis=1, sort=False)
df_grp_coords = df_grp_coords.drop(columns = ["Postal Code"])
df_grp_coords

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


## Week 3 - Part 3: Map & Explore with FourSquare

Let's start by visualizing our data and working from there. Eventually we are going to use FourSquare to get more information on these neighborhoods, so let's see what we have to work with.

In [115]:
tdot = folium.Map(location=[43.653963, -79.387207], zoom_start=11)


for lat, lng, borough, neighborhood in zip(df_grp_coords['Latitude'], df_grp_coords['Longitude'], df_grp_coords['Borough'], df_grp_coords['Neighborhood']):
    label = '{}- {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='blue', fill=True, fill_color='red').add_to(tdot)

tdot

Let's drop some of the periphery Boroughs / Neighborhoods, and focus only on the more central areas. We're going to drop the data pertaining to Etopicoke, Mississauga, and Scarborough as these are not really part of Toronto proper.

In [133]:
out_there = ['Etobicoke', 'Mississauga', 'Scarborough', 'North York']
df_central = df_grp_coords[~df_grp_coords.Borough.str.contains('|'.join(out_there))]

In [134]:
tdot_center = folium.Map(location=[43.7, -79.4], zoom_start=11)


for lat, lng, borough, neighborhood in zip(df_central['Latitude'], df_central['Longitude'], df_central['Borough'], df_central['Neighborhood']):
    label = '{}- {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='blue', fill=True, fill_color='red').add_to(tdot_center)

tdot_center

### Using the FourSquare API

In [4]:
import config as cfg
