<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>
<br>
<h1 align=center><font size = 5>IBM Data Science Capstone - Eugene Bible</font></h1>

# Introduction

This is the Jupyter Notebook to be used for the IBM Data Science Professional Certificate course by Coursera. 

In this Jupyter notebook, neighborhoods in the city of Toronto will be explored, segmented, and clustered. 
First, the data will be scraped from the Wikipedia page and be wrangled, cleaned, and then read into a pandas dataframe in a structured format. Once structured, the neighborhoods in Toronoto will be explored and clustered.

Author: Eugene Bible

## Table of Contents
<div class="alert alert-block alert-info" style="margin-top: 20px">
<font size = 3>

[1. Segmenting and Clustering Neighborhoods in Toronto](#1.-Segmenting-and-Clustering-Neighborhoods-in-Toronto) <br>
[2. The Battle of Neighborhoods - Part 1](#2.-The-Battle-of-Neighborhoods-\--Part-1)<br>
[3. The Battle of Neighborhoods - Part 2](#3.-The-Battle-of-Neighborhoods-\--Part-2)


</font>
</div>

Import the necessary libraries:

In [7]:
import urllib.request,urllib.parse,urllib.error # Library to parse websites
from bs4 import BeautifulSoup # Library to handle web scraping

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#!conda install -c conda-forge geocoder --yes
import geocoder # import geocoder for getting lat/lon

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# 1. Segmenting and Clustering Neighborhoods in Toronto

## Part 1.1 - Scraping the Data and Creating the Dataframe

### Scrape the Wikipedia page to obtain the table of postal codes/boroughs/neighborhoods

Page found at: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [8]:
# Define the URL of the webpage to scrape
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# Get HTML
html = urllib.request.urlopen(url).read()
#print(html)

# Use BeautifulSoup to parse the html
soup = BeautifulSoup(html, 'html.parser')
#print(soup.prettify())

# Parse the table only and pull all rows
table = soup.find('table')
#print(table.prettify())

### Transform the table into a Pandas Dataframe
Requirements:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [48]:
# Scraping the rows of the table
table_rows = table.find_all('tr')

rows = []
for tr in table_rows:
    #print('count = ' + str(count))
    th = tr.find_all('th') # Grab header
    td = tr.find_all('td') # Grab row
    if th:
        row = [th.text.rstrip() for th in th]
        rows.append(row)
    else:
        row = [tr.text.rstrip() for tr in td]
        if row[1] == 'Not assigned': continue # If no borough, don't keep the row
        if row[2] == 'Not assigned': row[2] = row[1] # If neighborhood is not assigned, set it to the Borough
        rows.append(row)  
            
# Assign list of rows to a Pandas DF
nhScraped_df = pd.DataFrame(rows)

# Name the columns after the first row (headers) then drop the header row
nhScraped_df.columns = nhScraped_df.iloc[0]
nhScraped_df = nhScraped_df.drop(nhScraped_df.index[0])
nhScraped_df.head(10)


Unnamed: 0,Postcode,Borough,Neighbourhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Downtown Toronto,Queen's Park
7,M9A,Queen's Park,Queen's Park
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern
10,M3B,North York,Don Mills North


In [76]:
# Let's check our counts of postal codes, boroughs, and neighborhoods.

print('The dataframe has {} postal codes, {} boroughs, and {} neighborhoods.'.format(
        len(nhScraped_df['Postcode'].unique()),
        len(nhScraped_df['Borough'].unique()),
        len(nhScraped_df['Neighbourhood'].unique())
    )
)

The dataframe has 103 postal codes, 11 boroughs, and 207 neighborhoods.


In [78]:
# Now that we have the 'Not Assigned' entries taken care of, need to combine the rows with the same Postcode but different neighborhoods

# Use groupby to sort the list by Postcode
nhGrouped_df = nhScraped_df.groupby(['Postcode','Borough'])
#print(nhGrouped_df)

# Print to check for proper sorting
for key, item in nhGrouped_df:
    print(nhGrouped_df.get_group(key), "\n\n")

0 Postcode      Borough Neighbourhood
8      M1B  Scarborough         Rouge
9      M1B  Scarborough       Malvern 


0  Postcode      Borough   Neighbourhood
21      M1C  Scarborough  Highland Creek
22      M1C  Scarborough      Rouge Hill
23      M1C  Scarborough      Port Union 


0  Postcode      Borough Neighbourhood
33      M1E  Scarborough     Guildwood
34      M1E  Scarborough   Morningside
35      M1E  Scarborough     West Hill 


0  Postcode      Borough Neighbourhood
39      M1G  Scarborough        Woburn 


0  Postcode      Borough Neighbourhood
43      M1H  Scarborough     Cedarbrae 


0  Postcode      Borough        Neighbourhood
54      M1J  Scarborough  Scarborough Village 


0  Postcode      Borough         Neighbourhood
66      M1K  Scarborough  East Birchmount Park
67      M1K  Scarborough               Ionview
68      M1K  Scarborough          Kennedy Park 


0  Postcode      Borough Neighbourhood
79      M1L  Scarborough      Clairlea
80      M1L  Scarborough   Gold

In [80]:
# Join neighborhoods of the grouped entries (i.e. the groups with the same postcode), separated by commas
nhJoined_df = nhGrouped_df['Neighbourhood'].apply(', '.join)

nhJoined_df = nhJoined_df.reset_index()

nhJoined_df.head(20)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [82]:
# Check the number of rows in our df, should still have 103 postal codes, meaning 103 rows now.
nhJoined_df.shape

(103, 3)

### Getting latitude and longitude for each neighborhood
Now that we have a dataframe of the postal codes of each neighborhood, borough name, and neighborhood name, to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [84]:
## Using Python's built-in geocode NOTE: TOO UNRELIABLE. USED THE GEOSPATIAL_DATA PAGE PROVIDED BY INSTRUCTOR.
#
## initialize your variable to None
#lat_lng_coords = None
#
#address = 'M1J Scarborough Scarborough Village'
#geolocator = Nominatim(user_agent="EB_explores_canada")
#
#while(lat_lng_coords is None):
#    lat_lng_coords = geolocator.geocode(address)
#
#print(lat_lng_coords)
#    
#nlat = lat_lng_coords.latitude
#nlon = lat_lng_coords.longitude
#
#
#print(nlat, nlon)

Note: Package was too unreliable to get all lat/lon values. Used the CSV file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [85]:
url = 'http://cocl.us/Geospatial_data'
response = urllib.request.urlopen(url)
latlng_df = pd.read_csv(response)


latlng_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [86]:
# Change column name so both dataframes have the same column name
latlng_df.rename(columns={'Postal Code': 'Postcode'}, inplace=True)
latlng_df.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [87]:
# Merge two Dataframes on column 'Postcode'
neighborhoods_df = nhJoined_df.merge(latlng_df, on='Postcode')

neighborhoods_df = neighborhoods_df.rename(columns={'Neighbourhood': 'Neighborhood'}) # Because I'm American. (Sorry.)

neighborhoods_df.head(20)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [89]:
# Let's check our neighborhood and borough counts. Should still have 103 postal codes and 11 boroughs.
print('The dataframe has {} postal codes and {} boroughs.'.format(
        len(neighborhoods_df['Postcode'].unique()),
        len(neighborhoods_df['Borough'].unique())
    )
)

The dataframe has 103 postal codes and 11 boroughs.


## Part 1.2 - Analysis

# 2. The Battle of Neighborhoods - Part 1

## Part 2.1 - 

## Part 2.2 - 

# 3. The Battle of Neighborhoods - Part 2

## Part 3.1 -

## Part 3.2 -