### <h1><center>Segmenting and Clustering Neighborhoods in Toronto</center></h1>

## Part 1. Scraping data from Wikipedia page to a dataframe

### Option 1: Use Python, Urllib, Beautiful Soup and Pandas

#### a. Import urllib library for working with URLs.

In [23]:
# import the library we use to open URLs
import urllib.request
import requests

#### b. Next we specify the URL of the Wikipedia page we are looking to scrape

In [24]:
# specify which URL/web page we are going to be scraping
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#### c. Using the urllib.request library to query the page and put the HTML data into a variable

In [25]:
# open the url using urllib.request and put the HTML into the page variable
page = urllib.request.urlopen(url)

#### d. We then use Beautiful Soup to parse the HTML data stored in the variable and store it in a new variable in the Beautiful Soup format. 

In [21]:
# First install BeautifulSoup4
!pip install BeautifulSoup4

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 7.1MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.1 soupsieve-2.0.1


In [26]:
# import the BeautifulSoup library so we can parse HTML and XML documents
from bs4 import BeautifulSoup

#### Jupyter Notebook prefers we specify a parser format so we use the “lxml” library option:

In [18]:
# First we install lxml
!pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [28]:
# Parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

In [None]:
# Use Beautiful Soup’s prettify function to view the HTML code
#print(soup.prettify())

#### e. From the HTML code, we find the table we want. We find it as "< table class = "wikitable sortable" >"

In [30]:
soup.title

<title>List of postal codes of Canada: M - Wikipedia</title>

In [None]:
# use the 'find_all' function to bring back all instances of the 'table' tag in the HTML and store in 'all_tables' variable
all_tables=soup.find_all("table")
all_tables

In [None]:
right_table=soup.find('table', class_='wikitable sortable')
right_table

In [33]:
# We loop through the rows. Create 3 empty lists to store column data in
A=[]
B=[]
C=[]

# Use the Beautiful Soup ‘find_all’ function again and set it to look for the string ‘tr’
# We then set up a FOR loop for each row within that array and set Python to loop through the rows, one by one
for row in right_table.findAll('tr'):
    # Within the loop we are going to use find_all again to search each row for tags with the ‘td’ string. 
    cells=row.findAll('td')
    # # We will add all of these to a variable called ‘cells’ and then check to make sure that there are 3 items in our ‘cells’ array (i.e. one for each column)
    if len(cells)==3:
        # If there are then we use the find(text=True)) option to extract the content string from within each element in that row and add them to the A-C lists we created at the start of this step.
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))

#### F. Move data to Pandas

We assign each of the lists A-E into a column with the name of our source table columns i.e. postal code, borough and neighborhood

In [36]:
# We assign each of the lists A-E into a column with the name of our source table columns i.e. postal code, borough and neighborhood
import pandas as pd
df_canada = pd.DataFrame(A,columns=['Postal Code'])
df_canada['Borough']=B
df_canada['Neighborhood']=C
df_canada.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A\n,Not assigned\n,Not assigned\n
1,M2A\n,Not assigned\n,Not assigned\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


There appears to be "\n" in each roww. Let's remove them using ".str.replace"

In [37]:
df_canada['Postal Code'] = df_canada['Postal Code'].str.replace('\n', '')
df_canada['Borough'] = df_canada['Borough'].str.replace('\n', '')
df_canada['Neighborhood'] = df_canada['Neighborhood'].str.replace('\n', '')
df_canada.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Option 2: Import directly using read_html

In [8]:
pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/55/6f/c87dffdd88a54dd26a3a9fef1d14b6384a9933c455c54ce3ca7d64a84c88/lxml-4.5.1-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 4.9MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.1
Note: you may need to restart the kernel to use updated packages.


In [5]:
import pandas as pd
df_canada = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df_canada.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [10]:
df_canada.shape

(180, 3)

#### Ignore cells with a borough that is Not assigned

In [11]:
df_canada = df_canada[df_canada['Borough'] != "Not assigned"]
df_canada.reset_index(drop=True, inplace=True)
df_canada.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Where more than one neighborhood exists in one postal code area, combine into one row with the neighborhoods

In [14]:
df_canada.groupby('Postal Code')['Neighborhood'].apply(' '.join).reset_index()
df_canada.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Number of rows of my dataframe

In [17]:
print("The number of rows of my dataframe are: ", df_canada.shape[0])

The number of rows of my dataframe are:  103


### Part 2. Getting the latitude and the longitude coordinates of each neighborhood

In [47]:
df_cord = pd.read_csv('Geospatial_Coordinates.csv')
df_cord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [52]:
df_canada = pd.merge(df_canada, df_cord)
df_canada

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
