# Segmenting and Clustering Neighborhoods in Toronto
## Week 3 Assignment for Applied Data Science Capstone
### Notebook 1: Scraping neighborhood data from the web

#### Imports of libraries

In [16]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

#### Table about Toronto neighborhoods can be found on Wikipedia

In [17]:
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#### Use requests to get the raw html, and then parse it using BeautifulSoup

In [18]:
r = requests.get(wiki_url)
html_data = r.content

soup = BeautifulSoup(html_data, 'html.parser')

#### Use BeautifulSoup to find the table object

In [19]:
table = soup.find('table')

#### Retrieve headers and row contents, remove newline (\n) characters

In [20]:
headers = [h.text.replace('\n', '') for h in table.find_all('th')]
rows = table.find_all('tr')

row_data = []

for row in rows[1:]:  # first row is the header
    row_contents = [c.text.replace('\n', '') for c in row.find_all('td')]
    row_data.append(row_contents)

#### Create a pandas DataFrame from the above contents:

In [21]:
df = pd.DataFrame(data=row_data, columns=headers)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


#### Remove 'Not assigned' boroughs

In [22]:
df = df.loc[df['Borough'] != 'Not assigned']

In [23]:
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Let's see if there are any Neighborhoods that are within multiple Boroughs...

In [24]:
df['Neighborhood'].value_counts()

Downsview                                                4
Don Mills                                                2
Willowdale                                               2
Toronto Dominion Centre, Design Exchange                 1
Rosedale                                                 1
                                                        ..
Dorset Park, Wexford Heights, Scarborough Town Centre    1
Weston                                                   1
Malvern, Rouge                                           1
Cliffside, Cliffcrest, Scarborough Village West          1
North Park, Maple Leaf Park, Upwood Park                 1
Name: Neighborhood, Length: 98, dtype: int64

This is not great. While this is not the most correct way to deal with this, I will simply drop duplicate Borough and Neighborhood combinations. This should not be a big deal assuming that the neighborhood is simply divided into different postal code areas.

In [25]:
df.drop_duplicates(subset=['Borough', 'Neighborhood'], inplace=True)

All good now:

In [26]:
df['Neighborhood'].value_counts()

Toronto Dominion Centre, Design Exchange             1
Harbourfront East, Union Station, Toronto Islands    1
Westmount                                            1
Woodbine Heights                                     1
Islington Avenue                                     1
                                                    ..
Weston                                               1
Malvern, Rouge                                       1
Cliffside, Cliffcrest, Scarborough Village West      1
Victoria Village                                     1
North Park, Maple Leaf Park, Upwood Park             1
Name: Neighborhood, Length: 98, dtype: int64

#### Save the dataframe to csv file that will be imported in the second Notebook

In [27]:
df.to_csv('Toronto_PostalCodeData.csv', index=False)

#### Inspect neighborhood counts and the DataFrame's shape

In [28]:
df['Borough'].value_counts()

Downtown Toronto    19
North York          19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
East Toronto         5
York                 5
Mississauga          1
Name: Borough, dtype: int64

In [29]:
df.shape

(98, 3)