# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto - Part 1

### For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes.
3. Transform the data into a pandas dataframe.


#### I'm using here the Beautiful Soup package for parsing HTML. More infos here: http://beautiful-soup-4.readthedocs.io/en/latest/

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import requests # library to handle requests
from urllib.request import urlopen
from bs4 import BeautifulSoup # BeautifulSoup library to parse HTML input


In [2]:
# Extract data from Wikipedia
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(wiki_url).text


In [3]:
soup = BeautifulSoup(urlopen(wiki_url))


In [4]:
# Creating the basis for a three columns (PostalCode, Borough, Neighborhood) dataframe.
codes_list = []
borough_list = []
neighborhood_list = []
i = 1

for tag in soup.table.find_all('td'):
    if i == 1:
        codes_list.append(tag.text)
    if i == 2:
        borough_list.append(tag.text)
    if i == 3:
        neighborhood_list.append(tag.text)
    i += 1
    if i == 4:
        i = 1
        

In [5]:
# Transform the data in pd.DataFrame
toronto_df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])
toronto_df['PostalCode'] = codes_list
toronto_df['Borough'] = borough_list
toronto_df['Neighborhood'] = neighborhood_list
toronto_df.head(15) # Showing the first 15 rows

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n
5,M6A,North York,Lawrence Heights\n
6,M6A,North York,Lawrence Manor\n
7,M7A,Downtown Toronto,Queen's Park\n
8,M8A,Not assigned,Not assigned\n
9,M9A,Queen's Park,Not assigned\n


In [6]:
# Cleaning the Neighborhoor Column from 'Enter Key (\n) character'
toronto_df = toronto_df.replace('\n', ' ', regex = True)
toronto_df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


In [7]:
# Ignore cells with a borough that is Not assigned.
toronto_df.drop(toronto_df.index[toronto_df['Borough'] == 'Not assigned'], inplace = True)
# Reset the index and dropping the previous index
toronto_df = toronto_df.reset_index(drop = True)

toronto_df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Queen's Park,Not assigned
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


In [8]:
# Combining Neighborhoods based on PostCode and Borough
toronto_df = toronto_df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(','.join).reset_index()
toronto_df.columns = ['PostalCode', 'Borough', 'Neighborhood']
toronto_df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge ,Malvern"
1,M1C,Scarborough,"Highland Creek ,Rouge Hill ,Port Union"
2,M1E,Scarborough,"Guildwood ,Morningside ,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park ,Ionview ,Kennedy Park"
7,M1L,Scarborough,"Clairlea ,Golden Mile ,Oakridge"
8,M1M,Scarborough,"Cliffcrest ,Cliffside ,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff ,Cliffside West"


In [9]:
# Using .strip() to removing any space area in the start of the string
toronto_df['Neighborhood'] = toronto_df['Neighborhood'].str.strip()
toronto_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge ,Malvern"
1,M1C,Scarborough,"Highland Creek ,Rouge Hill ,Port Union"
2,M1E,Scarborough,"Guildwood ,Morningside ,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park ,Ionview ,Kennedy Park"
7,M1L,Scarborough,"Clairlea ,Golden Mile ,Oakridge"
8,M1M,Scarborough,"Cliffcrest ,Cliffside ,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff ,Cliffside West"


In [10]:
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
toronto_df.loc[toronto_df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = toronto_df['Borough']
toronto_df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge ,Malvern"
1,M1C,Scarborough,"Highland Creek ,Rouge Hill ,Port Union"
2,M1E,Scarborough,"Guildwood ,Morningside ,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park ,Ionview ,Kennedy Park"
7,M1L,Scarborough,"Clairlea ,Golden Mile ,Oakridge"
8,M1M,Scarborough,"Cliffcrest ,Cliffside ,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff ,Cliffside West"


In [11]:
toronto_df.shape

(103, 3)

In [13]:
# Saving the file in csv format for the next part of this assignment
export_csv = toronto_df.to_csv(r'C:\Users\tatir\OneDrive\Documentos\IT\IBM Data Science\toronto_df.csv', index = None, header = True)

In [14]:
print(toronto_df)

    PostalCode      Borough                                       Neighborhood
0          M1B  Scarborough                                     Rouge ,Malvern
1          M1C  Scarborough             Highland Creek ,Rouge Hill ,Port Union
2          M1E  Scarborough                  Guildwood ,Morningside ,West Hill
3          M1G  Scarborough                                             Woburn
4          M1H  Scarborough                                          Cedarbrae
..         ...          ...                                                ...
98         M9N         York                                             Weston
99         M9P    Etobicoke                                          Westmount
100        M9R    Etobicoke  Kingsview Village ,Martin Grove Gardens ,Richv...
101        M9V    Etobicoke  Albion Gardens ,Beaumond Heights ,Humbergate ,...
102        M9W    Etobicoke                                          Northwest

[103 rows x 3 columns]
