# <center> Segmenting and Clustering Neighborhoods in Toronto </center>
------

For this project, I explored and clustered the neighborhood in Toronto.

The following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, was scraped in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

-----------
### Package requirements

- pandas


In [1]:

# !conda install -c anaconda pandas


The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

_There are different website scraping libraries and packages in Python. For scraping the above table, I simply use pandas to read the table into a pandas dataframe._

_Another way, which would help to learn for more complicated cases of web scraping is using the BeautifulSoup package. Here is the package's main documentation page: http://beautiful-soup-4.readthedocs.io/en/latest/_

In [2]:
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
toronto_data = pd.read_html(url)
toronto_data = toronto_data[0]
toronto_data.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


### Data Cleaning
I need only cells that have an assigned borough. So, I Ignore cells with a borough that is Not assigned.

In [3]:
# lets clean the table
# I will drop all "Not assigned" value on the table
toronto_data = toronto_data.set_index('Postal Code')
toronto_data1 = toronto_data[toronto_data.Borough != 'Not assigned']
toronto_data1.head(10)


Unnamed: 0_level_0,Borough,Neighborhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"
M3B,North York,Don Mills
M4B,East York,"Parkview Hill, Woodbine Gardens"
M5B,Downtown Toronto,"Garden District, Ryerson"


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 3 in the above table.

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.


In [4]:
# Lets check the Neighborhood to see if a Not assigned value exist
toronto_data1[toronto_data1.Neighborhood == 'Not assigned']

Unnamed: 0_level_0,Borough,Neighborhood
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1


We check to see if there is a Not assigned Neighborhood. However, we do not have one

In [5]:
toronto_data1.shape

(103, 2)

------
This notebook was created by [Maxwell Ihiaso](https://github.com/Maxwell-ihiaso). I hope you found this lab interesting and educational. Feel free to contact me if you have any questions!