## Segmenting and Clustering Neighborhoods in Toront

In this project I explore, segment, and cluster the neighborhoods in the city of Toronto. The data is on [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

First of all, I scrape the following [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) to get the DataFrame table.

In [95]:
import requests
from bs4 import BeautifulSoup
import geocoder
import pandas as pd

In [60]:
url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(url,'lxml')
# print(soup.prettify())

In [88]:
df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])

postcode = ""
borough = ""
neighborhood = ""
table = soup.find("table", { "class" : "wikitable sortable" })

for row in table.findAll("tr"):
    cells = row.findAll("td")
    #For each "tr", assign each "td" to a variable.
    if len(cells) == 3:
        postcode = cells[0].find(text=True)
        borough = cells[1].text
        neighborhood = cells[2].find(text=True).replace('\n', '')
#         if postcode == 'M5A':
#             print(postcode, borough, neighborhood)
        if borough == 'Not assigned':
            pass
        elif neighborhood == 'Not assigned':
            neighborhood = borough
            df = df.append({'PostalCode':postcode, 'Borough':borough, 'Neighborhood':neighborhood}, ignore_index=True)
        else:
            df = df.append({'PostalCode':postcode, 'Borough':borough, 'Neighborhood':neighborhood}, ignore_index=True)

In [89]:
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


To create the above dataframe:
* The dataframe will consist of three columns: **PostalCode, Borough, and Neighborhood**
* Only process the cells that have an assigned borough. Ignore cells with a borough that is *Not assigned*.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that *M5A* is listed twice and has two neighborhoods: *Harbourfront* and *Regent Park*. These two rows will be combined into one row with the neighborhoods separated with a comma.
* If a cell has a borough but a *Not assigned* neighborhood, then the neighborhood will be the same as the borough.

In [92]:
df.shape

(103, 3)

In [103]:
long_lat = pd.read_csv("http://cocl.us/Geospatial_data")
long_lat.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

toronto_data = pd.merge(df, long_lat, on = 'PostalCode', how = 'left')
toronto_data.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [102]:
toronto_data.shape

(103, 5)