<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>

### Webscraping and loading of information
First we have to install the required packages for webscraping:

In [1]:
!conda install beautifulsoup4
!conda install lxml
!conda install html5lib
!conda install requests

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.



Then we load the packages that are going to be used:

In [2]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re

Next, we get the information of the Wikipedia Webpage into a html file in order to obtain only the information tha is on the table:

In [12]:
#Webscraping information
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

table = soup.find('table')

We scrape the elements within the table and append them into a Pandas DataFrame:

In [13]:
# Empty List for loading data from website
post_codes = []

for tr in table.find_all('tr')[2:]:
    post_codes.append({
            'PostalCode': tr.find_all('td')[0].text,
            'Borough': tr.find_all('td')[1].text,
            'Neighborhood': re.sub('\n$', '', tr.find_all('td')[2].text)
            })
    
post_codes = pd.DataFrame(post_codes)
post_codes.head(12)

Unnamed: 0,Borough,Neighborhood,PostalCode
0,Not assigned,Not assigned,M2A
1,North York,Parkwoods,M3A
2,North York,Victoria Village,M4A
3,Downtown Toronto,Harbourfront,M5A
4,North York,Lawrence Heights,M6A
5,North York,Lawrence Manor,M6A
6,Downtown Toronto,Queen's Park,M7A
7,Not assigned,Not assigned,M8A
8,Queen's Park,Not assigned,M9A
9,Scarborough,Rouge,M1B


Finally, we clean the dataset:
- Filter those rows without an assigned __Borough__

In [14]:
post_codes = post_codes[-post_codes['Borough'].str.contains("Not assigned")]
post_codes.head(12)

Unnamed: 0,Borough,Neighborhood,PostalCode
1,North York,Parkwoods,M3A
2,North York,Victoria Village,M4A
3,Downtown Toronto,Harbourfront,M5A
4,North York,Lawrence Heights,M6A
5,North York,Lawrence Manor,M6A
6,Downtown Toronto,Queen's Park,M7A
8,Queen's Park,Not assigned,M9A
9,Scarborough,Rouge,M1B
10,Scarborough,Malvern,M1B
12,North York,Don Mills North,M3B


- For those rows without an assigned __Neighborhood__, use the name of the corresponding __Borough__

In [15]:
post_codes['Neighborhood'][post_codes['Neighborhood'].str.contains("Not assigned")] = np.nan
post_codes['Neighborhood'] = post_codes['Neighborhood'].fillna(post_codes['Borough'])
post_codes.head(12)

Unnamed: 0,Borough,Neighborhood,PostalCode
1,North York,Parkwoods,M3A
2,North York,Victoria Village,M4A
3,Downtown Toronto,Harbourfront,M5A
4,North York,Lawrence Heights,M6A
5,North York,Lawrence Manor,M6A
6,Downtown Toronto,Queen's Park,M7A
8,Queen's Park,Queen's Park,M9A
9,Scarborough,Rouge,M1B
10,Scarborough,Malvern,M1B
12,North York,Don Mills North,M3B


- And collect all the __Neighborhoods__ that are within a certain __Postal Code__

In [17]:
post_codes = post_codes.groupby(["PostalCode", "Borough"], as_index=False)['Neighborhood'].agg(lambda x : ', '.join(set(x)))
post_codes.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Highland Creek, Port Union"
2,M1E,Scarborough,"West Hill, Morningside, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Oakridge, Clairlea"
8,M1M,Scarborough,"Cliffside, Scarborough Village West, Cliffcrest"
9,M1N,Scarborough,"Cliffside West, Birch Cliff"


Lastly, the DataFrame's shape:

In [18]:
#Dimensions of DataFrame
post_codes.shape

(103, 3)

Obtain the latitude and longitude of each __Postal Code__:

In [20]:
geo = 'http://cocl.us/Geospatial_data'

geo_data = pd.read_csv(geo)
geo_data.rename(columns={'Postal Code':'PostalCode'}, 
                 inplace=True)
geo_data.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Join the both dataframes on __Postal Code__:

In [23]:
toronto_codes = pd.merge(post_codes, geo_data, on=['PostalCode'])
toronto_codes.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Highland Creek, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"West Hill, Morningside, Guildwood",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
