## Scraping Wikipedia page to get neighborhoods in Toronto and segmenting them into clusters

__1. First step__ is to install all needed packages as well as necessary libraries.    
Web scraping is done using the BeautifulSoup library, where the lxml parser is applied.

In [1]:
!pip install beautifulsoup4

Requirement not upgraded as not directly required: beautifulsoup4 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages


In [2]:
!pip install lxml

Requirement not upgraded as not directly required: lxml in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages


In [3]:
import lxml
import pandas as pd
import requests
from bs4 import BeautifulSoup

Using the BeautifulSoup library, the html code is extracted and printed with the prettify option to see it indented.

In [4]:
website_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(website_link).text

soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())

In [5]:
#Let's see if we got the right article:
title = soup.title.text
print(title.split(' - ')[0])

List of postal codes of Canada: M


__Next__, let's get the information from the table in the Wikipedia article.

In [6]:
#get the info from the table
table = soup.find('table', class_='wikitable sortable')
#print(table.prettify())

__Next__, let's put all the information from each row that we see into an empty list that we will then transform into a dataframe, giving the right columns names.

Meanwhile, rows that don't have an assigned borough are removed, as well as such that have 0 values.

In [7]:
info_list = []
for row in table.find_all('tr'):
    data=row.find_all('td')
    info_list.append([i.text.strip() for i in data])
    
column_names = ['PostalCode', 'Borough', 'Neighborhood']
neighborhoods_df = pd.DataFrame(columns= column_names,data=info_list[1:])

neighborhoods_df=neighborhoods_df[neighborhoods_df.Borough != 'Not assigned']
neighborhoods_df=neighborhoods_df[neighborhoods_df.Borough != 0]
neighborhoods_df.reset_index(inplace=True, drop=True)

neighborhoods_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Let's see some infos about the current data:

In [8]:
print('There are:')
print('  {} Postal codes'.format(neighborhoods_df['PostalCode'].unique().shape[0]))
print('  {} Boroughs'.format(neighborhoods_df['Borough'].unique().shape[0]))
print('  {} Neighborhoods'.format(neighborhoods_df['Neighborhood'].unique().shape[0]))

There are:
  103 Postal codes
  11 Boroughs
  209 Neighborhoods


However, there are still neighborhoods that have a 'Not assigned' value. Need to fix those

In [9]:
neighborhoods_df.loc[neighborhoods_df.Neighborhood == 'Not assigned', 'Neighborhood'] = neighborhoods_df.Borough
neighborhoods_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Since we see, that there are more neighborhoods than postal codes, we combine the rows with the same postal code into one, separating the neighborhoods with a comma in the last column.

In [10]:
neighborhoods_df=neighborhoods_df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
neighborhoods_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Let's see how many rows the dataframe now has. The number should correspond to the count of unique postal codes from the beginning. (103)

In [11]:
neighborhoods_df.shape

(103, 3)

After extracting the data for the neighborhoods and creating a proper dataframe with those with the postal code, now it's time to gather their coordinates in order to segment them into clusters.

The coordinates will be extracted from the following source: http://cocl.us/Geospatial_data and directly converted into e pandas dataframe. 

In [12]:
location_data=pd.read_csv('http://cocl.us/Geospatial_data')
location_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
#rename the postal code column to match with the other dataframe
location_data.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

__Next__, we'll have to merge both dataframes

In [14]:
neigh_location = pd.merge(neighborhoods_df, location_data, on='PostalCode')

In [15]:
#let's see how the new dataframe looks like
neigh_location.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
