# Week 3 Segmenting and Clustering Neighborhoods Part 2

Scraping the following Wiki page
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe using beautiful soup

## Part 1: Use the BeautifulSoup package to transform the data in the table on the Wikipedia page into the above pandas dataframe

Any assumptions I am making are cleary explained in comment section of code below

In [1]:
# Load the required libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Found the table using beautifulsoup and used Pandas to read it in. 
res = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df = pd.read_html(str(table))


# WRANGLE/Transform THE DATA
# Convert the list back into a dataframe
data = pd.DataFrame(df[0])

# Rename the columns as instructed
data = data.rename(columns={0:'PostalCode', 1:'Bourough', 2:'Neighbourhood'})

# Get rid of the first row which contained the table headers from the webpage
data = data.iloc[1:]


# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
data = data[~data['Bourough'].str.contains('Not assigned')]


# More than one neighborhood can exist in one postal code area. 
#For example, in the table on the Wikipedia page, you will notice 
#that M5A is listed twice and has two neighborhoods: Harbourfront 
#and Regent Park. These two rows will be combined into one row with 
#the neighborhoods separated with a comma
df2=data.groupby(['PostalCode', 'Bourough']).apply(lambda group: ', '.join(group['Neighbourhood']))


# Convert the Series back into a DataFrame and put the 'Neighbourhood' column label back in
df2=df2.to_frame().reset_index()
df2 = df2.rename(columns={0:'Neighbourhood'})

# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
df2.loc[df2.Neighbourhood == 'Not assigned', 'Neighbourhood' ] = df2.Bourough

# Display the DataFrame
df2   

Unnamed: 0,PostalCode,Bourough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [2]:
# In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
df2.shape

(103, 3)

## Part 2 Use the Geocoder package or the csv file to create the dataframe

In [3]:
#Going with csv option as I don't have a lot of time currently
!wget -O to_geo_space.csv http://cocl.us/Geospatial_data

#Read into dataframe
gf = pd.read_csv('to_geo_space.csv')




--2019-01-05 14:51:44--  http://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 169.48.113.201
Connecting to cocl.us (cocl.us)|169.48.113.201|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data [following]
--2019-01-05 14:51:44--  https://cocl.us/Geospatial_data
Connecting to cocl.us (cocl.us)|169.48.113.201|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-01-05 14:51:45--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.25.197, 107.152.24.197
Connecting to ibm.box.com (ibm.box.com)|107.152.25.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-01-05 14:51:45--  https://ibm.ent

In [4]:
#rename the coloumns so the match
gf = gf.rename(columns={'Postal Code':'PostalCode'})

#Join the 2 dataframes as instructed
df_new = pd.merge(df2, gf, on='PostalCode', how='inner')

# display the new dataframe
df_new

Unnamed: 0,PostalCode,Bourough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [5]:
#check to see shape is maintained
df_new.shape


(103, 5)