## Segmenting and Clustering Neighborhoods in Toronto Parte 2

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

Before we start with the assignment lab, let's download all the dependencies that we will need.

In [36]:
!conda install -c anaconda beautifulsoup4 --yes
!conda install -c anaconda lxml --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [37]:
#import dependences
import pandas as pd 
import requests
from bs4 import BeautifulSoup

In [38]:
#Scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data
web_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(web_text,'xml')
table = soup.find('table',{'class':'wikitable'})

In [39]:
#Get data from the table (class='wikitable') and insert into Pandas Dataframe
table_rows = table.find_all('tr')
data = []
for row in table_rows:
    td=[]
    for t in row.find_all('td'):
        td.append(t.text.strip())
    data.append(td)

# define the dataframe columns
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 

# instantiate the dataframe
df = pd.DataFrame(data, columns=column_names)

# filter out bad rows
df = df[~df['Borough'].isnull()]  

In [40]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df = df[df.Borough != 'Not assigned']

# Combine Neighborhood with the same Postal Code, separating by comma
df = df.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()

# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
df['Neighborhood'].replace('Not assigned',df['Borough'],inplace=True)

# It seems that the Wikipedia Table was updated and Neighborhood with the same Postal Code has already combined separed by /
# to comply the assignment requirements we replace the / by comma (,)
df['Neighborhood'] = df['Neighborhood'].str.replace('/',',')

Use the .shape method to print the number of rows of your dataframe.

In [41]:
df.shape

(103, 3)

In [42]:
import requests
import io

# in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.
url = "http://cocl.us/Geospatial_data"
request = requests.get(url).content
coordenates = pd.read_csv(io.StringIO(request.decode('utf-8')))

In [43]:
# rename the first column to allow merging dataframes on PostalCode
coordenates.columns = ['PostalCode', 'Latitude', 'Longitude']

# Merge Data
df = pd.merge(coordenates, df, on='PostalCode')

# Reorder column names and show the dataframe
df = df[['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude']]
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern , Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill , Port Union , Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood , Morningside , West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park , Ionview , East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile , Clairlea , Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside , Cliffcrest , Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff , Cliffside West",43.692657,-79.264848


In [44]:
df.shape

(103, 5)