# Segmenting and Clustering Neighborhoods in Toronto - part 2

For this assignment, we will explore and cluster the neighborhoods in Toronto, but before we get the data and start exploring it, let's download all the dependencies that we will need.


In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import requests # library to handle requests

## 1. Download and Explore the Dataset

We are going to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

### 1.1 Create and Prepare the Dataframe

We are going to use `pandas` to extract all tables from our wiki page (html) and puts them in a list of dataframes.
Then we are going to convert the relevant dataframe to a csv file.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html = requests.get(url).content
df_list = pd.read_html(html) # Extract all tables from our wiki page (html) and puts them in a list of dataframes
toronto_PC = df_list[-3] # For our web page, the relevant table is the third one from the bottom.
toronto_PC.to_csv('toronto_postal_codes.csv') # Convert the relevant dataframe (table) to a csv file

Now, we are going to read the csv file and create our main dataframe.

In [3]:
toronto_PC = pd.read_csv('toronto_postal_codes.csv')

# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
toronto_PC = toronto_PC.drop(columns=["Unnamed: 0"]).rename(columns={"Postal Code": "PostalCode", "Neighbourhood": "Neighborhood"}) 
toronto_PC.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### 1.2 Process the Dataframe

In [4]:
# Ignore cells with a borough that is Not assigned
toronto_PC['Borough'].replace('Not assigned', np.nan, inplace=True) 
toronto_PC.dropna(subset=['Borough'], inplace=True)

In [5]:
# More than one neighborhood can exist in one postal code area, so these two rows will be combined into one row with the neighborhoods separated with a comma
toronto_PC = toronto_PC.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

In [6]:
# Let's check if a cell has a borough but a Not assigned neighborhood, so then the neighborhood will be the same as the borough
print(toronto_PC['Neighborhood'].where(toronto_PC['Neighborhood'] == 'Not assigned'))

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
      ... 
98     NaN
99     NaN
100    NaN
101    NaN
102    NaN
Name: Neighborhood, Length: 103, dtype: object


There is no `Not assigned` neighborhood!

Let's check out the first 5 rows of our dateframe after processing.

In [7]:
toronto_PC.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Let's use the `.shape` method to print the number of rows of our dataframe.

In [8]:
toronto_PC.shape

(103, 3)

## 2. Get the Coordinates of each Neighborhood

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the `latitude` and the `longitude` coordinates of each neighborhood.

In [9]:
# Reading the csv file that has the geographical coordinates of each postal code and prepare the dataframe
url_c = 'http://cocl.us/Geospatial_data'
geospatial = pd.read_csv(url_c).rename(columns={"Postal Code": "PostalCode"})

Let's **merge** our main dataframe which has the postal code of each neighborhood with the previous dataframe which has the coordinates of each neighborhood.

In [10]:
toronto_geo = pd.merge(toronto_PC, geospatial, how = 'inner',  on = ["PostalCode"])

Let's check out the first 5 rows of our new dateframe after merging.

In [11]:
toronto_geo.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
