# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, I will explore, segment, and cluster the neighborhoods in the city of Toronto.
For the Toronto neighborhood data, a [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) exists that has all the information we need to explore and cluster the neighborhoods in Toronto. I will scrape the Wikipedia page using [beautiful soup](http://beautiful-soup-4.readthedocs.io/en/latest/) and wrangle the data, clean it, and then read it into a pandas dataframe.
After this I cluster the data and plot the clusters in a Map using Folium library.

# Step 1 : Data Collection
### This will be done in two steps:
- Scrape data from website and store in dataframe.
- Get latitute and longitude of place based on post code.

## Step 1.a
In this step beautiful soup and requests library will be used to scrape data from the website, which will be formatted and stored in a data frame. The data frame will have three columns Postcode, Borough and Neighbourhood.

In [19]:
# import requests to get the wesite and beautifulSoup to scrape data
import requests
from bs4 import BeautifulSoup

In [20]:
# store the url of the data website
tor_data_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# get the wesite in html format
source_data = requests.get(tor_data_url).text
# read the data
soup = BeautifulSoup(source_data,'lxml')

In [21]:
# all the table rows start with the tag tr and extracted in a list
table_row = soup.find_all('tr')

# the first element in the table row is the table header
table_header = table_row[0].text
table_header = table_header.split('\n')[1:-1]
print(table_header)

# rest of the elements excluding the last five elements are row data of the table
table_body = table_row[1:-5]
# create a list of rows, all the rows will be appended to this list
rows = []
for row in table_body:
    row = row.text.split('\n')[1:-1]
    rows.append(row)
print(rows[0])

In [22]:
# import pandas and numpy to store the  data in a dataframe and format it for further data collection.
import pandas as pd
import numpy as np

### Following conditions will be used to create the dataframe:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [112]:
# creating data frame with three columns: Postcode, Borough, Neighbourhood
df = pd.DataFrame(data = rows, columns = table_header)

# replce 'Not assigned' values with np.nan
df.replace({'Not assigned':np.nan}, inplace=True)

# Dropping all the rows which have null values in Borough
df.dropna(subset=['Borough'], inplace=True)

# if neighbor hood value is null it will aquire borough value, using np.where to use it
n_is_null = df['Neighbourhood'].isnull()
df['Neighbourhood'] = np.where(n_is_null,df['Borough'],df['Neighbourhood'])

print(df.describe())
df.head()

       Postcode    Borough Neighbourhood
count       211        211           211
unique      103         11           209
top         M9V  Etobicoke     Runnymede
freq          8         45             2


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


From the above result it is now clear that there are no missing values.

In [110]:
# the data frame will now be groupped by postcode and borough and the Neighbourhood will be concatenated by a comma
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


From above result it is clear that the groupping function was done successfully.

## Step 1.b
In this step two new columns will be added to the data frame which will be the latitude and longitude. Google Maps Geocoding API will be used to get the latitude and the longitude coordinates of each neighborhood.

In [113]:
# use geocoder library, if not present use !conda install -c conda-forge geocoder
import geocoder
# Google API key is required for the geocoder library to work, save the API key in OS environment variables as GOOGLE_API_KEY
# and then access thay key here
import os
BING_API_KEY = 'AksNN-3luSfNBssyZ3Ju4i78nIrFLt1UtYo--YWQj9oyfxSwyXkdsqykWk3FeTXB' # os.environ['BING_API_KEY']

In [124]:
# This function will take an adress and return the latlng of that adress
def get_latlng(address):
    g = geocoder.bing(address, key = BING_API_KEY)
    return pd.Series(g.latlng)

In [125]:
# using the get_latlng function to define latitude and longitude columns of the data frame
df[['Latitude','Longitude']] = df.apply(lambda x: get_latlng(x.Postcode + x.Borough + x.Neighbourhood), axis=1)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,latlng,Latitude,Longitude
2,M3A,North York,Parkwoods,"[43.75766, -79.31806]",43.757660,-79.318060
3,M4A,North York,Victoria Village,"[43.76826095581055, -79.41262817382812]",43.768261,-79.412628
4,M5A,Downtown Toronto,Harbourfront,"[43.64079, -79.37768]",43.640790,-79.377680
5,M5A,Downtown Toronto,Regent Park,"[43.68138885498047, -79.46888732910156]",43.681389,-79.468887
6,M6A,North York,Lawrence Heights,"[43.71296, -79.46003]",43.712960,-79.460030
7,M6A,North York,Lawrence Manor,"[43.71296, -79.46003]",43.712960,-79.460030
8,M7A,Queen's Park,Queen's Park,"[43.66436004638672, -79.39096069335938]",43.664360,-79.390961
10,M9A,Etobicoke,Islington Avenue,"[43.66979, -79.53449]",43.669790,-79.534490
11,M1B,Scarborough,Rouge,"[43.8093, -79.18765]",43.809300,-79.187650
12,M1B,Scarborough,Malvern,"[43.79749, -79.23609]",43.797490,-79.236090


In [126]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 211 entries, 2 to 286
Data columns (total 6 columns):
Postcode         211 non-null object
Borough          211 non-null object
Neighbourhood    211 non-null object
latlng           207 non-null object
Latitude         207 non-null float64
Longitude        207 non-null float64
dtypes: float64(2), object(4)
memory usage: 11.5+ KB


# Step 2 : Data preparation
prepare the data  in order to use it for modelling.

# Step 3 : Modelling
Create a model and fit the data