# Toronto's Neighborhoods Segmentation

### Contents
>1. Part 1: Scrape Toronto's postal codes from Wikipedia resource
>2. Part 2: Use the Google Geocoding API for geocoding the Boroughs
>3. Part 3: Mapping and clustering

# Part 1: Scrape Toronto's postal codes from Wikipedia resource

## Step 0: Load libraries for Part I

> Note: **pandas.read_html** method can process a html table in a more straightforward way than beautifulsoup, so only pandas is needed here.

In [1]:
import pandas as pd
print('Part I Libraries imported.')

Part I Libraries imported.


## Step 1: Load Toronto's postal codes from Wikipedia resource into a pandas DataFrame.

In [2]:
def toronto_wiki_to_df(geocode_FSA_only=True):
    """
    Parameter: geocode_FSA_only; (Note: 'FSA' = Forward Sortation Area) (default: True):
               This parameter enables two different level of segmentation: by Borough (True) or by Neighborhood (False).
               
               True: The dataframe 'Neighborhood' column will contain as many names as neighborhoods, so that
               the geocoding will be that of the Borough using the 'PostalCode' (its FSA);
               False: The dataframe 'Neighborhood' column will contain one neighborhood name, so that
               the geocoding will be that of the 'Borough', and 'Neighborhood'.
    
    Returns a pandas dataframe with columns: 'PostalCode', 'Borough', and 'Neighborhood'
 
    DataSource
        wikipedia page: 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
    """

    url_toronto_codes = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

    tbl_list = pd.read_html(url_toronto_codes)[0].unstack()

    df_records = []
    
    for _, v in enumerate(tbl_list):
        
        if v[-12:] != 'Not assigned':
            postal = v[0:3]

            rhs1 = v[3:].strip().split('(')
            boro = rhs1[0].strip()

            # extra boro cleanup:
            if postal == 'M9W' and boro.endswith('eNorthwest'):
                boro = 'Etobicoke Northwest'

            if postal == 'M4J' and boro == 'East YorkEast Toronto':
                boro = 'East York'


            if len(rhs1) == 1:
                hoods = boro
                df_records.append([postal, boro, hoods])

            else:
                if geocode_FSA_only:
                    if not ('/' in rhs1[1]):
                        hoods = rhs1[1][:-1].strip()

                    else:
                        hoods = rhs1[1][:-1].replace(' / ', ', ')


                    df_records.append([postal, boro, hoods])

                else:
                    rhs2 = rhs1[1][:-1].split(' / ')

                    if len(rhs2) == 1:
                        hoods = rhs2[0]
                        df_records.append([postal, boro, hoods])
                    else:
                        [df_records.append([postal, boro, h.strip()]) for h in rhs2]


    df = pd.DataFrame(df_records, columns = ['PostalCode', 'Borough', 'Neighborhood'])
    
    if not geocode_FSA_only:
        print('Note: Dataframe setup for geocoding by Neighborhood.')
        
    n1 = df.shape[0]
    # Drop records for mail processing centers
    df.drop(df[df.Neighborhood.str.startswith('Enclave')].index, axis=0, inplace=True)
    df.reset_index(inplace=True, drop=True)
    deleted = n1 - df.shape[0]
    
    print('Note: {} "Enclave" postal {areas} deleted'.format(deleted, areas=('area was' if deleted <= 2 else 'areas were')))
    
    return df


In [3]:
tor_df = toronto_wiki_to_df()

tor_df.head()

Note: 3 "Enclave" postal areas were deleted


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [4]:
print('Dataframe size: {}'.format(tor_df.shape))

Dataframe size: (100, 3)


# Part II: Use the Google Geocoding API for geocoding the Boroughs

## Step 0: Load libraries for Part II

In [5]:
import requests  # for use with Google API

## Step 1: Obtain the geolocation of each borough with Google geocode API

In [6]:
# Note: Using the Google geocode api in the implementation by Borough (as per exercise)
#       has the same behavior as that of geopy (without the limitations):
#       there can be several geo locations return for each Borough: the first one is retrieved.

def get_goo_ll(api_key, address):
    url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address="{}"'.format(api_key, address)

    response = requests.get(url).json()
    lat, lon = 0, 0
    
    if response['status'] == 'REQUEST_DENIED':
        return ('d', 'd')
    
    if response['status'] != 'ZERO_RESULTS':
        res = response['results'][0]
        geoloc = res['geometry']['location']
        
        if isinstance(geoloc, list):
            lat = geoloc[0]['lat']
            lon = geoloc[0]['lng']
        else:
            # dict assumed
            lat = geoloc['lat']
            lon = geoloc['lng']
        
    return (lat, lon)

Save API key in the GOO_GEO_API variable (hidden cell)

In [7]:
# The code was removed by Watson Studio for sharing.

In [8]:
TORONTO_ADRS = 'Canada Toronto {}'

tor_df['loc'] = tor_df['PostalCode'].apply(lambda x: get_goo_ll(GOO_GEO_API, TORONTO_ADRS.format(x)))

tor_df['Latitude'] = tor_df['loc'].apply(lambda x: x[0])
tor_df['Longitude'] = tor_df['loc'].apply(lambda x: x[1])

tor_df.drop('loc', axis=1, inplace=True)

In [9]:
tor_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# Part III: Mapping and clustering