# Capstone Project - Toronchester

## Table of Contents

* [1. Defining the Problem](#problem_definition)
* [2. The Proposed Soultion](#proposed_solution)
* [3. The Data](#the_data)
* [4. Collecting and Cleaning Data](#collecting_data)
 * [4.1 Importing Data](#importing_data)
 * [4.2 Cleaning Data](#cleaning_data)
* [5. Getting Location Data](#getting_location_data)
 * [5.1 Location Data for Toronto](#toronto_location_data)
 * [5.2 Location Data for Manchester](#manchester_location_data)
* [6. Mapping the Cities](#mapping)

## 1. Defining the Problem <a name="problem_definition"></a>

For this project, we are going to assume that I wish to move from Levenshulme, Manchester, UK, to Toronto, ON, Canada. I have developed a crippling addiction to maple syrup that just can't be satisfied in Britain. But I like where I live in Levenshulme, and would like to live somewhere similar in Toronto.

While some people might just ask someone in Toronto which neighbourhoods to choose, I, like many computer scientists, wish to avoid human interaction as much as possible, and ideally would like to choose my new home without talking to anybody at all.

While this solution will be of benefit to me personally rather than a group of stakeholders, it can easily be adapted for any user who wishes to move from one city to another.

## 2. The Proposed Solution <a name="proposed_solution"></a>

We already have the information about each neighbourhood in toronto from the Week 3 Segmentation and Clustering exercise. Similar data can also be found for Manchester. My plan is to combine the data for Manchester and Toronto to make one giant city, which we will call Toronchester. The neighbourhoods of Toronchester can then be grouped the same way that the neighbourhoods of Toronto were, based on FourSquare data for the most common types of venue in each neighbourhood. Any Toronto neighbourhoods in the same group as Levenshulme should be acceptable.

## 3. The Data <a name="the_data"></a>

Postcode data for the neighbourhoods can be found from wikipedia. We know the Toronto postcode data can be imported from "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:\_M". Similar data for Manchester is available from "https://en.wikipedia.org/wiki/M_postcode_area".


We will also need geospatial data giving the latitude and longitude of each neighbourhood - for Toronto we can use exactly the same dataset as used in the Week 3 project. We will, however, have to find the latitude and longitude of each neighbourhood in Manchester.

This will then provide the arguments for our requests to the Foursquare API, where we will find the venues present in each neighbourhood in both Manchester and Toronto. While we could limit the Manchester requests to Levenshulme, or the M19 postcode area, carrying out the exercise for the whole city will provide us with more information, and will later allow comparison of neighbourhoods within the same city, giving an idea of how well the model has worked. 

## 4. Collecting and Cleaning Data <a name="collecting_data"></a>

In [1]:
toronto_wiki_page = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
manchester_wiki_page = "https://en.wikipedia.org/wiki/M_postcode_area"

### 4.1 Importing Data <a name="importing_data"></a>

First we import the two wiki pages and use them to create BeautifulSoup objects

In [9]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

toronto_source_html = requests.get(toronto_wiki_page)
toronto_soup = BeautifulSoup(toronto_source_html.text, 'lxml')

manchester_source_html = requests.get(manchester_wiki_page)
manchester_soup = BeautifulSoup(manchester_source_html.text, 'lxml')

We can define a function to parse the html:

In [10]:
def parse_html(soup):
    #using soup object, iterate the .wikitable to get the data from the HTML page and store it into a list
    data = []     # the data for each postal code
    columns = []  # for our table header
    table = soup.find(class_='wikitable')
    for index, tr in enumerate(table.find_all('tr')):
        section = []
        for td in tr.find_all(['th','td']):
            section.append(td.text.rstrip())
    
        #First row of data is the header
        if (index == 0):
            columns = section
        else:
            data.append(section)
    
    return data, columns

In [7]:
data, columns = parse_html(toronto_soup)
toronto_df = pd.DataFrame(data=data, columns=columns)
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


And we can repeat the same process for Manchester:

In [97]:
data, columns = parse_html(manchester_soup)
manchester_df = pd.DataFrame(data=data, columns=columns)
manchester_df.head()

Unnamed: 0,Postcode district,Post town,Coverage,Local authority area(s)
0,M1,MANCHESTER,"Piccadilly, City Centre, Market Street",Manchester
1,M2,MANCHESTER,"Deansgate, City Centre",Manchester
2,"M3(Sectors 1, 2, 3, 4 and 9)",MANCHESTER,"City Centre, Deansgate, Castlefield",Manchester
3,"M3(Sectors 5, 6 and 7)",SALFORD,"Blackfriars, Greengate, Trinity",Salford
4,M4,MANCHESTER,"Ancoats, Northern Quarter, Strangeways",Manchester


### 4.2 Cleaning Data <a name="cleaning_data"></a>

For Toronto, we need to remove any entries where a borough is not assigned:

In [89]:
toronto_df = toronto_df[toronto_df['Borough'] != 'Not assigned']
toronto_df.reset_index(drop=True, inplace=True)
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


For Manchester, a number of entries list separate sectors in that postcode area. This will not work with the Geocode searches we will be doing later, so we remove these:

In [99]:
manchester_df = manchester_df[manchester_df["Postcode district"].str.contains('Sector') != True]
manchester_df.head()

Unnamed: 0,Postcode district,Post town,Coverage,Local authority area(s)
0,M1,MANCHESTER,"Piccadilly, City Centre, Market Street",Manchester
1,M2,MANCHESTER,"Deansgate, City Centre",Manchester
4,M4,MANCHESTER,"Ancoats, Northern Quarter, Strangeways",Manchester
5,M5,SALFORD,"Ordsall, Seedley, Weaste, University",Salford
6,M6,SALFORD,"Pendleton, Irlams o' th' Height, Langworthy, S...",Salford


## 5. Getting Location Data <a name="getting_location_data"></a>

### 5.1 Location Data for Toronto <a name="toronto_location_data"></a>

For Toronto, we can access the geospatial data the same way as previously:

In [100]:
toronto_geo_data_df = pd.read_csv('Geospatial_Coordinates.csv')
toronto_geo_data_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now we can merge this location data with our existing toronto dataframe:

In [101]:
toronto_df = pd.merge(toronto_df, toronto_geo_data_df, on='Postal Code')
toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude_x,Longitude_x,Latitude_y,Longitude_y
0,M3A,North York,Parkwoods,43.753259,-79.329656,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,43.662301,-79.389494
...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,43.636258,-79.498509


### 5.2 Location Data for Manchester <a name="manchester_location_data"></a>

For Manchester we do not have the data so easily available, so we will be using the google maps geocoding API:

We will try this for Levenshulme:

In [56]:
API_key="AIzaSyBr0DWpMJJvMLHMbHuH85ic2US56QzUMVs"

We can define a function to get latitude and longitude for any manchester postcode area:

In [138]:
def get_lat(postcode):
    postcode_district = postcode['Postcode district']
    URL = f"https://maps.googleapis.com/maps/api/geocode/json?address={postcode_district},+manchester,+UK&key={API_key}"
    data = requests.get(URL)
    data = data.json()
    try:
        lat = data['results'][0]['geometry']['location']['lat']
        return lat
    except IndexError: 
        print(f"could not find lat for {postcode_district}")
        return None

def get_lng(postcode):
    postcode_district = postcode['Postcode district']
    URL = f"https://maps.googleapis.com/maps/api/geocode/json?address={postcode_district},+manchester,+UK&key={API_key}"
    data = requests.get(URL)
    data = data.json()
    try:
        long = data['results'][0]['geometry']['location']['lng']
        return long
    except IndexError: 
        print(f"could not find long for {postcode_district}")
        return None

In [85]:
get_lat_long("M19")

(53.4379299, -2.1988786)

In [86]:
get_lat_long("city centre")

(53.4807593, -2.2426305)

In [82]:
levy_lat = data['results'][0]['geometry']['location']['lat']
levy_long = data['results'][0]['geometry']['location']['lng']

print(f"The latitude of Levenshulme is {levy_lat}")
print(f"The longitude of Levenshulme is {levy_long}")

The latitude of Levenshulme is 53.4488443
The longitude of Levenshulme is -2.1931977


In [102]:
manchester_df.head()

Unnamed: 0,Postcode district,Post town,Coverage,Local authority area(s)
0,M1,MANCHESTER,"Piccadilly, City Centre, Market Street",Manchester
1,M2,MANCHESTER,"Deansgate, City Centre",Manchester
4,M4,MANCHESTER,"Ancoats, Northern Quarter, Strangeways",Manchester
5,M5,SALFORD,"Ordsall, Seedley, Weaste, University",Salford
6,M6,SALFORD,"Pendleton, Irlams o' th' Height, Langworthy, S...",Salford


We can then apply the latitude and longitude to this dataframe.

In [119]:
manchester_df['Postcode district'] = manchester_df['Postcode district'].astype("string")
manchester_df.dtypes

Postcode district          string
Post town                  object
Coverage                   object
Local authority area(s)    object
Latitude                   object
dtype: object

In [139]:
manchester_df['Latitude'] = manchester_df.apply(get_lat, axis=1)
manchester_df.head()

Unnamed: 0,Postcode district,Post town,Coverage,Local authority area(s),Latitude
0,M1,MANCHESTER,"Piccadilly, City Centre, Market Street",Manchester,53.475109
1,M2,MANCHESTER,"Deansgate, City Centre",Manchester,53.479696
4,M4,MANCHESTER,"Ancoats, Northern Quarter, Strangeways",Manchester,53.487411
5,M5,SALFORD,"Ordsall, Seedley, Weaste, University",Salford,53.479642
6,M6,SALFORD,"Pendleton, Irlams o' th' Height, Langworthy, S...",Salford,53.399576


In [140]:
manchester_df['Longitude'] = manchester_df.apply(get_lng, axis=1)
manchester_df.head()

Unnamed: 0,Postcode district,Post town,Coverage,Local authority area(s),Latitude,Longitude
0,M1,MANCHESTER,"Piccadilly, City Centre, Market Street",Manchester,53.475109,-2.234693
1,M2,MANCHESTER,"Deansgate, City Centre",Manchester,53.479696,-2.242458
4,M4,MANCHESTER,"Ancoats, Northern Quarter, Strangeways",Manchester,53.487411,-2.227485
5,M5,SALFORD,"Ordsall, Seedley, Weaste, University",Salford,53.479642,-2.281064
6,M6,SALFORD,"Pendleton, Irlams o' th' Height, Langworthy, S...",Salford,53.399576,-2.51115


## 6. Mapping the Cities <a name="mapping"></a>