# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
1. [Introduction: Business Problem](#introduction)
2. [Data](#data)



## Introduction: Business Problem <a name="introduction"></a>

The objective of this project is to compare the neighbourhoods of two major cities: **London, the UK** and **Toronto, Canada**. In this project, I will focus on downtown Toronto and the western central London. By exploring the most common venues in each neighbourhood, I am trying to identify **the differences between the European and North American cities**, which may reflect *different city designs, lifestyles and cultures.*

This project might be interesting for:
* Students who want to study abroad in either North America or Europe
* Adults who are considering working abroad
* Travellers who are looking for their next destinations
* Researchers in the field of urban studies/human geography

## Data <a name="data"></a>

I will use the following datasets to collect the information needed for this project.


* The postal codes of western central London will be obtained from https://en.wikipedia.org/wiki/WC_postcode_area.
* The postal codes of downtown Toronto will be obtained from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.
* The geographical coordinates of each neighbourhood will be obtained using **Python Geocoder package**.
* The types and locations of venues in each neighborhood will be obtained using **Foursquare API**.

### 1). Gather the postal codes of western central London

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
# Scrape the wikipedia page
source1 = requests.get('https://en.wikipedia.org/wiki/WC_postcode_area').text
soup1 = BeautifulSoup(source1,'lxml')

table1 = soup1.find('table',{'class':'wikitable sortable'})

In [3]:
# Iteration: loop through the rows to get the data
PostalCode =[]
PostTown = []
Neighbourhood = []

for row in table1.findAll("tr"):
    cells = row.findAll("th")
    if len(cells) == 1:
        PostalCode.append(cells[0].find(text=True))
    
    cells = row.findAll("td")
    if len(cells) == 3: 
        PostTown.append(cells[0].find(text=True))
        Neighbourhood.append(cells[1].find(text=True))

london = pd.DataFrame(PostalCode, columns = ['PostalCode'])
london['PostTown'] = PostTown
london['Neighbourhood'] = Neighbourhood
london.head()

Unnamed: 0,PostalCode,PostTown,Neighbourhood
0,WC1A,LONDON,New Oxford Street
1,WC1B,LONDON,Bloomsbury
2,WC1E,LONDON,University College London
3,WC1H,LONDON,St Pancras
4,WC1N,LONDON,Russell Square


In [4]:
# Change 'Kings Cross'to 'Kings Cross Station' for clarity
london['Neighbourhood'] = london['Neighbourhood'].replace('Kings Cross','Kings Cross Station')

### Get the latitudes and longitudes for each neighbourhood in western central London

In [5]:
from geopy.geocoders import Nominatim

In [6]:
Latitude = []
Longitude = []

for i in london['Neighbourhood']:
    geolocator = Nominatim(user_agent="ld_explorer")
    location = geolocator.geocode(i)
    
    latitude = location.latitude
    Latitude.append(latitude)
    
    longitude = location.longitude
    Longitude.append(longitude)
    
london['Latitude'] = Latitude
london['Longitude'] = Longitude
london.head()

Unnamed: 0,PostalCode,PostTown,Neighbourhood,Latitude,Longitude
0,WC1A,LONDON,New Oxford Street,51.517302,-0.123046
1,WC1B,LONDON,Bloomsbury,51.523126,-0.126066
2,WC1E,LONDON,University College London,51.523161,-0.128204
3,WC1H,LONDON,St Pancras,53.316558,-6.28224
4,WC1N,LONDON,Russell Square,51.521699,-0.126074


In [7]:
# Drop 'St Pancras' and 'Charing Cross' which are far away from other neighbourhoods
london = london.drop(london.index[3])
london = london.drop(london.index[11])
london.head()

Unnamed: 0,PostalCode,PostTown,Neighbourhood,Latitude,Longitude
0,WC1A,LONDON,New Oxford Street,51.517302,-0.123046
1,WC1B,LONDON,Bloomsbury,51.523126,-0.126066
2,WC1E,LONDON,University College London,51.523161,-0.128204
4,WC1N,LONDON,Russell Square,51.521699,-0.126074
5,WC1R,LONDON,Gray's Inn,51.518938,-0.112812


### 2). Gather the postal codes of downtown Toronto

In [8]:
# Scrape the wikipedia page
source2 = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup2 = BeautifulSoup(source2,'lxml')

table2 = soup2.find('table',{'class':'wikitable sortable'})

In [9]:
# Iteration: loop through the rows to get the data
PostalCode =[]
Borough = []
Neighbourhood =[]

for row in table2.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        PostalCode.append(cells[0].find(text=True))
        Borough.append(cells[1].find(text=True))
        Neighbourhood.append(cells[2].find(text=True))
        
toronto = pd.DataFrame(PostalCode, columns = ['PostalCode'])
toronto['Borough'] = Borough
toronto['Neighbourhood'] = Neighbourhood
toronto.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Clean data

In [10]:
# 1. Remove cells with a borough that is 'Not assigned'
condition = toronto.Borough == 'Not assigned'
toronto = toronto.drop(toronto[condition].index, axis = 0, inplace = False)

In [11]:
# 2. For cells with a 'Not assigned' neighborhood, replace the neighborhood with the borough.
toronto['Neighbourhood'] = toronto['Neighbourhood'].str.strip()

import numpy as np
toronto['Neighbourhood'] = np.where(toronto['Neighbourhood'] =='Not assigned', toronto['Borough'], toronto['Neighbourhood'])

In [12]:
# 3. Combine Neighbourhood with the same postal code
toronto2 = pd.DataFrame(toronto.groupby(['PostalCode','Borough'], as_index = False).agg(', '.join))

In [13]:
toronto2.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Get the latitudes and longitudes for each neighbourhood in downtown Toronto

In [14]:
geodata = pd.read_csv('https://cocl.us/Geospatial_data')

In [15]:
toronto3 = pd.concat([toronto2, geodata], axis=1).drop('Postal Code',axis = 1)
toronto3.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [16]:
# We will focus on downtown Toronto.
dt_trt = toronto3[toronto3['Borough'] == 'Downtown Toronto'].reset_index(drop=True)

### Now I have two cleaned datasets of neighourhoods and their coordinates in central London and downtown Toronto.

The dataset of central London is called **london**.

In [17]:
london.head()

Unnamed: 0,PostalCode,PostTown,Neighbourhood,Latitude,Longitude
0,WC1A,LONDON,New Oxford Street,51.517302,-0.123046
1,WC1B,LONDON,Bloomsbury,51.523126,-0.126066
2,WC1E,LONDON,University College London,51.523161,-0.128204
4,WC1N,LONDON,Russell Square,51.521699,-0.126074
5,WC1R,LONDON,Gray's Inn,51.518938,-0.112812


The dataset of downtown Toronto is called **dt_trt**.

In [18]:
dt_trt.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
