# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
1. [Introduction: Business Problem](#introduction)
2. [Data](#data)
3. [Methodology](#methodology)
4. [Results](#results)
5. [Discussion and Conclusion](#conclusion)



## 1. Introduction: Business Problem <a name="introduction"></a>

The objective of this project is to compare the neighbourhoods of two major cities: **London, the UK** and **Toronto, Canada**. In this project, I will focus on downtown Toronto and the western central London. By exploring the most common venues in each neighbourhood, I am trying to identify **the differences between the European and North American cities**, which may reflect *different city designs, lifestyles and cultures.*

This project might be interesting for:
* Students who want to study abroad in either North America or Europe
* Adults who are considering working abroad
* Travellers who are looking for their next destinations
* Researchers in the field of urban studies/human geography

## 2. Data <a name="data"></a>

I will use the following datasets to collect the information needed for this project.


* The postal codes of western central London will be obtained from https://en.wikipedia.org/wiki/WC_postcode_area.
* The postal codes of downtown Toronto will be obtained from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.
* The geographical coordinates of each neighbourhood will be obtained using **Python Geocoder package**.
* The types and locations of venues in each neighborhood will be obtained using **Foursquare API**.

### 2.1. Gather the postal codes of western central London

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
# Scrape the wikipedia page
source1 = requests.get('https://en.wikipedia.org/wiki/WC_postcode_area').text
soup1 = BeautifulSoup(source1,'lxml')

table1 = soup1.find('table',{'class':'wikitable sortable'})

In [3]:
# Iteration: loop through the rows to get the data
PostalCode =[]
PostTown = []
Neighbourhood = []

for row in table1.findAll("tr"):
    cells = row.findAll("th")
    if len(cells) == 1:
        PostalCode.append(cells[0].find(text=True))
    
    cells = row.findAll("td")
    if len(cells) == 3: 
        PostTown.append(cells[0].find(text=True))
        Neighbourhood.append(cells[1].find(text=True))

london = pd.DataFrame(PostalCode, columns = ['PostalCode'])
london['PostTown'] = PostTown
london['Neighbourhood'] = Neighbourhood
london.head()

Unnamed: 0,PostalCode,PostTown,Neighbourhood
0,WC1A,LONDON,New Oxford Street
1,WC1B,LONDON,Bloomsbury
2,WC1E,LONDON,University College London
3,WC1H,LONDON,St Pancras
4,WC1N,LONDON,Russell Square


In [4]:
# Change 'Kings Cross'to 'Kings Cross Station' for clarity
london['Neighbourhood'] = london['Neighbourhood'].replace('Kings Cross','Kings Cross Station')

### 2.2 Get the latitudes and longitudes for each neighbourhood in western central London

In [None]:
from geopy.geocoders import Nominatim

In [None]:
Latitude = []
Longitude = []

for i in london['Neighbourhood']:
    geolocator = Nominatim(user_agent="ld_explorer")
    location = geolocator.geocode(i)
    
    latitude = location.latitude
    Latitude.append(latitude)
    
    longitude = location.longitude
    Longitude.append(longitude)
    
london['Latitude'] = Latitude
london['Longitude'] = Longitude
london.head()

In [None]:
# Drop 'St Pancras' and 'Charing Cross' which are far away from other neighbourhoods
london = london.drop(london.index[3])
london = london.drop(london.index[11])
london.head()

### 2.3 Gather the postal codes of downtown Toronto

In [None]:
# Scrape the wikipedia page
source2 = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup2 = BeautifulSoup(source2,'lxml')

table2 = soup2.find('table',{'class':'wikitable sortable'})

In [None]:
# Iteration: loop through the rows to get the data
PostalCode =[]
Borough = []
Neighbourhood =[]

for row in table2.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 3:
        PostalCode.append(cells[0].find(text=True))
        Borough.append(cells[1].find(text=True))
        Neighbourhood.append(cells[2].find(text=True))
        
toronto = pd.DataFrame(PostalCode, columns = ['PostalCode'])
toronto['Borough'] = Borough
toronto['Neighbourhood'] = Neighbourhood
toronto.head()

#### Clean data

In [None]:
# 1. Remove cells with a borough that is 'Not assigned'
condition = toronto.Borough == 'Not assigned'
toronto = toronto.drop(toronto[condition].index, axis = 0, inplace = False)

In [None]:
# 2. For cells with a 'Not assigned' neighborhood, replace the neighborhood with the borough.
toronto['Neighbourhood'] = toronto['Neighbourhood'].str.strip()

import numpy as np
toronto['Neighbourhood'] = np.where(toronto['Neighbourhood'] =='Not assigned', toronto['Borough'], toronto['Neighbourhood'])

In [None]:
# 3. Combine Neighbourhood with the same postal code
toronto2 = pd.DataFrame(toronto.groupby(['PostalCode','Borough'], as_index = False).agg(', '.join))

In [None]:
toronto2.head()

### 2.4 Get the latitudes and longitudes for each neighbourhood in downtown Toronto

In [None]:
geodata = pd.read_csv('https://cocl.us/Geospatial_data')

In [None]:
toronto3 = pd.concat([toronto2, geodata], axis=1).drop('Postal Code',axis = 1)
toronto3.head()

In [None]:
# We will focus on downtown Toronto.
dt_trt = toronto3[toronto3['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
dt_trt.head()

### 2.5 Now I have two cleaned datasets of neighourhoods and their coordinates in central London and downtown Toronto.

The dataset of central London is called **london**.

In [None]:
london.head()

The dataset of downtown Toronto is called **dt_trt**.

In [None]:
dt_trt.head()

## 3. Methodology <a name="methodology"></a>

After cleaning the data, I will first visualize all neighourhoods in the central London (using **folium map**) to take a closer look at their locations. 

Using the **Foursquare API**, I will then explore the top 100 venues that are in each neighbourhood within a radius of 500 meters. The coordinate and category of each venue is recorded in a dataset called ***london_venues***. 

By calculating the average frequency of occurrence of each category, I will identify the top 10 most common venues in each neighborhood, which are recorded in a dataset called ***london_neighborhoods_venues_sorted***.

Next, I will employ a machine learning algorithm called **K Means Clustering** to separate the neighbourhoods into three clusters, and visualize them on the map. I will then label each cluster based on its most common venues.

The same analysis will be performed on the dataset of downtown toronto to cluster its neighourhoods.

Finally, I will compare the neighbourhood clusters in these two cities, identify and discuss any difference/similarity.

### 3.1 Visualize all neighbourhoods in the western central London

In [None]:
address = 'London'

geolocator = Nominatim(user_agent="ld_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London are {}, {}.'.format(latitude, longitude))

In [None]:
import folium

# Create map of London using latitude and longitude values
map_london = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to map
for lat, lng, borough, neighborhood in zip(london['Latitude'], london['Longitude'], london['PostTown'], london['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    
map_london

### 3.2 Use the Foursquare API to get nearby venues in each neighborhood

In [None]:
CLIENT_ID = 'U10ZJS1TXTKPWOT5UBLAIVNARBH3AOVVEB3WWYAPBKERMBIG'
CLIENT_SECRET = '00RAH1VSAN5HGN5V1XX4ASG5MJR45NPHJRQ3R1LGMTQYK2AG'
VERSION = '20181110'

In [None]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
london_venues = getNearbyVenues(names = london['Neighbourhood'],
                                latitudes = london['Latitude'],
                                longitudes = london['Longitude'])       

In [None]:
london_venues.head()

In [None]:
# Count how many venues were returned for each neighborhood
london_venues.groupby('Neighbourhood').count()

In [None]:
print('There are {} uniques categories of all the returned venues.'.format(len(london_venues['Venue Category'].unique())))

### 3.3 Get the top 10 most common venues in each neighborhood

In [None]:
london_onehot = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
london_onehot['Neighbourhood'] = london_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [london_onehot.columns[-1]] + list(london_onehot.columns[:-1])
london_onehot = london_onehot[fixed_columns]

london_onehot.head()

#### Calculate the average frequency of occurrence of each category

In [None]:
london_grouped = london_onehot.groupby('Neighbourhood').mean().reset_index()
london_grouped.head()

#### Get the top 10 most common venues  in each neighborhood

In [None]:
# Sort the venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
london_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
london_neighborhoods_venues_sorted['Neighbourhood'] = london_grouped['Neighbourhood']

for ind in np.arange(london_grouped.shape[0]):
    london_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(london_grouped.iloc[ind, :], num_top_venues)

london_neighborhoods_venues_sorted.head()

### 3.4 Cluster neighbourhoods in the central London

In [None]:
from sklearn.cluster import KMeans

In [None]:
london_grouped_clustering = london_grouped.drop('Neighbourhood', 1)

kclusters = 3
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(london_grouped_clustering)

kmeans.labels_[0:10] # cluster labels generated for each row in the dataframe

In [None]:
# add clustering labels
london_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

london_merged = london
london_merged = london_merged.join(london_neighborhoods_venues_sorted.set_index('Neighbourhood'), 
                                   on ='Neighbourhood')

london_merged.head()

### 3.5 Visualize the clusters in London

In [None]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [None]:
# create map
london_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_merged['Latitude'],london_merged['Longitude'],london_merged['Neighbourhood'],london_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(london_clusters)
       
london_clusters

### 3.6 Label the clusters in London

#### Cluster 1:  Theater

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 0, 
                  london_merged.columns[[1] + list(range(5, london_merged.shape[1]))]]

#### Cluster 2: Cafe

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 1, 
                  london_merged.columns[[1] + list(range(5, london_merged.shape[1]))]]

#### Cluster 3: Pub

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 2, 
                  london_merged.columns[[1] + list(range(5, london_merged.shape[1]))]]

### Now I am going to perform the same analysis on downtown Toronto. 
### 3.7 Visualize all neighbourhoods in downtown Toronto

In [None]:
address = 'Downtown Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

In [None]:
# Create map of Downtown Toronto using latitude and longitude values
map_dt_trt = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(dt_trt['Latitude'], dt_trt['Longitude'], dt_trt['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dt_trt)  
    
map_dt_trt

### 3.8 Use the Foursquare API to get nearby venues in each neighborhood

In [None]:
dt_venues = getNearbyVenues(names = dt_trt['Neighbourhood'],
                            latitudes = dt_trt['Latitude'],
                            longitudes = dt_trt['Longitude'])   

In [None]:
dt_venues.head()

In [None]:
# Count how many venues were returned for each neighborhood
dt_venues.groupby('Neighbourhood').count()

In [None]:
print('There are {} uniques categories of all the returned venues.'.format(len(dt_venues['Venue Category'].unique())))

In [None]:
dt_onehot = pd.get_dummies(dt_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dt_onehot['Neighbourhood'] = dt_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [dt_onehot.columns[-1]] + list(dt_onehot.columns[:-1])
dt_onehot = dt_onehot[fixed_columns]

dt_onehot.head()

### 3.9 Get the top 10 most common venues in each neighborhood

#### Calculate the average frequency of occurrence of each category

In [None]:
dt_grouped = dt_onehot.groupby('Neighbourhood').mean().reset_index()
dt_grouped.head()

#### Get the top 10 most common venues  in each neighborhood

In [None]:
num_top_venues = 5

for hood in dt_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = dt_grouped[dt_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
# Sort the venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
trt_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
trt_neighborhoods_venues_sorted['Neighbourhood'] = dt_grouped['Neighbourhood']

for ind in np.arange(dt_grouped.shape[0]):
    trt_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dt_grouped.iloc[ind, :], num_top_venues)

trt_neighborhoods_venues_sorted.head()

### 3.10 Cluster neighbourhoods in downtown Toronto

In [None]:
dt_grouped_clustering = dt_grouped.drop('Neighbourhood', 1)

kclusters = 6
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(dt_grouped_clustering)

kmeans.labels_[0:10] # cluster labels generated for each row in the dataframe

In [None]:
# add clustering labels
trt_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dt_merged = dt_trt

# merge dt_grouped with dt_trt to add latitude/longitude for each neighborhood
dt_merged = dt_merged.join(trt_neighborhoods_venues_sorted.set_index('Neighbourhood'), 
                           on ='Neighbourhood')

dt_merged.head()

### 3.11 Visualize the clusters in Toronto

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dt_merged['Latitude'], dt_merged['Longitude'], dt_merged['Neighbourhood'], dt_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 3.12 Label the clusters in London

#### Cluster 1: Cafe & Restaurant

In [None]:
dt_merged.loc[dt_merged['Cluster Labels'] == 0, 
              dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

#### Cluster 2: Park

In [None]:
dt_merged.loc[dt_merged['Cluster Labels'] == 1, 
              dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

#### Cluster 3: Grocery Store

In [None]:
dt_merged.loc[dt_merged['Cluster Labels'] == 2, 
              dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

#### Cluster 4: Airport 

In [None]:
dt_merged.loc[dt_merged['Cluster Labels'] == 3, 
              dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

#### Cluster 5: Cafe

In [None]:
dt_merged.loc[dt_merged['Cluster Labels'] == 4, 
              dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

#### Cluster 6: Bar

In [None]:
dt_merged.loc[dt_merged['Cluster Labels'] == 5, 
              dt_merged.columns[[1] + list(range(5, dt_merged.shape[1]))]]

## 4. Results <a name="results"></a>

#### Create a table to summarize the categories of 3 clusters in London and 6 clusters in Toronto.

In [None]:
Clusters = [1,2,3,4,5,6]
comparison = pd.DataFrame(Clusters, columns = ['Clusters'])

London = ['Theater','Cafe','Pub','-','-','-']
Toronto = ['Cafe & Restaurant','Park',' Grocery Store','Airport','Cafe','Bar']

comparison['London'] = London
comparison['Toronto'] = Toronto

comparison

## 5. Discussion and Conclusion <a name="conclusion"></a>

The clustering result reveals that London and Toronto are very similar based on the most common venues in their neighbouhoods. 

Both cities have a lot of coffee shops, which is probably true in most western countries. Also, both cities have a wide variety of restaurants, ranging from Italian and French to Japanese and Chinese restaurants. This reflects the fact that both cities are culturally diverse. Different cultures are celebrated and embraced in both cities. Therefore, if you are considering studying or working abroad in either London or Toronto, you may not worry too much about the cultural issues. It is very likely that you will find some signs of your own culture, such as a restaurant which provides food from your hometown. 

However, there does exist some differences between London and Toronto. 

First, Toronto tends to have more parks than London does. This is a very positive sign, especially for a large crowded city like Toronto. If you are thinking about living abroad for a long period of time, the living environment is an important factor to consider.

Second, London tends to have more theatres, exhibits and bookstores than Toronto does. As we all know, London is famous for its rich history, cultures and arts, so I am not surprised to discover this difference. For people who are interested in history or arts, London is an ideal place to experience and learn the European culture. 

With increasing globalization, major cities around the world tend to become more similar in terms of city designs. However, they still have unique history backgrounds and cultures, which make them different from each other to some extent. For researchers in the field of urban studies, I hope this project can provide you with additional insights into the difference between the European and North American cities. 