# Cousera Capstone Final Project

## Introduction

Traveling Master is an international travel agency who provides tourism products for its customers. Compared with other organizations' products, the biggest feature of Traveling Masters' is the focus on different experience. The trip provided by Traveling Master will not provide repeated experience but highly differentiated ones. For example, if someone plans to go for a 3-day trip in a city, Traveling Master will try to analyze this city and segment its neiborhoods in terms of nearby venues so that the customer and experience as many different things as he can in the limited time. This report will show one way Traveling Master uses for segementing, taking Toronto as an example.

## Data

Neighborhood segementing is the final objective for this report. During this process, first, geographical information for Toronto is needed, which will be retrieved from url: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. Second, more data such as latitudes and longitudes will be added into data from the first step. Third, from Foursquare API, nearby venues for each neighborhood will be retrieved. Based on this information, neighborhoods would be segemented into three different categories and this information can be used by Traveling Master to plan a 3-day trip for its customer in Toronto.

## Methodology

### Data Collection

First I need postal code data from wikipedia and stored them into pandas dataframe. In order to do that, following steps are carried out:

1. Import relevant libraries

In [4]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import csv

2. Retrieve data from wikipedia

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

csv_file = open('my_project.csv', 'w')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Postcode', 'Borough', 'Neighborhood'])

table = soup.find('table')
for element in table.find_all('tr')[1:]:
    lst = []
    for i in element.find_all('td'):
        lst.append(i.text.strip())
    Postcode = lst[0]
    Borough = lst[1]
    Neighborhood = lst[2]
    csv_writer.writerow([Postcode, Borough, Neighborhood])

csv_file.close()

3. Clean the data  
Rows with "Not Assigned" in "Borough" column would be dropped.  
Rows with "Not Assigned" in "Neighborhood" column would replace its "Neighborhood" value with its "Borough" value.  
Rows with the same postal code will be combined together.

In [5]:
df = pd.read_csv('my_project.csv')
missing1 = df.loc[df['Borough'] == 'Not assigned']
df.drop(index=missing1.index, inplace=True)
df.reset_index(inplace=True, drop=True)
missing2 = df.loc[df['Neighborhood'] == 'Not assigned']
df.loc[missing2.index, 'Neighborhood'] = df.loc[missing2.index, 'Borough']
postcode_list = []
delete_list = []
for i in range(len(df)):
    if df.loc[i, 'Postcode'] in postcode_list:
        delete_list.append(i-1)
        df.loc[i, 'Neighborhood'] = df.loc[i, 'Neighborhood']+', '+df.loc[i-1, 'Neighborhood']
    else:
        postcode_list.append(df.loc[i, 'Postcode'])
df.drop(index=delete_list, inplace=True)
df.reset_index(inplace=True, drop=True)
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Queen's Park


4. Get latitude and longoitude information from an existing document

In [6]:
csv = pd.read_csv('http://cocl.us/Geospatial_data')
df = df.merge(csv, how='left', left_on='Postcode', right_on='Postal Code')
df.drop(columns=['Postal Code'], inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


5. Select boroughs with "Toronto" in their names

In [7]:
def isToronto(a):
    return('Toronto' in a)

toronto = df.loc[df['Borough'].apply(isToronto)].reset_index(drop=True)
toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


### Segementing

1. Define Foursquare Credentials and Version

In [8]:
CLIENT_ID = 'CP05R2FTCBQRV13QAY1S0LAV31NJTIACJXRGY22CJBQE5PMX'
CLIENT_SECRET = 'EKARVZ5ITXOLSPPGYIETNDYJ2K3JTVF2RYNTAOABCF5DFM0I'
VERSION = '20180605'
LIMIT = 100
radius = 500

2. Get trending information for all neighborhoods in Toronto  

In [21]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                            'Neighborhood Latitude',
                            'Neighborhood Longitude',
                            'Venue',
                            'Venue Latitude',
                            'Venue Longitude',
                            'Venue Category']
    return(nearby_venues)

toronto_venues = getNearbyVenues(toronto['Neighborhood'],
                                toronto['Latitude'],
                                toronto['Longitude']
                                )
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65426,-79.360636,Toronto Cooper Koo Family Cherry St YMCA Centre,43.653191,-79.357947,Gym / Fitness Center
3,"Regent Park, Harbourfront",43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
4,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa


3. Analyze each neighborhood  
Group venue information and rank top 10 categories for each neighborhood

In [22]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix='', prefix_sep='')
toronto_onehot['neighborhood'] = toronto_venues['Neighborhood']

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_grouped = toronto_onehot.groupby('neighborhood').mean().reset_index()


def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Moth Common Venue'.format(ind+1))
        
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)
    
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Moth Common Venue,5th Moth Common Venue,6th Moth Common Venue,7th Moth Common Venue,8th Moth Common Venue,9th Moth Common Venue,10th Moth Common Venue
0,Berczy Park,Coffee Shop,Restaurant,Cocktail Bar,Café,Farmers Market,Pub,Seafood Restaurant,Cheese Shop,Beer Bar,Italian Restaurant
1,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Garden,Pizza Place,Park,Recording Studio,Restaurant,Burrito Place,Brewery,Skate Park,Smoke Shop
2,Central Bay Street,Coffee Shop,Café,Italian Restaurant,Bar,Burger Joint,Chinese Restaurant,Indian Restaurant,Ice Cream Shop,Spa,Sandwich Place
3,Christie,Café,Grocery Store,Park,Convenience Store,Coffee Shop,Baby Store,Nightclub,Diner,Italian Restaurant,Restaurant
4,Church and Wellesley,Japanese Restaurant,Coffee Shop,Sushi Restaurant,Gay Bar,Restaurant,Burger Joint,Gastropub,Fast Food Restaurant,Men's Store,Café


4. Cluster neighborhoods  
Here we use KMeans method to cluster neighborhood and we want to get 3 categories.

In [26]:
from sklearn.cluster import KMeans
kclusters = 3
toronto_grouped_clustering = toronto_grouped.drop('neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

toronto_merged = toronto
toronto_merged['Cluster Labels'] = kmeans.labels_
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Moth Common Venue,5th Moth Common Venue,6th Moth Common Venue,7th Moth Common Venue,8th Moth Common Venue,9th Moth Common Venue,10th Moth Common Venue
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0,Coffee Shop,Bakery,Park,Pub,Café,Restaurant,Theater,Mexican Restaurant,Breakfast Spot,Event Space
1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,0,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Middle Eastern Restaurant,Bar,Theater,Ramen Restaurant,Plaza,Tea Room
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,0,Coffee Shop,Restaurant,Café,Hotel,Clothing Store,Cosmetics Shop,Bakery,Italian Restaurant,Park,Cocktail Bar
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Coffee Shop,Pub,Neighborhood,Gym / Fitness Center,Ethiopian Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Yoga Studio
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Restaurant,Cocktail Bar,Café,Farmers Market,Pub,Seafood Restaurant,Cheese Shop,Beer Bar,Italian Restaurant


5. Visualize segementing results

In [29]:
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes
import folium
print('Libraries imported')


from geopy.geocoders import Nominatim
address = 'Toronto'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lng, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Libraries imported


In [37]:
map_clusters