## The Problem - How to assess "Quality of Living" in a city where it is "unknown"?

The Mercer Quality of Living Survey every year ranks cities of the world according to their quality of living. However, only big or at least bigger cities are considered in the survey (a total of 231 cities in 2019, from Vienna (#1) to Baghdad (#231)). The Swiss city of Basel for example was only added to the list in 2018, when it directly placed very high in the ranking, as the 10th most livable city on the planet. 

Given the situation that one has multiple job offers in cities that are not listed in the survey (just like Basel before 2018) and one would like to decide which offer to take, based on the expected quality of life, can we built a dataset that represents the current top 30 cities in the Mercer Quality of Living Survey and then compare the data of our cities of interest to this data set, determining to which city to move?


## The data

The first data set I will be using for this project is the list of cities with the highest quality of living in 2019, as provided by Mercer, and which can be found on Wikipedia.
https://mobilityexchange.mercer.com/Insights/quality-of-living-rankings; 
https://en.wikipedia.org/wiki/Mercer_Quality_of_Living_Survey

I will then use Foursquare location data for these cities to build a comprehensive dataset of what defines a "livable" city with respect to amenities e.g. restaurants, parks etc. While I understand that aspects like safety, economic and natural environment cannot be grasped by Foursquare location data, I assume that those factors are similar in the cities one applies to for a job, or at least were considered before.

Then, I will use three query cities, which are just outside the top 20 (Dublin (#33), Boston (#36) and Paris (#39)), three cities that rank rather low (Istanbul (#130), Bangkok (#133) and Moscow (#167)) to determine how my dataset compares to already ranked cities and show me the dynamic range I am able to achieve i.e. how good can I discriminate a high vs. low ranked city.

Lastly I will ask how the quality of living, based on the before defined criteria, is in Quebec (Canada), Nice (France) and Zug (Switzerland) and hence decide where one should move to.


### Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

### Obtaining a dataframe containing the 30 highest ranked cities with the best quality of living in 2019

This data is exctracted from wikipedia, but was compiled by Mercer and can also be seen at their website (see above).

In [24]:
source = requests.get('https://en.wikipedia.org/wiki/Mercer_Quality_of_Living_Survey').text
soup = BeautifulSoup(source, 'lxml')

table = soup.find('table',{'class':'wikitable sortable'})

#print(table.prettify())

In [25]:
list_of_entries = []
for entries in table.find_all('td'):
        table_text_entry = entries.text
        list_of_entries.append(table_text_entry)

list_of_entries = [item.strip() for item in list_of_entries if str(item)]

In [33]:
df = pd.DataFrame.from_dict(list_of_entries)

df.columns = ['Rank']
df.insert(1,'a','')
df.insert(2,'b','')
df.insert(3,'c','')
df.insert(4,'d','')
df.insert(5,'e','')
df.insert(6,'f','')
df.insert(7,'City','')
df.insert(8,'Country','')
df.insert(9,'g','')

new_df = pd.DataFrame({'Rank':df['Rank'].iloc[::10].values,
                       'City':df['Rank'].iloc[7::10].values,
                       'Country':df['Rank'].iloc[8::10].values}) 

top20_df = new_df[0:20]
top20_df.head()

Unnamed: 0,Rank,City,Country
0,1,Vienna,Austria
1,2,Zürich,Switzerland
2,3,Munich,Germany
3,3,Auckland,New Zealand
4,3,Vancouver,Canada


## Import more libraries

In [42]:
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.



### Adding coordinates to the city names

In [67]:
top20_list = pd.DataFrame()

for address in top20_df['City']:
    geolocator = Nominatim(user_agent="city_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    #temp = pd.DataFrame({'latitude': latitude, 'longitude': longitude})
    
    top20_list = top20_list.append({'City': address, 'latitude': latitude, 'longitude': longitude}, ignore_index=True)
    
top20_list.head()    

Unnamed: 0,City,latitude,longitude
0,Vienna,48.208354,16.372504
1,Zürich,47.372394,8.542333
2,Munich,48.137108,11.575382
3,Auckland,-36.853467,174.765551
4,Vancouver,49.260872,-123.113953


### Create a map that shows those top 20 cities

In [227]:
# Make an empty map
worldmap = folium.Map(location=[0, 20], tiles="Mapbox Bright", zoom_start=1.5)
 
# I can add marker one by one on the map
for i in range(0,len(top20_list)):
    folium.Marker([top20_list.iloc[i]['latitude'], top20_list.iloc[i]['longitude']],
                  popup = top20_list.iloc[i]['City']).add_to(worldmap)
    
worldmap    

## Observation (1)

We can see that all cities within the top 20 are located in developed countries. Interestingly none of the top 20 cities are located in the USA.

# Retrieving Foursquare data for Top 20 cities

The next step consists of getting the Foursquare data for each city, extract relevant features, so that every city is represented by a subset of features and compare the cities based on their features.

In [120]:
CLIENT_ID = 'xxx' # your Foursquare ID
CLIENT_SECRET = 'xxx' # your Foursquare Secret
VERSION = '20191105' # Foursquare API version
LIMIT = 1000 # set a Limit for venues to retrieve from Foursquare, in a radius of 5 km aroudn the city centre

In [334]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    City_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    City_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(City_venues)

In [155]:
City_venues = getNearbyVenues(names = top20_list['City'],
                                   latitudes = top20_list['latitude'],
                                   longitudes = top20_list['longitude']
                                  )

Vienna
Zürich
Munich
Auckland
Vancouver
Düsseldorf
Frankfurt
Geneva
Copenhagen
Basel
Sydney
Amsterdam
Berlin
Bern
Wellington
Toronto
Melbourne
Luxembourg
Ottawa
Hamburg


In [156]:
print(City_venues.shape)
City_venues.head(5)

(1807, 7)


Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Vienna,48.208354,16.372504,Stephansplatz,48.208345,16.372118,Plaza
1,Vienna,48.208354,16.372504,Stephansdom,48.208466,16.373169,Church
2,Vienna,48.208354,16.372504,KLEINOD Die Bar,48.207372,16.373671,Bar
3,Vienna,48.208354,16.372504,DO & CO Restaurant,48.20824,16.371758,Restaurant
4,Vienna,48.208354,16.372504,Roberto American Bar,48.209889,16.37232,Cocktail Bar


### One hot encoding of venue categories extracted for all cities

In [157]:
# one hot encoding
Cities_onehot = pd.get_dummies(City_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Cities_onehot['City'] = City_venues['City'] 

# move City column to the first column
fixed_columns = [Cities_onehot.columns[-1]] + list(Cities_onehot.columns[:-1])
Cities_onehot = Cities_onehot[fixed_columns]

Cities_onehot.head()

Unnamed: 0,City,Accessories Store,African Restaurant,American Restaurant,Antique Shop,Apple Wine Pub,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vacation Rental,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Vienna,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Vienna,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Vienna,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Vienna,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Vienna,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Grouping cities and averaging the venue category count for every city

In [268]:
Cities_grouped = Cities_onehot.groupby('City').mean().reset_index()
Cities_grouped.head()

Unnamed: 0,City,Accessories Store,African Restaurant,American Restaurant,Antique Shop,Apple Wine Pub,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vacation Rental,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Zoo
0,Amsterdam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
1,Auckland,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Basel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01
3,Berlin,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.02,0.0,...,0.01,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0
4,Bern,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01


### Defining a function to return the most common venues and make a list of the 20 most common venues per city

In [159]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [287]:
num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
City_venues_sorted = pd.DataFrame(columns=columns)
City_venues_sorted['City'] = Cities_grouped['City']

for ind in np.arange(Cities_grouped.shape[0]):
    City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Cities_grouped.iloc[ind, :], num_top_venues)

City_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Amsterdam,Hotel,French Restaurant,Sandwich Place,Café,Bar,Marijuana Dispensary,Bookstore,Restaurant,Beer Bar,...,Cocktail Bar,Dessert Shop,Organic Grocery,Breakfast Spot,Hotel Bar,Clothing Store,Plaza,Cheese Shop,Bakery,Burger Joint
1,Auckland,Café,Japanese Restaurant,Hotel,Restaurant,Dessert Shop,Bar,Bistro,Gym,Coffee Shop,...,Pizza Place,Mexican Restaurant,Indian Restaurant,Steakhouse,Ice Cream Shop,Sushi Restaurant,Italian Restaurant,Monument / Landmark,Food Court,Theater
2,Basel,Hotel,Café,Swiss Restaurant,Italian Restaurant,French Restaurant,Plaza,Bar,Pub,Art Museum,...,Coffee Shop,Museum,Beer Garden,Thai Restaurant,Food Court,Coworking Space,Comic Shop,Zoo,Pastry Shop,Pedestrian Plaza
3,Berlin,Hotel,History Museum,Art Gallery,Monument / Landmark,Bookstore,Theater,Historic Site,Concert Hall,Ice Cream Shop,...,Plaza,Japanese Restaurant,Clothing Store,Coffee Shop,Park,Middle Eastern Restaurant,Steakhouse,Cocktail Bar,Indie Theater,Art Museum
4,Bern,Swiss Restaurant,Café,Bar,Restaurant,Italian Restaurant,Park,Plaza,Hotel,Bakery,...,Ice Cream Shop,Gastropub,Creperie,Clothing Store,Cocktail Bar,Coffee Shop,Performing Arts Venue,Church,Convention Center,Burger Joint


### Perform K-means clustering with 4 clusters on the City-Venue dataset

4 clusters were chosen, to visualize coarse trends.

In [187]:
# set number of clusters
kclusters = 4

Cities_grouped_clustering = Cities_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Cities_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:20] 

array([0, 0, 2, 0, 2, 0, 0, 0, 2, 0, 3, 0, 0, 1, 0, 0, 0, 2, 0, 2],
      dtype=int32)

In [188]:
# add clustering labels
City_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Cities_top20= top20_list

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Cities_top20 = Cities_top20.join(City_venues_sorted.set_index('City'), on='City')

Cities_top20.head()

Unnamed: 0,City,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Vienna,48.208354,16.372504,2,Restaurant,Austrian Restaurant,Hotel,Italian Restaurant,Plaza,Café,...,Art Museum,French Restaurant,History Museum,Bakery,Pedestrian Plaza,Palace,Canal,Church,Clothing Store,Sandwich Place
1,Zürich,47.372394,8.542333,2,Swiss Restaurant,Bar,Café,Plaza,Cocktail Bar,French Restaurant,...,Italian Restaurant,Gourmet Shop,Hotel,Boutique,Gym / Fitness Center,Pedestrian Plaza,Dessert Shop,Bratwurst Joint,Flea Market,Bookstore
2,Munich,48.137108,11.575382,0,Café,Plaza,German Restaurant,Hotel,Cocktail Bar,Bavarian Restaurant,...,Bookstore,Church,Trattoria/Osteria,Boutique,Opera House,Steakhouse,Fast Food Restaurant,Jazz Club,Fish Market,Cupcake Shop
3,Auckland,-36.853467,174.765551,0,Café,Japanese Restaurant,Hotel,Restaurant,Dessert Shop,Bar,...,Pizza Place,Mexican Restaurant,Indian Restaurant,Steakhouse,Ice Cream Shop,Sushi Restaurant,Italian Restaurant,Monument / Landmark,Food Court,Theater
4,Vancouver,49.260872,-123.113953,0,Coffee Shop,Brewery,Japanese Restaurant,Bakery,Chinese Restaurant,Park,...,Trail,Café,Lounge,Taco Place,Bagel Shop,Pizza Place,Liquor Store,Ice Cream Shop,Grocery Store,Vietnamese Restaurant


### Creating a world map, visualizing the clusters in different colours

In [189]:
# create map
map_clusters = folium.Map(location=[0, 20], zoom_start=2)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Cities_top20['latitude'], Cities_top20['longitude'], Cities_top20['City'], Cities_top20['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Observation (2)

A quick k-means clustering analysis, using the top 20 venue categories out of 1000 venues per city (in a 2 km radius around the city center) shows that there seems to be a specific signature of Swiss/Austrian cities (turqouise). Then, all cities cluster together, but Luxembourg and Ottawa, tow special cities. The first one being the only city of its country, a european politic and banking capital and the latter one being a rather small Canadian city (in comparison much smaller than Tornto or Vancouver) and due to its half french, half english make up, very unique.

_____________________________________________________________________________________________________________________

# Exploration of cities outside the top 20

I next explore cities that were ranked jsut outside the top 30 and cities that were ranked 130 and lower to see if, by using Foursquare data I can distinguish them from the top 20 cities shown above.

These cities are: Dublin (#33), Boston (#36) and Paris (#39) and Istanbul (#130), Bangkok (#133) and Moscow (#167)

In [469]:
new_Cities_df = pd.DataFrame({'Rank': [33, 36, 39, 130, 133, 167],
                               'City':['Dublin', 'Boston', 'Paris', 'Istanbul', 'Bangkok', 'Moscow'],
                               'Country':['Ireland', 'USA', 'France', 'Turkey', 'Thailand', 'Russia']}) 
new_Cities_df.head(6)

Unnamed: 0,Rank,City,Country
0,33,Dublin,Ireland
1,36,Boston,USA
2,39,Paris,France
3,130,Istanbul,Turkey
4,133,Bangkok,Thailand
5,167,Moscow,Russia


### Retrieving coordinates for the new cities

In [471]:
new_city_list = pd.DataFrame()

for address in new_Cities_df['City']:
    geolocator = Nominatim(user_agent="city_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    #temp = pd.DataFrame({'latitude': latitude, 'longitude': longitude})
    
    new_city_list = new_city_list.append({'City': address, 'latitude': latitude, 'longitude': longitude}, ignore_index=True)
    
new_city_list.head(6)  

Unnamed: 0,City,latitude,longitude
0,Dublin,53.349764,-6.260273
1,Boston,42.360253,-71.058291
2,Paris,48.85661,2.351499
3,Istanbul,41.009633,28.965165
4,Bangkok,13.753893,100.81608
5,Moscow,55.750446,37.617494


### Create a map visualizing all three city sets: Top 20 (blue), just outside Top 30 (green) and below 100 (red)

In [240]:
# Make an empty map
worldmap = folium.Map(location=[0, 20], tiles="Mapbox Bright", zoom_start=2)
 
# Top 20 cities in blue
for i in range(0,len(top20_list)):
    folium.Marker([top20_list.iloc[i]['latitude'], top20_list.iloc[i]['longitude']],
                  popup = top20_list.iloc[i]['City']).add_to(worldmap)

# Just outside top 30 cities in green
for i in range(0,3):
    folium.Marker([City_list.iloc[i]['latitude'], City_list.iloc[i]['longitude']],
                  popup = City_list.iloc[i]['City'],
                  icon = folium.Icon(color='green')).add_to(worldmap)
    
# Below top 100 cities in red
for i in range(3,len(City_list)):
    folium.Marker([City_list.iloc[i]['latitude'], City_list.iloc[i]['longitude']],
                  popup = City_list.iloc[i]['City'],
                  icon = folium.Icon(color='red')).add_to(worldmap)
    
worldmap  

# The following steps repeat from above (getting Foursquare data and clustering)

In [241]:
new_City_venues = getNearbyVenues(names = City_list['City'],
                                   latitudes = City_list['latitude'],
                                   longitudes = City_list['longitude']
                                  )

Dublin
Boston
Paris
Istanbul
Bangkok
Moscow


In [311]:
# one hot encoding
new_Cities_onehot = pd.get_dummies(new_City_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
new_Cities_onehot['City'] = new_City_venues['City'] 

# move City column to the first column
new_fixed_columns = [new_Cities_onehot.columns[-1]] + list(new_Cities_onehot.columns[:-1])
new_Cities_onehot = new_Cities_onehot[new_fixed_columns]

new_Cities_grouped = new_Cities_onehot.groupby('City').mean().reset_index()

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
new_City_venues_sorted = pd.DataFrame(columns=columns)
new_City_venues_sorted['City'] = new_Cities_grouped['City']

for ind in np.arange(new_Cities_grouped.shape[0]):
    new_City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(new_Cities_grouped.iloc[ind, :], num_top_venues)
    
new_City_venues_sorted

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Bangkok,Restaurant,Hotpot Restaurant,Bus Stop,Coffee Shop,Convenience Store,Department Store,Market,Noodle House,Accessories Store,...,Comic Shop,Farmers Market,Falafel Restaurant,Exhibit,Event Space,Electronics Store,Donut Shop,Discount Store,Diner,Dessert Shop
1,Boston,Italian Restaurant,Coffee Shop,Park,Pizza Place,Hotel,Historic Site,Seafood Restaurant,Market,American Restaurant,...,Salad Place,Sandwich Place,Yoga Studio,Gastropub,Aquarium,Grocery Store,Opera House,Neighborhood,Church,Pastry Shop
2,Dublin,Pub,Café,Coffee Shop,Restaurant,Theater,Hotel,Sushi Restaurant,Ice Cream Shop,Italian Restaurant,...,Museum,Plaza,Pizza Place,Deli / Bodega,Cocktail Bar,Tapas Restaurant,Gastropub,Vietnamese Restaurant,Korean Restaurant,Comic Shop
3,Istanbul,Hotel,Turkish Restaurant,Historic Site,Jewelry Store,Coffee Shop,Café,Mosque,Art Gallery,Plaza,...,Bookstore,Department Store,Dessert Shop,Neighborhood,Gift Shop,Restaurant,Arts & Crafts Store,Electronics Store,Food & Drink Shop,Health Food Store
4,Moscow,Plaza,Coffee Shop,Art Gallery,Hotel,Boutique,Bookstore,Jewelry Store,Cocktail Bar,Theater,...,Salon / Barbershop,Asian Restaurant,Opera House,Art Museum,Park,Russian Restaurant,Pizza Place,Concert Hall,Restaurant,Photography Studio
5,Paris,French Restaurant,Cocktail Bar,Plaza,Sandwich Place,Art Gallery,Ice Cream Shop,Japanese Restaurant,Coffee Shop,Restaurant,...,Wine Bar,Pedestrian Plaza,Art Museum,Gourmet Shop,Tapas Restaurant,Café,Hotel,Museum,Bakery,Historic Site


_____________________________________________________________________________________________________________________

## Clustering of new and established top 20 cities

First I cluster the "almost top 20" cities together with the top 20 cities and see if they can be separated.

In [473]:
new_Cities_grouped_clustering = new_Cities_grouped.drop('City', 1)
Cities_grouped_clustering = Cities_grouped.drop('City', 1)

frames = [Cities_grouped_clustering, new_Cities_grouped_clustering]
comb_Cities_clustering = pd.concat(frames,sort=False).reset_index(drop = True)
comb_Cities_clustering.fillna(value=0, inplace = True)

In [474]:
# set number of clusters
kclusters = 2

# run k-means clustering
kmeans2 = KMeans(n_clusters=kclusters, random_state=0).fit(comb_Cities_clustering)

# check cluster labels generated for each row in the dataframe
kmeans2.labels_ 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0], dtype=int32)

### After clustering, the labels are added to the cities and displayed on the map

In [475]:
num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
City_venues_sorted = pd.DataFrame(columns=columns)
City_venues_sorted['City'] = Cities_grouped['City']

for ind in np.arange(Cities_grouped.shape[0]):
    City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Cities_grouped.iloc[ind, :], num_top_venues)

In [476]:
frames = [City_venues_sorted, new_City_venues_sorted]
comb_venues = pd.concat(frames,sort=False).reset_index(drop = True)

frames2 = [top20_list, City_list]
comb_Cities_list = pd.concat(frames2,sort=False).reset_index(drop = True)

# add clustering labels
comb_venues.insert(0, 'Cluster Labels', kmeans2.labels_)
new_comb_Cities_list = comb_Cities_list.join(comb_venues.set_index('City'), on='City')

In [321]:
# create map
map_clusters = folium.Map(location=[0, 20], zoom_start=2)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(new_comb_Cities_list['latitude'], new_comb_Cities_list['longitude'],
                                  new_comb_Cities_list['City'], new_comb_Cities_list['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Observation (3)

I cannot assign the cities with less quality of living, according to the Mercer list, to a different cluster than cities within the top 40. Again, only Ottawa clusters appart, as the top venues seem more rural associated than city associated.

There is a number of reasons, why the clustering was unsucessfull. 

1. The feautres extracted were not good enough for clustering "low" vs. "high" quality
2. The cities do not differ as much as expected, based on the venues listed in Foursquare
3. The differences were to subtle to be detected by a binary clustering: low vs high quality

I therefore adjust my strategy such, that I can distinguish the cities with a lower quality of living vs the ones with a higher quality.

### First, I repeat the clustering with more than only 2 clusters - 4 to be exact

In [332]:
new_Cities_grouped_clustering = new_Cities_grouped.drop('City', 1)
Cities_grouped_clustering = Cities_grouped.drop('City', 1)

frames = [Cities_grouped_clustering, new_Cities_grouped_clustering]
comb_Cities_clustering = pd.concat(frames,sort=False).reset_index(drop = True)
comb_Cities_clustering.fillna(value=0, inplace = True)
comb_Cities_clustering.head()

# set number of clusters
kclusters = 4

# run k-means clustering
kmeans2 = KMeans(n_clusters=kclusters, random_state=0).fit(comb_Cities_clustering)

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
City_venues_sorted = pd.DataFrame(columns=columns)
City_venues_sorted['City'] = Cities_grouped['City']

for ind in np.arange(Cities_grouped.shape[0]):
    City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Cities_grouped.iloc[ind, :], num_top_venues)

City_venues_sorted.head()

frames = [City_venues_sorted, new_City_venues_sorted]
comb_venues = pd.concat(frames,sort=False).reset_index(drop = True)

frames2 = [top20_list, City_list]
comb_Cities_list = pd.concat(frames2,sort=False).reset_index(drop = True)

# add clustering labels
comb_venues.insert(0, 'Cluster Labels', kmeans2.labels_)

new_comb_Cities_list = comb_Cities_list.join(comb_venues.set_index('City'), on='City')

In [335]:
# create map
map_clusters = folium.Map(location=[0, 20], zoom_start=2)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(new_comb_Cities_list['latitude'], new_comb_Cities_list['longitude'],
                                  new_comb_Cities_list['City'], new_comb_Cities_list['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Observation (4)

Using more clusters (4 in this case) revealed that Bangkok clusters outside of all other cities, just like Ottawa and Luxembourg, all very distinct cities, as discussed above.

This means my feature selection is either not sufficient to distinguish low vs. high quality of life cities, or, again cities do not differ greatly based on data available from Foursquare.

Therefore, I will broaden my search radius aroudn the city center and include more venues in my search/clustering. This likely smoothens out the differences between e.g. Swiss and German cities, but could allow distinguishing low vs high quality of life cities, as classified by Mercer.

### I will get the venue data, now within 15 km around the city center, for the top 20 and the other selected cities

In [385]:
def getNearbyVenues(names, latitudes, longitudes, radius=10000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    City_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    City_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(City_venues)

In [386]:
City_venues = getNearbyVenues(names = top20_list['City'],
                                   latitudes = top20_list['latitude'],
                                   longitudes = top20_list['longitude']
                                  )

new_City_venues = getNearbyVenues(names = City_list['City'],
                                   latitudes = City_list['latitude'],
                                   longitudes = City_list['longitude']
                                  )

Vienna
Zürich
Munich
Auckland
Vancouver
Düsseldorf
Frankfurt
Geneva
Copenhagen
Basel
Sydney
Amsterdam
Berlin
Bern
Wellington
Toronto
Melbourne
Luxembourg
Ottawa
Hamburg
Dublin
Boston
Paris
Istanbul
Bangkok
Moscow


In [387]:
# one hot encoding
Cities_onehot = pd.get_dummies(City_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Cities_onehot['City'] = City_venues['City'] 

# move City column to the first column
fixed_columns = [Cities_onehot.columns[-1]] + list(Cities_onehot.columns[:-1])
Cities_onehot = Cities_onehot[fixed_columns]

Cities_grouped = Cities_onehot.groupby('City').mean().reset_index()

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
City_venues_sorted = pd.DataFrame(columns=columns)
City_venues_sorted['City'] = Cities_grouped['City']

for ind in np.arange(new_Cities_grouped.shape[0]):
    City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Cities_grouped.iloc[ind, :], num_top_venues)

In [388]:
# one hot encoding
new_Cities_onehot = pd.get_dummies(new_City_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
new_Cities_onehot['City'] = new_City_venues['City'] 

# move City column to the first column
new_fixed_columns = [new_Cities_onehot.columns[-1]] + list(new_Cities_onehot.columns[:-1])
new_Cities_onehot = new_Cities_onehot[new_fixed_columns]

new_Cities_grouped = new_Cities_onehot.groupby('City').mean().reset_index()

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
new_City_venues_sorted = pd.DataFrame(columns=columns)
new_City_venues_sorted['City'] = new_Cities_grouped['City']

for ind in np.arange(new_Cities_grouped.shape[0]):
    new_City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(new_Cities_grouped.iloc[ind, :], num_top_venues)

In [477]:
new_Cities_grouped_clustering = new_Cities_grouped.drop('City', 1)
Cities_grouped_clustering = Cities_grouped.drop('City', 1)

frames = [Cities_grouped_clustering, new_Cities_grouped_clustering]
comb_Cities_clustering = pd.concat(frames,sort=False).reset_index(drop = True)
comb_Cities_clustering.fillna(value=0, inplace = True)

In [390]:
new_Cities_grouped_clustering = new_Cities_grouped.drop('City', 1)
Cities_grouped_clustering = Cities_grouped.drop('City', 1)

frames = [Cities_grouped_clustering, new_Cities_grouped_clustering]
comb_Cities_clustering = pd.concat(frames,sort=False).reset_index(drop = True)
comb_Cities_clustering.fillna(value=0, inplace = True)
comb_Cities_clustering.head()

# set number of clusters
kclusters = 4

# run k-means clustering
kmeans2 = KMeans(n_clusters=kclusters, random_state=0).fit(comb_Cities_clustering)

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
City_venues_sorted = pd.DataFrame(columns=columns)
City_venues_sorted['City'] = Cities_grouped['City']

for ind in np.arange(Cities_grouped.shape[0]):
    City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Cities_grouped.iloc[ind, :], num_top_venues)

City_venues_sorted.head()

frames = [City_venues_sorted, new_City_venues_sorted]
comb_venues = pd.concat(frames,sort=False).reset_index(drop = True)

frames2 = [top20_list, City_list]
comb_Cities_list = pd.concat(frames2,sort=False).reset_index(drop = True)

# add clustering labels
comb_venues.insert(0, 'Cluster Labels', kmeans2.labels_)

new_comb_Cities_list = comb_Cities_list.join(comb_venues.set_index('City'), on='City')

In [391]:
# create map
map_clusters = folium.Map(location=[0, 20], zoom_start=2)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(new_comb_Cities_list['latitude'], new_comb_Cities_list['longitude'],
                                  new_comb_Cities_list['City'], new_comb_Cities_list['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Observation (5)

Using more information (Foursquare information in a radius of 10 km aroudn the city centre) revealed that Bangkok still clusters outside of all other cities. However, all other clusters are intermixed between the continent, coastal locations and no difference can be seen, in comparison to the Mercer list.

_____________________________________________________________________________________________________________________


# Intermediate Result

Based on Foursqaure data, one cannot clusters cities according to the quality of living (in comparison to a professionally curated list, taking into account politics, natural and economic environment). 

It however is possible to work out differences between cities, even within Europe or Canada, when focusing on the venues in the city centers.

### Next steps:

- Confine search for venues to city centers
- Increase cluster number, to allow finding smaller differences
- Search for the query cities, which are not in the list (Quebec, Nice, Zug) and compare them to my already explored cities.

### I again search for venues, this time though within 2.5 km of the city centers

### I also increase the cluster number to 8, in order to be able to pick up more subtle differences between city characteristics

In [392]:
def getNearbyVenues(names, latitudes, longitudes, radius=2500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    City_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    City_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(City_venues)

In [393]:
City_venues = getNearbyVenues(names = top20_list['City'],
                                   latitudes = top20_list['latitude'],
                                   longitudes = top20_list['longitude']
                                  )

new_City_venues = getNearbyVenues(names = City_list['City'],
                                   latitudes = City_list['latitude'],
                                   longitudes = City_list['longitude']
                                  )

Vienna
Zürich
Munich
Auckland
Vancouver
Düsseldorf
Frankfurt
Geneva
Copenhagen
Basel
Sydney
Amsterdam
Berlin
Bern
Wellington
Toronto
Melbourne
Luxembourg
Ottawa
Hamburg
Dublin
Boston
Paris
Istanbul
Bangkok
Moscow


In [394]:
# one hot encoding
Cities_onehot = pd.get_dummies(City_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Cities_onehot['City'] = City_venues['City'] 

# move City column to the first column
fixed_columns = [Cities_onehot.columns[-1]] + list(Cities_onehot.columns[:-1])
Cities_onehot = Cities_onehot[fixed_columns]

Cities_grouped = Cities_onehot.groupby('City').mean().reset_index()

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
City_venues_sorted = pd.DataFrame(columns=columns)
City_venues_sorted['City'] = Cities_grouped['City']

for ind in np.arange(new_Cities_grouped.shape[0]):
    City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Cities_grouped.iloc[ind, :], num_top_venues)

In [395]:
# one hot encoding
new_Cities_onehot = pd.get_dummies(new_City_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
new_Cities_onehot['City'] = new_City_venues['City'] 

# move City column to the first column
new_fixed_columns = [new_Cities_onehot.columns[-1]] + list(new_Cities_onehot.columns[:-1])
new_Cities_onehot = new_Cities_onehot[new_fixed_columns]

new_Cities_grouped = new_Cities_onehot.groupby('City').mean().reset_index()

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
new_City_venues_sorted = pd.DataFrame(columns=columns)
new_City_venues_sorted['City'] = new_Cities_grouped['City']

for ind in np.arange(new_Cities_grouped.shape[0]):
    new_City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(new_Cities_grouped.iloc[ind, :], num_top_venues)

In [478]:
new_Cities_grouped_clustering = new_Cities_grouped.drop('City', 1)
Cities_grouped_clustering = Cities_grouped.drop('City', 1)

frames = [Cities_grouped_clustering, new_Cities_grouped_clustering]
comb_Cities_clustering = pd.concat(frames,sort=False).reset_index(drop = True)
comb_Cities_clustering.fillna(value=0, inplace = True)

In [397]:
new_Cities_grouped_clustering = new_Cities_grouped.drop('City', 1)
Cities_grouped_clustering = Cities_grouped.drop('City', 1)

frames = [Cities_grouped_clustering, new_Cities_grouped_clustering]
comb_Cities_clustering = pd.concat(frames,sort=False).reset_index(drop = True)
comb_Cities_clustering.fillna(value=0, inplace = True)
comb_Cities_clustering.head()

# set number of clusters
kclusters = 8

# run k-means clustering
kmeans2 = KMeans(n_clusters=kclusters, random_state=0).fit(comb_Cities_clustering)

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
City_venues_sorted = pd.DataFrame(columns=columns)
City_venues_sorted['City'] = Cities_grouped['City']

for ind in np.arange(Cities_grouped.shape[0]):
    City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Cities_grouped.iloc[ind, :], num_top_venues)

City_venues_sorted.head()

frames = [City_venues_sorted, new_City_venues_sorted]
comb_venues = pd.concat(frames,sort=False).reset_index(drop = True)

frames2 = [top20_list, City_list]
comb_Cities_list = pd.concat(frames2,sort=False).reset_index(drop = True)

# add clustering labels
comb_venues.insert(0, 'Cluster Labels', kmeans2.labels_)

new_comb_Cities_list = comb_Cities_list.join(comb_venues.set_index('City'), on='City')

In [398]:
# create map
map_clusters = folium.Map(location=[0, 20], zoom_start=2)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(new_comb_Cities_list['latitude'], new_comb_Cities_list['longitude'],
                                  new_comb_Cities_list['City'], new_comb_Cities_list['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Observation (6)

Using Foursquare information from only within radius of 2.5 km around the city centre revealed that Bangkok, Moscow, Istanbul, Ottawa and Luxembourg cluster alone. While we cannot cluster low vs high quality of living, we now see many more differences between cities - a good base for more comparisons.

# Query cities

### Now I define my query cities from the beginning, search for their venue data on Foursquare and cluster them with the cities I already have information for

In [438]:
query_Cities_df = pd.DataFrame({'City':['Quebec City', 'Nice', 'Zug'],
                               'Country':['Canada','France','Switzerland']}) 

query_city_list = pd.DataFrame()

for address in query_Cities_df['City']:
    geolocator = Nominatim(user_agent="city_explorer2")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    #temp = pd.DataFrame({'latitude': latitude, 'longitude': longitude})
    
    query_city_list = query_city_list.append({'City': address, 'latitude': latitude, 'longitude': longitude}, ignore_index=True)
    
query_city_list.head(6)  

Unnamed: 0,City,latitude,longitude
0,Quebec City,46.82596,-71.235223
1,Nice,43.700936,7.268391
2,Zug,47.16799,8.517365


In [439]:
query_City_venues = getNearbyVenues(names = query_city_list['City'],
                                   latitudes = query_city_list['latitude'],
                                   longitudes = query_city_list['longitude']
                                  )

Quebec City
Nice
Zug


In [440]:
# one hot encoding
query_Cities_onehot = pd.get_dummies(query_City_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
query_Cities_onehot['City'] = query_City_venues['City'] 

# move City column to the first column
query_fixed_columns = [query_Cities_onehot.columns[-1]] + list(query_Cities_onehot.columns[:-1])
query_Cities_onehot = query_Cities_onehot[query_fixed_columns]

query_Cities_grouped = query_Cities_onehot.groupby('City').mean().reset_index()

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

In [452]:
new_Cities_grouped_clustering = new_Cities_grouped.drop('City', 1)
Cities_grouped_clustering = Cities_grouped.drop('City', 1)
query_grouped_clustering = query_Cities_grouped.drop('City', 1)

frames = [Cities_grouped_clustering, new_Cities_grouped_clustering, query_grouped_clustering]
comb_Cities_clustering_2 = pd.concat(frames,sort=False).reset_index(drop = True)
comb_Cities_clustering_2.fillna(value=0, inplace = True)

In [456]:
# set number of clusters
kclusters = 8

# run k-means clustering
kmeans2 = KMeans(n_clusters=kclusters, random_state=0).fit(comb_Cities_clustering_2)

num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
City_venues_sorted = pd.DataFrame(columns=columns)
City_venues_sorted['City'] = Cities_grouped['City']

for ind in np.arange(Cities_grouped.shape[0]):
    City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Cities_grouped.iloc[ind, :], num_top_venues)

    
# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
new_City_venues_sorted = pd.DataFrame(columns=columns)
new_City_venues_sorted['City'] = new_Cities_grouped['City']

for ind in np.arange(new_Cities_grouped.shape[0]):
    new_City_venues_sorted.iloc[ind, 1:] = return_most_common_venues(new_Cities_grouped.iloc[ind, :], num_top_venues)
    

    
frames = [City_venues_sorted, new_City_venues_sorted,query_City_venues_sorted]
comb_venues_2 = pd.concat(frames,sort=False).reset_index(drop = True)

frames2 = [top20_list, City_list, query_city_list]
comb_Cities_list_2 = pd.concat(frames2,sort=False).reset_index(drop = True)


# add clustering labels
comb_venues_2.insert(0, 'Cluster Labels', kmeans2.labels_)
new_comb_Cities_list_2 = comb_Cities_list_2.join(comb_venues_2.set_index('City'), on='City')

Unnamed: 0,City,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Vienna,48.208354,16.372504,6,Restaurant,Hotel,Austrian Restaurant,Plaza,Italian Restaurant,Ice Cream Shop,...,Pedestrian Plaza,Concert Hall,French Restaurant,Cocktail Bar,Theater,Museum,Jazz Club,Lounge,Bratwurst Joint,Jewish Restaurant
1,Zürich,47.372394,8.542333,1,Bar,Swiss Restaurant,Plaza,Café,Cocktail Bar,Vegetarian / Vegan Restaurant,...,Boutique,Pedestrian Plaza,Gourmet Shop,Coffee Shop,Dessert Shop,Gym / Fitness Center,Italian Restaurant,Shoe Store,Wine Bar,Bookstore
2,Munich,48.137108,11.575382,3,Café,Plaza,Cocktail Bar,Hotel,German Restaurant,Coffee Shop,...,Boutique,Restaurant,Bavarian Restaurant,Steakhouse,Italian Restaurant,Seafood Restaurant,Bookstore,Board Shop,Fast Food Restaurant,Farmers Market
3,Auckland,-36.853467,174.765551,5,Café,Japanese Restaurant,Hotel,Bar,Dessert Shop,Park,...,Indian Restaurant,Ice Cream Shop,Steakhouse,Theater,Monument / Landmark,Italian Restaurant,Food Court,Malay Restaurant,Asian Restaurant,Gym
4,Vancouver,49.260872,-123.113953,5,Bakery,Coffee Shop,Trail,Sushi Restaurant,Dessert Shop,Park,...,Middle Eastern Restaurant,Taco Place,Seafood Restaurant,BBQ Joint,Liquor Store,Furniture / Home Store,Japanese Restaurant,Indian Restaurant,Vegetarian / Vegan Restaurant,Yoga Studio
5,Düsseldorf,51.225402,6.776314,5,Japanese Restaurant,Hotel,Coffee Shop,Bar,Grocery Store,Café,...,Park,Bakery,Bookstore,Ramen Restaurant,Steakhouse,Wine Bar,Pizza Place,Sushi Restaurant,Pedestrian Plaza,German Restaurant
6,Frankfurt,50.110644,8.682092,5,Café,Bar,Hotel,Japanese Restaurant,Restaurant,Plaza,...,Coffee Shop,Thai Restaurant,Theater,Park,Ice Cream Shop,Apple Wine Pub,German Restaurant,Scenic Lookout,Bakery,Pedestrian Plaza
7,Geneva,46.201756,6.146601,3,French Restaurant,Hotel,Italian Restaurant,Plaza,Bar,Coffee Shop,...,Department Store,Café,Steakhouse,Jewelry Store,Farmers Market,Swiss Restaurant,Movie Theater,Neighborhood,Ice Cream Shop,Monument / Landmark
8,Copenhagen,55.686724,12.570072,5,Café,Beer Bar,Bakery,Coffee Shop,Scandinavian Restaurant,Cocktail Bar,...,Sandwich Place,Bar,Steakhouse,Seafood Restaurant,Burger Joint,Breakfast Spot,Art Museum,Camera Store,Pub,Pizza Place
9,Basel,47.558108,7.587826,1,Café,Hotel,Italian Restaurant,Swiss Restaurant,Bar,Plaza,...,Restaurant,Thai Restaurant,Coffee Shop,Museum,Food Court,Japanese Restaurant,Jazz Club,Juice Bar,Park,Mexican Restaurant


In [466]:
# create map
map_clusters = folium.Map(location=[0, 20], zoom_start=2)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(new_comb_Cities_list_2['latitude'], new_comb_Cities_list_2['longitude'],
                                  new_comb_Cities_list_2['City'], new_comb_Cities_list_2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [463]:
new_comb_Cities_list_2.loc[new_comb_Cities_list_2['Cluster Labels'] == 1]

Unnamed: 0,City,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
1,Zürich,47.372394,8.542333,1,Bar,Swiss Restaurant,Plaza,Café,Cocktail Bar,Vegetarian / Vegan Restaurant,...,Boutique,Pedestrian Plaza,Gourmet Shop,Coffee Shop,Dessert Shop,Gym / Fitness Center,Italian Restaurant,Shoe Store,Wine Bar,Bookstore
9,Basel,47.558108,7.587826,1,Café,Hotel,Italian Restaurant,Swiss Restaurant,Bar,Plaza,...,Restaurant,Thai Restaurant,Coffee Shop,Museum,Food Court,Japanese Restaurant,Jazz Club,Juice Bar,Park,Mexican Restaurant
13,Bern,46.948271,7.451451,1,Swiss Restaurant,Café,Bar,Restaurant,Plaza,Park,...,Gastropub,Creperie,Science Museum,Bakery,Coffee Shop,Hotel Bar,Rock Club,Art Gallery,Asian Restaurant,Pizza Place
28,Zug,47.16799,8.517365,1,Swiss Restaurant,Hotel,Restaurant,Supermarket,Italian Restaurant,Fast Food Restaurant,...,Bakery,Cocktail Bar,Salad Place,Light Rail Station,Shopping Mall,Park,Sports Bar,Train Station,Sushi Restaurant,Mexican Restaurant


In [464]:
new_comb_Cities_list_2.loc[new_comb_Cities_list_2['Cluster Labels'] == 3]

Unnamed: 0,City,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
2,Munich,48.137108,11.575382,3,Café,Plaza,Cocktail Bar,Hotel,German Restaurant,Coffee Shop,...,Boutique,Restaurant,Bavarian Restaurant,Steakhouse,Italian Restaurant,Seafood Restaurant,Bookstore,Board Shop,Fast Food Restaurant,Farmers Market
7,Geneva,46.201756,6.146601,3,French Restaurant,Hotel,Italian Restaurant,Plaza,Bar,Coffee Shop,...,Department Store,Café,Steakhouse,Jewelry Store,Farmers Market,Swiss Restaurant,Movie Theater,Neighborhood,Ice Cream Shop,Monument / Landmark
22,Paris,48.85661,2.351499,3,Plaza,Cocktail Bar,French Restaurant,Wine Bar,Japanese Restaurant,Sandwich Place,...,Burger Joint,Hotel,Ice Cream Shop,Coffee Shop,Restaurant,Gourmet Shop,Modern European Restaurant,Monument / Landmark,Provençal Restaurant,Wine Shop
27,Nice,43.700936,7.268391,3,French Restaurant,Mediterranean Restaurant,Italian Restaurant,Hotel,Ice Cream Shop,Seafood Restaurant,...,Wine Bar,Japanese Restaurant,Bar,Middle Eastern Restaurant,Pizza Place,Plaza,Restaurant,Art Museum,Tapas Restaurant,Bistro


In [465]:
new_comb_Cities_list_2.loc[new_comb_Cities_list_2['Cluster Labels'] == 5]

Unnamed: 0,City,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
3,Auckland,-36.853467,174.765551,5,Café,Japanese Restaurant,Hotel,Bar,Dessert Shop,Park,...,Indian Restaurant,Ice Cream Shop,Steakhouse,Theater,Monument / Landmark,Italian Restaurant,Food Court,Malay Restaurant,Asian Restaurant,Gym
4,Vancouver,49.260872,-123.113953,5,Bakery,Coffee Shop,Trail,Sushi Restaurant,Dessert Shop,Park,...,Middle Eastern Restaurant,Taco Place,Seafood Restaurant,BBQ Joint,Liquor Store,Furniture / Home Store,Japanese Restaurant,Indian Restaurant,Vegetarian / Vegan Restaurant,Yoga Studio
5,Düsseldorf,51.225402,6.776314,5,Japanese Restaurant,Hotel,Coffee Shop,Bar,Grocery Store,Café,...,Park,Bakery,Bookstore,Ramen Restaurant,Steakhouse,Wine Bar,Pizza Place,Sushi Restaurant,Pedestrian Plaza,German Restaurant
6,Frankfurt,50.110644,8.682092,5,Café,Bar,Hotel,Japanese Restaurant,Restaurant,Plaza,...,Coffee Shop,Thai Restaurant,Theater,Park,Ice Cream Shop,Apple Wine Pub,German Restaurant,Scenic Lookout,Bakery,Pedestrian Plaza
8,Copenhagen,55.686724,12.570072,5,Café,Beer Bar,Bakery,Coffee Shop,Scandinavian Restaurant,Cocktail Bar,...,Sandwich Place,Bar,Steakhouse,Seafood Restaurant,Burger Joint,Breakfast Spot,Art Museum,Camera Store,Pub,Pizza Place
10,Sydney,-33.854816,151.216454,5,Café,Scenic Lookout,Hotel,Theater,Park,Australian Restaurant,...,Japanese Restaurant,Italian Restaurant,Brewery,Burger Joint,Cocktail Bar,Chinese Restaurant,Pool,Coffee Shop,Garden,Concert Hall
14,Wellington,-41.288747,174.777209,5,Café,Restaurant,Coffee Shop,Bar,Gastropub,Asian Restaurant,...,Chinese Restaurant,Park,Mexican Restaurant,Brewery,Art Gallery,Exhibit,Pizza Place,Seafood Restaurant,Bridge,Grocery Store
15,Toronto,43.653963,-79.387207,5,Coffee Shop,Café,Hotel,Theater,Gym,Pizza Place,...,Concert Hall,Burrito Place,Sandwich Place,Plaza,Arts & Crafts Store,Art Gallery,Vegetarian / Vegan Restaurant,Cocktail Bar,Mediterranean Restaurant,Clothing Store
16,Melbourne,-37.814218,144.963161,5,Coffee Shop,Café,Bar,Cocktail Bar,Italian Restaurant,Hotel,...,Plaza,Dessert Shop,Korean Restaurant,Shopping Mall,Spanish Restaurant,Sushi Restaurant,Bookstore,Wine Bar,Australian Restaurant,Whisky Bar
20,Dublin,53.349764,-6.260273,5,Pub,Café,Coffee Shop,Hotel,Restaurant,Theater,...,Museum,Pizza Place,Cocktail Bar,Tapas Restaurant,Burger Joint,Italian Restaurant,Plaza,Record Shop,Indie Movie Theater,Burrito Place


## Observation (7) - Final Observation

From the map, but now most of all from the tables below (the map gets a bit crowded) one can make the following observations:

1. Zug clusters together with 3 other Swiss cities.
2. Nice clusters together with Paris, Geneve (Swiss but french speaking) and Munich. Munich is the biggest southern city in Germany and has a mediteranian flair associated to it. Therfore, this clustering is intersting and potentially accurate.
3. Quebec city was added to a big cluster with other Canadian cities (minues Ottawa), European cities and cities from Australia and New Zeeland. One can interpret this cluster as "international cities" with many amenities. Chances are that if one would want or need something from back home, in Quebec City one would find it.

# Results & Discussion

My overall results show...

- ... that overall quality of living (according to Mercer) cannot be replicated only by the listing of venues as dervied from Foursquare. 
- ... that Foursquare venue data can be used to cluster cities according to their characteristics and that venues in the city center have a stronger influence on clustering than venues located throughout the city and suburbs.
- ... that given a base dataset one can search for new cities and assign them to clusters very well. In this way one can get a good idea of what a city, with respect to its venues, is like.

More specifically, I could show that Zug seems to be a typical Swiss town, Nice is very French and Mediteranian and Quebec is an international town that probably needs little adaptation when moving there.

If one would need to decide where to move, such a search can give a person a good overview what to expect in each city. This search though is not inclusive and natural environments (coastal city, mountains, desert etc.) political environments (dictatorship, democracy etc.) and pay would need to be considered as well.

# Conclusion

Overall, while clustering cities based on venue data has its caveats, it is a powerful technique that can give interested persons a fast overview about how a unknown city compares to known cities and can help in decision making.