# City Simlarity Test (Complete)
## Business Problem
A New York Luxury Brand has built its business in several cities in the United States, including Los Angeles, New York, and Chicago. Due to its sucess and growing popularity in these cities, the CEO and his team wants to expand their business to other cities in the United States and also explore their market in big cities in other countries, such as China and UK. Now the CEO has hired a data scientist and assigned her a task to find out the similarity between different big cities in the world and group the cities into various clusters, so that the Board of Directors can make a better decision of which business mode to operate in new cities. (For Example: If London has the been grouped into the same cluster with New York, the business mode operated in New York market will be considered for London)

In order to carry out the task, the data scientiest should make a full use of FourSquare API and collect a dataset for at least 15 cities, including those in the United States and those in other countries. Visualizations are required to present the clustering outcome so that the Board of Directors will have a clearer view of the results

## Data Section
We chose 27 most popular cities in the world and clustered them based on the similarity of venues distribution. The venues information is retrieved from FourSquare API and at most 500 venues were selected for each city 

### Import Necessary Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib as mpl

### Set up a World City Dataframe
The world city dataframe includes information of city names, their locations features, and their country names

In [3]:
## Create a dataframe of cities
City_data = {'City': ['New York', 'London','Edinburgh', 'Toronto', 'Sydney', 'Singapore', 'Melbourne', 'Hong Kong', 'Los Angeles', 'Chicago', 'Boston', 'San Francisco', 'Dublin', 'Washington','Beijing', 'Shanghai','Guangzhou', 'Shenzhen', 'Mumbai', 'Tokyo', 'Seoul–Incheon','Moscow','Paris', 'Taipei', 'Berlin', 'Jakarta', 'Mexico City']}
City_df = pd.DataFrame(City_data)

## add up columns of 'Lat', 'Lng', 'Country'
## For lar, and lng, we use zero values first for later data fill-in
City_df.insert(1, 'Latitude', np.zeros(27))
City_df.insert(2, 'Longitude', np.zeros(27))
City_df.insert(3, 'Country', ['US', 'UK', 'UK', 'Canada', 'Australia', 'Singapore', 'Australia', 'China', 'US', 'US', 'US', 'US', 'Ireland', 'US', 'China', 'China', 'China', 'China', 'India', 'Japan', 'South Korea', 'Russia', 'France', 'China', 'Germany', 'Indonesia', 'Mexico'])          

In [4]:
City_df.head()

Unnamed: 0,City,Latitude,Longitude,Country
0,New York,0.0,0.0,US
1,London,0.0,0.0,UK
2,Edinburgh,0.0,0.0,UK
3,Toronto,0.0,0.0,Canada
4,Sydney,0.0,0.0,Australia


### Add in Latitude & Longitude Values

In [5]:
## Imporr necessary libraries
import geopy
from geopy.geocoders import Nominatim

## use geolocation package to retrieve location features (lat & lng) into the dataframe 
for index, row in City_df.iterrows():
    city = row['City']
    geolocator = Nominatim(user_agent = "explorer2")
    location_city = geolocator.geocode(str(city))
    lat_city = location_city.latitude
    lng_city = location_city.longitude
    City_df.loc[index, 'Latitude'] = lat_city
    City_df.loc[index, 'Longitude'] = lng_city
    
City_df.head()

Unnamed: 0,City,Latitude,Longitude,Country
0,New York,40.712728,-74.006015,US
1,London,51.507322,-0.127647,UK
2,Edinburgh,55.952148,-3.188991,UK
3,Toronto,43.653963,-79.387207,Canada
4,Sydney,-33.854816,151.216454,Australia


### Visualize Locations on World Map

In [7]:
## Install relevant packages for visualization
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.3 MB

The following NEW packages will be 

In [8]:
## import necessary lib
import folium

## create a world map
world_map = folium.Map()

## add location marks on the world map
for lati, lngi, city in zip(City_df['Latitude'], City_df['Longitude'], City_df['City']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lati, lngi],
        radius = 3,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.6,
        parse_html = False
    ).add_to(world_map)
    
world_map

### Retrive Venue Information for Cities
We retrieve at most 500 venues information for each city and add venue names and venue categories to the dataframe

In [9]:
## import necessary packages
import requests

## Client Information for Foursquare
CLIENT_ID = "331QIXI5YMKN20D3VHINKUKY5SRMBZHMS5S2WLB0YATWN21J"
CLIENT_SECRET = "KSDVCZ554PWN1244NW4FF4OF3H10FJRI5AZ0WLRBH3X4SY1X"
VERSION = '20190829'
LIMIT = 500

In [10]:
## Create a function to repeat process for all neighborhoods
def getNearbyVenues(names, latitudes, longitudes, radius=10000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # create the API request URL
        url_city = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url_city).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return nearby_venues

In [11]:
## Fill in the location inforation of cities into the function and return a segregated dataframe of venues for all cities
world_venues = getNearbyVenues(names = City_df['City'], latitudes = City_df['Latitude'], longitudes = City_df['Longitude'])
world_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,New York,40.712728,-74.006015,The Bar Room at Temple Court,40.711448,-74.006802,Hotel Bar
1,New York,40.712728,-74.006015,Four Seasons Hotel New York Downtown,40.712612,-74.00938,Hotel
2,New York,40.712728,-74.006015,Korin,40.714824,-74.009404,Furniture / Home Store
3,New York,40.712728,-74.006015,Aire Ancient Baths,40.718141,-74.004941,Spa
4,New York,40.712728,-74.006015,9/11 Memorial North Pool,40.712077,-74.013187,Memorial Site


In [12]:
## Check out the size of the dataset
world_venues.shape

(2700, 7)

### Analysis for the Venue Distribution in Cities
We will figure out the top 10 popular venues in each city for viewing

In [13]:
## Apply onehot-coding to venue categories 
world_onehot = pd.get_dummies(world_venues['Venue Category'], prefix = "", prefix_sep= "")
world_onehot.head()

Unnamed: 0,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Amphitheater,Aquarium,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,...,Wine Shop,Winery,Wings Joint,Xinjiang Restaurant,Yakitori Restaurant,Yoga Studio,Yoshoku Restaurant,Yunnan Restaurant,Zhejiang Restaurant,Zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
## Add city column back to dataframe
world_onehot[['City']] = world_venues[['City']]

# move city column to the first column
fixed_columns = [world_onehot.columns[-1]] + list(world_onehot.columns[:-1])
world_onehot_city = world_onehot[fixed_columns]

world_onehot_city.head()

Unnamed: 0,City,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Amphitheater,Aquarium,Arcade,Argentinian Restaurant,Art Gallery,...,Wine Shop,Winery,Wings Joint,Xinjiang Restaurant,Yakitori Restaurant,Yoga Studio,Yoshoku Restaurant,Yunnan Restaurant,Zhejiang Restaurant,Zoo
0,New York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,New York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,New York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,New York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,New York,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
## Group the dataset by the city names to check out the percentage of each venue categores 
world_grouped = world_onehot_city.groupby('City').mean().reset_index()
world_grouped.head()

Unnamed: 0,City,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Amphitheater,Aquarium,Arcade,Argentinian Restaurant,Art Gallery,...,Wine Shop,Winery,Wings Joint,Xinjiang Restaurant,Yakitori Restaurant,Yoga Studio,Yoshoku Restaurant,Yunnan Restaurant,Zhejiang Restaurant,Zoo
0,Beijing,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.04,0.01,0.0
1,Berlin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
2,Boston,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0
3,Chicago,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0
4,Dublin,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
## Define a function that sorts the values in rows

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [17]:
## create columns according to number of top venues
columns = ['City']
for ind in np.arange(10):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

## Create a dataframe
## set up the column names for the dataframe
City_venue_sorted = pd.DataFrame(columns = columns)

## set the column of "City" 
City_venue_sorted['City'] = world_grouped['City']

## Set the other column values -- the top 10 venue names
for ind in np.arange(world_grouped.shape[0]):
    City_venue_sorted.iloc[ind, 1:] = return_most_common_venues(world_grouped.iloc[ind, :], 10)

City_venue_sorted

Unnamed: 0,City,1th Most Common Venue,2th Most Common Venue,3th Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Beijing,Historic Site,Hotel,Park,Chinese Restaurant,Yunnan Restaurant,Café,Temple,Peking Duck Restaurant,Hostel,Coffee Shop
1,Berlin,Coffee Shop,Park,Bookstore,Ice Cream Shop,Concert Hall,Garden,Sandwich Place,Hotel,Wine Bar,Bakery
2,Boston,Park,Bakery,Hotel,Seafood Restaurant,Gym,Mexican Restaurant,Historic Site,Pizza Place,Gastropub,Theater
3,Chicago,Hotel,Park,Theater,Italian Restaurant,Coffee Shop,Boat or Ferry,Yoga Studio,Mediterranean Restaurant,New American Restaurant,Restaurant
4,Dublin,Coffee Shop,Café,Pub,Restaurant,Hotel,Cocktail Bar,Park,Italian Restaurant,Bookstore,Gastropub
5,Edinburgh,Café,Bar,Coffee Shop,Hotel,French Restaurant,Park,Pub,Museum,Restaurant,Beer Bar
6,Guangzhou,Hotel,Coffee Shop,Shopping Mall,Park,Turkish Restaurant,Chinese Restaurant,Cantonese Restaurant,Restaurant,Electronics Store,Cocktail Bar
7,Hong Kong,Hotel,Japanese Restaurant,Bar,Gym / Fitness Center,Italian Restaurant,Lounge,Yoga Studio,Cocktail Bar,Café,Steakhouse
8,Jakarta,Hotel,Coffee Shop,Indonesian Restaurant,Restaurant,Shopping Mall,Dessert Shop,Japanese Restaurant,Sushi Restaurant,Steakhouse,Asian Restaurant
9,London,Hotel,Theater,Cocktail Bar,Park,Art Museum,Bookstore,Department Store,Coffee Shop,Juice Bar,Boutique


### Set Up and Train the Model
We will use the onehot-coded grouped data to fit a kmeans clustering model 

In [18]:
## Drop out the city column of the grouped data for model training
world_grouped_clustering = world_grouped.drop(['City'], axis = 1)

## import machine learning packages
import sklearn
from sklearn.cluster import KMeans

## Create and fit a kmeans model 
model_kmeans = KMeans(n_clusters = 5, random_state = 0)
model_kmeans.fit(world_grouped_clustering)

## Check out the labels
kmeans_labels = model_kmeans.labels_
kmeans_labels

array([0, 1, 1, 3, 4, 4, 3, 3, 3, 3, 1, 4, 1, 1, 4, 1, 1, 1, 1, 3, 3, 2,
       4, 2, 3, 1, 1], dtype=int32)

### Update the Dataframe with Cluster Labels and Location Features
We will add the cluster labels to the dataframe -- City_venue_sorted


In [19]:
## Add Clustering labels 
City_venue_sorted.insert(0, 'Cluster Labels', kmeans_labels)
City_venue_sorted.head()

Unnamed: 0,Cluster Labels,City,1th Most Common Venue,2th Most Common Venue,3th Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Beijing,Historic Site,Hotel,Park,Chinese Restaurant,Yunnan Restaurant,Café,Temple,Peking Duck Restaurant,Hostel,Coffee Shop
1,1,Berlin,Coffee Shop,Park,Bookstore,Ice Cream Shop,Concert Hall,Garden,Sandwich Place,Hotel,Wine Bar,Bakery
2,1,Boston,Park,Bakery,Hotel,Seafood Restaurant,Gym,Mexican Restaurant,Historic Site,Pizza Place,Gastropub,Theater
3,3,Chicago,Hotel,Park,Theater,Italian Restaurant,Coffee Shop,Boat or Ferry,Yoga Studio,Mediterranean Restaurant,New American Restaurant,Restaurant
4,4,Dublin,Coffee Shop,Café,Pub,Restaurant,Hotel,Cocktail Bar,Park,Italian Restaurant,Bookstore,Gastropub


In [20]:
## Check out the shape of the City_venue_sorted
City_venue_sorted.shape

(27, 12)

In [21]:
## Check out the shape of the Toronto_selected
City_df.shape

(27, 4)

In [22]:
## Since the two dataframes have the same shape, we can merge them on the Postal Code
world_merged = City_df
world_merged = world_merged.join(City_venue_sorted.set_index('City'), on = 'City')

world_merged.head()

Unnamed: 0,City,Latitude,Longitude,Country,Cluster Labels,1th Most Common Venue,2th Most Common Venue,3th Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,New York,40.712728,-74.006015,US,1,Park,Bookstore,Ice Cream Shop,Scenic Lookout,Cycle Studio,Gym,Italian Restaurant,Theater,Gourmet Shop,Furniture / Home Store
1,London,51.507322,-0.127647,UK,3,Hotel,Theater,Cocktail Bar,Park,Art Museum,Bookstore,Department Store,Coffee Shop,Juice Bar,Boutique
2,Edinburgh,55.952148,-3.188991,UK,4,Café,Bar,Coffee Shop,Hotel,French Restaurant,Park,Pub,Museum,Restaurant,Beer Bar
3,Toronto,43.653963,-79.387207,Canada,1,Coffee Shop,Café,Restaurant,Hotel,Park,Sandwich Place,Yoga Studio,Pizza Place,Japanese Restaurant,Italian Restaurant
4,Sydney,-33.854816,151.216454,Australia,4,Café,Park,Theater,Hotel,Scenic Lookout,Coffee Shop,Bakery,Cocktail Bar,Chinese Restaurant,Thai Restaurant


### Visualize the Cluster Results on the Map
We will visualize the map with different clusters results with different colors

In [23]:
## import necessary lib and packages
import matplotlib.cm as cm
import matplotlib.colors as colors

## Create map
map_clusters = folium.Map()

## set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, city, country,cluster in zip(world_merged['Latitude'], world_merged['Longitude'], world_merged['City'], world_merged['Country'],world_merged['Cluster Labels']):
    label = folium.Popup(str(city) + ',' + str(country) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

## Conlusion
Based on the analysis above, a total of 27 cities were clustered into 5 groups and the Board of Directors can make an informed decision based on this result. Let's have a look at which cities are more similar than others

### Cluster 0

In [34]:
## Filter out the cluster 0 cities and change the column name to cluster 0
Cluster0 = pd.DataFrame(world_merged[world_merged['Cluster Labels'] == 0][['City', 'Country']])
Cluster0.columns = ['Cluster 0', 'Country']
Cluster0

Unnamed: 0,Cluster 0,Country
14,Beijing,China


Beijing is of its own group, which shows that its distribution is pretty unique from any other big city in the world. The reason might be that Beijing as the capital of China is pretty ancient and has various historic sites, especially at the center of the city. This means Beijing is a relatively cultured place and the Board of Directors might want to localize their business more in Beijing.   

### Cluster 1

In [35]:
## Filter out the cluster 1 cities and change the column name to cluster 1
Cluster1 = pd.DataFrame(world_merged[world_merged['Cluster Labels'] == 1][['City', 'Country']])
Cluster1.columns = ['Cluster 1', 'Country']
Cluster1

Unnamed: 0,Cluster 1,Country
0,New York,US
3,Toronto,Canada
8,Los Angeles,US
10,Boston,US
11,San Francisco,US
13,Washington,US
20,Seoul–Incheon,South Korea
21,Moscow,Russia
22,Paris,France
24,Berlin,Germany


In [31]:
# Count how many cities in cluster 1
Cluster1.shape

(11, 1)

These 11 cities are segemented into one group. As shown in the table, most big cities in the United States are grouped into this cluster and this shows that the business mode for New York and Los Angeles could be copied to most American cities. Another observation is that a big majority of these cities are in North and South America

### Cluster 2

In [36]:
## Filter out the cluster 2 cities and change the column name to cluster 2
Cluster2 = pd.DataFrame(world_merged[world_merged['Cluster Labels'] == 2][['City', 'Country']])
Cluster2.columns = ['Cluster 2', 'Country']
Cluster2

Unnamed: 0,Cluster 2,Country
5,Singapore,Singapore
23,Taipei,China


This cluster only has two cities and these two cities are both in the South East Asia. This shows that the Board of Directors might need to consider a new and localized operation mode for their business in the South East Asia 

### Cluster 3

In [37]:
## Filter out the cluster 3 cities and change the column name to cluster 3
Cluster3 = pd.DataFrame(world_merged[world_merged['Cluster Labels'] == 3][['City', 'Country']])
Cluster3.columns = ['Cluster 3', 'Country']
Cluster3

Unnamed: 0,Cluster 3,Country
1,London,UK
7,Hong Kong,China
9,Chicago,US
15,Shanghai,China
16,Guangzhou,China
17,Shenzhen,China
19,Tokyo,Japan
25,Jakarta,Indonesia


In [38]:
# Count how many cities in cluster 3
Cluster3.shape

(8, 2)

This cluster is a diverse one and these 8 cities are surprisingly from across the world, including China, US, Europe, and South East Asia. However, the majority of cities in cluster 3 are in Asia. Since Chicago is groupeded into this cluster and the Board of Director could consider applying the Chicago Business mode to other cities in this group. 

### Cluster 4

In [39]:
## Filter out the cluster 4 cities and change the column name to cluster 4
Cluster4 = pd.DataFrame(world_merged[world_merged['Cluster Labels'] == 4][['City', 'Country']])
Cluster4.columns = ['Cluster 4', 'Country']
Cluster4

Unnamed: 0,Cluster 4,Country
2,Edinburgh,UK
4,Sydney,Australia
6,Melbourne,Australia
12,Dublin,Ireland
18,Mumbai,India


This cluster is an interesting group too, as these 5 cities are pretty British compared to other cities in the dataset. However, the capital of UK, London itself is not included in this cluster. The Board of Directors might need to come up with a British style of Business mode for these cities.