# Coursera Applied Data Science Capstone Final Project Report

### By Elliot Taylor


## Section 1 - Project background

What do the world's top 50 financial centers have in common?

This is the overarching question I will looking to answer with my analysis. I will use the list of world financial centers ranked by the global financial centers index (GFCI), the latest addition of which was published in September of 2020. I will use this list alongside FourSquare data for each city to categorise and group the cities based upon their venues. Following this I will explore any cultural, political or economic reasons which can explain the different clusters. 

The reasoning behind this is that many of the top financial centers around the world are seen as having a similar feel and environment. In my personal experience visiting London, New York, Hong Kong and other financial centers across Europe, North America and Asia I have always felt there is a familiarity where each city envokes a similar feeling despite massive cultural and politcal differences between their respective countries. Therefore I would like to explore whether the FourSquare data can be used to demonstrate any relationships between such cities. 

Following this analysis I will take a deeper dive into any clusters of interest or any specific cities which appear as outliers and try to describe why such clusters or anomalies exist.

GFCI report: https://www.zyen.com/media/documents/GFCI_28_Full_Report_2020.09.25_v1.1.pdf

## Section 2 - Data

As mentioned I will be taking the list of countries from the GFCI, to get these I will use the Wikipedia page which lists them in order (linked below). I will then feed the list of cities into the Foursquare API which will be used to create the final data with each countries venues to be fed into the K-means clustering algorithm.

Before using the Foursquare API I needed the coordinates for each city. To obtain this I used the Geopy Geolocator Nominatim library which allowed me to feed it my scaped city names and retrieve the coordinates for each.

Using the the Foursquare API I was able to obtain 100 venues for each city (with the excpetion of Luxembourg and Qingdao which both returned 65). Of these venues there were 390 unique venue categories. As with the Toronto example I one hot encoded these categories and used pandas groupby to create a frequency value for each category for each city. 

As an example, below we can see the top 5 venue categories for the top 4 financial centres in the world. These values alongside all 390 venue frequency values will be fed into the K-means clustering algorithm to produce clusters of cities. 

<pre>
5 Most Common Venue Categories For:  New York
            Venue  Frequency
0            Park       0.13
1  Ice Cream Shop       0.05
2          Bakery       0.04
3  Scenic Lookout       0.04
4       Bookstore       0.04


5 Most Common Venue Categories For:  London
        Venue  Frequency
0       Hotel       0.17
1     Theater       0.05
2      Lounge       0.04
3  Art Museum       0.04
4        Park       0.04


5 Most Common Venue Categories For:  Shanghai
               Venue  Frequency
0              Hotel       0.18
1        Coffee Shop       0.05
2      Shopping Mall       0.05
3  French Restaurant       0.05
4             Bakery       0.04


5 Most Common Venue Categories For:  Tokyo
              Venue  Frequency
0             Hotel       0.09
1        Art Museum       0.05
2  Ramen Restaurant       0.04
3     Wagashi Place       0.04
4          Sake Bar       0.04

</pre>




GFCI wiki page: https://en.wikipedia.org/wiki/Global_Financial_Centres_Index 

## Section 3 - Methodology

### 3.1 Data Collection and Preparation

In this section I will outline the steps I took from data acquisition mentioned above through to completing the k-means clustering on my dataset to create city clusters. The main sections are: data ETL (extract transform load), data validation, and modelling using K-means.  

As described above I combined the list from GFCI on Wikipedia and the foursquare API data. 

Firstly I used the BeautifulSoup python package to scrape the HTML data from the Wikipedia page, this was then read into a Pandas dataframe to allow me to manipulate and use it. The list of cities was then passed into the geolocator Nomatim function which provided latitude and longitude coordinates for each city. With this I generated the below World map which I used to validate I had the correct coordinates for each city. I will discuss data validate done using this map in the next section.

![image](markedWorldMap.PNG)

Once I had my list of cities and coordinates I needed the foursquare data. As I had validated the coordinates of each city I chose to use these to feed into the API. I chose not to use city names as formatting may have led to misleading or outright wrong data. Feeding the coordinates in and a radius of 12km I managed to retrieve 100 records (the limit for my API access) for all cities with the exception of Luxemburg and Qingdao. I chose 12km as this is large enough to encompass the centre of most cities as shown below over London.  Regarding Luxemburg and Qingdao only 65 records were returned for each, I believe this to be because Luxemburg is a relatively small city and Qingdao is small relative to other Chinese cities and sees far fewer tourists. 

![image](RadiusRing.PNG)

With all of the records I then one hot encoded the using this venues category to get dummy variables for each type of variable. This was done using pandas get_dummies function. Following these steps the data was ready to be fed into K-means clustering algorithm as discussed below.

### 3.2 Data Validation

In order to be confident using the data in the clustering algorithm I first had to validate it was accurate.

For the coordinates I used the map shown above to confirm that labels for each city were in the correct place and labelled correctly. This led to the correction of 3 cities as the coordinates provided from the geolocator were initially incorrect. New York and Washington D.C at first provided incorrect coordinates due to the formatting of their names not matching the format expected by the geolocater API, this was corrected by manually formatting their names. For Beijing the coordinates appeared to be correct but in reality were ~10km away from the centre of Beijing. I believe this is for national security reasons and related to maps not working as expected in parts of China, this was corrected by manually updating the coordinates to the centre of Beijing. 

To validate the FourSquare API data I used the dummy variables and created a list of the 5 most common venue types for each city. As shown in the example for New York, London, Shanghai, and Tokyo above. I checked each city to make sure that they had a reasonable range of venues. Following this there were no anomalous cities and I continued onto modelling.

### 3.3 Data Modelling K-Means Clustering.

This was actually the simplest section as the majority of the work had been done in the data preparation stages outlined already. 

With my prepared data of dummy variables I used Kmeans clustering with the number of clusters set to 8. This was chosen as it is the highest number of clusters that provides no clusters with a single city in it. To implement the Kmeans clustering I used Sklearns Kmeans module. After completing the clustering I recombined the cluster labels, cummy variables and the dataframe containing the city names and coordinates. Using the now merged dataframe I generated the below map which displays the 8 clusters generated by the model 

![image](clusteredWorldMap.PNG)

### 3.4 Additional steps

Following all this I also decided to add each city’s population to the dataframe to add further information when analysing each cluster. To do this I used a call to the public opendatasoft API using the name of each city. This returned values for all cities with the exception of HK, Beijing, Luxemburg, Mumbai, and Busan. For these I manually provided populations.  


## Section 4 - Results

A summary of each cluster is as follows:

<pre>

---- Cluster: 1----
Cities in cluster: Paris, Madrid, Brussels, 
Cluster Mean Population:  2,077,453
            venue  freq
0           Plaza  0.10
1      Art Museum  0.04
2            Park  0.04
3  Sandwich Place  0.03
4      Restaurant  0.03


---- Cluster: 2----
Cities in cluster: Tokyo, Washington, Chicago, Amsterdam, Stockholm, Vancouver, Seoul, Abu Dhabi, Osaka, Busan, 
Cluster Mean Population:  5,312,528
                venue  freq
0               Hotel  0.08
1         Coffee Shop  0.06
2                Park  0.04
3              Bakery  0.03
4  Italian Restaurant  0.02


---- Cluster: 3----
Cities in cluster: Zurich, Geneva, Frankfurt, Melbourne, Hamburg, Stuttgart, Milan, Tel Aviv, Casablanca, Munich, 
Cluster Mean Population:  1,361,467
         venue  freq
0         Café  0.09
1         Park  0.06
2  Coffee Shop  0.05
3        Hotel  0.05
4        Plaza  0.04


---- Cluster: 4----
Cities in cluster: Edinburgh, Montreal, Toronto, Sydney, Dublin, Wellington, 
Cluster Mean Population:  2,319,058
         venue  freq
0         Café  0.12
1         Park  0.07
2  Coffee Shop  0.07
3   Restaurant  0.03
4        Hotel  0.03


---- Cluster: 5----
Cities in cluster: London, Shanghai, Hong Kong, Singapore, Luxembourg, Dubai, Kuala Lumpur, Taipei, 
Cluster Mean Population:  4,843,574
                venue  freq
0               Hotel  0.15
1                Café  0.03
2                Park  0.03
3  Italian Restaurant  0.03
4       Shopping Mall  0.03


---- Cluster: 6----
Cities in cluster: Beijing, Shenzhen, Guangzhou, Chengdu, Qingdao, 
Cluster Mean Population:  6,257,620
                  venue  freq
0                 Hotel  0.17
1           Coffee Shop  0.09
2         Shopping Mall  0.08
3                  Park  0.05
4  Fast Food Restaurant  0.03


---- Cluster: 7----
Cities in cluster: Mumbai, New Delhi, 
Cluster Mean Population:  14,669,135
               venue  freq
0  Indian Restaurant  0.12
1              Hotel  0.10
2               Café  0.08
3                Bar  0.04
4        Coffee Shop  0.04


---- Cluster: 8----
Cities in cluster: New York, San Francisco, Los Angeles, Boston, Copenhagen, Oslo, 
Cluster Mean Population:  2,531,174
                venue  freq
0                Park  0.07
1         Coffee Shop  0.06
2              Bakery  0.06
3      Ice Cream Shop  0.03
4  Italian Restaurant  0.02

</pre>

Cluster 1 includes Paris, Madrid and Brussels appear to be clustered based on a higher frequency of Plazas and Art museums suggesting a more creative focus for the venues found in these cities. This matches as they all historical European capitals that have rich artistic and cultural histories.

Cluster 2 includes Tokyo, Washington, Chicago, Amsterdam, Stockholm, Vancouver, Seoul, Abu Dhabi, Osaka, and Busan. This cluster is one of the most diverse clusters which could be a good way to describe them. With a higher number of hotels, coffee shops and parks these cities could be clustered based on their metropolitan and tourist appeal. 

Cluster 3 includes Zurich, Geneva, Frankfurt, Melbourne, Hamburg, Stuttgart, Milan, Tel Aviv, Casablanca, and Munich. This cluster includes a lot of European cities with a high number of cafes and parks. This cluster has the lowest average population. Taking these factors into account this cluster appears to represent smaller more residential cities. 

Cluster 4 includes Edinburgh, Montreal, Toronto, Sydney, Dublin and Wellington. This cluster appears to be very similar to the previous cluster with a higher frequency of cafes and lower number of hotels. In my opinion looking at cluster 3 and 4 they both include relatively small cities with a residential/tourist focus and could be combined into a single cluster.

Cluster 5 includes London, Shanghai, Hong Kong, Singapore, Luxembourg, Dubai, Kuala Lumpur, and Taipei. This cluster includes the second highest frequency of hotels beside cluster 6. These cities all have a large focus on tourism and business travel. Another consideration of this cluster is that each represents a gateway into their respective markets. London and Luxembourg into Europe, Shangai, HK and Taipai into the China, KL and Singapore into South East Asia, and Dubai into the Middle East. The only major financial gateway missing from this cluster would be New York into the US market.

Cluster 6 includes Beijing, Shenzhen, Guangzhou, Chengdu, and Qingdao. This cluster includes all mainland Chinese financial centres with the exception of Shanghai. This is an accurate observation as Shanghai is often marked as being different from other mainland Chinese cities. In this cluster is the highest frequency of hotels, a high number of coffee shops (possibly including tea houses) and shopping malls. These venues are typical of Chinese cities. 

Cluster 7 includes Mumbai and New Delhi. Similar to cluster 6 this cluster highlights geographical differences as it contains the two Indian financial centres. As expected these include a high frequency of Indian restaurants as well as hotels and cafes. 

Cluster 8 includes New York, San Francisco, Los Angeles, Boston, Copenhagen, and Oslo. Here we have the majority of the US financial centres and Scandinavia centres. This cluster is the only cluster not to include hotels in the top 5 venue categories. Although all are known for having lots of tourists this would suggest that centre of these cities is more targeted towards commercial venues and businesses. 


## Section 5 - Discussion

Overall the results help categorise the financial centres in a way that helps highlight some types of financial centres based on not only their location but their focus whether that be  tourism, residence or business. Personally I agree with all of the categories with the exception of New York. The reason for this is that I believe New York fits my description of cluster 5 better than cluster 8. This is because New York fits the description of a gateway city with lots of tourism and business travel.

Based off of these I can make a number of recommendations for both travel and work in these centres. Cluster 1 would be prefect for cultural travel or working if you have a keen interest in the arts in your free time. Cluster 2 provides a diverse range of financial centres with metropolitan appeal and lots of green spaces. For smaller more residential financial centres clusters 3 and 4 would be a good fit although many of these centres are also tourist hotspots. Cluster 5 includes fast based business and travel orientated cities, for work or travel if a busy upbeat city appeals to you I would recommend these cities. 6 and 7 include cities categorised based on their country therefore an authentic experience of China or India respectively these clusters would be best as the data suggest their cities have strong relationship. Finally cluster 8 here these cities appear to be less accessible to tourists (at least in the city centre), this suggests that their city centres are more aimed towards work and business. 

## Section 6 - Conclusion

In conclusion looking at the top 50 financial centres in the world we can see that there are similarities and differences to be found in them. We can see that although financial centres many of them have different focuses as cities whether that be culture, tourism or business. We can also see that some regions have distinct venue categories whilst some city types are found on multiple continents and with massively different cultures. Overall I believe that the clustered described help give a good summary of the feel of each city in the cluster. This can especially be useful if you like to work or travel in one city another city in that cluster may be suitable for you. 