London’s Asian Restaurants : A Cluster Analysis
Christopher Liu
1st June 2019

1.	Introduction

In this notebook, I analyze Foursquare data from London to define geographical distribution of Asian cuisine restaurants found in the London area.

1.1  Background

In Week 3 of the Applied Data Science Capstone Course, I analyzed Toronto's most central neighborhoods and used K-Means to cluster neighborhoods. The clusters obtained were not precise enough to define by its most common venue type. This is due to small data population as data from the entire Toronto area is narrowed to set of boroughs containing the word 'Toronto', mainly the central area of Toronto.

Another issue is the high number of venue types. Indeed, for 38 neighborhoods in the final Toronto data frame, there are 238 unique venue types. This translates into low frequency counts and most common venue types that are too dispersed to cluster neatly. Looking at final clusters, there is often an overlap of venue types that could have been better defined earlier in the process. Indeed, overlapping labels like 'coffee' and 'coffee shop' can be regrouped into one label, as well as 'bar', 'cocktail bar', 'pub', 'beer bar', thereby reducing our count of unique venue types.

1.2  Problem 

For this project I chose London; another multicultural, populous and large city, I am assuming the initial data resulting from the call to Foursquare's API is large enough to further narrow down to Asian denominated cuisines. 

London is home to many Asian minorities and I will try to first regroup Foursquare's API results into a set of clearly denominated buckets. Depending on the number of venue types in the Asian cuisine group, I would like to narrow this set into of labels consisting of: Chinese, Japanese, Korean, Indian, South-East Asian, Middle-Eastern; 6 unique venue types. Furthermore, if, the venue type returned by Foursquare is a type of dish that is a specialty of a certain type of cuisine then it's re-labeled as the country that cuisine belongs to. For example if the venue type is 'Sushi', then we'll re-label as 'Japanese', if it's 'bubble tea', re-label as 'Chinese', for 'Phô' re-label as 'Vietnamese'.

The main challenges for this analysis are two fold. First, to find a proper and efficient way to label and group venue types according to a pre-defined set of venues. Second, to obtain clusters that are statistically meaningful in order to get clusters for the 7 venue types defined at the beginning.

The result should be a map of clusters defining areas of London where one is more likely to find a certain type of cuisine. A traveler or specialist would find this a useful guide. 

1.3 Interest
A consultancy or banking business might also find this helpful when assessing proposals for a new restaurant or real estate venture.

This is also of interest to data-scientists as it is also an exercise in honing the K-Means algorithm.



2. Data Acquisition and Cleaning

2.1 Data Sources

a.	Wikipedia 
Web link: https://en.wikipedia.org/wiki/London_postal_district
I parse the nested table on the Wikipedia page for London Postcodes and district names with Beautiful Soup.

b.	London Datastore Site
Web link: https://data.london.gov.uk/dataset/postcode-directory-for-london, Office for National Statistics, Accessed in May 2019
Filename: London_postcode-ONS-postcode-Directory-May15  (89.94 MB)
c.	Foursquare API: 
JSON formatted results queried from an account created for the purpose of this course


2.2 Data Cleaning

First, Postcodes and their district names are parsed from Wikipedia into a Pandas DataFrame. This involves using Beautiful Soup. 

Second, a join operation is carried out on the London Data Store file to match latitude and longitude for every postcode. This step involves heavy data cleaning as this file enumerates statistics for every postcode in the Greater London Administrative Area. Postcodes have to be aggregated by districts and narrowed down to the Inner Area of the London Postal Region (eight postcode districts). The BR, CR, DA, EN, HA, IG, SL, TN, KT, RM, SM, TW, UB and WD postcode areas (the 13 outer London postcode areas) comprising the outer area of the London postal region are therefore excluded from the limits of this project. A choice is made to use the mean latitude and longitude as an aggregate value to join on the first dataframe.

3.	Methodology

3.1 K-means Algorithm
The choice to use the K-Means algorithm comes from its use in the course’s previous assignment ‘The Battle of the Neighborhoods’ in which this algorithm is used to cluster areas of Toronto in meaningful Foursquare data venue type points of commonality. I choose to keep using this algorithm but try to improve its clustering ability. 

3.2	Feature Selection

A. Reducing the number of labels
The results obtained from the ‘Battle of the Neighborhoods’ assignment are the difficulty of understanding what the clusters mean, the similarities in venue types. 

I think this is due to a relatively low number of results from the foursquare query that makes it more difficult for K-Means algorithm to cluster neighborhoods into most common venue types. Another issue is the high number of unique venue types, which adds to the problem of clusters.

To try to solve this problem, I chose to increase the number of results by choosing London rather than Toronto and increase the search radius to 800 meters instead of 500. This is to increase the count of results and to account for greater area spread of London. 

Furthermore, to lower the number of unique categories by grouping them into buckets of types cuisines. This brings the count of unique categories in the results from Foursquare from 312 to 6 labels of cuisine types.

•	Middle-Eastern which includes the venue types labeled as: 'Middle Eastern Restaurant',  'Turkish Restaurant', 'Persian Restaurant', 'Lebanese Restaurant', 'Israeli Restaurant', 'Iraqi Restaurant', 'Afghan Restaurant'

•	Chinese which includes the venue types labeled as: 'Chinese', 'Chinese Restaurant', 'Ramen Restaurant', 'Szechuan Restaurant', 'Xinjiang Restaurant', 'Cantonese Restaurant'

•	South East Asian which includes the venue types labeled as: 'Thai Restaurant', 'Vietnamese Restaurant', 'Malay Restaurant'

•	Indian which includes the venue types labeled as: 'South Indian Restaurant','North Indian Restaurant','Indian Restaurant','Himalaya', 'Pakistani Restaurant'

•	Japanese which includes the venue types labeled as: 'Okonomiyaki Restaurant','Sushi Restaurant','Japanese Restaurant'

•	Korean which includes the venue types labeled as: 'Korean Restaurant'


Here, another category was added for fun but also as a constant to see how well the algorithm would do. A category for ‘Fast Foods’ was added that included venue types labeled as 'Fried Chicken Joint','Doner Restaurant','Fast Food Restaurant'. This venue type is common throughout London but it is an interesting question as to whether it would improve results from K-Means or on the contrary make them more confusing. The results will be discussed further in the paper.

B. 	Returning only the most common venue type

The final dataframe matches every neighborhood from the first dataframe with its latitude, longitude and most common venue type before passing them through K-Means. The choice to lower the most common venues from 5 in the previous assignment to 1 is also an attempt to improve clusers and is a result of trials with different values from 5 to 1.

C. 	Setting the number of clusters

The parameter of number of k clusters in the KMeans algorithm is set to 6, the number of bins created and introduced earlier in this section. This is both logical and to see if the algorithm can regroup venues according to their labels.



4. Results
Clusters resulting from the final machine-learning step in this project show that increasing the number of results and reducing the number of categories have a positive effect on the K-Means algorithm. Compared to the ‘Battle of Neighborhoods’ assignment, clusters are grouped by cuisine label in 5 out of 6 clusters. One cluster has different cuisine types with a skew towards ‘Middle-Eastern’. 

Mapping these clusters with Folium, we can see areas in London where one type of Asian cuisine is the most common type. The results are in-line with my own empirical experience of living in London which is the intuition for this data science inquiry.

A more in-depth discussion about these clusters can be made summarily. Areas known as the West End that are more affluent are clustered into one cluster (Cluster 2). Interestingly, this cluster comprises Japanese denominated cuisine, a type of restaurant that is typically more expensive.
 


Cluster 3 is a mix of cuisines but skewed towards Middle-Eastern with a geographical clustering around the North West area of London, shown in pale green on the map.
 
 
5. Conclusion
Machine learning on results from Foursquare show that results can be improved by carrying out cleaning operations on datasets and by further tweaking parameters in the algorithm. 

In this project, I have attempted to refine clusters resulting from the K-Means algorithm into statistically meaningfull clusters by grouping results into bins of most-commonly labeled data. This has resulted in 6 clusters from which 5 are exclusively labeled as one type. This in-turn has a positive effect on visualization.

 



