# Analysis on venue ratings in Queens

## Introduction: Business Problem

This project will investigate the ratings of different venues in Queens. To do this, we will identify clusters of high quality venues, mid range venues and venues receiving low ratings. 

This information can be visualised on a map and can be useful for the following use cases:  
* Deciding on locations to live  
* Choosing a meeting location for eating out where there are more highly rated restaurants   
* People looking for a location to open a business, to identify the competition or the reputation of a particular area.

## Data

In order to draw conclusions on this problem, we will require the following information:
* Venue locations **provided by Foursquare**
* Ratings for each venue **also provided by Foursquare**

To obtain this information, we used New York neighborhood information **from Coursera** and obtained up to 50 venues within a 200m radius of each neighborhood in Queens. From these venues, we only used venues which had been rated on Foursquare.

The venues will be clustered based on their ratings and the location information will be used to visualise the clusters from a geographical perspective.


### Collecting the data

Import all the required libraries

In [93]:
import pandas as pd
import numpy as np
import wget
import json
import matplotlib.cm as cm
import matplotlib.colors as colors
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans
import folium # map rendering library
import requests

Download the neighborhood information and convert into a dataframe

In [95]:
wget.download('https://cocl.us/new_york_dataset') # Download New York dataset
with open('new_york_dataset') as json_data:
    newyork_data = json.load(json_data)

neighborhoods_data = newyork_data['features']

column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 	# define the dataframe columns
neighborhoods = pd.DataFrame(columns=column_names)	# instantiate the dataframe

# Populate the data into the data frame one row at a time
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

100% [............................................................................] 115774 / 115774The dataframe has 5 boroughs and 306 neighborhoods.


Extract all the information for neighborhoods in Queens

In [66]:
queens_data = neighborhoods[neighborhoods['Borough']=='Queens']
queens_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
129,Queens,Astoria,40.768509,-73.915654
130,Queens,Woodside,40.746349,-73.901842
131,Queens,Jackson Heights,40.751981,-73.882821
132,Queens,Elmhurst,40.744049,-73.881656
133,Queens,Howard Beach,40.654225,-73.838138


### Neighborhoods in Queens

Plot all the neighborhoods on a map.

In [96]:
latitude = 40.7282
longitude = -73.7949

map_queens = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(queens_data['Latitude'], queens_data['Longitude'], queens_data['Borough'], queens_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_queens)  
map_queens

Define a function and call it to get all the relevant information for nearby venues. In this case we are getting up to 50 venues within a 500m radius.

In [117]:
def getNearbyVenues(names, latitudes, longitudes, radius, limit):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            v['venue']['name'],
            v['venue']['id'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [
                  'Neighborhood',
                  'Venue',
                  'Venue ID',
                  'Venue Latitude', 
                  'Venue Longitude']
    return(nearby_venues)

In [None]:
CLIENT_ID = 'UJRB0TLYD34ITNQ5A4LNL3LPZGVEZBRTETUYA2RO1JCDGIC5'
CLIENT_SECRET = '5IRNFGARGPK01CBDMZXYL5TGSAAPSGOH3HZILVAZIQYDCGRR'
VERSION = '20180605'
radius= 500
limit = 20

queens_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                   latitudes=queens_data['Latitude'],
                                   longitudes=queens_data['Longitude'],
                                   radius = radius,
                                   limit = limit)


In [132]:
queens_venues.head()

Unnamed: 0,Neighborhood,Venue,Venue ID,Venue Latitude,Venue Longitude
0,Astoria,Favela Grill,4bdf502a89ca76b062b75d5e,40.767348,-73.917897
1,Astoria,Orange Blossom,52c580e8498eddd52d925dd9,40.769856,-73.917012
2,Astoria,Titan Foods Inc.,4a9c0105f964a520b03520e3,40.769198,-73.919253
3,Astoria,CrossFit Queens,4c94d26d58d4b60c40fc2b29,40.769404,-73.918977
4,Astoria,Simply Fit Astoria,4d7ce85486cfa14365a2d2a0,40.769114,-73.912403


Create a map to show all the venues collected

In [134]:
latitude = 40.7282
longitude = -73.7949

map_venues = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(queens_venues['Venue Latitude'], queens_venues['Venue Longitude'], queens_venues['Venue'], queens_venues['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_venues)  
map_venues

Collect the ratings data for all the venues that were found using the previous search.

In [131]:
CLIENT_ID = 'UJRB0TLYD34ITNQ5A4LNL3LPZGVEZBRTETUYA2RO1JCDGIC5'
CLIENT_SECRET = '5IRNFGARGPK01CBDMZXYL5TGSAAPSGOH3HZILVAZIQYDCGRR'
## loop through every venue ID in queens_venues
ratings_list = []
for index, row in queens_venues.iterrows():
    venue_id = row['Venue ID']
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(   
        venue_id,
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION)
    try:
        result = requests.get(url).json()['response']['venue']['rating']      
    except:
        result = None
    ratings_list.append([venue_id, result])

ratings_df = pd.DataFrame(ratings_list)

print(ratings_df.head())
print(ratings_df.shape)

                          0    1
0  4bdf502a89ca76b062b75d5e  8.6
1  52c580e8498eddd52d925dd9  8.1
2  4a9c0105f964a520b03520e3  9.3
3  4c94d26d58d4b60c40fc2b29  8.9
4  4d7ce85486cfa14365a2d2a0  8.5
(117, 2)


Add the ratings to the venue information and remove all venues that do not have a rating.

In [111]:
ratings_df.columns = ['Venue_ID', 'rating']
queens_venues.set_index('Venue_ID', inplace=True)
ratings_df.set_index('Venue_ID', inplace=True)

queens_venues['rating'] = ratings_df['rating']
queens_venues.dropna(axis=0, inplace=True)
queens_venues.head()

Unnamed: 0_level_0,Venue,Venue Latitude,Venue Longitude
Venue ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4bdf502a89ca76b062b75d5e,Favela Grill,40.767348,-73.917897
52c580e8498eddd52d925dd9,Orange Blossom,40.769856,-73.917012
4a9c0105f964a520b03520e3,Titan Foods Inc.,40.769198,-73.919253
4c94d26d58d4b60c40fc2b29,CrossFit Queens,40.769404,-73.918977
4d7ce85486cfa14365a2d2a0,Simply Fit Astoria,40.769114,-73.912403


## Methodology

In this project, we aim to indentify the general quality of venues in specific locations. The area of interest was Queens. A 500m radius around each neighborhood was used to search for venues, with a maximum of 20 venues per neighborhood. A limit of 20 venues was used due to the quota enforced by the Foursquare API. Therefore, in the anaylsis, we need to take into account unavailable ratings.

In order to carry out the investigation, we acquired the name, coordinates and ratings for each venue. This data was then clustered using k-means clustering method to divide the venues into three tiers of quality, low medium and high.

We then use a map to visualise the clusters with regards to their locations. From this, we can make conclusions regarding which neighborhoods have a higher density of venues at different quality levels and explore which locations would be best to live in, or meet with friends.

## Analysis

In this section, we extract the venue names and ratings to apply kmeans clustering to the data. We will add the clusters back to the dataframe with all the other venue information for further analysis.

In [102]:
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

queens_ratings = queens_venues[['Venue', 'rating']]
queens_ratings.set_index('Venue', inplace=True)

# Initialise and fit cluster to k-means
num_clusters = 3
k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(queens_ratings)
labels = k_means.labels_
print('label size: ', labels.size, ' df size: ',queens_venues.shape)

queens_venues.insert(0, 'Cluster Labels', labels)

label size:  10  df size:  (10, 4)


We will now get the mean for each cluster. From the results, we can see that:  
* Cluster 0: low rated venues  
* Cluster 1: mid rated venues  
* Cluster 2: high rated venues  

In [58]:
queens_venues.groupby('Cluster Labels')['rating'].mean()


Cluster Labels
0    7.800000
1    8.447368
2    9.000000
Name: rating, dtype: float64

We can now display our findings on a map to identify locations of interest and draw conclusions about different locations.

In [60]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)	# create map

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i + x + (i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(queens_venues['Venue Latitude'], queens_venues['Venue Longitude'], queens_venues['Venue'], queens_venues['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results and Discussion

In this project to find the quality of venues in Queens as rated by Foursquare users, we excluded venues with no ratings. This resulted in a visualisation where not all venues were taken into account but will most likely give an indication of where venues are more frequently visited.

On top of this, we have the limitation from Foursquare, where there is a quota for the number of calls that can be made per day (I was unable to get around the bug of creating an app using a verified personal account, so had to use the Sandbox account).

From the data that is available, we can see that there are three distinct areas where data was able to be collected. When we compare this with the raw number of venues found, this investigation does not contain enough data to make any meaningful conclusions based on all of Queens.

If we limit our investigation to these three areas with data, we can only rank the areas in order of preference, relative to each other.

In Area 1, we have the most number of venues belonging to cluster 2, high quality venues, where as area 2 has the highests concentration of venues, but most of them are belonging to cluster 0, indicating that venues in this area are not of a very high standard. The third area has a combination of mid and low quality venues.

From these observations, we can conclude that the order of preference for locations to live around would be area 1, area 3, then area 2 if we base our decision on the quality of venues in the area.


## Conclusion

From this investigation, we found:  
* With the limited data obtained from Foursquare, venues found were clustered into three distinct areas (indicated on the map)
* Clustering this data into groups of low, medium and high quality venues based on user ratings, the mean of each cluster was:  
0:     7.800000  
1:     8.447368  
2:     9.000000  
    
* Area 1 contained the highest proportion of high quality venues, while still containing a mixture of low to mid range venues in the northern side.
* Area 2 contained the highest density of venues, most of them being low range venues.
* Area 3 contained a mixture of low and mid range venues, with no high range venues.
* The order of preference for locations to live based on venue quality would be area 1, area 3, then area 2