# Coffee Lovers guide to America - comparing five major US cities

----------------------------------------------------------------------------------------------------------------------
 Felix Reznitskiy
 
 December 18, 2020
 
----------------------------------------------------------------------------------------------------------------------

## Introduction

![image](./coffee-caffeinated-history.jpg)

Coffee first became popular in the U.S. after the Boston Tea Party, when the switch was seen as “patriotic,” [according to PBS](http://www.pbs.org/food/the-history-kitchen/history-coffee/). And since Starbucks debuted in 1971, the drink is now accessible almost anywhere you go. A recent survey by the National Coffee Association found that [62 percent](https://www.ncausa.org/Newsroom/NCA-releases-Atlas-of-American-Coffee) of Americans drink coffee every day, with the average coffee drinker consuming 3 cups daily.
What gave way to java culture? Science, for one, has convinced us that caffeine possesses multiple health benefits besides mental stimulation. At the right dosages, caffeine may contribute to [longevity](https://time.com/5326420/coffee-longevity-study/). Perhaps just as important, though, is coffee’s social purpose. Today, coffee stations are a staple of the workplace, and tens of thousands of shops serve as meeting places for friends, dates and coworkers – though in 2020 many have had to provide take-out service only due to the COVID-19 pandemic.

## Business Problem

To determine the best city for coffee lovers, we will find a major city with the highest density of coffee shops out of five major US cities.

## Data Description

We will use the FourSquare API to fetch data about locations of coffee shops in following 5 largest US cities: 
 -	New York City, NY (Population: 8,622,357)
 -	Los Angeles, CA (Population: 4,085,014)
 -	Chicago, IL (Population: 2,670,406)
 -	Houston, TX (Population: 2,378,146)
 -	Phoenix, AZ (Population: 1,743,469)

Next, we will use this data for measuring the density of coffee shops in selected cities. We will measure density as a mean distance from venues to the city center coordinates. City with the lowest mean distance will be considered as the best.

In [16]:
import numpy as np # library for working with arrays, vectors etc.
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import folium # library for generating the maps
#import json # library to handle JSON files
from pandas import json_normalize
import math
from scipy.spatial.distance import cdist # for calculating the distance between two points

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

print('Libraries imported.')

Libraries imported.


In [2]:
cityList = ['New York, NY', 'Los Angeles, CA', 'Chicago, IL', 'Houston, TX', 'Phoenix, AZ']

cityCoordinates = {}

for city in cityList:
    address = city # 'New York City, NY'
    geolocator = Nominatim(user_agent="my_coffee_explorer")
    location = geolocator.geocode(address)
    cityCoordinates[city] = [location.latitude, location.longitude]
    print('The geograpical coordinate of {} are {}, {}.'.format(city, cityCoordinates[city][0], cityCoordinates[city][1]))

The geograpical coordinate of New York, NY are 40.7127281, -74.0060152.
The geograpical coordinate of Los Angeles, CA are 34.0536909, -118.242766.
The geograpical coordinate of Chicago, IL are 41.8755616, -87.6244212.
The geograpical coordinate of Houston, TX are 29.7589382, -95.3676974.
The geograpical coordinate of Phoenix, AZ are 33.4484367, -112.0741417.


In [3]:
search_query = 'Coffee'
#search_query = 'Coffee Shop'
radius = 500
#print(search_query + ' .... OK!')
CLIENT_ID = '0YOP1FXJVEUP5BOXUZG1FH3Y2EIWH04A5EYLAVRC2SUXR2XT' # your Foursquare ID
CLIENT_SECRET = 'BF5TKWMGU004CM3X2FVAAAAMOOHLCJ1PJUNO2ARGKFYBJW1A' # your Foursquare Secret
ACCESS_TOKEN = '2SU305WMFH3PXFEAUD0D0H4H0KCA0NOC04CRBVIXHYHCYL03' # your FourSquare Access Token
VERSION = '20180605'
LIMIT = 1 # we will use this single query result to fetch the category Id
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 0YOP1FXJVEUP5BOXUZG1FH3Y2EIWH04A5EYLAVRC2SUXR2XT
CLIENT_SECRET:BF5TKWMGU004CM3X2FVAAAAMOOHLCJ1PJUNO2ARGKFYBJW1A


First, we need to figure out the coffee shops category Id in order to proceed with fetching the coffee shops data using Foursquare API

In [4]:
neighborhood_latitude = cityCoordinates['New York, NY'][0]
neighborhood_longitude = cityCoordinates['New York, NY'][1]

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude,ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)

# checking the URL
#print(url)

#fetching one Coffee Shop in order to get the category Id
queryResult = requests.get(url).json()

# fetching category name and id
print(queryResult['response']['venues'][0]['categories'][0]['name'] + ", " + queryResult['response']['venues'][0]['categories'][0]['id']) #'4bf58dd8d48988d1e0931735'

Coffee Shop, 4bf58dd8d48988d1e0931735


Now we can proceed with pulling the data

In [5]:
LIMIT = 100
results = {}
for city in cityList:
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&limit={}&categoryId={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, city, LIMIT,
        "4bf58dd8d48988d1e0931735") # Category from the previous step
    results[city] = requests.get(url).json()

In [6]:
#from pandas import json_normalize
df_venues={}
for city in cityList:
    venues = json_normalize(results[city]['response']['groups'][0]['items'])
    df_venues[city] = venues[['venue.name', 'venue.location.address', 'venue.location.lat', 'venue.location.lng']]
    df_venues[city].columns = ['name', 'address', 'lat', 'lng']

Let's take a look at the map to see the density of coffee shops in each city.

In [7]:
CoffeeShopsPerCity = [] # this list will be used later for the final report
maps = {} # will contain five maps of the cities
for city in cityList:
    maps[city] = folium.Map(location=[cityCoordinates[city][0], cityCoordinates[city][1]], zoom_start=11)

    # add markers to map
    for lat, lng, label in zip(df_venues[city]['lat'], df_venues[city]['lng'], df_venues[city]['name']):
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(maps[city])  
    print(f"Total number of coffee shops in {city} = ", results[city]['response']['totalResults'])
    CoffeeShopsPerCity.append(results[city]['response']['totalResults'])

Total number of coffee shops in New York, NY =  220
Total number of coffee shops in Los Angeles, CA =  203
Total number of coffee shops in Chicago, IL =  184
Total number of coffee shops in Houston, TX =  156
Total number of coffee shops in Phoenix, AZ =  161


In [8]:
maps[cityList[0]]

In [9]:
maps[cityList[1]]

In [10]:
maps[cityList[2]]

In [11]:
maps[cityList[3]]

In [12]:
maps[cityList[4]]

We can see that New York and Chicago have the highest density of coffee shops.

In order to justify the above observations, we will measure the density and create a table with the concrete numbers.
Let's use two methods of calculations - we will calculate average distance from coffee shops to the corresponding city center, and average of the distance of the venues to their mean coordinates.

In [17]:
# these lists will be used later for creating the final report columns
citiesCol=[]
distance1Col=[]
distance2Col=[]

for city in cityList:
    # calculating mean coordinates of the coffee shops
    coffeeShopsMeanCoordinates = [df_venues[city]['lat'].mean(), df_venues[city]['lng'].mean()] 
    #print(city)
    # calculating average distance from coffee shops to the city center coordinates
    averageDistanceToCenter = np.mean(np.apply_along_axis(lambda x: math.hypot(x[0]-cityCoordinates[city][0],x[1]-cityCoordinates[city][1]),1,df_venues[city][['lat','lng']].values))
    #print(averageDistanceToCenter)
    # calculating average distance from coffee shops to the mean coordinates
    averageDistanceToMean = np.mean(np.apply_along_axis(lambda x: math.hypot(x[0]-coffeeShopsMeanCoordinates[0],x[1]-coffeeShopsMeanCoordinates[1]),1,df_venues[city][['lat','lng']].values))
    #print(averageDistanceToMean)
    citiesCol.append(city)
    distance1Col.append(averageDistanceToCenter)
    distance2Col.append(averageDistanceToMean)

df = pd.DataFrame()
df['City'] = citiesCol
df['Average_Proximity_To_The_City_Center'] = distance1Col
df['Average_Distance_To_Mean_Coordinates'] = distance2Col
df['Coffee_Shops_Per_City'] = CoffeeShopsPerCity

# sorting the results by average proximity (the lower, the better)
df = df.sort_values(by=['Average_Proximity_To_The_City_Center'], ascending=True)
df.reset_index(drop = True, inplace = True)

df.head()

Unnamed: 0,City,Average_Proximity_To_The_City_Center,Average_Distance_To_Mean_Coordinates,Coffee_Shops_Per_City
0,"New York, NY",0.032419,0.022122,220
1,"Chicago, IL",0.07216,0.053993,184
2,"Houston, TX",0.115663,0.106295,156
3,"Phoenix, AZ",0.128295,0.117186,161
4,"Los Angeles, CA",0.136265,0.099407,203


Conclusion:

We can see that New York and Chicago have the highest density of coffee shops.