# Part I - Explanation of the problem and why it is interesting

## 1 - Problem

Indication: Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

Israel is a very specific place in the arabic world with very specific needs and demand. Since it is creation on the 14th of May 1948 after the vote by the ONU, it has been largely influenced by the international community and specifically the occidental culture.
Nevertheless, Isreal remains part of the Arabic world and is affected by its culture.

This combination of cultures makes the understanding of israely's culture and its consumption behaviour extremely difficult.

Several international companies would like to understand better this context to establish a strategy plan for implementation.

#### Nota bene: there is nothing politic about this notebook its purpose is purely scientific.

## 2 - Data

Indication: Describe the data that you will be using to solve the problem or execute your idea. Remember that you will need to use the Foursquare location data to solve the problem or execute your idea. You can absolutely use other datasets in combination with the Foursquare location data. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using, even if it is only Foursquare location data.

We would like to benchmark Israel's capital Tel Aviv versus two well known cities to establish the consumption behaviour of its population.

For the benchmark, we have selected New York City and Amman for which we will be using Four Square's location data and compare the venues of the three cities in terms of shops, cafes, restaurant and so on. By establishing the profile in terms of venues in Four square we should be able to caracterise Tel Aviv and then compare it to AMman and New York City. We will then investigate the differences and conclude by identifying the best proxy.

Indeed, the number of venues will be divided by the population of each city to remain consistent.

If it happens that none of the two cities are close enough, we will iterate with other cities among the top 50 largest cities.

# Part II - Method to solve it

## 0 - Import of libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


In [2]:
CLIENT_ID = 'DCPMWEGCH0C2PZR13QN0IB0KEJSDLZMIPARV415TX5Y52FQ1' # your Foursquare ID
CLIENT_SECRET = 'W3BY4ARSZPO4YLRLD4BHCEWHJKLZW5UJYHEHEVDPKVXGRZXX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET: ' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DCPMWEGCH0C2PZR13QN0IB0KEJSDLZMIPARV415TX5Y52FQ1
CLIENT_SECRET:W3BY4ARSZPO4YLRLD4BHCEWHJKLZW5UJYHEHEVDPKVXGRZXX


In [4]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# 1 - Exploration: Amman

In [7]:
address = 'Amman'

geolocator = Nominatim(user_agent="my access")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Amman are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Amman are 31.9515694, 35.9239625.


In [8]:
LIMIT = 1000 # limit of number of venues returned by Foursquare API

radius = 10000 # define radius in meters

# create URL

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
results = requests.get(url).json()

In [9]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Fakhr Al Din (مطعم فخر الدين),Middle Eastern Restaurant,31.952205,35.920381
1,Jo Bedu (جوبدو),Clothing Store,31.956443,35.926776
2,Mijana (ميجنا),Café,31.950485,35.92552
3,Rumi Cafe (مقهى رومي),Café,31.956113,35.925881
4,Falafel Al-Quds (فلافل القدس),Falafel Restaurant,31.949464,35.926499


In [10]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


In [11]:
venues_grouped = nearby_venues.drop(['lat', 'lng'], axis=1).groupby('categories').count().sort_values(['name'], ascending=[False]).reset_index()
venues_grouped.rename(columns={'name':'count'}, inplace=True)
venues_grouped.head(10)

Unnamed: 0,categories,count
0,Café,12
1,Hotel,8
2,Coffee Shop,6
3,Middle Eastern Restaurant,5
4,Italian Restaurant,5
5,Ice Cream Shop,4
6,Dessert Shop,4
7,Pub,3
8,Historic Site,3
9,Bakery,3


# 2 - Deploy the same analysis on other cities

In [5]:
# function that extracts the category of the venue
def collect_venues(address):
    geolocator = Nominatim(user_agent="my access")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude

    LIMIT = 10000 # limit of number of venues returned by Foursquare API

    radius = 20000 # define radius in meters

    # create URL

    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        latitude, 
        longitude, 
        radius, 
        LIMIT)
    results = requests.get(url).json()

    venues = results['response']['groups'][0]['items']
    
    nearby_venues = json_normalize(venues) # flatten JSON

    # filter columns
    filtered_columns = ['venue.name', 'venue.location.city', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
    nearby_venues =nearby_venues.loc[:, filtered_columns]

    # filter the category for each row
    nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

    # clean columns
    nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
    
    # one hot encoding
    result_onehot = pd.get_dummies(nearby_venues[['categories']], prefix="", prefix_sep="")

    # add neighborhood column back to dataframe
    #result_onehot['Name'] = nearby_venues['name']
    result_onehot['City'] = address

    # move neighborhood column to the first column
    fixed_columns = [result_onehot.columns[-1]] + list(result_onehot.columns[:-1])
    result_onehot = result_onehot[fixed_columns]
    
    result_onehot = result_onehot.groupby('City').mean().reset_index()    

    #venues_grouped = nearby_venues.drop(['lat', 'lng'], axis=1).groupby('categories').count().reset_index()
    #venues_grouped.rename(columns={'name':'count'}, inplace=True)
    return result_onehot

In [36]:
#Collect a dataframe with the 50 largest cities with
largest_cities = ('Abidjan', 'Abu Dhabi', 'Abuja', 'Accra', 'Addis Ababa', 'Ahmedabad', 'Ahvaz', 'Alexandria', 'Algiers', 'Allahabad', 'Almaty', 'Ankara', 'Auckland', 'Baghdad', 'Baku', 'Bandung', 'Bangalore', 'Bangkok', 'Baoding', 'Barcelona', 'Barranquilla', 'Basra', 'Beijing', 'Belgrade', 'Belo Horizonte', 'Berlin', 'Bhopal', 'Birmingham', 'Bogotá', 'Brasília', 'Brazzaville', 'Brisbane', 'Bucharest', 'Budapest', 'Buenos Aires', 'Bulawayo', 'Busan', 'Cairo', 'Calgary', 'Cali', 'Caloocan', 'Campinas', 'Cape Town', 'Caracas', 'Cartagena', 'Casablanca', 'Cebu City', 'Changchun', 'Changsha', 'Chaozhou', 'Chengdu', 'Chennai', 'Chicago', 'Chittagong', 'Chongqing', 'Cologne', 'Córdoba', 'Curitiba', 'Daegu', 'Daejeon', 'Dakar', 'Dalian', 'Dallas', 'Dar es Salaam', 'Davao City', 'Delhi', 'Dhaka', 'Dongguan', 'Douala', 'Dubai', 'Durban', 'Ekurhuleni', 'Faisalabad', 'Fez', 'Fortaleza', 'Foshan', 'Fukuoka', 'Fuzhou', 'Giza', 'Guadalajara', 'Guangzhou', 'Guatemala City', 'Guayaquil', 'Gujranwala', 'Gwangju', 'Hamburg', 'Hangzhou', 'Hanoi', 'Harare', 'Harbin', 'Havana', 'Hefei', 'Hiroshima', 'Ho Chi Minh City', 'Hong Kong', 'Houston', 'Hyderabad', 'Hyderabad', 'Ibadan', 'Incheon', 'Isfahan', 'Islamabad', 'Istanbul', 'İzmir', 'Jaipur', 'Jakarta', 'Jeddah', 'Jinan', 'Johannesburg', 'Kabul', 'Kampala', 'Kano', 'Kanpur', 'Kaohsiung', 'Karachi', 'Karaj', 'Kathmandu', 'Kawasaki', 'Kazan', 'Kharkiv', 'Khartoum', 'Kinshasa', 'Kobe', 'Kolkata', 'Kuala Lumpur', 'Kyiv', 'Kyoto', 'Lagos', 'Lahore', 'Lanzhou', 'Lima', 'London', 'Los Angeles', 'Luanda', 'Lucknow', 'Lusaka', 'Madrid', 'Makassar', 'Managua', 'Mandalay', 'Manila', 'Maputo', 'Maracaibo', 'Mashhad', 'Medan', 'Medellin', 'Melbourne', 'Mexico City', 'Milan', 'Minsk', 'Monterrey', 'Montevideo', 'Montreal', 'Moscow', 'Multan', 'Mumbai', 'Munich', 'Nagoya', 'Nagpur', 'Nairobi', 'Nanjing', 'New Taipei City', 'New York', 'Ningbo', 'Nizhny Novgorod', 'Novosibirsk', 'Nur-Sultan', 'Omsk', 'Oran', 'Osaka', 'Ouagadougou', 'Palembang', 'Paris', 'Patna', 'Peshawar', 'Philadelphia', 'Phnom Penh', 'Phoenix', 'Porto Alegre', 'Prague', 'Pune', 'Pyongyang', 'Qingdao', 'Qom', 'Quanzhou', 'Quezon City', 'Quito', 'Rawalpindi', 'Recife', 'Rio de Janeiro', 'Riyadh', 'Rome', 'Rosario', 'Rostov-on-Don', 'Saint Petersburg', 'Saitama', 'Salvador', 'San Antonio', 'San Diego', 'Sana a', 'Santa Cruz de la Sierra', 'Santiago', 'São Paulo', 'Sapporo', 'Semarang', 'Seoul', 'Shanghai', 'Shantou', 'Shenyang', 'Shenzhen', 'Shijiazhuang', 'Shiraz', 'Singapore', 'Sofia', 'Surabaya', 'Surat', 'Suwon', 'Suzhou', 'Sydney', 'T bilisi', 'Tabriz', 'Taichung', 'Tainan', 'Taipei', 'Tangshan', 'Taoyuan', 'Tashkent', 'Tehran', 'Tel Aviv', 'Tianjin', 'Tijuana', 'Tokyo', 'Toronto', 'Tripoli', 'Tunis', 'Ulsan', 'Vienna', 'Vijayawada', 'Visakhapatnam', 'Warsaw', 'Wenzhou', 'Wuhan', 'Xi an', 'Xiamen', 'Yangon', 'Yaoundé', 'Yekaterinburg', 'Yerevan', 'Yokohama', 'Zhengzhou', 'Zhongshan', 'Zunyi')

In [39]:
column_name = list()
result = pd.DataFrame()

for i, city in enumerate(largest_cities):
    print(city)
    result = collect_venues(city)
    column_name.extend(list(result.columns)) 

column_name = sorted(column_name)
column_name = list(dict.fromkeys(column_name))

global_df = pd.DataFrame(columns=column_name)

for i, city in enumerate(largest_cities):
    result = collect_venues(city)
    global_df = global_df.append(result, sort=False)

global_df.fillna(0, inplace=True)
global_df.set_index('City', inplace=True)
global_df.head()

Abidjan
Abu Dhabi
Abuja
Accra
Addis Ababa
Ahmedabad
Ahvaz
Alexandria
Algiers
Allahabad
Almaty
Ankara
Auckland
Baghdad
Baku
Bandung
Bangalore
Bangkok
Baoding
Barcelona
Barranquilla
Basra
Beijing
Belgrade
Belo Horizonte
Berlin
Bhopal
Birmingham
Bogotá
Brasília


KeyError: "None of [Index(['venue.name', 'venue.location.city', 'venue.categories',\n       'venue.location.lat', 'venue.location.lng'],\n      dtype='object')] are in the [columns]"

In [None]:
column_name = list()

result = collect_venues('Amman')
column_name.extend(list(result.columns))

result = collect_venues('New York City')
column_name.extend(list(result.columns))

result = collect_venues('Tel Aviv')
column_name.extend(list(result.columns))

column_name = sorted(column_name)
column_name = list(dict.fromkeys(column_name))

global_df = pd.DataFrame(columns=column_name)
global_df = global_df.append(collect_venues('Amman'), sort=False)
global_df = global_df.append(collect_venues('New York City'), sort=False)
global_df = global_df.append(collect_venues('Tel Aviv'), sort=False)

global_df.fillna(0, inplace=True)
global_df.set_index('City', inplace=True)
global_df