# __Data Science Capstone Notebook__
### _Daniel Svidenko_

## Introduction/Business Problem

Hypothetically, the owner of a successful chain business within Toronto is deciding whether to expand and open new locations in downtown New York or Paris. Logically, opening locations in a place with similar venues is more likely to replicate the local success of the business in Toronto.

Problem: Are the venues in downtown New York or Paris more similar to the venues of downtown Toronto?

## Data Selection

The data that will be leveraged to solve this problem is Foursquare the venue location data for downtown Toronto, Paris, and New York. The coordinate locations of the downtown areas of the cities will be retrieved from Google Street Maps. The valueable features that can be retrieved from the Foursquare dataset are the categories of the top 100 venues in the downtown of each city.

Imports and installations:

In [167]:

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize


Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



Storing coordinates of Toronto, New York and Paris:

In [89]:
torontoloc = [43.6548, -79.3883]
nyloc = [40.7128, -74.0060]
parisloc = [48.864716, 2.349014]
londonloc=[51.5074,-0.1278]
locs = [parisloc, nyloc, torontoloc]
names = ['Paris', 'New York', 'Toronto']
latitudes = [locs[0][0],locs[1][0],locs[2][0]]
longitudes = [locs[0][1],locs[1][1],locs[2][1]]

Setting up Foursquare credentials:

In [118]:
CLIENT_ID = 'WDPS1AL4CKYJKHFBWQ3SKDCUYT0PJBBXPHCSOGKFRBNJMX32'
CLIENT_SECRET = 'D1G5OSUFZTBZFG3ELNKNVPOXZLZEXJSJG4TAVMCODPIPZDHR'
VERSION = '20180605'
LIMIT=100

Defining method to retreive nearby venues:

In [119]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
        
        results = requests.get(url).json()["response"]['groups'][0]['items']
        venues_list.append([( name, lat, lng, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'], v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 'Latitude', 'Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    
    return(nearby_venues)

In [120]:
venues_df=getNearbyVenues(names,latitudes,longitudes)

Paris
New York
Toronto


In [122]:
venues_df.shape
venues_df.head()

Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Paris,48.864716,2.349014,Passage du Grand Cerf,48.86476,2.349486,Pedestrian Plaza
1,Paris,48.864716,2.349014,Redd,48.866237,2.347772,Wine Bar
2,Paris,48.864716,2.349014,Spa Nuxe,48.864017,2.34665,Spa
3,Paris,48.864716,2.349014,Ma Cave Fleury,48.865505,2.350544,Wine Bar
4,Paris,48.864716,2.349014,Raviolis Chinois Nord-Est,48.862851,2.349547,Chinese Restaurant


## Methodology
Now that the necessary data has been collected, it is time to process it and analyze it.

I will begin by grouping the data by city using .count(), as well as determining the number of unique venue categories, to get a better idea of the data I will be working with.

Next, I will one-hot encode the data using the Pandas get_dummies() function and group the resulting dataframe by city using .mean(). This will yield a dataframe with one row per city and one column for each unique category of venue with the numbers in the cells being the mean occurence of venues in that category per city.

Finally, I will transform the dataframe into a numpy array and use the scipy pdist() function to calculate the dissimilarity between the New York and Paris data to the Toronto data using euclidean distance. I will present a table with the dissimilarity values of New York/Paris to Toronto.

In [169]:
venues_df.groupby('City').count()

Unnamed: 0_level_0,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
New York,100,100,100,100,100,100
Paris,100,100,100,100,100,100
Toronto,100,100,100,100,100,100


In [124]:
print('There are {} unique categories.'.format(len(venues_df['Venue Category'].unique())))

There are 123 unique categories.


One-hot encoding the venue categories:

In [125]:
onehot = pd.get_dummies(venues_df[['Venue Category']], prefix="", prefix_sep="")
onehot['City'] = venues_df['City'] 
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]
print(onehot.shape)
onehot.head()

(300, 124)


Unnamed: 0,City,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Auditorium,Australian Restaurant,Bagel Shop,Bakery,...,Thai Restaurant,Theater,Toy / Game Store,University,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Paris,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Paris,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,Paris,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Paris,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,Paris,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Grouping the onehot dataframe by city with mean occurences of categories:

In [164]:
venues_grouped=onehot.groupby('City').mean().reset_index()

In [165]:
venues_grouped.head()

Unnamed: 0,City,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Auditorium,Australian Restaurant,Bagel Shop,Bakery,...,Thai Restaurant,Theater,Toy / Game Store,University,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,New York,0.02,0.0,0.0,0.0,0.01,0.01,0.01,0.01,0.01,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.03,0.01,0.02
1,Paris,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.08,...,0.01,0.0,0.01,0.0,0.01,0.01,0.03,0.0,0.01,0.0
2,Toronto,0.01,0.01,0.01,0.02,0.01,0.0,0.0,0.0,0.01,...,0.02,0.03,0.0,0.01,0.02,0.01,0.0,0.0,0.0,0.01


Turning the dataframe into a numpy array to prepare it for analysis:

In [133]:
vg_np=venues_grouped.drop('City',axis=1).to_numpy()
vg_np

array([[0.02, 0.  , 0.  , 0.  , 0.01, 0.01, 0.01, 0.01, 0.01, 0.  , 0.  ,
        0.01, 0.  , 0.  , 0.01, 0.  , 0.01, 0.02, 0.  , 0.04, 0.01, 0.  ,
        0.03, 0.09, 0.  , 0.01, 0.  , 0.  , 0.02, 0.01, 0.  , 0.  , 0.01,
        0.  , 0.01, 0.03, 0.01, 0.  , 0.02, 0.01, 0.  , 0.  , 0.  , 0.01,
        0.01, 0.  , 0.01, 0.02, 0.03, 0.01, 0.  , 0.03, 0.01, 0.02, 0.  ,
        0.01, 0.  , 0.  , 0.  , 0.01, 0.01, 0.  , 0.01, 0.  , 0.  , 0.  ,
        0.03, 0.  , 0.  , 0.  , 0.  , 0.01, 0.01, 0.  , 0.  , 0.  , 0.  ,
        0.02, 0.  , 0.02, 0.  , 0.01, 0.01, 0.03, 0.  , 0.  , 0.  , 0.02,
        0.02, 0.01, 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.02, 0.01, 0.  ,
        0.01, 0.  , 0.01, 0.  , 0.  , 0.  , 0.02, 0.01, 0.01, 0.02, 0.  ,
        0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.01, 0.  , 0.  , 0.03,
        0.01, 0.02],
       [0.  , 0.02, 0.01, 0.  , 0.  , 0.  , 0.  , 0.  , 0.08, 0.04, 0.01,
        0.01, 0.01, 0.  , 0.  , 0.01, 0.  , 0.  , 0.  , 0.  , 0.04, 0.  ,
        0.03, 0.0

In [135]:
from scipy.spatial.distance import pdist

Calculating the Euclidean distance of the data of New York and Paris to Toronto to determine which is more similar to Toronto:

In [158]:
tor_ny_distance=pdist([vg_np[2],vg_np[0]],metric='euclidean')[0]
tor_paris_distance=pdist([vg_np[2],vg_np[1]],metric='euclidean')[0]

In [159]:
print('Dissimilarity to New York:',tor_ny_distance,'Dissimilarity to Paris:',tor_paris_distance)

Dissimilarity to New York: 0.1341640786499874 Dissimilarity to Paris: 0.19798989873223324


In [163]:
results=pd.DataFrame(columns=['City','Dissimilarity From Toronto'])
results=results.append({'City':'New York','Dissimilarity From Toronto':tor_ny_distance},ignore_index=True)
results=results.append({'City':'Paris','Dissimilarity From Toronto':tor_paris_distance},ignore_index=True)
results

Unnamed: 0,City,Dissimilarity From Toronto
0,New York,0.134164
1,Paris,0.19799


It appears that New York is more similar to Toronto in the composition of its venues to Toronto.

## Results/discussion

Using one-hot encoding, I created an index of venue composition using the top 100 venues retreived from Foursquare for Toronto, New York, and Paris. Using Euclidean distance, I determined that New York is more similar to Toronto in terms of Venue composition. This may mean that a successful business chain in Toronto will have greater chance of replicating its local success in New York than in Paris. However this approach was very basic and did not take into account many factors, such as the category of the business that is looking to expand, and the possibility that the dissimilarity would cause the business to be even more successful in Paris than in Toronto. The result of this analysis should therefore only be used to loosely guide the decision of a business in tandem with other field knowledge and analysis.

# Conclusion

The purpose of this project was to determine whether a successful Toronto business should open locations in New York or in Paris based on the similarity of Venues found in the cities to those of Toronto. Because so many factors are potentially overlooked with this model and because it is built on the potentially faulty assumption that more similar venues are better, the final decision where to expand to should ultimately be made using much more detailed knowledge and analysis than that of this report. The similarity of venues should be but one factor taken into account when making the final decision.