# Capstone Project - The Paris Job

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)




## Introduction: Business Problem <a name="introduction"></a>

In this project we are trying to start a new job in Paris. Lets say that we want to open a shop of some jop. To help us deside what kind of shop to open, we will examine all Paris venues to see what is the most common type of shops, so we decide to start this kind of shops.
Second, will try to find an optimal location this shop. Specifically, this report will be targeted to stakeholders interested in opening a **new shop** in **Paris**, France.

Since there are lots of shops in Paris we will try to detect **areas with no this type of shops in vicinity**. We would also prefer locations **as close to city center as possible**, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of and distance to shops in the neighborhood, if any
* distance of neighborhood from city center

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained.
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of Paris center will be obtained manually using **Geo PyNominatim** that will point to the center of Paris

### Neighborhood Candidates

Let's create latitude & longitude coordinates for centroids of our candidate neighborhoods. We will create a grid of cells covering our area of interest which is aprox. 12x12 killometers centered around Paris city center.


In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


In [8]:
address = 'Paris, France'

geolocator = Nominatim(user_agent="paris_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Paris are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Paris are 48.8566969, 2.3514616.


In [3]:
map_paris = folium.Map(location =[latitude,longitude], zoom_start = 11)
map_paris

In [2]:
latitude = 48.8566969
longitude = 2.3514616
paris_center = [latitude , longitude]
#paris_center = [48.8566969, 2.3514616]

Now let's create a grid of area candidates, equaly spaced, centered around city center and within ~6km from Parice center. Our neighborhoods will be defined as circular areas with a radius of 300 meters, so our neighborhood centers will be 800 meters apart.

To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in  meters).

In [4]:
import shapely.geometry

#!pip install pyproj
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Paris center longitude={}, latitude={}'.format(paris_center[1], paris_center[0]))
x, y = lonlat_to_xy(paris_center[1], paris_center[0])
print('Paris center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Paris center longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
Paris center longitude=2.3514616, latitude=48.8566969
Paris center UTM X=-426736.18866927875, Y=5489149.384570399
Paris center longitude=2.3514616000000026, latitude=48.856696899999996


Let's create a **hexagonal grid of cells**: we offset every other row, and adjust vertical row spacing so that **every cell center is equally distant from all it's neighbors**.

In [5]:
paris_center_x, paris_center_y = lonlat_to_xy(paris_center[1], paris_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
min = 6000
x_min = paris_center_x - min
x_step = 800
y_min = paris_center_y - min - (int(21/k)*k*800 - 12000)/2
y_step = 800 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = (x_step/2) if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(paris_center_x, paris_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

199 candidate neighborhood centers generated.


Let's visualize the data we have so far: city center location and candidate neighborhood centers:

In [6]:
map_paris_1 = folium.Map(location=paris_center, zoom_start=13)
folium.Marker(paris_center, popup='Paris Center').add_to(map_paris_1)
for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=(x_step/2), color='blue', fill=False).add_to(map_paris_1)
map_paris_1

OK, we now have the coordinates of centers of neighborhoods/areas to be evaluated, equally spaced (distance from every point to it's neighbors is exactly the same) and within ~6km from Paris center. 

For simplicity, we will name these locations Paris0 to Paris198.

In [7]:
paris_data = pd.DataFrame(columns=['Neighborhood','Latitude','Longitude'])
paris_data['Latitude'] = latitudes
paris_data['Longitude'] = longitudes
for i in range(0,len(latitudes) ):
    paris_data.loc[i,'Neighborhood'] = 'Paris'+ str(i)

In [8]:
LIMIT=100

In [9]:
paris_data.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Paris0,48.805676,2.342644
1,Paris1,48.806862,2.353274
2,Paris2,48.808047,2.363905
3,Paris3,48.809231,2.374537
4,Paris4,48.810415,2.38517


Looking good. Let's now place all this into a Pandas dataframe.

In [10]:
import pandas as pd

df_locations = pd.DataFrame({'Address': paris_data['Neighborhood'],
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,Paris0,48.805676,2.342644,-428336.188669,5483607.0,5768.882041
1,Paris1,48.806862,2.353274,-427536.188669,5483607.0,5600.0
2,Paris2,48.808047,2.363905,-426736.188669,5483607.0,5542.562584
3,Paris3,48.809231,2.374537,-425936.188669,5483607.0,5600.0
4,Paris4,48.810415,2.38517,-425136.188669,5483607.0,5768.882041
5,Paris5,48.809975,2.325141,-429536.188669,5484300.0,5600.0
6,Paris6,48.811163,2.335772,-428736.188669,5484300.0,5245.950819
7,Paris7,48.81235,2.346403,-427936.188669,5484300.0,4995.998399
8,Paris8,48.813536,2.357035,-427136.188669,5484300.0,4866.210024
9,Paris9,48.814721,2.367668,-426336.188669,5484300.0,4866.210024


### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on venues in each neighborhood.


Foursquare credentials are defined in hidden cell bellow.

In [27]:
CLIENT_ID = 'S2ZHJVSBY45V1BCP2D1XWCWFH34LULHFCYAMXQRVL2ZHX11V' # your Foursquare ID
CLIENT_SECRET = 'CSQID4JG02EF32OX0J2IZIFXP4R5KDD433QIWL3VNTOZBASL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: S2ZHJVSBY45V1BCP2D1XWCWFH34LULHFCYAMXQRVL2ZHX11V
CLIENT_SECRET:CSQID4JG02EF32OX0J2IZIFXP4R5KDD433QIWL3VNTOZBASL


In [28]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [65]:
paris_venues = getNearbyVenues(names=paris_data['Neighborhood'],
                                   latitudes=paris_data['Latitude'],
                                   longitudes=paris_data['Longitude']
                                  )


Paris0
Paris1
Paris2
Paris3
Paris4
Paris5
Paris6
Paris7
Paris8
Paris9
Paris10
Paris11
Paris12
Paris13
Paris14
Paris15
Paris16
Paris17
Paris18
Paris19
Paris20
Paris21
Paris22
Paris23
Paris24
Paris25
Paris26
Paris27
Paris28
Paris29
Paris30
Paris31
Paris32
Paris33
Paris34
Paris35
Paris36
Paris37
Paris38
Paris39
Paris40
Paris41
Paris42
Paris43
Paris44
Paris45
Paris46
Paris47
Paris48
Paris49
Paris50
Paris51
Paris52
Paris53
Paris54
Paris55
Paris56
Paris57
Paris58
Paris59
Paris60
Paris61
Paris62
Paris63
Paris64
Paris65
Paris66
Paris67
Paris68
Paris69
Paris70
Paris71
Paris72
Paris73
Paris74
Paris75
Paris76
Paris77
Paris78
Paris79
Paris80
Paris81
Paris82
Paris83
Paris84
Paris85
Paris86
Paris87
Paris88
Paris89
Paris90
Paris91
Paris92
Paris93
Paris94
Paris95
Paris96
Paris97
Paris98
Paris99
Paris100
Paris101
Paris102
Paris103
Paris104
Paris105
Paris106
Paris107
Paris108
Paris109
Paris110
Paris111
Paris112
Paris113
Paris114
Paris115
Paris116
Paris117
Paris118
Paris119
Paris120
Paris121
Paris122
Par

In [11]:
# paris_venues.to_csv('paris_venues_test.csv')
paris_venues = pd.read_csv('paris_venues_test.csv')
print(paris_venues.shape)
paris_venues.head()

(9150, 8)


Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Paris0,48.805676,2.342644,Skatepark d'Arcueil,48.806097,2.341609,Skate Park
1,1,Paris0,48.805676,2.342644,Centre Sportif François Vincent Raspail,48.8059,2.340608,Athletics & Sports
2,2,Paris0,48.805676,2.342644,Parc Départemental du Coteau de la Bièvre,48.806781,2.342727,Park
3,3,Paris0,48.805676,2.342644,Tourte & Petitin,48.803658,2.33844,Photography Studio
4,4,Paris0,48.805676,2.342644,Marin Beaux-Arts,48.801768,2.342825,Arts & Crafts Store


In [12]:
print('There are {} uniques categories.'.format(len(paris_venues['Venue Category'].unique())))

There are 355 uniques categories.


We will now check for the most common category

In [13]:
print(paris_venues['Venue Category'].value_counts().idxmax())
paris_venues['Venue Category'].value_counts().max()

French Restaurant


1208

In [14]:
french_restaurant = paris_venues[paris_venues['Venue Category'].str.contains("French Restaurant")]
french_restaurant

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
5,5,Paris0,48.805676,2.342644,Le Patio Brasserie,48.810096,2.342963,French Restaurant
32,32,Paris2,48.808047,2.363905,Hippopotamus,48.812053,2.362346,French Restaurant
83,83,Paris7,48.812350,2.346403,Aux Foudres de Bacchus,48.814739,2.351580,French Restaurant
85,85,Paris7,48.812350,2.346403,La Terrasse du Marché,48.815978,2.350205,French Restaurant
104,104,Paris8,48.813536,2.357035,La Place Rouge,48.812639,2.361797,French Restaurant
...,...,...,...,...,...,...,...,...
9110,9110,Paris196,48.905344,2.338985,La Péricole,48.903057,2.336493,French Restaurant
9117,9117,Paris196,48.905344,2.338985,Chez Louisette,48.903007,2.343813,French Restaurant
9123,9123,Paris196,48.905344,2.338985,Café de l'Avenir,48.906125,2.332298,French Restaurant
9130,9130,Paris197,48.906532,2.349636,Le Comptoir,48.907469,2.344114,French Restaurant


In [46]:
xr = []
yr = []

for i in range(0, french_restaurant.shape[0]):
    lonr = french_restaurant.iloc[i,6]
    latr = french_restaurant.iloc[i,5]
    x, y = lonlat_to_xy(lonr, latr)
    xr.append(x)
    yr.append(y)

In [54]:
french_restaurant['Y'] = yr
french_restaurant['X'] = xr

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [19]:
french_restaurant = pd.read_csv('french_restaurant.csv',index_col=0)
#french_restaurant.to_csv('french_restaurant.csv')
french_restaurant

Unnamed: 0,Unnamed: 0.1,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Y,X
5,5,Paris0,48.805676,2.342644,Le Patio Brasserie,48.810096,2.342963,French Restaurant,5.484093e+06,-428230.051519
32,32,Paris2,48.808047,2.363905,Hippopotamus,48.812053,2.362346,French Restaurant,5.484070e+06,-426775.408592
83,83,Paris7,48.812350,2.346403,Aux Foudres de Bacchus,48.814739,2.351580,French Restaurant,5.484500e+06,-427512.771559
85,85,Paris7,48.812350,2.346403,La Terrasse du Marché,48.815978,2.350205,French Restaurant,5.484655e+06,-427590.159990
104,104,Paris8,48.813536,2.357035,La Place Rouge,48.812639,2.361797,French Restaurant,5.484142e+06,-426804.603443
...,...,...,...,...,...,...,...,...,...,...
9110,9110,Paris196,48.905344,2.338985,La Péricole,48.903057,2.336493,French Restaurant,5.494469e+06,-426960.996641
9117,9117,Paris196,48.905344,2.338985,Chez Louisette,48.903007,2.343813,French Restaurant,5.494373e+06,-426427.474789
9123,9123,Paris196,48.905344,2.338985,Café de l'Avenir,48.906125,2.332298,French Restaurant,5.494861e+06,-427209.747255
9130,9130,Paris197,48.906532,2.349636,Le Comptoir,48.907469,2.344114,French Restaurant,5.494864e+06,-426321.804774


In [20]:
print('Total number of French restaurants:', len(french_restaurant))

Total number of French restaurants: 1213


Let's now see all the collected restaurants in our area of interest on map.

In [21]:
map_paris_l = folium.Map(location=paris_center, zoom_start=13)
folium.Marker(paris_center, popup='Paris center').add_to(map_paris_l)
for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=(x_step/2), color='blue', fill=False).add_to(map_paris_l)
for index , row in french_restaurant.iterrows():
    lat = row[5]; lon = row[6]
    color = 'red' 
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_paris_l)
map_paris_l

So now we have all the french restaurants in area within few kilometers from Paris center. We also know which restaurants exactly are in vicinity of every neighborhood candidate center.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new French restaurant!