<h1>Coursera IBM's Applied Data Science Capston - Week 4</h1>

In this notebook we will describe a potential problem, for which we could find an answer with use of the Foursquare location data.

Then, we will collect the necessary data to present a solution.

Finally we will make a visualization of the proposed solution, and we will draw a conclusion.

<h1>Paris's district choice for best bakeries</h1>

<h2>Context, Problem and solution</h2>

<h3>Context</h3>

<i>We are a travel agency in France that promise it's client the best experience. To backup our recommendations, choices are made with the help of data. It is also a proof to our clients of the quality of our service.</i>

<h3>Problem</h3>

As we all know, France is famous around the world for it's Bread. Paris is the 3rd most visited city around the world in 2018 according to <a href='https://www.businessinsider.fr/us/most-visited-cities-in-the-world-2018-9'>Business Insider</a>.

We can imagine that many of these tourists are willing to pay so they can wake up with a French breakfast, composed of the best French bread. But to do so, they need to reside near by quality bakeries.

How can we approximate the best places to reside when you want quality bakeries nearby ?

<h3>Solution</h3>

We can leverage the Foursquare API explore call to get the 5 most recommended bakeries for each of the 20 districts in paris. Then Make a VENUE_ID call to get the rating of each of these bakeries.<br>By averaging the rating of the top 5 bakeries for each district, we could highligh districts with better graded bakeries and recommend them to our clients.

<h3>The Data</h3>

In the first place we need to get the districts with its latitudes and longitudes.

To get the districts latitudes and longitudes we will kindly scrape the website <a href='https://mapcarta.com'>mapcarta.com</a>.

These data will then be put into a DataFrame

After that, we will get the 5 most recommended bakeries (categoryId=4bf58dd8d48988d16a941735) for each of our district location.

We will then get rate of each of these bakery, only to get the average rate for each district.

We will finally display a map of each district with colored markers corresponding to the average rating of the recommended bakeries.

Importing Dependecies.

In [101]:
import pandas as pd
pd.set_option('display.max_rows', 20)

import numpy as np
import json
import requests
import time

!pip install geocoder
import geocoder

!pip install folium
import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

!pip install beautifulsoup4
from bs4 import BeautifulSoup

print("dependecies imported !")

dependecies imported !


To get the district's numbers, we generate them with a simple for loop adding the indicators to 'arrondissement' which is the French for district.

In [102]:
num_districts = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
districts = ['Districts']
for i in np.arange(num_districts):
    try:
        districts.append('{}{}_arrondissement'.format(i+1, indicators[i]))
    except:
        districts.append('{}th_arrondissement'.format(i+1))
        
print(districts)

['Districts', '1st_arrondissement', '2nd_arrondissement', '3rd_arrondissement', '4th_arrondissement', '5th_arrondissement', '6th_arrondissement', '7th_arrondissement', '8th_arrondissement', '9th_arrondissement', '10th_arrondissement', '11th_arrondissement', '12th_arrondissement', '13th_arrondissement', '14th_arrondissement', '15th_arrondissement', '16th_arrondissement', '17th_arrondissement', '18th_arrondissement', '19th_arrondissement', '20th_arrondissement']


To get the latitude and longitude of each district, we scrape the mapcarta.com website since the url format correspond to the districts indicator we listed.

In [103]:
lat = ['Latitude']
lng = ['Longitude']

for suffix in districts[1:]:
    url = "https://mapcarta.com/Paris/{}".format(suffix)
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")
    uls = soup.find_all('li')
    potential_li = uls[12:15]
    
    for li in potential_li:
        if str(li.text).startswith('Latitude'):
            lat.append(float(str(li.text[10:17]).split('°')[0]))
        if str(li.text).startswith('Longitude'):
            lng.append(float(str(li.text[10:17]).split('°')[0]))
    
    
    
    print('{} done\n'.format(suffix))
    time.sleep(3)

1st_arrondissement done

2nd_arrondissement done

3rd_arrondissement done

4th_arrondissement done

5th_arrondissement done

6th_arrondissement done

7th_arrondissement done

8th_arrondissement done

9th_arrondissement done

10th_arrondissement done

11th_arrondissement done

12th_arrondissement done

13th_arrondissement done

14th_arrondissement done

15th_arrondissement done

16th_arrondissement done

17th_arrondissement done

18th_arrondissement done

19th_arrondissement done

20th_arrondissement done



We can now make a Dataframe with each district and its geographical coordinates

In [104]:
df = pd.DataFrame()
df['District'] = districts[1:]
df['Latitude'] = lat[1:]
df['Longitude'] = lng[1:]
df.head()

Unnamed: 0,District,Latitude,Longitude
0,1st_arrondissement,48.8592,2.3417
1,2nd_arrondissement,48.8655,2.3426
2,3rd_arrondissement,48.8637,2.3615
3,4th_arrondissement,48.8601,2.3507
4,5th_arrondissement,48.8448,2.3471


We define ou Foursquare credentials. We will use this API in order to get a list of recommended venus for each districts.

In [116]:
CLIENT_ID = 'AOGSTY0AGSFHIBQ0VJGCXAMTQDBA2BKUQYH2E34Q13F5VEV3' # your Foursquare ID
# 'DEQ1HQTYYEJS3S5BVJI4HGQXCD5VEHWVXT0C1WHZT3D2KBKB'
CLIENT_SECRET = '50YUWV3NG2F1QEJPP5W0IQ0MAX5CPKOM4VPEEZP1KCCYQDLV' # your Foursquare Secret
# 'YCGQ4FOTBGRB1FT5QW1GA3FOK3WE5RY04O05QWUPRUSQ5R2O'
VERSION = '20180605' # Foursquare API version

print("Credentials defined !")

Credentials defined !


We define a function to get the nearby venues for a given district.

In [121]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            categoryId)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'],
            v['venue']['id'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['District', 
                  'District Latitude', 
                  'District Longitude', 
                  'Venue',
                  'Venue id',
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)



We apply this function to every district by seting a limit of resulting venues to 5, and the categoryId to '4bf58dd8d48988d16a941735' which is the bakery.

In [122]:
LIMIT = 5
categoryId = '4bf58dd8d48988d16a941735'

districts_bakeries = getNearbyVenues(names=df['District'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

1st_arrondissement
2nd_arrondissement
3rd_arrondissement
4th_arrondissement
5th_arrondissement
6th_arrondissement
7th_arrondissement
8th_arrondissement
9th_arrondissement
10th_arrondissement
11th_arrondissement
12th_arrondissement
13th_arrondissement
14th_arrondissement
15th_arrondissement
16th_arrondissement
17th_arrondissement
18th_arrondissement
19th_arrondissement
20th_arrondissement


Let's check the shape of our resulting DataFrame and if we get 100 bakeries as a result (5 bakeries of each of the 20 districts)

In [123]:
print(districts_bakeries.shape)
districts_bakeries.head(15)

(100, 8)


Unnamed: 0,District,District Latitude,District Longitude,Venue,Venue id,Venue Latitude,Venue Longitude,Venue Category
0,1st_arrondissement,48.8592,2.3417,Boulangerie Julien,4c862093e602b1f71304bd7a,48.861251,2.34417,Bakery
1,1st_arrondissement,48.8592,2.3417,Aux Castelblangeois,4c0e2f57d64c0f471ab5275d,48.862239,2.339207,Bakery
2,1st_arrondissement,48.8592,2.3417,Le Moulin de la Vierge,527e3948498e3c991897db98,48.866202,2.341155,Bakery
3,1st_arrondissement,48.8592,2.3417,La Parisienne,56745148498e17815e5e2ced,48.860525,2.346304,Bakery
4,1st_arrondissement,48.8592,2.3417,La Couleur des Blés,4c9ce119542b224b3fa9e49f,48.862216,2.340083,Bakery
5,2nd_arrondissement,48.8655,2.3426,Le Moulin de la Vierge,527e3948498e3c991897db98,48.866202,2.341155,Bakery
6,2nd_arrondissement,48.8655,2.3426,Cloud Cakes,57dd0809cd10776425c22dd7,48.865641,2.346302,Bakery
7,2nd_arrondissement,48.8655,2.3426,La Boulangerie du Nil,567566b2498e54c42fd070f9,48.867722,2.347654,Bakery
8,2nd_arrondissement,48.8655,2.3426,Boulangerie Aki,4c360bbd93db0f4705641d92,48.866211,2.335458,Bakery
9,2nd_arrondissement,48.8655,2.3426,Boulangerie Julien,4c862093e602b1f71304bd7a,48.861251,2.34417,Bakery


Now we want the rate of each of the bakeries, for that we need to make an API call for every single bakery.

In [124]:
def getRate(ids):
    
    ratings_list=[]
    for ident in ids:
        
        try:
            # create the API request URL
            url = 'https://api.foursquare.com/v2/venues/{}?&client_id={}&client_secret={}&v={}'.format(
                ident,
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION)

            # make the GET request
            rating = requests.get(url).json()["response"]['venue']['rating']

            # return only relevant information for each nearby venue
            ratings_list.append(rating)
        except KeyError:
            ratings_list.append(np.nan)

    ratings = pd.DataFrame()
    ratings['Rating'] = ratings_list
    
    return(ratings)

Lets check the head of the resulting DataFrame

In [125]:
ratings = getRate(districts_bakeries['Venue id'])
ratings.head(10)

Unnamed: 0,Rating
0,8.4
1,8.1
2,8.6
3,8.0
4,7.9
5,8.6
6,8.6
7,8.6
8,8.8
9,8.4


Now we add the ratings to the bakeries Dataframe

In [126]:
districts_bakeries['Rating'] = ratings
districts_bakeries

Unnamed: 0,District,District Latitude,District Longitude,Venue,Venue id,Venue Latitude,Venue Longitude,Venue Category,Rating
0,1st_arrondissement,48.8592,2.3417,Boulangerie Julien,4c862093e602b1f71304bd7a,48.861251,2.344170,Bakery,8.4
1,1st_arrondissement,48.8592,2.3417,Aux Castelblangeois,4c0e2f57d64c0f471ab5275d,48.862239,2.339207,Bakery,8.1
2,1st_arrondissement,48.8592,2.3417,Le Moulin de la Vierge,527e3948498e3c991897db98,48.866202,2.341155,Bakery,8.6
3,1st_arrondissement,48.8592,2.3417,La Parisienne,56745148498e17815e5e2ced,48.860525,2.346304,Bakery,8.0
4,1st_arrondissement,48.8592,2.3417,La Couleur des Blés,4c9ce119542b224b3fa9e49f,48.862216,2.340083,Bakery,7.9
5,2nd_arrondissement,48.8655,2.3426,Le Moulin de la Vierge,527e3948498e3c991897db98,48.866202,2.341155,Bakery,8.6
6,2nd_arrondissement,48.8655,2.3426,Cloud Cakes,57dd0809cd10776425c22dd7,48.865641,2.346302,Bakery,8.6
7,2nd_arrondissement,48.8655,2.3426,La Boulangerie du Nil,567566b2498e54c42fd070f9,48.867722,2.347654,Bakery,8.6
8,2nd_arrondissement,48.8655,2.3426,Boulangerie Aki,4c360bbd93db0f4705641d92,48.866211,2.335458,Bakery,8.8
9,2nd_arrondissement,48.8655,2.3426,Boulangerie Julien,4c862093e602b1f71304bd7a,48.861251,2.344170,Bakery,8.4


We remove the bakeries for which the API call didn't returned any rating, to these values a np.nan has been assigned.

In [127]:
districts_bakeries.dropna(inplace=True)
districts_bakeries

Unnamed: 0,District,District Latitude,District Longitude,Venue,Venue id,Venue Latitude,Venue Longitude,Venue Category,Rating
0,1st_arrondissement,48.8592,2.3417,Boulangerie Julien,4c862093e602b1f71304bd7a,48.861251,2.344170,Bakery,8.4
1,1st_arrondissement,48.8592,2.3417,Aux Castelblangeois,4c0e2f57d64c0f471ab5275d,48.862239,2.339207,Bakery,8.1
2,1st_arrondissement,48.8592,2.3417,Le Moulin de la Vierge,527e3948498e3c991897db98,48.866202,2.341155,Bakery,8.6
3,1st_arrondissement,48.8592,2.3417,La Parisienne,56745148498e17815e5e2ced,48.860525,2.346304,Bakery,8.0
4,1st_arrondissement,48.8592,2.3417,La Couleur des Blés,4c9ce119542b224b3fa9e49f,48.862216,2.340083,Bakery,7.9
5,2nd_arrondissement,48.8655,2.3426,Le Moulin de la Vierge,527e3948498e3c991897db98,48.866202,2.341155,Bakery,8.6
6,2nd_arrondissement,48.8655,2.3426,Cloud Cakes,57dd0809cd10776425c22dd7,48.865641,2.346302,Bakery,8.6
7,2nd_arrondissement,48.8655,2.3426,La Boulangerie du Nil,567566b2498e54c42fd070f9,48.867722,2.347654,Bakery,8.6
8,2nd_arrondissement,48.8655,2.3426,Boulangerie Aki,4c360bbd93db0f4705641d92,48.866211,2.335458,Bakery,8.8
9,2nd_arrondissement,48.8655,2.3426,Boulangerie Julien,4c862093e602b1f71304bd7a,48.861251,2.344170,Bakery,8.4


We then manage to compute the average rating for the bakeries of each district.

In [128]:
districts_bakeries_rating = df[['District', 'Latitude', 'Longitude']]
districts_bakeries_rating = districts_bakeries.groupby(['District', 'District Latitude', 'District Longitude'])['Rating'].mean().reset_index()
districts_bakeries_rating.rename(columns={'Rating': 'Average Rating'}, inplace=True)
districts_bakeries_rating.sort_values(by=['Average Rating'], inplace=True, ascending=False)
districts_bakeries_rating

Unnamed: 0,District,District Latitude,District Longitude,Average Rating
8,18th_arrondissement,48.8925,2.3444,8.95
13,3rd_arrondissement,48.8637,2.3615,8.86
0,10th_arrondissement,48.8709,2.3561,8.68
7,17th_arrondissement,48.8835,2.3219,8.65
15,5th_arrondissement,48.8448,2.3471,8.64
12,2nd_arrondissement,48.8655,2.3426,8.6
1,11th_arrondissement,48.8574,2.3795,8.32
6,16th_arrondissement,48.8637,2.2769,8.266667
14,4th_arrondissement,48.8601,2.3507,8.22
10,1st_arrondissement,48.8592,2.3417,8.2


We finally display a map of Paris with a marker for each district, the darkest a marker is, the better its average rating is.

In [129]:
# Create map
lat_paris = 48.8534
lng_paris = 2.3488

map_clusters = folium.Map(location=[lat_paris, lng_paris], zoom_start=12)

# Set color scheme for the clusters (here i am getting a gradient from )
!pip install webcolors
import webcolors as wc

min_avg = np.min(districts_bakeries_rating['Average Rating'])
max_avg = np.max(districts_bakeries_rating['Average Rating'])
delta = max_avg-min_avg

colors = []
for avg in districts_bakeries_rating['Average Rating']:
    tmp = int((avg-min_avg)*(100/delta))
    colors.append(wc.rgb_to_hex((105-tmp, 155-tmp, 225-(tmp))))

# Add markers to the map
for lat, lng, dis, avg, i in zip(
                                districts_bakeries_rating['District Latitude'], 
                                districts_bakeries_rating['District Longitude'], 
                                districts_bakeries_rating['District'], 
                                districts_bakeries_rating['Average Rating'],
                                range(len(districts_bakeries_rating))):
    label = folium.Popup(str(dis) + ' average rating: ' + str(avg))
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color=colors[i],
                        fill=True,
                        fill_color=colors[i],
                        bins=colors,
                        fill_opacity=1).add_to(map_clusters)

map_clusters

