# The Wine bar project 

### Introduction & Business problem

In case my career as Data scientist fails (*let's hope it doesn't*), I want to open a wine bar in Paris, France. <br/> 
Of course wine, **I'm french !** <br/>

The problem is that, from my experience, Paris has multiple areas where people go out for a drink and these areas are not concentrated but rather spread around the city. <br/>

Therefore, where is the best location to open a new wine bar to ensure enough clients to be successful ? <br/>

To ensure success, I need the bar to be in a location where the concentration of venues such as theaters, cinemas, restaurants demonstrates an active life in the area. Using the Foursquare data, I will geolocate the venues and find the best spot to open my wine bar.

### Data section

To provide an analytical answer to the business problem of where to open my future wine bar in Paris I will do :<br/>
- A segmentation of Paris inner-city using a .geojson file
- Venues data related to the neighborhoods using Foursquare API (Category of the venue, customer rating, ...)

### Methodology

 Section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.

In [1]:
import pandas as pd
import numpy as np
import requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import folium # map rendering library
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geopy.distance
from math import sqrt

#### Loading the Paris coordinates

In [2]:
with open('arrondissements.geojson') as json_data:
    parisarr = json.load(json_data)
    
par_data = parisarr['features']
colnames = ['PostCode', 'Neighborhood', 'Latitude', 'Longitude']
dfparis = pd.DataFrame(columns=colnames)

In [3]:
for d in par_data: 
    latlon = d['properties']['geom_x_y']
    code = d['properties']['c_ar']    
    neigh = d['properties']['l_aroff']
    
    lat = latlon[0]
    lon = latlon[1]
    dfparis= dfparis.append({'PostCode' : code, 'Neighborhood' : neigh, 'Latitude' : lat, 'Longitude' : lon}, ignore_index=True)   

dfparis.head()

Unnamed: 0,PostCode,Neighborhood,Latitude,Longitude
0,3,Temple,48.862872,2.360001
1,1,Louvre,48.862563,2.336443
2,5,Panthéon,48.844443,2.350715
3,6,Luxembourg,48.84913,2.332898
4,12,Reuilly,48.834974,2.421325


In [4]:
dfparis = dfparis.sort_values(by='PostCode')

In [5]:
address = 'Paris, France'

geolocator = Nominatim(user_agent="par_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Paris are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Paris are 48.8566969, 2.3514616.


#### Creation of a map of Paris, using Follium

In [6]:
# create map of Paris using latitude and longitude values
map_paris = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(dfparis['Latitude'], dfparis['Longitude'], dfparis['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris)  
    
map_paris

The above map shows Paris with the the center coordinates of its 20 arrondissements (neighborhoods).

In [7]:
df_coor = dfparis[['Latitude', 'Longitude']]
dfparis['Distance from center'] = ''

In [8]:
#Function to calculate the distance of center coordinates of each neighborgood to the center of Paris
def calc_xy_distance(coords_1, coords_2):
    return geopy.distance.vincenty(coords_1, coords_2).m

In [9]:
for i in range(0, len(df_coor)):
    dfparis['Distance from center'][i] = calc_xy_distance((df_coor['Latitude'][i], df_coor['Longitude'][i]), (latitude, longitude))

  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Now let's identify the venues around each of these center coordinates of the city using the **Foursquare API**.

#### Foursquare

Let's use Foursquare API to get info on wine bars in each neighborhood.<br/>

We're interested in venues in 'Night life' category, since the density will indicate the activity of the area. Also, we are interested in areas where there is a good density of bars, nightclubs and pubs but less wine bars. We will include in out list only venues that have 'wine bar' in the category name.

In [10]:
#Foursquare Credentials
CLIENT_ID = 'JO31W52NKMLMEQBPQ3GSRBK3FKRXIIJLIFKSRNDDTC5K1Q23' # your Foursquare ID
CLIENT_SECRET = 'XVGAMH0OCJG03ALF5ONIWJN3CJ5TOMKTST0ECRVRKQVCVHNL' # your Foursquare Secret

Let's send a query to retrieve the venues using Foursquare API. To do so, we will send a query to Foursquare for each Paris' neighborhood coordinates and look for venues in the *Night life* category. <br/>

In [11]:
# Category IDs corresponding to Night life, Bars and Wine bar were taken from Foursquare web site (https://developer.foursquare.com/docs/resources/categories):

categ_parent = '4d4b7105d754a06376d81259' #Night life
#= '4bf58dd8d48988d116941735' #Bar
categ_wine = '4bf58dd8d48988d123941735' #Wine bar category
categ_beer = '56aa371ce4b08b9a8d57356c' #Beer bar
categ_cocktail = '4bf58dd8d48988d11e941735' #Coctail bar
categ_pub = '4bf58dd8d48988d11b941735' #Pub
categ_club = '4bf58dd8d48988d11f941735' #club

In [12]:
def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', France', '')
    return address


def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

We create a dictionnary to isolate the venues which categories are most likely to be our competitors.

In [13]:
wine_bars, beer_bars, cocktail_bars, pub_bars, club_bars, other_bars = {}, {}, {}, {}, {}, {}, 
location_bars = []  

In [14]:
df2=dfparis

In [15]:
for i in range(0, len(dfparis)):
    lat = dfparis['Latitude'][i]
    lon = dfparis['Longitude'][i]
    nb_wine, nb_beer, nb_cocktail, nb_pub, nb_club, nb_other = 0, 0, 0, 0, 0, 0, 
    
    
    venues = get_venues_near_location(lat, lon, categ_parent, CLIENT_ID, CLIENT_SECRET, radius=700, limit=150)    
    area_bars = []
    
    for venue in venues:
        venue_id = venue[0]
        venue_name = venue[1]        
        venue_categories = venue[2]        
        venue_latlon = venue[3]
        venue_address = venue[4]
        venue_distance = venue[5]        
        venue_bar = (venue_id, venue_name, venue_categories[0][0], venue_latlon[0], venue_latlon[1], venue_address, venue_distance)
        
        if venue_categories[0][1] == categ_wine:
            wine_bars[venue_id] = venue_bar
            nb_wine = nb_wine + 1
#         elif venue_categories[0][1] == categ_beer:
#             beer_bars[venue_id] = venue_bar
#             nb_beer = nb_beer + 1
#         elif venue_categories[0][1] == categ_cocktail:
#             cocktail_bars[venue_id] = venue_bar
#             nb_cocktail = nb_cocktail + 1
#         elif venue_categories[0][1] == categ_pub:
#             pub_bars[venue_id] = venue_bar
#             nb_pub = nb_pub + 1
#         elif venue_categories[0][1] == categ_club:
#             club_bars[venue_id] = venue_bar
#             nb_club = nb_club + 1
        else:
            other_bars[venue_id] = venue_bar
            nb_other = nb_other + 1
         
        dfparis.loc[i, 'Nb wine bars'] = nb_wine
        dfparis.loc[i, 'Nb other bars'] = nb_other

#     dfparis.loc[i, 'Nb beer bars'] = nb_beer
#     dfparis.loc[i, 'Nb cocktail bars'] = nb_cocktail
#     dfparis.loc[i, 'Nb pubs'] = nb_pub
#     dfparis.loc[i, 'Nb clubs'] = nb_club


In [21]:
dfparis.reset_index(drop=True, inplace=True)
dfparis.head()

Unnamed: 0,PostCode,Neighborhood,Latitude,Longitude,Distance from center,Nb wine bars,Nb other bars
0,1,Louvre,48.862563,2.336443,1280.59,8.0,51.0
1,2,Bourse,48.868279,2.342803,1436.21,19.0,81.0
2,3,Temple,48.862872,2.360001,929.653,11.0,89.0
3,4,Hôtel-de-Ville,48.854341,2.35763,522.961,4.0,69.0
4,5,Panthéon,48.844443,2.350715,1363.8,6.0,63.0


### Place a Bar chart here !!!

In [17]:
#Converting dict to pd dataFrame
df_winebars = pd.DataFrame.from_dict(wine_bars, orient='index', columns=['Id', 'Name', 'Category', 'Latitude', 'Longitude', 'Adresse', 'Distance from center'])
df_otherbars = pd.DataFrame.from_dict(other_bars, orient='index', columns=['Id', 'Name', 'Category', 'Latitude', 'Longitude', 'Adresse', 'Distance from center'])
df_winebars.reset_index(drop=True, inplace=True)
df_otherbars.reset_index(drop=True, inplace=True)

In [18]:
count_bars = df_otherbars[['Id', 'Category']].groupby('Category').count()

In [19]:
df_otherbars.head()

Unnamed: 0,Id,Name,Category,Latitude,Longitude,Adresse,Distance from center
0,4f4e96eae4b0a99a78161d9e,Bar de l'Hôtel Jules et Jim,Hotel Bar,48.863463,2.357393,"11 rue des Gravilliers, 75003 Paris",201
1,5079f1e0e4b0eb8b83f90b0d,Little Red Door,Speakeasy,48.863703,2.363514,"60 rue Charlot, 75003 Paris",273
2,4d77b39caf63cbff3997be0f,Candelaria,Cocktail Bar,48.863032,2.364059,"56 rue de Saintonge, 75003 Paris",297
3,58e13f3babb86a218991cb2f,La Ruée Vers L'Orge,Beer Bar,48.865601,2.359647,"6 rue des Fontaines du Temple, 75003 Paris",304
4,5116b70ce4b0d096ad258d22,Le Mary Céleste,Cocktail Bar,48.861742,2.365012,"1 rue Commines (Rue Froissart), 75004 Paris",387


Let's visualize the wine bars on a Paris Map

In [20]:
# create map of Paris using latitude and longitude values
map_paris_wine = folium.Map(location=[latitude, longitude], zoom_start=12)


# add markers of wine bars to map    
for lat, lng, label in zip(df_otherbars['Latitude'], df_otherbars['Longitude'], df_otherbars['Name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris_wine)

# add markers of wine bars to map
for lat, lng, label in zip(df_winebars['Latitude'], df_winebars['Longitude'], df_winebars['Name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_paris_wine)  


map_paris_wine

One hot encoding

In [25]:
# one hot encoding
paris_onehot = pd.get_dummies(df_otherbars[['Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
# DownT_onehot['Neighborhood'] = DonwT_venues['Neighborhood'] 

# move neighborhood column to the first column
# fixed_columns = [DownT_onehot.columns[-1]] + list(DownT_onehot.columns[:-1])
# DownT_onehot = DownT_onehot[fixed_columns]

print(paris_onehot.shape)
paris_onehot.head()

(720, 62)


Unnamed: 0,African Restaurant,American Restaurant,BBQ Joint,Bar,Beach Bar,Beer Bar,Beer Garden,Beer Store,Bistro,Brasserie,...,Seafood Restaurant,Smoke Shop,South American Restaurant,Spanish Restaurant,Speakeasy,Sports Bar,Steakhouse,Tapas Restaurant,Tea Room,Vietnamese Restaurant
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Questions : Faut il mettre tous les résultats des venues dans un seul dataframe ? <br/>

Idée : suivre le projet précédent capstone sur toronto puis par l'analyse exploratoire identifier les quartiers où l'on a un concentration des bars à vins. Vérifier grâce à K-mean dans ces quartiers les types de venues populaires. Enfin voir dans quels quartiers on retrouve des clusters similaires avec moins de bars à vin. Ces quartiers seront des candidats intéressants.

### Results

### Discussion

### Conclusion