# Capstone Project - THE NEXT BAKERY
### Author: Carlos Gonzalez Romero

### INDEX
* [INTRODUCTION / BUSINESS PROBLEM](#link_01)
 * [Background](#link_01_01)
 * [Problem](#link_01_02)
 * [Interest](#link_01_03)
* [DATA](#link_02)
 * [Data sources](#link_02_01)
 * [Data assumption](#link_02_02)
 * [Data cleaning](#link_02_03)
* [METHODOLOGY](#link_03)
 * [First criterion](#link_03_01)
 * [Second criterion](#link_03_02)
* [RESULTS / DISCUSSION](#link_04)
 * [First criterion](#link_04_01)
 * [Second criterion](#link_04_02)
* [CONCLUSION](#link_05)

### INTRODUCTION / BUSINESS PROBLEM <a name="link_01"></a>

**Background**<a name="link_01_01"></a>  
The Levaduramadre brand is a gourmet bakery that is in full expansion in the city of Madrid, Spain. Although it was created in 2007, it is in 2017 that this expansion has been made. This expansion is happened mainly in the center and north-center of Madrid.

**Problem**<a name="link_01_02"></a>  
To make the expansion in a homogeneus way, we will try to find an optimal location for a new bakery of the brand Levaduramadre in the south-center of Madrid. To do this, we will assume that the bakery owner gives us a choice of 3 specific locations, and we will have to justify our choice between those 3 options based on two criteria:  
* How similar are the neighborhoods that currently have a Levaduramadre compared to the neighborhood of future Levaduramadre.  
* How many bakeries there are around the future Levaduramadre and what its the distance between them.

**Interest**<a name="link_01_03"></a>  
This report has been requested by the owner of other Levaduramadre interested in opening a new bakery in the south-center of Madrid.



### DATA <a name="link_02"></a>

**Data sources**<a name="link_02_01"></a>  
For this work we will use the data obtained from Foursquare API. Although it will be the same API, we will obtain three datasets, one for each aim:
* *First dataset* will contain information about all Levaduramadre (existings and possibles future): This dataset will be used to obtain the second dataset.
* *Second dataset* will contain information about the venues around all Levaduramadre: This dataset will be used to create a cluster (using k-means clustering) of Levaduramadre’s locations to get the first criteria and help identify which should be the best location of next Levaduramadre.
* *Third dataset* will contain information about the bakeries around the possible locations: This dataset will be used to get the second criteria and help identify which should be the best location of next Levaduramadre.



**Data assumption**<a name="link_02_02"></a>  
The locations of the possible future Levaduramadre has been made by the following steps:
1.	Search and paint on a map all existing Levaduramadre in Madrid (with first dataset). 
2.	Select (approximately) three zones in the south-center of Madrid without Levaduramadre.
3.	Choose a random address in that area, with the only criterion that it will be located on a street. Of the random address we get the latitude and longitude from google map.



`Import libraries`

In [1]:
import pandas as pd
import numpy as np

from IPython.display import display # mostrar print() como resultados del jupyter 
import folium # pintar en mapas
import requests # hacer llamadas (GET, POST, ...)
from sklearn.cluster import KMeans # hacer modelo Clustering [k-means]
import matplotlib.pyplot as plt


print('\n\nDONE')



DONE


``Credentials for Foursquare API``

In [2]:
CLIENT_ID = 'VVJQ2FTM2BLDHPGYLXDYYTYYZ0CGJNLHNMB2MJA1CPE3AHHL' # Foursquare ID
CLIENT_SECRET = '242OM3K5XL4VCC2W1TFNYUDLIHWLCGSXEBJKHWYP0A4MVAJY' # Foursquare Secret
VERSION = '20180604'


print('\n\nDONE')



DONE


`Search for Levaduramadre in Madrid`  
When we make this search we obtain a part of first dataset, which we will put in the dataframe "df_first_dataset".
The search will be done with a call to the Foursquare API. In this call we do a search with the keyword "Levaduramadre", so the response will be all venues named Levaduramadre in 20 km around Madrid center.

In [3]:
# Geolocalizacion MADRID
latitud=40.4165000
longitud=-3.7025600

# Parametros de la llamada a la API
palabra_clave = 'Levaduramadre'
area_busqueda = 20000 # radio de busqueda en metros respecto a la geolocalizacion
LIMIT=200

# Hacer llamada
url_01 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit=200'.format(CLIENT_ID, CLIENT_SECRET, latitud, longitud, VERSION, palabra_clave, area_busqueda)
results_01 = requests.get(url_01).json()

# Se saca la info del JSON de la respuesta
info_001 = results_01['response']['venues']

# Se mete la info en un dataframe
df_first_dataset = pd.json_normalize(info_001)
display(df_first_dataset.head(3))
print(df_first_dataset.shape)


print('\n\nDONE')

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,location.crossStreet
0,5b20c34635f983002c06448f,Levaduramadre,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065952,False,Espronceda 3,40.440712,-3.698919,"[{'label': 'display', 'lat': 40.440712, 'lng':...",2712,28003,ES,Madrid,Madrid,España,"[Espronceda 3, 28003 Madrid Madrid, España]",
1,50576da2e4b0d211d20aa09e,Levadura Madre,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065952,False,"C. Alcalá, 179",40.42673,-3.671556,"[{'label': 'display', 'lat': 40.42672977537444...",2863,28009,ES,Madrid,Madrid,España,"[C. Alcalá, 179, 28009 Madrid Madrid, España]",
2,5cfbe8a7f5e9d7002ceac4bf,Levadura Madre,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065952,False,,40.410534,-3.706628,"[{'label': 'display', 'lat': 40.410534, 'lng':...",748,28005,ES,Madrid,Madrid,España,"[28005 Madrid Madrid, España]",


(20, 17)


DONE


`Paint on a map the existings Levaduramadre and the zones for the future Levaduramadre`

In [4]:
# Generar mapa centrado en un punto de inicio
madrid_map = folium.Map(location=[latitud, longitud], zoom_start=12) 

# pintar levaduramadre 
for lat, lon, id_venue in zip(df_first_dataset['location.lat'], df_first_dataset['location.lng'], df_first_dataset.id):
        folium.Marker([lat, lon],tooltip=id_venue,icon=(folium.Icon(icon='',color='cadetblue'))).add_to(madrid_map)

# pintar zonas de posibles Levaduramadre    
folium.CircleMarker(
    [40.404, -3.675],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

folium.CircleMarker(
    [40.403, -3.708],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

folium.CircleMarker(
    [40.391, -3.686],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

# mostrar mapa
display(madrid_map)



print('\n\nDONE')



DONE


`Insert in a dataframe the latitude and longitude of possibles Levaduramadre obtained from google map`

In [5]:
df_option=pd.DataFrame({'NAME':['Option_01','Option_02','Option_03'],'LAT':[40.403677,40.405408,40.392024],'LON':[-3.703524,-3.676730,-3.688217]})
display(df_option)


print('\n\nDONE')

Unnamed: 0,NAME,LAT,LON
0,Option_01,40.403677,-3.703524
1,Option_02,40.405408,-3.67673
2,Option_03,40.392024,-3.688217




DONE


`Paint on a map the existings Levaduramadre and the possibles Levaduramadre`

In [6]:
# Generar mapa centrado en un punto de inicio
madrid_map = folium.Map(location=[latitud, longitud], zoom_start=12) 

# pintar levaduramadre 
for lat, lon, id_venue in zip(df_first_dataset['location.lat'], df_first_dataset['location.lng'], df_first_dataset.id):
        folium.Marker([lat, lon],tooltip=id_venue,icon=(folium.Icon(icon='',color='cadetblue'))).add_to(madrid_map)

# pintar posibles levaduramadre 
for lat, lon, id_venue in zip(df_option.LAT, df_option.LON, df_option.NAME):
        folium.Marker([lat, lon],tooltip=id_venue,icon=(folium.Icon(icon='',color='red'))).add_to(madrid_map)

# pintar zonas de posibles Levaduramadre    
folium.CircleMarker(
    [40.404, -3.675],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

folium.CircleMarker(
    [40.403, -3.708],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

folium.CircleMarker(
    [40.391, -3.686],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

# mostrar mapa
display(madrid_map)



print('\n\nDONE')



DONE


**Data cleaning**<a name="link_02_03"></a>  

<u>*First Dataset*</u>

`Deleted the non-useful rows from first dataset`  
To obtain the second dataset, I need clean the datas of the first dataset, because I obtain all venues named Levaduramadre, include the factory called “Obrador”, and other Levaduramadre that is not located in Madrid, so I deleted the venue with “Obrador” in its name, and filtered the rest by city equal to Madrid.

In [7]:
print('Dimension of dataframe df_first_dataset before cleaning = ',df_first_dataset.shape)

for i in range(len(df_first_dataset)):
    check='Obrador' in df_first_dataset.name[i]
    if check == True:
        df_first_dataset=df_first_dataset.drop([i])

for i in range(len(df_first_dataset)):
    if df_first_dataset['location.city'][i] != 'Madrid':
        df_first_dataset=df_first_dataset.drop([i])

print('Dimension of dataframe df_first_dataset after cleaning = ',df_first_dataset.shape)
        
print('\n\nDONE')

Dimension of dataframe df_first_dataset before cleaning =  (20, 17)
Dimension of dataframe df_first_dataset after cleaning =  (18, 17)


DONE


`Rename the names of the venues`  
For easier identification of each venue, I will change the “name” each venue with the concatenated “Levaduramadre_” plus a numerical value.

In [8]:
for i in range(len(df_first_dataset)):
    name_new = 'Levaduramadre_'+str(i)
    df_first_dataset.name[i]=name_new

display(df_first_dataset.head(3))

        
print('\n\nDONE')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_first_dataset.name[i]=name_new


Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,location.crossStreet
0,5b20c34635f983002c06448f,Levaduramadre_0,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065952,False,Espronceda 3,40.440712,-3.698919,"[{'label': 'display', 'lat': 40.440712, 'lng':...",2712,28003,ES,Madrid,Madrid,España,"[Espronceda 3, 28003 Madrid Madrid, España]",
1,50576da2e4b0d211d20aa09e,Levaduramadre_1,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065952,False,"C. Alcalá, 179",40.42673,-3.671556,"[{'label': 'display', 'lat': 40.42672977537444...",2863,28009,ES,Madrid,Madrid,España,"[C. Alcalá, 179, 28009 Madrid Madrid, España]",
2,5cfbe8a7f5e9d7002ceac4bf,Levaduramadre_2,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065952,False,,40.410534,-3.706628,"[{'label': 'display', 'lat': 40.410534, 'lng':...",748,28005,ES,Madrid,Madrid,España,"[28005 Madrid Madrid, España]",




DONE


`Clean the first dataset`  
The last step for this first dataset will be get only the next information: name, latitude and longitude. Also I will change the column name to match with the dataframe of possibles Levaduramadre. 

In [9]:
df_first_dataset=df_first_dataset.iloc[:,[1,6,7]]
df_first_dataset.columns=['NAME','LAT','LON']
display(df_first_dataset.head())
        
print('\n\nDONE')

Unnamed: 0,NAME,LAT,LON
0,Levaduramadre_0,40.440712,-3.698919
1,Levaduramadre_1,40.42673,-3.671556
2,Levaduramadre_2,40.410534,-3.706628
3,Levaduramadre_3,40.422757,-3.70419
4,Levaduramadre_4,40.424831,-3.701068




DONE


`Complete the first dataset`  
And then join with the dataframe of possibles Levaduramadre.

In [10]:
print('Dimension of dataframe df_first_dataset before join = ',df_first_dataset.shape)

df_first_dataset=df_first_dataset.append(df_option).reset_index(drop=True)

print('Dimension of dataframe df_first_dataset before join = ',df_first_dataset.shape)
        
print('\n\nDONE')

Dimension of dataframe df_first_dataset before join =  (18, 3)
Dimension of dataframe df_first_dataset before join =  (21, 3)


DONE


`FINAL FIRST DATASET`

In [11]:
df_first_dataset

Unnamed: 0,NAME,LAT,LON
0,Levaduramadre_0,40.440712,-3.698919
1,Levaduramadre_1,40.42673,-3.671556
2,Levaduramadre_2,40.410534,-3.706628
3,Levaduramadre_3,40.422757,-3.70419
4,Levaduramadre_4,40.424831,-3.701068
5,Levaduramadre_5,40.434561,-3.708538
6,Levaduramadre_6,40.425808,-3.716907
7,Levaduramadre_7,40.417014,-3.676648
8,Levaduramadre_8,40.435278,-3.702351
9,Levaduramadre_9,40.439179,-3.712758


<u>*Second Dataset*</u>

`Create a dataframe with the response of Foursquare API`  
With the first dataset we can develop a search (with Foursquare API) and obtain the datas to make the second dataset. Obtaining those data will be made in the next 3 steps:
* Create an axiliar list.
* Make calls to Foursquare API and save the responses in auxiliar list generated: In this call we don´t use keyword, so we will obtein all venues in 200 meters around a point, in this case, each of venue in first dataset.
* Save the auxiliar list in a dataframe (named “df_foursquare”).


In [12]:
# Cambiar parametros de busqueda
area_busqueda =200
limite=100

# Generar lista auxiliar para obtener el segundo dataset
list_foursquare=[]
for i in range(len(df_first_dataset)):    
    df_name='df_foursquare_'+str(i)
    list_foursquare.append(df_name)    

# Hacer llamadas a Foursquare API y meter las respuestas en la lista generada
for i in range(len(df_first_dataset)):
        initial_point=df_first_dataset.iloc[i,:]
        latitud=initial_point.LAT
        longitud=initial_point.LON
        url_02 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitud, longitud, VERSION, area_busqueda, limite)
        results_02 = requests.get(url_02).json()
        info_002 = results_02['response']['venues']
        list_foursquare[i] = pd.json_normalize(info_002)

# Llevar los datos de la lista a un dataframe
df_foursquare=pd.DataFrame()
for i in range(len(list_foursquare)):
        df_auxiliar=pd.DataFrame(data=list_foursquare[i])
        list_leva=list(df_first_dataset.NAME)
        df_auxiliar['name_leva']=list_leva[i] # para tener una referencia
        df_foursquare=df_foursquare.append(df_auxiliar)

display(df_foursquare.shape)
display(df_foursquare.head())



print('\n\nDONE')

(2013, 20)

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,location.crossStreet,location.neighborhood,venuePage.id,name_leva
0,5b20c34635f983002c06448f,Levaduramadre,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065953,False,Espronceda 3,40.440712,-3.698919,"[{'label': 'display', 'lat': 40.440712, 'lng':...",0,28003,ES,Madrid,Madrid,España,"[Espronceda 3, 28003 Madrid Madrid, España]",,,,Levaduramadre_0
1,55fdd399498e45bcfcb7f0ee,El Secreto de Ponzano,"[{'id': '4bf58dd8d48988d1db931735', 'name': 'T...",v-1595065953,False,Ponzano 48,40.4407,-3.699052,"[{'label': 'display', 'lat': 40.44070000586889...",11,28004,ES,Madrid,Madrid,España,"[Ponzano 48, 28004 Madrid Madrid, España]",,,,Levaduramadre_0
2,4b9bce21f964a5204a2736e3,El Escudo,"[{'id': '4bf58dd8d48988d150941735', 'name': 'S...",v-1595065953,False,ponzano 49,40.440719,-3.699143,"[{'label': 'display', 'lat': 40.44071888573661...",19,28003,ES,Madrid,Madrid,España,"[ponzano 49, 28003 Madrid Madrid, España]",,,,Levaduramadre_0
3,5b53a5952f97ec002cbb2d0a,Kemuri 49,"[{'id': '4bf58dd8d48988d111941735', 'name': 'J...",v-1595065953,False,"Ponzano, 49",40.440665,-3.699301,"[{'label': 'display', 'lat': 40.440665, 'lng':...",32,28003,ES,Madrid,Madrid,España,"[Ponzano, 49, 28003 Madrid Madrid, España]",,,,Levaduramadre_0
4,57800f05498e010ca9a1a323,Arima-Basque Gastronomy,"[{'id': '4bf58dd8d48988d11e941735', 'name': 'C...",v-1595065953,False,"C/ Ponzano, 51",40.440947,-3.699293,"[{'label': 'display', 'lat': 40.4409472, 'lng'...",41,28003,ES,Madrid,Madrid,España,"[C/ Ponzano, 51, 28003 Madrid Madrid, España]",,,,Levaduramadre_0




DONE


`Create the second dataset`  
We just get the useful information: name, latitude, longitude and the venue of reference to the search (in this case the Levaduramadre). 

In [13]:
df_second_dataset=df_foursquare.iloc[:,[1,6,7,19]]
df_second_dataset.columns=['NAME','LAT','LON','REF']
display(df_second_dataset.shape)
display(df_second_dataset.head())

print('\n\nDONE')

(2013, 4)

Unnamed: 0,NAME,LAT,LON,REF
0,Levaduramadre,40.440712,-3.698919,Levaduramadre_0
1,El Secreto de Ponzano,40.4407,-3.699052,Levaduramadre_0
2,El Escudo,40.440719,-3.699143,Levaduramadre_0
3,Kemuri 49,40.440665,-3.699301,Levaduramadre_0
4,Arima-Basque Gastronomy,40.440947,-3.699293,Levaduramadre_0




DONE


`Insert categories in second dataset`

In [14]:
list_catego=[]
for i in range(len(df_foursquare)):
    try:
        name_catego=df_foursquare.iloc[i,2][0]['name']
        list_catego.append(name_catego)
    except:
        list_catego.append('NaN')
        
df_second_dataset['CATEGO']=list_catego
display(df_second_dataset.head())


print('\n\nDONE')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_second_dataset['CATEGO']=list_catego


Unnamed: 0,NAME,LAT,LON,REF,CATEGO
0,Levaduramadre,40.440712,-3.698919,Levaduramadre_0,Bakery
1,El Secreto de Ponzano,40.4407,-3.699052,Levaduramadre_0,Tapas Restaurant
2,El Escudo,40.440719,-3.699143,Levaduramadre_0,Spanish Restaurant
3,Kemuri 49,40.440665,-3.699301,Levaduramadre_0,Japanese Restaurant
4,Arima-Basque Gastronomy,40.440947,-3.699293,Levaduramadre_0,Cocktail Bar




DONE


`FINAL SECOND DATASET`

In [15]:
df_second_dataset

Unnamed: 0,NAME,LAT,LON,REF,CATEGO
0,Levaduramadre,40.440712,-3.698919,Levaduramadre_0,Bakery
1,El Secreto de Ponzano,40.440700,-3.699052,Levaduramadre_0,Tapas Restaurant
2,El Escudo,40.440719,-3.699143,Levaduramadre_0,Spanish Restaurant
3,Kemuri 49,40.440665,-3.699301,Levaduramadre_0,Japanese Restaurant
4,Arima-Basque Gastronomy,40.440947,-3.699293,Levaduramadre_0,Cocktail Bar
...,...,...,...,...,...
54,Devoteam,40.390890,-3.688619,Option_03,Coworking Space
55,Rocódromo Planetario,40.393651,-3.687054,Option_03,Rock Climbing Spot
56,Exclusive Lounge Bar,40.390910,-3.688034,Option_03,Cocktail Bar
57,Delivery Media,40.393087,-3.689400,Option_03,Office


<u>*Third Dataset*</u>

`Create the third dataset`  
The process to create the third dataset is very similar to the second dataset, but we have to change the parameters of the call to Foursquare API, beacuse we will use a category ID (corresponding to Bakery) to make the calls. The response will be all bakeries around a point, in this case, the three possibles future Levaduramadre.

In [16]:
# Cambiar parametros de busqueda
area_busqueda =250
limite=100
catego_ID='4bf58dd8d48988d16a941735'

# Generar lista auxiliar para obtener el segundo dataset
list_foursquare_02=[]
for i in range(len(df_option)):    
    df_name_02='df_foursquare_'+str(i)
    list_foursquare_02.append(df_name_02)    

# Hacer llamadas a Foursquare API y meter las respuestas en la lista generada
for i in range(len(df_option)):
        initial_point=df_option.iloc[i,:]
        latitud=initial_point.LAT
        longitud=initial_point.LON
        url_03 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitud, longitud, VERSION, catego_ID, area_busqueda, limite)
        results_03 = requests.get(url_03).json()
        info_003 = results_03['response']['venues']
        list_foursquare_02[i] = pd.json_normalize(info_003)

# Llevar los datos de la lista a un dataframe
df_third_dataset=pd.DataFrame()
for i in range(len(list_foursquare_02)):
        df_auxiliar_02=pd.DataFrame(data=list_foursquare_02[i])
        list_levadura=list(df_option.NAME)
        df_auxiliar_02['name_option']=list_levadura[i] # para tener una referencia
        df_third_dataset=df_third_dataset.append(df_auxiliar_02)

display(df_third_dataset.shape)
display(df_third_dataset.head())



print('\n\nDONE')

(14, 18)

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.crossStreet,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,name_option
0,5432d32d498e7df38dd37e10,Granier,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065963,False,Gta. de Embajadores 1,C. Embajadores,40.404785,-3.702284,"[{'label': 'display', 'lat': 40.40478529498308...",162.0,28012.0,ES,Madrid,Madrid,España,"[Gta. de Embajadores 1 (C. Embajadores), 28012...",Option_01
1,4f49f131e4b0570c8eb117d2,Panaderia Rovier,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065963,False,Calle Ercilla,,40.403458,-3.703091,"[{'label': 'display', 'lat': 40.40345812799989...",44.0,,ES,,,España,"[Calle Ercilla, España]",Option_01
2,4db55e3cf7b121c29f5c3e5e,La Rinconada Pasteleria,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065963,False,Glorieta Embajadores,,40.404757,-3.702387,"[{'label': 'display', 'lat': 40.40475733246809...",154.0,28012.0,ES,Madrid,Madrid,España,"[Glorieta Embajadores, 28012 Madrid Madrid, Es...",Option_01
3,4c434f15ff711b8d99c51405,PANISHOP,"[{'id': '4bf58dd8d48988d16a941735', 'name': 'B...",v-1595065963,False,Paseo de las Acacias 29,,40.403938,-3.704983,"[{'label': 'display', 'lat': 40.4039382441524,...",127.0,28005.0,ES,Madrid,Madrid,España,"[Paseo de las Acacias 29, 28005 Madrid Madrid,...",Option_01
4,5bb9e02518d43b002c2b8b68,Ytalia Bakery & Coffee,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1595065963,False,,,40.402534,-3.700828,"[{'label': 'display', 'lat': 40.402534, 'lng':...",261.0,28012.0,ES,Madrid,Madrid,España,"[28012 Madrid Madrid, España]",Option_01




DONE


`Clean the third dataset`  
We just get the useful information: name, distance and the venue wich we use of reference in the search (in this case the possibles future Levaduramadre).

In [17]:
df_third_dataset=df_third_dataset.iloc[:,[1,10,17]]
df_third_dataset.columns=['NAME','DIST','OPT']
display(df_third_dataset.shape)
display(df_third_dataset.head())

print('\n\nDONE')

(14, 3)

Unnamed: 0,NAME,DIST,OPT
0,Granier,162.0,Option_01
1,Panaderia Rovier,44.0,Option_01
2,La Rinconada Pasteleria,154.0,Option_01
3,PANISHOP,127.0,Option_01
4,Ytalia Bakery & Coffee,261.0,Option_01




DONE


`FINAL THIRD DATASET`

In [18]:
df_third_dataset

Unnamed: 0,NAME,DIST,OPT
0,Granier,162.0,Option_01
1,Panaderia Rovier,44.0,Option_01
2,La Rinconada Pasteleria,154.0,Option_01
3,PANISHOP,127.0,Option_01
4,Ytalia Bakery & Coffee,261.0,Option_01
0,La Petite Sara,128.0,Option_02
1,VillaGarcia Pasteleria,49.0,Option_02
2,Horno San Miguel,124.0,Option_02
3,Panadería San Miguel,146.0,Option_02
4,L'atelier Del Pan,151.0,Option_02


### METHODOLOGY <a name="link_03"></a>

In this project we will calculate two criteria to choose where is the best locations to open a new bakery of the Levaduramadre´s branch.  
In **first step** we have collected and cleaned the required data and save it in two dataframes (called second and third dataset). To obtain the second dataset we had to created a first dataset with only Levaduramadre´s data.  
In the **second step** we use the datasets to develop the criteria:
* The second dataset will be used to develop the **first criterion**, which consist in make a clustering of the neighborhoods with a Levaduramadre and the neighborhoods with the possibility to have one. We will make a clustering with 4 cluster and other with 5 cluster (to have more data in the criterion).
* The third dataset will be used to develop the **second criterion**, which consist in make a wieghted average with the distances between the possibles future Levaduramadre and the bakeries around them in a 250 meters of ratio.  

In a **third step** we develop an analysis of the criteria´s result and finllay we will make a conclusion.


**First Criterion**<a name="link_03_01"></a>  

`Count of categories`  
We make a count of the categories in second dataset.

In [19]:
df_conteo=df_second_dataset['CATEGO'].value_counts(dropna=False,ascending=False).reset_index()
df_conteo.columns=['CATEGO','Conteo']
print(df_conteo.head(40))


print('\n\nDONE')

                                      CATEGO  Conteo
0                                        NaN      84
1                                        Bar      74
2                         Spanish Restaurant      68
3                                 Restaurant      59
4                         Salon / Barbershop      55
5                                     Office      55
6                           Tapas Restaurant      52
7                                     Bakery      45
8                                       Bank      43
9                                       Café      35
10                                Nail Salon      32
11                               Coffee Shop      30
12  Residential Building (Apartment / Condo)      28
13                            Clothing Store      27
14                          Dentist's Office      27
15                        Miscellaneous Shop      21
16                             Grocery Store      21
17                           Doctor's Office  

`Select the venues with top categories`  
To make the clustering with good datas, we will avoid the venues with the value 'NaN' in category, and only select the venues wich the categories have a count upper or equal than 15.

In [20]:
print('Dimension of dataframe df_second_dataset before clean categories = ',df_second_dataset.shape,'\n\n')

list_catego_02=[]
for i in range(len(df_conteo)):
    if df_conteo.Conteo[i] >= 15 and df_conteo.CATEGO[i] != 'NaN':
        list_catego_02.append(df_conteo.CATEGO[i])
print(list_catego_02)


boolean_CATEGO = df_second_dataset.CATEGO.isin(list_catego_02)
df_second_dataset = df_second_dataset[boolean_CATEGO].reset_index(drop=True)
display(df_second_dataset)

print('Dimension of dataframe df_second_dataset after clean categories = ',df_second_dataset.shape)


print('\n\nDONE')

Dimension of dataframe df_second_dataset before clean categories =  (2013, 5) 


['Bar', 'Spanish Restaurant', 'Restaurant', 'Salon / Barbershop', 'Office', 'Tapas Restaurant', 'Bakery', 'Bank', 'Café', 'Nail Salon', 'Coffee Shop', 'Residential Building (Apartment / Condo)', 'Clothing Store', "Dentist's Office", 'Miscellaneous Shop', 'Grocery Store', "Doctor's Office", 'General Entertainment', 'Pharmacy', 'Bookstore', 'Automotive Shop', 'Pizza Place', 'Mediterranean Restaurant', 'Furniture / Home Store', 'Shoe Store', 'Medical Center', 'Snack Place', 'Building', 'Arts & Crafts Store', 'Brewery', 'Electronics Store']


Unnamed: 0,NAME,LAT,LON,REF,CATEGO
0,Levaduramadre,40.440712,-3.698919,Levaduramadre_0,Bakery
1,El Secreto de Ponzano,40.440700,-3.699052,Levaduramadre_0,Tapas Restaurant
2,El Escudo,40.440719,-3.699143,Levaduramadre_0,Spanish Restaurant
3,Candeli,40.440627,-3.699117,Levaduramadre_0,Spanish Restaurant
4,Cervecería Lola,40.440597,-3.698988,Levaduramadre_0,Bar
...,...,...,...,...,...
928,Gesycal Madrid,40.390672,-3.688332,Option_03,Office
929,SEDIGAS - Asociación Española del Gas,40.391221,-3.688252,Option_03,Office
930,Arkadin Spain,40.391340,-3.689081,Option_03,Office
931,Delivery Media,40.393087,-3.689400,Option_03,Office


Dimension of dataframe df_second_dataset after clean categories =  (933, 5)


DONE


`Preparing datas to cluster`  
Now we will grouping of the datas by Levaduramadre (existing and possibles future), making the mean about how many repetitions have a category for each Levaduramadre.

In [21]:
df_clustering = pd.get_dummies(df_second_dataset[['CATEGO']])
df_clustering['grouping']=df_second_dataset.REF
df_clustering = df_clustering.groupby('grouping').mean().reset_index()
display(df_clustering)

df_cluster = df_clustering.drop('grouping', 1) # Solo valores numericos en el dataframe

print('\n\nDONE')

Unnamed: 0,grouping,CATEGO_Arts & Crafts Store,CATEGO_Automotive Shop,CATEGO_Bakery,CATEGO_Bank,CATEGO_Bar,CATEGO_Bookstore,CATEGO_Brewery,CATEGO_Building,CATEGO_Café,...,CATEGO_Office,CATEGO_Pharmacy,CATEGO_Pizza Place,CATEGO_Residential Building (Apartment / Condo),CATEGO_Restaurant,CATEGO_Salon / Barbershop,CATEGO_Shoe Store,CATEGO_Snack Place,CATEGO_Spanish Restaurant,CATEGO_Tapas Restaurant
0,Levaduramadre_0,0.0,0.018868,0.056604,0.0,0.150943,0.018868,0.018868,0.0,0.018868,...,0.037736,0.037736,0.018868,0.018868,0.09434,0.037736,0.0,0.0,0.09434,0.169811
1,Levaduramadre_1,0.021277,0.021277,0.021277,0.06383,0.085106,0.021277,0.0,0.021277,0.042553,...,0.042553,0.042553,0.0,0.042553,0.021277,0.085106,0.0,0.021277,0.042553,0.021277
2,Levaduramadre_10,0.021739,0.021739,0.021739,0.108696,0.0,0.021739,0.0,0.065217,0.0,...,0.195652,0.021739,0.0,0.021739,0.043478,0.086957,0.021739,0.0,0.021739,0.0
3,Levaduramadre_11,0.0,0.037037,0.055556,0.074074,0.111111,0.0,0.0,0.018519,0.0,...,0.037037,0.018519,0.0,0.018519,0.074074,0.074074,0.055556,0.018519,0.055556,0.037037
4,Levaduramadre_12,0.020833,0.041667,0.0,0.0,0.083333,0.0,0.0,0.0,0.041667,...,0.020833,0.020833,0.020833,0.041667,0.083333,0.083333,0.0,0.0,0.1875,0.041667
5,Levaduramadre_13,0.0,0.069767,0.093023,0.023256,0.069767,0.023256,0.0,0.023256,0.069767,...,0.046512,0.023256,0.023256,0.023256,0.069767,0.023256,0.0,0.0,0.116279,0.023256
6,Levaduramadre_14,0.0,0.0,0.020833,0.0625,0.020833,0.0,0.0,0.0625,0.0,...,0.125,0.020833,0.0,0.020833,0.0625,0.041667,0.0,0.0,0.083333,0.020833
7,Levaduramadre_15,0.021277,0.0,0.06383,0.021277,0.042553,0.0,0.06383,0.0,0.06383,...,0.06383,0.021277,0.042553,0.0,0.106383,0.042553,0.0,0.021277,0.085106,0.085106
8,Levaduramadre_16,0.0,0.057692,0.096154,0.057692,0.019231,0.019231,0.038462,0.0,0.057692,...,0.0,0.0,0.038462,0.019231,0.019231,0.057692,0.038462,0.038462,0.096154,0.076923
9,Levaduramadre_17,0.020833,0.0,0.083333,0.0625,0.0625,0.020833,0.0,0.020833,0.041667,...,0.041667,0.0,0.0,0.0,0.083333,0.104167,0.020833,0.020833,0.104167,0.020833




DONE


###### Cluster (n=4)

`Develop cluster (n=4)`  
Apply model K-means to group in cluster (4 clusters).

In [22]:
# Run model
model_001 = KMeans(init="k-means++", n_clusters=4, n_init=12, random_state=0).fit(df_cluster)

# See clusters
print(model_001.labels_)


print('\n\nDONE')

[2 0 2 2 0 0 2 0 0 0 0 0 3 0 2 2 2 2 2 2 1]


DONE


`Make a dataframe with data venue and cluster`

In [23]:
df_first_cluster=df_first_dataset
df_first_cluster['CLUSTER']=model_001.labels_
display(df_first_cluster.head())



print('\n\nDONE')

Unnamed: 0,NAME,LAT,LON,CLUSTER
0,Levaduramadre_0,40.440712,-3.698919,2
1,Levaduramadre_1,40.42673,-3.671556,0
2,Levaduramadre_2,40.410534,-3.706628,2
3,Levaduramadre_3,40.422757,-3.70419,2
4,Levaduramadre_4,40.424831,-3.701068,0




DONE


`Print in a map the first clustering`

In [24]:
# Geolocalizacion MADRID
latitud=40.4165000
longitud=-3.7025600

# generar mapa centrado en un punto de inicio
madrid_map = folium.Map(location=[latitud, longitud], zoom_start=12) 

# pintar barrios (y destacar el seleccionado)
for name, lat, lon, cluster in zip(df_first_cluster.NAME, df_first_cluster.LAT, df_first_cluster.LON, df_first_cluster.CLUSTER):
#    folium.Marker([lat, lng],tooltip=distrito,popup=barrio,icon=(folium.Icon(icon='',color='cadetblue'))).add_to(madrid_map)
    if cluster == 0:
        folium.Marker([lat, lon],tooltip=cluster,icon=(folium.Icon(icon='',color='cadetblue'))).add_to(madrid_map)
    elif cluster == 1:
        folium.Marker([lat, lon],tooltip=cluster,icon=(folium.Icon(icon='',color='orange'))).add_to(madrid_map)
    elif cluster == 2:
        folium.Marker([lat, lon],tooltip=cluster,icon=(folium.Icon(icon='',color='green'))).add_to(madrid_map)
    else:
        folium.Marker([lat, lon],tooltip=cluster,icon=(folium.Icon(icon='',color='red'))).add_to(madrid_map)
        
# pintar zonas de posibles Levaduramadre    
folium.CircleMarker(
    [40.404, -3.675],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

folium.CircleMarker(
    [40.403, -3.708],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

folium.CircleMarker(
    [40.391, -3.686],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

# mostrar mapa
display(madrid_map)


print('\n\nDONE')



DONE


`Count the repetition of each cluster`

In [25]:
df_count_cluster_01=df_first_cluster.copy()
df_count_cluster_01.set_index('CLUSTER', inplace=True)
df_count_cluster_01.index.name = None
display(df_count_cluster_01.index.value_counts())


print('\n\nDONE')

2    10
0     9
3     1
1     1
dtype: int64



DONE


###### Cluster (n=5)

`Develop cluster (n=5)`  
Apply model K-means to group in cluster (5 clusters).

In [26]:
# Run model
model_002 = KMeans(init="k-means++", n_clusters=5, n_init=12, random_state=0).fit(df_cluster)

# See clusters
print(model_002.labels_)


print('\n\nDONE')

[0 2 3 0 2 0 3 0 0 0 2 0 4 0 0 0 0 0 0 0 1]


DONE


`Make a dataframe with data venue and cluster`

In [27]:
df_second_cluster=df_first_dataset.copy()
df_second_cluster['CLUSTER']=model_002.labels_
display(df_second_cluster.head())



print('\n\nDONE')

Unnamed: 0,NAME,LAT,LON,CLUSTER
0,Levaduramadre_0,40.440712,-3.698919,0
1,Levaduramadre_1,40.42673,-3.671556,2
2,Levaduramadre_2,40.410534,-3.706628,3
3,Levaduramadre_3,40.422757,-3.70419,0
4,Levaduramadre_4,40.424831,-3.701068,2




DONE


`Print in a map the second clustering`

In [28]:
# Geolocalizacion MADRID
latitud=40.4165000
longitud=-3.7025600

# generar mapa centrado en un punto de inicio
madrid_map = folium.Map(location=[latitud, longitud], zoom_start=12) 

# pintar barrios (y destacar el seleccionado)
for name, lat, lon, cluster in zip(df_second_cluster.NAME, df_second_cluster.LAT, df_second_cluster.LON, df_second_cluster.CLUSTER):
    if cluster == 0:
        folium.Marker([lat, lon],tooltip=cluster,icon=(folium.Icon(icon='',color='cadetblue'))).add_to(madrid_map)
    elif cluster == 1:
        folium.Marker([lat, lon],tooltip=cluster,icon=(folium.Icon(icon='',color='orange'))).add_to(madrid_map)
    elif cluster == 2:
        folium.Marker([lat, lon],tooltip=cluster,icon=(folium.Icon(icon='',color='green'))).add_to(madrid_map)
    elif cluster == 3:
        folium.Marker([lat, lon],tooltip=cluster,icon=(folium.Icon(icon='',color='purple'))).add_to(madrid_map)        
    else:
        folium.Marker([lat, lon],tooltip=cluster,icon=(folium.Icon(icon='',color='red'))).add_to(madrid_map)
        
# pintar zonas de posibles Levaduramadre    
folium.CircleMarker(
    [40.404, -3.675],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

folium.CircleMarker(
    [40.403, -3.708],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

folium.CircleMarker(
    [40.391, -3.686],
    radius=25,
    color='red',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.1
    ).add_to(madrid_map)

# mostrar mapa
display(madrid_map)


print('\n\nDONE')



DONE


`Count the repetition of each cluster`

In [29]:
df_count_cluster_02=df_second_cluster.copy()
df_count_cluster_02.set_index('CLUSTER', inplace=True)
df_count_cluster_02.index.name = None
display(df_count_cluster_02.index.value_counts())


print('\n\nDONE')

0    14
2     3
3     2
4     1
1     1
dtype: int64



DONE


**Second criterion**<a name="link_03_02"></a>  

`Calculate the average`  
First of all we are going to calculate the normal average, and then we develop three cases of weighted average.

How the Option_03 don´t have any bakery around, we will calculated the weighed average for Option_01 and Option_02.

In [30]:
list_avera_01=[]
list_avera_02=[]
for i in range(len(df_third_dataset)):
    if df_third_dataset.iloc[i,2] == 'Option_01':
        list_avera_01.append(df_third_dataset.iloc[i,1])
    elif df_third_dataset.iloc[i,2] == 'Option_02':
        list_avera_02.append(df_third_dataset.iloc[i,1])        
average_01=(sum(list_avera_01))/(len(list_avera_01))
average_02=(sum(list_avera_02))/(len(list_avera_02))
print('Average of Option_01 =',average_01.round())
print('Average of Option_02 =',average_02.round())


print('\n\nDONE')

Average of Option_01 = 150.0
Average of Option_02 = 155.0


DONE


*CASE 01*

To make the weighted average we have to define the weight that we are going to assign to each range distance.

| Weight | Range distance |
| -- | -- | 
| 0.8 | 0-75 |
| 0.6 | 75-150 |
| 0.4 | 150-250 |
| 0.2 | >250 |

`Assign weight in dataframe`

In [31]:
list_weight=[]
for i in range(len(df_third_dataset)):
    if df_third_dataset.iloc[i,1] <=75:
        list_weight.append(0.8)
    elif df_third_dataset.iloc[i,1] <=150:
        list_weight.append(0.6)
    elif df_third_dataset.iloc[i,1] <=250:
        list_weight.append(0.4)
    else:
        list_weight.append(0.2)
    
df_third_dataset['PESOS']=list_weight
display(df_third_dataset)


print('\n\nDONE')

Unnamed: 0,NAME,DIST,OPT,PESOS
0,Granier,162.0,Option_01,0.4
1,Panaderia Rovier,44.0,Option_01,0.8
2,La Rinconada Pasteleria,154.0,Option_01,0.4
3,PANISHOP,127.0,Option_01,0.6
4,Ytalia Bakery & Coffee,261.0,Option_01,0.2
0,La Petite Sara,128.0,Option_02,0.6
1,VillaGarcia Pasteleria,49.0,Option_02,0.8
2,Horno San Miguel,124.0,Option_02,0.6
3,Panadería San Miguel,146.0,Option_02,0.6
4,L'atelier Del Pan,151.0,Option_02,0.4




DONE


`Calculated the weighted average for each possible Levaduramadre`  

In [32]:
list_mult_01=[]
list_peso_01=[]
for i in range(len(df_third_dataset)):
    if df_third_dataset.iloc[i,2] == 'Option_01':
        multi=(df_third_dataset.iloc[i,1])*(df_third_dataset.iloc[i,3])
        list_mult_01.append(multi.round())
        list_peso_01.append(df_third_dataset.iloc[i,3])
display(list_mult_01)
display(list_peso_01)
media_01=(sum(list_mult_01))/(sum(list_peso_01))
print('Weighted average of Option_01 =',media_01.round())


print('\n\nDONE')

[65.0, 35.0, 62.0, 76.0, 52.0]

[0.4, 0.8, 0.4, 0.6, 0.2]

Weighted average of Option_01 = 121.0


DONE


In [33]:
list_mult_02=[]
list_peso_02=[]
for i in range(len(df_third_dataset)):
    if df_third_dataset.iloc[i,2] == 'Option_02':
        multi=(df_third_dataset.iloc[i,1])*(df_third_dataset.iloc[i,3])
        list_mult_02.append(multi.round())
        list_peso_02.append(df_third_dataset.iloc[i,3])
display(list_mult_02)
display(list_peso_02)
media_02=(sum(list_mult_02))/(sum(list_peso_02))
print('Weighted average of Option_02 =',media_02.round())


print('\n\nDONE')

[77.0, 39.0, 74.0, 88.0, 60.0, 62.0, 84.0, 61.0, 56.0]

[0.6, 0.8, 0.6, 0.6, 0.4, 0.4, 0.4, 0.4, 0.2]

Weighted average of Option_02 = 137.0


DONE


*CASE 02*

`First change the weights`  
To see how much it can affect the weights, we will increase the weights in the ranges of the shortest distances.

For this case we will double the first weight, we will sum the half of the own value to the second weight, and sum a quarter of the own value to the third weight. The last weight will be the half.

| Weight | Range distance |
| -- | -- | 
| 1.6 | 0-75 |
| 0.9 | 75-150 |
| 0.5 | 150-250 |
| 0.1 | >250 |

`Assign weight in dataframe`

In [34]:
df_third_dataset_02=df_third_dataset
list_weight_02=[]
for i in range(len(df_third_dataset_02)):
    if df_third_dataset_02.iloc[i,1] <=75:
        list_weight_02.append(1.6)
    elif df_third_dataset_02.iloc[i,1] <=150:
        list_weight_02.append(0.9)
    elif df_third_dataset_02.iloc[i,1] <=250:
        list_weight_02.append(0.5)
    else:
        list_weight_02.append(0.1)
    
df_third_dataset_02['PESOS']=list_weight_02
display(df_third_dataset_02)


print('\n\nDONE')

Unnamed: 0,NAME,DIST,OPT,PESOS
0,Granier,162.0,Option_01,0.5
1,Panaderia Rovier,44.0,Option_01,1.6
2,La Rinconada Pasteleria,154.0,Option_01,0.5
3,PANISHOP,127.0,Option_01,0.9
4,Ytalia Bakery & Coffee,261.0,Option_01,0.1
0,La Petite Sara,128.0,Option_02,0.9
1,VillaGarcia Pasteleria,49.0,Option_02,1.6
2,Horno San Miguel,124.0,Option_02,0.9
3,Panadería San Miguel,146.0,Option_02,0.9
4,L'atelier Del Pan,151.0,Option_02,0.5




DONE


`Calculated the weighted average for each possible Levaduramadre`  

In [35]:
list_mult_03=[]
list_peso_03=[]
for i in range(len(df_third_dataset_02)):
    if df_third_dataset_02.iloc[i,2] == 'Option_01':
        multi=(df_third_dataset_02.iloc[i,1])*(df_third_dataset_02.iloc[i,3])
        list_mult_03.append(multi.round())
        list_peso_03.append(df_third_dataset_02.iloc[i,3])
display(list_mult_03)
display(list_peso_03)
media_03=(sum(list_mult_03))/(sum(list_peso_03))
print('Weighted average of Option_01 =',media_03.round())


print('\n\nDONE')

[81.0, 70.0, 77.0, 114.0, 26.0]

[0.5, 1.6, 0.5, 0.9, 0.1]

Weighted average of Option_01 = 102.0


DONE


In [36]:
list_mult_04=[]
list_peso_04=[]
for i in range(len(df_third_dataset_02)):
    if df_third_dataset_02.iloc[i,2] == 'Option_02':
        multi=(df_third_dataset_02.iloc[i,1])*(df_third_dataset_02.iloc[i,3])
        list_mult_04.append(multi.round())
        list_peso_04.append(df_third_dataset_02.iloc[i,3])
display(list_mult_04)
display(list_peso_04)
media_04=(sum(list_mult_04))/(sum(list_peso_04))
print('Weighted average of Option_02 =',media_04.round())


print('\n\nDONE')

[115.0, 78.0, 112.0, 131.0, 76.0, 78.0, 104.0, 76.0, 28.0]

[0.9, 1.6, 0.9, 0.9, 0.5, 0.5, 0.5, 0.5, 0.1]

Weighted average of Option_02 = 125.0


DONE


*CASE 03*

`Second change the weights`  
To see how much it can affect the weights, we will increase the weights in the ranges of the longest distances.

For this case we will assign the same weights that in first case, but in oppositte order.

| Weight | Range distance |
| -- | -- | 
| 0.2 | 0-75 |
| 0.4 | 50-150 |
| 0.6 | 150-250 |
| 0.8 | >250 |

`Assign weight in dataframe`

In [37]:
df_third_dataset_03=df_third_dataset
list_weight_03=[]
for i in range(len(df_third_dataset_03)):
    if df_third_dataset_03.iloc[i,1] <=75:
        list_weight_03.append(0.2)
    elif df_third_dataset_03.iloc[i,1] <=150:
        list_weight_03.append(0.4)
    elif df_third_dataset_03.iloc[i,1] <=250:
        list_weight_03.append(0.6)
    else:
        list_weight_03.append(0.8)
    
df_third_dataset_03['PESOS']=list_weight_03
display(df_third_dataset_03)


print('\n\nDONE')

Unnamed: 0,NAME,DIST,OPT,PESOS
0,Granier,162.0,Option_01,0.6
1,Panaderia Rovier,44.0,Option_01,0.2
2,La Rinconada Pasteleria,154.0,Option_01,0.6
3,PANISHOP,127.0,Option_01,0.4
4,Ytalia Bakery & Coffee,261.0,Option_01,0.8
0,La Petite Sara,128.0,Option_02,0.4
1,VillaGarcia Pasteleria,49.0,Option_02,0.2
2,Horno San Miguel,124.0,Option_02,0.4
3,Panadería San Miguel,146.0,Option_02,0.4
4,L'atelier Del Pan,151.0,Option_02,0.6




DONE


`Calculated the weighted average for each possible Levaduramadre`  

In [38]:
list_mult_05=[]
list_peso_05=[]
for i in range(len(df_third_dataset_03)):
    if df_third_dataset_03.iloc[i,2] == 'Option_01':
        multi=(df_third_dataset_03.iloc[i,1])*(df_third_dataset_03.iloc[i,3])
        list_mult_05.append(multi.round())
        list_peso_05.append(df_third_dataset_03.iloc[i,3])
display(list_mult_05)
display(list_peso_05)
media_05=(sum(list_mult_05))/(sum(list_peso_05))
print('Weighted average of Option_01 =',media_05.round())


print('\n\nDONE')

[97.0, 9.0, 92.0, 51.0, 209.0]

[0.6, 0.2, 0.6, 0.4, 0.8]

Weighted average of Option_01 = 176.0


DONE


In [39]:
list_mult_06=[]
list_peso_06=[]
for i in range(len(df_third_dataset_03)):
    if df_third_dataset_03.iloc[i,2] == 'Option_02':
        multi=(df_third_dataset_03.iloc[i,1])*(df_third_dataset_03.iloc[i,3])
        list_mult_06.append(multi.round())
        list_peso_06.append(df_third_dataset_03.iloc[i,3])
display(list_mult_06)
display(list_peso_06)
media_06=(sum(list_mult_06))/(sum(list_peso_06))
print('Weighted average of Option_02 =',media_06.round())


print('\n\nDONE')

[51.0, 10.0, 50.0, 58.0, 91.0, 93.0, 125.0, 92.0, 222.0]

[0.4, 0.2, 0.4, 0.4, 0.6, 0.6, 0.6, 0.6, 0.8]

Weighted average of Option_02 = 172.0


DONE


### RESULT / DISCUSSION<a name="link_04"></a>

**First Criterion**<a name="link_04_01"></a>  

In [40]:
print('RESULT OF CLUSTER (n=4)')
display(df_count_cluster_01.index.value_counts())
display(df_first_cluster.tail(3))
print('\n\nRESULT OF CLUSTER (n=5)')
display(df_count_cluster_02.index.value_counts())
display(df_second_cluster.tail(3))


print('\n\nDONE')

RESULT OF CLUSTER (n=4)


2    10
0     9
3     1
1     1
dtype: int64

Unnamed: 0,NAME,LAT,LON,CLUSTER
18,Option_01,40.403677,-3.703524,2
19,Option_02,40.405408,-3.67673,2
20,Option_03,40.392024,-3.688217,1




RESULT OF CLUSTER (n=5)


0    14
2     3
3     2
4     1
1     1
dtype: int64

Unnamed: 0,NAME,LAT,LON,CLUSTER
18,Option_01,40.403677,-3.703524,0
19,Option_02,40.405408,-3.67673,0
20,Option_03,40.392024,-3.688217,1




DONE


*Discussion first criterion*   

As we can see from the results of the first criterion, Option_01 and Option_02 are in the most common cluster about Levaduramadre´s location (in both clustering models), so those locations can be considered good for opening a new Levaduramadre.  

But Option_03 has no similarity to the other Levaduramadre´s neighborhoods (in any of the clustering models), so it is easy to say that this location is not recommended to open a Levaduramadre, at least not without a high risk rate.

**Second Criterion**<a name="link_04_02"></a>  

In [41]:
print('Average of Option_01 =',average_01.round())
print('Average of Option_02 =',average_02.round())

print('\nCASE 01')
print('Weighted average of Option_01 =',media_01.round())
print('Weighted average of Option_02 =',media_02.round())

print('\nCASE 02')
print('Weighted average of Option_01 =',media_03.round())
print('Weighted average of Option_02 =',media_04.round())

print('\nCASE 03')
print('Weighted average of Option_01 =',media_05.round())
print('Weighted average of Option_02 =',media_06.round())


print('\n\nDONE')

Average of Option_01 = 150.0
Average of Option_02 = 155.0

CASE 01
Weighted average of Option_01 = 121.0
Weighted average of Option_02 = 137.0

CASE 02
Weighted average of Option_01 = 102.0
Weighted average of Option_02 = 125.0

CASE 03
Weighted average of Option_01 = 176.0
Weighted average of Option_02 = 172.0


DONE


*Discussion second criterion*   

We only discuss about Option_01 and Option_02, because the Option_03 don´t have any bakery around,  which can tell us two things, that you can open a bakery without competition or is not the best place to open a bakery (I think more in the second). 

So when we see the results of the average, the difference is 5 meters, so is very similar; but when we put a weight depending on the distance (when more shortest distance more weight) the Option_01 decrease the average (in 29 meters) more than Option_02 (in 18 meters) with a difference between them of 16 meters, and when we increase the weight to the shortest distance, the difference between them up to 23 meters, so we can say that the short distance between a future Levaduramadre and others bakeries affect more in Option_01. But when we put weight depending on the longer distance, the difference between the two options is 4 meters, so the difference is simply maintained and we can conclude that bakeries with more distance affect less to the possibles future Levaduramadre.

### CONCLUSION <a name="link_05"></a>

The aim of this project is select the best location to open a new Levaduramadre bakery. For that we have to chooce one between three options, and for that we considered two criteria.  

With the first criterion we discard the Option_03, due to the reason given in the Discussion part.  

And due to the second criterion we consider Option_02 as the best option, since the Option_01 is most affected by the shortest distance, which is interpreted as having more competition.  

Finally **our recommendation** is to open the next Levaduramadre in the **Option_02 location**.