# Battle of the Neighborhoods - Capstone Project



## Table of contents
* [Introduction/Business Problem](#introduction)
* [Data](#datadiscussion)
* [Importing Libraries and Dataframe Creation](#table&libraries)
* [Obtaining the Neighborhoods' Geographical Data](#geo)
* [Clustering Analysis](#clusteranalysis)
* [Results and Discussion](#results&discussion)
* [Conclusion](#conclusions)

## 1. Introduction/Business Problem
<a name="introduction"></a>

A small-sized company wants to enter the restaurant business, specifically they want to start a Pizza Place. They plan to buy and remodel available a real estate property in the center of the city of São Paulo, the financial capital of Brazil and largest consumer of pizza in the country, but **do not know which location would best leverage their efforts to stand out in the food industry**. So, in order to achieve their goal they want to use the power of machine learning and data science to find their answer and thrive in their newfound business.

## 2. Data
<a name="datadiscussion"></a> 

First and foremost this project will use neighborhoods' location data available in this source: https://www.hastedesign.com.br/lab/planilha-areas-de-entrega-por-ceps-de-sao-paulo-woocommerce/ to map out the city of São Paulo. Later, utilizing geocode, a Python geolocation library, the latitude and longitude of each neighborhood will be retrieved. Then the distance to the geographical center of the city will be calculated for each neighborhood and a criteria will be used to map the centermost locations. The FourSquare API will be subsequently used to retrieve the Pizza Places/Italian Restaurants of each selected location. Then, the data will be clustered through a machine learning algorithm and folium will generate a map based on these clusters. Finally we will be able to see and discover which locations are suited to receive attention as a potential place for the new business.   

## 3. Importing Libraries and Dataframe Creation
<a name="table&libraries"></a>

In [1]:
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np #library to handle data in a vectorized manner 

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# visualization library 
import seaborn as sns

!pip install folium
import folium # map rendering library

# import DBSCAN from clustering stage
from sklearn.cluster import DBSCAN 

!pip install geopandas
!conda install -c conda-forge geopy -y

Collecting geopandas
  Using cached https://files.pythonhosted.org/packages/83/c5/3cf9cdc39a6f2552922f79915f36b45a95b71fd343cfc51170a5b6ddb6e8/geopandas-0.7.0-py2.py3-none-any.whl
Collecting shapely (from geopandas)
  Using cached https://files.pythonhosted.org/packages/ea/55/61a5d274a210585b5d0c3dac81a82952a4baa7903e3642228d7a465fc340/Shapely-1.7.0-cp37-cp37m-win_amd64.whl
Collecting pyproj>=2.2.0 (from geopandas)
  Using cached https://files.pythonhosted.org/packages/ee/d8/d729c6addb1a3caaee71295a1212ca498801441391ebd7a1573ba0459c19/pyproj-2.5.0-cp37-cp37m-win_amd64.whl
Collecting fiona (from geopandas)
  Using cached https://files.pythonhosted.org/packages/6d/42/f4a7cac53b28fa70e9a93d0e89a24d33e14826dad6644b699362ad84dde0/Fiona-1.8.13.post1.tar.gz
    Complete output from command python setup.py egg_info:
    A GDAL API version must be specified. Provide a path to gdal-config using a GDAL_CONFIG environment variable or use a GDAL_VERSION environment variable.
    
    --------------

Command "python setup.py egg_info" failed with error code 1 in C:\Users\User\AppData\Local\Temp\pip-install-2tautoxo\fiona\


Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



Then we will extract the data of São Paulos's neighborhoods from a .csv file found on the source mentioned in the **Data** section

In [49]:
#read the .csv file into a pandas Dataframe
saopaulo_df = pd.read_csv('WooCommerce Zonas de Entrega - São Paulo - Áreas de Entrega.csv', header = None)

saopaulo_df.head()

Unnamed: 0,0,1,2,3,4
0,,,,,
1,Zonas de Entrega por CEP - São Paulo e Região ...,,,,
2,,Nome da área,Regiões da área (CEP),Descrição,Entrega
3,1,Zona Norte I,02000...02099,Santana / Carandiru / Vila Guilherme / Jardim ...,10.00
4,2,Zona Norte I,02100...02199,Vila Maria / Parque Novo Mundo / Jardim Japão,10.00


In [50]:
#dropping the first 3 rows of the DF, which contain NaN values
saopaulo_df.dropna(axis = 0, inplace = True)

#dropping the columns [0,2,4] for they do not contain information that will be used in this notebook
saopaulo_df.drop([0, 2, 4], axis = 1, inplace = True)

#reset its index
saopaulo_df.reset_index(inplace = True, drop = True)

saopaulo_df.head()

Unnamed: 0,1,3
0,Zona Norte I,Santana / Carandiru / Vila Guilherme / Jardim ...
1,Zona Norte I,Vila Maria / Parque Novo Mundo / Jardim Japão
2,Zona Norte II,Tucuruvi / Jaçanã / Parque Edu Chaves / Vila M...
3,Zona Norte II,Jardim Tremembé / Barro Branco / Água Fria
4,Zona Norte II,Mandaqui / Imirim / Lauzane Paulista / Santa T...


In [51]:
#renaming the columns
saopaulo_df.rename(columns = {1:'Borough', 3: 'Neighborhood'}, inplace = True)

saopaulo_df.head()

Unnamed: 0,Borough,Neighborhood
0,Zona Norte I,Santana / Carandiru / Vila Guilherme / Jardim ...
1,Zona Norte I,Vila Maria / Parque Novo Mundo / Jardim Japão
2,Zona Norte II,Tucuruvi / Jaçanã / Parque Edu Chaves / Vila M...
3,Zona Norte II,Jardim Tremembé / Barro Branco / Água Fria
4,Zona Norte II,Mandaqui / Imirim / Lauzane Paulista / Santa T...


In [54]:
# Explode/Split column into multiple rows
df1 = pd.DataFrame(saopaulo_df['Neighborhood'].str.split(' / ').tolist(), index=saopaulo_df['Borough']).stack()

df1.head()

Borough        
Zona Norte I  0             Santana
              1           Carandiru
              2      Vila Guilherme
              3    Jardim São Paulo
              0          Vila Maria
dtype: object

In [55]:
#convert the Series object to DF
df2 = pd.DataFrame(df1)

#resetting its index
df2 = df2.reset_index([0, 'Borough'])

#renaming the columns
df2.rename(columns = { 0 : 'Neighborhood'}, inplace = True)


print(df2.head())

        Borough      Neighborhood
0  Zona Norte I           Santana
1  Zona Norte I         Carandiru
2  Zona Norte I    Vila Guilherme
3  Zona Norte I  Jardim São Paulo
4  Zona Norte I        Vila Maria


## 4. Obtaining the Neighborhoods' Geographical Data 
<a name="geo"></a>

In [7]:
#creating a Series object which will have the same number of rows as df2. It will contain the repeated value 'Região Imediata de São Paulo'. Its purpose will be seen further ahead,
#just before feeding the addresses to geocode 
saopaulo_region = pd.Series(['Região Imediata de São Paulo'])

#repeat the only row 224 times
saopaulo_region = saopaulo_region.repeat(224)

#resetting its index
saopaulo_region.reset_index(inplace = True, drop = True)

#converting to a DF
saopaulo_region = saopaulo_region.to_frame()

#rename its column
saopaulo_region.rename(columns = { 0 : 'Region' }, inplace = True)

saopaulo_region.head()

Unnamed: 0,Region
0,Região Imediata de São Paulo
1,Região Imediata de São Paulo
2,Região Imediata de São Paulo
3,Região Imediata de São Paulo
4,Região Imediata de São Paulo


In [8]:
#merging the data from two DFs: saopaulo_region and df2
df = pd.merge(df2, saopaulo_region, left_index = True, right_index = True)

df.head()

Unnamed: 0,Borough,Neighborhood,Region
0,Zona Norte I,Santana,Região Imediata de São Paulo
1,Zona Norte I,Carandiru,Região Imediata de São Paulo
2,Zona Norte I,Vila Guilherme,Região Imediata de São Paulo
3,Zona Norte I,Jardim São Paulo,Região Imediata de São Paulo
4,Zona Norte I,Vila Maria,Região Imediata de São Paulo


In [9]:
#correct one of the values of the DF
df.at[11, 'Neighborhood'] = 'Vila Ede'

#aggregating the values of ['Neighborhood'] and ['Region'] into a new column, ['Address']
df['Address'] = df[['Neighborhood', 'Region']].agg(', '.join, axis=1)

df.head()

Unnamed: 0,Borough,Neighborhood,Region,Address
0,Zona Norte I,Santana,Região Imediata de São Paulo,"Santana, Região Imediata de São Paulo"
1,Zona Norte I,Carandiru,Região Imediata de São Paulo,"Carandiru, Região Imediata de São Paulo"
2,Zona Norte I,Vila Guilherme,Região Imediata de São Paulo,"Vila Guilherme, Região Imediata de São Paulo"
3,Zona Norte I,Jardim São Paulo,Região Imediata de São Paulo,"Jardim São Paulo, Região Imediata de São Paulo"
4,Zona Norte I,Vila Maria,Região Imediata de São Paulo,"Vila Maria, Região Imediata de São Paulo"


In [10]:
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

locator = Nominatim(user_agent='my_Geocoder')

# 1 - convenient function to delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

# 2- - create location column
df['Location'] = df['Address'].apply(geocode)

# 3 - create longitude, latitude and altitude from location column (returns tuple)
df['Point'] = df['Location'].apply(lambda loc: tuple(loc.point) if loc else None)

# 4 - split point column into latitude, longitude and altitude columns
df[['Latitude', 'Longitude', 'Altitude']] = pd.DataFrame(df['Point'].tolist(), index=df.index)

In [11]:
df

Unnamed: 0,Borough,Neighborhood,Region,Address,Location,Point,Latitude,Longitude,Altitude
0,Zona Norte I,Santana,Região Imediata de São Paulo,"Santana, Região Imediata de São Paulo","(Santana, São Paulo, Região Imediata de São Pa...","(-23.499321000000002, -46.6289326497014, 0.0)",-23.499321,-46.628933,0.0
1,Zona Norte I,Carandiru,Região Imediata de São Paulo,"Carandiru, Região Imediata de São Paulo","(Carandiru, 2487, Avenida Cruzeiro do Sul, San...","(-23.5095466, -46.62497714745851, 0.0)",-23.509547,-46.624977,0.0
2,Zona Norte I,Vila Guilherme,Região Imediata de São Paulo,"Vila Guilherme, Região Imediata de São Paulo","(Vila Guilherme, São Paulo, Região Imediata de...","(-23.5096065, -46.606228901507535, 0.0)",-23.509607,-46.606229,0.0
3,Zona Norte I,Jardim São Paulo,Região Imediata de São Paulo,"Jardim São Paulo, Região Imediata de São Paulo","(Jardim São Paulo, Itaquaquecetuba, Região Ime...","(-23.4662552, -46.319589, 0.0)",-23.466255,-46.319589,0.0
4,Zona Norte I,Vila Maria,Região Imediata de São Paulo,"Vila Maria, Região Imediata de São Paulo","(Vila Maria, São Paulo, Região Imediata de São...","(-23.512369999999997, -46.57558413659218, 0.0)",-23.51237,-46.575584,0.0
5,Zona Norte I,Parque Novo Mundo,Região Imediata de São Paulo,"Parque Novo Mundo, Região Imediata de São Paulo","(Parque Novo Mundo, Vila Maria, São Paulo, Reg...","(-23.5144938, -46.5684596, 0.0)",-23.514494,-46.56846,0.0
6,Zona Norte I,Jardim Japão,Região Imediata de São Paulo,"Jardim Japão, Região Imediata de São Paulo","(Jardim Japão, Caucaia do Alto, Cotia, Região ...","(-23.6645771, -47.0740279, 0.0)",-23.664577,-47.074028,0.0
7,Zona Norte II,Tucuruvi,Região Imediata de São Paulo,"Tucuruvi, Região Imediata de São Paulo","(Tucuruvi, 100, Avenida Doutor Antônio Maria L...","(-23.4800747, -46.6032701, 0.0)",-23.480075,-46.60327,0.0
8,Zona Norte II,Jaçanã,Região Imediata de São Paulo,"Jaçanã, Região Imediata de São Paulo","(Jaçanã, São Paulo, Região Imediata de São Pau...","(-23.4579935, -46.57694665772727, 0.0)",-23.457994,-46.576947,0.0
9,Zona Norte II,Parque Edu Chaves,Região Imediata de São Paulo,"Parque Edu Chaves, Região Imediata de São Paulo","(Parque Edu Chaves, Jardim Modelo, Jaçanã, São...","(-23.475745, -46.5668027, 0.0)",-23.475745,-46.566803,0.0


In [12]:
#drop the columns with redundant or irrelevant information concerning the analysis that will be performed later on 
df.drop(['Point', 'Region', 'Location', 'Altitude'], axis = 1, inplace = True)

#there were some cases with missing geographical information. As such the values for eachplace were googled, minus the last 3 (which are different municipalities),
#and inputted them manually into the DF
df.loc[44, ['Latitude', 'Longitude']] = [-23.5890068, -46.5818722]
df.loc[64, ['Latitude', 'Longitude']] = [-23.5003246, -46.5175597]
df.loc[67, ['Latitude', 'Longitude']] = [-23.4992570, -46.4779202]
df.loc[96, ['Latitude', 'Longitude']] = [-23.6445419, -46.6714579]
df.loc[105, ['Latitude', 'Longitude']] = [-23.5908905, -46.6844933]
df.loc[113, ['Latitude', 'Longitude']] = [-23.6438238, -46.6738619]
df.loc[115, ['Latitude', 'Longitude']] = [-23.6273397, -46.7015531]
df.loc[126, ['Latitude', 'Longitude']] = [-23.686726, -46.7047161]
df.loc[156, ['Latitude', 'Longitude']] = [-23.5865182, -46.6999690]
df.loc[170, ['Latitude', 'Longitude']] = [-23.5426083, -46.6290212]

In [13]:
#drop the NaN rows
df.dropna(inplace = True)

#reset the index
df.reset_index(inplace = True, drop = True)

df.head()

Unnamed: 0,Borough,Neighborhood,Address,Latitude,Longitude
0,Zona Norte I,Santana,"Santana, Região Imediata de São Paulo",-23.499321,-46.628933
1,Zona Norte I,Carandiru,"Carandiru, Região Imediata de São Paulo",-23.509547,-46.624977
2,Zona Norte I,Vila Guilherme,"Vila Guilherme, Região Imediata de São Paulo",-23.509607,-46.606229
3,Zona Norte I,Jardim São Paulo,"Jardim São Paulo, Região Imediata de São Paulo",-23.466255,-46.319589
4,Zona Norte I,Vila Maria,"Vila Maria, Região Imediata de São Paulo",-23.51237,-46.575584


## 5. Clustering Analysis
<a name="clusteranalysis"></a>

In [14]:
#search for the geographical information of the center of the city of São Paulo. The information displayed in the output was validated through GoogleMaps

address = 'São Paulo, São Paulo'

geolocator = Nominatim(user_agent="sp_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('==================================================================================')
print('The geograpical coordinates of São Paulo, São Paulo are {}, {}.'.format(latitude, longitude))
print('==================================================================================')

The geograpical coordinates of São Paulo, São Paulo are -23.5506507, -46.6333824.


Now, we will use the **folium** library to visualize the geographical data:

In [15]:
map_saopaulo = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_saopaulo)  
    
map_saopaulo

Next, we will create a function to calculate the distances of  the neighborhoods to the center of the city of São Paulo using its geographical data. The function is based on the **Haversine equation**:

In [16]:
import math

def calc_distance(lat, long):
    R = 6373.0
    lat_const = math.radians(-23.5506507)
    long_const = math.radians(-46.6333824)
    lat_transf = math.radians(lat)
    long_transf = math.radians(long)
    dlong = long_transf - long_const
    dlat = lat_transf - lat_const
    a = math.sin(dlat / 2)**2 + math.cos(lat_const) * math.cos(lat_transf) * math.sin(dlong / 2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return R*c

In [17]:
#apply the previous function to the DF
df['Distance'] = df.apply(lambda x: calc_distance(x['Latitude'], x['Longitude']), axis=1)

df

Unnamed: 0,Borough,Neighborhood,Address,Latitude,Longitude,Distance
0,Zona Norte I,Santana,"Santana, Região Imediata de São Paulo",-23.499321,-46.628933,5.727401
1,Zona Norte I,Carandiru,"Carandiru, Região Imediata de São Paulo",-23.509547,-46.624977,4.651662
2,Zona Norte I,Vila Guilherme,"Vila Guilherme, Região Imediata de São Paulo",-23.509607,-46.606229,5.339521
3,Zona Norte I,Jardim São Paulo,"Jardim São Paulo, Região Imediata de São Paulo",-23.466255,-46.319589,33.354486
4,Zona Norte I,Vila Maria,"Vila Maria, Região Imediata de São Paulo",-23.51237,-46.575584,7.271354
5,Zona Norte I,Parque Novo Mundo,"Parque Novo Mundo, Região Imediata de São Paulo",-23.514494,-46.56846,7.746544
6,Zona Norte I,Jardim Japão,"Jardim Japão, Região Imediata de São Paulo",-23.664577,-47.074028,46.664506
7,Zona Norte II,Tucuruvi,"Tucuruvi, Região Imediata de São Paulo",-23.480075,-46.60327,8.429556
8,Zona Norte II,Jaçanã,"Jaçanã, Região Imediata de São Paulo",-23.457994,-46.576947,11.804916
9,Zona Norte II,Parque Edu Chaves,"Parque Edu Chaves, Região Imediata de São Paulo",-23.475745,-46.566803,10.748592


In [18]:
#correcting two geographical entries, which were wrongly retrieved by the Nominatim
df.loc[3, ['Latitude', 'Longitude']] = [-23.4926257, -46.6131064]
df.loc[6, ['Latitude', 'Longitude']] = [-23.507307, -46.5721079]

#apply the function once more
df['Distance'] = df.apply(lambda x: calc_distance(x['Latitude'], x['Longitude']), axis=1)

As stated in the **Introduction** the stakeholders want the business to be located in the centermost region of São Paulo. To fulfill that demand a circunference of 7km was chosen, seeing as São Paulo is 1,522km², and the values of the 'Distance' column were taken into account to select suited locations.

In [19]:
#apply the condition specified previously
df[['Borough', 'Neighborhood', 'Address', 'Latitude', 'Longitude', 'Distance']] = df[(df['Distance']) <= 7.0]

#drop any NaN values
df.dropna(inplace = True)

#reset its index
df.reset_index(inplace = True, drop = True)

df

Unnamed: 0,Borough,Neighborhood,Address,Latitude,Longitude,Distance
0,Zona Norte I,Santana,"Santana, Região Imediata de São Paulo",-23.499321,-46.628933,5.727401
1,Zona Norte I,Carandiru,"Carandiru, Região Imediata de São Paulo",-23.509547,-46.624977,4.651662
2,Zona Norte I,Vila Guilherme,"Vila Guilherme, Região Imediata de São Paulo",-23.509607,-46.606229,5.339521
3,Zona Norte I,Jardim São Paulo,"Jardim São Paulo, Região Imediata de São Paulo",-23.492626,-46.613106,6.777298
4,Zona Norte II,Imirim,"Imirim, Região Imediata de São Paulo",-23.491095,-46.64706,6.769643
5,Zona Norte II,Santa Teresinha,"Santa Teresinha, Região Imediata de São Paulo",-23.490583,-46.634307,6.681952
6,Zona Norte II,Casa Verde,"Casa Verde, Região Imediata de São Paulo",-23.499124,-46.654098,6.108307
7,Zona Norte II,Parque Peruche,"Parque Peruche, Região Imediata de São Paulo",-23.49767,-46.654869,6.287237
8,Zona Leste I,Brás,"Brás, Região Imediata de São Paulo",-23.545114,-46.616336,1.844109
9,Zona Leste I,Belém,"Belém, Região Imediata de São Paulo",-23.538476,-46.595039,4.137746


In [20]:
#drop duplicates from the df
df.drop_duplicates(subset ="Neighborhood", keep = False, inplace = True)

#reset the index
df.reset_index(drop = True, inplace = True)

Next we will use the FourSquare API to retrieve the venues of the selected locations.

In [1]:
#This code contained client_id and client_secret to the FourSquare API, thus it was omitted from the viewer version 

In [22]:
#define a function which will retrieve the venues of the section ´food´

LIMIT = 100
section = 'food'

def getNearbyRestaurants(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&section={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            section)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [39]:
#create a DF with the venues
saopaulo_restaurants = getNearbyRestaurants(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Santana
Carandiru
Vila Guilherme
Jardim São Paulo
Imirim
Santa Teresinha
Casa Verde
Parque Peruche
Brás
Belém
Pari
Canindé
Moóca
Alto da Moóca
Ana Rosa
Quarta Parada
Parque Moóca
Vila Zelina
Vila Ema
Tatuapé
Vila Mariana
Vila Clementino
Paraíso
Mirandópolis
Jardim Glória
Água Funda
Ipiranga
Sacomã
Vila Sta Catarina
Vila Nova Conceição
Perdizes
Água Branca
Pompéia
Vila Madalena
Sé
República
Barra Funda
Bom Retiro
Luz
Ponte Pequena
Santa Cecília
Pacaembú
Sumaré
Higienópolis
Bela Vista
Cerqueira César
Jardim Paulista
Jardim Europa
Liberdade
Cambuci
Aclimação
Vila Monumento
Jardim da Glória


In [40]:
print(saopaulo_restaurants.shape)
saopaulo_restaurants.head()

(2053, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Santana,-23.499321,-46.628933,Nação Verde,-23.500173,-46.627697,Vegetarian / Vegan Restaurant
1,Santana,-23.499321,-46.628933,Kombi do Samuca,-23.500076,-46.630593,Food Truck
2,Santana,-23.499321,-46.628933,Muradi Cozinha Árabe,-23.501741,-46.627226,Halal Restaurant
3,Santana,-23.499321,-46.628933,Canto da Marechal,-23.497762,-46.632148,Restaurant
4,Santana,-23.499321,-46.628933,Dolce Caffe,-23.502046,-46.62724,Café


Seeing as the stakeholders only want to know about Pizza Places, but Italian Restaurants are not excluded from serving pizza, the two will be used as criteria to select the desired venues 

In [41]:
saopaulo_restaurants = saopaulo_restaurants[(saopaulo_restaurants['Venue Category'] == 'Pizza Place') | (saopaulo_restaurants['Venue Category'] == 'Italian Restaurant')]
saopaulo_restaurants.reset_index(inplace = True, drop = True)
saopaulo_restaurants.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Santana,-23.499321,-46.628933,La Delichia Pizzaria,-23.502218,-46.630512,Pizza Place
1,Santana,-23.499321,-46.628933,Fioresi Pizza Artesanal,-23.498851,-46.626583,Pizza Place
2,Santana,-23.499321,-46.628933,Lassù,-23.498937,-46.624743,Italian Restaurant
3,Santana,-23.499321,-46.628933,Pizzaria Cézanne,-23.5023,-46.630118,Pizza Place
4,Santana,-23.499321,-46.628933,Pizzaria Casarão,-23.497307,-46.628381,Pizza Place


Next, we will implement the ML algorithm. The **DBSCAN** is a clustering algorithm that stands for: Density-Based Spatial Clustering of Applications with Noise. It works by finding core samples of high density and expands clusters from them. It was used as an alternative to K-Means for yielding satisfactory results with geospatial data.

In [42]:
#create a DF with the venues' geographical data
coords = saopaulo_restaurants[['Venue Latitude', 'Venue Longitude']]

#define the constan of km per radians
kms_per_radian = 6371.0088

#define the variable, which will be plugged into the object below
epsilon = 1.25/ kms_per_radian

#defining the DBSCAN object
db = DBSCAN(eps=epsilon, min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

#get the labels from the object
cluster_labels = db.labels_

#get the number of clusters created
num_clusters = len(set(cluster_labels))

#create a Series with the labels

clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])

print('Number of clusters: {}'.format(num_clusters))

Number of clusters: 9


In [43]:
# add clustering labels
saopaulo_restaurants.insert(0, 'Cluster Labels',  cluster_labels)

saopaulo_restaurants

Unnamed: 0,Cluster Labels,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Santana,-23.499321,-46.628933,La Delichia Pizzaria,-23.502218,-46.630512,Pizza Place
1,0,Santana,-23.499321,-46.628933,Fioresi Pizza Artesanal,-23.498851,-46.626583,Pizza Place
2,0,Santana,-23.499321,-46.628933,Lassù,-23.498937,-46.624743,Italian Restaurant
3,0,Santana,-23.499321,-46.628933,Pizzaria Cézanne,-23.5023,-46.630118,Pizza Place
4,0,Santana,-23.499321,-46.628933,Pizzaria Casarão,-23.497307,-46.628381,Pizza Place
5,1,Vila Guilherme,-23.509607,-46.606229,Skina Pizzaria e Choperia,-23.505522,-46.607002,Pizza Place
6,-1,Vila Guilherme,-23.509607,-46.606229,Pizzaria Kapricho,-23.507298,-46.605413,Pizza Place
7,1,Jardim São Paulo,-23.492626,-46.613106,Mr. Texas Pizza Pan,-23.495248,-46.610614,Pizza Place
8,1,Jardim São Paulo,-23.492626,-46.613106,Pizzaria Valpolicella,-23.491103,-46.610556,Pizza Place
9,1,Jardim São Paulo,-23.492626,-46.613106,Delicata Pizzaria,-23.493647,-46.609766,Pizza Place


The -1 seen as the Cluster Labels of some of the locations means that DBSCAN considers these data points as noisy (they do not fit with any of the clusters created).
That can be seen in the map created below

In [44]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(num_clusters)
ys = [i + x + (i*x)**2 for i in range(num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(saopaulo_restaurants['Venue Latitude'], saopaulo_restaurants['Venue Longitude'], saopaulo_restaurants['Neighborhood'], saopaulo_restaurants['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Since the visualization may be unsatisfactory to determine the most suited locations, we will perform a pandas operation that will assess the neighborhoods that do not have many restaurants/pizza places.

In [46]:
saopaulo_restaurants.groupby('Neighborhood')['Venue Category'].count().nsmallest(15, keep = 'all')

Neighborhood
Ana Rosa          1
Barra Funda       1
Bom Retiro        1
Brás              1
Cambuci           1
Canindé           1
Casa Verde        1
Imirim            1
Jardim Europa     1
Parque Peruche    1
Sé                1
Vila Madalena     1
Vila Zelina       1
Pari              2
Quarta Parada     2
Sacomã            2
Vila Ema          2
Vila Guilherme    2
Vila Monumento    2
Name: Venue Category, dtype: int64

## 6. Results and Discussion
<a name="results&discussion"></a>

As per our analysis we could see that there are 247 Pizza Places/Italian Restaurants in the relatively small area that we studied. But as can be seen in the map above those 247 venues are not evenly spaced. There are empty pockets in the area which do not cointain a single venue, and on the other hand, neighborhoods which house many venues closely packed together.

Combining the visualization with the data analysis we can say that the best locations, regarding relative distance to competitors and distance to the center of the city, lie in the following pockets:  

- The northeastern region between **Vila Guilherme** and **Pari**;
- The eastern region between **Brás** and **Belém**; and
- The southeastern region between **Moóca** and **Aclimação**.

That, naturally, does not imply that these are actually the most suited locations to house a new Pizza Place. These areas were chosen due to the fact that they are not crowded with the same type of restaurant and their distance to the center of the city is optimal if we think about accessibility to all the other regions in the city (in the event the venue also doing deliveries). But, we cannot discard this places as potential locations for the new business and they will serve as a starting point for further analysis as the business project unfolds.

## 7. Conclusion
<a name="conclusions"></a>

The objective of this project was to determine the suitability of neighborhoods in São Paulo to house a new Pizza Place. Far from being a definitive answer, the results yielded by the analysis are merely a starting point. Many other factors should be considered, besides the relative distance to competitors, when opening a venue: socioeconomic profile of the population you want to service, the region's foot traffic, accessibility to suppliers and costumers, for example, are only a few of these factors.

The decision for the most suited location for a new business is one taken by the stakeholders of the project. They will ponder the factors mentioned above, and many others, to reach a final verdict, but they do not have to do so without the help of data science and machine learning, as was done in the past. This project aimed to, and hopefully succeeded, in giving a peek of the potential of these fields relating to any type of business.