# New Venue Implementation Analysis in Sao Paulo / Brazil

### Utilizing unsupervisioned Machine Learning algorithms based on social-economic and geo-spacial data  
### Target venue: New Healthy Foods Market

#### Author: Juliano Garcia  
<juliano.garcia@protonmail.com>

# Executive Summary

This Data Science problem is designed to support the decision of choosing the best neighborhoods in Sao Paulo - Brazil to stablish a new Healthy Foods market.  
  
The data is constructed upon a wikipedia table containing neighborhood names, HDI (Human Development Index) and Zone. Additional data is then gathered from geo-spacial APIs such as Geopy library for coordinates and Foursquare for venues information.  
  
The analysis is conducted based on the gathered data to identify best neighborhood attributes for a healthy food market implementation, for that is used an unsupervised machine learning algorithm such as K-means in order to relate similat neighborhoods, then an analysis is performed on the cluster with the most healthy venues in order to identify th main caracteristics such as IDH geolocation on the city of Sao Paulo.

__Table of Contents__  
  
1. [Introduction](#1)<br>  
  
  
2. [Data](#2)<br>  
  
    2.1. [Gathering & Structuring Data](#2.1)<br>  
      
    2.2. [Final Data](#2.2)<br>  
      

3. [Methodology](#3)<br>  
  
  
4. [Results](#4)<br>  
  
  
5. [Discussion](#5)<br>  
  
  
6. [Conclusion](#6)<br>

# 1. Introduction<a id="1"></a>

This Data Science problem is designed to support the decision of choosing the best neighborhoods in Sao Paulo - Brazil to stablish a new Healthy Foods market.  
  
<img src = "https://i.pinimg.com/originals/1b/a1/94/1ba194e3034e8e9352c4c4b790a5215c.jpg" width = 300>  
  
Today in Sao Paulo, the region on wich to stablish the new venue might represent a key indicator of sucess or failure. Social development across the neighborhoods are very different, where the majority of the city's development is only concentrated on a few neighborhoods.  
  
Also, the healthy foods sector in the region is considerably more expensive than their non-healthy substitutes, therefore we need to make sure that the citizens in the chosen neighborhoods have the economic power to purchase the products.

An analysis based on the venues distribution across the neighborhoods as well as their social-economic indicators will be conducted in order to cluster and indetify the venues that are most likely enhance the sucess rate of the new venue.

In [2]:
#!pip install folium

In [1]:
# Importing basic libraries

# Data manipulation libraries
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
import itertools

# Data gathering libraries
from bs4 import BeautifulSoup
import lxml
import json
import requests

# Data visualization tools
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as cls
import seaborn as sns

# Geospacial tools
import folium
from geopy.geocoders import Nominatim

# 2. Data<a id="2"></a>

For the present problem we will use data collected from 3 main sources:
- Wikipedia: This will be used to gather all the neighborhoods in Sao Paulo as well as some caractheristics such as the zone and the Human Development Index (HDI);  
  
  
- Geopy: This library will be used to get the coordinates (latitudes and longitudes) for Sao Paulo as well as for each of its neighborhoods;  
  
  
- Foursquare API: Will be used to get the venues surrouding each specific neighborhood in order to cluster them and have a better understanding of how envolved a neighborhood is in the healthy foods sector.  
  
Each source and gathering methods is shown in the code cells below

## 2.1. Gathering & Structuring Data<a id="2.1"></a>

### 2.1.1. Neighborhoods (Wikipedia)<a id="2.1.1"></a>

Getting the Sao Paulo neighborhoods from wikipedia's web page:  
https://pt.wikipedia.org/wiki/Lista_dos_distritos_de_S%C3%A3o_Paulo_por_%C3%8Dndice_de_Desenvolvimento_Humano  
Let's pull the Zone, Name and the HDI (Human Development Index)

#### Now let's get the data from the Wikipedia page

In [2]:
# Parsing the webpage url
html_page = requests.get('https://pt.wikipedia.org/wiki/Lista_dos_distritos_de_S%C3%A3o_Paulo_por_%C3%8Dndice_de_Desenvolvimento_Humano').text
soup = BeautifulSoup(html_page, 'lxml')

# Getting the postcodes table
tables = soup.find_all('table')

Building the DataFrame with BeatifulSoup library  
<i>The <b>'table_ranges'</b> values are based on the observation of the web page's source code</i>

In [3]:
# Instantiate the dataframe
neighborhoods = pd.DataFrame(columns=['Zone', 'Position', 'Neighborhood', 'HDI'])

# Set the row numbers for all the neighborhoods we need based on the distribution of the zone tables
table_ranges = [(3,10), (17,29), (36,43), (50,64), (71,85), (92,102), (109,115), (122,128), (132,143)]

for i, (start, finish) in enumerate(table_ranges):
    # Defining Zone
    zone = tables[9].find_all('h3')[i]
    zone = zone.text.split("[")[0]

    # Get the mian table
    d1 = tables[9].find_all('tr')
    d2 = d1[0].find_all('tr')

    # Gettng Position, Name and HDI for each table row
    for row in range(start, finish+1):
        d3 = d2[row].find_all('td')

        neighborhoods = neighborhoods.append({'Zone' : zone,
                                              'Position': d3[0].text,
                                              'Neighborhood': d3[1].text,
                                              'HDI': d3[2].text[:-1]}, ignore_index=True)

In [4]:
neighborhoods.head(10)

Unnamed: 0,Zone,Position,Neighborhood,HDI
0,Região Central,1,Consolação,950
1,Região Central,2,Bela Vista,940
2,Região Central,3,Liberdade,936
3,Região Central,4,Santa Cecília,930
4,Região Central,5,Cambuci,903
5,Região Central,6,Sé,854
6,Região Central,7,República,858
7,Região Central,8,Bom Retiro,847
8,Leste 1,1,Penha,865
9,Leste 1,2,Vila Matilde,864


In [5]:
neighborhoods.shape

(96, 4)

Cleaning and structuring the data

In [6]:
# Let's create a dictionary to translate the zone names
translate = {'Região Central' : 'Central',
             'Leste 1' : 'East 1',
             'Leste 2' : 'East 2',
             'Sudeste' : 'Southeast',
             'Oeste' : 'West',
             'Nordeste' : 'Northeast',
             'Noroeste' : 'Northwest',
             'Centro-Sul' : 'Center-South',
             'Sul' : 'South'}

In [7]:
# Replacing ',' with '.' since brazilian numeric standard differs from US - ',' is used for decimals separation
neighborhoods['HDI'] = neighborhoods['HDI'].apply(lambda x: str.replace(x, ',', '.'))

# Translating Zone
neighborhoods['Zone'] = neighborhoods['Zone'].apply(lambda x: str.replace(x, x, translate[x]))

neighborhoods.sample(10)

Unnamed: 0,Zone,Position,Neighborhood,HDI
51,West,8,Butantã,0.928
30,Southeast,2,Mooca,0.909
11,East 1,4,Artur Alvim,0.833
70,Northwest,1,São Domingos,0.854
6,Central,7,República,0.858
92,South,9,Grajaú,0.754
85,South,2,Vila Andrade,0.853
95,South,12,Marsilac,0.701
64,Northeast,6,Limão,0.847
32,Southeast,4,Belém,0.897


Now let's cast the columns to the correct types

In [8]:
neighborhoods['HDI'] = neighborhoods['HDI'].astype('float')
neighborhoods['Position'] = neighborhoods['Position'].astype('int')
neighborhoods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 4 columns):
Zone            96 non-null object
Position        96 non-null int64
Neighborhood    96 non-null object
HDI             96 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 3.1+ KB


### 2.1.2. Coordinates (Geopy Library)<a id="2.1.2"></a>

#### Geolocation for Sao Paulo - Brazil

In [9]:
address = 'Sao Paulo / BR'

geolocator = Nominatim(user_agent="sp_explorer")
location = geolocator.geocode(address)
sp_coord = (location.latitude, location.longitude)

sp_coord

(-23.5506507, -46.6333824)

Getting the Latitudes and Longitudes for the neighborhoods  
Let's use a separate DataFrame so we only need to run it once in order to avoid hitting the query limit of the API

In [10]:
# Defining the function to get the Latitude and Longitude of given neighborhood
def get_lat_long(neigh):
    address = neigh + ', Sao Paulo / BR'

    geolocator = Nominatim(user_agent="sp_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    
    return (latitude, longitude)

In [11]:
# Instantiate the dataframe
coords = pd.DataFrame(columns=['Neighborhood', 'coords'])
coords['Neighborhood'] = neighborhoods['Neighborhood']

In [12]:
# AVOID RE-RUNNING THS CELL!! ______________________________________________________________ !!!
coords['coords'] = coords['Neighborhood'].apply(get_lat_long)

#### Adding the Coordinates to the Neighborhood DataFrame casted as floats

In [13]:
neighborhoods['Latitude'] = coords['coords'].apply(lambda x : x[0]).astype('float')
neighborhoods['Longitude'] = coords['coords'].apply(lambda x : x[1]).astype('float')

In [14]:
neighborhoods.sample(5)

Unnamed: 0,Zone,Position,Neighborhood,HDI,Latitude,Longitude
74,Northwest,5,Anhangüera,0.774,-23.432908,-46.788534
13,East 1,6,Cidade Líder,0.817,-23.56277,-46.494333
24,East 2,4,Cidade Tiradentes,0.766,-23.582497,-46.409207
2,Central,3,Liberdade,0.936,-23.566703,-46.631809
82,Center-South,6,Campo Grande,0.921,-23.675548,-46.687226


Let's see how the data that we have appears on a map!

In [15]:
# create map of Sao Paulo using latitude and longitude values
map_sp = folium.Map(location=[sp_coord[0], sp_coord[1]], zoom_start=11)

# add markers to map
for lat, lng, zone, name in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Zone'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(name, zone)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sp)  
    
map_sp

### 2.1.3. Venues (Foursquare API)<a id="2.1.3"></a>

API Credentials hidden for sharable version

In [59]:
# The code was removed by Watson Studio for sharing.

In [17]:
VERSION = '20190609' # Foursquare API version

Before we proceed, let's define a function to get the categories of all the venues.

In [18]:
def getVenues(df, names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(df[names], df[latitudes], df[longitudes]):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'Neighborhood Latitude',
                             'Neighborhood Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']
    
    return(nearby_venues)

Now let's use the above function to create a new DataFrame ('sp_venues') with all the venues we gathered from the API

In [20]:
# AVOID RE-RUNNING THS CELL!! ______________________________________________________________ !!!
sp_venues = getVenues(df=neighborhoods,
                      names='Neighborhood',
                      latitudes='Latitude',
                      longitudes='Longitude')

In [21]:
print(sp_venues.shape)
sp_venues.head()

(2840, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Consolação,-23.54808,-46.660029,Carlota,-23.546694,-46.66078,Restaurant
1,Consolação,-23.54808,-46.660029,Bráz Pizzaria,-23.547989,-46.657645,Pizza Place
2,Consolação,-23.54808,-46.660029,Petí Panamericana,-23.549036,-46.659611,Restaurant
3,Consolação,-23.54808,-46.660029,Ici Bistrô,-23.549389,-46.65819,French Restaurant
4,Consolação,-23.54808,-46.660029,Loja Mod,-23.546138,-46.658954,Furniture / Home Store


Let's check how many venues in average the API got for each neighborhood

In [22]:
sp_venues[['Neighborhood', 'Venue']].groupby('Neighborhood').count().describe()

Unnamed: 0,Venue
count,94.0
mean,30.212766
std,28.259882
min,1.0
25%,8.0
50%,21.0
75%,45.5
max,100.0


How many unique categories we got?

In [23]:
len(sp_venues['Venue Category'].unique())

284

Let's encode the categories and have the avergae on the total venues for each neighborhood

In [51]:
# Creating the encoded DataFrame with 'get_dummies' from pandas
sp_encoded = sp_venues[['Neighborhood']]
sp_encoded = sp_encoded.join(pd.get_dummies(sp_venues[['Venue Category']], prefix="", prefix_sep=""))

# Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
sp_encoded = sp_encoded.groupby('Neighborhood').mean().reset_index()
sp_encoded.head()

Unnamed: 0,Neighborhood,Acai House,Accessories Store,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Alto de Pinheiros,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333
1,Anhangüera,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Aricanduva,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Artur Alvim,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0
4,Barra Funda,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Finally let's add the HDI and Zone (encoded) to the encoded DataFrame

In [52]:
# Adding the columns
sp_encoded = sp_encoded.merge(right=neighborhoods[['Neighborhood', 'Latitude', 'Longitude', 'HDI', 'Zone']],
                              on='Neighborhood')

In [53]:
# Now let's encode the zone
sp_encoded = sp_encoded.join(pd.get_dummies(sp_encoded[['Zone']], prefix="", prefix_sep=""))

In [56]:
# Finally let's create listd of the columns names so it is easy manipulate
columns = list(sp_encoded.columns)
zone_list = columns[-9:]
category_list = columns[1:-13]
coord_list = ['Latitude', 'Longitude']
columns = ['Neighborhood'] + ['HDI'] + coord_list + zone_list + category_list

# Now let's re-arrange the columns
sp_encoded = sp_encoded[columns]

## 2.2. Final Data<a id="2.2"></a>

Finally we have our final dataset!
  
- __sp_encoded__ DataFrame: contains venue category frequencie as well as Neighborhood information such as Zone(encoded), coordinates and HDI

In [58]:
print(sp_encoded.shape)
sp_encoded.head()

(94, 297)


Unnamed: 0,Neighborhood,HDI,Latitude,Longitude,Center-South,Central,East 1,East 2,Northeast,Northwest,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Alto de Pinheiros,0.955,-23.549906,-46.707642,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333
1,Anhangüera,0.774,-23.432908,-46.788534,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Aricanduva,0.885,-23.578024,-46.511454,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Artur Alvim,0.833,-23.539221,-46.485265,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,0.0,0.0
4,Barra Funda,0.917,-23.525462,-46.667513,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# 3. Mehodology<a id="3"></a>

# 4. Results<a id="4"></a>

# 5. Discussion<a id="5"></a>

# 6. Conclusion<a id="6"></a>