# Capstone Project - The Battle of Neighborhoods (Week 1)

# Doing Business in Brazil

## Introduction/Business Problem

The Business problem is to determine what are Brazilians behaviors and how to create a rapport to facilitate doing business in Brazil.

One important thing when doing business in a country is to understand the local behavior and what is most appreciated. Understanding this, you can create rapport and facilitates your negotiations.


## Data and how it will be used to solve the problem

I will explore the cities that contribute most to the Brazilian GDP as described in a Wikipedia page that has all the information I need. ('https://pt.wikipedia.org/wiki/Lista_de_munic%C3%ADpios_do_Brasil_por_PIB')

I will use the Foursquare API to explore the cities and will use the **explore** function to get the most common venue categories in each city. I will use the *k*-means clustering algorithm to complete this task. Finally, I will use the Folium library to visualize the Cities, compare the venues and determine what type of places Brazilians like.


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

</font>
</div>

Before I get the data and start exploring it, let me download all the dependencies that I will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

For the data, a Wikipedia page exists that has all the information I need to explore and cluster the cities in Brazil. I  will scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.

In [2]:
Table = pd.read_html('https://pt.wikipedia.org/wiki/Lista_de_munic%C3%ADpios_do_Brasil_por_PIB',header=0)[0]
Table.rename(columns={'Município':'City','PIB 2016 (R$ 1.000)':'GDP'},inplace=True)
Table.head()

Unnamed: 0,Posição,City,GDP,Estado
0,1,São Paulo,687 035 890,SP
1,2,Rio de Janeiro,329 431 360,RJ
2,3,Brasília,235 497 107,DF
3,4,Belo Horizonte,88 277 463,MG
4,5,Curitiba,83 788 904,PR


In [3]:
Table.shape

(5570, 4)

In [4]:
gdp_cities_brazil =Table.drop(columns=['Posição','GDP'],axis=1)
gdp_cities_brazil.head()

Unnamed: 0,City,Estado
0,São Paulo,SP
1,Rio de Janeiro,RJ
2,Brasília,DF
3,Belo Horizonte,MG
4,Curitiba,PR


In [5]:
gdp_cities_brazil.shape

(5570, 2)

I will work with top 50 cities

In [6]:
top_50 = gdp_cities_brazil.head(50)
top_50

Unnamed: 0,City,Estado
0,São Paulo,SP
1,Rio de Janeiro,RJ
2,Brasília,DF
3,Belo Horizonte,MG
4,Curitiba,PR
5,Osasco,SP
6,Porto Alegre,RS
7,Manaus,AM
8,Salvador,BA
9,Fortaleza,CE


In [7]:
for index, row in top_50.iterrows():
    print (row['City'],row['Estado'])

São Paulo SP
Rio de Janeiro RJ
Brasília DF
Belo Horizonte MG
Curitiba PR
Osasco SP
Porto Alegre RS
Manaus AM
Salvador BA
Fortaleza CE
Campinas SP
Guarulhos SP
Recife PE
Barueri SP
Goiânia GO
São Bernardo do Campo SP
Duque de Caxias RJ
Jundiaí SP
São José dos Campos SP
Uberlândia MG
Paulínia SP
Sorocaba SP
Ribeirão Preto SP
Belém PA
São Luís MA
Contagem MG
Santo André SP
Campo Grande MS
Joinville SC
Betim MG
Niterói RJ
Cuiabá MT
Santos SP
Camaçari BA
Natal RN
Vitória ES
Piracicaba SP
Maceió AL
Caxias do Sul RS
São José dos Pinhais PR
Canoas RS
Itajaí SC
Teresina PI
João Pessoa PB
Florianópolis SC
Londrina PR
Serra ES
Cubatão SP
Macaé RJ
Campos dos Goytacazes RJ


#### Use geopy library to get the latitude and longitude values of Cities.

In order to define an instance of the geocoder, I need to define a user_agent. I will name the agent <em>br_explorer</em>, as shown below.

In [8]:
for index, row in top_50.iterrows():
    address = row['City'] + ", " + row['Estado']
    geolocator = Nominatim(user_agent="br_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinate of {} are {}, {}.'.format(address,latitude, longitude))

The geograpical coordinate of São Paulo, SP are -23.5506507, -46.6333824.
The geograpical coordinate of Rio de Janeiro, RJ are -22.9110137, -43.2093727.
The geograpical coordinate of Brasília, DF are -15.7934036, -47.8823172.
The geograpical coordinate of Belo Horizonte, MG are -19.9227318, -43.9450948.
The geograpical coordinate of Curitiba, PR are -25.4295963, -49.2712724.
The geograpical coordinate of Osasco, SP are -23.5324859, -46.7916801.
The geograpical coordinate of Porto Alegre, RS are -30.0324999, -51.2303767.
The geograpical coordinate of Manaus, AM are -3.1316333, -59.9825041.
The geograpical coordinate of Salvador, BA are -12.9822499, -38.4812772.
The geograpical coordinate of Fortaleza, CE are -3.7304512, -38.5217989.
The geograpical coordinate of Campinas, SP are -22.90556, -47.06083.
The geograpical coordinate of Guarulhos, SP are -23.4430602, -46.524459.
The geograpical coordinate of Recife, PE are -8.0641931, -34.8781517.
The geograpical coordinate of Barueri, SP are 

In [9]:
# define the dataframe columns
column_names = ['State', 'City', 'Latitude', 'Longitude'] 

# instantiate the dataframe
Brazil = pd.DataFrame(columns=column_names)
Brazil

Unnamed: 0,State,City,Latitude,Longitude


In [10]:
for index, row in top_50.iterrows():
    address = row['City'] + ", " + row['Estado']
    geolocator = Nominatim(user_agent="br_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    
    Brazil = Brazil.append({'State': row['Estado'],
                            'City': row['City'],
                            'Latitude': latitude,
                            'Longitude': longitude}, ignore_index=True)

In [11]:
Brazil

Unnamed: 0,State,City,Latitude,Longitude
0,SP,São Paulo,-23.550651,-46.633382
1,RJ,Rio de Janeiro,-22.911014,-43.209373
2,DF,Brasília,-15.793404,-47.882317
3,MG,Belo Horizonte,-19.922732,-43.945095
4,PR,Curitiba,-25.429596,-49.271272
5,SP,Osasco,-23.532486,-46.79168
6,RS,Porto Alegre,-30.0325,-51.230377
7,AM,Manaus,-3.131633,-59.982504
8,BA,Salvador,-12.98225,-38.481277
9,CE,Fortaleza,-3.730451,-38.521799


In [12]:
address = 'Brazil'

geolocator = Nominatim(user_agent="br_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Brazil are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Brazil are -10.3333333, -53.2.


#### Create a map of Brazil.

In [13]:
# create map of Brazil using latitude and longitude values
map_brazil = folium.Map(location=[latitude, longitude], zoom_start=4)

# add markers to map
for lat, lng, state, city in zip(Brazil['Latitude'], Brazil['Longitude'], Brazil['State'], Brazil['City']):
    label = '{}, {}'.format(city, state)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_brazil)
map_brazil

Next, I am going to start utilizing the Foursquare API to explore the cities and segment them.

#### Define Foursquare Credentials and Version

In [14]:
CLIENT_ID = 'LPHXG0C4IITZCYW2BJN1T5KJXBL1EF4AGODY54HKZPLOZ5SC' # your Foursquare ID
CLIENT_SECRET = 'IIFSFH05BMHIJBCQS3GJ4WW1YWMJKO05RWOLP2HVHBT4TVKF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LPHXG0C4IITZCYW2BJN1T5KJXBL1EF4AGODY54HKZPLOZ5SC
CLIENT_SECRET:IIFSFH05BMHIJBCQS3GJ4WW1YWMJKO05RWOLP2HVHBT4TVKF


#### Let's explore the first city in my dataframe.

In [15]:
Brazil.loc[0, 'City']

'São Paulo'

Get the city's latitude and longitude values.

In [16]:
city_latitude = Brazil.loc[0, 'Latitude'] # city latitude value
city_longitude = Brazil.loc[0, 'Longitude'] # city longitude value

city_name = Brazil.loc[0, 'City'] # city name

print('Latitude and longitude values of {} are {}, {}.'.format(city_name, 
                                                               city_latitude, 
                                                               city_longitude))

Latitude and longitude values of São Paulo are -23.5506507, -46.6333824.


#### Now, let's get the top 100 venues that are in Sao Paulo.

First, let's create the GET request URL. Name your URL **url**.

In [17]:
# type your answer here
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 50000 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    city_latitude, 
    city_longitude, 
    radius, 
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=LPHXG0C4IITZCYW2BJN1T5KJXBL1EF4AGODY54HKZPLOZ5SC&client_secret=IIFSFH05BMHIJBCQS3GJ4WW1YWMJKO05RWOLP2HVHBT4TVKF&v=20180605&ll=-23.5506507,-46.6333824&radius=50000&limit=100'

Send the GET request and examine the resutls

In [18]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c62ee201ed2192b5938ec41'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'São Paulo',
  'headerFullLocation': 'São Paulo',
  'headerLocationGranularity': 'city',
  'totalResults': 217,
  'suggestedBounds': {'ne': {'lat': -23.10065024999955,
    'lng': -46.14341107531995},
   'sw': {'lat': -24.000651150000447, 'lng': -47.12335372468005}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b17eb00f964a520a1c923e3',
       'name': 'Centro Cultural Banco do Brasil (CCBB)',
       'location': {'address': 'R. Álvares Penteado, 112',
        'crossStreet': 'R. Quitanda',
        'lat': -23.547588190396358,
        'lng': -46.6346831174672,
        'lab

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [19]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now I am ready to clean the json and structure it into a *pandas* dataframe.

In [20]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Centro Cultural Banco do Brasil (CCBB),Cultural Center,-23.547588,-46.634683
1,Teatro Renault,Theater,-23.55412,-46.638695
2,Theatro Municipal de São Paulo,Theater,-23.545387,-46.638765
3,Casa de Francisca,Music Venue,-23.548733,-46.634763
4,Casa Mathilde,Dessert Shop,-23.545409,-46.634746


And how many venues were returned by Foursquare?

In [21]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


#### Let's find out how many unique categories can be curated from all the returned venues

In [22]:
print('There are {} uniques categories.'.format(len(nearby_venues['categories'].unique())))

There are 55 uniques categories.
