### <center>Universidade Federal do Rio Grande do Norte<br>Programa de Pós-Graduação em Engenharia Elétrica e Computação<br>Module: Data Science (Tópicos Especiais F)<br>Professor Ivanovitch Silva

### <center> Students: Marianne Diniz / Taline Nóbrega


# <center> Project 3: Youtubers for Data Science - a choropleth map perspective

# <div class="alert alert-info">Part A - Population density</div>

## 1. Introduction


A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita income. It is extremely useful for analyzing statistical data spatially. The data variable uses color progression to represent itself in each region of the map.
This first part of the project aims to analyze Brazil's population density, especially in the northeast area. Choropleth maps were used in order to obtain better data representation.


## 2. The data

Population density is a ratio of the average number of inhabitants per square kilometre. This project analyzes Brazil's population density according to [IBGE - Instituto Brasileiro de Geografia e Estatística](https://downloads.ibge.gov.br/downloads_estatisticas.htm) database.
The dataset used was Population Estimative 2017. 

## 3. Objectives

The main objective of this part of the project is comprehend how choropleth maps are useful to analyze statistical data. Additionally, the results obtained from the analysis supports the idea that population density is higher on capitals than other regions. In general, people used to live near to city centers. 

## 4. The code

In [8]:
import os
import folium
import numpy as np
import json
import pandas as pd
from branca.colormap import linear

In [2]:
# dataset name
dataset_pop_2017 = os.path.join('data', 'population_2017.csv')

# read the data to a dataframe
data2017 = pd.read_csv(dataset_pop_2017)

# eliminate spaces in name of columns
data2017.columns = [cols.replace(' ', '_') for cols in data2017.columns]

#displaying the five first rows
data2017.head()

Unnamed: 0,UF,COD._UF,COD._MUNIC,NOME_DO_MUNICÍPIO,POPULAÇÃO_ESTIMADA
0,RO,11.0,15.0,Alta Floresta D'Oeste,25437.0
1,RO,11.0,23.0,Ariquemes,107345.0
2,RO,11.0,31.0,Cabixi,6224.0
3,RO,11.0,49.0,Cacoal,88507.0
4,RO,11.0,56.0,Cerejeiras,17934.0


In [3]:
# filtering data by Northeast region
dataNE = data2017[(data2017['UF'] == 'AL') | (data2017['UF'] == 'BA') | (data2017['UF'] == 'CE') | 
                  (data2017['UF'] == 'MA') | (data2017['UF'] == 'PI') | (data2017['UF'] == 'PB') | 
                  (data2017['UF'] == 'PE') | (data2017['UF'] == 'RN') | (data2017['UF'] == 'SE')]

dataNE = dataNE.sort_values('UF')
dataNE

Unnamed: 0,UF,COD._UF,COD._MUNIC,NOME_DO_MUNICÍPIO,POPULAÇÃO_ESTIMADA
1739,AL,27.0,8402.0,São José da Tapera,32626.0
1725,AL,27.0,7008.0,Pindoba,2953.0
1724,AL,27.0,6901.0,Pilar,35552.0
1723,AL,27.0,6802.0,Piaçabuçu,18074.0
1722,AL,27.0,6703.0,Penedo,64497.0
1721,AL,27.0,6604.0,Paulo Jacinto,7679.0
1720,AL,27.0,6505.0,Passo de Camaragibe,15461.0
1718,AL,27.0,6422.0,Pariconha,10684.0
1717,AL,27.0,6406.0,Pão de Açúcar,24792.0
1716,AL,27.0,6307.0,Palmeira dos Índios,74208.0


In [4]:
# Importing GeoJson files

# searching the files in geojson/geojs-xx-mun.json
br_nordeste = os.path.join('geojson', 'geojs-nordeste-mun.json')

# load the data and use 'latin-1'encoding because the accent
geo_json_data_nordeste = json.load(open(br_nordeste,encoding='latin-1'))

In [5]:
# Measuring max and min population estimative

pop_max = dataNE.POPULAÇÃO_ESTIMADA.max()
pop_min = dataNE.POPULAÇÃO_ESTIMADA.min()

print(dataNE.sort_values(['POPULAÇÃO_ESTIMADA'],ascending=[True]))

# Min - PI (1228 Miguel Leão)
# Max - BA (2953986 Salvador)

      UF  COD._UF  COD._MUNIC           NOME_DO_MUNICÍPIO  POPULAÇÃO_ESTIMADA
794   PI     22.0      6308.0                 Miguel Leão              1228.0
1240  RN     24.0     14902.0                      Viçosa              1731.0
1375  PB     25.0     10659.0                      Parari              1769.0
1433  PB     25.0     14651.0   São José do Brejo do Cruz              1806.0
1306  PB     25.0      4850.0                    Coxixola              1925.0
1397  PB     25.0     12606.0                     Quixaba              1964.0
1403  PB     25.0     12788.0     Riacho de Santo Antônio              1985.0
848   PI     22.0      9450.0  Santo Antônio dos Milagres              2125.0
1257  PB     25.0      1153.0           Areia de Baraúnas              2126.0
1161  RN     24.0      7906.0        Monte das Gameleiras              2178.0
1464  PB     25.0     17407.0                      Zabelê              2245.0
1251  PB     25.0       734.0                      Amparo       

In [6]:
# colormap yellow and red (YlOrRd)
colormap = linear.YlOrRd.scale(pop_min,pop_max)

print(colormap(70000.0))
colormap

#fffaaa


In [9]:
# Create a map object
m = folium.Map(
    location=[-5.826592, -35.212558],
    zoom_start=5,
    tiles='Stamen Terrain'
)
# Create a threshold of legend
threshold_scale = np.linspace(dataNE['POPULAÇÃO_ESTIMADA'].min(),
                              dataNE['POPULAÇÃO_ESTIMADA'].max(), 6, dtype=int).tolist()

m.choropleth(
    line_color='silver',
    geo_data=geo_json_data_nordeste,
    data=dataNE,
    columns=['NOME_DO_MUNICÍPIO', 'POPULAÇÃO_ESTIMADA'],
    key_on='feature.properties.description',
    fill_color='YlGnBu',
    legend_name='Population estimation (2017)',
    highlight=True,
    threshold_scale = threshold_scale
)
m

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


## 5. Results and Conclusion

As expected, the results obtained from the map show that in general Capitals present expressive population density in comparison to other regions. The choropleth map showed the population density differences between all counties of  Brazil's northeast region using colors pattern. Salvador, Capital of Bahia, it is the county with higher population density. And the county of Miguel Leão presents the lowest population density. The results were coherent to theory.


**Observation:** Before running the code presented in this part of the project it is necessary to execute the following code to avoid the error *"IOPub data rate exceeded"*

jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000



# <div class="alert alert-info">Part B - Uber waiting time

## 1. Introduction

Uber is a company that offers car riders using technology platform. The client just needs to have Uber app installed on their smartphone to request a ride. It is a convenient, inexpensive and safe. It is an on-demand car service, which means that the number of drivers in a specific area could depend on many factors, such as quantity of users requests.

This part of the project aims to analyze uber requests distribution inside Natal ( a city located in Rio Grande do Norte - Brazil) by using choropleth maps. The analysis was based on the time to get an Uber ride in a specific neighborhood of Natal. The main propose was to analyze the waiting time experienced by the client to have an Uber car available to them.

## 2. The data

The data used in this project was collect from Uber API by creating an [Uber](https://developer.uber.com/) session with a server token to authenticate the application and the rider. The information collected was the list of available products (UberX and UberSelect) and the [Time Estimates endpoint](https://developer.uber.com/docs/riders/references/api/v1.2/estimates-time-get). 

The Products endpoint returns information about the Uber products offered at a given location. The response includes the display name and other details about each product, and lists the products in the proper display order.
In some markets, the list of products returned from this endpoint may vary by the time of day due to time restrictions on when that product may be utilized.

The Time Estimates endpoint returns ETAs (estimate time of arrivals) for all products currently available at a given location, with the ETA for each product expressed as integers in seconds. If a product returned from GET /v1.2/products is not returned from this endpoint for a given latitude/longitude pair then there are currently none of that product available to request. We recommend that this endpoint be called every minute to provide the most accurate, up-to-date ETAs.

The Uber API enforces rate limits to help distribute resources among apps. Based on registered app’s server_token, there are a limit to 2000 requests per hour. 

## 3. The code

In [1]:
import os
import folium
import json
import time
import datetime as dt
import pandas as pd
import numpy as np
import csv
import requests
from uber_rides.session import Session
from uber_rides.client import UberRidesClient
from branca.colormap import linear
from shapely.geometry import Polygon
from shapely.geometry import Point
from numpy import random

In [2]:
# Creating an Uber session with a server token
session = Session(server_token='LW-8FjhqjgwDySXcsT03Iq94tuolRR7_OBIdzMlq')
client = UberRidesClient(session)

In [3]:
# import geojson file about natal neighborhood
natal_neigh = os.path.join('geojson', 'natal.geojson')

# load the data and use 'UTF-8'encoding
geo_json_natal = json.load(open(natal_neigh,encoding='UTF-8'))

In [4]:
neighborhood = []
# listing all neighborhoods
for neigh in geo_json_natal['features']:
        neighborhood.append(neigh['properties']['name'])

In [5]:
def nearest_road_distance(log, lat):
    
    response = requests.get('http://' + 'router.project-osrm.org' + '/' + 'nearest' + 
               '/' + 'v1' +'/' + 'car' + '/' + str(log) + ',' + str(lat) )
    response_json = json.loads(response.text)
    distance = response_json.get('waypoints')[0]['distance']
    
    return distance

In [6]:
# return a number of points inside the polygon
def generate_random(number, polygon, neighborhood):
    list_of_points = []
    minx, miny, maxx, maxy = polygon.bounds
    max_distance = 400
    counter = 0
    while counter < number:
        x = random.uniform(minx, maxx)
        y = random.uniform(miny, maxy)
        pnt = Point(x, y)
        if polygon.contains(pnt) and nearest_road_distance(x, y) <= max_distance:
            list_of_points.append([x,y,neighborhood])
            counter += 1
    return list_of_points

In [None]:
while True:
    number_of_points = 3
    writer = csv.writer(open("data/uber.csv", 'a'))
    
    for feature in geo_json_natal['features']:
        # get the name of neighborhood
        neighborhood = feature['properties']['name']
        # take the coordinates (lat,log) of neighborhood
        geom = feature['geometry']['coordinates']
        # create a polygon using all coordinates
        polygon = Polygon(geom[0])
        # return number_of_points by neighborhood as a list [[log,lat],....]
        points = generate_random(number_of_points, polygon, neighborhood)
        # iterate over all points and print in the map
        for i,value in enumerate(points):
            log, lat, name = value
            try:
                response = client.get_products(lat,log)
               
                # API - get/products
                products = response.json.get('products')
                for product in products:
                    now = dt.datetime.now()
                    try:
                        wait_time = client.get_pickup_time_estimates(lat,log,product['product_id'])
                        price = client.get_price_estimates(start_latitude, start_longitude, lat, log, seat_count)
                        row = [neighborhood, now, wait_time.json.get('times')[0]['estimate'], lat, log, 
                               wait_time.json.get('times')[0]['localized_display_name'], price.json.get('prices')[0]['estimate']]
                        writer.writerow(row) 
                        print(row)
                    except:
                        pass
            except:
                pass  
time.sleep(180) # 3 minutes break

In [7]:
# dataset name
dataset_uber = os.path.join('data', 'uber.csv')

# read the data to a dataframe
data_uber = pd.read_csv(dataset_uber, encoding='ISO-8859-1')

# defining columns name
data_uber.columns = ['NEIGHBORHOOD','DATE','WAIT_TIME','LATITUDE','LONGITUDE','PRODUCT']

In [8]:
# Dividing the dataset between the two products
data_uberX = data_uber[(data_uber['PRODUCT'] == 'uberX')]
data_uberSelect = data_uber[(data_uber['PRODUCT'] == 'UberSELECT')] 

In [9]:
# Measuring mean time for UberX and UberSelect
uberX_meanTime = data_uberX.pivot_table(index='NEIGHBORHOOD', values='WAIT_TIME', aggfunc=np.mean)
uberSelect_meanTime = data_uberSelect.pivot_table(index='NEIGHBORHOOD', values='WAIT_TIME', aggfunc=np.mean)
uberX_meanTime ['NEIGHBORHOOD'] = sorted(neighborhood)
uberSelect_meanTime ['NEIGHBORHOOD'] = sorted(neighborhood)

In [10]:
# Create a map object - UBER X
m = folium.Map(
    location=[-5.826592, -35.212558],
    zoom_start=12,
    tiles='OpenStreetMap'
)

# create a threshold of legend
threshold_scale = np.linspace(uberX_meanTime['WAIT_TIME'].min(),
                              uberX_meanTime['WAIT_TIME'].max(), 6, dtype=int).tolist()

m.choropleth(
    geo_data=geo_json_natal,
    data=uberX_meanTime,
    columns=['NEIGHBORHOOD', 'WAIT_TIME'],
    key_on='feature.properties.name',
    fill_color='YlOrRd',
    legend_name='MEAN EXPECTED WAIT TIME (UBER X)',
    highlight=True,
    threshold_scale = threshold_scale
)


# print one marker on each neighborhood
for neighborhood in geo_json_natal['features']:
    
    # get the name of neighborhood
    name = neighborhood['properties']['name']
    # take the coordinates (lat,log) of neighborhood
    geom = neighborhood['geometry']['coordinates']
    # create a polygon using all coordinates
    polygon = Polygon(geom[0])
    
    # Let's create a Vega popup based on bar_dict.
    popup = folium.Popup(name, max_width=580)
    #folium.Vega( height=270, width=580).add_to(popup)
    # print a marker with the name of the neighborhood and the mean wait time
    folium.Marker([polygon.centroid.y, polygon.centroid.x],
                  popup=popup
                 ).add_to(m)
    

# print the map
m

In [11]:
# Create a map object - UBER Select
m = folium.Map(
    location=[-5.826592, -35.212558],
    zoom_start=12,
    tiles='OpenStreetMap'
)

# create a threshold of legend
threshold_scale = np.linspace(uberSelect_meanTime['WAIT_TIME'].min(),
                              uberSelect_meanTime['WAIT_TIME'].max(), 6, dtype=int).tolist()
m.choropleth(
    geo_data=geo_json_natal,
    data=uberSelect_meanTime,
    columns=['NEIGHBORHOOD', 'WAIT_TIME'],
    key_on='feature.properties.name',
    fill_color='YlOrRd',
    legend_name='MEAN EXPECTED WAIT TIME (UBER SELECT)',
    highlight=True,
    threshold_scale = threshold_scale
)

for neighborhood in geo_json_natal['features']:
    
    # get the name of neighborhood
    name = neighborhood['properties']['name']
    # take the coordinates (lat,log) of neighborhood
    geom = neighborhood['geometry']['coordinates']
    # create a polygon using all coordinates
    polygon = Polygon(geom[0])
      
    # Let's create a Vega popup based on bar_dict.
    popup = folium.Popup(name, max_width=580)
    
    folium.Marker([polygon.centroid.y, polygon.centroid.x],
                  popup=popup
                 ).add_to(m)
m

## 4. Results and Conclusion

This project used Uber API and choropleth maps to analyze the distribution of Uber products (UberX and UberSelect) in Natal. The analysis lasted one week and basically, the methodology was based on collecting the waiting time estimative for each neighborhood of the city. In other words, the interest was to verify the waiting time to get an Uber ride from each part of the city during the day. 

The results were quite interesting. Comparing the two maps generated using choropleth for UberX and UberSelect is clear that the north region of the city, which represents the socially and economically less developed area, it had the highest waiting time. This shows how this kind of service is associated to social questions. 

Choropleth maps showed to be an useful tool to study and analyse statistical data spatially. It promotes a clear comprehend regarding a distribution. 

As a suggestion to further projects, these results could be compared to economic data in order to evidence how social segregation influences the fair access to technology. 