# IBM Data Science Professional Certificate Capstone Project
# Bike Sharing Stations in Bogota

## Introduction
Bogota is ranked as the fifth most traffic congested city of the world. It takes the first place among South American capitals. According to the Global Traffic Scorecard of INRIX (2018), a Bogota citizen loses 75 hours a year in traffic jams. In fact, most Latin American cities have to deal with a rapid and unplanned urban growth which represents a major challenge in mobility and traffic dynamics.

Given this panorama, the promotion of the bicycle as a daily and safe mode of transport has become a common objective within the policies of sustainability and equity in large cities. The use of the bicycle not only reduces carbon emissions within big cities, it also helps to alleviate traffic congestion, decreases travel times, and favors people's health and wellbeing.

Bogota is famous across the world for being a bike friendly city. It has a population of around eight million people and cycle paths covering more than 360km (220 miles) of the city’s surface. Almost 84,000 people use Bogota’s cycle route network every day, which only stands for around 1% of the total population. This has made local government to ask themselves 'How to make the bicycle a daily and safe means of transport for most people?'.

I think that a bike-borrowing system would be appropriate for a city like Bogota in order to help answer this question. This solution also deals with other concerns among citizens which include vandalism, parking or storage, and maintenance. 

But then again another question arises and this is **which would be the ideal locations to put bike-sharing points within the city?**

To solve this question, I thought that the bike-sharing stations should be near to the most frequented places by the citizens. Bogota is divided in 20 districts. In this project I chose one which is called 'Teusaquillo'. I used the Foursquare API to get the top venues within this district and cluster them using a Density-based spatial (DBSCAN) machine learning algorithm which groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions. The center of the clusters would be defined as the location of the bike-sharing stations.

Finally, a GeoJSON file including the bike-path in the city of Bogota helps to visualize how far these stations would be from a bike route.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Downloading and Exploring Dataset</a>

2. <a href="#item2">Making Calls with Foursquare</a>

3. <a href="#item3">Exploring Top Venues in Bogota</a>

4. <a href="#item4">Clustering Venues using DBSCAN</a>

5. <a href="#item5">Defining the Location of our Bike Stations</a>    
</font>
</div>

## 1. Download and Explore Dataset

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [96]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import tools for clustering stage
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library


print('Libraries imported.')

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.

Libraries imported.


### A little about Bogota organization...

Bogota is divided in 112 Zonal Planning Units (**UPZ** in Spanish) and 4 Rural Planning Units (**UPR**), each one pertaining to one of the 20 total existing districts.

In order to segement the UPZs and explore them, we essentially need a dataset that contains the 20 districts and the UPZs that exist in each district as well as the the latitude and logitude coordinates of each UPZ.

I was able to download a dataset with all the UPZs and their location from this website as a csv file ([Mapas Bogota](https://mapas.bogota.gov.co)). I uploaded it to my GitHub repository [here](https://github.com/GhostDeini/bogota_bici_maps/blob/master/Unidad_de_planeamiento.csv). However, they don't have their associated disctrict. Luckily, I found a table that included each UPZ’s number and its corresponding district in [this Wikipedia page](https://es.wikipedia.org/wiki/Unidades_de_Planeamiento_Zonal). I used the **BeautifulSoup** to scrape the wiki page and get the dataframe, then used the merge command to join both dataframes and get a final one.

### First dataframe - Zonal Planning Units and their location

In [150]:
df_bog1=pd.read_csv('https://raw.githubusercontent.com/GhostDeini/bogota_bici_maps/master/Unidad_de_planeamiento.csv')
print('Dataset downloaded and read into a pandas dataframe!')

Dataset downloaded and read into a pandas dataframe!


In [151]:
# let's take a look
df_bog1.head()

Unnamed: 0,OBJECTID,Identificador unico de la unidad de planeamiento,Tipo de unidad de planeamiento,Nombre de la unidad de planeamiento,Acto administrativo de la unidad de planeamiento,Area de la unidad de planeamiento,coord_x,coord_y
0,129,UPZ100,1,GALERIAS,Dcto. 621-29/12/2006 (Gaceta 456/2007),2375681.0,-7407208281899997,4642903876652436
1,130,UPZ83,1,LAS MARGARITAS,,1472415.0,-741781751275,4637752138652276
2,131,UPZ107,1,QUINTA PAREDES,Dcto. 086-8/03/2011 Modificatorio del 1096-26/...,1739560.0,-740902518645,4631541124152076
3,132,UPZ101,1,TEUSAQUILLO,Dcto. 492-26/10/2007 Mod.=Res 253/2009 (Gacet...,2357008.0,-74075133155,4626522263151916
4,133,UPZ91,1,SAGRADO CORAZON,Dcto. 492-26/10/2007 Mod.=Res 249/2009 (Gacet...,1461893.0,-740639951005,46192456991516915


In [152]:
df_bog1.shape

(116, 8)

In [153]:
# remove UPR rows
df_bog1.drop(df_bog1[df_bog1['Identificador unico de la unidad de planeamiento'].str.contains('UPR')].index, inplace=True)
df_bog1.shape

(112, 8)

In [154]:
df_bog1.columns

Index(['OBJECTID', 'Identificador unico de la unidad de planeamiento',
       'Tipo de unidad de planeamiento', 'Nombre de la unidad de planeamiento',
       'Acto administrativo de la unidad de planeamiento',
       'Area de la unidad de planeamiento', 'coord_x', 'coord_y'],
      dtype='object')

In [155]:
# rename some columns columns
df_bog1.rename(columns={'Identificador unico de la unidad de planeamiento': 'UPZ',
                        'Nombre de la unidad de planeamiento': 'Name UPZ',
                        'coord_x': 'longitude',
                        'coord_y': 'latitude'}, inplace=True)

# drop unnecessary ones
df_bog1.drop(['OBJECTID', 'Tipo de unidad de planeamiento',
              'Acto administrativo de la unidad de planeamiento',
              'Area de la unidad de planeamiento'], axis=1, inplace=True)
df_bog1.head()

Unnamed: 0,UPZ,Name UPZ,longitude,latitude
0,UPZ100,GALERIAS,-7407208281899997,4642903876652436
1,UPZ83,LAS MARGARITAS,-741781751275,4637752138652276
2,UPZ107,QUINTA PAREDES,-740902518645,4631541124152076
3,UPZ101,TEUSAQUILLO,-74075133155,4626522263151916
4,UPZ91,SAGRADO CORAZON,-740639951005,46192456991516915


In [156]:
# modify the 'UPZ' column to leave just a number
df_bog1['UPZ'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'')
df_bog1.head()

Unnamed: 0,UPZ,Name UPZ,longitude,latitude
0,100,GALERIAS,-7407208281899997,4642903876652436
1,83,LAS MARGARITAS,-741781751275,4637752138652276
2,107,QUINTA PAREDES,-740902518645,4631541124152076
3,101,TEUSAQUILLO,-74075133155,4626522263151916
4,91,SAGRADO CORAZON,-740639951005,46192456991516915


In [157]:
df_bog1.dtypes

UPZ          object
Name UPZ     object
longitude    object
latitude     object
dtype: object

In [158]:
# cast 'UPZ' into an integer
df_bog1['UPZ']=df_bog1['UPZ'].astype(int)
df_bog1.dtypes

UPZ           int64
Name UPZ     object
longitude    object
latitude     object
dtype: object

In [159]:
df_bog1.reset_index(inplace=True)

### Second dataframe - Zonal Planning Units and their district

In [160]:
from bs4 import BeautifulSoup

It is also necessary to use the **requests** library, which allows to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. I took the link of the website through which we are going to scrape the data and assigned it to a variable named wiki_url.

In [161]:
import requests
wiki_url = requests.get('https://es.wikipedia.org/wiki/Unidades_de_Planeamiento_Zonal')
soup = BeautifulSoup(wiki_url.content,'lxml')
#print(soup.prettify())

You can uncomment the last line in order to take a look at the whole HTML script.

Now, let's use the `find` command to look up for the class ‘wikitable’ in the HTML script and assign it to the variable `my_table`.

In [162]:
my_table = soup.find('table',{'class':'wikitable'})
#my_table

In [163]:
len(my_table.find_all("tr"))

113

The length of the set is 113, which means our table contains 113 rows. Let's create an empty array of 113 rows and 4 columns, as the table shown in the wiki page.

In [164]:
matrix = np.empty((len(my_table.find_all("tr")), 4), dtype=object)

Now, we will fill our array with the values of the table using a for loop. Note that we use the `stripped_strings` generator.
When there’s more than one thing inside a tag (as it is our case), you can still look at just the strings using the `.strings` generator. Since these strings tend to have a lot of extra whitespace, you can remove it by using the `.stripped_strings` generator.

In [165]:
for i, val in enumerate(my_table.find_all("tr")):
    for j,string in enumerate(val.stripped_strings):
        matrix[i][j]=string
#matrix

Let's convert it to a **pandas** dataframe.

In [166]:
df_bog2=pd.DataFrame(data=matrix[1:,0:],
                       columns=matrix[0,0:])
df_bog2.head()

Unnamed: 0,Número,Nombre,Localidad,None
0,1,Paseo de los Libertadores,1,Usaquén
1,9,Verbenal,1,Usaquén
2,10,La Uribe,1,Usaquén
3,11,San Cristóbal Norte,1,Usaquén
4,12,Toberín,1,Usaquén


In [167]:
# rename some columns
df_bog2.rename(columns={'Número': 'UPZ', None: 'District', 'Localidad': 'Dis_ID'}, inplace=True)
# drop the UPZ name column because we already have it in the first dataset
df_bog2.drop('Nombre', axis=1, inplace=True)
# put the 'District' column in uppercase
df_bog2['District']=df_bog2['District'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8').str.upper()
df_bog2.head()

Unnamed: 0,UPZ,Dis_ID,District
0,1,1,USAQUEN
1,9,1,USAQUEN
2,10,1,USAQUEN
3,11,1,USAQUEN
4,12,1,USAQUEN


In [168]:
# cast UPZ and District_ID columns into integers
df_bog2['UPZ']=df_bog2['UPZ'].astype(int)
df_bog2['Dis_ID']=df_bog2['Dis_ID'].astype(int)
df_bog2.dtypes

UPZ          int64
Dis_ID       int64
District    object
dtype: object

In [169]:
df_bog2.head()

Unnamed: 0,UPZ,Dis_ID,District
0,1,1,USAQUEN
1,9,1,USAQUEN
2,10,1,USAQUEN
3,11,1,USAQUEN
4,12,1,USAQUEN


### Now we merge both dataframes using an inner join on 'UPZ' number

In [170]:
df_bog=pd.merge(df_bog1, df_bog2, on='UPZ', how='inner')
df_bog.head()

Unnamed: 0,index,UPZ,Name UPZ,longitude,latitude,Dis_ID,District
0,0,100,GALERIAS,-7407208281899997,4642903876652436,13,TEUSAQUILLO
1,1,83,LAS MARGARITAS,-741781751275,4637752138652276,8,KENNEDY
2,2,107,QUINTA PAREDES,-740902518645,4631541124152076,13,TEUSAQUILLO
3,3,101,TEUSAQUILLO,-74075133155,4626522263151916,13,TEUSAQUILLO
4,4,91,SAGRADO CORAZON,-740639951005,46192456991516915,3,SANTA FE


In [171]:
df_bog.shape

(112, 7)

In [172]:
# drop the repeated 'index' column
df_bog.drop(['index'], axis=1, inplace=True)
df_bog.head()

Unnamed: 0,UPZ,Name UPZ,longitude,latitude,Dis_ID,District
0,100,GALERIAS,-7407208281899997,4642903876652436,13,TEUSAQUILLO
1,83,LAS MARGARITAS,-741781751275,4637752138652276,8,KENNEDY
2,107,QUINTA PAREDES,-740902518645,4631541124152076,13,TEUSAQUILLO
3,101,TEUSAQUILLO,-74075133155,4626522263151916,13,TEUSAQUILLO
4,91,SAGRADO CORAZON,-740639951005,46192456991516915,3,SANTA FE


In [173]:
df_bog.dtypes

UPZ           int64
Name UPZ     object
longitude    object
latitude     object
Dis_ID        int64
District     object
dtype: object

In [174]:
# cast the latitude and longitude columns to float type
df_bog['latitude']=df_bog['latitude'].str.replace(',','.')
df_bog['longitude']=df_bog['longitude'].str.replace(',','.')
df_bog['latitude']=df_bog['latitude'].astype('float64')
df_bog['longitude']=df_bog['longitude'].astype('float64')
df_bog.head()

Unnamed: 0,UPZ,Name UPZ,longitude,latitude,Dis_ID,District
0,100,GALERIAS,-74.072083,4.642904,13,TEUSAQUILLO
1,83,LAS MARGARITAS,-74.178175,4.637752,8,KENNEDY
2,107,QUINTA PAREDES,-74.090252,4.631541,13,TEUSAQUILLO
3,101,TEUSAQUILLO,-74.075133,4.626522,13,TEUSAQUILLO
4,91,SAGRADO CORAZON,-74.063995,4.619246,3,SANTA FE


Our dataframe is ready!

## 2. Making Calls with Foursquare

### Defining Foursquare credentials

In [175]:
CLIENT_ID = 'NZHMV3O4FE4Z15HFB54BPMTMJ1B1VVKZYX05RWZFIT4DXOWZ' # your Foursquare ID
CLIENT_SECRET = 'FREZIQU3ILKVDTMHALFNU3MMOY2EDJJWWNXS3GZHA4Q4QVAP' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: NZHMV3O4FE4Z15HFB54BPMTMJ1B1VVKZYX05RWZFIT4DXOWZ
CLIENT_SECRET:FREZIQU3ILKVDTMHALFNU3MMOY2EDJJWWNXS3GZHA4Q4QVAP


### Let's explore the first UPZ

In [176]:
df_bog.loc[0, 'Name UPZ']

'GALERIAS'

In [177]:
# get latitude and longitude values
UPZ_latitude = df_bog.loc[0, 'latitude'] # UPZ latitude value
UPZ_longitude = df_bog.loc[0, 'longitude'] # UPZ longitude value

UPZ_name = df_bog.loc[0, 'Name UPZ'] # UPZ name

print('Latitude and longitude values of {} are {}, {}.'.format(UPZ_name, 
                                                               UPZ_latitude, 
                                                               UPZ_longitude))

Latitude and longitude values of GALERIAS are 4.642903876652436, -74.07208281899997.


### Now, let's get the top 100 venues that are in Galerias within a radius of 500 meters.

In [178]:
# create the GET request URL
radius=500
LIMIT=100
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, UPZ_latitude, UPZ_longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=NZHMV3O4FE4Z15HFB54BPMTMJ1B1VVKZYX05RWZFIT4DXOWZ&client_secret=FREZIQU3ILKVDTMHALFNU3MMOY2EDJJWWNXS3GZHA4Q4QVAP&ll=4.642903876652436,-74.07208281899997&v=20180605&radius=500&limit=100'

In [179]:
# send the GET request and examine the resutls
results = requests.get(url).json()
#results

From the Foursquare lab in previous modules, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [180]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [181]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Farmatodo,Pharmacy,4.641142,-74.071437
1,Bogotá Beer Company,Pub,4.64265,-74.073772
2,El Caracol Rojo,Seafood Restaurant,4.641487,-74.072033
3,Crepes & Waffles,Restaurant,4.642651,-74.07392
4,Panamericana,Paper / Office Supplies Store,4.641836,-74.073853


And how many venues were returned by Foursquare?

In [182]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

34 venues were returned by Foursquare.


## 3. Exploring Top Venues in Bogota - Teusaquillo district

For this case, I only used 1 district with its corresponding UPZs, where most of the cultural life and commercial activity of the city takes place. It is called **Teusaquillo**.

### Let's create a map of Bogotá to visualize where Teusaquillo is located.

In [183]:
# GeoJSON file with the limits of all the districts
!wget --quiet https://github.com/decolector/bta-geodata/raw/master/local.geojson -O localidades_bog.json
    
print('GeoJSON file downloaded!')
loc_bog = r'localidades_bog.json' # geojson file

GeoJSON file downloaded!


In [184]:
# use geopy library to get the latitude and longitude values to put a marker on the map
address1 = 'Teusaquillo'

geolocator = Nominatim()
location1 = geolocator.geocode(address1)
lat_teusaq = location1.latitude
lng_teusaq = location1.longitude

address2 = 'Bogota'
location2 = geolocator.geocode(address2)
lat_bog = location2.latitude
lng_bog = location2.longitude

print('The geograpical coordinate of Bogota are {}, {}.'.format(lat_teusaq, lng_teusaq))
print('The geograpical coordinate of Teusaquillo are {}, {}.'.format(lat_teusaq, lng_teusaq))

  after removing the cwd from sys.path.


The geograpical coordinate of Bogota are 4.6423434, -74.0872169.
The geograpical coordinate of Teusaquillo are 4.6423434, -74.0872169.


In [185]:
# plot a map of Bogota city
bogo_map = folium.Map(location=[lat_bog, lng_bog], zoom_start=10)

# add geojson
folium.GeoJson(
    loc_bog,
    name='geojson'
).add_to(bogo_map)

# add a marker located in the center of Teusaquillo district
loc = folium.map.FeatureGroup()
    
folium.Marker(
    location=[lat_teusaq, lng_teusaq],
    icon=folium.Icon(color='red', icon='bookmark')
).add_to(loc)

bogo_map.add_child(loc)

### Now let's create a sub-dataframe with our selected district

In [186]:
df_teusaq = df_bog[df_bog['District'].str.contains('TEUSAQUILLO')]
df_teusaq = df_teusaq.reset_index(drop=True)
# take a look
df_teusaq

Unnamed: 0,UPZ,Name UPZ,longitude,latitude,Dis_ID,District
0,100,GALERIAS,-74.072083,4.642904,13,TEUSAQUILLO
1,107,QUINTA PAREDES,-74.090252,4.631541,13,TEUSAQUILLO
2,101,TEUSAQUILLO,-74.075133,4.626522,13,TEUSAQUILLO
3,106,LA ESMERALDA,-74.08695,4.647636,13,TEUSAQUILLO
4,109,CIUDAD SALITRE ORIENTAL,-74.101445,4.644015,13,TEUSAQUILLO
5,104,PARQUE SIMON BOLIVAR - CAN,-74.091185,4.648992,13,TEUSAQUILLO


### Let's create a function to repeat the same process as before and get the top venues of each Teusaquillo's UPZ

In [187]:
def getNearbyVenues(names, latitudes, longitudes, radius):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Name UPZ', 
                  'UPZ Latitude', 
                  'UPZ Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [188]:
#get the venues of all the UPZs which belong to Teusaquillo
top_teusaq = getNearbyVenues(names=df_teusaq['Name UPZ'],
                             latitudes=df_teusaq['latitude'],
                             longitudes=df_teusaq['longitude'],
                             radius=1600
                            )

GALERIAS
QUINTA PAREDES
TEUSAQUILLO
LA ESMERALDA
CIUDAD SALITRE ORIENTAL
PARQUE SIMON BOLIVAR - CAN


In [189]:
print('{} venues were returned by Foursquare.'.format(top_teusaq.shape[0]))

600 venues were returned by Foursquare.


In [190]:
# remove duplicates
top_teusaq.drop_duplicates(inplace=True)

In [191]:
# take a look
top_teusaq.head()

Unnamed: 0,Name UPZ,UPZ Latitude,UPZ Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,GALERIAS,4.642904,-74.072083,Farmatodo,4.641142,-74.071437,Pharmacy
1,GALERIAS,4.642904,-74.072083,Bogotá Beer Company,4.64265,-74.073772,Pub
2,GALERIAS,4.642904,-74.072083,Restaurante Cañón del Chicamocha,4.644697,-74.069896,Restaurant
3,GALERIAS,4.642904,-74.072083,La Famosa Sandwichería,4.640282,-74.072085,Sandwich Place
4,GALERIAS,4.642904,-74.072083,Papa John's,4.640751,-74.073745,Pizza Place


We are ready to plot our venues in a Folium map!

### Let's visualize our venues on the map

In [192]:
# plot a map centered around Teusaquillo
bogo_map = folium.Map(location=[lat_teusaq, lng_teusaq], zoom_start=13)

# add geojson
#folium.GeoJson(
#    loc_bog,
#    name='geojson'
#).add_to(bogo_map)


# add marker on Teusaquillo district
bogo_map.add_child(loc)

# let's plot the venues
# instantiate a feature group for the incidents in the dataframe
putvenues = folium.map.FeatureGroup()

# loop through the venues and add each to the putvenues feature group
for lat, lng, in zip(top_teusaq['Venue Latitude'], top_teusaq['Venue Longitude']):
    putvenues.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=4, # define how big you want the circle markers to be
            color='green',
            fill=True,
            fill_color='green',
            fill_opacity=0.6
        )
    )
# add venues to map
bogo_map.add_child(putvenues)

# also add markers for each UPZ
putUPZ = folium.map.FeatureGroup()
# loop through the UPZs and add each to the putUPZ feature group
for lat, lng, upz in zip(df_teusaq['latitude'], df_teusaq['longitude'], df_teusaq['Name UPZ']):
    label = folium.Popup(upz, parse_html=True)
    putUPZ.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=7, # define how big you want the circle markers to be
            popup=label,
            color='yellow',
            fill=True,
            fill_color='yellow',
            fill_opacity=0.6
        )
    )    
# add UPZ to map
bogo_map.add_child(putUPZ)

Venues are plotted in green while each UPZ belonging to Teusaquillo is plotted in yellow. Since we used a quite high radius we see there are even some venues that pertain to the district of Chapinero, but it is OK.

We are ready to cluster our venues!

## 4. Clustering using DBSCAN

DBSCAN is a density-based data clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).

### Let's cluster our venues.

In [193]:
# cluster
sklearn.utils.check_random_state(1000)
Clus_dataSet = top_teusaq[['Venue Longitude','Venue Latitude']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)

# Compute DBSCAN
db = DBSCAN(eps=0.25, min_samples=12).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
top_teusaq["Clus_Db"]=labels

realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels)) 

# A sample of clusters
top_teusaq[["Venue","Clus_Db"]].head(5)

Unnamed: 0,Venue,Clus_Db
0,Farmatodo,0
1,Bogotá Beer Company,0
2,Restaurante Cañón del Chicamocha,0
3,La Famosa Sandwichería,0
4,Papa John's,0


In [194]:
set(labels)

{-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

In [195]:
# how many outliers
top_teusaq.loc[top_teusaq['Clus_Db'] == -1].shape[0]

138

In [196]:
# create sub-dataframes for outliers and clusters
top_teusaq_1=top_teusaq.loc[top_teusaq['Clus_Db'] != -1]
top_teusaq_2=top_teusaq.loc[top_teusaq['Clus_Db'] == -1]

### Let's plot our clusters

In [197]:
# set color scheme for the clusters
x = np.arange(clusterNum)
ys = [i+x+(i*x)**2 for i in map(lambda x: x+1, range(clusterNum))]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# plot a map
bogo_map = folium.Map(location=[lat_teusaq, lng_teusaq], zoom_start=13)

# add geojson
#folium.GeoJson(
#    loc_bog,
#    name='geojson'
#).add_to(bogo_map)

# add a marker to locate Teusaquillo
bogo_map.add_child(loc)

# add yellow markers to locate each UPZ
bogo_map.add_child(putUPZ)

# plot the clusters
putvenues_final = folium.map.FeatureGroup()
for i, val in enumerate(top_teusaq_1["Clus_Db"].unique()):
    for lat, lng, in zip(top_teusaq_1.loc[top_teusaq_1['Clus_Db'] == val]['Venue Latitude'], top_teusaq_1.loc[top_teusaq_1['Clus_Db'] == val]['Venue Longitude']):
        label = folium.Popup(' Cluster ' + str(val), parse_html=True)
        putvenues_final.add_child(
            folium.features.CircleMarker(
                [lat, lng],
                radius=2, # define how big you want the circle markers to be
                popup=label,
                color=rainbow[i-1],
                fill=True,
                fill_color=rainbow[i-1],
                fill_opacity=0.6
            )
        )
        
# plot outliers in grey
put_outliers = folium.map.FeatureGroup()

for lat, lng, in zip(top_teusaq_2['Venue Latitude'], top_teusaq_2['Venue Longitude']):
    label = folium.Popup('Outlier', parse_html=True)
    put_outliers.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=2, # define how big you want the circle markers to be
            popup=label,
            color='gray',
            fill=True,
            fill_color='gray',
            fill_opacity=0.6
        )
    )

# add each of feature group to the map
bogo_map.add_child(putvenues_final)
bogo_map.add_child(put_outliers)

Outliers are plotted in gray.

## 5. Defining the Location of our Bike Stations

As I said, the centroids of the clusters would be defined as the location of the bike-sharing stations. To find the centroids of the DBSCAN clusters I computed the mean latitude and longitude values withing each cluster. Centroids are then plotted in black.

In [198]:
lat_mean=np.arange(clusterNum-1, dtype=np.float)
lng_mean=np.arange(clusterNum-1, dtype=np.float)
for i, val in enumerate(top_teusaq_1["Clus_Db"].unique()):
    lat_mean[i]=top_teusaq_1.loc[top_teusaq_1['Clus_Db'] == val].describe().loc['mean',['Venue Latitude']]
    lng_mean[i]=top_teusaq_1.loc[top_teusaq_1['Clus_Db'] == val].describe().loc['mean',['Venue Longitude']]

Just for fun let's put a GeoJSON layer with the existing bike paths in Bogota in order to take a look and know how far our bike sharing points would be from a bike route.

In [199]:
# GeoJSON file with the limits of all the districts
!wget --quiet https://raw.githubusercontent.com/GhostDeini/bogota_bici_maps/master/cicloruta7.geojson -O bogota_ciclorutas.json
    
print('GeoJSON file downloaded!')
bici_bog = r'bogota_ciclorutas.json' # geojson file

GeoJSON file downloaded!


In [200]:
# plot again

bogo_map = folium.Map(location=[lat_teusaq, lng_teusaq], zoom_start=13)

# add geojson
folium.GeoJson(
    bici_bog,
    name='geojson'
).add_to(bogo_map)

#bogo_map.add_child(loc)

# plot the centroids
put_centroids = folium.map.FeatureGroup()

for i, val in enumerate(lat_mean):
    label = folium.Popup('Centroid Center', parse_html=True)
    put_centroids.add_child(
        folium.features.CircleMarker(
            [lat_mean[i], lng_mean[i]],
            radius=3, # define how big you want the circle markers to be
            popup=label,
            color='black',
            fill=True,
            fill_color='black',
            fill_opacity=0.6
        )
    )

# add venues to map
bogo_map.add_child(putvenues_final)
#bogo_map.add_child(putUPZ)
bogo_map.add_child(put_centroids)