<h1> Capstone Project - The Battle of Neighborhoods<h1>
<h2>Recommending System for Opening Businesses in Bogota, Colombia<h2>

<h3> Table of Contents <h3>

<h3> Introduction <h3>
<ul>
<li><p> In this project we will focus on determining the best location to open a new business in Bogota, Colombia (focusing on a book cafe business)<p>
<li><p> We will have to take into account which are the localities in which this sort of business are most frequently visited and if there is already a large amount of this kind of business in the vicinity (so as not to have too much competition) <p>
<li><p> Aditionally, since this project is focused on helping people who lost their business due to the covid pandemic, it is important to take into account other factor such as the amount of covid cases in the area (to avoid opening a business in a locality that might implement more restrictions ot lockdowns) and the availability of open spaces (such as nearby parks) so that the business can have an open air segment in order to comply with covid restriction established by the government<p>

<h2> Data Capture/Import <h2>
<p>As a first step we will capture the data needed for the project. This includes:<p>
<ul>
<li>Import data from a .csv file with Bogota´s neighbourhoods names, code and its corresponding locality. Original source <a href="https://es.wikipedia.org/wiki/Unidades_de_Planeamiento_Zonal">here</a>. Save in a pandas dataframe
<li>Import data from a .csv file downloaded from an official government source which includes data from covid cases in Bogota, Colombia. Original source <a href="https://saludata.saludcapital.gov.co/osb/index.php/datos-de-salud/enfermedades-trasmisibles/covid19/">here</a>. Save in a pandas dataframe

<ul>


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd
import numpy as np



In [59]:
df_neighborhoods = pd.read_csv("/content/drive/MyDrive/Coursera IBM Data Science/Capstone Project/Bogota Neighbourhoods - Hoja 1.csv")

In [60]:
df_covid = pd.read_csv("/content/drive/MyDrive/Coursera IBM Data Science/Capstone Project/osb_enftransm-covid-19_13072021.csv",sep=";")

<h2>Data Exploration<h2>


In [61]:
df_neighborhoods.head()

Unnamed: 0,Numero,Nombre,Localidad
0,1,Paseo de los Libertadores,Usaquén
1,9,Verbenal,Usaquén
2,10,La Uribe,Usaquén
3,11,San Cristóbal Norte,Usaquén
4,12,Toberín,Usaquén


In [62]:
df_neighborhoods.dtypes

Numero        int64
Nombre       object
Localidad    object
dtype: object

In [63]:
df_covid.head()

Unnamed: 0,CASO,FECHA_DE_INICIO_DE_SINTOMAS,FECHA_DIAGNOSTICO,CIUDAD,LOCALIDAD_ASIS,EDAD,UNI_MED,SEXO,FUENTE_O_TIPO_DE_CONTAGIO,UBICACION,ESTADO
0,1,2020-02-26,2020-03-06,Bogotá,Usaquén,19,1,F,Importado,Casa,Recuperado
1,2,2020-03-04,2020-03-10,Bogotá,Engativá,22,1,F,Importado,Casa,Recuperado
2,3,2020-03-07,2020-03-10,Bogotá,Engativá,28,1,F,Importado,Casa,Recuperado
3,4,2020-03-06,2020-03-12,Bogotá,Fontibón,36,1,F,Importado,Casa,Recuperado
4,5,2020-03-06,2020-03-12,Bogotá,Kennedy,42,1,F,Importado,Casa,Recuperado


In [64]:
df_covid.dtypes

CASO                            int64
FECHA_DE_INICIO_DE_SINTOMAS    object
FECHA_DIAGNOSTICO              object
CIUDAD                         object
LOCALIDAD_ASIS                 object
EDAD                            int64
UNI_MED                         int64
SEXO                           object
FUENTE_O_TIPO_DE_CONTAGIO      object
UBICACION                      object
ESTADO                         object
dtype: object

In [65]:
#Data Exploration of the city column
df_covid['CIUDAD'].unique()

array(['Bogotá', 'Fuera de Bogotá', 'Sin dato'], dtype=object)

<p> Check if the localities names between dataframes match </p>

In [66]:
df_covid['LOCALIDAD_ASIS'].unique()

array(['Usaquén', 'Engativá', 'Fontibón', 'Kennedy', 'Suba',
       'Teusaquillo', 'Chapinero', 'Ciudad Bolívar', 'Barrios Unidos',
       'Los Mártires', 'La Candelaria', 'Rafael Uribe Uribe',
       'Puente Aranda', 'Tunjuelito', 'Bosa', 'San Cristóbal', 'Santa Fe',
       'Antonio Nariño', 'Usme', 'Fuera de Bogotá', 'Sin dato', 'Sumapaz'],
      dtype=object)

In [67]:
df_neighborhoods['Localidad'].unique()

array(['Usaquén', 'Chapinero', 'Santa Fe', 'San Cristóbal', 'Usme',
       'Tunjuelito', 'Bosa', 'Kennedy', 'Fontibón', 'Engativá', 'Suba',
       'Barrios Unidos', 'Teusaquillo', 'Mártires', 'Antonio Nariño',
       'Puente Aranda', 'La Candelaria', 'Rafael Uribe', 'Ciudad Bolívar'],
      dtype=object)

In [68]:
localities = df_covid['LOCALIDAD_ASIS'].unique()

In [69]:
for index in range(0,localities.size):
  exists1 = localities[index] in df_neighborhoods['Localidad'].values
  print("Localidad {} in the neighbourhoods df: {}".format(localities[index],exists1))

Localidad Usaquén in the neighbourhoods df: True
Localidad Engativá in the neighbourhoods df: True
Localidad Fontibón in the neighbourhoods df: True
Localidad Kennedy in the neighbourhoods df: True
Localidad Suba in the neighbourhoods df: True
Localidad Teusaquillo in the neighbourhoods df: True
Localidad Chapinero in the neighbourhoods df: True
Localidad Ciudad Bolívar in the neighbourhoods df: True
Localidad Barrios Unidos in the neighbourhoods df: True
Localidad Los Mártires in the neighbourhoods df: False
Localidad La Candelaria in the neighbourhoods df: True
Localidad Rafael Uribe Uribe in the neighbourhoods df: False
Localidad Puente Aranda in the neighbourhoods df: True
Localidad Tunjuelito in the neighbourhoods df: True
Localidad Bosa in the neighbourhoods df: True
Localidad San Cristóbal in the neighbourhoods df: True
Localidad Santa Fe in the neighbourhoods df: True
Localidad Antonio Nariño in the neighbourhoods df: True
Localidad Usme in the neighbourhoods df: True
Localidad

<h2> Data Cleaning <h2>
<p> We will now focus on cleaning the localities and covid data by dropping the columns in the dataframe that we will not be using <p>
<p><b> For covid data </b>
<ul>
<li>We will only be needing the day the citizen was diagnosed with covid (fecha_diagnostico) and the locality where the citizen lives (localidad_asis)
<li>Since there is data of citizens that live outside of Bogota (our target city) and citizens with no city data, we need to delete these records since these are not useful for our project
<li>The diagnosis date (fecha_diagnostico) is type object, we'll nedd to change the type to date
</ul>
<p> Since the names of the localities do not have the same format/structure we need to convert them so that we can match them between dataframes when needed



In [70]:
df_covid.drop(df_covid[df_covid['CIUDAD']!='Bogotá'].index, inplace=True)

In [71]:
df_covid['CIUDAD'].unique()

array(['Bogotá'], dtype=object)

In [72]:
df_covid.drop(columns=['CASO','FECHA_DE_INICIO_DE_SINTOMAS','CIUDAD','EDAD','UNI_MED','SEXO','FUENTE_O_TIPO_DE_CONTAGIO','UBICACION','ESTADO'], inplace=True)

In [73]:
df_covid.head()

Unnamed: 0,FECHA_DIAGNOSTICO,LOCALIDAD_ASIS
0,2020-03-06,Usaquén
1,2020-03-10,Engativá
2,2020-03-10,Engativá
3,2020-03-12,Fontibón
4,2020-03-12,Kennedy


In [74]:
df_covid['FECHA_DIAGNOSTICO'] = pd.to_datetime(df_covid['FECHA_DIAGNOSTICO'])

In [75]:
df_covid.dtypes

FECHA_DIAGNOSTICO    datetime64[ns]
LOCALIDAD_ASIS               object
dtype: object

In [78]:
df_covid['LOCALIDAD_ASIS'] = df_covid['LOCALIDAD_ASIS'].str.lower()
df_neighborhoods['Localidad'] = df_neighborhoods['Localidad'].str.lower()

In [80]:
localities = df_covid['LOCALIDAD_ASIS'].unique()
for index in range(0,localities.size):
  exists1 = localities[index] in df_neighborhoods['Localidad'].values
  print("Localidad {} in the neighbourhoods df: {}".format(localities[index],exists1))

Localidad usaquén in the neighbourhoods df: True
Localidad engativá in the neighbourhoods df: True
Localidad fontibón in the neighbourhoods df: True
Localidad kennedy in the neighbourhoods df: True
Localidad suba in the neighbourhoods df: True
Localidad teusaquillo in the neighbourhoods df: True
Localidad chapinero in the neighbourhoods df: True
Localidad ciudad bolívar in the neighbourhoods df: True
Localidad barrios unidos in the neighbourhoods df: True
Localidad los mártires in the neighbourhoods df: False
Localidad la candelaria in the neighbourhoods df: True
Localidad rafael uribe uribe in the neighbourhoods df: False
Localidad puente aranda in the neighbourhoods df: True
Localidad tunjuelito in the neighbourhoods df: True
Localidad bosa in the neighbourhoods df: True
Localidad san cristóbal in the neighbourhoods df: True
Localidad santa fe in the neighbourhoods df: True
Localidad antonio nariño in the neighbourhoods df: True
Localidad usme in the neighbourhoods df: True
Localidad

In [42]:
#In our neighbourhoods df, locality "Rafael Uribe Uribe" appears as "Rafael Uribe". Let's change that
df_neighborhoods['Localidad'] = df_neighborhoods['Localidad'].replace(['rafael uribe'],'rafael uribe uribe')

In [44]:
#In our neighbourhoods df, locality "Los Mártires" appears as "Mártires". Let's change that
df_neighborhoods['Localidad'] = df_neighborhoods['Localidad'].replace(['mártires'],'los mártires')

In [81]:
#Unfortunately, we don't have much information about the neighbourhoods in the locality Sumapaz, so we'll delete these records from our dataframes
df_covid.drop(df_covid[df_covid['LOCALIDAD_ASIS']!='sumapaz'].index, inplace=True)

<h1>Data Capture II<h1>
<h2> Get location data for our neighbourhoods <h2>
<p> Create a function using the geocoder and geopy libraries so that we can get and add the location of the neighbourhoods to our dataframe <p>

In [None]:
!pip install geocoder
!pip install geopy
import geocoder
from geopy.geocoders import Nominatim

In [82]:
def get_location(neighbourhood):
  address = neighbourhood+',Bogota, Colombia'

  geolocator = Nominatim(user_agent="mygeocoder")
  location = geolocator.geocode(address)
  latitude = location.latitude
  longitude = location.longitude
  return(latitude,longitude)

In [102]:
#Create lists to storage the information returned by our function
#Use exception handling because some addresses might not be found 
lat_list = []
lon_list = []

for i in range(0,df_neighborhoods.shape[0]):
  try:
    lat,lon = get_location(df_neighborhoods['Nombre'][i])
    lat_list.append(lat)
    lon_list.append(lon)
  except:
    lat_list.append(0)
    lon_list.append(0)


In [106]:
#Add lists to our neighbourhoods data as new columns
df_neighborhoods['Latitude'] = lat_list
df_neighborhoods['Longitude'] = lon_list

In [119]:
df_neighborhoods.head()

Unnamed: 0,Numero,Nombre,Localidad,Latitude,Longitude,"Latitude,Longitude"
0,1,Paseo de los Libertadores,usaquén,4.791482,-74.03373,
1,9,Verbenal,usaquén,4.76515,-74.038394,
2,10,La Uribe,usaquén,4.7524,-74.045013,
3,11,San Cristóbal Norte,usaquén,4.734501,-74.017543,
4,12,Toberín,usaquén,4.747274,-74.043719,


In [114]:
#Neighbourhoods which our function couldn't find
df_neighborhoods[df_neighborhoods['Longitude']==0]

Unnamed: 0,Numero,Nombre,Localidad,Latitude,Longitude
0,1,Paseo de los Libertadores,usaquén,4.791482,0.0
10,89,San Isidro-Patios,chapinero,0.0,0.0
20,33,Sociego,san cristóbal,0.0,0.0
30,61,Ciudad Usme,usme,0.0,0.0
51,76,Fontibón-San Pablo,fontibón,0.0,0.0
53,110,Ciudad Salitre Occidente,fontibón,0.0,0.0
69,17,San José de Bavaria,suba,0.0,0.0


<p> Since there are just 7 neighbourhoods for which our function couldn't retrieve the location information; We can search and add this information manually </p>

In [121]:
df_neighborhoods.loc[0,['Latitude','Longitude']] = [4.791482,-74.03373]
df_neighborhoods.loc[10,['Latitude','Longitude']] = [4.667742,-74.01946]
df_neighborhoods.loc[20,['Latitude','Longitude']] = [4.578078,-74.08552]
df_neighborhoods.loc[30,['Latitude','Longitude']] = [4.479657,-74.111202]
df_neighborhoods.loc[51,['Latitude','Longitude']] = [4.694824,-74.16060]
df_neighborhoods.loc[53,['Latitude','Longitude']] = [4.655403,-74.111695]
df_neighborhoods.loc[69,['Latitude','Longitude']] = [4.585334, -74.17093]

In [123]:
df_neighborhoods[df_neighborhoods['Longitude']==0]

Unnamed: 0,Numero,Nombre,Localidad,Latitude,Longitude


<h1>Data Capture III<h1>
<h3> Mapping Bogota's Localities and Neighbourhoods to get a better understanding of the city<h2>
<h3>Use the Foursquare API to get the venues data from our neighbourhoods<h2>

In [124]:
import json 
import requests 
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

In [125]:
#Get location of Bogotá, Colombia
address = 'Bogota, Colombia'

geolocator = Nominatim(user_agent="mygeocoder")
location = geolocator.geocode(address)
bog_latitude = location.latitude
bog_longitude = location.longitude

In [126]:
df_neighborhoods.head()

Unnamed: 0,Numero,Nombre,Localidad,Latitude,Longitude
0,1,Paseo de los Libertadores,usaquén,4.791482,-74.03373
1,9,Verbenal,usaquén,4.76515,-74.038394
2,10,La Uribe,usaquén,4.7524,-74.045013
3,11,San Cristóbal Norte,usaquén,4.734501,-74.017543
4,12,Toberín,usaquén,4.747274,-74.043719


In [127]:
map_bogota = folium.Map(location=[latitude,longitude],zoom_start=10)

for lat,lng,neigh,loc in zip(df_neighborhoods['Latitude'],df_neighborhoods['Longitude'],df_neighborhoods['Nombre'],df_neighborhoods['Localidad']):
  label = '{}, {}'.format(neigh,loc)
  label = folium.Popup(label,parse_html=True)
  folium.CircleMarker(
      [lat,lng],
      radius=5,
      popup=label,
      color='purple',
      fill=True,
      fill_color='#800080',
      fill_opacity=0.7,
      parse_html=False).add_to(map_bogota)
map_bogota
  