<h1> Capstone Project </h1>
<p> RESTful API calls to the Foursquare API to retrieve data about venues in different neighborhoods around the world. 
Use of python and its pandas library to manipulate data. </p>

# Table of Contents

<div style="margin-top: 10px">
    
1. [Introduction. Bussiness Problem](#introduction)<br>
2. [Feature engineering. Downloading and prepping data](#data)<br>
3. [Analysis](#analysis)
4. [Results and discussion](#results)
5. [Conclusions](#conclusions)
</div>

<hr>

<h1> Introduction. Bussiness Problem </h1> <a name="introduction"></a>

The purpouse of this Project is to use the Foursquare API along with some machine learning techniques to retrieve location data and regional clustering of venue information to decide which might be the best neighbourhood in my home city, Guadalajara located at the state of Jalisco in Mexico, to open an authentic spanish food place. 

<hr>

<h1> Feature Engineering. Downloading and prepping data </h1> <a name="data"></a>

Before starting, all the dependencies and packages that are going to be needed are install and imported.

In [19]:
import numpy as np  
import pandas as pd
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from sklearn.cluster import KMeans
import json                           # library to handle JSON files
!conda install -c conda-forge geocoder --yes
import geocoder
!conda install -c conda-forge folium=0.5.0 --yes
import folium                         # map rendering library
import requests
print("Libraries imported.")

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.19.0                     py_0    conda-forge
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geocoder                  1.38.1                     py_0    conda-forge
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Libraries imported.


In [3]:
df = pd.read_csv("https://datos.jalisco.gob.mx/node/413/download", encoding="latin-1")
print("The original data frame size is", df.shape)
df.head()

The original data frame size is (1560, 187)


Unnamed: 0,No.,Municipio,Colonia,Población Total,Población masculina,Población femenina,Población de 0 a 2 años,Población masculina de 0 a 2 años,Población femenina de 0 a 2 años,Población 3 y más años,...,Viviendas particulares habitadas sin ningún bien,Viviendas particulares habitadas que disponen de radio,Viviendas particulares habitadas que disponen de televisor,Viviendas particulares habitadas que disponen de refrigerador,Viviendas particulares habitadas que disponen de lavadora,Viviendas particulares habitadas que disponen de automóvil o camioneta,Viviendas particulares habitadas que disponen de computadora,Viviendas particulares habitadas que disponen de línea telefónica fija,Viviendas particulares habitadas que disponen de teléfono celular,Viviendas particulares habitadas que disponen de internet
0,1,El Salto,ALAMEDA,188,95,93,9,6,3,179,...,0,36,40,38,32,16,8,16,27,6
1,2,El Salto,ALVAREZ DEL CASTILLO,630,305,294,22,7,7,563,...,0,128,137,134,117,95,48,53,108,26
2,3,El Salto,BAJA CALIFORNIA,2070,1046,1015,146,57,64,1890,...,0,378,459,438,360,235,96,222,371,51
3,4,El Salto,BONITO JALISCO,2885,1451,1434,215,90,70,2632,...,0,641,737,710,603,401,232,286,688,92
4,5,El Salto,CARDENAS DEL RIO,1660,819,821,94,37,17,1524,...,0,291,336,280,216,115,11,60,224,6


In [4]:
df_gdl = df[(df['Municipio'] == 'Guadalajara')]
df_gdl.head()

Unnamed: 0,No.,Municipio,Colonia,Población Total,Población masculina,Población femenina,Población de 0 a 2 años,Población masculina de 0 a 2 años,Población femenina de 0 a 2 años,Población 3 y más años,...,Viviendas particulares habitadas sin ningún bien,Viviendas particulares habitadas que disponen de radio,Viviendas particulares habitadas que disponen de televisor,Viviendas particulares habitadas que disponen de refrigerador,Viviendas particulares habitadas que disponen de lavadora,Viviendas particulares habitadas que disponen de automóvil o camioneta,Viviendas particulares habitadas que disponen de computadora,Viviendas particulares habitadas que disponen de línea telefónica fija,Viviendas particulares habitadas que disponen de teléfono celular,Viviendas particulares habitadas que disponen de internet
52,53,Guadalajara,18 DE MARZO,3694,1715,1978,82,26,24,3550,...,987,0,935,985,979,934,752,617,886,835
53,54,Guadalajara,18 DE MARZO (ANEXO),3682,1753,1914,153,72,69,3501,...,881,0,805,872,848,745,538,393,648,718
54,55,Guadalajara,1o. DE MAYO,3389,1681,1708,157,82,51,3127,...,717,0,654,707,692,607,373,226,458,554
55,56,Guadalajara,5 DE MAYO 1a. SECCIÓN,4032,1968,2058,203,76,79,3737,...,859,0,792,857,836,750,486,326,585,678
56,57,Guadalajara,5 DE MAYO 2a. SECCIÓN,12265,6052,6181,580,234,210,11169,...,2510,0,2284,2503,2446,2161,1525,1023,1756,2099


In [5]:
df_gdl = df_gdl[['Municipio','Colonia','Población Total', 'Población económicamente activa', 'Población no económicamente activa', 'Población ocupada', 'Población desocupada']].reset_index(drop=True)
print("The cleaned data set contains {} neighbourhoods".format(df_gdl.shape[0]))
df_gdl.head()

The cleaned data set contains 584 neighbourhoods


Unnamed: 0,Municipio,Colonia,Población Total,Población económicamente activa,Población no económicamente activa,Población ocupada,Población desocupada
0,Guadalajara,18 DE MARZO,3694,1758,1371,1699,31
1,Guadalajara,18 DE MARZO (ANEXO),3682,1767,1171,1702,52
2,Guadalajara,1o. DE MAYO,3389,1440,1132,1378,46
3,Guadalajara,5 DE MAYO 1a. SECCIÓN,4032,1745,1314,1670,55
4,Guadalajara,5 DE MAYO 2a. SECCIÓN,12265,5388,3860,5181,133


Get the location information (lalitude and longitude) of each neighbourhood. This may take a while since is traversing through all the 584 neighbourhoods. 

In [6]:
def get_latlng(neighborhood):
    
    lat_lng_coords = None
    
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Guadalajara, Jalisco'.format(neighborhood))
        lat_lng_coords = g.latlng
        
    return lat_lng_coords

In [7]:
coordinatess = [ get_latlng(neighborhood) for neighborhood in df_gdl["Colonia"].tolist() ]

Generation of a new data frame.

In [8]:
df_coords = pd.DataFrame(coordinatess, columns=['Latitude', 'Longitude'])
df_coords.head()

Unnamed: 0,Latitude,Longitude
0,20.628689,-103.347974
1,20.628689,-103.347974
2,20.642851,-103.473042
3,20.673063,-103.339797
4,20.673096,-103.339908


In [9]:
df_gdl = pd.concat([df_gdl, df_coords], axis=1)

Finally, the data set with the neighbourhoods with its corresponding latitude and longitude looks like this.

In [10]:
df_gdl.head(10)

Unnamed: 0,Municipio,Colonia,Población Total,Población económicamente activa,Población no económicamente activa,Población ocupada,Población desocupada,Latitude,Longitude
0,Guadalajara,18 DE MARZO,3694,1758,1371,1699,31,20.628689,-103.347974
1,Guadalajara,18 DE MARZO (ANEXO),3682,1767,1171,1702,52,20.628689,-103.347974
2,Guadalajara,1o. DE MAYO,3389,1440,1132,1378,46,20.642851,-103.473042
3,Guadalajara,5 DE MAYO 1a. SECCIÓN,4032,1745,1314,1670,55,20.673063,-103.339797
4,Guadalajara,5 DE MAYO 2a. SECCIÓN,12265,5388,3860,5181,133,20.673096,-103.339908
5,Guadalajara,8 DE JULIO,3393,1477,1281,1428,31,20.64786,-103.35715
6,Guadalajara,AARÓN JOAQUÍN,7644,3445,2337,3327,97,20.68165,-103.27708
7,Guadalajara,ACADEMIA DE POLICÍA (IRREGULAR),628,260,182,254,4,20.70501,-103.28354
8,Guadalajara,AGRARIA,245,101,95,98,0,20.71252,-103.38313
9,Guadalajara,AGUSTÍN YÁÑEZ,4088,1886,1223,1834,41,20.71806,-103.33887


<hr>

<h1> Analysis </h1> <a name="analysis"></a>

In [11]:
address = 'Guadalajara, Jalisco'

geolocator = Nominatim(user_agent="gdl_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Guadalajara are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Guadalajara are 20.6720375, -103.3383962.


In [50]:
df_gdl[(df_gdl['Colonia'] == 'LA NATIVIDAD')]

Unnamed: 0,Municipio,Colonia,Población Total,Población económicamente activa,Población no económicamente activa,Población ocupada,Población desocupada,Latitude,Longitude
272,Guadalajara,LA NATIVIDAD,3136,1553,1035,1509,28,20.411379,-103.641363


In [12]:
# create map of Guadalajara using latitude and longitude values
gdl_map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df_gdl['Latitude'], df_gdl['Longitude'], df_gdl['Colonia']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(gdl_map)  
    
gdl_map

<hr>

By doing an inspection of the resulting map it prompted that a few neighbourhoods were badly localized since they do not even belong to Guadalajara. Therefore, this locations need to be manually removed from the data set.

In [15]:
wrong_neighbours = ["VILLAS DE SAN JUAN 1a. SECCIÓN", "LA NATIVIDAD", "1o. DE MAYO", "AVENIDA UNIVERSIDAD", "JARDINES DEL NILO 3a. SECCIÓN", "SAN ANDRÉS 4a. SECCIÓN", "HABITAT 2001", "COLÓN C.R.O.C.", "SAN RAFAEL AYALA"]

for i in range(0, len(wrong_neighbours)):    
    idx = df_gdl[(df_gdl['Colonia'] == wrong_neighbours[i])].index.values.astype(int)[0]
    df_gdl.drop(labels=idx, axis=0, inplace=True)    

In [16]:
# create map of Guadalajara using latitude and longitude values
gdl_map = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df_gdl['Latitude'], df_gdl['Longitude'], df_gdl['Colonia']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(gdl_map)  
    
gdl_map

In [17]:
# The code was removed by Watson Studio for sharing.

In [20]:
radius = 2000
LIMIT = 100
venues = []

for lat, long, neighborhood in zip(df_gdl['Latitude'], df_gdl['Longitude'], df_gdl['Colonia']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

KeyError: 'groups'

In [23]:
venues_df = pd.DataFrame(venues)

venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.tail()

(44490, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
44485,SAN RAMÓN,20.695116,-103.30504,Life Gym Estadio,20.704874,-103.313448,Gym
44486,SAN RAMÓN,20.695116,-103.30504,Farmacia Guadalajara Suc. San Marcos,20.696796,-103.313226,Pharmacy
44487,SAN RAMÓN,20.695116,-103.30504,Pasteleria Espiritu Santo,20.689363,-103.306308,Bakery
44488,SAN RAMÓN,20.695116,-103.30504,Menudo Doña Chuy,20.680253,-103.310317,Mexican Restaurant
44489,SAN RAMÓN,20.695116,-103.30504,Glorieta Belisario Dominguez,20.707955,-103.313804,Sculpture Garden


<h1> Results and discussion </h1> <a name="results"></a>

<hr>

<h1> Conclusions </h1> <a name="conclusions"></a>