<h1 align=center> 
    <font size = 5> Scraping Cdmx venues  </font>  
</h1>


## Introduction
In this notebook we are going to use <a href="https://foursquare.com/developers"> Foursquare </a> to get the venues from the Cdmx (Mexico city). In order to work with Foursquare we are going to make the requests using latitudes and longitudes of each neighborhood, This information will be gotten from <a href="https://datos.cdmx.gob.mx"> datos cdmx </a>. The final result is a dataframe with All venues in the city.


## 1. Import libraries

In [1]:
import pandas as pd
import numpy as np 
import requests # library to handle requests
import json
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [2]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    pandas-1.0.5               |   py36h83

In [3]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



## 2. Load the data
I got the neighborhood data (name, longitude, latitude) from the goberment site: https://datos.cdmx.gob.mx/pages/home/
Once we download it, the next step is to transform it into a Dataframe.

In [4]:
url= 'https://datos.cdmx.gob.mx/explore/dataset/coloniascdmx/download/?format=csv&timezone=America/Mexico_City&lang=es&use_labels_for_header=true&csv_separator=%2C'
cdmx_geo = pd.read_csv(url)

# uncomment the next line to check how the data looks like
# cdmx_geo.head()    

## 3.  Data Wrangling

Drop unecesary values

In [5]:
cdmx_geo.drop(['CVE_ALC', 'CVE_COL', 'SECC_COM','SECC_PAR', 'ENTIDAD', 'Geo Shape' ],  axis=1, inplace=True )
cdmx_geo.head(3)

Unnamed: 0,COLONIA,Geo Point,ALCALDIA
0,LOMAS DE CHAPULTEPEC,"19.4228411174,-99.2157935754",MIGUEL HIDALGO
1,LOMAS DE REFORMA (LOMAS DE CHAPULTEPEC),"19.4106158914,-99.2262487268",MIGUEL HIDALGO
2,DEL BOSQUE (POLANCO),"19.4342189235,-99.2094037513",MIGUEL HIDALGO


### Set in the right Format 
split the column 'Geo Point' into two columns, Latitude and longitude.

In [6]:
#create two columns from 'Geo Point'
cdmx_geo['Lat'], cdmx_geo['Lon'] = cdmx_geo['Geo Point'].str.split(',', 1).str
cdmx_geo.drop('Geo Point', axis=1, inplace=True) #drop 'geo Point' column'

cdmx_geo.head(3)

  


Unnamed: 0,COLONIA,ALCALDIA,Lat,Lon
0,LOMAS DE CHAPULTEPEC,MIGUEL HIDALGO,19.4228411174,-99.2157935754
1,LOMAS DE REFORMA (LOMAS DE CHAPULTEPEC),MIGUEL HIDALGO,19.4106158914,-99.2262487268
2,DEL BOSQUE (POLANCO),MIGUEL HIDALGO,19.4342189235,-99.2094037513


Check for the data types. We find out that is NOT in the right format.


In [7]:
cdmx_geo.dtypes

COLONIA     object
ALCALDIA    object
Lat         object
Lon         object
dtype: object

Lets correct the data and check it again.

In [8]:
cdmx_geo['COLONIA'] = cdmx_geo['COLONIA'].astype(str) #change COLONIA to string
cdmx_geo['ALCALDIA'] = cdmx_geo['ALCALDIA'].astype(str) #change ALCALDIA to string
cdmx_geo['Lat'] = cdmx_geo['Lat'].astype(float) #change Lat to float
cdmx_geo['Lon'] = cdmx_geo['Lon'].astype(float) #change Lon to float
cdmx_geo.dtypes

COLONIA      object
ALCALDIA     object
Lat         float64
Lon         float64
dtype: object

#### Deal with missing data

Lets check for null values

In [9]:
cdmx_geo.isnull().sum()

COLONIA     0
ALCALDIA    0
Lat         4
Lon         4
dtype: int64

Create a dataframe with null values in Lon or Lat Column

In [10]:
null_data = cdmx_geo[cdmx_geo['Lon'].isnull() | cdmx_geo['Lat'].isnull() ]
null_data

Unnamed: 0,COLONIA,ALCALDIA,Lat,Lon
420,MAZA,CUAUHTEMOC,,
885,SAN PABLO OZTOTEPEC (PBLO),MILPA ALTA,,
1002,LOS CERRILLOS I,XOCHIMILCO,,
1021,CHIMALCOYOC,TLALPAN,,


Use Nominatim to look for unknown Address'

In [11]:

for nul in null_data.index.tolist():  #we will use the index of null_data to create a bucle with the null data rows   
    
    col = cdmx_geo.loc[nul,'COLONIA'] #get 'COLONIA' with Null values
    alc = cdmx_geo.loc[nul,'ALCALDIA'] #get 'ALCALDIA' with Null values
    address = col +' '+ alc + ', ' +'cdmx' #Set the right address format to use in geolocator
   
    try:
        geolocator = Nominatim(user_agent="cdmx_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        print(address + ': geograpical coordinate {}, {}.'.format(latitude, longitude))
        
        #replace the Null values in the dataframe with the new ones 
        cdmx_geo.loc[nul,'Lat'] = latitude
        cdmx_geo.loc[nul,'Lon'] = longitude
       
    except: 
        print( address +' NOT FOUND')
  


MAZA CUAUHTEMOC, cdmx: geograpical coordinate 19.4549636, -99.1282072.
SAN PABLO OZTOTEPEC (PBLO) MILPA ALTA, cdmx NOT FOUND
LOS CERRILLOS I XOCHIMILCO, cdmx NOT FOUND
CHIMALCOYOC TLALPAN, cdmx NOT FOUND


lets Check the Dataframe with the new information we could get

In [12]:
cdmx_geo.loc[[420], :]

Unnamed: 0,COLONIA,ALCALDIA,Lat,Lon
420,MAZA,CUAUHTEMOC,19.454964,-99.128207


Drop the remain values

In [None]:
cdmx_geo.dropna(inplace=True)
cdmx_geo =cdmx_geo.reset_index()
cdmx_geo.drop('index', axis=1,inplace = True)

In [49]:
#check Again for missing values
cdmx_geo.isnull().sum()

COLONIA     0
ALCALDIA    0
Lat         0
Lon         0
dtype: int64

In [3]:
# cdmx_geo.to_csv('cdmx_locations.csv')

## 4. Data Explore

In [51]:
print('there are {} Diferent "Alcaldias", which are:  {}'.format(len(cdmx_geo.ALCALDIA.unique()), cdmx_geo.ALCALDIA.unique()  ))

there are 16 Diferent "Alcaldias", which are:  ['MIGUEL HIDALGO' 'COYOACAN' 'VENUSTIANO CARRANZA' 'GUSTAVO A. MADERO'
 'TLALPAN' 'XOCHIMILCO' 'MILPA ALTA' 'IZTACALCO' 'AZCAPOTZALCO'
 'ALVARO OBREGON' 'CUAUHTEMOC' 'TLAHUAC' 'CUAJIMALPA DE MORELOS'
 'IZTAPALAPA' 'BENITO JUAREZ' 'LA MAGDALENA CONTRERAS']


See how many Nighborhoods (Colonias) are for each city (Alcaldia)

In [52]:
cdmx_geo.groupby(['ALCALDIA']).count()

Unnamed: 0_level_0,COLONIA,Lat,Lon
ALCALDIA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ALVARO OBREGON,249,249,249
AZCAPOTZALCO,111,111,111
BENITO JUAREZ,64,64,64
COYOACAN,153,153,153
CUAJIMALPA DE MORELOS,43,43,43
CUAUHTEMOC,64,64,64
GUSTAVO A. MADERO,232,232,232
IZTACALCO,55,55,55
IZTAPALAPA,293,293,293
LA MAGDALENA CONTRERAS,52,52,52


#### Map

If you want you can see the location of each Neighborhood

In [18]:
address = 'CDMX'
#use nominatim to look for latitude and longitude 
geolocator = Nominatim(user_agent="cdmx_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate are {}, {}.'.format(latitude, longitude))

The geograpical coordinate are 19.320556250000003, -99.15170107077653.


In [19]:
# create map using latitude and longitude values
map_MH = folium.Map(location=[latitude, longitude], zoom_start=12)


# add markers to map
for lat, lng, label in zip(cdmx_geo['Lat'], cdmx_geo['Lon'], cdmx_geo['COLONIA']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=4,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.6,
        parse_html=False).add_to(map_MH)
    
map_MH

## 5. Getting data from Foursquare

the first step is to create a foursquare developer acount and then get the credentials 

In [20]:
CLIENT_ID = '******'     # your Foursquare ID
CLIENT_SECRET = '******' # your Foursquare Secret
VERSION = '******'       # Foursquare API version

### In order to make it easier to understand,  first we are going to work with just one Neighborhood

The first step is to choose a random Neighborhood (COLONIA) to work with and check for its data

In [55]:
a_info = cdmx_geo.loc[1232, :]
a_info

COLONIA          LINDAVISTA I
ALCALDIA    GUSTAVO A. MADERO
Lat                   19.4899
Lon                  -99.1317
Name: 1232, dtype: object

save the next variables, they will be used in the request

In [56]:
a_lat = cdmx_geo.loc[1232, 'Lat'] # neighborhood latitude value
a_lon= cdmx_geo.loc[1232, 'Lon'] # neighborhood longitude value

then we make the foursquare request

In [24]:

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 400 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    a_lat, 
    a_lon, 
    radius, 
    LIMIT)

url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=MITKONHUWBQNWEPPMXFXHWWKMCL5X4SFTZUC1F1X5MI205IU&client_secret=VGKY5RF2MMWGLPDBHANYGZTYFH1AO2QO2IG4QKQ4HT3UDUA5&v=20180605&ll=19.4899449462,-99.1316707528&radius=400&limit=100'

In [25]:
results = requests.get(url).json()
#results  #uncomment to see the whole file <<JSON format>>

If we want, we can work with the information as a dataframe (optional)

In [26]:
venues = results['response']['groups'][0]['items'] #relevant information from results file
    
venues1235 = json_normalize(venues) #transform JSON format into a datframe 
venues1235.head(3)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,referralId,reasons.count,reasons.items,venue.id,venue.name,venue.location.address,venue.location.crossStreet,venue.location.lat,venue.location.lng,venue.location.labeledLatLngs,...,venue.location.cc,venue.location.city,venue.location.state,venue.location.country,venue.location.formattedAddress,venue.categories,venue.photos.count,venue.photos.groups,venue.location.neighborhood,venue.venuePage.id
0,e-0-4c06b5762e80a593098a74f9-0,0,"[{'summary': 'This spot is popular', 'type': '...",4c06b5762e80a593098a74f9,Parrilla Danesa,Matanzas 669,Av. Montevideo,19.490257,-99.132664,"[{'label': 'display', 'lat': 19.49025697990055...",...,MX,Gustavo A. Madero,Distrito Federal,México,"[Matanzas 669 (Av. Montevideo), 07300 Gustavo ...","[{'id': '4bf58dd8d48988d109941735', 'name': 'E...",0,[],,
1,e-0-564fbfe8498e3172273b6b5a-1,0,"[{'summary': 'This spot is popular', 'type': '...",564fbfe8498e3172273b6b5a,La Casa de Toño,Av. Montevideo 363,Av. Instituto Politécnico Nacional,19.491137,-99.133654,"[{'label': 'display', 'lat': 19.49113659394063...",...,MX,Gustavo A. Madero,Distrito Federal,México,[Av. Montevideo 363 (Av. Instituto Politécnico...,"[{'id': '4bf58dd8d48988d1c1941735', 'name': 'M...",0,[],,
2,e-0-57d4582bcd1076ef271b0c62-2,0,"[{'summary': 'This spot is popular', 'type': '...",57d4582bcd1076ef271b0c62,El Pescadito,Avenida Montevideo 363,,19.491236,-99.133791,"[{'label': 'display', 'lat': 19.49123613522448...",...,MX,Ciudad de México,Distrito Federal,México,"[Avenida Montevideo 363, 07300 Ciudad de Méxic...","[{'id': '4bf58dd8d48988d1ce941735', 'name': 'S...",0,[],,


In [27]:
#this is the way we can get the category of the first venue
venues1235.loc[0, 'venue.categories'][0]['name']

#this is the way we can get the category of the second venue
venues1235.loc[1, 'venue.categories'][0]['name']




'Mexican Restaurant'

This time i dicide to keep working with the data as JSON format

In [28]:
#check the number of venues we got 
len(results['response']['groups'][0]['items'])

63

In [29]:
# This is the way we get in each venue data
#venue = results['response']['groups'][0]['items'][0]['venue']  #first Venue

venue2 = results['response']['groups'][0]['items'][1]['venue']  #Second Venue
venue2

{'id': '564fbfe8498e3172273b6b5a',
 'name': 'La Casa de Toño',
 'location': {'address': 'Av. Montevideo 363',
  'crossStreet': 'Av. Instituto Politécnico Nacional',
  'lat': 19.491136593940638,
  'lng': -99.13365446109052,
  'labeledLatLngs': [{'label': 'display',
    'lat': 19.491136593940638,
    'lng': -99.13365446109052}],
  'distance': 246,
  'postalCode': '07300',
  'cc': 'MX',
  'city': 'Gustavo A. Madero',
  'state': 'Distrito Federal',
  'country': 'México',
  'formattedAddress': ['Av. Montevideo 363 (Av. Instituto Politécnico Nacional)',
   '07300 Gustavo A. Madero, Distrito Federal',
   'México']},
 'categories': [{'id': '4bf58dd8d48988d1c1941735',
   'name': 'Mexican Restaurant',
   'pluralName': 'Mexican Restaurants',
   'shortName': 'Mexican',
   'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/mexican_',
    'suffix': '.png'},
   'primary': True}],
 'photos': {'count': 0, 'groups': []}}

In [30]:
# then we save relevant information in the next variables

category = results['response']['groups'][0]['items'][1]['venue']['categories'][0]['name']
name  =  results['response']['groups'][0]['items'][1]['venue']['name']
lat = results['response']['groups'][0]['items'][1]['venue']['location']['lat']
lng = results['response']['groups'][0]['items'][1]['venue']['location']['lng']


Create en empty dataframe where we will save the information 

In [31]:
nearby_venues = pd.DataFrame(columns=['Neighborhood', 
                                      'Neighborhood Latitude', 
                                      'Neighborhood Longitude', 
                                      'Venue', 
                                      'Venue Latitude', 
                                      'Venue Longitude', 
                                      'Venue Category'])


nearby_venues


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category


The next step is to fill the dataframe with the venue data (located in JSON) 

In [32]:

#First venue
nearby_venues.loc[0,'Venue'] = name
nearby_venues.loc[0,'Venue Category']= category                                            
nearby_venues.loc[0,'Venue Latitude']= lat 
nearby_venues.loc[0,'Venue Longitude']= lng 

#second venue 
#Remember to redo the set variable process and give them a diferent name (otherwise the second dataframe row will have the same information as the first row)
nearby_venues.loc[1,'Venue'] = name
nearby_venues.loc[1,'Venue Category']= category                                            
nearby_venues.loc[1,'Venue Latitude']= lat 
nearby_venues.loc[1,'Venue Longitude']= lng 
    
nearby_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,,,,La Casa de Toño,19.4911,-99.1337,Mexican Restaurant
1,,,,La Casa de Toño,19.4911,-99.1337,Mexican Restaurant


Once we undertood the process flow of getting the information from the JSON format and save it into a dataframe, lets create a function to do it in every neighborhood

In [57]:
def getVenues(cities, names, latitudes, longitudes):
    
    LIMIT = 100 # limit of number of venues returned by Foursquare API
    radius = 450 # define radius
    pos = 0  #it will help us to fill the dataframe
    
    #Create a new dataframe just with just column names
    nearby_venues = pd.DataFrame(columns=['City',
                                          'Neighborhood', 
                                          'Neighborhood Latitude', 
                                          'Neighborhood Longitude', 
                                          'Venue', 
                                          'Venue Category',
                                          'Venue Latitude', 
                                          'Venue Longitude'])
    
    #Make a request for each neighborhood latitude and longitude
    for city, name, lat, lng in zip(cities, names, latitudes, longitudes):
        print(city, name) #this is to look for the current progress while the function is running 
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT)

        #Save the venues data
        results = requests.get(url).json()['response']['groups'][0]['items']

        #Fill the dataframe with the JSON data
        for row in list( range(len(results)) ) :

            v_category = results[row]['venue']['categories'][0]['name']
            v_name  =    results[row]['venue']['name']
            v_lat =      results[row]['venue']['location']['lat']
            v_lng =      results[row]['venue']['location']['lng']
            #Fill the dataframe in the correct position using loc method 
            nearby_venues.loc[row + pos ,'Venue'] = v_name
            nearby_venues.loc[row + pos ,'Venue Category']= v_category                                            
            nearby_venues.loc[row + pos ,'Venue Latitude']= v_lat 
            nearby_venues.loc[row + pos ,'Venue Longitude']= v_lng 
            nearby_venues.loc[row + pos ,'City'] = city
            nearby_venues.loc[row + pos ,'Neighborhood'] = name
            nearby_venues.loc[row + pos ,'Neighborhood Latitude'] = lat
            nearby_venues.loc[row + pos ,'Neighborhood Longitude'] = lng

        pos = pos + len(results)  
    
    return(nearby_venues)

In [59]:
#1 
cdmx_venues1 = getVenues(cities=cdmx_geo['ALCALDIA'], 
                        names=cdmx_geo['COLONIA'],
                        latitudes = cdmx_geo['Lat'],
                        longitudes=cdmx_geo['Lon'])

LOMAS DE CHAPULTEPEC
LOMAS DE REFORMA (LOMAS DE CHAPULTEPEC)
DEL BOSQUE (POLANCO)
PEDREGAL DE SANTA URSULA I
AJUSCO I
VISTAS DEL MAUREL (U HAB)
IGNACIO ZARAGOZA I
CENTRO II
VALENTIN GOMEZ FARIAS
MORELOS II
NICOLAS BRAVO
5 DE MAYO
PLUTARCO ELIAS CALLES
CAMPESTRE COYOACAN (FRACC)
JANITZIO
FELIPE ANGELES
AZTECA
ARTES GRAFICAS
CUATRO ARBOLES
CTM X CULHUACAN (U HAB)
EMILIANO ZAPATA
SAN FRANCISCO CULHUACAN (PBLO)
PARQUE SAN ANDRES
OLIMPICA
CTM EL RISCO (U HAB)
EL ARBOLILLO 2 (U HAB)
C T M ARAGON (U)
EDUARDO MOLINA II (U HAB)
VILLA DE ARAGON (FRACC)
NUEVA VALLEJO
SAN JUAN DE ARAGON 6A SECCION (U HAB) I
SANTIAGO ATEPETLAC (LA SELVITA) (U HAB)
EL CARMEN
HEROE DE NACOZARI
MAXIMINO AVILA CAMACHO
JOSE MARIA MORELOS Y PAVON I (U HAB)
SAN FELIPE DE JESUS IV
MARTIN CARRERA I
MESA LOS HORNOS, TEXCALTENCO
MIRADOR 1A SECC
LA PRIMAVERA
LOMAS ALTAS DE PADIERNA SUR
LOMAS DE PADIERNA (AMPL)
SAN PEDRO MARTIR (PBLO)
TLALCOLIGIA
SAUZALES CEBADALES (U HAB)
LA NORIA TEPEPAN
POTRERO DE SAN BERNARDINO
SAN ANDRES T

In [60]:
#Save it as a csv file 
cdmx_venues1.to_csv('cdmx_venues.csv')

In [61]:
cdmx_venues1

Unnamed: 0,City,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Category,Venue Latitude,Venue Longitude
0,MIGUEL HIDALGO,LOMAS DE CHAPULTEPEC,19.4228,-99.2158,Studio Gourmet,Gastropub,19.4203,-99.2155
1,MIGUEL HIDALGO,LOMAS DE CHAPULTEPEC,19.4228,-99.2158,Loma Linda,Steakhouse,19.4202,-99.2181
2,MIGUEL HIDALGO,LOMAS DE CHAPULTEPEC,19.4228,-99.2158,Cabina Radio Disney 99.3 Grupo Acir,Music Venue,19.4208,-99.216
3,MIGUEL HIDALGO,LOMAS DE CHAPULTEPEC,19.4228,-99.2158,Silence Track,Recording Studio,19.4204,-99.2175
4,MIGUEL HIDALGO,LOMAS DE CHAPULTEPEC,19.4228,-99.2158,City Market,Gourmet Shop,19.419,-99.2148
...,...,...,...,...,...,...,...,...
32349,VENUSTIANO CARRANZA,JARDIN BALBUENA I,19.4216,-99.1054,Buffet Chon Hing,Chinese Restaurant,19.4186,-99.1068
32350,VENUSTIANO CARRANZA,JARDIN BALBUENA I,19.4216,-99.1054,Créme brûlée,Pastry Shop,19.4203,-99.1073
32351,VENUSTIANO CARRANZA,JARDIN BALBUENA I,19.4216,-99.1054,Gino's,Bakery,19.418,-99.1039
32352,VENUSTIANO CARRANZA,JARDIN BALBUENA I,19.4216,-99.1054,Kaloc,Chinese Restaurant,19.4183,-99.1062


...and thats it. We have a dataframe with all city venues.  

###### thanks for reading <3