# <center>Coursera Capstone: Battle of Neighborhoods<br>Migrating to Spain – Exploring Similar Neighborhood</center>
<hr>

## Table of Contents

<ol>
    <li>** [Introduction](#introduction) **</li>
    <li>** [Data](#data) **</li>
    <li>** [Methodology](#meth) **</li>
    <li>** [Analysis](#ana) **</li>
    <li>** [Results and Conclusion](#rnd) **</li>
</ol>

<hr>

## Introduction <a name="introduction"></a>

Migrating from one city to another is many a times a hectic process. New place, new people, new culture, and most importantly, new neighborhood. So exploring the new place is, thus, a new beginning from square one. It would really help one if he/she could find the amenities or restaurants or the venues just like the ones in their current location, in the city where they are migrating.
Here, I am assuming that I am migrating from my current city, Pune, India to city of Madrid, Spain. In this capstone, I will attempt to apply the techniques learned throughout the Data Science courses to explore the neighborhoods in the capital of Spain that is city of Madrid.
I will acquire my places of interest in my current location using the Foursquare API. I will then use the same API and explore the similar kind of venues in the city of Madrid.

## Data <a name='data'></a>

In [4]:
import pandas as pd
import numpy as np
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
import io
import json
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.metrics import pairwise_distances
import folium

In [5]:
CLIENT_ID = 'CIJXLJ1OLJYY3BHETOTK1TK4BRGRAAUTV21RTCTM013PEULS'
CLIENT_SECRET = 'KPURRZOCR42OV0QL52TD2XFIPQ1KBEAGXRXD0ECROKEG0ZUD'
VERSION = '20190719' # Foursquare API version

In [6]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']

    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'Neighborhood Latitude',
                             'Neighborhood Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']

    return (nearby_venues)


def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]

## Data for current location

As discussed in the introduction, my current location is the city called ‘Pune’ in India. By using the Fousquare API with its explore endpoint and limiting the result to 80 venues and radius as 1000, lets find out the venues.

In [7]:
state_city = 'Pune'
state = 'Maharashtra'
address = state_city + ', ' + state

geolocator = Nominatim(user_agent="state_explorer")
location = geolocator.geocode(address)
state_latitud = location.latitude
state_longitud = location.longitude
print(location, state_latitud, state_longitud)

Pune, Pune District, Maharashtra, 411001, India 18.5203062 73.8543185


** GET https://api.foursquare.com/v2/venues/explore **

In [8]:
radius = 1000
LIMIT = 80

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, state_latitud, state_longitud, VERSION, radius, LIMIT)
print(url)

results = requests.get(url).json()
print('There are {} venues around your location.'.format(len(results['response']['groups'][0]['items'])))

state_venues = results['response']['groups'][0]['items']

state_nearby_venues = json_normalize(state_venues)  # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.location.lat', 'venue.location.lng', 'venue.categories']
state_nearby_venues = state_nearby_venues.loc[:, filtered_columns]

# filter the category for each row
state_nearby_venues['venue.categories'] = state_nearby_venues.apply(get_category_type, axis=1)

# clean columns
state_nearby_venues.columns = [col.split(".")[-1] for col in state_nearby_venues.columns]

state_nearby_venues['Neighborhood'] = state_city
state_nearby_venues['Neighborhood Latitude'] = state_latitud
state_nearby_venues['Neighborhood Longtude'] = state_longitud
cols = state_nearby_venues.columns.tolist()
cols = cols[-3:] + cols[:-3]
state_nearby_venues = state_nearby_venues[cols]

state_nearby_venues.head()
print('{} venues were returned by Foursquare.'.format(state_nearby_venues.shape[0]))
print('There are {} unique categories.'.format(len(state_nearby_venues['categories'].unique())))

https://api.foursquare.com/v2/venues/explore?client_id=CIJXLJ1OLJYY3BHETOTK1TK4BRGRAAUTV21RTCTM013PEULS&client_secret=KPURRZOCR42OV0QL52TD2XFIPQ1KBEAGXRXD0ECROKEG0ZUD&ll=18.5203062,73.8543185&v=20190719&radius=1000&limit=80
There are 47 venues around your location.
47 venues were returned by Foursquare.
There are 27 unique categories.


## Data for city of Madrid

For obtaining neighborhoods in Madrid, let's use data from [Portal de datos abiertos del Ayuntamiento de Madrid](https://datos.madrid.es/portal/site/egob/). Specifically I will download a CSV file titled [Relación de barrios (superficie y perímetro)](https://datos.madrid.es/portal/site/egob/menuitem.c05c1f754a33a9fbe4b2e4b284f1a5a0/?vgnextoid=46b55cde99be2410VgnVCM1000000b205a0aRCRD&vgnextchannel=374512b9ace9f310VgnVCM100000171f5a0aRCRD&vgnextfmt=default). 

This file is a list of Districts and Neighborhoods i.e. Distrito and Barrio in Madrid. 

Let's find coordinates by using **Nominatim** from **geopy.geocoders**.

In [9]:
url = "https://www.dropbox.com/s/77vqznq3bik3q57/madrid_barrios.csv?dl=1"
urlData = requests.get(url).content
barrios_df = pd.read_csv(io.StringIO(urlData.decode('utf-8')))

print(barrios_df.shape)
print(barrios_df.head(10))

(128, 2)
     Distrito                      Barrio
0  Arganzuela                      Atocha
1  Arganzuela                    Delicias
2  Arganzuela                    Imperial
3  Arganzuela                  La Chopera
4  Arganzuela                 Las Acacias
5  Arganzuela                     Legazpi
6  Arganzuela             Palos de Moguer
7     Barajas                  Aeropuerto
8     Barajas            Alameda de Osuna
9     Barajas  Casco Historico de Barajas


In [13]:
geo_barrios = pd.DataFrame(columns=['Distrito', 'Barrio', 'Latitud', 'Longitud'])
j = 0

for idx, row in barrios_df.iterrows():
    address = row['Barrio'] + ', ' + row['Distrito'] +  ', Madrid'
    geolocator = Nominatim(user_agent="state_explorer")
    location = geolocator.geocode(address, timeout=30)
    if location != None:
        latitude = location.latitude
        longitude = location.longitude
        geo_barrios.loc[j] = [row['Distrito'], row['Barrio'], latitude, longitude]
        j += 1

print(geo_barrios.shape)
print(geo_barrios.head(10))

(119, 4)
     Distrito            Barrio    Latitud  Longitud
0  Arganzuela            Atocha  40.405731 -3.690142
1  Arganzuela          Delicias  40.397292 -3.689495
2  Arganzuela          Imperial  40.406915 -3.717329
3  Arganzuela        La Chopera  40.394893 -3.699705
4  Arganzuela       Las Acacias  40.400759 -3.706995
5  Arganzuela           Legazpi  40.391172 -3.695190
6  Arganzuela   Palos de Moguer  40.403927 -3.695561
7     Barajas        Aeropuerto  40.494426 -3.564283
8     Barajas  Alameda de Osuna  40.457581 -3.587975
9     Barajas        Corralejos  40.468164 -3.587073


By using the Fousquare API with its explore endpoint and limiting the result to 80 venues and radius as 1000, I was returned with the following result:

In [16]:
madrid_venues = getNearbyVenues(names=geo_barrios['Barrio'],
                                   latitudes=geo_barrios['Latitud'],
                                   longitudes=geo_barrios['Longitud']
                                  )

print(madrid_venues.shape)
print(madrid_venues.head())
print('There are {} unique categories.'.format(len(madrid_venues['Venue Category'].unique())))

Atocha
Delicias
Imperial
La Chopera
Las Acacias
Legazpi
Palos de Moguer
Aeropuerto
Alameda de Osuna
Corralejos
Timon
Abrantes
Buenavista
Comillas
Opanel
Puerta Bonita
San Isidro
Vista Alegre
Cortes
Embajadores
Justicia
Palacio
Sol
Universidad
Castilla
Ciudad Jardin
Nueva Espana
Prosperidad
Almagro
Arapiles
Gaztambide
Rios Rosas
Trafalgar
Vallehermoso
Atalaya
Colina
Concepcion
Costillares
Pueblo Nuevo
Quintana
San Juan Bautista
San Pascual
Ventas
El Pardo
El Pilar
La Paz
Mirasierra
Apostol Santiago
Canillas
Pinar del Rey
Piovera
Valdefuentes
Aluche
Campamento
Cuatro Vientos
Las aguilas
Lucero
Puerta del angel
Aravaca
Arguelles
Casa de Campo
Ciudad Universitaria
El Plantio
Valdemarin
Valdezarza
Fontarron
Horcajo
Marroquina
Media Legua
Pavones
Vinateros
Entrevias
Numancia
Palomeras Bajas
Palomeras Sureste
Portazgo
San Diego
Adelfas
Estrella
Ibiza
Jeronimos
Nino Jesus
Pacifico
Castellana
Fuente del Berro
Goya
Guindalera
Lista
Recoletos
Amposta
Arcos
Canillejas
El Salvador
Hellin
Rejas
Rosa

## Methodology <a name="meth"></a>

Let us transform data into numeric form so that we can apply Kmeans. For this purpose I will follow the following steps:

* Put together the location data of State with those of the neighborhoods of Madrid (variable geo_barrios)
* Collect the data of places of interest of State with those of Madrid (variable madrid_venues)
* Use "onehot encoding" to transpose the categories of the places of interest and convert them to numerical values
* Group the resulting matrix by neighborhood, using the average value of each category
* Applying kmeans using clusters (10)

In [17]:
tmp = {'Distrito': state, 'Barrio':state_city, 'Latitud': state_latitud, 'Longitud': state_longitud}
geo_barrios = geo_barrios.append(tmp, ignore_index=True)
print(geo_barrios.shape)
print(geo_barrios.head())

(120, 4)
     Distrito       Barrio    Latitud  Longitud
0  Arganzuela       Atocha  40.405731 -3.690142
1  Arganzuela     Delicias  40.397292 -3.689495
2  Arganzuela     Imperial  40.406915 -3.717329
3  Arganzuela   La Chopera  40.394893 -3.699705
4  Arganzuela  Las Acacias  40.400759 -3.706995


In [18]:
cols = madrid_venues.columns.tolist()
state_nearby_venues.columns = cols

madrid_venues = madrid_venues.append(state_nearby_venues, ignore_index=True)

print(madrid_venues.shape)
print(madrid_venues.head())

(3493, 7)
  Neighborhood  Neighborhood Latitude  Neighborhood Longitude  \
0       Atocha              40.405731               -3.690142   
1       Atocha              40.405731               -3.690142   
2       Atocha              40.405731               -3.690142   
3       Atocha              40.405731               -3.690142   
4       Atocha              40.405731               -3.690142   

                                 Venue  Venue Latitude  Venue Longitude  \
0                Only You Hotel Atocha       40.407161        -3.688438   
1                       Bodegas Rosell       40.403803        -3.690620   
2                        Pandora's Vox       40.405600        -3.691992   
3               Running Company Madrid       40.406714        -3.686904   
4  Estación de Madrid-Puerta de Atocha       40.406611        -3.690551   

        Venue Category  
0                Hotel  
1   Spanish Restaurant  
2          Music Venue  
3  Sporting Goods Shop  
4        Train Station 

In [19]:
# let us onehot encode
madrid_onehot = pd.get_dummies(madrid_venues[['Venue Category']], prefix="", prefix_sep="")

# adding neighborhood column back to dataframe
madrid_onehot['Nborhood'] = madrid_venues['Neighborhood']

# moving neighborhood column to the first column
fixed_columns = [madrid_onehot.columns[-1]] + list(madrid_onehot.columns[:-1])
madrid_onehot = madrid_onehot[fixed_columns]

print(madrid_onehot.shape)
print(madrid_onehot.head())

(3493, 257)
  Nborhood  Accessories Store  Adult Boutique  African Restaurant  \
0   Atocha                  0               0                   0   
1   Atocha                  0               0                   0   
2   Atocha                  0               0                   0   
3   Atocha                  0               0                   0   
4   Atocha                  0               0                   0   

   Airport Lounge  American Restaurant  Arcade  Arepa Restaurant  \
0               0                    0       0                 0   
1               0                    0       0                 0   
2               0                    0       0                 0   
3               0                    0       0                 0   
4               0                    0       0                 0   

   Argentinian Restaurant  Art Gallery     ...       Train Station  \
0                       0            0     ...                   0   
1                       

In [20]:
madrid_grouped = madrid_onehot.groupby('Nborhood').mean().reset_index()
print(madrid_grouped.shape)
print(madrid_grouped.head())

(117, 257)
           Nborhood  Accessories Store  Adult Boutique  African Restaurant  \
0          Abrantes                0.0             0.0                 0.0   
1           Adelfas                0.0             0.0                 0.0   
2        Aeropuerto                0.0             0.0                 0.0   
3  Alameda de Osuna                0.0             0.0                 0.0   
4           Almagro                0.0             0.0                 0.0   

   Airport Lounge  American Restaurant  Arcade  Arepa Restaurant  \
0        0.000000               0.0000     0.0               0.0   
1        0.000000               0.0000     0.0               0.0   
2        0.266667               0.0000     0.0               0.0   
3        0.000000               0.0000     0.0               0.0   
4        0.000000               0.0125     0.0               0.0   

   Argentinian Restaurant  Art Gallery     ...       Train Station  \
0                     0.0        0.000   

## Exploring Neighborhood

Using kmeans to group neighborhoods in Madrid with my neighborhood in state to see which ones are more similar to mine.

Let's use 2 metrics:
- Silhouette Coefficient 
- Calinski-Harabaz index 

In [21]:
madrid_grouped_clustering = madrid_grouped.drop('Nborhood', 1)
best_sc = 0
best_sc_k = 0
best_chi = 0 
best_chi_k = 0

for x in range(3,12):
    kclusters = x
    # run k-means clustering
    kmeans_model = KMeans(n_clusters=kclusters, random_state=0).fit(madrid_grouped_clustering)
    # get cluster labels
    labels = kmeans_model.labels_
    # compute Silhouette Coefficient
    sc = metrics.silhouette_score(madrid_grouped_clustering, labels, metric='euclidean')
    print('Number of clusters: ', kclusters, ' Silhouette Coefficient: ', sc)
    # compute Calinski-Harabaz Index
    chi = metrics.calinski_harabaz_score(madrid_grouped_clustering, labels)
    print('Number of clusters: ', kclusters, ' Calinski-Harabaz Index: ', chi)
    if sc > best_sc:
        best_sc = sc
        best_sc_k = kclusters
    if chi > best_chi:
        best_chi = chi
        best_chi_k = kclusters  

print('Best # clusters according to Silhouette Coefficient: ', best_sc_k, ' with score = ', best_sc)
print('Best # clusters according to Calinski-Harabaz Index: ', best_chi_k, ' with score = ', best_chi)

Number of clusters:  3  Silhouette Coefficient:  0.18436927235489234
Number of clusters:  3  Calinski-Harabaz Index:  8.392106058755832
Number of clusters:  4  Silhouette Coefficient:  0.04171943162442763
Number of clusters:  4  Calinski-Harabaz Index:  6.54489720309675
Number of clusters:  5  Silhouette Coefficient:  0.06894314913315158
Number of clusters:  5  Calinski-Harabaz Index:  7.125551839985787
Number of clusters:  6  Silhouette Coefficient:  0.14267063508098968
Number of clusters:  6  Calinski-Harabaz Index:  6.209229083138543
Number of clusters:  7  Silhouette Coefficient:  0.04057894179178734
Number of clusters:  7  Calinski-Harabaz Index:  5.762384883442128
Number of clusters:  8  Silhouette Coefficient:  0.016130486711397354
Number of clusters:  8  Calinski-Harabaz Index:  6.3437008821395136
Number of clusters:  9  Silhouette Coefficient:  0.03493157856232075
Number of clusters:  9  Calinski-Harabaz Index:  5.7757117790951895
Number of clusters:  10  Silhouette Coefficien

Let's choose K from the best Silhouette Coefficient to use for the final model.

Calculate the distance of each point to its corresponding centroid using *kmeans.transform*.

In [22]:
kmeans_model = KMeans(n_clusters=best_sc_k, random_state=0).fit(madrid_grouped_clustering)
labels = kmeans_model.labels_
# get distance from centroids
distance = kmeans_model.transform(madrid_grouped_clustering)

## Analysis <a name='ana'></a>

We will look for what are the 10 most frequent venue categories, for each neighborhood. It helps us to obtain the conclusions about the neighborhoods of Madrid that are more similar to my neighborhood in Pune city.

In [34]:
num_top_venues = 12

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = madrid_grouped['Nborhood']

for ind in np.arange(madrid_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(madrid_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head()

(117, 13)


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue
0,Abrantes,Bakery,Soccer Field,Fast Food Restaurant,Pizza Place,Park,Athletics & Sports,Restaurant,Grocery Store,Flea Market,Financial or Legal Service,Fish & Chips Shop,Flower Shop
1,Adelfas,Bar,Hotel,Diner,Tapas Restaurant,Bakery,Fast Food Restaurant,Supermarket,Grocery Store,Soccer Field,Brewery,Café,Korean Restaurant
2,Aeropuerto,Airport Lounge,Fast Food Restaurant,Duty-free Shop,Spanish Restaurant,Coffee Shop,Sporting Goods Shop,Café,Breakfast Spot,Dumpling Restaurant,Food,Fruit & Vegetable Store,Frozen Yogurt Shop
3,Alameda de Osuna,Hotel,Smoke Shop,Restaurant,Tapas Restaurant,Italian Restaurant,Bakery,Fried Chicken Joint,Chinese Restaurant,Scenic Lookout,Breakfast Spot,Bookstore,Metro Station
4,Almagro,Spanish Restaurant,Restaurant,Hotel,French Restaurant,Cocktail Bar,Bookstore,Supermarket,Mediterranean Restaurant,Bar,Japanese Restaurant,Asian Restaurant,Italian Restaurant


In [35]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans_model.labels_)
# find index of state neighborhood
state_ind = neighborhoods_venues_sorted.index[neighborhoods_venues_sorted['Neighborhood'] == state_city].tolist()
# find cluster of state neighborhood
state_cluster = neighborhoods_venues_sorted.loc[state_ind]['Cluster Labels'].values[0]

- Distances of all the points to the cluster where the Pune neighborhood is located. 
- Adding cluster label to each point.
- Sorting rows by cluster and distance.
- Keeping only row for the cluster (where the Pune neighborhood is located.
- Get the index of Pune in this last dataframe.

In [39]:
# create dataframe with centroids distance to state cluster
dist_df = pd.DataFrame(distance[:,state_cluster], columns=['distance'])
dist_df['cluster'] = kmeans_model.labels_
# sort dataframe by cluster and distance
dist_df.sort_values(by=['cluster', 'distance'], inplace=True)
dist_df.reset_index(inplace=True)
# keep only the rows from the state cluster
dist_df = dist_df.loc[dist_df['cluster'] == state_cluster]
# get state position in sorted dataframe
state_pos = dist_df.index[dist_df['index'] == state_ind].tolist()[0]

Select the neighborhoods closest to the one in Pune, according to the distances to the centroid.

In [None]:
# get 10 closest neighborhoods from state neighborhood (5 below and 5 above in the sorted dataframe)
top10 = dist_df.loc[state_pos+1:state_pos+5]
top10 = top10.append(dist_df.loc[state_pos-5:state_pos], ignore_index=True)
neighborhoods_venues_sorted = neighborhoods_venues_sorted.join(top10.set_index('index'), how='inner')

Join neighborhoods_venues_sorted dataframe to geo_barrios dataframe to get the final result.

In [42]:
madrid_merged = geo_barrios

# merge madrid_grouped with madrid_data to add latitude/longitude for each neighborhood
madrid_merged = madrid_merged.set_index('Barrio').join(neighborhoods_venues_sorted.set_index('Neighborhood'), how="inner")

madrid_merged.reset_index(inplace=True)
madrid_merged = madrid_merged.rename(index=str, columns={"index": "Barrio"})
madrid_merged

Unnamed: 0,Barrio,Distrito,Latitud,Longitud,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,distance,cluster
0,Buenavista,Carabanchel,40.372831,-3.743086,1,Tapas Restaurant,Spanish Restaurant,Metro Station,Grocery Store,Hotel,Food & Drink Shop,Café,Supermarket,Candy Store,Restaurant,Yoga Studio,Financial or Legal Service,0.27022,1
1,Canillas,Hortaleza,40.461889,-3.643352,1,Spanish Restaurant,Juice Bar,Bar,Gymnastics Gym,Comic Shop,Skating Rink,Plaza,Soccer Field,Pizza Place,Sushi Restaurant,Football Stadium,Food Truck,0.28688,1
2,Campamento,Latina,40.394683,-3.768279,1,Coffee Shop,Bar,Italian Restaurant,Gym / Fitness Center,Department Store,Burger Joint,Spanish Restaurant,Tapas Restaurant,Grocery Store,Pizza Place,Historic Site,Ice Cream Shop,0.273437,1
3,Valdemarin,Moncloa-Aravaca,40.467452,-3.787834,1,Restaurant,Supermarket,Pharmacy,Park,Spanish Restaurant,Mediterranean Restaurant,Bar,Mexican Restaurant,Asian Restaurant,Italian Restaurant,Gastropub,Hotel,0.281147,1
4,Horcajo,Moratalaz,40.409155,-3.626392,1,Soccer Stadium,Bakery,Park,Fast Food Restaurant,Soccer Field,Spanish Restaurant,Tapas Restaurant,Pizza Place,History Museum,Historic Site,Fountain,Football Stadium,0.308557,1
5,Media Legua,Moratalaz,40.412095,-3.657005,1,Fast Food Restaurant,Athletics & Sports,Coffee Shop,Pizza Place,Supermarket,American Restaurant,Restaurant,Big Box Store,Pool,Sandwich Place,Chinese Restaurant,Financial or Legal Service,0.306984,1
6,San Diego,Puente de Vallecas,40.389439,-3.667741,1,Spanish Restaurant,Italian Restaurant,Supermarket,Chinese Restaurant,Music Venue,Pub,Restaurant,Financial or Legal Service,Fish & Chips Shop,Flea Market,Food,Flower Shop,0.324554,1
7,Arcos,San Blas-Canillejas,40.420833,-3.618406,1,Restaurant,Multiplex,Clothing Store,Beer Garden,Lottery Retailer,Gym / Fitness Center,Metro Station,Big Box Store,Chinese Restaurant,Soccer Field,Food,Food & Drink Shop,0.318213,1
8,Almendrales,Usera,40.383841,-3.698357,1,Spanish Restaurant,Seafood Restaurant,Bar,Train Station,Fast Food Restaurant,BBQ Joint,Chinese Restaurant,Gym,Pub,Gastropub,Grocery Store,Food,0.27014,1
9,Santa Eugenia,Villa de Vallecas,40.382924,-3.611767,1,Bar,Miscellaneous Shop,Gym,Supermarket,Neighborhood,Spanish Restaurant,Train Station,Soccer Field,Football Stadium,Fountain,Fast Food Restaurant,Food Truck,0.310594,1


Draw a map of the selected neighborhoods in Madrid.

In [43]:
address = "Madrid, España"

geolocator = Nominatim(user_agent="state_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(location, latitude, longitude)

Madrid, Área metropolitana de Madrid y Corredor del Henares, Comunidad de Madrid, 28001, España 40.4167047 -3.7035825


In [44]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i * x) ** 2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(madrid_merged['Latitud'], madrid_merged['Longitud'], madrid_merged['Barrio'],
                                  madrid_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster - 1],
        fill=True,
        fill_color=rainbow[cluster - 1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

## Results and Conclusion <a name='rnd'></a>

The result of this entire project shows that by using this model, it is possible to help people who have to move to another city and want to find conditions similar to those in their current residential location using public data available through the Foursquare API.

One of the difficulties in kmeans algorithms is the choice of the value for K. To decide what value to use, we executed the algorithm with different K values and, for each case, and calculated the Silhouette Coefficient and the Calinski-Harabaz Index. 
These are 2 metric that allow us to decide if we obtain dense and well separated clusters. With both indicators the best value for K was 4.

The characteristics that distinguish my neighborhood, according to the results of Foursquare, is the diversity of places to eat, shops and places to exercise. These same characteristics are present in almost all the selected neighborhoods.