<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Barcelona by their relative suitability to host a new gym/fitness center</font></h1>


## Introduction

In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in Barcelona. You will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the _k_-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in Barcelona and their emerging clusters.


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#item1">Download, Pre-process and Explore Datasets</a>

2.  <a href="#item2">Explore Gyms/Fitness Centers, Communications and Services by Neighborhoods in Barcelona</a>

3.  <a href="#item3">Standardization and Feature Weighting</a>

4.  <a href="#item4">Cluster Neighborhoods</a>

5.  <a href="#item5">Examine Clusters</a>  
    </font>
    </div>


Before we get the data and start exploring it, let's download all the dependencies that we will need.


In [57]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 1. Download, Pre-process and Explore Datasets


We download from the Barcelona City Council Open Data website the two datasets that we are going to use in this project.
We need to perform pre-processing tasks to leave the two datasets in the format that we need.

In [58]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,"ï»¿Year,District_Code,District_Name,Neighbourhood_Code,Neighbourhood_Name,Sex,Year_registered,Number"
0,"2020,1,Ciutat Vella,1,el Raval,Dona,Menys d'1 ..."
1,"2020,1,Ciutat Vella,2,el Barri Gotic,Dona,Meny..."
2,"2020,1,Ciutat Vella,3,la Barceloneta,Dona,Meny..."
3,"2020,1,Ciutat Vella,4,Sant Pere Santa Caterina..."
4,"2020,2,Eixample,5,el Fort Pienc,Dona,Menys d'1..."


In [59]:
df_pop_bcn_cols_names = df_data_0.columns.values[0][3:].split(",")
df_pop_bcn_cols_names
df_pop_bcn_data = []
for ix, rw in df_data_0.iterrows():
    trunc1 = str(rw.values)[2:]
    trunc2 = trunc1[:-2]
    df_pop_bcn_data.append(trunc2.split(","))
df_pop_bcn = pd.DataFrame(columns = df_pop_bcn_cols_names, data = df_pop_bcn_data)
df_pop_bcn.head(10)

Unnamed: 0,Year,District_Code,District_Name,Neighbourhood_Code,Neighbourhood_Name,Sex,Year_registered,Number
0,2020,1,Ciutat Vella,1,el Raval,Dona,Menys d'1 any,2857
1,2020,1,Ciutat Vella,2,el Barri Gotic,Dona,Menys d'1 any,1689
2,2020,1,Ciutat Vella,3,la Barceloneta,Dona,Menys d'1 any,1052
3,2020,1,Ciutat Vella,4,Sant Pere Santa Caterina i la Ribera,Dona,Menys d'1 any,1607
4,2020,2,Eixample,5,el Fort Pienc,Dona,Menys d'1 any,1666
5,2020,2,Eixample,6,la Sagrada Familia,Dona,Menys d'1 any,2438
6,2020,2,Eixample,7,la Dreta de l'Eixample,Dona,Menys d'1 any,2291
7,2020,2,Eixample,8,l'Antiga Esquerra de l'Eixample,Dona,Menys d'1 any,2233
8,2020,2,Eixample,9,la Nova Esquerra de l'Eixample,Dona,Menys d'1 any,2622
9,2020,2,Eixample,10,Sant Antoni,Dona,Menys d'1 any,1718


In [60]:
df_pop_bcn = df_pop_bcn.astype({"Year": int, "District_Code": int, "Neighbourhood_Code": int, "Number": int})
df_pop_bcn.sample(5)

Unnamed: 0,Year,District_Code,District_Name,Neighbourhood_Code,Neighbourhood_Name,Sex,Year_registered,Number
592,2020,2,Eixample,9,la Nova Esquerra de l'Eixample,Home,Mes de 15 anys,12709
694,2020,7,Horta-Guinardo,38,la Teixonera,Home,No consta,5
452,2020,3,Sants-MontjuÃ¯c,15,Hostafrancs,Home,D'1 a 5 anys,2041
367,2020,1,Ciutat Vella,3,la Barceloneta,Home,Menys d'1 any,1110
149,2020,1,Ciutat Vella,4,Sant Pere Santa Caterina i la Ribera,Dona,De 6 a 15 anys,2251


In [61]:
df_pop_bcn = df_pop_bcn.groupby(['Neighbourhood_Name'])['Number'].sum().reset_index()
df_pop_bcn

Unnamed: 0,Neighbourhood_Name,Number
0,Baro de Viver,2625
1,Can Baro,9331
2,Can Peguera,2234
3,Canyelles,6869
4,Ciutat Meridiana,11091
5,Diagonal Mar i el Front Maritim del Poblenou,13526
6,Horta,28363
7,Hostafrancs,16203
8,Montbau,5225
9,Navas,22457


In [62]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,"Year,District_Code,District_Name,Neighbourhood_Code,Neighbourhood_Name,Area (ha),Latitude,Longitude"
0,"2019,1,Ciutat Vella,1,el Raval,110,41.38,2.16861"
1,"2019,1,Ciutat Vella,2,el Barri Gotic,81.6,41.3..."
2,"2019,1,Ciutat Vella,3,la Barceloneta,109.5,41...."
3,"2019,1,Ciutat Vella,4,Sant Pere Santa Caterina..."
4,"2019,2,Eixample,5,el Fort Pienc,92.9,41.395675..."


In [63]:
df_geo_bcn_cols_names = df_data_1.columns.values[0].split(",")
df_geo_bcn_cols_names
df_geo_bcn_data = []
for ix, rw in df_data_1.iterrows():
    trunc1 = str(rw.values)[2:]
    trunc2 = trunc1[:-2]
    df_geo_bcn_data.append(trunc2.split(","))
df_geo_bcn = pd.DataFrame(columns = df_geo_bcn_cols_names, data = df_geo_bcn_data)
df_geo_bcn.head(10)

Unnamed: 0,Year,District_Code,District_Name,Neighbourhood_Code,Neighbourhood_Name,Area (ha),Latitude,Longitude
0,2019,1,Ciutat Vella,1,el Raval,110.0,41.38,2.16861
1,2019,1,Ciutat Vella,2,el Barri Gotic,81.6,41.382778,2.176944
2,2019,1,Ciutat Vella,3,la Barceloneta,109.5,41.37944,2.18917
3,2019,1,Ciutat Vella,4,Sant Pere Santa Caterina i la Ribera,111.0,41.3847,2.1826
4,2019,2,Eixample,5,el Fort Pienc,92.9,41.395675,2.183703
5,2019,2,Eixample,6,la Sagrada Familia,104.2,41.403561,2.174347
6,2019,2,Eixample,7,la Dreta de l'Eixample,212.0,41.395278,2.166667
7,2019,2,Eixample,8,l'Antiga Esquerra de l'Eixample,122.8,41.390061,2.155061
8,2019,2,Eixample,9,la Nova Esquerra de l'Eixample,134.1,41.383389,2.149
9,2019,2,Eixample,10,Sant Antoni,80.4,41.37801,2.15949


In [64]:
df_geo_bcn = df_geo_bcn.astype({"Year": int, "District_Code": int, "Neighbourhood_Code": int, "Area (ha)": float, "Latitude": float, "Longitude": float})
df_geo_bcn.sample(5)

Unnamed: 0,Year,District_Code,District_Name,Neighbourhood_Code,Neighbourhood_Name,Area (ha),Latitude,Longitude
49,2019,8,Nou Barris,50,les Roquetes,64.3,41.448644,2.172031
33,2019,7,Horta-Guinardo,34,Can Baro,38.4,41.416384,2.162356
8,2019,2,Eixample,9,la Nova Esquerra de l'Eixample,134.1,41.383389,2.149
13,2019,3,Sants-Montjuic,14,la Font de la Guatlla,29.7,41.369681,2.144811
16,2019,3,Sants-Montjuic,17,Sants - Badal,41.5,41.375278,2.126667


In [65]:
df_nhs_bcn = df_pop_bcn.merge(df_geo_bcn, on = 'Neighbourhood_Name')
df_nhs_bcn

Unnamed: 0,Neighbourhood_Name,Number,Year,District_Code,District_Name,Neighbourhood_Code,Area (ha),Latitude,Longitude
0,Baro de Viver,2625,2019,9,Sant Andreu,58,23.0,41.447906,2.200742
1,Can Baro,9331,2019,7,Horta-Guinardo,34,38.4,41.416384,2.162356
2,Can Peguera,2234,2019,8,Nou Barris,47,11.9,41.4349,2.166188
3,Canyelles,6869,2019,8,Nou Barris,49,79.0,41.442684,2.166015
4,Ciutat Meridiana,11091,2019,8,Nou Barris,55,37.7,41.460914,2.174433
5,Diagonal Mar i el Front Maritim del Poblenou,13526,2019,10,Sant Marti,69,120.3,41.4096,2.216306
6,Horta,28363,2019,7,Horta-Guinardo,43,307.0,41.429503,2.1601
7,Hostafrancs,16203,2019,3,Sants-Montjuic,15,41.0,41.375556,2.143056
8,Montbau,5225,2019,7,Horta-Guinardo,40,205.5,41.430925,2.142967
9,Navas,22457,2019,9,Sant Andreu,63,42.4,41.415744,2.1869


In [66]:
df_nhs_bcn['Population Density'] = (df_nhs_bcn['Number'] / df_nhs_bcn['Area (ha)']).astype('float')
df_nhs_bcn

Unnamed: 0,Neighbourhood_Name,Number,Year,District_Code,District_Name,Neighbourhood_Code,Area (ha),Latitude,Longitude,Population Density
0,Baro de Viver,2625,2019,9,Sant Andreu,58,23.0,41.447906,2.200742,114.130435
1,Can Baro,9331,2019,7,Horta-Guinardo,34,38.4,41.416384,2.162356,242.994792
2,Can Peguera,2234,2019,8,Nou Barris,47,11.9,41.4349,2.166188,187.731092
3,Canyelles,6869,2019,8,Nou Barris,49,79.0,41.442684,2.166015,86.949367
4,Ciutat Meridiana,11091,2019,8,Nou Barris,55,37.7,41.460914,2.174433,294.190981
5,Diagonal Mar i el Front Maritim del Poblenou,13526,2019,10,Sant Marti,69,120.3,41.4096,2.216306,112.435578
6,Horta,28363,2019,7,Horta-Guinardo,43,307.0,41.429503,2.1601,92.387622
7,Hostafrancs,16203,2019,3,Sants-Montjuic,15,41.0,41.375556,2.143056,395.195122
8,Montbau,5225,2019,7,Horta-Guinardo,40,205.5,41.430925,2.142967,25.425791
9,Navas,22457,2019,9,Sant Andreu,63,42.4,41.415744,2.1869,529.646226


#### We use geopy library to get the latitude and longitude values of Barcelona.


In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>bcn_explorer</em>, as shown below.


In [67]:
address = 'Barcelona, ES'
geolocator = Nominatim(user_agent = "bcn_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Barcelona are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Barcelona are 41.3828939, 2.1774322.


#### Create a map of Barcelona with neighborhoods superimposed on top.


In [68]:
# create map of Barcelona using latitude and longitude values
map_bcn = folium.Map(location = [latitude, longitude], zoom_start = 10)

# add markers to map
for lat, lng, district, neighborhood in zip(df_nhs_bcn['Latitude'], df_nhs_bcn['Longitude'], df_nhs_bcn['District_Name'], df_nhs_bcn['Neighbourhood_Name']):
    label = '{}, {}'.format(neighborhood, district)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_bcn)
map_bcn

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.


#### Define Foursquare Credentials and Version


In [69]:
CLIENT_ID = 'QCI0HK4NH4VHBP2SMJCAR0VJFQKN54PGIUFPQZ0W5YMLL1C1' # your Foursquare ID
CLIENT_SECRET = 'S1YNVRC4ED5JJXK0SVVE32ZIYNQ2FTQ0VMCCY4QBBFAZML4L' # your Foursquare Secret
VERSION = '20201212' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: QCI0HK4NH4VHBP2SMJCAR0VJFQKN54PGIUFPQZ0W5YMLL1C1
CLIENT_SECRET:S1YNVRC4ED5JJXK0SVVE32ZIYNQ2FTQ0VMCCY4QBBFAZML4L


From the Foursquare lab in the previous module, we know that all the information is in the _items_ key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.


In [70]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

## 2. Explore Gyms/Fitness Centers, Communications and Services by Neighborhoods in Barcelona


In [71]:
GYM_CODE = "4bf58dd8d48988d175941735"
TRANSPORTATION_CODES = "4bf58dd8d48988d1fd931735,4bf58dd8d48988d129951735,52f2ab2ebcbc57f1066b8b51,4e4c9077bd41f78e849722f9"
SERVICES_CODES = "4bf58dd8d48988d1f6941735,4bf58dd8d48988d1f9941735,4bf58dd8d48988d104941735,4bf58dd8d48988d12e941735"

#### Let's create a function to obtain from the Foursquare API the data of venues in all the neighbourhoods in Barcelona

In [76]:
def getNearbyVenues(names, latitudes, longitudes, categories, radius = 600):
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            categories,
            radius, 
            LIMIT)
            
        print(requests.get(url).json())
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now we write the code to run the above function on each neighborhood and create three new dataframes called _df_bcn_gyms_, _df_bcn_comms_ and _df_bcn_servs_.


In [77]:
df_bcn_gyms = getNearbyVenues(names = df_nhs_bcn['Neighbourhood_Name'],
                                   latitudes = df_nhs_bcn['Latitude'],
                                   longitudes = df_nhs_bcn['Longitude'],
                                   categories = GYM_CODE
                                  )

Baro de Viver
{'meta': {'code': 429, 'errorType': 'quota_exceeded', 'errorDetail': 'Quota exceeded', 'requestId': '6016b87eac754233f6674cb3'}, 'response': {}}


KeyError: 'groups'

In [None]:
df_bcn_comms = getNearbyVenues(names = df_nhs_bcn['Neighbourhood_Name'],
                                   latitudes = df_nhs_bcn['Latitude'],
                                   longitudes = df_nhs_bcn['Longitude'],
                                   categories = TRANSPORTATION_CODES
                                  )

In [None]:
df_bcn_servs = getNearbyVenues(names = df_nhs_bcn['Neighbourhood_Name'],
                                   latitudes = df_nhs_bcn['Latitude'],
                                   longitudes = df_nhs_bcn['Longitude'],
                                   categories = SERVICES_CODES
                                  )

#### Let's check the size of the resulting dataframe


In [None]:
print(df_bcn_gyms.shape)
print(df_bcn_comms.shape)
print(df_bcn_servs.shape)

In [None]:
df_bcn_gyms.sample(3)
df_bcn_comms.sample(3)
df_bcn_servs.sample(3)

Let's check how many venues were returned for each neighborhood


In [None]:
df_bcn_gyms.groupby('Neighborhood').count()
#df_bcn_comms.groupby('Neighborhood').count()
#df_bcn_servs.groupby('Neighborhood').count()

In [None]:
df_nhs_bcn.sample(10)

In [None]:
df_nhs_bcn['Gyms/Pop Density'] = np.nan
df_nhs_bcn['Communications Density'] = np.nan
df_nhs_bcn['Services Density'] = np.nan
aa = df_bcn_gyms.groupby('Neighborhood').count()
bb = df_bcn_comms.groupby('Neighborhood').count()
cc = df_bcn_servs.groupby('Neighborhood').count()
for ix, rw in df_nhs_bcn.iterrows():
    if (rw['Neighbourhood_Name'] not in aa.index.values):
        df_nhs_bcn.loc[ix, 'Gyms/Pop Density'] = 0.00
    else:
        aa_rw = aa.loc[aa.index == rw['Neighbourhood_Name']]
        df_nhs_bcn.loc[ix, 'Gyms/Pop Density'] = ((aa_rw['Venue'].values[0]) / df_nhs_bcn.loc[ix, 'Population Density']).astype('float')
    if (rw['Neighbourhood_Name'] not in bb.index.values):
        df_nhs_bcn.loc[ix, 'Communications Density'] = 0.00
    else:
        bb_rw = bb.loc[bb.index == rw['Neighbourhood_Name']]
        df_nhs_bcn.loc[ix, 'Communications Density'] = ((bb_rw['Venue'].values[0]) / df_nhs_bcn.loc[ix, 'Area (ha)']).astype('float')
    if (rw['Neighbourhood_Name'] not in cc.index.values):
        df_nhs_bcn.loc[ix, 'Services Density'] = 0.00
    else:
        cc_rw = cc.loc[cc.index == rw['Neighbourhood_Name']]
        df_nhs_bcn.loc[ix, 'Services Density'] = ((cc_rw['Venue'].values[0]) / df_nhs_bcn.loc[ix, 'Area (ha)']).astype('float')                 

#### Let's find out how many unique categories can be curated from all the returned venues


In [None]:
print('There are {} unique Gym/Fitness Center categories.'.format(len(df_bcn_gyms['Venue Category'].unique())))
print('There are {} unique Transportation/Communications categories.'.format(len(df_bcn_comms['Venue Category'].unique())))
print('There are {} unique Services categories.'.format(len(df_bcn_servs['Venue Category'].unique())))

<a id='item3'></a>


## 3. Standardization and Feature Weighting


In [None]:
df_nhs_bcn.sample(10)

We standardize the three features of the 3-D K-Means clustering:

In [None]:
from sklearn.preprocessing import StandardScaler

X = df_nhs_bcn.values[:, 10:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)
cluster_dataset

We assign the gym/population density feature double the weight of the other two features:

In [None]:
cluster_dataset[:, 0] = cluster_dataset[:, 0] * 0.50
cluster_dataset[:, 1] = cluster_dataset[:, 1] * 0.25
cluster_dataset[:, 2] = cluster_dataset[:, 2] * 0.25
cluster_dataset

## 4. Cluster Neighborhoods


Elbow Method:

In [None]:
distortions = []
K = range(1, 12)
for k in K:
    kmeanModel = KMeans(n_clusters = k, random_state = 0)
    kmeanModel.fit(cluster_dataset)
    distortions.append(kmeanModel.inertia_)

In [None]:
plt.figure(figsize = (16, 8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

Run _k_-means to cluster the neighborhood into 4 clusters.


In [None]:
# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(cluster_dataset)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
df_nhs_bcn.head()

In [None]:
# add clustering labels
df_nhs_bcn['Cluster Labels'] = kmeans.labels_

In [None]:
df_nhs_bcn[45:50]

Finally, let's visualize the resulting clusters


In [None]:
# create map
map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_nhs_bcn['Latitude'], df_nhs_bcn['Longitude'], df_nhs_bcn['Neighbourhood_Name'], df_nhs_bcn['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>


## 5. Examine Clusters


Now, we examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can then assign a name to each cluster.

#### Cluster 1


In [None]:
df_nhs_bcn.loc[df_nhs_bcn['Cluster Labels'] == 0, df_nhs_bcn.columns[[0] + list(range(1, df_nhs_bcn.shape[1]))]]

#### Cluster 2


In [None]:
df_nhs_bcn.loc[df_nhs_bcn['Cluster Labels'] == 1, df_nhs_bcn.columns[[0] + list(range(1, df_nhs_bcn.shape[1]))]]

#### Cluster 3


In [None]:
df_nhs_bcn.loc[df_nhs_bcn['Cluster Labels'] == 2, df_nhs_bcn.columns[[0] + list(range(1, df_nhs_bcn.shape[1]))]]

#### Cluster 4


In [None]:
df_nhs_bcn.loc[df_nhs_bcn['Cluster Labels'] == 3, df_nhs_bcn.columns[[0] + list(range(1, df_nhs_bcn.shape[1]))]]