## IBM Data Science Professional Certificate
### Capstone Project - The Battle of Neighborhoods

Damien Azzopardi - July 2021

<h2>Table of Contents</h2>
<br>
<ol>
    <li><a href="introduction"><b>Introduction</b></a>
<br>
<br>
    <li><a href="data"><b>Data</b></a></li>
        <ul>
            <li><a href="neighborhoods_and_coordinates">Neighborhoods & Coordinates</a>
            <li><a href="districts">Districts</a>    
            <li><a href="venues">Venues</a>     
        </ul>
<br>
    <li><a href="#data_manipulation"><b>Data Manipulation</b></a></li>
        <ul>
            <li><a href="1">1</a>
            <li><a href="2">2</a>    
            <li><a href="3">3</a>     
        </ul>
<hr>

<h2 id="introduction">Introduction</h2>

**The Green Alternative** is a group of vegetarian restaurants, which started operating in Madrid, Spain, in 2010. We are currently running six different restaurants across different neighborhoods in Madrid, oriented towards locals. As our group is becoming successful in the spanish capital, this year, we would like to expand our operations and open a vegetarian restaurant in Barcelona.

The question we are trying to answer is; **what is the best neighborhood to open a vegetarian restaurant in Barcelona?**

After running a market research and looking into the data collected from our six current restaurants in Madrid, we found that our most successful locations are in neighborhoods which:
- Are close to a **metro** or **train station**, where the flow of people is high.
- Have a **park** or **garden** closeby, where our customers like to have lunch.
- Have a **gym** closeby, as most of our customers come for lunch or dinner after a training at the gym.

Knowing this, we'll leverage the Foursquare location data in order to calculate the density of metro and train stations, parks, gardens, and gyms, for each neighborhood in Barcelona, and pick the one with higher density of selected venues to open our first vegetarian restaurant in the city of Barcelona.

<h2 id="data">Data</h2>

The data we will be using to help us answer our question comes from the following sources.

<h3 id="neighborhoods_and_coordinates">Neighborhoods & Coordinates</h3>

<h4>Metabolism of Cities</h4>

The full list of Barcelona's neighborhoods, along with their corresponding coordinates is available in [this](https://data.metabolismofcities.org/library/maps/577245/view/) page (*metabolismofcities.org*). It consists of a table with two rows, **Neighborhoods** and **Coordinates**. We will scrap the table containing the list of neighborhoods and coordinates directly in this workbook.


<h3 id="districts">Districts</h3>

<h4>Wikipedia</h4>

The full list of Barcelona's districts, along with their corresponding neirhborhoods is available in [this](https://en.wikipedia.org/wiki/Districts_of_Barcelona) page (*wikipedia.org*). We will export a CSV containing two rows, **Districts** and  **Neighborhoods**, that we will read directly in this workbook, and join it with the first dataset containing the **Neighborhoods** and **Coordinates**.


<h3 id="venues">Venues</h3>

<h4>Foursquare</h4>

We will leverage the Foursquare location data in order to calculate the density venues we have selected for the analysis. We will join it with the first two datasets containing the **District**, **Neighborhoods** and **Coordinates**.

<h2 id="xxx">Segmenting and Clustering Neighborhoods in Barcelona</h2>

### Data extraction and manipulation

In [1]:
# load libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

<h3 id="xxx">1. Scrap Barcelona's neighborhoods and coordinates</h3>

The full list of Barcelona's neighborhoods, along with their corresponding coordinates is available in [this Metabolism of Cities page](https://data.metabolismofcities.org/library/maps/577245/view/).

In [2]:
# scrap Barcelona's neighborhoods and coordinates table
url = 'https://data.metabolismofcities.org/library/maps/577245/view/'

r = requests.get(url)
html = r.text

soup = BeautifulSoup(html, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')
data = []
for row in rows[1:]:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

# convert to dataframe
df = pd.DataFrame(data)

# rename columns
df.columns = ['Neighborhood', 'Coordinates']

df.head()

Unnamed: 0,Neighborhood,Coordinates
0,Baró de Viver,"[41.44581467347341, 2.19899775842406]"
1,Can Baró,"[41.4167603624773, 2.1623865539676492]"
2,Can Peguera,"[41.43484212038238, 2.1664501320817235]"
3,Canyelles,"[41.445032990983854, 2.1634504252403164]"
4,Ciutat Meridiana,"[41.46120773644666, 2.1748476502321963]"


In [3]:
# split the 'Coordinates' column into two new columns 'Latitude' and 'Longitude'
df[['Latitude','Longitude']] = df.Coordinates.str.split(', ', expand = True)

df.head()

Unnamed: 0,Neighborhood,Coordinates,Latitude,Longitude
0,Baró de Viver,"[41.44581467347341, 2.19899775842406]",[41.44581467347341,2.19899775842406]
1,Can Baró,"[41.4167603624773, 2.1623865539676492]",[41.4167603624773,2.1623865539676492]
2,Can Peguera,"[41.43484212038238, 2.1664501320817235]",[41.43484212038238,2.1664501320817235]
3,Canyelles,"[41.445032990983854, 2.1634504252403164]",[41.445032990983854,2.1634504252403164]
4,Ciutat Meridiana,"[41.46120773644666, 2.1748476502321963]",[41.46120773644666,2.1748476502321963]


In [4]:
# drop the 'Coordinates' column
df_bcn = df.drop(['Coordinates'], axis = 1)

df_bcn.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Baró de Viver,[41.44581467347341,2.19899775842406]
1,Can Baró,[41.4167603624773,2.1623865539676492]
2,Can Peguera,[41.43484212038238,2.1664501320817235]
3,Canyelles,[41.445032990983854,2.1634504252403164]
4,Ciutat Meridiana,[41.46120773644666,2.1748476502321963]


In [5]:
# special characters to remove from the dataframe
spec_chars = ["[","]"]

# removing special characters from the 'Latitude' column
for char in spec_chars:
    df_bcn['Latitude'] = df_bcn['Latitude'].str.replace(char,'', regex=True)

# removing special characters from the 'Longitude column'    
for char in spec_chars:
    df_bcn['Longitude'] = df_bcn['Longitude'].str.replace(char,'', regex=True)

df_bcn.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Baró de Viver,41.44581467347341,2.19899775842406
1,Can Baró,41.4167603624773,2.162386553967649
2,Can Peguera,41.43484212038238,2.1664501320817235
3,Canyelles,41.445032990983854,2.1634504252403164
4,Ciutat Meridiana,41.46120773644666,2.1748476502321963


In [6]:
# check column type
df_bcn.dtypes

Neighborhood    object
Latitude        object
Longitude       object
dtype: object

In [7]:
# change column type 
df_bcn = df_bcn.astype({"Neighborhood": str, "Latitude": float, "Longitude": float})

# check column type
df_bcn.dtypes

Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

In [8]:
# import dataset with Districts
df_bcn_districts = pd.read_csv("/Users/damienazzopardi/Documents/GitHub/Coursera_Capstone/Districts_Barcelona.csv")
df_bcn_districts.head()

Unnamed: 0,Neighborhoods,District
0,Baró de Viver,Sant Andreu
1,Can Baró,Horta-Guinardó
2,Can Peguera,Nou Barris
3,Canyelles,Nou Barris
4,Ciutat Meridiana,Nou Barris


In [9]:
# merge both dataframes into one
df_barcelona = pd.merge(df_bcn, df_bcn_districts, how = 'left', left_on = 'Neighborhood', right_on = 'Neighborhoods')
df_barcelona.drop("Neighborhoods", axis = 1, inplace = True)
df_barcelona.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,District
0,Baró de Viver,41.445815,2.198998,Sant Andreu
1,Can Baró,41.41676,2.162387,Horta-Guinardó
2,Can Peguera,41.434842,2.16645,Nou Barris
3,Canyelles,41.445033,2.16345,Nou Barris
4,Ciutat Meridiana,41.461208,2.174848,Nou Barris


In [10]:
address = 'Barcelona, Spain'

geolocator = Nominatim(user_agent="barcelona_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Barcelona are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Barcelona are 41.3828939, 2.1774322.


In [11]:
# create map of Toronto using latitude and longitude values
map_barcelona = folium.Map(location=[latitude, longitude], zoom_start=12)

# add neighborhoods markers to map
for lat, lng, district, neighborhoods in zip(df_barcelona['Latitude'], df_barcelona['Longitude'], df_barcelona['District'], df_barcelona['Neighborhood']):
    label = '{}, {}'.format(neighborhoods, district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_barcelona)  
    
map_barcelona

Define Foursquare credentials and version.

In [12]:
CLIENT_ID = 'RIYT5OPWMC205ZAGHGCHANWVHPBEQ2HJSTIRKLYXIAY0AWOJ'
CLIENT_SECRET = 'CTFKGNQ13Y5ISMAXCEZL1ZB40WHZB1IXG0E44ZFBRLSXKCMD'
VERSION = '20180605'
LIMIT = 100

Explore the first neighborhood in **Barcelona**.

In [13]:
# get the neighborhood name
df_barcelona.loc[0, 'Neighborhood']

'Baró de Viver'

In [14]:
# get the neighborhood's latitude and longitude values
neighborhood_latitude = df_barcelona.loc[0, 'Latitude']
neighborhood_longitude = df_barcelona.loc[0, 'Longitude']

neighborhood_name = df_barcelona.loc[0, 'Neighborhood']

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Baró de Viver are 41.44581467347341, 2.19899775842406.


In [15]:
# get the top 50 venues that are in Baró de Viver within a radius of 500 meters
LIMIT = 50
radius = 500

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

# get the results in a json format
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60fc1ff524e7595c13a29f9a'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Baró de Viver',
  'headerFullLocation': 'Baró de Viver, Barcelona',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 11,
  'suggestedBounds': {'ne': {'lat': 41.45031467797341,
    'lng': 2.204989901077587},
   'sw': {'lat': 41.441314668973405, 'lng': 2.1930056157705327}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e272929e4cd60a2c62cd3e1',
       'name': 'Ibericus',
       'location': {'address': 'La Maquinista',
        'lat': 41.44150464406915,
        'lng': 2.197811752873911,
        'labeledLatLngs': [{'label': 'display',
          'lat': 41.44

In [16]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [17]:
# clean the json and structure it into a pandas dataframe
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(100)

Unnamed: 0,name,categories,lat,lng
0,Ibericus,Food,41.441505,2.197812
1,Pasteleria Buenavista,Dessert Shop,41.442172,2.199977
2,A Loja do Gato Preto,Furniture / Home Store,41.441689,2.197742
3,Restaurante Avenida II,Restaurant,41.443,2.202642
4,Enrique Tomás - C.C. La Maquinista,Spanish Restaurant,41.441421,2.197978
5,Enrique Tomás - C.C. La Maquinista Local 004,Deli / Bodega,41.442171,2.197763
6,Wok You,Asian Restaurant,41.442989,2.200887
7,Parc del Nus de la Trinitat,Park,41.449112,2.195828
8,Mister Guau,Pet Store,41.441843,2.197756
9,Drim,Toy / Game Store,41.441612,2.19779


In [18]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

11 venues were returned by Foursquare.


In [19]:
# function to repeat the same process to all neighborhoods in Barcelona
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
barcelona_venues = getNearbyVenues(names=df_barcelona['Neighborhood'],
                                   latitudes=df_barcelona['Latitude'],
                                   longitudes=df_barcelona['Longitude']
                                  )

Baró de Viver
Can Baró
Can Peguera
Canyelles
Ciutat Meridiana
Diagonal Mar i el Front Marítim del Poblenou
Horta
Hostafrancs
Montbau
Navas
Pedralbes
Porta
Provençals del Poblenou
Sant Andreu
Sant Antoni
Sant Genís dels Agudells
Sant Gervasi - Galvany
Sant Gervasi - la Bonanova
Sant Martí de Provençals
Sant Pere, Santa Caterina i la Ribera
Sants
Sants - Badal
Sarrià
Torre Baró
Vallbona
Vallcarca i els Penitents
Vallvidrera, el Tibidabo i les Planes
Verdun
Vilapicina i la Torre Llobeta
el Baix Guinardó
el Barri Gòtic
el Besòs i el Maresme
el Bon Pastor
el Camp d'en Grassot i Gràcia Nova
el Camp de l'Arpa del Clot
el Carmel
el Clot
el Coll
el Congrés i els Indians
el Fort Pienc
el Guinardó
el Parc i la Llacuna del Poblenou
el Poble-sec
el Poblenou
el Putxet i el Farró
el Raval
el Turó de la Peira
l'Antiga Esquerra de l'Eixample
la Barceloneta
la Bordeta
la Clota
la Dreta de l'Eixample
la Font d'en Fargues
la Font de la Guatlla
la Guineueta
la Marina de Port
la Marina del Prat Vermell
la M

In [21]:
barcelona_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Baró de Viver,41.445815,2.198998,Ibericus,41.441505,2.197812,Food
1,Baró de Viver,41.445815,2.198998,Pasteleria Buenavista,41.442172,2.199977,Dessert Shop
2,Baró de Viver,41.445815,2.198998,A Loja do Gato Preto,41.441689,2.197742,Furniture / Home Store
3,Baró de Viver,41.445815,2.198998,Restaurante Avenida II,41.443,2.202642,Restaurant
4,Baró de Viver,41.445815,2.198998,Enrique Tomás - C.C. La Maquinista,41.441421,2.197978,Spanish Restaurant


In [22]:
print('There are {} uniques categories.'.format(len(barcelona_venues['Venue Category'].unique())))

There are 262 uniques categories.


In [23]:
# one hot encoding
df_barcelona_onehot = pd.get_dummies(barcelona_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_barcelona_onehot['Neighborhood'] = barcelona_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [df_barcelona_onehot.columns[-1]] + list(df_barcelona_onehot.columns[:-1])
df_barcelona_onehot = df_barcelona_onehot[fixed_columns]

df_barcelona_onehot.head(100)

Unnamed: 0,Yoga Studio,Accessories Store,African Restaurant,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vacation Rental,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
97,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
# group rows by neighborhood and y taking the mmean of the frequency of occurence of each category
df_barcelona_grouped = df_barcelona_onehot.groupby('Neighborhood').mean().reset_index()
df_barcelona_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,African Restaurant,American Restaurant,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,...,Vacation Rental,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Baró de Viver,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0
1,Can Baró,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0
2,Can Peguera,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0
3,Canyelles,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0
4,Ciutat Meridiana,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,la Vila Olímpica del Poblenou,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.02,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0
68,la Vila de Gràcia,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.02,0.02,0.0,0.02,0.0,0.02,0.02,0.0,0.0
69,les Corts,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.02,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0
70,les Roquetes,0.00,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,...,0.0,0.00,0.00,0.0,0.00,0.0,0.00,0.00,0.0,0.0


In [25]:
df_barcelona_grouped_3 = df_barcelona_grouped[['Neighborhood','Park','Garden','Metro Station','Train Station', 'Gym']]

df_barcelona_grouped_3 = df_barcelona_grouped_3.astype({"Park": float, "Garden": float, "Metro Station": float, "Train Station": float, 'Gym': float})

df_barcelona_grouped_3["Sum"] = df_barcelona_grouped_3.sum(axis=1)

df_barcelona_grouped_3 = pd.DataFrame(df_barcelona_grouped_3)
df_barcelona_grouped_3.head(20)

Unnamed: 0,Neighborhood,Park,Garden,Metro Station,Train Station,Gym,Sum
0,Baró de Viver,0.090909,0.0,0.0,0.0,0.0,0.090909
1,Can Baró,0.047619,0.0,0.0,0.0,0.0,0.047619
2,Can Peguera,0.181818,0.0,0.0,0.0,0.0,0.181818
3,Canyelles,0.0,0.0,0.166667,0.0,0.0,0.166667
4,Ciutat Meridiana,0.111111,0.0,0.333333,0.111111,0.0,0.555556
5,Diagonal Mar i el Front Marítim del Poblenou,0.035714,0.0,0.0,0.0,0.0,0.035714
6,Horta,0.0,0.1,0.1,0.0,0.0,0.2
7,Hostafrancs,0.030303,0.0,0.0,0.0,0.030303,0.060606
8,Montbau,0.2,0.0,0.0,0.0,0.0,0.2
9,Navas,0.0,0.0,0.0,0.027027,0.0,0.027027


In [26]:
df_barcelona_grouped_4 = df_barcelona_grouped_3[['Neighborhood','Sum']]
df_barcelona_grouped_4

Unnamed: 0,Neighborhood,Sum
0,Baró de Viver,0.090909
1,Can Baró,0.047619
2,Can Peguera,0.181818
3,Canyelles,0.166667
4,Ciutat Meridiana,0.555556
...,...,...
67,la Vila Olímpica del Poblenou,0.060000
68,la Vila de Gràcia,0.000000
69,les Corts,0.020000
70,les Roquetes,0.090909


In [27]:
# print each neighborhood along with the top 5 most common venues
num_top_venues = 10

for hood in df_barcelona_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = df_barcelona_grouped[df_barcelona_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Baró de Viver----
                    venue  freq
0        Toy / Game Store  0.09
1               Pet Store  0.09
2           Deli / Bodega  0.09
3             Supermarket  0.09
4            Dessert Shop  0.09
5                    Park  0.09
6              Restaurant  0.09
7      Spanish Restaurant  0.09
8  Furniture / Home Store  0.09
9        Asian Restaurant  0.09


----Can Baró----
                venue  freq
0  Spanish Restaurant  0.19
1      Scenic Lookout  0.10
2    Tapas Restaurant  0.10
3  Chinese Restaurant  0.10
4       Grocery Store  0.10
5                Pool  0.05
6      Breakfast Spot  0.05
7                Café  0.05
8                Park  0.05
9    Basketball Court  0.05


----Can Peguera----
                venue  freq
0                Park  0.18
1          Restaurant  0.09
2         Sports Club  0.09
3         Escape Room  0.09
4   German Restaurant  0.09
5   Food & Drink Shop  0.09
6    Tapas Restaurant  0.09
7  Basketball Stadium  0.09
8              Hostel  0.

9        Chinese Restaurant  0.04


----el Carmel----
                   venue  freq
0                  Plaza  0.19
1            Coffee Shop  0.12
2          Metro Station  0.12
3               Mountain  0.06
4     Spanish Restaurant  0.06
5             Food Court  0.06
6  General Entertainment  0.06
7                   Park  0.06
8                    Gym  0.06
9                 Bakery  0.06


----el Clot----
                      venue  freq
0          Tapas Restaurant  0.08
1  Mediterranean Restaurant  0.08
2                      Café  0.08
3                Restaurant  0.06
4        Spanish Restaurant  0.06
5               Supermarket  0.04
6                     Hotel  0.04
7                      Park  0.04
8             Hot Dog Joint  0.02
9              Climbing Gym  0.02


----el Coll----
                      venue  freq
0                     Hotel  0.22
1                      Park  0.22
2            Scenic Lookout  0.22
3  Mediterranean Restaurant  0.11
4                  Mounta

In [28]:
# function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [29]:
# create a dataframe and display the top 10 venues for each neighborhood
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = df_barcelona_grouped['Neighborhood']

for ind in np.arange(df_barcelona_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_barcelona_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Baró de Viver,Toy / Game Store,Pet Store,Deli / Bodega,Supermarket,Dessert Shop
1,Can Baró,Spanish Restaurant,Scenic Lookout,Tapas Restaurant,Chinese Restaurant,Grocery Store
2,Can Peguera,Park,Restaurant,Sports Club,Escape Room,German Restaurant
3,Canyelles,Market,Metro Station,Mediterranean Restaurant,Café,Soccer Field
4,Ciutat Meridiana,Metro Station,Grocery Store,Train Station,Mediterranean Restaurant,Park
...,...,...,...,...,...,...
67,la Vila Olímpica del Poblenou,Spanish Restaurant,Paella Restaurant,Park,Restaurant,Café
68,la Vila de Gràcia,Plaza,Bar,Toy / Game Store,Theater,Indie Movie Theater
69,les Corts,Restaurant,Spanish Restaurant,Café,Bakery,Hotel
70,les Roquetes,Falafel Restaurant,Baby Store,Music Venue,Supermarket,Castle


In [30]:
# set number of clusters
kclusters = 5

df_barcelona_grouped_clustering = df_barcelona_grouped_4.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_barcelona_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 4, 3, 3, 2, 0, 3, 4, 3, 0], dtype=int32)

In [31]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_barcelona_merged = df_barcelona

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
df_barcelona_merged = df_barcelona_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# change Cluster Labels column type
df_barcelona_merged = df_barcelona_merged.fillna(0)
df_barcelona_merged = df_barcelona_merged.astype({"Cluster Labels": int})

df_barcelona_merged.head(10)

Unnamed: 0,Neighborhood,Latitude,Longitude,District,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Baró de Viver,41.445815,2.198998,Sant Andreu,4,Toy / Game Store,Pet Store,Deli / Bodega,Supermarket,Dessert Shop
1,Can Baró,41.41676,2.162387,Horta-Guinardó,4,Spanish Restaurant,Scenic Lookout,Tapas Restaurant,Chinese Restaurant,Grocery Store
2,Can Peguera,41.434842,2.16645,Nou Barris,3,Park,Restaurant,Sports Club,Escape Room,German Restaurant
3,Canyelles,41.445033,2.16345,Nou Barris,3,Market,Metro Station,Mediterranean Restaurant,Café,Soccer Field
4,Ciutat Meridiana,41.461208,2.174848,Nou Barris,2,Metro Station,Grocery Store,Train Station,Mediterranean Restaurant,Park
5,Diagonal Mar i el Front Marítim del Poblenou,41.405276,2.212946,Sant Martí,0,Mediterranean Restaurant,Restaurant,Hotel,Beach,Buffet
6,Horta,41.43989,2.15193,Horta-Guinardó,3,Spanish Restaurant,Scenic Lookout,Soccer Stadium,Bakery,Sandwich Place
7,Hostafrancs,41.375305,2.144255,Sants-Montjuïc,4,Tapas Restaurant,Mediterranean Restaurant,Pizza Place,Bar,Spanish Restaurant
8,Montbau,41.435267,2.137761,Horta-Guinardó,3,Trail,Restaurant,Park,Breakfast Spot,Multiplex
9,Navas,41.418,2.185948,Sant Andreu,0,Supermarket,Grocery Store,Bakery,Italian Restaurant,Spanish Restaurant


In [32]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_barcelona_merged['Latitude'], df_barcelona_merged['Longitude'], df_barcelona_merged['Neighborhood'], df_barcelona_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Now let's examine each cluster in details

In [33]:
df_barcelona_merged.loc[df_barcelona_merged['Cluster Labels'] == 0, df_barcelona_merged.columns[[1] + list(range(5, df_barcelona_merged.shape[1]))]]

Unnamed: 0,Latitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
5,41.405276,Mediterranean Restaurant,Restaurant,Hotel,Beach,Buffet
9,41.418,Supermarket,Grocery Store,Bakery,Italian Restaurant,Spanish Restaurant
11,41.434867,Supermarket,Spanish Restaurant,Grocery Store,Food,Sandwich Place
12,41.411123,Spanish Restaurant,Mediterranean Restaurant,Pizza Place,Café,Asian Restaurant
14,41.378513,Coffee Shop,Café,Tapas Restaurant,Mediterranean Restaurant,Bar
15,41.425455,Tennis Court,Sports Bar,Tapas Restaurant,Yoga Studio,Pakistani Restaurant
16,41.397452,Mediterranean Restaurant,Deli / Bodega,Spanish Restaurant,Sandwich Place,Café
19,41.386795,Tapas Restaurant,Bar,Wine Bar,Cocktail Bar,Camera Store
20,41.377465,Tapas Restaurant,Bar,Restaurant,Pie Shop,Hostel
21,41.374649,Supermarket,Bakery,Pizza Place,Restaurant,Café
