# Capstone Project - Music Venues in Madrid
### Applied Data Science Capstone by IBM/Coursera

This study is for the Applied Data Science Capstone by IBM/Coursera. We are going to measure which neighborhoods of Madrid are the best candidates for oppening a new Music Venue. For this, we will consider two factors: the distance from the neighborhood to the center of the city and the number of music venues that are already in the neighborhood. We will divide the neighborhoods considering these two factors using the K-means clustering method, and we will consider that the best neighborhoods are the ones that are close to the center and that have the less amount of music venues, so the competition will be softer.

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a new venue. Specifically, this report will be targeted to stakeholders interested in opening an **music venue** in **Madrid**, Spain.

Since there are lots of music venues in Madrid, we will try to detect **neighborhoods that are not already crowded with music venues**. A neighborhood crowded with music venues is a neghborhood good for music business, but the competition will be tough. Therefore, we will inform about this kind of neighborhoods too, because maybe the stakeholders decide that the risk is manageable. We would also prefer locations **as close to city center as possible**. The reason for this is that consuming culture, such as going to music venues, is more ordinary in neighborhoods close to the center, so if a neighborhood is close to it, the probabilities of the music venue to success are higher. 

We will use the K-Means clustering method to find the most promissing neighborhoods based on this criteria. Description of each cluster will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

The data we will need to our project will be the following:
- The complete list of **Madrid neighborhoods**. This list can be found in Wikipedia, so we will scrap this information into a dataframe from it.
- The **coordinates of each neighborhood**. This information will be achieved with a tool from the geopy library and the list of the neighborhoods.
- The **distance from each neighborhood to the center**. We will get the coordinates of the center of Madrid with geopy and then calculate the distance between locations also with a tool from the geopy library. 
- The **music venues of each neighborhood**. A music venue is any location used for a concert or musical performance. This kind of venues is categorized by Foursquare, so we will use the Foursquare API to get the music venues for each neighborhood, considering that each neighborhood's dimension is a radius of 500m with the center in its coordinates from geopy. After getting the complete list of music venues, we will count the **number of venues that are in each neighborhood**.

Now you will see the process of getting all of this information.

#### 1. Importing libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

from geopy.distance import geodesic # calculate distance between coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("Libraries imported.")

Libraries imported.


#### 2. Scraping the Neighborhoods list from Wikipedia page into a DataFrame

We can find a list of the neighborhoods of Madrid in Wikipedia. We can get this data with the read html tool from pandas, and then we drop all the information that we will not need.

In [2]:
html = 'https://en.wikipedia.org/wiki/List_of_neighborhoods_of_Madrid'
df = pd.read_html(html)[0] 
df = df.drop(['District name (number)','District location','Number','Image'],axis = 1)
df = df.rename(columns = {'Name':'Neighborhood'})

df.head()

Unnamed: 0,Neighborhood
0,Palacio
1,Embajadores
2,Cortes
3,Justicia
4,Universidad


#### 3. Getting the coordinates of each neighborhood

Now we need the coordinates of each neighborhood, to be able both to calculate the distance to the center and to get the music venues that each neighborhood has.

In [3]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Madrid, Spain'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [4]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in df["Neighborhood"].tolist() ]

In [5]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

# merge the coordinates into the original dataframe
df['Latitude'] = df_coords['Latitude']
df['Longitude'] = df_coords['Longitude']

df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Palacio,40.41517,-3.71273
1,Embajadores,40.40803,-3.70067
2,Cortes,40.41589,-3.69636
3,Justicia,40.42479,-3.69308
4,Universidad,40.42565,-3.70726


#### 4. Creating a Map of Madrid with its neighborhoods

Now, we will create a Map of Madrid with its neighborhoods, just for visualization purposes. 

In [6]:
# get the coordinates of Madrid
address = 'Madrid, Spain'

geolocator = Nominatim(user_agent="http")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
Madrid = ((latitude, longitude))

print('The geographical coordinate of Madrid, Spain: {}, {}.'.format(latitude, longitude))

The geographical coordinate of Madrid, Spain: 40.4167047, -3.7035825.


In [7]:
# create map of Madrid using latitude and longitude values
map_mad = folium.Map(Madrid, zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_mad)  
    
map_mad

#### 5. Calculating the distance from each neighborhood to the center

Now we have the coordinates of the center of Madrid, stored in the variable 'Madrid'. We will calculate this distance with each neighborhood with a tool from the geopy library. After that, we will have to clean this data to be able to use it in the k-means clustering: first, we delete the " km" part of each data point and then we convert the column to the float type.

In [8]:
header_list = ['Neighborhood', 'Latitude', 'Longitude', 'Distance from center']
df = df.reindex(columns = header_list) 
df_coords = df.drop(['Neighborhood','Distance from center'],1)

for row in df_coords.index:
    df.loc[row,'Distance from center']=str(geodesic(Madrid,df_coords.loc[row,:]))

df.head()  

Unnamed: 0,Neighborhood,Latitude,Longitude,Distance from center
0,Palacio,40.41517,-3.71273,0.794863607816637 km
1,Embajadores,40.40803,-3.70067,0.9944762511532348 km
2,Cortes,40.41589,-3.69636,0.6196350480297036 km
3,Justicia,40.42479,-3.69308,1.2651171324805244 km
4,Universidad,40.42565,-3.70726,1.0411873551730422 km


In [9]:
for row in df_coords.index:
    df.loc[row,'Distance from center']= df.loc[row,'Distance from center'].strip(" km")

df['Distance from center'] = np.array(df['Distance from center'], dtype=float)

df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Distance from center
0,Palacio,40.41517,-3.71273,0.794864
1,Embajadores,40.40803,-3.70067,0.994476
2,Cortes,40.41589,-3.69636,0.619635
3,Justicia,40.42479,-3.69308,1.265117
4,Universidad,40.42565,-3.70726,1.041187


#### 6. Getting the music venues with Foursquare

At this point, we will get the different music venues that we can find in each neighborhood using the Foursquare API.

In [19]:
# define Foursquare Credentials and Version
CLIENT_ID = 'your Foursquare ID' # your Foursquare ID
CLIENT_SECRET = 'your Foursquare Secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: your Foursquare ID
CLIENT_SECRET:your Foursquare Secret


In [11]:
radius = 500
LIMIT = 100
category_ID = '4bf58dd8d48988d1e5931735'
venues = []

for lat, long, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT,
        category_ID)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [12]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

(209, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Palacio,40.41517,-3.71273,Corral de la Morería,40.412619,-3.714249,Performing Arts Venue
1,Palacio,40.41517,-3.71273,La Taberna de Mister Pinkleton,40.414536,-3.708108,Music Venue
2,Palacio,40.41517,-3.71273,Contraclub,40.412639,-3.713827,Music Venue
3,Palacio,40.41517,-3.71273,Marula Café,40.413439,-3.713393,Music Venue
4,Palacio,40.41517,-3.71273,Restaurante El Cosaco,40.412939,-3.711632,Music Venue


#### 7. Counting the number of music venues for each neighborhood

In this section, we group every music venue in their neighborhoods. Then, we include this statistic in our dataframe and we fill the neighborhoods without venues with the value 0.

In [13]:
venues_df = venues_df.drop(['Latitude','Longitude','VenueLatitude','VenueLongitude','VenueCategory'],axis = 1)
count = venues_df.groupby(["Neighborhood"]).count()
count = count.rename(columns = {'VenueName':'Number of Venues'})

count.head()

Unnamed: 0_level_0,Number of Venues
Neighborhood,Unnamed: 1_level_1
Acacias,4
Adelfas,3
Almagro,5
Almenara,1
Almendrales,3


In [14]:
df = pd.merge(df, count, on='Neighborhood', how = 'left')

df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Distance from center,Number of Venues
0,Palacio,40.41517,-3.71273,0.794864,6.0
1,Embajadores,40.40803,-3.70067,0.994476,8.0
2,Cortes,40.41589,-3.69636,0.619635,10.0
3,Justicia,40.42479,-3.69308,1.265117,7.0
4,Universidad,40.42565,-3.70726,1.041187,14.0


In [15]:
df['Number of Venues'] = df['Number of Venues'].fillna(0)
df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Distance from center,Number of Venues
0,Palacio,40.41517,-3.71273,0.794864,6.0
1,Embajadores,40.40803,-3.70067,0.994476,8.0
2,Cortes,40.41589,-3.69636,0.619635,10.0
3,Justicia,40.42479,-3.69308,1.265117,7.0
4,Universidad,40.42565,-3.70726,1.041187,14.0


## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting neighborhoods of Madrid that have low music venues density, particularly those with low distance to the center of the city. Our analysis will be for all the neighborhoods of the city according to Wikipedia.

In first step we have collected the required data: **location of each neighborhood and its distance to the center and location of each music venue and the number of music venues per neighborhood**. 

Second step in our analysis will be running the **K-Means clustering**. We will erase all the information from the dataframe except **'Distance from center'** and **'Number of Venues'**. Doing this, the K-Means clustering will cluster the neighborhoods according just to these two variables, and then we will include the cluster label for each neighborhood, and we will be able to see the features that are common for each cluster and the number of neighborhoods in each cluster. We will also do a map to see these clusters graphically.

To finish, we will explain the features of each cluster for the stakeholders to decide which spot is best for their interests.

## Analysis <a name="analysis"></a>

#### 1. Clustering

Now we do the k-means clustering using 'Distance from center' and 'Number of Venues' as variables. To finish, we color the neighborhoods in the map according to their cluster.

In [16]:
k=5
clustering = df.drop(df.columns[[0,1,2]],axis=1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(clustering)
df.insert(0, 'Cluster Labels', kmeans.labels_)
df

Unnamed: 0,Cluster Labels,Neighborhood,Latitude,Longitude,Distance from center,Number of Venues
0,4,Palacio,40.41517,-3.71273,0.794864,6.0
1,4,Embajadores,40.40803,-3.70067,0.994476,8.0
2,2,Cortes,40.41589,-3.69636,0.619635,10.0
3,4,Justicia,40.42479,-3.69308,1.265117,7.0
4,2,Universidad,40.42565,-3.70726,1.041187,14.0
5,2,Sol,40.41802,-3.70577,0.236221,15.0
6,1,Imperial,40.40833,-3.71865,1.581258,2.0
7,4,Acacias,40.40137,-3.70669,1.723112,4.0
8,4,Chopera,40.39536,-3.69833,2.41174,7.0
9,1,Legazpi,40.38702,-3.6899,3.494919,1.0


#### 2. Counting the neighborhoods of each cluster

In [17]:
clusters_df = df.drop(['Latitude','Longitude','Distance from center','Number of Venues'],axis = 1)
count_neigh = clusters_df.groupby(["Cluster Labels"]).count()
count_neigh = count_neigh.rename(columns = {'Neighborhood':'Number of Neighborhoods'})

count_neigh

Unnamed: 0_level_0,Number of Neighborhoods
Cluster Labels,Unnamed: 1_level_1
0,45
1,53
2,3
3,12
4,15


#### 3. Mapping the clusters

In [18]:
# create map
map_clusters = folium.Map(Madrid,zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df['Latitude'], df['Longitude'], df['Neighborhood'], df['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results <a name="results"></a>

We can see the differences between the 5 clusters we got.
- The cluster 0 is fulfilled with neighborhoods that are 6-8 kms away from the center and have 0-2 music venues, so they are not the most interesting ones. There are 45 of these neighborhoods.
- The cluster 1 includes neighborhoods that are 1-5 kms away from the center and have 0-3 music venues, so these could be good places to open a new music venue. There are 53 of these neighborhoods.
- Neighborhoods in cluster 2 are 0-1 kms away from the center but they have already 10-15 music venues. These places are good for music venues but the competition can be tough.There are 3 of these neighborhoods.
- Neighborhoods in cluster 3 are +10 kms away from the center, so they are not good spots to music venues. There are 12 of these neighborhoods.
- Finally, neighborhoods in cluster 4 are 1-5 kms away from the center, such as the ones in cluster 1, but they have already 4-7 music venues each one, so they could be good places but the competition will be harder than in the neighborhoods of cluster 1. There are 15 of these neighborhoods.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Madrid neighborhoods close to center with low number of music venues, in order to help stakeholders to choose their desired location to open a new music venue, evaluating possible extra profit because of the distance to the center and the concentration of music consumption in that neighborhood, considering the risk of the competition if the neighborhood is already crowded with music venues.

We found 5 different types of neighborhoods according to these two features. There are two kind of neighborhoods that are far away from the center, so they can be directly discarted. Another cluster is fulfilled with neighborhoods really crowded with music venues. The consumption of music in these neighborhoods is big in these locations, but competition can be too tough. Finally, there are two clusters of neighborhoods close to the center, one of them is fulfilled with neighborhoods with few music venues and the other one is fulfilled with neighborhoods with more music venues.

Final decission on optimal music venue location will be made by stakeholders, we are only giving the information of each neighborhood that we think that can help with this decission. More variables should be takenn into accound, such as accessibility of each neighborhood in public transport, the prices, social and economic dynamics of every neighborhood etc.