# Capstone Project - The Battle of the Neighborhoods
###  Veterinary Clinic in Madrid, Spain
#### Applied Data Science Capstone by IBM/Coursera

## Introduction: Business Problem

For a veterinarian who decides to open a new veterinarian clinic in Madrid, Spain it is very important to know the distribution of pets and veterinarian clinics in the neighbourhoods of Madrid to find an optimal location for the new facility.

We will analyze the number of pets per neighborhood and the number of veterinarians to detect those neighborhoods with the highest ratios of pets per veterinarian (i.e. with more needing of locating a new facility).

On the other hand, it is also important to check the ascending trend of the number of pets in Madrid in the last five years and analyze those boroughs with highest increase ratios.

With these two criteria we will cluster the neighbourhoods to detect the most promising neighborhoods to install a new veterinary clinic (and the most "saturated" neighbourhood to avoid). 

## Data

Based on the definition of the problem, factors that will influence our decission are:

- increase in the number of pets in the last 5 years
- number of pets in the neighbourhood,
- number of veterinary clinics in the neighbourhood
   

Following data sources will be needed to extract/generate the required information:

- name and location of the boroughs and neighbourhoods of Madrid: City Hall Public Data web publishes the street guide including numbering of all urban premises (206866 premises). We will extract the name of the boroughs, neighbourhoods and their location.

- number of pets in the last five years: City Hall Public Data web publishes the number of dogs and cats per borough in the last 5 years. We will extract the increase of the number of total pets per borough. On the other hand, since the number of pets is published per borough, we use the proportion of population to distribute the number of pets per neighbourhood.

- number of veterinary in every neighborhood: City Hall Public Data web publishes the active venues in Madrid (163251 venues). We will extract the number of veterinary clinics per neighbourhood.

(Note:I have decided not to use the Foursquare API since not so much venues are registered in application yet. A checking of this is included in the code)

As final dataframe we will have per each neighbourhood:
- Borough
- Location: Latitude and Longitude
- Total Number of pets
- Increase of the number of pets in the last 5 years
- Number of veterinarian clinics
- Ratio pets/vet


In [1]:
# The code was removed by Watson Studio for sharing.

In [None]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')


Solving environment: done

# All requested packages already installed.

Solving environment: / 

## 1. Download and Explore Dataset - Map of Neighbourhoods

The City Hall of Madrid Web publishes a big amount of public data of the city.

Data Source: City Hall Public Data web https://datos.madrid.es/portal/site/egob


In [None]:
# From the street directoty of the city, it is extracted the list of boroughs, neighbourhoods and location
df=pd.read_csv(project.get_file('CALLEJERO_VIGENTE_NUMERACIONES_201908.csv'), sep=';', encoding = "ISO-8859-1")
df.head(5)

In [None]:
# it is a very big dataframe
df.shape

In [None]:
# Only the 6 columns required are selected
names = list(df.columns)
names[6]='Borough Code'
names[7]='Borough'
names[8]='Neighbourhood Code'
names[9]='Neighbourhood'
names[18] = 'Longitude'
names[19] = 'Latitude'
df.columns = names
df=df[['Borough Code','Borough','Neighbourhood Code','Neighbourhood','Longitude','Latitude']]
df.head()

In [None]:
# table index for the borough codes is created to be used later
codigos_distrito=df[['Borough Code','Borough']].drop_duplicates().sort_values(['Borough Code'])
codigos_distrito=codigos_distrito.reset_index(drop=True)
codigos_distrito

In [None]:
#coordinates transforming function definition
def latDD(x): 
    D = int(x[:x.find('º')]) 
    M = int(x[x.find('º')+1:x.find("'")]) 
    S = float(x[x.find("'")+1:x.find("''")]) 
    DD = D + float(M)/60 + float(S)/3600
    return DD

def lonDD(x): 
    D = int(x[:x.find('º')]) 
    M = int(x[x.find('º')+1:x.find("'")]) 
    S = float(x[x.find("'")+1:x.find("''")]) 
    DD = -(D + float(M)/60 + float(S)/3600)
    return DD

In [None]:
#coordinates transforming function aplication
columnas=list(df.columns)
df['Longitude']=df['Longitude'].apply(lonDD)
df['Latitude']=df['Latitude'].apply(latDD)
df.head()

In [None]:
#grouping by neighbourhood and reordering columns
df=df.groupby(['Neighbourhood']).mean()
df=df.reset_index()
df=df.join(codigos_distrito.set_index('Borough Code'), on='Borough Code')
columnas=df.columns.tolist()
columnas= columnas[1:2] + columnas[-1:]+ columnas[2:3]+columnas[0:1]+columnas[3:5]
df=df[columnas]
df=df.sort_values(['Borough Code', 'Neighbourhood Code'])
df=df.reset_index(drop=True)
df

In [None]:
# calculate number of boroughs and neighbourhoods
print('Madrid has {} boroughs and {} neighbourhoods.'.format(
        len(df['Borough'].unique()),
        df.shape[0]
    )
)

### Map of Madrid Neighbourhoods

In [None]:
address = 'Madrid'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Madrid are {}, {}.'.format(latitude, longitude))

In [None]:
# create map of Madrid using latitude and longitude values
map_madrid = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_madrid)  
    
map_madrid

## 2. Explore Neighbourhoods in Madrid - via FourSquare

Analisys of data fron Madrid obtained form FourSquare application

In [None]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

CLIENT_ID = 'I2JEH30YZ5DWC4DQBZAY05XPA3QR3JLCGB0KDB4B34B3XEXM' # your Foursquare ID
CLIENT_SECRET = 'IQJUV2X3DXB0E0EQ4VE4TCNHTIEGZ3O5OQ2S5I2RHAGXIMVG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
madrid_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )


In [None]:
print(madrid_venues.shape)
madrid_venues.head()

In [None]:
print('Foursquare application has {} registered venues in Madrid.'.format(
        madrid_venues.shape[0]
    )
)

In [None]:
# Musseum del Prado is one of the most important venues in the city of Madrid and one of the most important
# picture musseum in the world. Lets check if Musseum del Prado is included in the venues od Madrid
madrid_venues[madrid_venues['Venue']=='%Prado']

In [None]:
madrid_venues[madrid_venues['Venue Category']=='Museum']

The total number of museums in the city of Madrid according to the City Council is around 70 (away from the 10 FourSquares records) that include, of course, the Prado Museum, the Thyssen-Bornemisza Museum, the Reina Sofía Museum, etc. . It is for this reason, that I consider that the FourSquare application does not yet have enough records for the city of Madrid. For the rest of the Capstone I will use the City Hall Public data web of Madrid. 

## 3. Explore Neighbourhoods in Madrid - via City Hall Public Data

Data Source: City Hall Public Data web https://datos.madrid.es/portal/site/egob

As we are interesting in installing a new veterinary clinic, we analyse first the increase of the number of pets.

In [None]:
# In the file "Censo animales.csv" we have the total number of pets per Borough and the increase in the last 5 years
pets=pd.read_csv(project.get_file('Censo animales .csv'), sep=';', encoding = "ISO-8859-1")
names = list(pets.columns)
names[0]='Year'
names[1]='Borough_code'
names[2]='Borough'
names[3]='Dogs'
names[4]='Cats'
pets.columns = names
pets['Total_pets']=pets['Dogs']+pets['Cats']

pets.head(30)


In [None]:
#pivot table to analyse the number of pets per borough and year
pets=pets.pivot_table('Total_pets',['Borough_code','Borough'],'Year')
pets=pets.rename_axis(None, axis=1).reset_index()

# calculation of the increase in the number of pets per borough in the last five years
pets['Inc_5_y (%)']=(pets[2018]/pets[2014]-1)*100
pets=pets.round({'Inc_5_y (%)': 1})


pets

In [None]:
# select only required columns
pets=pets[['Borough_code','Borough',2018,'Inc_5_y (%)']]
names=list(pets.columns)
names[2]='Total_pets'
pets.columns = names
pets


In [None]:
print('There are {} registered pets in Madrid.'.format(
        pets['Total_pets'].sum()
    )
)

print('The increase of the number of pets in the last 5 years in Madrid is {}%.'.format(
        round(pets['Inc_5_y (%)'].mean(),1)
    )
)

In [None]:
# as we have the number of pets per borough, we use the proportion of population to distribute the number of pets
# per neighbourhood. First we download the population file from the City Hall Public web
population=pd.read_csv(project.get_file('Rango_Edades_Seccion_201908.csv'), sep=';', encoding = "ISO-8859-1")
names = list(population.columns)
names[0]='Borough_code'
names[1]='Borough'
names[2]='Neighbourhood_code'
names[3]='Neighbourhood'
names[8]='SpanishMen'
names[9]='SpanishWomen'
names[10]='OtherMen'
names[11]='OtherWomen'
population.columns = names
population=population.fillna(0)
population['Total_Pop_Neighbourhood']=population['SpanishMen']+population['SpanishWomen']+population['OtherMen']+population['OtherWomen']
population.head()

In [None]:
# the population per neighbourhood
pop_neigh=population.pivot_table('Total_Pop_Neighbourhood',['Borough_code','Borough','Neighbourhood_code','Neighbourhood'],aggfunc='sum')
pop_neigh=pop_neigh.rename_axis(None, axis=1).reset_index()
pop_neigh

In [None]:
# the population per borough
pop_bor=population.pivot_table('Total_Pop_Neighbourhood',['Borough_code'],aggfunc='sum')
pop_bor=pop_bor.rename_axis(None, axis=1).reset_index()
pop_bor.rename(columns={'Total_Pop_Neighbourhood': 'Total_Pop_Bor'}, inplace=True)
pop_bor

In [None]:
# pets and population per borough
pets=pets.set_index('Borough_code').join(pop_bor.set_index('Borough_code') )
pets=pets.rename_axis(None, axis=1).reset_index()
pets=pets.drop('Borough',axis=1)
pets

In [None]:
# distribution of pets per neighbourhood
pets=pop_neigh.set_index('Borough_code').join(pets.set_index('Borough_code') )
pets=pets.rename_axis(None, axis=1).reset_index()
pets.rename(columns={'Total_pets': 'Total_Pets_Borough'}, inplace=True)
pets['Total_Pets_Neighbourhood']=pets['Total_Pets_Borough']/pets['Total_Pop_Bor']*pets['Total_Pop_Neighbourhood']
pets=pets.round({'Total_Pets_Neighbourhood': 0})
pets=pets.drop('Borough',axis=1)
pets=pets.set_index('Neighbourhood').join(df.set_index('Neighbourhood') )
pets=pets.rename_axis(None, axis=1).reset_index()
pets=pets.iloc[:, [0,9,11,12,7,5]] 

pets

In [None]:
# download venues from the City Hall Data Web
vets=pd.read_csv(project.get_file('OPEN DATA Locales-Epigrafes201907.csv'), sep=';', encoding = "ISO-8859-1", low_memory=False)
vets.head()

In [None]:
vets.shape

In [None]:
# counting the veterinary clinic per neighbourhood
vets.rename(columns={'desc_barrio_local': 'Neighbourhood','desc_division': 'Activity' }, inplace=True)
vets=vets[['Neighbourhood','Activity' ]]
vets.head()
vets=vets[vets['Activity'].str.contains("VETERINARIAS")==True] 
vets=vets['Neighbourhood'].value_counts()
vets=pd.DataFrame(vets)
vets=vets.rename_axis(None, axis=1).reset_index()
names = list(vets.columns)
names[0]='Neighbourhood'
names[1]='Number_of_Vets'
vets.columns = names
vets

In [None]:
print('There are {} Veterinary Clinics in Madrid.'.format(
        vets['Number_of_Vets'].sum()
    )
)




In [None]:
# joining pets per neighbourhood + vets data
final_df=pets.set_index('Neighbourhood').join(vets.set_index('Neighbourhood') )
final_df=final_df.rename_axis(None, axis=1).reset_index()
final_df.head()
final_df=final_df.fillna(0)
final_df['pets/vet'] = 0
condition = final_df['Number_of_Vets'] > 0
final_df.loc[condition, 'pets/vet'] = final_df['Total_Pets_Neighbourhood']/final_df['Number_of_Vets']
final_df.loc[~condition, 'pets/vet'] = final_df['Total_Pets_Neighbourhood']
final_df=final_df.round({'pets/vet': 0})
final_df

## 4. Cluster Neighbourhoods

In [None]:
# clustering neighbourhood by increase in the last 5 years and ratio pets/vet

madrid_grouped_clustering = final_df.drop(['Neighbourhood','Borough','Longitude','Latitude','Total_Pets_Neighbourhood','Number_of_Vets'], 1)

from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(madrid_grouped_clustering)

madrid_grouped_clustering=scaler.transform(madrid_grouped_clustering) 

Application of Elbow Method to determine the optimal K number

In [None]:
# optimize number of clusters
ks = 10
inertia_clusters = list()

for i in range(1,ks):
        # Object KMeans
        kmeans=KMeans(n_clusters=i, random_state=0).fit(madrid_grouped_clustering)

        # Obtain inertia
        inertia_clusters.append([i, kmeans.inertia_])


In [None]:
import matplotlib.pyplot as plt
x, y = zip(*[inertia for inertia in inertia_clusters])
plt.plot(x, y, 'ro-', markersize=8, lw=2)
plt.xlabel('Num Clusters')
plt.ylabel('Inertia')
plt.show()


According to the plot, the optimal number of clusters is set in 4

In [None]:
# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(madrid_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# add clustering labels

madrid_grouped_sorted=final_df

madrid_grouped_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

madrid_grouped_sorted


In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(madrid_grouped_sorted['Latitude'], madrid_grouped_sorted['Longitude'], madrid_grouped_sorted['Neighbourhood'], madrid_grouped_sorted['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Cluster 0

In [None]:
madrid_cluster_0=madrid_grouped_sorted.loc[madrid_grouped_sorted['Cluster Labels'] == 0, madrid_grouped_sorted.columns[[1] + list(range(5, madrid_grouped_sorted.shape[1]))]]
madrid_cluster_0

In [None]:
madrid_cluster_0.describe()

These 4 neighbourhoods are located in boroughs with a high increase in the number of pets in the last 5 years and with a high ratio pets/vets. The neighbourhoods included in this cluster are preferential to install a new veterinary clinic.

#### Cluster 1

In [None]:
madrid_cluster_1=madrid_grouped_sorted.loc[madrid_grouped_sorted['Cluster Labels'] == 1, madrid_grouped_sorted.columns[[1] + list(range(5, madrid_grouped_sorted.shape[1]))]]
madrid_cluster_1

In [None]:
madrid_cluster_1.describe()

These 65 neighbourhoods are located in boroughs with a low increase in the number of pets in the last 5 years and with very low ratio pets/vets. The neighbourhoods included in this cluster are the worst to install a new veterinary clinic.

#### Cluster 2

In [None]:
madrid_cluster_2=madrid_grouped_sorted.loc[madrid_grouped_sorted['Cluster Labels'] == 2, madrid_grouped_sorted.columns[[1] + list(range(5, madrid_grouped_sorted.shape[1]))]]
madrid_cluster_2

In [None]:
madrid_cluster_2.describe()

These 25 neighbourhoods are located in boroughs with a low increase in the number of pets in the last 5 years but  with high ratio pets/vets. The neighbourhoods included in this cluster are good location to install a new veterinary clinic yet.

#### Cluster 3

In [None]:
madrid_cluster_3 = madrid_grouped_sorted.loc[madrid_grouped_sorted['Cluster Labels'] == 3, madrid_grouped_sorted.columns[[1] + list(range(5, madrid_grouped_sorted.shape[1]))]]
madrid_cluster_3

In [None]:
madrid_cluster_3.describe()

These 37 neighbourhoods are located in boroughs with a high increase in the number of pets in the last 5 years but  with a low ratio pets/vets. The neighbourhoods included in this cluster are not good location to install a new veterinary clinic at the present but can be studied at the future to see how the increase evolves.