## IBM Data Science Professional Certificate
### Capstone Project - The Battle of Neighborhoods

Damien Azzopardi - July 2021

<h2>Table of Contents</h2>

<ol>
    <li><a href="#introduction"><b>Introduction</b></a>
        <ul>
            <li><a href="#business_problem">Business Problem</a>          
        </ul>
<br>
<br>
    <li><a href="#data"><b>Data</b></a></li>
        <ul>
            <li><a href="#data_1">Data 1</a>     
        </ul>
<hr>

<h2 id="introduction">Introduction</h2>

Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

<h3 id="business_problem">Business Problem</h3>

XXX

<h2 id="data">Data</h2>

Describe the data that you will be using to solve the problem or execute your idea. Remember that you will need to use the Foursquare location data to solve the problem or execute your idea. You can absolutely use other datasets in combination with the Foursquare location data. So make sure that you provide adequate explanation and discussion, with examples, of the data that you will be using, even if it is only Foursquare location data.

- **Neighborhoods list**:
The data containing the list of neighborhoods in Barcelona is coming from the [Districts of Barcelona Wikipedia page](https://en.wikipedia.org/wiki/Districts_of_Barcelona). The data manipulation required in order to scrap and get the list of neighborhoods in the proper format will be done directly in the workbook.


- **Neighborhoods coordinates**:

https://data.metabolismofcities.org/library/maps/577245/view/

https://data.metabolismofcities.org/referencespaces/view/577264/

XXX

- **Foursquare**:

In [1]:
# load libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

In [53]:
url = 'https://data.metabolismofcities.org/library/maps/577245/view/'

r = requests.get(url)
html = r.text

soup = BeautifulSoup(html, 'lxml')
table = soup.find('table')
rows = table.find_all('tr')
data = []
for row in rows[1:]:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

# convert the scrapped table to a dataframe
df = pd.DataFrame(data)

# Rename column names
df.columns = ['Neighborhoods', 'Coordinates']

df.head()

Unnamed: 0,Neighborhoods,Coordinates
0,Baró de Viver,"[41.44581467347341, 2.19899775842406]"
1,Can Baró,"[41.4167603624773, 2.1623865539676492]"
2,Can Peguera,"[41.43484212038238, 2.1664501320817235]"
3,Canyelles,"[41.445032990983854, 2.1634504252403164]"
4,Ciutat Meridiana,"[41.46120773644666, 2.1748476502321963]"


## Data Manipulation

In [54]:
# split the 'Coordinates' table into two new columns 'Latitude' and 'Longitude'
df[['Latitude','Longitude']] = df.Coordinates.str.split(', ', expand = True)

df.head()

Unnamed: 0,Neighborhoods,Coordinates,Latitude,Longitude
0,Baró de Viver,"[41.44581467347341, 2.19899775842406]",[41.44581467347341,2.19899775842406]
1,Can Baró,"[41.4167603624773, 2.1623865539676492]",[41.4167603624773,2.1623865539676492]
2,Can Peguera,"[41.43484212038238, 2.1664501320817235]",[41.43484212038238,2.1664501320817235]
3,Canyelles,"[41.445032990983854, 2.1634504252403164]",[41.445032990983854,2.1634504252403164]
4,Ciutat Meridiana,"[41.46120773644666, 2.1748476502321963]",[41.46120773644666,2.1748476502321963]


In [91]:
# drop the 'Coordinates' column
neighborhoods_bcn = df.drop(['Coordinates'], axis = 1)

neighborhoods_bcn.head()

Unnamed: 0,Neighborhoods,Latitude,Longitude
0,Baró de Viver,[41.44581467347341,2.19899775842406]
1,Can Baró,[41.4167603624773,2.1623865539676492]
2,Can Peguera,[41.43484212038238,2.1664501320817235]
3,Canyelles,[41.445032990983854,2.1634504252403164]
4,Ciutat Meridiana,[41.46120773644666,2.1748476502321963]


In [108]:
# special characters to remove from the dataframe
spec_chars = ["[","]"]

# removing special characters from the 'Latitude' column
for char in spec_chars:
    neighborhoods_bcn['Latitude'] = neighborhoods_bcn['Latitude'].str.replace(char,'', regex=True)

# removing special characters from the 'Longitude column'    
for char in spec_chars:
    neighborhoods_bcn['Longitude'] = neighborhoods_bcn['Longitude'].str.replace(char,'', regex=True)

neighborhoods_bcn.head()

Unnamed: 0,Neighborhoods,Latitude,Longitude
0,Baró de Viver,41.44581467347341,2.19899775842406
1,Can Baró,41.4167603624773,2.162386553967649
2,Can Peguera,41.43484212038238,2.1664501320817235
3,Canyelles,41.445032990983854,2.1634504252403164
4,Ciutat Meridiana,41.46120773644666,2.1748476502321963


In [110]:
neighborhoods_bcn.shape

(73, 3)

Our dataset containing all neighborhoods in Barcelona, along with their corresponding coordinates is now ready for use. However, for our analysis, in order to know the potential market for each neighborhood, we would also like to know the size and number of inhabitants for each neighborhood. We can find that list on https://en.wikipedia.org/wiki/Districts_of_Barcelona.

In [10]:
url = 'https://en.wikipedia.org/wiki/Districts_of_Barcelona'

r = requests.get(url)
html = r.text

soup = BeautifulSoup(html, 'lxml')
table = soup.find('table',{'class':"wikitable"})
rows = table.find_all('tr')
data = []
for row in rows[1:]:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

# get only the name of the neighborhoods and convert it to a list
neighborhoods = (data[0][5] + str(', ') + data[1][5] + str(', ') + data[2][5] + str(', ') + data[3][5] + str(', ') + data[4][5] + str(', ') + data[5][5] + str(', ') + data[6][5] + str(', ') + data[7][5] + str(', ') + data[8][5] + str(', ') + data[9][5])
districts = (data[0][1] + str(', ') + data[1][1] + str(', ') + data[2][1] + str(', ') + data[3][1] + str(', ') + data[4][1] + str(', ') + data[5][1] + str(', ') + data[6][1] + str(', ') + data[7][1] + str(', ') + data[8][1] + str(', ') + data[9][1])

districts

'Ciutat Vella, Eixample, Sants-Montjuïc, Les Corts, Sarrià-Sant Gervasi, Gràcia, Horta-Guinardó, Nou Barris, Sant Andreu, Sant Martí'

In [13]:
# convert the string of neighborhoods to a list
neighborhoods = list(data.split(", "))
neighborhoods

AttributeError: 'list' object has no attribute 'split'

In [14]:
# convert the list of neighborhoods to a dataframe
neighborhoods = pd.DataFrame(neighborhoods, columns=['Neighborhoods'])
neighborhoods.head()

ValueError: DataFrame constructor not properly called!