# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera
#### Made by: Holzel, Gabriela

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

The city chosen to answer the initial question is <b>Buenos Aires</b>, which is the capital and largest city of Argentina. 

Buenos Aires is the financial, industrial, and commercial hub of Argentina. The economy in the city proper alone, measured by Gross Geographic Product (adjusted for purchasing power), totaled US \\$ 84.7 billion (US$34,200 per capita) in 2011 and amounts to nearly a quarter of Argentina's as a whole. These are the reasons why I believe Buenos Aires is a place with a great competition, especially, if you want to open an <b>italian restaurant</b> so I would like to help a possible stakeholder to understand better the town and the market with useful insights.

Target Audience:
* A business entrepreneur that wants open a new italian restaurant in Buenos Aires.
* Business Analyst or Data Scientists, who wish to analyze the neighborhoods of Buenos Aires using python, Jupiter notebook and some machine learning techniques.
* Someone curious about data that want to have an idea, how beneficial it is to open a restaurant and what are the pros and cons of this business.

# 
## Data <a name="data"></a>

This project uses the Foursquare API to explore all the neighborhoods in Buenos Aires. 

Furthermore, we will explore the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. 
Finally, this project uses the Folium library to visualize the neighborhoods in Buenos Aires.


 
First of all we must install everything we are going to need for now.
 


Let's first find the latitude & longitude of Buenos Aires city center, using  Nominatim.

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [2]:
from geopy.geocoders import Nominatim

In [3]:
!pip install folium
import folium

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 7.5 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [4]:
geolocator = Nominatim(user_agent="ca_explorer")
city ="Buenos Aires"
country ="AR"
loc = geolocator.geocode(city+','+ country)
latitude_BA = loc.latitude
longitude_BA = loc.longitude

print("The latitude of Buenos Aires is :-" ,loc.latitude,"\nThe longtitude of Buenos Aires is:-" ,loc.longitude)

The latitude of Buenos Aires is :- -34.6075682 
The longtitude of Buenos Aires is:- -58.4370894


We will now scrap the Neighbourhoods of Buenos Aires from a webpage to create a dataframe. To do so, we will use BeautifulSoup.

In [5]:
URL = 'https://www.coordenadas.com.es/argentina/pueblos-de-ciudad-buenos-aires/7/1'
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
print(soup)

<!DOCTYPE html>
<html lang="es-ES">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="es" http-equiv="content-language"/>
<link href="https://www.coordenadas.com.es/css/style.css" rel="stylesheet"/>
<link href="https://www.coordenadas.com.es/argentina/pueblos-de-ciudad-buenos-aires/7/1" rel="canonical"/>
<title>Coordenadas, longitud y latitud de Ciudad-Buenos-Aires Argentina Pagina 1</title>
<meta content="Coordenadas pueblos de  Ciudad-Buenos-Aires Argentina  de los pueblos de la provincia de Ciudad-Buenos-Aires 1 " name="description"/>
<meta content="coordenadas geograficas, longitud, latitud, geolocalizar " name="keywords"/>
<link href="https://www.coordenadas.com.es/favicon.ico" rel="shortcut icon"/>
<meta content="all" name="googlebot"/>
<meta content="index" name="googlebot"/>
<meta content="follow" name="googlebot"/>
<meta content="all" name="robots"/>
<meta content="index" name="robots"/>
<meta content="follow"

In [7]:
r = requests.get('https://www.coordenadas.com.es/argentina/pueblos-de-ciudad-buenos-aires/7/1')
soup = BeautifulSoup(r.text, 'lxml')

Neighborhood = []
Coord = []
Coordinates = []

ciudades = soup.find_all('a')
for city in ciudades[6:92]:
    hijos = city.descendants
    for child in hijos:
        Neighborhood.append(child)

coordenadas = soup.find_all('td')
for coor in coordenadas:
    hijos = coor.descendants
    for child in hijos:
        Coord.append(child)

for i in range(len(Coord)):
    try:
        Coordinates.append(Coord[4*i+3])
    except IndexError:
        pass

df = pd.DataFrame({'Neighborhoods': Neighborhood, 'Coordinates':Coordinates})
df.head(15)

Unnamed: 0,Neighborhoods,Coordinates
0,Agronomia,"-34.6,-58.48333"
1,Almagro,"-34.6,-58.41667"
2,Almirante-Brown,"-34.66667,-58.45"
3,Balbastro,"-34.65,-58.46667"
4,Balvanera,"-34.61018,-58.40654"
5,Barracas,"-34.65,-58.36667"
6,Barrio-Norte,"-34.58333,-58.4"
7,Belgrano,"-34.56667,-58.46667"
8,Boca,"-34.63333,-58.35"
9,Boedo,"-34.63333,-58.41667"


We will now make a few modifications to the dataframe.

In [8]:
df[['Latitude','Longitude']] = df.Coordinates.str.split(",",expand=True)
df.drop('Coordinates', inplace=True, axis=1)
df['Neighborhoods'] = df['Neighborhoods'].str.replace('-',' ')
df['Longitude'] = df['Longitude'].astype(float).round(4)
df['Latitude'] = df['Latitude'].astype(float).round(4)
df.head(15)

Unnamed: 0,Neighborhoods,Latitude,Longitude
0,Agronomia,-34.6,-58.4833
1,Almagro,-34.6,-58.4167
2,Almirante Brown,-34.6667,-58.45
3,Balbastro,-34.65,-58.4667
4,Balvanera,-34.6102,-58.4065
5,Barracas,-34.65,-58.3667
6,Barrio Norte,-34.5833,-58.4
7,Belgrano,-34.5667,-58.4667
8,Boca,-34.6333,-58.35
9,Boedo,-34.6333,-58.4167




Let's see the result in a map!

In [9]:
map_BA = folium.Map(location=[latitude_BA, longitude_BA], zoom_start=12)

for lat, lng, neigh in zip(
        df['Latitude'], 
        df['Longitude'], 
        df['Neighborhoods']):
    label = '{}'.format(neigh)
    label2 = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup='{}'.format(neigh),
        color='blue',
        fill=False,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_BA)  

map_BA

# 
## Foursquare

Now that we know where the neighborhoods are located, we will use Foursquare API to get info on restaurants in each of them.

We will only include in our list only venues that have 'restaurant' in category name, and we'll make sure to detect and include all the subcategories of specific 'Italian restaurant' category, as we need info on Italian restaurants in the neighborhood.


In [11]:
!pip install shapely
import shapely.geometry

Collecting shapely
  Downloading Shapely-1.7.1-cp37-cp37m-manylinux1_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 24.4 MB/s eta 0:00:01
[?25hInstalling collected packages: shapely
Successfully installed shapely-1.7.1


In [12]:
!pip install pyproj
import pyproj

Collecting pyproj
  Downloading pyproj-3.0.1-cp37-cp37m-manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 11.2 MB/s eta 0:00:01
Installing collected packages: pyproj
Successfully installed pyproj-3.0.1


In [13]:
import math

In [14]:
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

In [50]:
CLIENT_ID = 'MX25ENHCJYIKTEL44452CCA1I0UY2LYLRWQ3JCQ0DLRU1ZQE' # your Foursquare ID
CLIENT_SECRET = 'U4L5OTAITD0O3BWYSXOFYTWNXGDBX5XWMWEESYPC1KK2Y2ZJ' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MX25ENHCJYIKTEL44452CCA1I0UY2LYLRWQ3JCQ0DLRU1ZQE
CLIENT_SECRET:U4L5OTAITD0O3BWYSXOFYTWNXGDBX5XWMWEESYPC1KK2Y2ZJ


In [54]:
food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

italian_restaurant_categories = ['4bf58dd8d48988d110941735','55a5a1ebe4b013909087cbb6','55a5a1ebe4b013909087cb7c',
                                 '55a5a1ebe4b013909087cba7','55a5a1ebe4b013909087cba1','55a5a1ebe4b013909087cba4',
                                 '55a5a1ebe4b013909087cb95','55a5a1ebe4b013909087cb89','55a5a1ebe4b013909087cb9b',
                                 '55a5a1ebe4b013909087cb98','55a5a1ebe4b013909087cbbf','55a5a1ebe4b013909087cb79',
                                 '55a5a1ebe4b013909087cbb0','55a5a1ebe4b013909087cbb3','55a5a1ebe4b013909087cb74',
                                 '55a5a1ebe4b013909087cbaa','55a5a1ebe4b013909087cb83','55a5a1ebe4b013909087cb8c',
                                 '55a5a1ebe4b013909087cb92','55a5a1ebe4b013909087cb8f','55a5a1ebe4b013909087cb86',
                                 '55a5a1ebe4b013909087cbb9','55a5a1ebe4b013909087cb7f','55a5a1ebe4b013909087cbbc',
                                 '55a5a1ebe4b013909087cb9e','55a5a1ebe4b013909087cbc2','55a5a1ebe4b013909087cbad',
                                 '4bf58dd8d48988d110941735','55a5a1ebe4b013909087cbb6','55a5a1ebe4b013909087cb7c',
                                 '55a5a1ebe4b013909087cba7','55a5a1ebe4b013909087cba1','55a5a1ebe4b013909087cba4',
                                 '55a5a1ebe4b013909087cb95','55a5a1ebe4b013909087cb89','55a5a1ebe4b013909087cb9b',
                                 '55a5a1ebe4b013909087cb98','55a5a1ebe4b013909087cbbf','55a5a1ebe4b013909087cb79',
                                 '55a5a1ebe4b013909087cbb0','55a5a1ebe4b013909087cbb3','55a5a1ebe4b013909087cb74',
                                 '55a5a1ebe4b013909087cbaa','55a5a1ebe4b013909087cb83','55a5a1ebe4b013909087cb8c',
                                 '55a5a1ebe4b013909087cb92','55a5a1ebe4b013909087cb8f','55a5a1ebe4b013909087cb86',
                                 '55a5a1ebe4b013909087cbb9','55a5a1ebe4b013909087cb7f','55a5a1ebe4b013909087cbbc',
                                 '55a5a1ebe4b013909087cb9e','55a5a1ebe4b013909087cbc2','55a5a1ebe4b013909087cbad',
                                 '52af3a5e3cf9994f4e043bea','52af3a723cf9994f4e043bec','52af3a7c3cf9994f4e043bed',
                                 '58daa1558bbb0b01f18ec1d3','52af3a673cf9994f4e043beb','52af3a903cf9994f4e043bee',
                                 '4bf58dd8d48988d1f5931735','52af3a9f3cf9994f4e043bef','52e81612bcbc57f1066b79ff',
                                 '4bf58dd8d48988d16e941735']

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse','place','pub','house']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

In [52]:
import pickle

In [61]:
def get_restaurants(lats, lons):
    restaurants = {}
    italian_restaurants = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        venues = get_venues_near_location(lat, lon, food_category, CLIENT_ID, CLIENT_SECRET, radius=500, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_italian = is_restaurant(venue_categories, specific_filter=italian_restaurant_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_italian, x, y)
                if venue_distance<=470:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_italian:
                    italian_restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, italian_restaurants, location_restaurants

# Try to load from local file system in case we did this before
restaurants = {}
italian_restaurants = {}
location_restaurants = []
loaded = False
try:
    with open('restaurants_350.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('italian_restaurants_350.pkl', 'rb') as f:
        italian_restaurants = pickle.load(f)
    with open('location_restaurants_350.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    print('Restaurant data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    restaurants, italian_restaurants, location_restaurants = get_restaurants(list(df['Latitude']), list(df['Longitude']))
    
    # Let's persists this in local file system
    with open('restaurants_350.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('italian_restaurants_350.pkl', 'wb') as f:
        pickle.dump(italian_restaurants, f)
    with open('location_restaurants_350.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)

Restaurant data loaded.


In [62]:
print('Total number of restaurants:', len(restaurants))
print('Total number of Italian restaurants:', len(italian_restaurants))
print('Percentage of Italian restaurants: {:.2f}%'.format(len(italian_restaurants) / len(restaurants) * 100))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

Total number of restaurants: 208
Total number of Italian restaurants: 24
Percentage of Italian restaurants: 11.54%
Average number of restaurants in neighborhood: 2.4302325581395348


We can see that 11.54% of restaurants are italian. Let's see the result in a map! The red circles will represent italian restaurants and the blue ones are other types of restaurants. 

In [63]:
map_BA = folium.Map(location=[latitude_BA, longitude_BA], zoom_start=13)
folium.Marker([latitude_BA, longitude_BA], popup='Buenos Aires').add_to(map_BA)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_italian = res[6]
    color = 'red' if is_italian else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_BA)
map_BA

# 
## Methodology <a name="methodology"></a>

For this report I used a few different maps that could help a new investor to decide the best neighborhood. In order to do that I've used the above information combined with maps to visually display the neighborhoods where the italian restaurants are situated. 

We will now analyze our data to determine the restaurant density across the city of Buenos Aires.


## Analysis <a name="analysis"></a>

First of all we're going to add a new column to our previous df showing how many restaurants are in each neighborhood.

In [65]:
location_restaurants_count = [len(res) for res in location_restaurants]
df_locations = df
df_locations['Restaurants in area'] = location_restaurants_count

print('The average number of restaurants in every area with radius of = 500m is:', np.array(location_restaurants_count).mean())

df_locations.head(10)

The average number of restaurants in every area with radius of = 500m is: 2.4302325581395348


Unnamed: 0,Neighborhoods,Latitude,Longitude,Restaurants in area
0,Agronomia,-34.6,-58.4833,0
1,Almagro,-34.6,-58.4167,14
2,Almirante Brown,-34.6667,-58.45,0
3,Balbastro,-34.65,-58.4667,0
4,Balvanera,-34.6102,-58.4065,4
5,Barracas,-34.65,-58.3667,0
6,Barrio Norte,-34.5833,-58.4,11
7,Belgrano,-34.5667,-58.4667,4
8,Boca,-34.6333,-58.35,0
9,Boedo,-34.6333,-58.4167,4


In [66]:
restaurant_latlons = [[res[2], res[3]] for res in restaurants.values()]
italian_latlons = [[res[2], res[3]] for res in italian_restaurants.values()]

In [67]:
from folium import plugins
from folium.plugins import HeatMap

map_BA = folium.Map(location=[latitude_BA, longitude_BA], zoom_start=12)

for lat, lng, neigh in zip(
        df['Latitude'], 
        df['Longitude'], 
        df['Neighborhoods']):
    label = '{}'.format(neigh)
    label2 = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup='{}'.format(neigh),
        color='blue',
        fill=False,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_BA)
    
folium.TileLayer('cartodbpositron').add_to(map_BA) #cartodbpositron cartodbdark_matter
HeatMap(restaurant_latlons).add_to(map_BA)
folium.Marker([latitude_BA, longitude_BA]).add_to(map_BA)
folium.Circle([latitude_BA, longitude_BA], radius=1000, fill=False, color='white').add_to(map_BA)
folium.Circle([latitude_BA, longitude_BA], radius=2000, fill=False, color='white').add_to(map_BA)
folium.Circle([latitude_BA, longitude_BA], radius=3000, fill=False, color='white').add_to(map_BA)
map_BA

## Results and Discussion <a name="results"></a>

Here we are at the end of the analysis, I tried to set up a realistic data-analysis scenario using several different ways such as: web scraping, some powerful python libraries eg. Folium and GeoPandas, Foursquare API, etc.

So now we have the opportunity to make some argument about our findings. Let’s see what we have found:
* There are certain areas with high 'Italian Restaurant density', such as Puerto Madero or Recoleta.
* There are other areas such as Parque Patricios or Caballito where said density is medium.
* Lastly, there are areas like Retiro or Villa del Parque, where there are little to none italian restaurants.

We would suggest the stakeholders to invest in a new italian restaurant in the areas with low restaurant density.

## Conclusion <a name="conclusion"></a>

In this project I've got a small glimpse of how real life data-science projects look like. I’ve made use of some frequently used python libraries to scrap web-data, use Foursquare API to explore the neighborhoods of Buenos Aires and saw the results of it using Folium maps. Potential for this kind of analysis in a real life business problem is discussed in great detail.

As the analysis is performed on small set of data, we can achieve better results by increasing the neighborhood information. Anyway, Buenos Aires is an international city with many different types of new restaurant business to offer and I think we have gone through the process of identifying the business problem, specifying the data required, clean the datasets and providing some useful tips to our stakeholder.