# Capstone Project - The Battle of Neighborhoods (Week 1)
This is the first part of week 1 Capstone Project. In this notebook I'll describe the problem at hand, 
explain why it is important, and explain how and where i'll retrieve the data needed to accomplish my objective.
I am asked to use location data to explore geographical locations (New York, Toronto and another city of my choice)
using Foursquare location data, to be creative and find a possible problem that can be solved with this approach.

## 1. Introduction
Everyone has faced or will face, some time of their life, a difficult decision to make: i.e. get or not get married, move
away from your parent's house to your own place, have ou not have kids, enroll or not in that awesome course, etc...

The situation I present might as well be a one of particular difficulty, mostly due to the challenge it may present. Let's
imagine that you, as an entrepreneur, creative professional, and keen on producing innovative products, decide (along with
some good and daring colleagues) to start a StartUp company. You are offered some help to establish an office in only 3 cities,
all of them away from your hometown and also your colleagues' hometown.

Let's define further our problem.
### 1.1 Problem Definition
The 3 cities are major hubs, diverse in all the ways, the main financial capitals of their countries, full of opportunities, etc...
The 3 cities are New York (USA), Toronto (Canada), and Lisbon (Portugal).

To complicate things, you are an american citizen, and your colleagues, citizens of central europe. You happen
to make first acquainted in a major meeting related to your area of expertise.

Your StartUp, and it's work, may be defined by the following keywords:
* Innovative;
* Culturally impacting;
* Environment friendly;
* Information Technology;
* Game changing;
* Daring designs and approaches;

You and your colleagues have the following concerns and needs (with no particular order):
* Affordability of office and living;
* Good weather;
* Multicultural environment;
* Near essential commodities;
* Near places to relax and meet people (good night life, sightseeing, etc...);
* Safety, good overall security;
* Always wanted to live in a buzzing city but without stressing situations;
* Easy to move around, in and out, without own vehicle;
* Dont want to move to a place very far from hometown;

Now let's look at the cities background

### 1.2 Objective
My objective is, with Data Science and Machine Learning, to advise the best city and neighborhood to set up said StartUp
company based on the factor above specified.


## 2. Data
Every data used in the project will be either stored in the resources folder of the GitHub repository associated with this project,
or be available in the internet.

I'll use the data from the Coursera clustering lab of New York (https://geo.nyu.edu/catalog/nyu_2451_34572; https://cocl.us/new_york_dataset)
) and the previous project regarding the clustering of Toronto
neighborhoods. The data related to the city of Lisbon was retrieved from official Portuguese Government open source information. 
I'll also use data from the "Nomad List" ( https://nomadlist.com/ ) to rank the cities so I can put similarities and dissimilarities
in perspective when comparing their neighborhoods.

In [1]:
# import of relevant packages
import pandas as pd
import urllib.request
import json
from pandas.io.json import json_normalize
import re
from geopy.geocoders import Nominatim
import requests
from bs4 import BeautifulSoup as bs
import time
import random
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print('Libraries imported...')

Libraries imported...


### 2.1 Loading of New York neighborhood and location data
This data was first retrieved in the coursera lab that clustered the neighborhoods in New York. I'll use a JSON format.

In [2]:
with open('resources_battle_n/newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
# assessing the data from the json file
json_normalize(newyork_data['features']).head()

Unnamed: 0,geometry.coordinates,geometry.type,geometry_name,id,properties.annoangle,properties.annoline1,properties.annoline2,properties.annoline3,properties.bbox,properties.borough,properties.name,properties.stacked,type
0,"[-73.84720052054902, 40.89470517661]",Point,geom,nyu_2451_34572.1,0.0,Wakefield,,,"[-73.84720052054902, 40.89470517661, -73.84720...",Bronx,Wakefield,1,Feature
1,"[-73.82993910812398, 40.87429419303012]",Point,geom,nyu_2451_34572.2,0.0,Co-op,City,,"[-73.82993910812398, 40.87429419303012, -73.82...",Bronx,Co-op City,2,Feature
2,"[-73.82780644716412, 40.887555677350775]",Point,geom,nyu_2451_34572.3,0.0,Eastchester,,,"[-73.82780644716412, 40.887555677350775, -73.8...",Bronx,Eastchester,1,Feature
3,"[-73.90564259591682, 40.89543742690383]",Point,geom,nyu_2451_34572.4,0.0,Fieldston,,,"[-73.90564259591682, 40.89543742690383, -73.90...",Bronx,Fieldston,1,Feature
4,"[-73.9125854610857, 40.890834493891305]",Point,geom,nyu_2451_34572.5,0.0,Riverdale,,,"[-73.9125854610857, 40.890834493891305, -73.91...",Bronx,Riverdale,1,Feature


In [3]:
# Create DataFrame of NY
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
NY_neighborhoods = pd.DataFrame(columns=column_names)

for data in newyork_data['features']:
    borough = data[u'properties'][u'borough'] 
    neighborhood_name = data[u'properties'][u'name']
        
    neighborhood_lat_lon = data[u'geometry'][u'coordinates']
    neighborhood_lat = neighborhood_lat_lon[1]
    neighborhood_lon = neighborhood_lat_lon[0]
    
    NY_neighborhoods = NY_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
# Check if data was correctly loaded
NY_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [4]:
# Save the dataframe as a csv file in the resources folder
NY_neighborhoods.to_csv('resources_battle_n/NY_neighborhoods.csv')

### 2.3 Loading of Toronto neighborhood and location data
This data was first retrieved in the previous project from this Capstone in Data Science that clustered the 
neighborhoods in Toronto.

In [5]:
TO_neighborhoods = pd.read_csv('resources_battle_n/PC_N_Toronto_coord.csv')\
                       .iloc[:,2:].rename(columns={'Neigborhood':'Neighborhood'}) # selection of only the correct labels
TO_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,Scarborough,Woburn,43.770992,-79.216917
4,Scarborough,Cedarbrae,43.773136,-79.239476


In [6]:
# Save the dataframe as a csv file in the resources folder
TO_neighborhoods.to_csv('resources_battle_n/TO_neighborhoods.csv')

### 2.3 Loading of Great Lisbon neighborhood and location data
This data was retrieved from official Portuguese government urls. The city of Lisbon is part of a greater area, so
I'll aggregate all the data from Lisbon and the most adjacent municipalities that compose the Metropolitan Lisbon Area, namely: 
Alcochete; Almada; Amadora; Barreiro; Cascais; Loures; Moita; Montijo; Odivelas; Oeiras; Seixal; Sintra; Vila Franca de Xira.

For sake of simplicity, I'll call these municipalities as Boroughs. From the administrative point of view this is incorrect but
this categorization resembles the administrative division in New York and Toronto. Thus, comparison is more adequate in
the further analysis.

In [7]:
# We'll start by getting the data from the City of Lisbon
lx_url = 'https://services.arcgis.com/1dSrzEWVQn5kHHyK/arcgis/rest/services/Administracao_Publica/FeatureServer/1/query?where=1%3D1&outFields=OBJECTID,NOME&outSR=4326&f=json'
with urllib.request.urlopen(lx_url) as url:
    lx_data = json.load(url)
json_normalize(lx_data['features']).head()

Unnamed: 0,attributes.NOME,attributes.OBJECTID,geometry.x,geometry.y
0,Junta de Freguesia das Avenidas Novas,1,-9.148398,38.730269
1,Junta de Freguesia do Beato,2,-9.124203,38.736438
2,Junta de Freguesia da Estrela,3,-9.168894,38.711975
3,Junta de Freguesia de Benfica,4,-9.204372,38.73417
4,Junta de Freguesia de Arroios,5,-9.14437,38.731079


In [8]:
# let's save it to a Dataframe
LX_neighborhoods = pd.DataFrame(columns=column_names)

for data in lx_data['features']:
    borough = 'Lisboa' 
    neighborhood_name = re.sub('^[A-z]*\s[a-z]*\s[A-z]*\s[a-z]*\s', '', # regex to clean the name of
                               data[u'attributes'][u'NOME'])            # neighborhoods

    neighborhood_lat = data[u'geometry'][u'y']
    neighborhood_lon = data[u'geometry'][u'x']
    
    LX_neighborhoods = LX_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
# Check if data was correctly loaded
LX_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Lisboa,Avenidas Novas,38.730269,-9.148398
1,Lisboa,Beato,38.736438,-9.124203
2,Lisboa,Estrela,38.711975,-9.168894
3,Lisboa,Benfica,38.73417,-9.204372
4,Lisboa,Arroios,38.731079,-9.14437


Now that we have the information for central Lisbon, we can now do the same to the adjacent municipalities, extract their
postal code (Portugal) and enquire the geolocator API for their coordinates. 

In [9]:
# We'll start by getting the data from all the municipalities of portugal
pt_nbh_url = 'https://dados.gov.pt/pt/datasets/r/2266425a-18ca-44a8-8655-9c39624c0ccb'
with urllib.request.urlopen(pt_nbh_url) as url:
    pt_nbh_url = json.load(url)

In [10]:
# Check the needed columns and relevant data
json_normalize(pt_nbh_url['d']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3091 entries, 0 to 3090
Data columns (total 21 columns):
PartitionKey     3091 non-null object
RowKey           3091 non-null object
Timestamp        3091 non-null object
areaha           3091 non-null object
codigo           3091 non-null object
codigoine        3091 non-null object
codigopostal     3091 non-null object
descrpostal      3091 non-null object
eleitores2011    3091 non-null object
email            3091 non-null object
entidade         3091 non-null object
entityid         3091 non-null object
fax              3091 non-null object
localidade       3091 non-null object
nif              3091 non-null object
nomecompleto     3091 non-null object
populacao2011    3091 non-null object
rua              3091 non-null object
sitio            3091 non-null object
telefone         3091 non-null object
tipoentidade     3091 non-null object
dtypes: object(21)
memory usage: 507.2+ KB


In [11]:
# let's save data from 'entidade' and 'codigopostal' it to a Dataframe
# I will want only the municipalities named before
m_target = ['ALCOCHETE', 'ALMADA', 'AMADORA', 'BARREIRO', 'CASCAIS', 'LOURES',
            'MOITA', 'MONTIJO', 'ODIVELAS', 'OEIRAS', 'SEIXAL', 'SINTRA', 'VILA FRANCA DE XIRA']
PT_neighborhoods = pd.DataFrame(columns=['Borough', 'Neighborhood', 'PC'])

for data in pt_nbh_url['d']:
    borough = re.sub('^(.)*(?=\s\(.)\s\(', '', # regex to clean the name
                     data[u'entidade']).strip(')')
    neighborhood_name = re.sub('\s\((.*)', '', # regex to clean the name of neighborhoods
                               data[u'entidade'])
    postalcode = data[u'codigopostal']

    PT_neighborhoods = PT_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'PC': postalcode}, ignore_index=True)

PT_neighborhoods = PT_neighborhoods[PT_neighborhoods['Borough'].isin(m_target)].reset_index().iloc[:,1:]
# Check the resulting DataFrame
PT_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,PC
0,SINTRA,Agualva e Mira-Sintra,2735-054
1,AMADORA,Águas Livres,2720-797
2,CASCAIS,Alcabideche,2645-060
3,ALCOCHETE,Alcochete,2890-017
4,AMADORA,Alfragide,2610-123


In [12]:
# Let's try to use the geopy.geocoders package to search for coordinates based on portuguese postal code
geolocator = Nominatim(user_agent='pt_coordinates')
pc = '2735-054'
borough ="SINTRA"
loc = geolocator.geocode(pc + ',' + borough)
print("latitude is :" ,loc.latitude,"\nlongitude is:" ,loc.longitude)

latitude is : 38.79846 
longitude is: -9.3881


It work fine, so i'll use this to find all the coordinates in the previous dataframe

In [13]:
# Build a function to extract and store coordinates in a DataFrame
def find_coordinates_pt(place, nbh, postal_code, df):
    coordinates_df = pd.DataFrame(columns=('PC','Latitude', 'Longitude'))
    list_ex = ['\,.*', '(\se\s).*'] # list of regex to parse the string from nbh
    for p, n, post in zip(df[place], df[nbh], df[postal_code]): # For every line
        gl = Nominatim(user_agent='pt_coordinates')
        try:
            l = gl.geocode(post + ',' + n) # find "postalcode, nbh"
            coordinates_df = coordinates_df.append({
                'PC':post,
                'Latitude':l.latitude,
                'Longitude':l.longitude
            }, ignore_index=True)
        except:
            try: # if doesnt work we loop by the regex until no error is found
                for ex in list_ex:
                    l = gl.geocode(re.sub(ex, '', n) + ', Portugal')
                    try:
                       coordinates_df = coordinates_df.append({
                            'PC':post,
                            'Latitude':l.latitude,
                            'Longitude':l.longitude
                        }, ignore_index=True)
                       break
                    except:
                        pass
            except: # if still doesnt provide coordinates, it wil produce a blank value for that post code
                coordinates_df = coordinates_df.append({
                    'PC':post,
                    'Latitude':'',
                    'Longitude':''
                }, ignore_index=True)
    return pd.merge(df, coordinates_df, on=postal_code, how='inner') # merge of the 2 dataframes on postal code

In [14]:
# Let's create our final dataset for the locations of Great Lisbon Area
GLX_neighborhoods = LX_neighborhoods.append(find_coordinates_pt('Borough',
                                                                'Neighborhood', 
                                                                'PC', 
                                                                PT_neighborhoods).drop('PC', axis=1),
                                            sort=False)

Now our final dataframe is built, so let's take a look at it:

In [15]:
GLX_neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Lisboa,Avenidas Novas,38.730269,-9.148398
1,Lisboa,Beato,38.736438,-9.124203
2,Lisboa,Estrela,38.711975,-9.168894
3,Lisboa,Benfica,38.73417,-9.204372
4,Lisboa,Arroios,38.731079,-9.14437


In [16]:
# Save the dataframe as a csv file in the resources folder
GLX_neighborhoods.to_csv('resources_battle_n/GLX_neighborhoods.csv')

### 2.4 Scrapping of data from Nomad List
This data is intended to further help on selecting the best place for the startup. I'm expecting to have matches between
neighborhoods of all 3 cities, so this is the best objective way to select an appropriate neighborhood or at least the
top best ones.

In [44]:
# Define the list of Cities
list_cities = ['new-york-city', 'toronto', 'lisbon']

content = []
col_data = []

In [48]:
# Scrapping data
try:
    for i, c in enumerate(list_cities):
        time.sleep(random.choice(np.linspace(1, 5))) # a random delayer to avoid blocking
        req_parsed = bs(requests.get('https://nomadlist.com/{}'.format(c)).text, 'html.parser')
        #Find relevant information and store it
        table = req_parsed.find('table',{'class':'details'})
        fullrows = []
        for row in table.find_all('tr'):
            col = row.find_all('td')
            fullrows.append([row.text for row in col])
            fullrows[-1][0] = re.sub('\W', '', fullrows[-1][0])
            fullrows[-1][1] = re.sub('\W', ' ', fullrows[-1][1])
        fullrows.append(['City', list_cities[i]])
        row_data = np.array(fullrows).T.tolist()[1]
        col_data = np.array(fullrows).T.tolist()[0]
        content.append(pd.DataFrame(row_data).T)
    print('Done!')
except:
    print('Something went wrong...')

Done!


In [49]:
# constructing final df
content.append(pd.DataFrame(col_data).T)
NL_data = pd.concat(content).reset_index().iloc[:,1:].copy()
NL_cities = NL_data.iloc[:-1,:]
NL_cities.columns = NL_data.iloc[-1,:]
NL_cities = NL_cities.reindex()

In [50]:
# now some data clean up
corresp = {'Great':5, 'Good':4, 'Okay':3, 'Bad':1}
NL_cities['NomadScore'] = NL_cities['NomadScore'].astype('str').str.\
    replace(r'\s\d+\s\w+$', '', regex=True).\
    replace(r'\s', '.', regex=True).astype('float')
NL_cities['Cost'] = NL_cities['Cost'].astype('str').str.\
    replace(r'[A-z]\s*', '', regex=True).replace(r'\s', '', regex=True).astype('int')
NL_cities['Internet'] = NL_cities['Internet'].astype('str').str.\
    replace(r'[A-z]\s*', '', regex=True).astype('int')
NL_cities['Weather'] = NL_cities['Weather'].astype('str').str.\
    replace(r'[^0-9]', '', regex=True).\
    replace(r'[0-9]{2}$', '', regex=True).astype('int')
NL_cities['Airqualitynow'] = NL_cities['Airqualitynow'].astype('str').str.\
    replace(r'[^0-9]{2}.', '', regex=True).astype('float')
list_cols = ['Fun','Safety', 'Qualityoflife', 'Walkability', 
           'Peace', 'Trafficsafety', 'Hospitals', 
           'Happiness', 'Nightlife', 'FreeWiFiincity', 
           'Placestoworkfrom', 'ACorheating', 
           'Friendlytoforeigners', 'Englishspeaking', 
           'Freedomofspeech', 'Racialtolerance', 
           'Femalefriendly', 'LGBTfriendly', 'StartupScore']
for col in list_cols:
    NL_cities[col] = NL_cities[col].replace(corresp).astype('int')
NL_cities['Peopledensity'] = NL_cities['Peopledensity'].astype('str').str.\
    replace(r'k.*', '000', regex=True).replace(r'\w*\s+', '', regex=True).astype('int')
NL_cities = NL_cities.rename(columns={'Cost':'CostPerMonth',
                                      'Internet':'Internet_Mbps_avg',
                                      'Weather':'WeatherCelsiusNow',
                                      'Airqualitynow':'Airqualitynow_mcg_cubicm',
                                      'Peopledensity':'Peopledensity_sqrKm'})
cols = NL_cities.columns.tolist()
cols = cols[-1:] + cols[:-1]
NL_cities = NL_cities[cols]

Finally I have a clean and final Dataframe about the 3 cities.

In [51]:
NL_cities.to_csv('resources_battle_n/NL_cities.csv')
NL_cities

3,City,NomadScore,CostPerMonth,Internet_Mbps_avg,WeatherCelsiusNow,Airqualitynow_mcg_cubicm,Fun,Safety,Qualityoflife,Peopledensity_sqrKm,Walkability,Peace,Trafficsafety,Hospitals,Happiness,Nightlife,FreeWiFiincity,Placestoworkfrom,ACorheating,Friendlytoforeigners,Englishspeaking,Freedomofspeech,Racialtolerance,Femalefriendly,LGBTfriendly,StartupScore
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
new-york-city,new-york-city,3.56,5239,21,32,23.0,5,4,4,10000,5,3,5,4,4,5,5,5,3,5,5,4,4,4,5,3
toronto,toronto,3.32,3331,18,27,59.0,4,4,4,4000,5,4,5,4,5,5,4,5,3,4,5,5,5,4,5,3
lisbon,lisbon,3.77,2346,15,24,43.0,4,4,4,6000,5,4,5,1,3,3,3,5,3,4,4,5,3,4,4,3
