# Capstone Project - The Battle of Neighborhoods (Week 1)
### Applied Data Science Capstone by IBM/Coursera
*Author: Henrique M. L. Pereira*

This is the final part of Coursera Capstone Project. In this notebook I'll describe the problem at hand, 
explain why it is important, and explain how and where i'll retrieve the data needed to accomplish my objective.
I am asked to use location data to explore geographical locations (New York, Toronto or another city of my choice)
using Foursquare location data, to be creative and find a possible problem that can be solved with this approach.

## Table of Contents
* [Introduction](#introduction)
* [Data and Methodology](#data)
* [Data Exploration](#exploration)
* [Results and Discussion](#discussion)
* [Conclusion](#conclusion)

## 1. Introduction <a name="introduction"></a>
Everyone has faced or will face, some time of their life, a difficult decision to make: i.e. get or not get married, move
away from your parent's house to your own place, have ou not have kids, enroll or not in that awesome course, etc...

The situation I present might as well be a one of particular difficulty, mostly due to the challenge it may present. Let's
imagine that you, as an entrepreneur, creative professional, and keen on producing innovative products, decide (along with
some good and daring colleagues) to start a StartUp company. You are offered some help to establish an office in only 5 cities,
all of them away from your hometown and also your colleagues' hometown.

Let's define further our problem.

### 1.1 Problem Definition <a name="problem"></a>
The 5 cities are major hubs in Portugal, diverse in all the ways, financial, economic, artistic, academic, full of opportunities, etc...
The 5 cities were selected using data from the Nomad List, and are classified as good cities to establish a startup. These are 'LISBOA', 'PORTO', 'COIMBRA', 'BRAGA', 'AVEIRO'.

Your StartUp, and it's work, may be defined by the following keywords:
* Innovative;
* Culturally impacting;
* Environment friendly;
* Information Technology;
* Game changing;
* Daring designs and approaches;

You and your colleagues have the following concerns and needs (with no particular order):
* Affordability of office and living;
* Multicultural environment;
* Near essential commodities;
* Near places to relax and meet people
* Always wanted to live in a buzzing city but without stressing situations;

### 1.2 Objective <a name="Objective"></a>
My objective is, with Data Science and Machine Learning, to advise the best city and neighborhood to set up said StartUp
company based on the factors above specified, as well all the data from these cities that will be retrieved dinamically.

## 2. Data <a name="data"></a>
Every data used in the project will be either stored in the resources folder of the GitHub repository associated with this project,
or be available in the internet.

I'll use the data related to the cities of Portugal from official Portuguese Government open source information. 
I'll also use data from the "Nomad List" ( https://nomadlist.com/ ) to rank the cities so I can put similarities and dissimilarities in perspective when comparing selected neighborhoods.

In [None]:
# import of relevant packages
import pandas as pd
import urllib.request
import json
from pandas.io.json import json_normalize
import re
from geopy.geocoders import Nominatim
import geocoder
import requests
from bs4 import BeautifulSoup as bs
import time
import random
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print('Libraries imported...')
%matplotlib inline

### 2.1 Scrapping of data from Nomad List
This data is intended to further help on selecting the best place for the startup. I'm expecting to have matches between
neighborhoods of all cities, so this is the best objective way to select an appropriate neighborhood or at least the
top best ones.

In [None]:
# Define the list of Cities
list_cities = ['lisbon', 'porto', 'coimbra', 'braga', 'aveiro']
content = []
col_data = []
# Scrapping data
try:
    for i, c in enumerate(list_cities):
        time.sleep(random.choice(np.linspace(1, 5))) # a random delayer to avoid blocking
        req_parsed = bs(requests.get('https://nomadlist.com/{}'.format(c)).text, 'html.parser')
        #Find relevant information and store it
        table = req_parsed.find('table',{'class':'details'})
        fullrows = []
        for row in table.find_all('tr'):
            col = row.find_all('td')
            fullrows.append([row.text for row in col])
            fullrows[-1][0] = re.sub('\W', '', fullrows[-1][0])
            fullrows[-1][1] = re.sub('\W', ' ', fullrows[-1][1])
        fullrows.append(['City', list_cities[i]])
        row_data = np.array(fullrows).T.tolist()[1]
        col_data = np.array(fullrows).T.tolist()[0]
        fulldata = pd.DataFrame(row_data).T
        fulldata.columns = col_data
        content.append(fulldata)
    print('Done!')
except:
    print('Something went wrong...')

In [None]:
# constructing final df
NL_cities = pd.concat(content, axis=0, sort=False).reset_index(drop=True)
NL_cities

In [None]:
# now some data clean up
for col in NL_cities.columns:
    NL_cities[col] = NL_cities[col].fillna(NL_cities[col].dropna(axis=0).mode()[0])

corresp = {'Great':5, 'Good':4, 'Okay':3, 'Bad':1}

NL_cities['NomadScore'] = NL_cities['NomadScore'].astype('str').\
    replace(r'\s\d+\s\w+$', '', regex=True).\
    replace(r'\s', '.', regex=True).astype('float')

NL_cities['Cost'] = NL_cities['Cost'].astype('str').\
    replace(r'[A-z]\s*', '', regex=True).replace(r'\s', '', regex=True).astype('int')

NL_cities['Internet'] = NL_cities['Internet'].astype('str').\
    replace(r'[A-z]\s*', '', regex=True).astype('int')

NL_cities['Temperaturenow'] = NL_cities['Temperaturenow'].astype('str').\
    replace(r'[^0-9]', '', regex=True).\
    replace(r'[0-9]{2}$', '', regex=True).astype('int')

NL_cities['Humiditynow'] = NL_cities['Humiditynow'].astype('str').\
    replace(r'[^0-9]', '', regex=True).\
    replace(r'[0-9]{1}$', '', regex=True).astype('int')
     
NL_cities['Airqualitynow'] = NL_cities['Airqualitynow'].astype('str').\
    replace(r'[^0-9]{2}.', '', regex=True).astype('float')

list_cols = ['Fun','Safety', 'Qualityoflife', 'Walkability', 
           'Peace', 'Trafficsafety', 'Hospitals', 
           'Happiness', 'Nightlife', 'FreeWiFiincity', 
           'Placestoworkfrom', 'ACorheating', 
           'Friendlytoforeigners', 'Englishspeaking', 
           'Freedomofspeech', 'Racialtolerance', 
           'Femalefriendly', 'LGBTfriendly', 'StartupScore']
for col in list_cols:
    NL_cities[col] = NL_cities[col].replace(corresp).astype('int')

NL_cities['Peopledensity'] = NL_cities['Peopledensity'].astype('str').\
    replace(r'k.*', '000', regex=True).replace(r'\w*\s+', '', regex=True).astype('int')

NL_cities = NL_cities.rename(columns={'Cost':'CostPerMonth',
                                      'Internet':'Internet_Mbps_avg',
                                      'Weather':'WeatherCelsiusNow',
                                      'Airqualitynow':'Airqualitynow_mcg_cubicm',
                                      'Peopledensity':'Peopledensity_sqrKm'})

cols = NL_cities.columns.tolist()
cols = cols[-1:] + cols[:-1]
NL_cities = NL_cities[cols]
NL_cities['City'] = NL_cities['City'].str.upper().replace('LISBON', 'LISBOA') 

Finally I have a clean and final Dataframe about all the cities.

In [None]:
NL_cities.to_csv('resources_battle_n/NL_cities.csv')
NL_cities

Let's scale data between 1 and 0 to use it further to rank these cities.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
NL_cities_norm = pd.DataFrame(scaler.fit(NL_cities.set_index('City')).transform(NL_cities.set_index('City')), columns=NL_cities.set_index('City').columns)
NL_cities_norm['City'] = NL_cities['City']
NL_cities_norm = NL_cities_norm.set_index('City').reset_index()

In [None]:
NL_cities_norm

I'll remove now the variables that are not interesting to our startup:

In [None]:
columns_to_remove = ['Internet_Mbps_avg','Nightlife','FreeWiFiincity','Placestoworkfrom','ACorheating','Temperaturenow','Humiditynow',
                     'Airqualitynow_mcg_cubicm','Friendlytoforeigners','Englishspeaking','Freedomofspeech','Racialtolerance','Femalefriendly','LGBTfriendly',]
NL_cities = NL_cities.drop(columns_to_remove, axis=1)
NL_cities_norm = NL_cities_norm.drop(columns_to_remove, axis=1)

### 2.2 Loading of Portuguese cities neighborhood's and location data
This data was retrieved from official Portuguese government urls (available at https://dados.gov.pt/). The precise url's are presented in the code below

In [None]:
column_names = ['City', 'Neighborhood', 'Latitude', 'Longitude']

# We'll start by getting the data from all the cities of portugal
pt_nbh_url = 'https://dados.gov.pt/pt/datasets/r/2266425a-18ca-44a8-8655-9c39624c0ccb'

with urllib.request.urlopen(pt_nbh_url) as url:
    pt_nbh_url = json.load(url)

# Check the needed columns and relevant data
json_normalize(pt_nbh_url['d']).info()

In [None]:
# let's save data from 'entidade' and 'codigopostal' to a Dataframe
# I will want only the cities named in the introduction
m_target = ['LISBOA', 'PORTO', 'COIMBRA', 'BRAGA', 'AVEIRO']
PT_neighborhoods = pd.DataFrame(columns=['City', 'Neighborhood', 'PC'])

for data in pt_nbh_url['d']:
    city = re.sub('^(.)*(?=\s\(.)\s\(', '', # regex to clean the name
                     data[u'entidade']).strip(')')
    neighborhood_name = re.sub('\s\((.*)', '', # regex to clean the name of neighborhoods
                               data[u'entidade'])
    postalcode = data[u'codigopostal']

    PT_neighborhoods = PT_neighborhoods.append({'City': city,
                                          'Neighborhood': neighborhood_name,
                                          'PC': postalcode}, ignore_index=True)

PT_neighborhoods = PT_neighborhoods[PT_neighborhoods['City'].isin(m_target)].reset_index().iloc[:,1:]

# Check of a sample from the resulting DataFrame
PT_neighborhoods.sample(5)

In [None]:
# Now I build a function to extract and store coordinates in a DataFrame and by using the geocoder package to search for coordinates based on portuguese postal code
def find_coordinates_pt(place, nbh, postal_code, df):
    coordinates_df = pd.DataFrame(columns=('PC','Latitude', 'Longitude'))
    print('Retrieving coordinates: ', end='')
    
    for p, n, post in zip(df[place], df[nbh], df[postal_code]): # For every line
        try:
            l = geocoder.arcgis(n + ', ' + p + ', ' + post + ', PT').latlng 
            coordinates_df = coordinates_df.append({
                'PC':post,
                'Latitude':l[0],
                'Longitude':l[1]
            }, ignore_index=True)
            print(' .', end='')
        except:
            l = geocoder.arcgis(post + ', PT').latlng #
            coordinates_df = coordinates_df.append({
                'PC':post,
                'Latitude':l[0],
                'Longitude':l[1]
            }, ignore_index=True)
            print(' o', end='')
    
    print('End of function!')
    
    return pd.merge(df, coordinates_df, on=postal_code, how='inner') # merge of the 2 dataframes on the key 'postal code'

In [None]:
# Let's create our final dataset for the locations
Neighborhoods = find_coordinates_pt('City', 'Neighborhood', 'PC', PT_neighborhoods).drop('PC', axis=1)

Now our dataframe is built with the cities we want, their neighborhoods and coordinates, so let's take a look at it:

In [None]:
Neighborhoods.sample(5)

In [None]:
# Check if every location was returned a value of latitude and longitude
print('Found', sum(Neighborhoods.Latitude.isna()), 'Latitude values missing from Dataframe, and',
      sum(Neighborhoods.Longitude.isna()), 'Longitude values missing from Dataframe.')

In [None]:
# Save the dataframe as a csv file in the resources folder
Neighborhoods.to_csv('resources_battle_n/Neighborhoods.csv')

### 2.3 Retrieval of Venue Data from FourSquare
I'll retrieve the data of the Venues near each coordinate using the FourSquare API, and concatenate this information into another
Dataframe.

Now I'll define the required Foursquare Credentials and Version to use its request API url

In [None]:
CLIENT_ID = '4QFEINHYB35HFDKHZQYHUSKX5UWGPTCFUCMXSIRHS5CQJJ51'
CLIENT_SECRET = 'XUIM0KSEAAPNBMC0XXDK3QMS3XDYEHGRUG1VABVMKJMVTSCI'
VERSION = '20190618'
LIMIT = 100 # number of maximum venues to be returned for every neighborhood
RADIUS = 500 # radius in meters to search for venues

I'll create a function to repeat the process of pulling all venues for all the neighborhoods

In [None]:
def get_nearby_venues(cities, names, latitudes, longitudes, radius=RADIUS, limit=LIMIT):
    venues_list=[]
    print('Retrieving venues: ', end='')
    for city, name, lat, lng in zip(cities, names, latitudes, longitudes):
        # build of the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
            # return only relevant information for each nearby venue
            venues_list.append([(
                city,
                name, 
                lat, 
                lng, 
                v['venue']['id'],
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['id'],  
                v['venue']['categories'][0]['name']) for v in results])
            print(' .', end='')
        except KeyError:
            print('KeyError... The API is not responding or max requests per day maxed out...')
            break
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City',
                  'Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude',
                  'VenueID',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',
                  'Venue CategoryID',
                  'Venue Category']
    
    
    # this will return the parent categories for each venue child category
    url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION,)

    categories_results = requests.get(url).json()['response']['categories']
    
    fs_categories = json_normalize(data=categories_results, 
                               record_path='categories', 
                               record_prefix='child_', 
                               meta=['name'])[['name', 'child_name', 'child_id']]
    
    nearby_venues = pd.merge(nearby_venues, fs_categories, left_on='Venue CategoryID', right_on='child_id')\
        .drop(['Venue CategoryID', 'child_name', 'child_id'], axis=1)\
        .rename(columns={'name':'ParentCategory'})
    
    print('End of function!')
    
    return nearby_venues

Now I'll retrieve, with the above function, every venue for each neighborhood and 
create a new dataframe with this information.

In [None]:
all_venues = get_nearby_venues(
                cities = Neighborhoods['City'],
                names = Neighborhoods['Neighborhood'],
                latitudes = Neighborhoods['Latitude'],
                longitudes = Neighborhoods['Longitude'])

The resulting Dataframe was created and now I'll check its shape and information about the columns retrieved

In [None]:
all_venues.to_csv('resources_battle_n/all_venues_retrieved.csv')
all_venues.shape

In [None]:
all_venues.info()

In [None]:
all_venues = pd.read_csv('resources_battle_n/all_venues_retrieved.csv').iloc[:,1:]

I want to check how many venues were retrived by city

In [None]:
all_venues.groupby(by='City').count()['Venue']

### 2.4 Methodology
For this problem I will first make a data exploratory analisys in order to find hidden patterns among the data, and also check if all selected cities have venues that in some way are very different one from another.

Next I'll apply 2 machine learning techniques (unsupervised), firstly K-Means, and OPTICS, in this order. My intention is to validade clusters found by k-means with the capability of finding outliers provided by OPTICS (in a similar fashion as DBSCAN)

After selecting the best suited neighborhoods with the appropriate venues for our startup, I´ll use the data from Nomad List to rank these a little bit further to narrow down the eligible neighborhoods so that the clients can have an easy task selecting the best one to establish their startup.

In all steps I'll try to make this notebook as dynamical and automated as possible, so the results can be repeated and even reflect changes in the future.