# 1. Introduction (Week 1)

Problem statement is to find the best location for the new opening of a luxury retaurant in Toronto, Canada. This new restaurant will be focused on clients with high purchasing power. For this type of restaurant, the main points to consider are:
  - Neighborhood in Toronto: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
  - Neighborhood's average income: https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods
  - Neighborhood's criminality rate: https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#6ff36980-d2f4-f438-d940-3e6a5c315588
  - Other similar restaurants in the vicinity: Foursquare.
  - Neighborhood's geo position.

# 2. Data (Week 1)

## 2.1 Neighborhoods:

As data source for the information, the following link is used: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [2]:
# Use an HTTP client (requests) to get the document behind the URL
request = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# Print information about request
if request.status_code == 200:
    print('Success!')
elif request.status_code == 404:
    print('Not Found.')

Success!


In [3]:
# Create soup object
soup = BeautifulSoup(request.content, 'html.parser')

In [4]:
# Find table in the HTML with the needed information
soup_table = soup.find_all('tbody', limit=1)[0]

In [5]:
# Create empty data frame with information from wiki
df_Toronto = pd.DataFrame(columns=['Postcode', 'Borough', 'Neighborhood'])

for tag in soup_table.find_all('tr'):
    postcod = tag.find_next()
    borough = postcod.find_next()
    neighborhood = borough.find_next()
    
    str_borough = borough.string     
    # if exists tag "a" get content tag "a" and go to next tag
    if borough.a is not None :
      str_borough =  borough.a.string.strip()
      neighborhood = neighborhood.find_next()
        
    str_neighborhood = neighborhood.string   
    # if exists tag "a" get content tag "a" and go to next tag
    if neighborhood.a is not None :
      str_neighborhood =  neighborhood.a.string.strip()
    
    df_Toronto = df_Toronto.append({
                        'Postcode'     : postcod.string.strip(), 
                        'Borough'      : str_borough, 
                        'Neighborhood': str_neighborhood.strip()}, ignore_index=True) 

In [6]:
# Remove first row of the table which is not needed (just header)
df_Toronto = df_Toronto.iloc[1:]

# Delete Boroughs which are "Not assigned"
df_Toronto = df_Toronto.loc[df_Toronto['Borough']  != 'Not assigned']

# Change Neighbourhood: "Not assigned" to the name of the Borough
df_Toronto.Neighborhood = np.where(df_Toronto.Neighborhood.eq('Not assigned'), df_Toronto.Borough, df_Toronto.Neighborhood)

In [7]:
df_Toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


## 2.2 Average income:

As data source for the information, the following link is used: https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods

In [8]:
# Use an HTTP client (requests) to get the document behind the URL
request = requests.get('https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods')

# Print information about request
if request.status_code == 200:
    print('Success!')
elif request.status_code == 404:
    print('Not Found.')

Success!


In [9]:
# Create soup object
soup = BeautifulSoup(request.content, 'html.parser')

In [10]:
# Find table in the HTML with the needed information
soup_table = soup.find_all('tbody', limit=2)[1]

In [11]:
# Create empty data frame with information from wiki
df_Toronto_income = pd.DataFrame(columns=['Neighborhood', 'Population', 'Income'])

for tag in soup_table.find_all('tr'):
    try:
        neighborhood = tag.find_next()
        population = neighborhood.find_next().find_next().find_next()
        income = population.find_next().find_next().find_next().find_next().find_next()

        str_neighborhood = neighborhood.string     
        # if exists tag "a" get content tag "a" and go to next tag
        if neighborhood.a is not None :
          str_neighborhood =  neighborhood.a.string.strip()
          population = population.find_next()

        str_population = population.string.strip().replace(',','')
        # if exists tag "a" get content tag "a" and go to next tag
        if population.a is not None :
          str_population =  population.a.string.strip()

        str_income = income.string.strip().replace(',','')
        # if exists tag "a" get content tag "a" and go to next tag
        if income.a is not None :
          str_income =  income.a.string.strip()
    
        df_Toronto_income = df_Toronto_income.append({
                            'Neighborhood'      : str_neighborhood, 
                            'Population'        : int(str_population),
                            'Income'            : int(str_income)}, ignore_index=True) 
    except:
        tag  
        
df_Toronto_income["Income"] = pd.to_numeric(df_Toronto_income["Income"])
df_Toronto_income["Population"] = pd.to_numeric(df_Toronto_income["Population"])

df_Toronto_income.head()

Unnamed: 0,Neighborhood,Population,Income
0,Agincourt,44577,25750
1,Alderwood,11656,35239
2,Alexandra Park,4355,19687
3,Allenby,2513,245592
4,Amesbury,17318,27546


In [12]:
df_Toronto = df_Toronto.merge(df_Toronto_income, left_on='Neighborhood', right_on='Neighborhood')
df_Toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Population,Income
0,M3A,North York,Parkwoods,26533,34811
1,M4A,North York,Victoria Village,17047,29657
2,M6A,North York,Lawrence Heights,3769,29867
3,M6A,North York,Lawrence Manor,13750,36361
4,M1B,Scarborough,Rouge,22724,29230


## 2.3 Criminality rate:

As data source for the information, the following link is used:  https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#6ff36980-d2f4-f438-d940-3e6a5c315588

In [13]:
test = pd.read_excel('WB-Safety.xlsx', sheet_name = 'RawData-Ref Period 2011')
test.reset_index()
test = test[{'Neighborhood', 'Assaults'}]
test.head()

Unnamed: 0,Assaults,Neighborhood
0,390,West Humber-Clairville
1,316,Mount Olive-Silverstone-Jamestown
2,85,Thistletown-Beaumond Heights
3,59,Rexdale-Kipling
4,77,Elms-Old Rexdale


In [14]:
df_Toronto = df_Toronto.merge(test, left_on='Neighborhood', right_on='Neighborhood')
df_Toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Population,Income,Assaults
0,M4A,North York,Victoria Village,17047,29657,107
1,M1B,Scarborough,Rouge,22724,29230,170
2,M1B,Scarborough,Malvern,44324,25677,319
3,M1C,Scarborough,Highland Creek,12853,33640,70
4,M3C,North York,Flemingdon Park,21287,23471,159


As data source for the information, the following table is used: 'Geospatial_Coordinates.csv' (already commented during other part of this course)

## 2.4 Geo Location:

In [15]:
# Importing CSV
pdGeoToronto = pd.read_csv('Geospatial_Coordinates.csv')

In [16]:
# Merge both data frames
df_Toronto = df_Toronto.merge(pdGeoToronto, left_on='Postcode', right_on='Postal Code')

# Remove 'Postal Code' because it is a duplicated column
df_Toronto.drop('Postal Code', axis=1, inplace=True)
df_Toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Population,Income,Assaults,Latitude,Longitude
0,M4A,North York,Victoria Village,17047,29657,107,43.725882,-79.315572
1,M1B,Scarborough,Rouge,22724,29230,170,43.806686,-79.194353
2,M1B,Scarborough,Malvern,44324,25677,319,43.806686,-79.194353
3,M1C,Scarborough,Highland Creek,12853,33640,70,43.784535,-79.160497
4,M3C,North York,Flemingdon Park,21287,23471,159,43.7259,-79.340923


In [17]:
df_Toronto_group = df_Toronto.groupby(['Postcode'])['Neighborhood'].agg([('Neighborhood', ', '.join)])
df_Toronto_group = df_Toronto_group.reset_index()

df_Toronto_group_temp = df_Toronto.groupby(['Postcode']).mean()
df_Toronto_group_temp = df_Toronto_group_temp.reset_index()

df_Toronto_group_merge = df_Toronto_group.merge(df_Toronto_group_temp, left_on='Postcode', right_on='Postcode')
df_Toronto_group_merge.head()

Unnamed: 0,Postcode,Neighborhood,Population,Income,Assaults,Latitude,Longitude
0,M1B,"Rouge, Malvern",33524.0,27453.5,244.5,43.806686,-79.194353
1,M1C,Highland Creek,12853.0,33640.0,70.0,43.784535,-79.160497
2,M1E,"Guildwood, Morningside, West Hill",16641.333333,31960.333333,197.333333,43.763573,-79.188711
3,M1G,Woburn,48507.0,26190.0,412.0,43.770992,-79.216917
4,M1J,Scarborough Village,12796.0,24413.0,226.0,43.744734,-79.239476


## 2.5 Foursquare 

In [18]:
# Foursquare Credentials and Version
CLIENT_ID = '***' # your Foursquare ID
CLIENT_SECRET = '***' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [19]:
# Calibration of venues limits and radius
LIMIT = 100
radius = 500

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [21]:
df_Toronto_venues = getNearbyVenues(names=df_Toronto_group_merge['Neighborhood'],
                                   latitudes=df_Toronto_group_merge['Latitude'],
                                   longitudes=df_Toronto_group_merge['Longitude'])
df_Toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,Highland Creek,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


In [22]:
# one hot encoding
df_Toronto_onehot = pd.get_dummies(df_Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df_Toronto_onehot['Neighborhood'] = df_Toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [df_Toronto_onehot.columns[-1]] + list(df_Toronto_onehot.columns[:-1])
df_Toronto_onehot = df_Toronto_onehot[fixed_columns]

df_Toronto_onehot.head()

Unnamed: 0,Yoga Studio,American Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,Baseball Field,Beer Store,...,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
df_Toronto_grouped = df_Toronto_onehot.groupby('Neighborhood').mean().reset_index()
df_Toronto_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,American Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,Baseball Field,...,Toy / Game Store,Trail,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bathurst Manor,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,...,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0
2,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Cliffcrest,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Dorset Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0


In [24]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [25]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = df_Toronto_grouped['Neighborhood']

for ind in np.arange(df_Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(df_Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Pool,Pub,Sandwich Place,Skating Rink,Pharmacy,Gym,Athletics & Sports,Gift Shop
1,Bathurst Manor,Coffee Shop,Fried Chicken Joint,Diner,Sandwich Place,Bridal Shop,Restaurant,Supermarket,Middle Eastern Restaurant,Sushi Restaurant,Fast Food Restaurant
2,Bayview Village,Japanese Restaurant,Bank,Chinese Restaurant,Café,Women's Store,Fast Food Restaurant,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
3,Cliffcrest,American Restaurant,Movie Theater,Motel,Women's Store,Fast Food Restaurant,Cupcake Shop,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
4,Dorset Park,Indian Restaurant,Furniture / Home Store,Vietnamese Restaurant,Pet Store,Chinese Restaurant,Latin American Restaurant,Women's Store,Eastern European Restaurant,Dog Run,Discount Store
