# <center>Best location to open a gym in Toronto<center>

## 1. Introduction

### 1.1 Background

People in these day and age incorporate fitness into their lifestyle. thus, we can see the surge of gyms and fitness that answer to these demands. but opening a gym or other business is a tough decision, it involves making many difficult decisions such as: Who is our targeted customers, How much should be cost for the gym membership cost? Are there any competitions in the region? And one of the most important question that needs thorough answer would be what is the best location for customers to come and excercise and in turn will optimize the profitability.

### 1.2 Business Problem

Imagine a client that want to open a gym in Toronto and want our service to help find the optimum location that will benefit the business in the long run. which location in Toronto is the optimum point of interest? we first need to think about the factors that contribute to this. it would be based on income, competition and desity of people in the neighborhood can also play an important factor as well. so, to solve these problem, we will mainly use Foursquare API to get the venues location, Neighborhoods in Toronto from Wikipedia and census data from Toronto’s Open Data Portal.

### 1.3 Interest 

The targeted audiences of this project would be the business people who want to open a new gym or expand their franchised. through this study, they will have a clear overview of the locations in Toronto and can confidently target their specific clients, which will give them competitive advantage and a head start in the gym business.

## 2. Data Acquisition and Wrangling

### 2.1 Data Sources

We mainly focus on 4 data sources in this instance.
1. <a href='https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050'>Wikipedia</a>: We will extract the postal code, burough and neighborhoods in Toronto.
2. <a href='https://cocl.us/Geospatial_data'>Geospatial Data</a>: A geospatial data of Toronto that contains the Postal code along with latitude and longitude of neighborhoods in Toronto.
3. Foursquare API: An API call to get the locations and information of venues in toronto. (Foursquare API requires a developer account in order to log in) 
4. <a href='http://map.toronto.ca/wellbeing/#eyJ0b3Itd2lkZ2V0LWNsYXNzYnJlYWsiOsSAcGVyY2VudE9wYWNpdHnElzcwfSwiY3VzxIJtYcSTYcSXxIBuZWlnaGJvdXJob29kc8S2fcSrxIHEg8SFxIfEicSLdGFixYXEmCLEo3RpdmVUxZBJZMSXxYnEhMWPYi1pbmRpY2HEgnLFhcWIYWdzTWFwxLYiesWCbcSXMTPErHjEly04ODM3NzQ2LjDEqTc4MDnErMSnOjU0MTI5MzkuOTIyxorGmsWIxaTFpsWoxarFksSAxZjFq2lvbsSXMsSsc8WkZ2xlxYbErMWWbWVzxJtpxrbGssStxL%2FEk8SfScWlxafFqcSDTcWDxrE6IsatbsavxrHFhw%3D%3D'>Toronto Census data</a>: List of total population, household income and other info in the neighborhoods in toronto. 

### 2.2 Data Cleaning

The above data will be combined together into a single table using pandas libriry and will further apply standard scaling operation to further help with our model that will be used in this study.
The Data Cleaning and Wrangling procedures will be followed as below:
1. Pulling data from data sources.
2. Drop row and column based on data quality
3. Mapping all data into one table
4. Prepare data by selecting and applying standard scaling to the features that will be used in the K-mean clustering model

#### install important packages

In [1]:
#Uncomment below if packages haven't installed
#!pip install beautifulsoup4
#!pip install requests
#!pip install geocoder
#!pip install folium

### *Note: If the maps in folium library don't render properly, please copy the notebook link or file and view it in https://nbviewer.jupyter.org/*

#### Import neccessary libraries

In [3]:
from bs4 import BeautifulSoup # library for Web scraping
import requests # library for GET request
import pandas as pd # library for data analsysis
import folium # map rendering library
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import numpy as np # library to handle data in a vectorized manner

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#### Extract Toronto neighborhoods along with Postal Code from Wikipedia

In [4]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050"
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')

table_contents=[]
table=soup.find('table')

for row in table.findAll('tr'):
    cell = {}
    arr = []    
    for td in row.findAll('td'):
        arr.append(td.text.replace('\n', ''))
    
    if len(arr) == 3:
        cell['PostalCode'] = arr[0]
        cell['Borough'] = arr[1]
        cell['Neighborhood'] = arr[2]
        table_contents.append(cell)

df = pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Stn A PO Boxes 25 The Esplanade':'Downtown Toronto Stn A',
                                             'Business Reply Mail Processing Centre 969 Eastern':'East Toronto Business'})

df = df[~((df.Borough == 'Not assigned') & (df.Neighborhood == 'Not assigned'))]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


#### Download Geospatial data that contains latitude and longitude of neighborhoods in Toronto

In [7]:
!wget -q -O 'geospatial.csv' https://cocl.us/Geospatial_data
sp_df = pd.read_csv('geospatial.csv')
sp_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Merge Toronto neighborhood with Geospatial data

In [8]:
df_neighborhood = df.join(sp_df.set_index('Postal Code'), on='PostalCode')
df_neighborhood.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
5,M6A,North York,Lawrence Heights,43.718518,-79.464763
6,M6A,North York,Lawrence Manor,43.718518,-79.464763


#### Read Toronto cesus data file within the project

In [9]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Neighbourhood,Neighbourhood Id,Combined Indicators,Total Population,Average Family Income,After-Tax Household Income,Pop 15 - 64 years
0,West Humber-Clairville,1.0,,33312.0,72820.0,59703.0,23285.0
1,Mount Olive-Silverstone-Jamestown,2.0,,32954.0,57411.0,46986.0,22300.0
2,Thistletown-Beaumond Heights,3.0,,10360.0,70838.0,57522.0,6760.0
3,Rexdale-Kipling,4.0,,10529.0,69367.0,51194.0,7165.0
4,Elms-Old Rexdale,5.0,,9456.0,61196.0,49425.0,6370.0


#### Merge all relevant tables into single table

<dl>
    <dt>Note:</dt>
    <dd>* <i>Drop NaN row in Total Population column</i></dd> 
    <dd>* <i>We join table together using neighborhood name as primary key, because of census data doesn't have postal code</i></dd>
</dl>

In [10]:
df_merge = df_neighborhood.join(df_data_1[['Neighbourhood', 'Total Population', 'Average Family Income']].set_index('Neighbourhood'), on='Neighborhood')
df_merge = df_merge[~df_merge['Total Population'].isnull()]
df_merge.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Total Population,Average Family Income
3,M4A,North York,Victoria Village,43.725882,-79.315572,17510.0,65104.0
10,M1B,Scarborough,Rouge,43.806686,-79.194353,46496.0,86997.0
11,M1B,Scarborough,Malvern,43.806686,-79.194353,43794.0,64497.0
26,M1C,Scarborough,Highland Creek,43.784535,-79.160497,12494.0,98857.0
30,M3C,North York,Flemingdon Park,43.7259,-79.340923,21933.0,55824.0


#### Now that we have the required toronto information, Let's see how many gyms are in each neighborhoods using Foursquare API

#### Define Foursquare Credentials and Version


In [11]:
# The code was removed by Watson Studio for sharing.

#### Define getAllVenues to loop through all neighborhoods and find gym venues within 2500m radius

In [12]:
def getAllVenues(names, latitudes, longitudes):
    
    venues_list=[]
    query = 'gym'
    radius = 2500
    for name, lat, lng in zip(names, latitudes, longitudes):
#        print("Searching for: ", name)    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&query={}&radius={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        lng,
        query,
        radius
        )
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
#        print(json_normalize(results))
        # return only relevant information for each nearby venue
        for v in results:
            
            try:
                vname = v['name']
                vlat = v['location']['lat']
                vlng = v['location']['lng']
                vcat = v['categories'][0]['name']
    
            except IndexError:
                vname = ''
                vlat = 0
                vlng = 0
                vcat = ''
                
#            print(name, lat, lng, vname, vlat, vlng, vcat)            
            venues_list.append([(name, lat, lng, vname, vlat, vlng, vcat)])

            
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                 'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### We store all venues from Foursquare API in toronto_venues DataFrame

In [13]:
toronto_venues = getAllVenues(names=df_merge['Neighborhood'],
                                   latitudes=df_merge['Latitude'],
                                   longitudes=df_merge['Longitude']
                                  )
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Victoria Village,43.725882,-79.315572,Gym,43.725821,-79.309808,Residential Building (Apartment / Condo)
1,Victoria Village,43.725882,-79.315572,Clancy's Boxing Gym,43.71851,-79.308707,Gym / Fitness Center
2,Victoria Village,43.725882,-79.315572,Tridel Accolade Gym,43.724403,-79.327789,Gym / Fitness Center
3,Victoria Village,43.725882,-79.315572,Gym 9,43.715557,-79.303993,Athletics & Sports
4,Victoria Village,43.725882,-79.315572,Bell Wynford Gym,43.726643,-79.33144,Athletics & Sports


#### Let's have a basic info of venues data

In [14]:
toronto_venues.describe(include='all')

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
count,419,419.0,419.0,419.0,419.0,419.0,419
unique,34,,,218.0,,,26
top,Little Portugal,,,,,,Gym
freq,30,,,20.0,,,195
mean,,43.713467,-79.382305,,41.627738,-75.588953,
std,,0.056246,0.095099,,9.331211,16.943845,
min,,43.602414,-79.577201,,0.0,-79.599456,
25%,,43.676357,-79.442259,,43.657278,-79.424579,
50%,,43.711112,-79.411307,,43.714512,-79.392559,
75%,,43.763573,-79.328247,,43.765892,-79.311214,


#### We group and count all venues from our search in Foursquare API

In [16]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhoods column back to dataframe !(instead of neighborhood as Venues category also has the name neighborhood)!
toronto_onehot['Neighborhoods'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_grouped = toronto_onehot.groupby('Neighborhoods').sum().reset_index()
print('The group table has {} columns and {} rows.'.format(toronto_grouped.shape[1], toronto_grouped.shape[0]))
toronto_grouped.head(5)

The group table has 27 columns and 34 rows.


Unnamed: 0,Neighborhoods,Unnamed: 2,Athletics & Sports,Basketball Court,Boxing Gym,Building,College Gym,Dance Studio,General Entertainment,Gym,...,Park,Playground,Pool,Residential Building (Apartment / Condo),School,Spa,Sports Bar,Sports Club,Student Center,Yoga Studio
0,Agincourt North,0,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
1,Alderwood,0,1,0,0,0,2,0,0,4,...,0,0,0,0,0,1,0,0,0,0
2,Bathurst Manor,1,0,0,0,0,1,0,0,13,...,0,0,0,5,0,0,0,1,0,0
3,Bayview Village,1,0,0,0,1,1,0,0,10,...,1,0,0,1,0,0,0,0,0,0
4,Cliffcrest,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


#### We combine gym related columns

In [17]:
toronto_grouped['No. of Gym Center'] = toronto_grouped['Gym'] + toronto_grouped['Gym / Fitness Center']
toronto_gym = toronto_grouped[['Neighborhoods', 'No. of Gym Center']]
toronto_gym.head()

Unnamed: 0,Neighborhoods,No. of Gym Center
0,Agincourt North,1
1,Alderwood,5
2,Bathurst Manor,20
3,Bayview Village,12
4,Cliffcrest,0


#### We successfully created  the final table for our study by mapping the number of Gyms table into the toronto data

In [18]:
toronto_info = df_merge.join(toronto_gym.set_index('Neighborhoods'), on='Neighborhood').reset_index(drop=True)
toronto_info.dropna(subset = ['No. of Gym Center'], inplace=True)
toronto_info.reset_index(drop=True)
toronto_info.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Total Population,Average Family Income,No. of Gym Center
0,M4A,North York,Victoria Village,43.725882,-79.315572,17510.0,65104.0,12.0
1,M1B,Scarborough,Rouge,43.806686,-79.194353,46496.0,86997.0,2.0
2,M1B,Scarborough,Malvern,43.806686,-79.194353,43794.0,64497.0,2.0
3,M1C,Scarborough,Highland Creek,43.784535,-79.160497,12494.0,98857.0,3.0
4,M3C,North York,Flemingdon Park,43.7259,-79.340923,21933.0,55824.0,17.0
