<h1>Applied Data Science Capstone<h1>

_This notebook will be used mainly for the applied data science capstone project thats part of the <b>IBM Data Science Professional Certificate</b> course._





<h2>1. Introduction</h2>

This project is about helping several stakeholders decide where it is the most optimal place to open a 24 hours highly sophisticated gym in Toronto. Given that Toronto is the most populous city in Canada and one of the highest ranked cities in the world on health and high quality of living, opening a gym there is an easy decision for an investment group. With that said we first need to explore the current gym market in the city and decide where it will be the smartest location to open a gym to maximize the profits.

The investors are interested in neighborhoods that meet the following criteria:

- The neighborhood should have average or above average population
- Since it is a 24 hours gym a higher percentage of people younger than 45 is preferred
- Since it is a highly sophisticated gym, the membership will be higher so the household incomes in that area should be average or above average.

With the data gathered by the explorations for this criteria, the goal is to find and recommend the optimal areas to the investors to open their gym. They will use this data to find places to buy or rent for their business. 

Also this information can be shared with other investors that are looking to open a new gym or a recreational center.

<h2>2. Data</h2>

<h3>2.1 Data Description</h3>

The data for the needs of this project will come from the following sources:
- City of Toronto Neighborhoods:
https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050
- City of Toronto Neighborhoods demographics data:
https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods
- Foursquare API to collect information on other gyms and competitors in Toronto

<h3>2.2 Data Preparation</h3>

First let's import the libraries that will be used

In [1]:

import pandas as pd
import numpy as np
import lxml

#!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!pip install requests
import requests 
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

Now let's load the coordinates of the city's boroughs into a pandas dataframe

In [77]:
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050"
data = pd.read_html(url, header=0)
df = data[0]

#rename the columns
df.rename(columns={'Postcode':'PostalCode','Neighbourhood':'Neighborhood'},inplace=True)

#delete the rows with unassigned Borough
df=df[df['Borough']!='Not assigned']
df=df[df['Neighborhood']!='Not assigned']

#df=df.groupby(['PostalCode', 'Borough']).agg({'Neighborhood' : ','.join})

#We can see that Neighborhood has become the index of the dataframe, so we need to reset it for the next operations
df.reset_index(inplace=True)

#Next we replace the neighborhoods with unassigned values with the name of the borough as per assignment
#df['Neighborhood'][df['Neighborhood']=='Not assigned']=df['Borough'][df['Neighborhood']=='Not assigned']

coords = pd.read_csv('https://cocl.us/Geospatial_data')

coords.rename(columns={'Postal Code':'PostalCode'},inplace=True)

df1 = pd.merge(df, coords, left_on=  ['PostalCode'],
            right_on= ['PostalCode'], 
            how = 'left')
            
df1.head()

Unnamed: 0,index,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,2,M3A,North York,Parkwoods,43.753259,-79.329656
1,3,M4A,North York,Victoria Village,43.725882,-79.315572
2,4,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,5,M6A,North York,Lawrence Heights,43.718518,-79.464763
4,6,M6A,North York,Lawrence Manor,43.718518,-79.464763


Now let's load the demographics for the city.
We will only need the Name of the neighborhood, Population, Population Density and the Average Income data columns.

In [78]:
url = "https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods"
data = pd.read_html(url, header=0)
df_n = data[1][['Name','Population','Density (people/km2)','Average Income']]
df_n.head(10)

Unnamed: 0,Name,Population,Density (people/km2),Average Income
0,Toronto CMA Average,5113149,866,40704
1,Agincourt,44577,3580,25750
2,Alderwood,11656,2360,35239
3,Alexandra Park,4355,13609,19687
4,Allenby,2513,4333,245592
5,Amesbury,17318,4934,27546
6,Armour Heights,4384,1914,116651
7,Banbury,6641,2442,92319
8,Bathurst Manor,14945,3187,34169
9,Bay Street Corridor,4787,43518,40598


Now let's merge the location dataframe with the demographics dataframe

In [79]:
#merge the dataframes
df_f=pd.merge(df1, df_n, left_on=  ['Neighborhood'],
            right_on= ['Name'], 
            how = 'left')
df_f.head()       

Unnamed: 0,index,PostalCode,Borough,Neighborhood,Latitude,Longitude,Name,Population,Density (people/km2),Average Income
0,2,M3A,North York,Parkwoods,43.753259,-79.329656,Parkwoods,26533.0,5349.0,34811.0
1,3,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village,17047.0,3612.0,29657.0
2,4,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636,,,,
3,5,M6A,North York,Lawrence Heights,43.718518,-79.464763,Lawrence Heights,3769.0,1178.0,29867.0
4,6,M6A,North York,Lawrence Manor,43.718518,-79.464763,Lawrence Manor,13750.0,6425.0,36361.0


Now let's clean up the data. Delete unnecessary columns and rename some columns.

In [80]:
#delete unnecessary columns
df_f.drop(['Name','PostalCode'],axis=1,inplace=True)

#rename 
df_f.rename(columns={'Density (people/km2)':'Population Density'},inplace=True)
df_f.head()

Unnamed: 0,index,Borough,Neighborhood,Latitude,Longitude,Population,Population Density,Average Income
0,2,North York,Parkwoods,43.753259,-79.329656,26533.0,5349.0,34811.0
1,3,North York,Victoria Village,43.725882,-79.315572,17047.0,3612.0,29657.0
2,4,Downtown Toronto,Harbourfront,43.65426,-79.360636,,,
3,5,North York,Lawrence Heights,43.718518,-79.464763,3769.0,1178.0,29867.0
4,6,North York,Lawrence Manor,43.718518,-79.464763,13750.0,6425.0,36361.0


Now let's get the coordinates of Toronto

In [73]:
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode('Toronto, Canada')
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


Set up the Foursquare API

In [16]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 
CLIENT_SECRET:


Method to get all the nearby venues

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now let's get all nearby venues for each neighborhood

In [18]:
LIMIT = 100
toronto_venues = getNearbyVenues(names=df_f['Neighborhood'],
                                   latitudes=df_f['Latitude'],
                                   longitudes=df_f['Longitude']
                                  )

Parkwoods
Victoria Village
Lawrence Heights
Lawrence Manor
Rouge
Malvern
Garden District
Princess Gardens
West Deane Park
Highland Creek
Rouge Hill
Port Union
Flemingdon Park
St. James Town
Eringate
Markland Wood
Guildwood
Morningside
West Hill
The Beaches
Woburn
Leaside
Bathurst Manor
Wilson Heights
Thorncliffe Park
Scarborough Village
Henry Farm
Toronto Islands
Little Portugal
Ionview
Bayview Village
Riverdale
Brockton
Clairlea
Oakridge
York Mills
Downsview
Humber Summit
Cliffcrest
Cliffside
Newtonbrook
Willowdale
Bedford Park
Mount Dennis
Silverthorn
Humberlea
Birch Cliff
Lawrence Park
Runnymede
Weston


Let's examine the shape of the venues dataframe 

In [19]:
print(toronto_venues.shape)
toronto_venues.head()

(1639, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
2,Parkwoods,43.753259,-79.329656,Tim Hortons,43.760668,-79.326368,Café
3,Parkwoods,43.753259,-79.329656,A&W,43.760643,-79.326865,Fast Food Restaurant
4,Parkwoods,43.753259,-79.329656,Bruno's valu-mart,43.746143,-79.32463,Grocery Store


In [21]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bathurst Manor,28,28,28,28,28,28
Bayview Village,14,14,14,14,14,14
Bedford Park,41,41,41,41,41,41
Birch Cliff,11,11,11,11,11,11
Brockton,100,100,100,100,100,100
Clairlea,27,27,27,27,27,27
Cliffcrest,12,12,12,12,12,12
Cliffside,12,12,12,12,12,12
Downsview,11,11,11,11,11,11
Eringate,19,19,19,19,19,19


Now let's see how many venues are there that have the word Gym or Fitness in their name. We will assume these objects are the competition in the city.

In [22]:
toronto_venues[(toronto_venues['Venue Category'].str.contains('Gym', regex=False)) |
                 (toronto_venues['Venue Category'].str.contains('Fitness', regex=False)) ].count()

Neighborhood              57
Neighborhood Latitude     57
Neighborhood Longitude    57
Venue                     57
Venue Latitude            57
Venue Longitude           57
Venue Category            57
dtype: int64

Now let's make a dataframe that holds count of gyms in each neighborhood 

In [50]:
toronto_gyms = toronto_venues[(toronto_venues['Venue Category'].str.contains('Gym', regex=False)) |
                 (toronto_venues['Venue Category'].str.contains('Fitness', regex=False))].groupby(['Neighborhood']).count()
toronto_gyms.drop(['Neighborhood Latitude', 'Neighborhood Longitude', 'Venue Longitude', 'Venue', 'Venue Latitude'], axis = 1, inplace = True)
toronto_gyms.rename(columns = {'Venue Category':'Number of Gyms'}, inplace=True)
toronto_gyms.head()

Unnamed: 0_level_0,Number of gyms
Neighborhood,Unnamed: 1_level_1
Birch Cliff,2
Brockton,4
Clairlea,1
Flemingdon Park,3
Garden District,2


Now let's join the gyms data to the demographics dataframe

In [51]:
df_f = df_f.join(toronto_gyms, on='Neighborhood')
df_f.head()

In [52]:
df_f.head()

Unnamed: 0,index,Borough,Neighborhood,Latitude,Longitude,Population,Population Density,Average Income,Number of gyms
0,2,North York,Parkwoods,43.753259,-79.329656,26533.0,5349.0,34811.0,
1,3,North York,Victoria Village,43.725882,-79.315572,17047.0,3612.0,29657.0,2.0
3,5,North York,Lawrence Heights,43.718518,-79.464763,3769.0,1178.0,29867.0,2.0
4,6,North York,Lawrence Manor,43.718518,-79.464763,13750.0,6425.0,36361.0,2.0
7,10,Scarborough,Rouge,43.806686,-79.194353,22724.0,791.0,29230.0,1.0


Fill NaN values with 0

In [66]:
df_f = df_f.fillna(0)

Examine the final data

In [67]:
df_f.head()

Unnamed: 0,index,Borough,Neighborhood,Latitude,Longitude,Population,Population Density,Average Income,Number of gyms
0,2,North York,Parkwoods,43.753259,-79.329656,26533.0,5349.0,34811.0,0.0
1,3,North York,Victoria Village,43.725882,-79.315572,17047.0,3612.0,29657.0,2.0
3,5,North York,Lawrence Heights,43.718518,-79.464763,3769.0,1178.0,29867.0,2.0
4,6,North York,Lawrence Manor,43.718518,-79.464763,13750.0,6425.0,36361.0,2.0
7,10,Scarborough,Rouge,43.806686,-79.194353,22724.0,791.0,29230.0,1.0


With this we are done with the data preparation process

<h1>3. Methodology and Analysis </h1>