# Buying a first home in Toronto, Canada

### This notebook will be used to showcase my Data Science Capstone! 

#### Table of Contents 
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

A common problem many future first-time homeowners face is where they should buy their home. Where is the ideal location to put forth one of their largest investments? 

My goal is to arrive at a recommendation as to where a first-time homeowner should purchase their first home, based on a few assumptions I think first-home owners would look for when buying a house.


### Target Audience

The targeted demographic for my study are young first-time homeowners that are interested in buying their first home. As a result, I will be making the following assumptions, in accordance with their first-home preferences. 

My “simplified and targeted” first-time homeowner looks for the following things when looking to buy a home:

* Wants to be close to as many restaurants as possible
* Wants to be close to public transportation (Subway lines)
* Wants to be close to downtown as possible 


## Data <a name="data"></a>

In [216]:
#Imports
import pandas as pd 
import numpy as np
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
import folium # map rendering library

### 1. Toronto Neighborhood 

In [217]:
link =  "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#Read the table from the wikipedia link
df = pd.read_html(link)[0]

# This calls the first row for the header
new_header = df.iloc[0]

# takes the rest of the data minus the header row
df = df[1:] 

# sets the header row as the df header
df.columns = new_header 

#Removes rows with "Not Assigned" Boroughts
df = df[df.Borough != 'Not assigned']

#Rests index
df.reset_index(inplace=True)
df.drop('index',axis=1,inplace=True)

#If Brorough has a value but a "Not Assigmed" Neighborhoud
df['Neighbourhood']=df['Neighbourhood'].replace('Not assigned', df['Borough'])

#Grouping by Postcod and Borough and combining...
df = df.groupby(['Postcode','Borough']).agg(lambda col: ', '.join(col))

#Final Clean Up
df.reset_index(inplace = True)

csv_file = 'http://cocl.us/Geospatial_data'

#Reading the dataframe and cleaning it up
df_codes = pd.read_csv(csv_file)
df_codes.columns = ['Postcode', 'Latitude', 'Longitude']

#Merging the data frames together
df_complete = pd.merge(df, df_codes, on='Postcode')

#Getting a data frame with Boroughs with 'Toronto' in them
df_Toronto = df_complete[df_complete['Borough'].str.contains("Toronto")]

df_Toronto.reset_index(inplace = True)
df_Toronto.drop('index',axis=1,inplace=True)

df_Toronto.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


### 2. Subway Locations in Toronto

In [218]:
link = 'http://scruss.com/wordpress/wp-content/yonge-university-spadina-NAD83.csv'
df_subway_loc = pd.read_csv(link , names = ['Longitude','Latitude','Station'])
df_subway_loc.head()

Unnamed: 0,Longitude,Latitude,Station
0,43.750054,-79.462343,Downsview
1,43.734581,-79.449929,Wilson
2,43.724813,-79.447509,Yorkdale
3,43.716381,-79.444029,Lawrence West
4,43.70982,-79.441528,Glencairn


### 3. Restuarants in Toronto

This is a little tricky, in order to get the restaurants for our predefined areas **see data set 1**, we will need to use the *Foursquare API*.

Using the following assumptions:
* Only concidering 25 restaurants per location
* COnsidering anywhere within a 500m radius of the area center as reasonable to go to.

In [219]:
LIMIT = 25
radius = 1000 #Assuming that people would be willing to consider anywhere within 500m of their desginated area 

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        #Using Category ID to filter for restaurants
        url = 'https://api.foursquare.com/v2/venues/explore?&categoryId=4d4b7105d754a06374d81259&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            'A3OLRKHSCXE12ISAJLLV5KTFOJ1MQ2UEU0FHHE5SKEVHALQZ', 
            'FPPKJ4KERIX524QTO2MT1BARXB252SVAEBN3LMDGQGAVNTX3', 
            '20180605', 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [220]:
toronto_venues = getNearbyVenues(names=df_Toronto['Neighbourhood'],
                                   latitudes=df_Toronto['Latitude'],
                                   longitudes=df_Toronto['Longitude']
                                  )

toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Domino's Pizza,43.679058,-79.297382,Pizza Place
1,The Beaches,43.676357,-79.293031,Fearless Meat,43.680337,-79.290289,Burger Joint
2,The Beaches,43.676357,-79.293031,Seaspray Restaurant,43.678888,-79.298167,Asian Restaurant
3,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
4,"The Danforth West, Riverdale",43.679557,-79.352188,Mezes,43.677962,-79.350196,Greek Restaurant


## Methodology <a name="methodology"></a>

In order to find the ideal location we need to satisfy 3 conditions

##### 1. We need to find a neighborhood with a sufficient number of restaurants (25 to simplify)

In order to do this, we will need the **Foursquare** API find the neighborhoods with the sufficient number of restaurants.

##### 2. We need to find a neighborhood close to the subway line 

We will use **folium** to construct a map with our subway data set in order to view where they lie on the *Toronto, map*

##### 3. We need to find a neighborhood that is classified under the "downtown" cluster

We will finally compare all of the above 2 conditions with the classified locations of downtown neighborhoods

## Analysis <a name="analysis"></a>

### 1. Neighborhoods with most restaurants

In order to meet the criteria that the neighboordhood should be dense with restaurants, we group our restaurants data set to find the locations with the max number of restaurants according to our max cap of 25 

In [221]:
toronto_venues_grouped = toronto_venues.groupby('Neighborhood').count()

#Only interested in top 10 neighborhoods with most restaurants
toronto_venues_grouped = toronto_venues_grouped[toronto_venues_grouped['Venue'] == 25]
top10_neighborhoods_restaurants = toronto_venues_grouped.head(10)

df_Toronto = df_Toronto.rename(columns={'Neighbourhood': 'Neighborhood'})
df_Toronto_Top10 = pd.merge(df_Toronto, top10_neighborhoods_restaurants, on='Neighborhood')

#### Creating a Map Showing the Top Neighboorhoods according to our restaurant Criteria

In [222]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [223]:
# create map
top10_map = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lon, poi in zip(df_Toronto_Top10['Latitude'], df_Toronto_Top10['Longitude'], df_Toronto_Top10['Neighborhood']):
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(top10_map)
       
top10_map

### 2. Now let's look at a map of the Subway line 

In [224]:
# create map
station_maps = folium.Map(location=[latitude, longitude], zoom_start=12)

for lon, lat, poi in zip(df_subway_loc['Latitude'], df_subway_loc['Longitude'], df_subway_loc['Station']):
    label2 = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        color = 'yellow' ,
        popup=label2,
        fill=True,
        fill_opacity=0.9).add_to(station_maps)
       
station_maps

### 3. Pinpointing the heart of Downtown

Time for a quick recap:

* We've identified the neighborhoods with the most restaurants 
* We've identified where the subways are with respect to our selected neighborhoods
* **Now we need to identify where the heart of downtown is to satisfy the need to be close to downtown**

#### Downtown Borough Group

In [229]:
df_Downtown_cluster = df_Toronto.loc[df_Toronto['Borough'] == 'Downtown Toronto']
df_Downtown_cluster.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
10,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
11,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
12,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
13,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
14,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


In [232]:
# create map
dt_map = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lon, poi in zip(df_Downtown_cluster['Latitude'], df_Downtown_cluster['Longitude'], df_Downtown_cluster['Neighborhood']):
    label2 = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        color = 'black' ,
        popup=label2,
        fill=True,
        fill_opacity=0.4).add_to(dt_map)
       
dt_map

### 4. Let's tie it all together

Let's look at all of our maps 

* Subway line 
* Top neighborhoods with restaurants 
* Downtown neighborhoods

In [238]:
agg_maps = folium.Map(location=[latitude, longitude], zoom_start=14)

for lon, lat, poi in zip(df_subway_loc['Latitude'], df_subway_loc['Longitude'], df_subway_loc['Station']):
    label2 = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        color = 'yellow' ,
        popup=label2,
        fill=True,
        fill_opacity=0.9).add_to(agg_maps)
    
for lat, lon, poi in zip(df_Toronto_Top10['Latitude'], df_Toronto_Top10['Longitude'], df_Toronto_Top10['Neighborhood']):
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        color = 'blue',
        popup=label,
        fill=True,
        fill_opacity=1.0).add_to(agg_maps)
    
for lat, lon, poi in zip(df_Downtown_cluster['Latitude'], df_Downtown_cluster['Longitude'], df_Downtown_cluster['Neighborhood']):
    label2 = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        color = 'black' ,
        popup=label2,
        fill=True,
        fill_opacity=0.2).add_to(agg_maps)

    
       
agg_maps

## Results and Discussion <a name="results"></a>

As we can see from the above map, **Adelaide, King, Richmond** is the neighborhood that best fits our criteria of a neighborhood closest to the __subway line__, __max number of restaurants__ and classified as __being downtown__ .

As a result the ideal first-home owner's buying location should be in this neighborhood due to satisfying our *simple* assumptions

## Conclusion <a name="conclusion"></a>

Problem: To identy the ideal location for a firt-time home owner in Toronto, Ontario

Data: Utillised subway location, restaurant location and downtown locations to satisfy our assumptions

Analyzed our data and consolidated information to view in a map to conclude our final recommendation.

**Final recommendation would be to look for a neighborhood in the Adelaide, King, Richmond neighborhood**