# Coursera Capstone Project - Renting Prices and Neighborhood Analysis of Toronto

## 1. Introduction

### 1.1 - Description and discussion of the background.

Thank's Coursera, I could finish my Data Science Course and I got a job in Toronto as a Data Scientist. The company offers me 60.000 CAD, along a simply online calculator I could get 45.000 CAD as net pay (after paying taxes). It sounds good and it means, I would earn 3750 CAD a month. Some experts say, you have to pay max. 1/3 of your net income for your renting, it means I could move into a flat, which renting price is max. 1250. Because I have to move to Toronto, I should find the best location where I can live, paying my flat, going out or shopping etc. I'd like to feel myself as home, that's why I will looking a neighbourhood, which is quit similar to the place, where I live now. The criteries are:
- Max renting price is 1250 CAD/month for a flat with minimum 1 bedrooms (if there are two bedrooms for that price, it would be also OK).
- In that area should be one or more grocery stores where I can buy my stuffs for living (food and drink).
- In summer I like to chill or do some sport in a park, that's why I'd like to have a park nearby.
- In winter I don't want to stop my training, so I need a Gym or Fitness Center.
- I like swimming, so a swimming pool or a spa would be also great.
- It should be as close as possible to a subway station.
- The pubs, restaurants or shopping malls are not that important for me, so I can travel a little bit more to have a beer with my friends.

__Business Problem:__

Because I have been never in Toronto, I make a model to analyse the neighborhoods in Toronto. The goal is to find the best area in Toronto city, which is similar as my place at home. The model should find the area, where I can rent a flat for max. 1250 CAD, and with the most of opportunity to shopping, to do sports and as close as possible to a subway station.

### 1.2 - Row datas and the stategy.

For the soultion I would use the following websites/data/API:

1. You can find the average renting prices based on neighbourhoods in Toronto on that website: https://www03.cmhc-schl.gc.ca/hmip-pimh/en/TableMapChart/Table?TableId=2.2.11&GeographyId=2270&GeographyTypeId=3&DisplayAs=Table&GeograghyName=Toronto
2. The csv sheets about the subway lines and stations are available on this page: http://scruss.com/blog/2005/12/14/toronto-subway-station-gps-locations/. I need them to check, in which area, how many stations are available
3. To visualise the neighborhoods, I will use the Google Maps API and the Folium python package
4. For the location analysis I will use the Foursquare API

With the first page you can download the average renting prices in csv format. This will be the first filter of my data, because I can't pay more than 1250 CAD a month.

With Foursquare I will collect some informations about the neighbourhoods: how many shopping centers, pubs, parks etc are in the area. I will choose the district with the highest diversity and hopefully there are some subway stations too. Furthermore I will use folium to create interactive maps with my data and with my results.

----------------

## 2. Data Preprocessing

### 2.1 - Preprocessing the rental prices

In [1]:
#Import libraries

import numpy as np
import pandas as pd

I downloaded the rental prices from the website (see above) in a csv format. Let's make the _rentingPrices_ dataframe

In [2]:
rentingPrices = pd.read_csv('Toronto_Renting_Prices.csv')
rentingPrices.head()

Unnamed: 0.1,Unnamed: 0,Bachelor,Unnamed: 2,1 Bedroom,Unnamed: 4,2 Bedroom,Unnamed: 6,3 Bedroom +,Unnamed: 8,Total,Unnamed: 10
0,Agincourt/Malvern,,,1105.0,b,1316.0,a,1569.0,a,1341.0,a
1,Ajax/Pickering,,,953.0,a,1248.0,a,1397.0,a,1283.0,a
2,Alderwood,,,1169.0,c,1462.0,c,,,1435.0,c
3,Aurora,,,1127.0,a,1347.0,a,,,1298.0,b
4,Banbury-Don Mills/York Mills,,,1163.0,a,1335.0,b,1643.0,c,1286.0,b


I'd like to rent a 1 or 2 bedrooms flat, so I will keep only the 'Neighborhood', '1 Bedroom' and '2 Bedroom' columns.

In [3]:
rentingPrices = rentingPrices[['Unnamed: 0', '1 Bedroom', '2 Bedroom']]

In [4]:
#Rename Unnamed: 0 to Neighborhood

rentingPrices= rentingPrices.rename(columns={"Unnamed: 0": "Neighborhood"})

In [5]:
rentingPrices.head()

Unnamed: 0,Neighborhood,1 Bedroom,2 Bedroom
0,Agincourt/Malvern,1105.0,1316.0
1,Ajax/Pickering,953.0,1248.0
2,Alderwood,1169.0,1462.0
3,Aurora,1127.0,1347.0
4,Banbury-Don Mills/York Mills,1163.0,1335.0


Let's check the basic statistical features

In [6]:
rentingPrices.describe()

Unnamed: 0,1 Bedroom,2 Bedroom
count,122.0,119.0
mean,1200.401639,1436.89916
std,181.947439,304.335603
min,830.0,945.0
25%,1080.0,1253.0
50%,1160.5,1337.0
75%,1277.0,1552.0
max,2013.0,3114.0


If a 1 bedroom is more than 1250, than 2 bedroom is definitely over 1250. The max value of the 1 Bedroom is 2013, so delete rows, where 1 Bedroom > 1250 CAD. Additionally reseting index

In [7]:
rentingPrices = rentingPrices[rentingPrices['1 Bedroom'] < 1250].reset_index(drop=True)

Because the neighborhood's are the same, I'd like to choose a 2 Bedroom flat insted of 1 Bedroom flat in the neighbourhood, where it is possible. Let's check, in which neighbourhood costs a 2 Bedroom flat less than 1250.

In [8]:
rentingPrices[rentingPrices['2 Bedroom'] < 1250]

Unnamed: 0,Neighborhood,1 Bedroom,2 Bedroom
1,Ajax/Pickering,953.0,1248.0
7,Beechborough-Greenbrook,1022.0,1137.0
8,Bendale,1070.0,1173.0
11,Bradford/West Gwillimbury/New Tecumseth,977.0,1181.0
25,Danforth Village-East York,1080.0,1248.0
27,Dorset Park,1015.0,1139.0
33,Eglinton East,1088.0,1227.0
37,Georgina,830.0,1008.0
39,Ionview,1069.0,1207.0
41,Keelesdale-Eglinton West,910.0,945.0


In [9]:
rentingPrices[rentingPrices['2 Bedroom'] < 1250].shape

(27, 3)

So we have 27 neighborhoods, where a 2 Bedroom flat is also for avaible for the max. price. Now I create a dataset with 3 columns: Neighborhood, Price and Bedrooms and call it _pricesFinal_. In the neighborhoods, where a 2 Bedroom flat is cheaper than 1250, I will choose that price. 

In [10]:
pricesFinal = pd.DataFrame(columns=['Neighborhood', 'Price', 'Bedrooms'])

for i in range(0,len(rentingPrices)):
    neighborhood = rentingPrices.iloc[i]['Neighborhood']
    if rentingPrices.iloc[i]['2 Bedroom'] <= 1250:
        price = rentingPrices.iloc[i]['2 Bedroom']
        bedrooms = 2
    else:
        price = rentingPrices.iloc[i]['1 Bedroom']
        bedrooms = 1
    pricesFinal = pricesFinal.append({'Neighborhood': neighborhood, 'Price': price, 'Bedrooms': bedrooms}, ignore_index=True)

Let's check how the _pricesFinal_ dataset looks like and counting the neighborhoods.

In [11]:
pricesFinal.head()

Unnamed: 0,Neighborhood,Price,Bedrooms
0,Agincourt/Malvern,1105.0,1
1,Ajax/Pickering,1248.0,2
2,Alderwood,1169.0,1
3,Aurora,1127.0,1
4,Banbury-Don Mills/York Mills,1163.0,1


In [12]:
pricesFinal.shape

(85, 3)

There are 85 neighborhoods with the "renting price" criteria, but there are also some rows, which consists more neighborhoods (like Agincourt/Malvern, Ajax/Pickering etc.)  I'll split them in different rows. A description about the next lines you can read here: https://gist.github.com/sureshsarda/00c3b7423ea7b6cba4250a719d6b7424

In [13]:
# We start with creating a new dataframe from the series with Price and Bedrroms as the index
splittedPrices = pd.DataFrame(pricesFinal.Neighborhood.str.split('/').tolist(), 
                              index=[pricesFinal.index, pricesFinal.Price, pricesFinal.Bedrooms]).stack()

#Now make Price and Bedrooms ad a column
splittedPrices = splittedPrices.reset_index([0, 'Price', 'Bedrooms'])

#Drop level_0 column
splittedPrices = splittedPrices.drop(['level_0'], axis=1)

#Rename 0 as Neighborhood
splittedPrices = splittedPrices.rename(columns={0: "Neighborhood"})

#Reorder dataframe, sort alphabetically and reset index
pricesFinal = splittedPrices[['Neighborhood', 'Price', 'Bedrooms']].sort_values('Neighborhood').reset_index(drop=True)

Let's check the _pricesFinal_ dataset again

In [14]:
pricesFinal.head()

Unnamed: 0,Neighborhood,Price,Bedrooms
0,Agincourt,1105.0,1
1,Ajax,1248.0,2
2,Alderwood,1169.0,1
3,Aurora,1127.0,1
4,Banbury-Don Mills,1163.0,1


In [17]:
pricesFinal.shape

(109, 3)

Now I splitted the neighborhood (e.g.: 0: Agincourt/Malvern 1: Ajax -> 0: Agincourt, 1: Ajax.... ). In this dataset we have 109 rows, it means, that in Toronto (or near Toronto) there are 109 neighborhoods, where a renting price of a 1 or 2 bedroom flat is under 1250 CAD.

### 2.2 - Neighborhood coordinates

Now I have the neighborhoods with the renting prices, but for the further analysis I need the geographical coordinates of these areas. For that part I was using the Google Maps API. There is a free trial version for 12 months and 300 USD, for my porpuse it is totally enough.

If you want to try this part of code, just paste your API-key in the 2nd line.

In [18]:
import googlemaps
gmaps_key = googlemaps.Client(key= "***Your API-key***")

coordinates = pd.DataFrame(columns=['Neighborhood', 'Latitude', 'Longitude'])

for i in range(0, len(pricesFinal)):
    geocode_result = gmaps_key.geocode('Canada, Toronto, '+ pricesFinal['Neighborhood'][i])[0]
    neighborhood = pricesFinal.iloc[i]['Neighborhood']
    
    if geocode_result:
        lat = geocode_result['geometry']['location']['lat']
        lon = geocode_result['geometry']['location']['lng']
    else:
        lat = None
        lon = None
    
    coordinates = coordinates.append({'Neighborhood': neighborhood, 'Latitude': lat, 'Longitude': lon}, ignore_index=True)
    
coordinates.to_csv('Toronto_Coordinates.csv') #Saving results in a csv file.

Creating the _coordinates_ dataset from this csv sheet:

In [19]:
coordinates = pd.read_csv('Toronto_Coordinates.csv', index_col=0)
coordinates.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Agincourt,43.788009,-79.283882
1,Ajax,43.850855,-79.020373
2,Alderwood,43.60171,-79.545238
3,Aurora,44.00648,-79.450396
4,Banbury-Don Mills,43.749115,-79.366359


Okay, now I have the cooridnates, but every coordinates are in Toronto? To figure out that, let's using geopy to determine the coordinates of Toronto create a folium map with the neighbouhood markers.

In [20]:
import folium
from geopy.geocoders import Nominatim

In [21]:
#Lat and Lon of Toronto, using geolocator
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_ontario")
location = geolocator.geocode(address)
toronto_latitude = location.latitude
toronto_longitude = location.longitude
print('The coordinates of Toronto are {}, {}.'.format(toronto_latitude, toronto_longitude))

The coordinates of Toronto are 43.653963, -79.387207.


In [23]:
#Creating a Folium Map with the neighborhoods
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=9)

for lat, lng, label in zip(coordinates['Latitude'], coordinates['Longitude'], coordinates['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto

Ok, it worked, but some Neighborhoods are outside of Toronto (like Orangeville, Aurora, East and West Gwillimbury etc.). But it is not problem, because in the next steps I will filter the Neighborhood's, which are near to a subway station. After that I will drop out the neighborhoods from my dataset, where are any subway stations. But before doing that, I will create the *df_toronto* dataset from the datasets pricesFinal and coordinates.

If you can't see the picture above open: images/Folium_Toronto_1.jpg

In [24]:
df_toronto = pd.merge(pricesFinal, coordinates, on='Neighborhood')
df_toronto.head()

Unnamed: 0,Neighborhood,Price,Bedrooms,Latitude,Longitude
0,Agincourt,1105.0,1,43.788009,-79.283882
1,Ajax,1248.0,2,43.850855,-79.020373
2,Alderwood,1169.0,1,43.60171,-79.545238
3,Aurora,1127.0,1,44.00648,-79.450396
4,Banbury-Don Mills,1163.0,1,43.749115,-79.366359


### 2.3 - Subway stations

There are 4 subway lines in Toronto. I downloaded the coordinates of theese stations (see link in the Introduction). Let's creating dataframes from the csv files and merge them in the _torontoSubway_ dataset.

In [25]:
subway1 = pd.read_csv('bloor-danforth-NAD83.csv', header=None)
subway2 = pd.read_csv('sheppard-yonge-NAD83.csv', header=None)
subway3 = pd.read_csv('srt-NAD83.csv', header=None)
subway4 = pd.read_csv('yonge-university-spadina-NAD83.csv', header=None)

#Merging all datasets
torontoSubway = pd.concat([subway1, subway2, subway3, subway4]).reset_index(drop=True)

In [26]:
torontoSubway.head()

Unnamed: 0,0,1,2
0,43.63802,-79.536388,Kipling
1,43.64595,-79.523948,Islington
2,43.648804,-79.511541,Royal York
3,43.650576,-79.495225,Old Mill
4,43.650291,-79.484772,Jane


In [27]:
#Rename the columns and check the shape of the dataset
torontoSubway = torontoSubway.rename(columns={0: "Latitude", 1: "Longitude", 2: "Station"})
torontoSubway.shape

(74, 3)

If we check the subway map of Toronto (https://www.ttc.ca/Spadina/images/about%20the%20project/SubwayFutureMap_lg.jpg ), there are 5 stations, where two subway lines meets. These are: Spadina, St. George, Bloor (Yonge), Sheppard and Kennedy. Let's delete them to avoid duplicates.

In [28]:
#First find in which rows are theese stations
print(torontoSubway[torontoSubway['Station'].str.contains("Spadina")])
print(torontoSubway[torontoSubway['Station'].str.contains("St George")])
print(torontoSubway[torontoSubway['Station'].str.contains("Yonge")])
print(torontoSubway[torontoSubway['Station'].str.contains("Sheppard")])
print(torontoSubway[torontoSubway['Station'].str.contains("Kennedy")])

     Latitude  Longitude  Station
14  43.667648 -79.403758  Spadina
50  43.667715 -79.403752  Spadina
     Latitude  Longitude    Station
15  43.668312 -79.398643  St George
51  43.668319 -79.398672  St George
     Latitude  Longitude         Station
17  43.670706 -79.385880           Yonge
31  43.761618 -79.410989  Sheppard-Yonge
     Latitude  Longitude         Station
31  43.761618 -79.410989  Sheppard-Yonge
71  43.761674 -79.410987        Sheppard
     Latitude  Longitude  Station
30  43.732118 -79.265698  Kennedy
36  43.732192 -79.265696  Kennedy


Okay, there are the duplicates, I will delete the row Nr. 50, 51, 17, 31, 36

In [29]:
torontoSubway = torontoSubway.drop([17,31,36,50,51]).reset_index(drop=True)

### 2.4 - Neighborhoods vs. Subway stations

Now create a folium map with the neighborhoods and the subway stations. I've already imported folium and calculated the lat and lon values of the city. On the map the green points will be the subway stations and the blue cirles shows the areas with a 1000 m radius from the neighborhood middle point. At the end I'd like to find a neighborhood, where is min. 1 metro station within 1 km, this distance to walk takes usually 10-15 min.

In [30]:
map_toronto = folium.Map(location=[toronto_latitude, toronto_longitude], zoom_start=11)

for lat, lng, label in zip(torontoSubway['Latitude'], torontoSubway['Longitude'], torontoSubway['Station']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#71c763',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
for lat, lng, label in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.Circle(
        [lat, lng],
        radius=1000,
        popup=label,
        color='blue',
        parse_html=False).add_to(map_toronto)    

map_toronto

In the middle we can see an empty place. It's because the prices, in this area there are no flat's available, which rentig price is under 1250 CAD. For the next step I will counting the subway stations from the middle points to see, how many subway stations are available in the neighborhoods. As I mentioned above the max distance should be 1000 m. For calculating I will using the geopy package again. In this feature there are two different methods to calculating the distances:
- Great Circle
- Geodasic

In short distances (like now) there are not relevant differnces between the result, so is up to us, which one we use. I'll take geodasic.


If you can't see the picture above open: images/Folium_Toronto_2.jpg

In [31]:
#Import library to calculate the distances
from geopy.distance import geodesic

The next for loop will calculate the distances between every neighborhoods and subway stations. The code calculates also, how many stations are in a 1 km radius from the middle point. I will append a 'Station Number' column to my *df_toronto* dataset

In [32]:
stationNumbers = []

for i in range(0,len(df_toronto)):
    counter = 0
    coords_1 = (df_toronto['Latitude'][i], df_toronto['Longitude'][i])
    
    for j in range(0,len(torontoSubway)):
        coords_2 = (torontoSubway['Latitude'][j], torontoSubway['Longitude'][j])
        distance = geodesic(coords_1, coords_2).km
        if (distance <= 1):
            counter = counter + 1

    stationNumbers.append(counter)

df_toronto['Station Number'] = stationNumbers

In [33]:
df_toronto.head()

Unnamed: 0,Neighborhood,Price,Bedrooms,Latitude,Longitude,Station Number
0,Agincourt,1105.0,1,43.788009,-79.283882,0
1,Ajax,1248.0,2,43.850855,-79.020373,0
2,Alderwood,1169.0,1,43.60171,-79.545238,0
3,Aurora,1127.0,1,44.00648,-79.450396,0
4,Banbury-Don Mills,1163.0,1,43.749115,-79.366359,0


In [35]:
#order values by Station numbers
df_toronto.sort_values(by=['Station Number'], ascending=False).head()

Unnamed: 0,Neighborhood,Price,Bedrooms,Latitude,Longitude,Station Number
19,City Centre South,1196.0,1,43.654262,-79.385975,8
43,High Park-Swansea,1216.0,1,43.653556,-79.465258,3
40,Georgina,1008.0,2,43.684144,-79.392643,3
36,Englemount-Lawrence,1063.0,1,43.718533,-79.439446,3
83,Riverdale,1189.0,1,43.678985,-79.34491,3


It looks like, that the City Center South has the most station numbers (8). Now delete every rows, where the station number 0 is.

In [36]:
df_toronto = df_toronto[df_toronto['Station Number'] >= 1].reset_index(drop=True)

In [37]:
df_toronto.head()

Unnamed: 0,Neighborhood,Price,Bedrooms,Latitude,Longitude,Station Number
0,City Centre South,1196.0,1,43.654262,-79.385975,8
1,Clairlea-Birchmount,1056.0,1,43.716205,-79.282842,1
2,Crescent Town,1096.0,1,43.695981,-79.293736,1
3,Danforth Village-East York,1248.0,2,43.689136,-79.296554,2
4,Dorset Park,1139.0,2,43.765831,-79.281111,2


In [38]:
df_toronto.shape

(21, 6)

There are 21 neighborhoods left, where is a subway station and the renting price is not more than 1250 CAD. For the next part I will analyse only this neighborhoods.

----------------------

## 3. Exploring areas

### 3.1 - Using Foursquare API

There are 21 neighborhoods left. Let's analyse this neighborhoods. For this part I will use the Foursquare API. For this exploration I will make a limit as 100 venues within a 500 m radius.

In [39]:
#Import requests
import requests

#Define Foursquare Credentials and Version

CLIENT_ID = 'Your Foursquare Client-ID'
CLIENT_SECRET = 'Your Foursquare Client-Secret'
VERSION = 'Your Foursquare Version'

In [40]:
#Define the limit and the radius
LIMIT = 100
radius = 500

#Create a getNearbyVenues function to collect the venues in the neighborhoods
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

    return(nearby_venues)

Let's run the above function on each neighborhood and create a new dataframe called *toronto_venues*

In [41]:
toronto_venues = getNearbyVenues(names=df_toronto['Neighborhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                )

City Centre South
Clairlea-Birchmount
Crescent Town
Danforth Village-East York
Dorset Park
Dovercourt
Dufferin Grove
Englemount-Lawrence
Georgina
High Park-Swansea
Ionview
Islington
Lambton Baby Point
Milton
Newmarket
Oakwood-Vaughan
Playter Estates-Danforth
Riverdale
Woodbine Corridor
Woodbine-Lumsden
Wychwood


In [42]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,City Centre South,43.654262,-79.385975,Downtown Toronto,43.653232,-79.385296,Neighborhood
1,City Centre South,43.654262,-79.385975,Japango,43.655268,-79.385165,Sushi Restaurant
2,City Centre South,43.654262,-79.385975,Sansotei Ramen 三草亭,43.655157,-79.386501,Ramen Restaurant
3,City Centre South,43.654262,-79.385975,Rolltation,43.654918,-79.387424,Japanese Restaurant
4,City Centre South,43.654262,-79.385975,Chatime 日出茶太,43.655542,-79.384684,Bubble Tea Shop


Let's check how many venues were returned for each neighborhood

In [43]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
City Centre South,100,100,100,100,100,100
Clairlea-Birchmount,7,7,7,7,7,7
Crescent Town,5,5,5,5,5,5
Danforth Village-East York,28,28,28,28,28,28
Dorset Park,14,14,14,14,14,14
Dovercourt,9,9,9,9,9,9
Dufferin Grove,35,35,35,35,35,35
Englemount-Lawrence,4,4,4,4,4,4
Georgina,52,52,52,52,52,52
High Park-Swansea,14,14,14,14,14,14


Let's find out how many unique categories can be curated from all the returned venues

In [44]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 142 uniques categories.


### 3.2 - Analyze Each Neighborhood

There are 142 different venues, let's analyse them.

In [47]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

toronto_onehot.head()

Unnamed: 0,Accessories Store,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Automotive Shop,BBQ Joint,Bagel Shop,Bakery,Bank,...,Turkish Restaurant,University,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's see which venues are available and let's take the most important ones for me

In [48]:
for name in toronto_onehot.columns:
    print(name)

Accessories Store
American Restaurant
Art Gallery
Arts & Crafts Store
Asian Restaurant
Automotive Shop
BBQ Joint
Bagel Shop
Bakery
Bank
Bar
Baseball Field
Beer Bar
Beer Store
Bookstore
Boutique
Brazilian Restaurant
Breakfast Spot
Bubble Tea Shop
Burger Joint
Burrito Place
Bus Line
Butcher
Café
Cajun / Creole Restaurant
Camera Store
Candy Store
Caribbean Restaurant
Chinese Restaurant
Clothing Store
Coffee Shop
Comic Shop
Concert Hall
Convenience Store
Cosmetics Shop
Dance Studio
Dessert Shop
Diner
Discount Store
Doctor's Office
Donut Shop
Electronics Store
Event Space
Falafel Restaurant
Farmers Market
Fast Food Restaurant
Fish & Chips Shop
Flower Shop
Food
Food & Drink Shop
Food Court
French Restaurant
Fried Chicken Joint
Frozen Yogurt Shop
Fruit & Vegetable Store
Furniture / Home Store
Gaming Cafe
Garden
Gastropub
General Entertainment
German Restaurant
Gift Shop
Golf Course
Gourmet Shop
Greek Restaurant
Grocery Store
Gym
Gym / Fitness Center
Hardware Store
Health Food Store
Hotel
Ice 

The 5 most important venues (descending order) are for me: Park, Grocery Store, Gym or Gym/Fitness Center, Spa (for swimming) and Shopping Mall. Let's select theese columns and put them in the *df_toronto_selected*. The neighborhood name became also a Venue, so I will take this column also.

In [49]:
df_toronto_selected=toronto_onehot.loc[:, ['Neighborhood', 'Park', 'Grocery Store', 'Gym', 
                                           'Gym / Fitness Center', 'Spa', 'Shopping Mall']]
df_toronto_selected.head()

Unnamed: 0,Neighborhood,Park,Grocery Store,Gym,Gym / Fitness Center,Spa,Shopping Mall
0,City Centre South,0,0,0,0,0,0
1,City Centre South,0,0,0,0,0,0
2,City Centre South,0,0,0,0,0,0
3,City Centre South,0,0,0,0,0,0
4,City Centre South,0,0,0,0,0,0


Grouping neighboorhoods and adding venues together to analyse the neighborhood easier.

In [51]:
df_toronto_selected = df_toronto_selected.groupby(['Neighborhood']).sum()
df_toronto_selected.head()

Unnamed: 0_level_0,Park,Grocery Store,Gym,Gym / Fitness Center,Spa,Shopping Mall
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
City Centre South,0,0,1,1,1,1
Clairlea-Birchmount,0,0,1,0,0,0
Crescent Town,1,0,0,0,0,0
Danforth Village-East York,1,0,0,2,0,0
Dorset Park,0,1,0,0,0,0


Because Gym and Gym/Fitness center are the same, lets adding Gym and Gym/Fitness together and drop 'Gym'. 

In [52]:
df_toronto_selected['Gym / Fitness Center'] = df_toronto_selected['Gym'] + df_toronto_selected['Gym / Fitness Center']
df_toronto_selected = df_toronto_selected.drop(['Gym'], axis = 1)

Let's check how my dataset looks like

In [53]:
df_toronto_selected

Unnamed: 0_level_0,Park,Grocery Store,Gym / Fitness Center,Spa,Shopping Mall
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
City Centre South,0,0,2,1,1
Clairlea-Birchmount,0,0,1,0,0
Crescent Town,1,0,0,0,0
Danforth Village-East York,1,0,2,0,0
Dorset Park,0,1,0,0,0
Dovercourt,2,0,1,0,0
Dufferin Grove,1,0,1,0,0
Englemount-Lawrence,0,0,0,0,0
Georgina,2,1,2,2,0
High Park-Swansea,0,0,0,0,0


Let's counting, how many different venues are available in the neighborhoods, the result will be saved in a new column called 'Number of Venue Type'.  Finally order them by 'Number of Venue Type'.

In [54]:
df_toronto_selected['Number of Venue Type'] = df_toronto_selected.astype(bool).sum(axis=1)
df_toronto_selected.sort_values(by = 'Number of Venue Type', ascending=False)

Unnamed: 0_level_0,Park,Grocery Store,Gym / Fitness Center,Spa,Shopping Mall,Number of Venue Type
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Georgina,2,1,2,2,0,4
City Centre South,0,0,2,1,1,3
Riverdale,0,2,1,1,0,3
Danforth Village-East York,1,0,2,0,0,2
Dovercourt,2,0,1,0,0,2
Dufferin Grove,1,0,1,0,0,2
Lambton Baby Point,2,0,0,0,0,1
Woodbine-Lumsden,0,0,0,1,0,1
Woodbine Corridor,0,0,0,1,0,1
Playter Estates-Danforth,0,1,0,0,0,1


It look's much more clear. I'd like to live in a neighborhood where min. 3 of 5 venues are available. Select rows, where there are min. 3 art of venues.

In [55]:
df_toronto_selected = df_toronto_selected[df_toronto_selected['Number of Venue Type'] >= 3]
df_toronto_selected

Unnamed: 0_level_0,Park,Grocery Store,Gym / Fitness Center,Spa,Shopping Mall,Number of Venue Type
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
City Centre South,0,0,2,1,1,3
Georgina,2,1,2,2,0,4
Riverdale,0,2,1,1,0,3


There are only 3 neighborhoods left: City Centre South, Georgina and Riverdale. Let's make the final decision

### 3.3 - Final decision

Okay, there are only 3 neighborhoods left. Actually I could make a decision, that I'd like to move to Georgina. But for my final decision I'll add bedroom numbers and renting prices to this dataframe and save as *toronto_final*.

In [56]:
#save df_toronto_selected neighborhoods in a list
indexes = df_toronto_selected.index
print(indexes)

Index(['City Centre South', 'Georgina', 'Riverdale'], dtype='object', name='Neighborhood')


In [57]:
#Merging df_toronto_selected and df_toronto together in toronto final
toronto_final = pd.merge(df_toronto.loc[df_toronto['Neighborhood'].isin(indexes)], df_toronto_selected, on='Neighborhood')
toronto_final

Unnamed: 0,Neighborhood,Price,Bedrooms,Latitude,Longitude,Station Number,Park,Grocery Store,Gym / Fitness Center,Spa,Shopping Mall,Number of Venue Type
0,City Centre South,1196.0,1,43.654262,-79.385975,8,0,0,2,1,1,3
1,Georgina,1008.0,2,43.684144,-79.392643,3,2,1,2,2,0,4
2,Riverdale,1189.0,1,43.678985,-79.34491,3,0,2,1,1,0,3


Oder the final dataset by the Number of Venue Types.

In [58]:
toronto_final.sort_values(by = 'Number of Venue Type', ascending=False)

Unnamed: 0,Neighborhood,Price,Bedrooms,Latitude,Longitude,Station Number,Park,Grocery Store,Gym / Fitness Center,Spa,Shopping Mall,Number of Venue Type
1,Georgina,1008.0,2,43.684144,-79.392643,3,2,1,2,2,0,4
0,City Centre South,1196.0,1,43.654262,-79.385975,8,0,0,2,1,1,3
2,Riverdale,1189.0,1,43.678985,-79.34491,3,0,2,1,1,0,3


---------

## 4. Conclusion

When we ordering the neighborhoods by Number of Venue Type, we can see that Georgina has 4 venues of 5, that is the most one. For me the number of the venue types is more important, than the sum of venues. Furthermore it has the cheapest renting price and additionally for that price I can rent a two bedroom flat. Ok, in this area there are no Shopping Mall (which was the last one on my importance list), but there are a Grocery store, where can I buy everthing, what I need for every day (food, drink). In this neighborhood are also 2 Gyms and 2 Spas where can I relax and do sports. It has also 3 subway stations, which more than enough for me. My second decision would be Riverdale becuse there are 2 grocery stores, but it is without park. The third one would be the city centre South, but there are no parks and no grocery stores. Okay, there is a Shopping Mall and I'm sure that there are a grocery store, but the main goal is to have a flat, where is a park nearby. So my final decision is (along this model) is **Georgina**. Now I have to check on the Internet and search for a flat in this area, move in and make a walk in the David A. Balfour Park. 

## Thank you for your attention!