# Battle of the Neighborhoods

## Final Assignment Notebook

This notebook was created by Alejandro Somarriba in order to complete the final assignment of the IBM Data Science Course on Coursera.<br>
Parts of the code used for this notebook were modified from some of the labs seen throughout the course.

## Part 1: Installing Libraries

The first thing to do is to install all the necessary libraries:

In [1]:
!pip install beautifulsoup4
!pip install lxml

!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/3b/c8/a55eb6ea11cd7e5ac4bacdf92bac4693b90d3ba79268be16527555e186f0/beautifulsoup4-4.8.1-py3-none-any.whl (101kB)
[K     |████████████████████████████████| 102kB 4.0MB/s ta 0:00:011
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.1 soupsieve-1.9.5
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/68/30/affd16b77edf9537f5be051905f33527021e20d563d013e8c42c7fd01949/lxml-4.4.2-cp36-cp36m-manylinux1_x86_64.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 22.2MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.4.2
Solving environment: done


  current version: 4.5.11
  latest

<hr>
Once the libraries are all installed, import them:

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

from geopy.geocoders import Nominatim

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium

import json
from pandas.io.json import json_normalize

print("Libraries imported")

Libraries imported


<hr>

I will be comparing Toronto and New York based on how many of their neighborhoods have hospitals within a 1 kilometer radius.
<hr>

## Part 2: Getting the data frames and first maps
### Part 2.1: Getting the data frame for Toronto

#### Part 2.1.1: Getting the neighborhoods for Toronto
I need to make a data frame that lists all of the neighborhoods in Toronto along with their coordinates.<br>

First I need list of neighborhoods, and then I will use the geopy library to get the coordinates.<br>
Now, I will get the neighborhoods in Toronto from a table from Wikipedia that lists all the neighborhoods in Toronto based on their postal codes.<br>
To do that, I first save the HTML code of the website as a variable, and then I use the BeautifulSoup scraper to find the table.<br>
I then store the table data in a Python variable:

In [4]:
html = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text

wikiPage = BeautifulSoup(html, "lxml")
    
postalTable = wikiPage.find("table")

<hr>

I created a list of headers for the column names for the table.<br>
The following code loops through all the <code>\<th></code> tags, which contain the names of the columns, and stores the names in a list.<br>
(It also removes the \n of the last item.)

In [5]:
headers = []

for headName in postalTable.tbody.tr.find_all("th"):
    headers.append(headName.text.replace("\n", ""))
    
print(headers)

['Postcode', 'Borough', 'Neighbourhood']


<hr>

I created a list of nested lists as rows to populate the table.<br>
The following code loops through all the <code>\<tr></code> tags, which contain the values for the rows.<br>
It loops through every <code>\<td></code> tag in the <code>\<tr></code> tags, which are the individual cells in each row.<br>
Lastly, it gets rid of the first row because it is an empty header row.<br>
(It also removes the \n of the last item of each row.)

In [6]:
rows = []

for row in postalTable.tbody.find_all("tr"):
    rows.append([])
    for cell in row.find_all("td"):
        rows[-1].append(cell.text.replace("\n", ""))
        
del(rows[0])
print(len(rows), "rows")
print(rows[0:5])

287 rows
[['M1A', 'Not assigned', 'Not assigned'], ['M2A', 'Not assigned', 'Not assigned'], ['M3A', 'North York', 'Parkwoods'], ['M4A', 'North York', 'Victoria Village'], ['M5A', 'Downtown Toronto', 'Harbourfront']]


<hr>

I then created a data frame using the <code>headers</code> list for the column names and the <code>rows</code> list for the rows.<br>
It also makes the name of the data frame variable shorter. (nht stands for ***n***eighbor***h***ood***t***able)

In [7]:
neighborhoodTable = pd.DataFrame(columns=headers, data=rows)

nht = neighborhoodTable

nht

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


At this point, <code>nht</code> is the data frame that corresponds to all the Postal Codes in Toronto, along with their respective Boroughs and Neighborhoods.<br>
Further along, I will be cleaning the data frame to get rid of useless rows.

<hr>

The following code is for cleaning the data frame.<br>
<ul>
    <li>It renames the first column</li>
    <li>It changes all the "Not assigned" cells for <code>NaN</code> values</li>
    <li>It drops rows where "Borough" had a <code>NaN</code> value</li>
    <li>It replaces the <code>NaN</code> values in "Neighbourhood" for the corresponding value in "Borough"
</ul>

In [8]:
nht.rename(columns={"Postcode":"PostalCode"}, inplace=True)

nht.replace("Not assigned", np.nan, inplace=True)

nht.dropna(subset=["Borough"], inplace=True)
nht.reset_index(drop=True, inplace=True)

for index, row in enumerate(nht["Neighbourhood"]):
    if (type(row) == type(np.nan)):
        nht.replace(row, nht["Borough"][index], inplace=True)

In [9]:
nht

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West
206,M8Z,Etobicoke,Mimico NW
207,M8Z,Etobicoke,The Queensway West
208,M8Z,Etobicoke,Royal York South West


Now, the <code>nht</code> data frame has been cleaned. I have a list of all the neighborhoods in Toronto.
<hr>

#### Part 2.1.2: Getting the Coordinates for the neighborhoods in Toronto

First, I created another data frame <code>TOnht</code> (which stands for ***TO***ronto***n***eighbor***h***ood***t***able), which contains only the neighborhoods in Toronto.<br>
I will use this to get the coordinates for the neighborhoods later.

In [14]:
TOnht = nht[["Neighbourhood"]]

Before proceeding, the following code cells were used to rename and drop all the rows that weren't working with the geopy library.<br>
I used these to clean the data frame more so I would be able to get the most results for the coordinates later.

In [15]:
for index, row in enumerate(TOnht["Neighbourhood"]):
    if (row.find("-") != -1):
        print(row)
        TOnht.loc[index, "Neighbourhood"] = TOnht.loc[index, "Neighbourhood"].split("-")

Humewood-Cedarvale
Caledonia-Fairbanks


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [16]:
TOnht = TOnht.explode("Neighbourhood")
TOnht.reset_index(drop=True, inplace=True)

In [17]:
TOnht.drop(103, inplace=True)
TOnht.drop(137, inplace=True)
TOnht.drop(170, inplace=True)
TOnht.drop(172, inplace=True)
TOnht.reset_index(drop=True, inplace=True)

In [18]:
TOnht.loc[39, "Neighbourhood"] = "Fairbank"
TOnht.loc[71, "Neighbourhood"] = "Canadian Forces Base"
TOnht.loc[172, "Neighbourhood"] = "Beaumonde Heights"
TOnht.loc[181, "Neighbourhood"] = "The Esplanade"
TOnht.loc[194, "Neighbourhood"] = "969 Eastern"

<hr>
Once the data frame has been cleaned, I add 2 more columns: one for the latitude and one for the longitude of the neighborhoods.

In [21]:
TOnht.insert(1, "Latitude", None)
TOnht.insert(2, "Longitude", None)
TOnht

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Parkwoods,,
1,Victoria Village,,
2,Harbourfront,,
3,Lawrence Heights,,
4,Lawrence Manor,,
...,...,...,...
203,Kingsway Park South West,,
204,Mimico NW,,
205,The Queensway West,,
206,Royal York South West,,


Now that the data frame is ready, I use a loop to iterate through all of the neighborhoods and to assign them their corresponding coordinates using the geopy library.<br>
I also used a <code>while</code> loop so it would start again from the last neighborhood in case it stops.

In [22]:
pos = 0
while (TOnht.loc[len(TOnht)-1, "Latitude"] == None or TOnht.loc[len(TOnht)-1, "Longitude"] == None):
    try:
        for index, neighbourhood in enumerate(TOnht.loc[pos:, "Neighbourhood"]):
            address = '{}, Toronto, Ontario'.format(neighbourhood)
            geolocator = Nominatim(user_agent="CA_explorer")
            location = geolocator.geocode(address)
            latitude = location.latitude
            longitude = location.longitude

            TOnht.loc[index+pos, "Latitude"] = latitude
            TOnht.loc[index+pos, "Longitude"] = longitude
            #print('{} The geograpical coordinates of {}, Toronto, Ontario are {}, {}.'.format(index+pos, neighbourhood, latitude, longitude))
    except:
        pos = index+pos
        print("Loop stopped at", pos, "- attempting to continue...")
        
print("All coordinates have been filled.")

All coordinates have been filled.


In [23]:
TOnht

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Parkwoods,43.7588,-79.3202
1,Victoria Village,43.7327,-79.3112
2,Harbourfront,43.6401,-79.3801
3,Lawrence Heights,43.7228,-79.4509
4,Lawrence Manor,43.7221,-79.4375
...,...,...,...
203,Kingsway Park South West,43.6504,-79.5
204,Mimico NW,43.6167,-79.4968
205,The Queensway West,43.6236,-79.5148
206,Royal York South West,43.6482,-79.5113


Now we have the data frame with all the neighborhoods and their respective coordinates.
<hr>

### Part 2.2: Getting the data frame for New York

For the New York data frame, I used the same JSON file provided by the course for one of the labs.<br>
I iterated through the JSON file and got the Neighborhoods, Latitudes, and Longitudes.

In [None]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

In [24]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']

column_names = ['Neighborhood', 'Latitude', 'Longitude'] 

neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

I save the <code>neighborhoods</code> data frame as NYnht (which stands for ***N***ew***Y***ork***n***eighbor***h***ood***t***able).

In [25]:
NYnht = neighborhoods

In [26]:
NYnht

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Wakefield,40.894705,-73.847201
1,Co-op City,40.874294,-73.829939
2,Eastchester,40.887556,-73.827806
3,Fieldston,40.895437,-73.905643
4,Riverdale,40.890834,-73.912585
...,...,...,...
301,Hudson Yards,40.756658,-74.000111
302,Hammels,40.587338,-73.805530
303,Bayswater,40.611322,-73.765968
304,Queensbridge,40.756091,-73.945631


Now we have the New York data frame with all its neighborhoods and their corresponding coordinates.
<hr>

In case it becomes necessary to access these data frames in the future without having to go through the whole process again, I save them as CSV files.

In [None]:
TOnht.to_csv("Toronto_coords.csv")
NYnht.to_csv("NewYork_coords.csv")

I would retrieve them like this:

In [4]:
TOnht = pd.read_csv("Toronto_coords.csv", index_col = 0)
NYnht = pd.read_csv("NewYork_coords.csv", index_col = 0)

<hr>

### Part 2.3: Plotting the neighborhoods

I will use the folium library to make maps that display the neighborhoods using the data frames that have just been created.<br>
Later on, I will make more maps with more data.

#### Part 2.3.1: Plotting the neighborhoods in Toronto

Using the geopy library, I get the coordinates for Toronto, Ontario, which will be used to center the map:

In [5]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="CA_explorer")
location = geolocator.geocode(address)
CAlatitude = location.latitude
CAlongitude = location.longitude
print('The geograpical coordinates of Toronto, Ontario are {}, {}.'.format(CAlatitude, CAlongitude))

The geograpical coordinates of Toronto, Ontario are 43.653963, -79.387207.


Now, I proceed to generate a map of Toronto with all its neighborhoods.

In [6]:
map_Toronto = folium.Map(location=[CAlatitude, CAlongitude], zoom_start=10)

for lat, lng, neighbourhood in zip(TOnht['Latitude'], TOnht['Longitude'], TOnht['Neighbourhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

All the blue points on the map are neighborhoods.

<hr>

#### Part 2.3.2: Plotting the neighborhoods in New York

I did the same thing I did for Toronto with New York.

In [7]:
address = 'New York, US'

geolocator = Nominatim(user_agent="US_explorer")
location = geolocator.geocode(address)
USlatitude = location.latitude
USlongitude = location.longitude
print('The geograpical coordinates of New York are {}, {}.'.format(USlatitude, USlongitude))

The geograpical coordinates of New York are 40.7127281, -74.0060152.


In [8]:
map_NewYork = folium.Map(location=[USlatitude, USlongitude], zoom_start=10)

for lat, lng, neighborhood in zip(NYnht['Latitude'], NYnht['Longitude'], NYnht['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_NewYork)  
    
map_NewYork

Once again, all the blue points on the map represent neighborhoods in New York.<br>

Some observations that can already be made by comparing the 2 maps are that New York has more neighborhoods than Toronto, and that the neighborhoods in Toronto are more spread out than the ones from New York.

<hr>

## Part 3: Getting the hospital venues
### Part 3.1: Preparing to use the FourSquare API
#### Part 3.1.1: Getting the credentials

Before defining the function that will obtain the location data I'm looking for, I define my credentials.<br>
<ul><li><em>Note that my credentials are hidden in this notebook</em></li></ul>
I also specify the version, the limit of results, and radius (in meters).

In [20]:
CLIENT_ID = '*************'
CLIENT_SECRET = '*************'
VERSION = '20190605'

LIMIT = 300
radius = 1000

<hr>

#### Part 3.1.2: Defining the request function

I used some of the code from one of the labs to get the list of nearby hospital-related venues for every neighborhood in the table.

It's possible to get a specific type of venue using the FourSquare API by adding the endpoint "categoryId" to the URL.<br>
In this case, I used the ID for hospitals: 4bf58dd8d48988d196941735

In [19]:
def getNearbyHospitals(names, latitudes, longitudes, radius=1000):
    
    hospital_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId=4bf58dd8d48988d196941735'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['venues']
        
        hospital_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories'][0]['name']) for v in results])

    nearby_hospitals = pd.DataFrame([item for hospital_list in hospital_list for item in hospital_list])
    nearby_hospitals.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_hospitals)

### Part 3.2: Getting the API requests

Once the function has been defined, I use it to create 2 new data frames, <code>toronto_hospitals</code> and <code>newyork_hospitals</code>.

In [None]:
toronto_hospitals = getNearbyHospitals(names=TOnht['Neighbourhood'],
                                   latitudes=TOnht['Latitude'],
                                   longitudes=TOnht['Longitude']
                                  )

In [None]:
newyork_hospitals = getNearbyHospitals(names=NYnht['Neighborhood'],
                                   latitudes=NYnht['Latitude'],
                                   longitudes=NYnht['Longitude']
                                  )

These data frames have all the venues that match the criteria I specified in the API call. This means that they will display all the hospital-related venues the exist within 1 kilometer of every neighborhood in both Toronto and New York.

In [33]:
toronto_hospitals

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Victoria Village,43.732658,-79.311189,Clearview MediSpa,43.741662,-79.317577,Hospital
1,Harbourfront,43.640080,-79.380150,Harbourfront Animal Hospital,43.639399,-79.389530,Veterinarian
2,Harbourfront,43.640080,-79.380150,Cheo Ottawa,43.642088,-79.380305,Hospital
3,Harbourfront,43.640080,-79.380150,Scripps Rancho Bernardo,43.646181,-79.380815,Hospital
4,Harbourfront,43.640080,-79.380150,Toronto Cosmetic Surgery Institute,43.647224,-79.376242,Hospital
...,...,...,...,...,...,...,...
1284,South of Bloor,43.667662,-79.394698,Mount Sinai Nuclear Medicine,43.658653,-79.381157,Hospital
1285,South of Bloor,43.667662,-79.394698,Mount Sinai Hospital Special Pregnancy Program,43.658399,-79.389683,Hospital
1286,South of Bloor,43.667662,-79.394698,Princess Margaret Hospital-Red Pod,43.658183,-79.389962,Hospital
1287,South of Bloor,43.667662,-79.394698,Mount Sinai Hospital Women's and Infants' Depa...,43.659612,-79.390761,Hospital


In [34]:
newyork_hospitals

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Co-op City,40.874294,-73.829939,Statcare Urgent & Walk-In Medical Care (Bronx ...,40.870168,-73.828404,Hospital
1,Co-op City,40.874294,-73.829939,CityMD Baychester Urgent Care - Bronx,40.866795,-73.827051,Hospital
2,Co-op City,40.874294,-73.829939,wellcare,40.874247,-73.837745,Hospital
3,Fieldston,40.895437,-73.905643,The Mollie & Jack Zicklin Jewish Hospice Resid...,40.888478,-73.910047,Hospital
4,Riverdale,40.890834,-73.912585,The Mollie & Jack Zicklin Jewish Hospice Resid...,40.888478,-73.910047,Hospital
...,...,...,...,...,...,...,...
2115,Queensbridge,40.756091,-73.945631,Goldwater Memorial Hospital,40.755334,-73.956673,Hospital
2116,Fox Hills,40.617311,-74.081740,Bayley Seton Hospital,40.622068,-74.074856,Hospital
2117,Fox Hills,40.617311,-74.081740,St Elizabeth Ann Nursing Home,40.622370,-74.077920,Hospital
2118,Fox Hills,40.617311,-74.081740,Richmond University Medical Center Psych Center,40.622365,-74.076279,Hospital


In case it becomes necessary to access these data frames in the future without having to go through the whole process again, I save them as CSV files.

In [None]:
toronto_hospitals.to_csv("Toronto_Hospitals.csv")
newyork_hospitals.to_csv("NewYork_Hospitals.csv")

I would retrieve them like this:

In [32]:
toronto_hospitals = pd.read_csv("Toronto_Hospitals.csv", index_col = 0)
newyork_hospitals = pd.read_csv("NewYork_Hospitals.csv", index_col = 0)

<hr>

### Part 3.3: Exploring and cleaning the data frames

Before moving on with mapping and clustering the neighborhoods, I thought it was a good idea to see some of the results that turned up from the API calls.

#### Part 3.3.1: Exploring and cleaning the <code>toronto_hospitals</code> dataframe

First, I wanted to see what kind of venues I received to make sure I got precisely what I wanted and not something else, so I used the <code>.value_counts()</code> method on the data frames to see the different venue categories I got:

In [35]:
toronto_hospitals["Venue Category"].value_counts()

Hospital           1139
Hospital Ward        63
Conference Room      36
Medical Center       18
Emergency Room       18
Veterinarian          9
Building              4
Medical Lab           1
Doctor's Office       1
Name: Venue Category, dtype: int64

These are all the venue categories that came up when using the Hospital categoryId with the FourSquare API. I will clean the Data Frames to get rid of the rows with venues that I don't need, such as Veterinarian and Conference Room.<br>
The following code is a loop that iterates through all the venues based on their categories and gets rid of the ones I'm not interested in:

In [36]:
for index, place in enumerate(toronto_hospitals["Venue Category"]):
    if (place != "Hospital" and place != "Hospital Ward" and place != "Medical Center" and place != "Emergency Room"):
        toronto_hospitals.drop(index, inplace=True)

toronto_hospitals.reset_index(drop=True, inplace=True)
toronto_hospitals["Venue Category"].value_counts()

Hospital          1139
Hospital Ward       63
Medical Center      18
Emergency Room      18
Name: Venue Category, dtype: int64

<hr>

Once I only had the venues I cared about, I wanted to see how many neighborhoods in Toronto met the conditions I set:

In [37]:
len(toronto_hospitals["Neighborhood"].value_counts())

94

It's possible to see that there are only 94 neighborhoods in Toronto that have at least 1 hospital within a radius of 1 kilometer.

<hr>

#### Part 3.3.2: Exploring and cleaning the <code>newyork_hospitals</code> dataframe

I did the same thing I did to the first data frame with the data frame for New York.<br>
First I wanted to see the venue categories:

In [38]:
newyork_hospitals["Venue Category"].value_counts()

Hospital                    1709
Hospital Ward                163
Doctor's Office              114
Medical Center                46
Office                        20
Emergency Room                10
Pharmacy                       7
Optical Shop                   7
Medical School                 7
Government Building            6
Eye Doctor                     5
Medical Lab                    3
College Science Building       3
Veterinarian                   3
Auditorium                     3
Spiritual Center               3
Building                       3
Urgent Care Center             2
Bus Station                    2
Scenic Lookout                 2
Dentist's Office               1
High School                    1
Name: Venue Category, dtype: int64

Then I looped through the data frame and got rid of the ones I didn't need:

In [39]:
for index, place in enumerate(newyork_hospitals["Venue Category"]):
    if (place != "Hospital" and place != "Hospital Ward" and place != "Medical Center" and place != "Emergency Room" and place != "Urgent Care Center"):
        newyork_hospitals.drop(index, inplace=True)

newyork_hospitals.reset_index(drop=True, inplace=True)
newyork_hospitals["Venue Category"].value_counts()

Hospital              1709
Hospital Ward          163
Medical Center          46
Emergency Room          10
Urgent Care Center       2
Name: Venue Category, dtype: int64

<hr>

Once again, I checked how many neighborhoods in New York met my criteria:

In [42]:
len(newyork_hospitals["Neighborhood"].value_counts())

232

Likewise, there are 232 neighborhoods in New York that have at least 1 hospital within a radius of 1 kilometer.<br>
This is almost 2.5 times more than the neighborhoods in Toronto.<br><br>
<hr>
Now, I have data frames that contain the information of all the hospital-related venues within 1 kilometer of the neighborhoods in Toronto and New York.

<hr>

Once again, in case it becomes necessary to access these data frames in the future without having to go through the whole process again, I save them as CSV files.

In [52]:
toronto_hospitals.to_csv("Toronto_Hospitals_clean.csv")
newyork_hospitals.to_csv("NewYork_Hospitals_clean.csv")

I would retrieve them like this:

In [9]:
toronto_hospitals = pd.read_csv("Toronto_Hospitals_clean.csv", index_col = 0)
newyork_hospitals = pd.read_csv("NewYork_Hospitals_clean.csv", index_col = 0)

<hr>

## Part 4: One-hot encoding and second maps

### Part 4.1: One-hot encoding

I used some of the code from one of the labs to transform the categorical values of the venues' categories into numerical values.

#### Part 4.1.1: One-hot encoding the <code>toronto_hospitals</code> data frame

First, I turned the venue categories for the <code>toronto_hospitals</code> data frame into numerical values:

In [10]:
toronto_onehot = pd.get_dummies(toronto_hospitals[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot['Neighborhood'] = toronto_hospitals['Neighborhood'] 

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot

Unnamed: 0,Neighborhood,Emergency Room,Hospital,Hospital Ward,Medical Center
0,Victoria Village,0,1,0,0
1,Harbourfront,0,1,0,0
2,Harbourfront,0,1,0,0
3,Harbourfront,0,1,0,0
4,Lawrence Manor,0,1,0,0
...,...,...,...,...,...
1233,South of Bloor,0,1,0,0
1234,South of Bloor,0,1,0,0
1235,South of Bloor,0,1,0,0
1236,South of Bloor,0,1,0,0


Then I grouped the neighborhoods according to the mean occurrence of different venues (similar to how it was done in one of the labs).<br>
This will help when getting the most common venues later.

In [11]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Emergency Room,Hospital,Hospital Ward,Medical Center
0,Adelaide,0.021739,0.891304,0.065217,0.021739
1,Bayview Village,0.000000,1.000000,0.000000,0.000000
2,Berczy Park,0.000000,1.000000,0.000000,0.000000
3,Bloordale Gardens,0.000000,1.000000,0.000000,0.000000
4,CN Tower,0.000000,1.000000,0.000000,0.000000
...,...,...,...,...,...
89,Willowdale,0.000000,1.000000,0.000000,0.000000
90,Willowdale South,0.000000,1.000000,0.000000,0.000000
91,Willowdale West,0.000000,1.000000,0.000000,0.000000
92,Wilson Heights,0.000000,1.000000,0.000000,0.000000


For now, this is all I need to do with this data frame. I will use this later.
<hr>

#### Part 4.1.2: One-hot encoding the <code>newyork_hospitals</code> data frame

I did the same thing I did before with the New York data frame.

In [21]:
newyork_onehot = pd.get_dummies(newyork_hospitals[['Venue Category']], prefix="", prefix_sep="")

newyork_onehot['Neighborhood'] = newyork_hospitals['Neighborhood'] 

fixed_columns2 = [newyork_onehot.columns[-1]] + list(newyork_onehot.columns[:-1])
newyork_onehot = newyork_onehot[fixed_columns2]

newyork_onehot.head()

Unnamed: 0,Neighborhood,Emergency Room,Hospital,Hospital Ward,Medical Center,Urgent Care Center
0,Co-op City,0,1,0,0,0
1,Co-op City,0,1,0,0,0
2,Co-op City,0,1,0,0,0
3,Fieldston,0,1,0,0,0
4,Riverdale,0,1,0,0,0


In [22]:
newyork_grouped = newyork_onehot.groupby('Neighborhood').mean().reset_index()
newyork_grouped

Unnamed: 0,Neighborhood,Emergency Room,Hospital,Hospital Ward,Medical Center,Urgent Care Center
0,Allerton,0.0,1.000000,0.000000,0.000000,0.0
1,Arlington,0.0,1.000000,0.000000,0.000000,0.0
2,Arverne,0.0,1.000000,0.000000,0.000000,0.0
3,Astoria,0.0,1.000000,0.000000,0.000000,0.0
4,Auburndale,0.0,1.000000,0.000000,0.000000,0.0
...,...,...,...,...,...,...
227,Wingate,0.0,0.950000,0.050000,0.000000,0.0
228,Woodhaven,0.0,1.000000,0.000000,0.000000,0.0
229,Woodlawn,0.0,1.000000,0.000000,0.000000,0.0
230,Woodside,0.0,0.750000,0.250000,0.000000,0.0


For now, this is all I need to do with this data frame. I will use this later.
<hr>

### Part 4.2: Maps with venues

Before doing the clustering, I wanted to take a look at where the venues were located in Toronto and New York.

#### Part 4.2.1: Toronto map with venues

First, I made a list that contained every unique neighborhood in the <code>toronto_grouped</code> data frame so I could use it later to generate the map.<br>
The list <code>TOnhlist</code> stands for ***TO***ronto***n***eighbor***h***ood***list*** because it's a list that only has the neighborhoods.

In [18]:
TOnhlist = []
for neighborhood in toronto_grouped["Neighborhood"]:
    TOnhlist.append(neighborhood)

Then, I made another data frame from the <code>toronto_hospitals</code> data frame that only had the Venue name, Venue Latitude, Venue Longitude, and Venue Category.<br>
The data frame <code>TOHs</code> stands for ***TO***ronto***H***ospital***s*** because it's a data frame that has all the hospitals and their coordinates.<br>
This will be used to plot all the venues in the map.

In [14]:
TOHs = toronto_hospitals[["Venue", "Venue Latitude", "Venue Longitude", "Venue Category"]]
TOHs.shape

(1238, 4)

Seeing as the data frame had 1238 rows, I considered the possibility that some venues may be repeated due to the overlap between neighborhoods (meaning that 2 different neighborhoods may have the same hospital within 1 kilometer, and therefore appear twice in the data frame).<br>
To account for this, I made another list that only contained the unique hospital venues and used it to clean the <code>TOHs</code> data frame.<br>
The <code>TOHslist</code> list stands for ***TO***ronto***H***ospital***slist*** because it only has the hospitals.

In [15]:
TOHslist = []
for index, hospital in enumerate(TOHs["Venue"]):
    if (hospital not in TOHslist):
        TOHslist.append(hospital)
    else:
        TOHs.drop(index, inplace=True)
        
TOHs.reset_index(drop=True, inplace=True)

TOHs.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


(167, 4)

After cleaning the <code>TOHs</code> data frame, it had 167 unique rows.
<hr>
Now, using the folium library again, I placed all the neighborhoods in a map, but colored the ones that had a hospital within 1 kilometer in green. Additionally, I also plotted all of the venues returned by the FourSquare API in red circles.<br>
(Note, I used the coordinates of one of the neighborhoods to center the map better with a zoom of 11)

In [19]:
map_Toronto2 = folium.Map(location=[43.7098517, -79.4042948], zoom_start=11)

for lat, lng, neighbourhood in zip(TOnht['Latitude'], TOnht['Longitude'], TOnht['Neighbourhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    if (neighbourhood in TOnhlist):
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='green',
            fill=True,
            fill_color='#31cc6a',
            fill_opacity=0.7,
            parse_html=False).add_to(map_Toronto2)
    else:
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_Toronto2)

for lat, lng, hospital, category in zip(TOHs['Venue Latitude'], TOHs['Venue Longitude'], TOHs['Venue'], TOHs["Venue Category"]):
    label = '{} ({})'.format(hospital, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='red',
        fill=True,
        fill_color='white',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto2)

map_Toronto2

To recap:<br>
- Green Points: Neighborhoods with at least 1 hospital within 1 kilometer
- Blue Points: All the other neighborhoods
- Red Points: Hospital-related venues within 1 kilometer of a neighborhood

There seems to be a high concentration of hospital related venues between Queen's Park and Mount Olive.<br>

#### Part 4.2.2: New York map with venues

I repeated the same process as before.<br>
The <code>NYnhlist</code> stands for ***N***ew***Y***ork***n***eighbor***h***ood***list*** because it's a list with all the neighborhoods in New York with a hospital within 1 kilometer.

In [24]:
NYnhlist = []
for neighborhood in newyork_grouped["Neighborhood"]:
    NYnhlist.append(neighborhood)

The <code>NYHs</code> data frame stands for ***N***ew***Y***ork***H***ospital***s*** because it's a data frame with all the hospitals and their coordinates (with duplicates).

In [25]:
NYHs = newyork_hospitals[["Venue", "Venue Latitude", "Venue Longitude", "Venue Category"]]
NYHs.shape

(1930, 4)

Once again I cleaned the data frame to account for overlaps.<br>
The <code>NYHslist</code> stands for ***N***ew***Y***ork***H***ospital***slist*** because it's a list of the unique hospitals in the data frame.

In [26]:
NYHslist = []
for index, hospital in enumerate(NYHs["Venue"]):
    if (hospital not in NYHslist):
        NYHslist.append(hospital)
    else:
        NYHs.drop(index, inplace=True)
        
NYHs.reset_index(drop=True, inplace=True)

NYHs.shape

(862, 4)

After cleaning the <code>NYHs</code> data frame, it had 862 unique rows (which is a bit over 5 times the amount of venues in Toronto).
<hr>
Again, I plotted the New York map with color-coded markers:

In [28]:
map_NewYork2 = folium.Map(location=[USlatitude, USlongitude], zoom_start=11)

for lat, lng, neighborhood in zip(NYnht['Latitude'], NYnht['Longitude'], NYnht['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    if (neighborhood in NYnhlist):
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='green',
            fill=True,
            fill_color='#31cc6a',
            fill_opacity=0.7,
            parse_html=False).add_to(map_NewYork2)
    else:
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_NewYork2)

for lat, lng, hospital, category in zip(NYHs['Venue Latitude'], NYHs['Venue Longitude'], NYHs['Venue'], NYHs["Venue Category"]):
    label = '{} ({})'.format(hospital, category)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='red',
        fill=True,
        fill_color='white',
        fill_opacity=0.7,
        parse_html=False).add_to(map_NewYork2)

map_NewYork2

To recap:<br>
- Green Points: Neighborhoods with at least 1 hospital within 1 kilometer
- Blue Points: All the other neighborhoods
- Red Points: Hospital-related venues within 1 kilometer of a neighborhood

(I think the data I used to get the Toronto neighborhoods may not have been the best, seen as there are various blank spaces between neighborhoods, unlike in New York.)

<hr>

Lastly, before moving on, I save the data frames as CSVs in case I may need them in the future.

In [None]:
TOHs.to_csv("Toronto_Hospital_Venues.csv")
NYHs.to_csv("NewYork_Hospital_Venues.csv")

<hr>

## Part 5: Most common venues, clusters, and final maps

### Part 5.1: Getting the most common venues

Once again, I used some of the code from one of the labs to get the most common venues per neighborhood.

In [29]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now, I create yet another data frame that contains the 4 most common venues near the neighborhoods in Toronto.<br>
I chose to do 4 because there were only 4 different venue categories returned by the FourSquare API for the neighborhoods in Toronto that I was interested in.<br>
(It's possible this may not have been the best idea, seen as not every neighborhood had all 4 venue categories, and as such, may not have a 2nd, 3rd, or 4th most common venue...)

In [30]:
num_top_venues = 4

indicators = ['st', 'nd', 'rd']

TOcolumns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        TOcolumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        TOcolumns.append('{}th Most Common Venue'.format(ind+1))

TOneighborhoods_venues_sorted = pd.DataFrame(columns=TOcolumns)
TOneighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    TOneighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

TOneighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,Adelaide,Hospital,Hospital Ward,Medical Center,Emergency Room
1,Bayview Village,Hospital,Medical Center,Hospital Ward,Emergency Room
2,Berczy Park,Hospital,Medical Center,Hospital Ward,Emergency Room
3,Bloordale Gardens,Hospital,Medical Center,Hospital Ward,Emergency Room
4,CN Tower,Hospital,Medical Center,Hospital Ward,Emergency Room


Then another data frame that contains the 5 most common venues near the neighborhoods in Toronto.<br>
I chose to do 5 because there were only 5 different venue categories returned by the FourSquare API for the neighborhoods in Toronto that I was interested in.<br>

In [31]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

NYcolumns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        NYcolumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        NYcolumns.append('{}th Most Common Venue'.format(ind+1))

NYneighborhoods_venues_sorted = pd.DataFrame(columns=NYcolumns)
NYneighborhoods_venues_sorted['Neighborhood'] = newyork_grouped['Neighborhood']

for ind in np.arange(newyork_grouped.shape[0]):
    NYneighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(newyork_grouped.iloc[ind, :], num_top_venues)

NYneighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Allerton,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
1,Arlington,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
2,Arverne,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
3,Astoria,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
4,Auburndale,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room


<hr>

### Part 5.2: Clustering and final maps

First, all the clustering occurs.<br>
The following code cell instantiates two <code>KMeans</code> objects, one for Toronto and one for New York. Now, all the neighborhoods are clustered.

In [32]:
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)
newyork_grouped_clustering = newyork_grouped.drop('Neighborhood', 1)

kmeansCA = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)
kmeansUS = KMeans(n_clusters=kclusters, random_state=0).fit(newyork_grouped_clustering)

#kmeansCA.labels_[0:10]
#kmeansUS.labels_[0:10]

#### 5.2.1: Clustering and mapping Toronto

Starting with the Toronto data, I create another data frame <code>TONs</code> (which stands for ***TO***ronto***N***eighborhood***s***), which is meant to collect all the unique neighborhoods in Toronto so they are able to be plotted on the map after being clustered.

In [33]:
TONs = toronto_hospitals[["Neighborhood", "Neighborhood Latitude", "Neighborhood Longitude"]]
TONslist = []
for index, neighborhood in enumerate(TONs["Neighborhood"]):
    if (neighborhood not in TONslist):
        TONslist.append(neighborhood)
    else:
        TONs.drop(index, inplace=True)
        
TONs.sort_values("Neighborhood", inplace=True)
TONs.reset_index(drop=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [34]:
TONs

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude
0,Adelaide,43.650809,-79.377917
1,Bayview Village,43.769197,-79.376662
2,Berczy Park,43.647984,-79.375396
3,Bloordale Gardens,43.635317,-79.563674
4,CN Tower,43.642564,-79.387087
...,...,...,...
89,Willowdale,43.775356,-79.416686
90,Willowdale South,43.775356,-79.416686
91,Willowdale West,43.775356,-79.416686
92,Wilson Heights,43.740519,-79.440017


Afterwards, I merged the <code>TOneighborhoods_venues_sorted</code> data frame with the <code>TONs</code> data frame that was just generated. It also added a column that displayed the cluster label for each neighborhood. This helped when mapping.

In [35]:
TOneighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeansCA.labels_)

toronto_merged = TONs

toronto_merged = toronto_merged.join(TOneighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,Adelaide,43.650809,-79.377917,1,Hospital,Hospital Ward,Medical Center,Emergency Room
1,Bayview Village,43.769197,-79.376662,0,Hospital,Medical Center,Hospital Ward,Emergency Room
2,Berczy Park,43.647984,-79.375396,0,Hospital,Medical Center,Hospital Ward,Emergency Room
3,Bloordale Gardens,43.635317,-79.563674,0,Hospital,Medical Center,Hospital Ward,Emergency Room
4,CN Tower,43.642564,-79.387087,0,Hospital,Medical Center,Hospital Ward,Emergency Room


<hr>

Lastly, a map is generated with the different clusters marked with different color markers.

In [36]:
map_clustersTO = folium.Map(location=[43.7098517, -79.4042948], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Neighborhood Latitude'], toronto_merged['Neighborhood Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clustersTO)
       
map_clustersTO

In this map, only the neighborhoods that have at least 1 hospital within 1 kilometer. However, this time, every neighborhood is marked with a different color depending on the cluster label they got assigned to.<br>
It's possible to observe that the neighborhoods in red are generally the ones that are more spread out, while the ones in purple and blue are more close together. Lastly, it also appears that the neighborhoods in green and orange are some sort of outlier.
<hr>

#### 5.2.2: Clustering and mapping New York

I create another data frame <code>NYNs</code> (which stands for ***N***ew***Y***ork***N***eighborhood***s***), which is meant to collect all the unique neighborhoods in New York so they are able to be plotted on the map after being clustered.

In [37]:
NYNs = newyork_hospitals[["Neighborhood", "Neighborhood Latitude", "Neighborhood Longitude"]]
NYNslist = []
for index, neighborhood in enumerate(NYNs["Neighborhood"]):
    if (neighborhood not in NYNslist):
        NYNslist.append(neighborhood)
    else:
        NYNs.drop(index, inplace=True)
        
NYNs.sort_values("Neighborhood", inplace=True)
NYNs.reset_index(drop=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [38]:
NYNs

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude
0,Allerton,40.865788,-73.859319
1,Arlington,40.635325,-74.165104
2,Arverne,40.589144,-73.791992
3,Astoria,40.768509,-73.915654
4,Auburndale,40.761730,-73.791762
...,...,...,...
227,Wingate,40.660947,-73.937187
228,Woodhaven,40.689887,-73.858110
229,Woodlawn,40.898273,-73.867315
230,Woodside,40.746349,-73.901842


Afterwards, I merged the <code>NYneighborhoods_venues_sorted</code> data frame with the <code>NYNs</code> data frame that was just generated. It also added a column that displayed the cluster label for each neighborhood. This helped when mapping.

In [39]:
NYneighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeansUS.labels_)

newyork_merged = NYNs

newyork_merged = newyork_merged.join(NYneighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

newyork_merged.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Allerton,40.865788,-73.859319,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
1,Arlington,40.635325,-74.165104,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
2,Arverne,40.589144,-73.791992,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
3,Astoria,40.768509,-73.915654,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
4,Auburndale,40.76173,-73.791762,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room


<hr>

The last map:

In [41]:
map_clustersNY = folium.Map(location=[USlatitude, USlongitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(newyork_merged['Neighborhood Latitude'], newyork_merged['Neighborhood Longitude'], newyork_merged['Neighborhood'], newyork_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clustersNY)
       
map_clustersNY

Again, in this map, only the neighborhoods that have at least 1 hospital within 1 kilometer. However, this time, every neighborhood is marked with a different color depending on the cluster label they got assigned to.<br>
The New York map seems to follow the same pattern as the Toronto map; the red neighborhoods are the most abundant and spread out, while the blue ones are more close together. The orange and green neighborhoods still appear to be outliers, and the only difference appears to be the purple neighborhoods, which are slightly more spread out than in Toronto.
<hr>

## Part 6: Individual cluster analysis

The following code cells display the data frames for the individual clusters of neighborhoods. With this it's possible to observe patterns.

### Part 6.1: Toronto clusters

#### Cluster 1

In [107]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[0] + list(range(3, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
1,Bayview Village,0,Hospital,Medical Center,Hospital Ward,Emergency Room
2,Berczy Park,0,Hospital,Medical Center,Hospital Ward,Emergency Room
3,Bloordale Gardens,0,Hospital,Medical Center,Hospital Ward,Emergency Room
4,CN Tower,0,Hospital,Medical Center,Hospital Ward,Emergency Room
8,Christie,0,Hospital,Medical Center,Hospital Ward,Emergency Room
...,...,...,...,...,...,...
88,Wexford Heights,0,Hospital,Medical Center,Hospital Ward,Emergency Room
89,Willowdale,0,Hospital,Medical Center,Hospital Ward,Emergency Room
90,Willowdale South,0,Hospital,Medical Center,Hospital Ward,Emergency Room
91,Willowdale West,0,Hospital,Medical Center,Hospital Ward,Emergency Room


#### Cluster 2

In [108]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[0] + list(range(3, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,Adelaide,1,Hospital,Hospital Ward,Medical Center,Emergency Room
6,Central Bay Street,1,Hospital,Hospital Ward,Medical Center,Emergency Room
7,Chinatown,1,Hospital,Hospital Ward,Medical Center,Emergency Room
22,Fairview,1,Hospital,Hospital Ward,Medical Center,Emergency Room
23,First Canadian Place,1,Hospital,Medical Center,Hospital Ward,Emergency Room
24,Garden District,1,Hospital,Hospital Ward,Medical Center,Emergency Room
25,Grange Park,1,Hospital,Hospital Ward,Medical Center,Emergency Room
36,Jamestown,1,Hospital,Hospital Ward,Medical Center,Emergency Room
38,Kensington Market,1,Hospital,Hospital Ward,Medical Center,Emergency Room
48,Mount Olive,1,Hospital,Hospital Ward,Medical Center,Emergency Room


#### Cluster 3

In [109]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[0] + list(range(3, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
5,Cabbagetown,2,Hospital,Hospital Ward,Medical Center,Emergency Room
9,Church and Wellesley,2,Hospital,Hospital Ward,Medical Center,Emergency Room
31,Henry Farm,2,Hospital,Hospital Ward,Medical Center,Emergency Room
53,Oriole,2,Hospital,Hospital Ward,Medical Center,Emergency Room
64,Silver Hills,2,Hospital,Hospital Ward,Medical Center,Emergency Room
67,South of Bloor,2,Hospital,Hospital Ward,Medical Center,Emergency Room
68,St. James Town,2,Hospital,Hospital Ward,Medical Center,Emergency Room
80,Underground city,2,Hospital,Hospital Ward,Medical Center,Emergency Room
84,Victoria Hotel,2,Hospital,Hospital Ward,Medical Center,Emergency Room
93,Yorkville,2,Hospital,Hospital Ward,Medical Center,Emergency Room


#### Cluster 4

In [110]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[0] + list(range(3, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
60,Rosedale,3,Hospital,Hospital Ward,Medical Center,Emergency Room


#### Cluster 5

In [111]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[0] + list(range(3, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
47,Morningside,4,Hospital,Emergency Room,Medical Center,Hospital Ward


<hr>

### Part 6.2: New York clusters

#### Cluster 1

In [112]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 0, newyork_merged.columns[[0] + list(range(3, newyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Allerton,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
1,Arlington,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
2,Arverne,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
3,Astoria,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
4,Auburndale,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
...,...,...,...,...,...,...,...
224,Williamsburg,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
225,Willowbrook,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room
227,Wingate,0,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
228,Woodhaven,0,Hospital,Urgent Care Center,Medical Center,Hospital Ward,Emergency Room


#### Cluster 2

In [113]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 1, newyork_merged.columns[[0] + list(range(3, newyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
86,Georgetown,1,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
94,Gravesend,1,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
96,Greenpoint,1,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
105,Homecrest,1,Hospital Ward,Hospital,Urgent Care Center,Medical Center,Emergency Room
113,Kensington,1,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
133,Marine Park,1,Hospital Ward,Hospital,Urgent Care Center,Medical Center,Emergency Room
140,Mill Basin,1,Hospital Ward,Hospital,Urgent Care Center,Medical Center,Emergency Room
219,West Farms,1,Hospital Ward,Hospital,Urgent Care Center,Medical Center,Emergency Room


#### Cluster 3

In [114]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 2, newyork_merged.columns[[0] + list(range(3, newyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
5,Bath Beach,2,Hospital,Medical Center,Urgent Care Center,Hospital Ward,Emergency Room
6,Battery Park City,2,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
15,Belmont,2,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
18,Boerum Hill,2,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
19,Borough Park,2,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
20,Briarwood,2,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
21,Brighton Beach,2,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
24,Brooklyn Heights,2,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room
27,Bushwick,2,Hospital,Medical Center,Urgent Care Center,Hospital Ward,Emergency Room
31,Carroll Gardens,2,Hospital,Hospital Ward,Urgent Care Center,Medical Center,Emergency Room


#### Cluster 4

In [115]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 3, newyork_merged.columns[[0] + list(range(3, newyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
29,Canarsie,3,Medical Center,Urgent Care Center,Hospital Ward,Hospital,Emergency Room


#### Cluster 5

In [116]:
newyork_merged.loc[newyork_merged['Cluster Labels'] == 4, newyork_merged.columns[[0] + list(range(3, newyork_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
58,East Flatbush,4,Hospital Ward,Urgent Care Center,Medical Center,Hospital,Emergency Room


<hr>
<hr>

## Final observations

- Overall, Toronto has less neighborhoods and hospital-related venues than New York
- New York has Urgent Care Centers within 1 kilometer of some of its neighborhoods, while Toronto does not
- New York has its neighborhoods more close together than Toronto
- Taking a look at the clusters, both the majority of Toronto and New York's neighborhoods most common venues in order are: Hospitals, Urgent Care Centers (New York only), Medical Centers, Hospital Wards, and Emergency Rooms
- Overall, it would seem that New York has more hospital-related venues within 1 kilometer of its neighborhoods, which would suggest that it is better equipped to handle health care
 - Please note that this implication is formed solely based on the data compiled in this notebook, and does not take into account other measures such as population, medical insurance policies, quality of health care, or otherwise; for this reason, the implication that New York is more prepared than Toronto regarding health care may not be completely accurate
  - The purpose of this project was simply to visually compare the two cities based on how many hospitals their neighborhoods have nearby, and to see which venues were more common

<hr>

This is the end of the Notebook.