<a href="https://colab.research.google.com/github/Starsa/Battle_of_the_Neighborhoods_NYC/blob/main/CourseraCapstone_BattleoftheNeighborhoods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Comparison of neighbourhoods in Toronto and New York**
###**Table of contents**
*   Introduction: Business Problem
*   Data
*   Methodology
*   Analysis
*   Results
*   Conclusion

#### **Introduction: Business Problem**
New York City and Toronto are both financial capitals of USA and Canada respectively. Both cities are located on the east coast of North America. Toronto has population of 2.7 million people while New York 8.2 million. Both cities are most populous in their respective countries.

In this project we aim to investigate the similarities and differences between the neighbourhoods of both cities based on the venues that are present in them.

#### **Data** 
Toronto data for postal codes for neighbourhoods was obtained from wikipedia pages of postal codes of Canada [here](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').

New York has a total of 5 boroughs and 306 neighborhoods. Data was available [here]( https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json) as part of the coursera course.

The data on the venues for both cities was obtained using Foursquare API.

#### **Methodoogy**
Data was preprocessed to produce dataframes of each neighbourhoods with corresponding top 10 venues in that neighbourhood. K means cluster analysis was then done for individual cities and on the combined dataset with neighbourhoods of both cities.



### **Analysis** 
Installing necessary packages and begining data preprocessing

In [1]:
!pip install wget



In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

# !conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Install website scraping libraries and packages in Python from BeautifulSoup 
#!conda install -c conda-forge beautifulsoup4 --yes  # uncomment this line if you haven't completed 
from bs4 import BeautifulSoup

import wget

import matplotlib.pyplot as plt
print('Libraries imported.')

Libraries imported.


In [3]:
#getting the URL for postal codes for each neighbourhour in toronto
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
#print(source)
soup  = BeautifulSoup(source,'lxml')

In [4]:
# following course instruction on creating the dataframe with Postal codes, Boroughs and Neighbourhoods 
# by extracting from soup object

table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

In [5]:
print("The shape is: ", df.shape)
df.head()

The shape is:  (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


In [6]:
lat_long = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Geospatial_Coordinates.csv")
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
lat_long = lat_long.rename(columns={"Postal Code": "PostalCode"})

In [8]:
df = pd.merge(df, lat_long, on='PostalCode')
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


##### **Create Map of Toronto Neighborhoods**

*We will get the coordinates* 

In [9]:
address = 'Toronto, TOR'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.7370584, -79.2442535.


In [10]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

**Access Foursquare with credintials**

In [11]:
#Load FourSquare Credentials
CLIENT_ID = '5JK34TYV4CICAX50KGSELMTXPS1BEFZT11RREKVZUNOWTWAD' # your Foursquare ID
CLIENT_SECRET = 'PXBWLCT5FHKQJMJR5BBMB2YP1SI5XRSEWDBHIDAGRGUYGIWY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100


Create category function for later use

In [12]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Define function and get all venues in the neighborhoods we selected

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [14]:
toronto_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Parkwoods


KeyError: ignored

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

In [None]:
grouped_tor=toronto_venues.groupby('Neighborhood').count()
grouped_tor

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))


In [None]:
counts, bins = np.histogram(grouped_tor)
plt.hist(bins[:-1], bins, weights=counts)

A majority of Neighborhoods have les than 20 venues

In [None]:
import seaborn as sns

ax = sns.countplot(x="Neighborhood", data=toronto_venues)

#### **Analyze Each Neighborhood**

In [None]:
# one hot encoding 
tor_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
tor_onehot.head()

# add neighborhood column back to dataframe
tor_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
tor_onehot.head()

#as it didnt go to the last spot i remove it and add it back at first location
col_name='Neighborhood'
first_col = tor_onehot.pop(col_name)
tor_onehot.insert(0, col_name, first_col)
tor_onehot.head()

In [None]:
tor_onehot.shape

**Group by neighborhood and mean occurance for category**

In [None]:
tor_grouped = tor_onehot.groupby('Neighborhood').mean().reset_index()
tor_grouped

##### **Find top 5 venues in each neighbourhood**

In [None]:
num_top_venues = 5

for hood in tor_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = tor_grouped[tor_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

**Create a Dataframe from results**

We will also include a function to sort the venues

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
#Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = tor_grouped['Neighborhood']

for ind in np.arange(tor_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(tor_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()


#### **Cluster Neighborhoods**

In [None]:
# set number of clusters
kclusters = 10

tor_grouped_clustering = tor_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=42).fit(tor_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Create a dataframe with clusters and top 10 venues for each neighborhood

In [None]:
tor_merged = toronto_venues

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
tor_merged = tor_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

tor_merged.head() # check the last columns! - its adding in alphabetic order column number 5

In [None]:
#rename column names
tor_merged.rename(columns={'Neighborhood Latitude': 'Latitude', 'Neighborhood Longitude': 'Longitude'}, inplace = True)

#### **Visualize Clusters with a Map**

Lets Visualize the Clusters before we map them

In [None]:
sns.countplot("Cluster Labels", data = tor_merged)

Now let's look at our clusters on a map

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(tor_merged['Latitude'], tor_merged['Longitude'], tor_merged['Neighborhood'], tor_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### **New York Analysis**



#### **Loading and preparing Data**

##### **Loading Neighboorhood Data for New York City**

This data is taken from IBM Coursera course and contains 5 boroughs and 306 neighborhoods. We will need a dataset that contains the 5 boroughs and their respecgtive neighborhoods as well as lang and lat coordinates.



In [None]:
#download data from course
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json
print('Data downloaded!')

In [None]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [None]:
neighborhoods_data = newyork_data['features']

**Transforming JSON into Dataframe**

In [None]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                                  'Neighborhood': neighborhood_name,
                                                  'Latitude': neighborhood_lat,
                                                  'Longitude': neighborhood_lon}, ignore_index=True)

neighborhoods.head()

In [None]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)


**Using geopy for NYC coordinates**

In [None]:
address = 'New York City, NY'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

**Creating A Map of NYC with Neighborhoods**

In [None]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10, 
                         min_zoom=9, max_zoom=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], 
                                           neighborhoods['Longitude'], 
                                           neighborhoods['Borough'], 
                                           neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#333333',
        fill=True,
        fill_color='#ffb300',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

**Working with FourSquare Data for NYC Venues**

In [None]:
#call function from before for "getNearbyVenues"
ny_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

In [None]:
print(nyc_venues.shape)
nyc_venues.head()

In [None]:
grouped_neigh = nyc_venues.groupby('Neighborhood').count().head(20)
grouped_neigh

**Let's find out the number of venue categories**

In [None]:
print('There are {} uniques categories.'.format(len(nyc_venues['Venue Category'].unique())))

In [None]:
counts, bins = np.histogram(grouped_neighbor)
plt.hist(bins[:-1], bins, weights=counts)

**Encoding**

In [None]:
# one hot encoding
nyc_onehot = pd.get_dummies(nyc_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
nyc_onehot['Neighborhood'] = nyc_venues['Neighborhood'] 

# move neighborhood column to the first column
col_name='Neighborhood'
first_col = ny_onehot.pop(col_name)
ny_onehot.insert(0, col_name, first_col)
ny_onehot.head()

In [None]:
nyc_onehot.shape

**Grouping by neighborhood** 

In [None]:
nyc_grouped = nyc_onehot.groupby('Neighborhood').mean().reset_index()
print(nyc_grouped.shape)
nyc_grouped.head()

**Identifying the most common catergories**

Call function we wrote previously

In [None]:
#Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
nyc_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
nyc_neighborhoods_venues_sorted['Neighborhood'] = nyc_grouped['Neighborhood']

for ind in np.arange(nyc_grouped.shape[0]):
    nyc_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(nyc_grouped.iloc[ind, :], num_top_venues)

nyc_neighborhoods_venues_sorted.head()


##### **Clustering**

Now we apply k-means clustering in the data.

In [None]:
# set number of clusters
kclusters = 10

nyc_grouped_clustering = nyc_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=42).fit(nyc_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

create datafram that includes cluster labels

In [None]:
# add clustering labels
nyc_neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [None]:
nyneighborhoods_venues_sorted.head()

In [None]:
nyc_merged = neighborhoods

# merge ny data to add latitude/longitude for each neighborhood
nyc_merged = ny_merged.join(nyc_neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

nyc_merged.head()

**Let's visualize our new clusters**

Lets visualize the clusters in a countplot before we map them

In [None]:
sns.countplot(x="Cluster Labels", data=nyc_merged)

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(nyc_merged['Latitude'], nyc_merged['Longitude'], nyc_merged['Neighborhood'], nyc_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters


### **KMeans Analysis for all data**

First we must merge both cleaned datasets and create a new column for "City" meaning NYC or Toronto

In [None]:
nyc_venues['City'] = 'NY'
toronto_venues['City'] = 'TOR'

all_venues = toronto_venues.append(nyc_venues)

print("The shape of our combined dataset is ", all_venues.shape)
all_venues.head()

In [None]:
#get dummy variables for our categories
onehot_all = pd.get_dummies(all_venues[['Venue Category']], prefix ="", prefix_sep="")

#add string data back to df
onehot_all['Neighborhood'] = all_venues['Neighborhood']
onehot_all['City'] = all_venues['City']


#like we did before, remove  and add it back at first location
#col_name='Neighborhood'
#first_col = onehot_all.pop(col_name)
#onehot_all.insert(0, col_name, first_col)
#onehot_all.head()

#Put city in front too
#col_name='City'
#second_col = onehot_all.pop(col_name)
#onehot_all.insert(1, col_name, second_col)
#onehot_all.head()
print("The shape of our encoded dataset is ", onehot_all.shape)
onehot_all.head()

#### Next, let's group rows by neighborhood and  mean of the frequency of occurrence of each category


In [None]:
grouped = all_onehot.groupby('Neighborhood').mean().reset_index()
grouped


grouped.shape

#### **Now we will create a new DataFrame containing top 10 venues for each neighborhood**

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Neighborhood'] = grouped['Neighborhood']


for ind in np.arange(grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

#### **Cluster Neighborhoods**
Lets aim for 10 clusters

In [None]:
# set number of clusters
kclusters = 10

grouped_clustering = grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

Create a new Df with clusters included

In [None]:
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

venues_sorted.head()

Now we will prep our data for another merge

In [None]:
venues_sorted.head()
formerge = venues.loc[:, ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'City']] 
formerge.head()

In [None]:
# merge data to add latitude/longitude for each neighborhood
merged = venues_sorted.join(formerge.set_index('Neighborhood'), on='Neighborhood')

merged.head() # check the last columns!

In [None]:
merged.rename(columns={'Neighborhood Latitude': 'Latitude', 'Neighborhood Longitude': 'Longitude'}, inplace = True)

#### **Cluster Comparison**

Lets Visualize count of clusters first then we will look at our map

In [None]:
sns.countplot(x="Cluster", hue="City", data=merged)

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(merged['Latitude'], merged['Longitude'], merged['Neighborhood'], merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## ***Results***

In this study we investigated neighborhoods of Toronto and New York. We used k means clustering method to make sense of the data. The majority of the neighborhoods seem to fall into busy district type of neighborhoods. There is also clearly more parks in Toronto than in NY. Also it seems that there are beaches in NY and not in Toronto. That could be because its colder in Toronto compared to New York.

## **Conclusion**

There are a lot of similarities between the two cities, Toronto & New York,  but each remain unique.