# Capstone Project - The Battle of the Neighborhoods (Week 1)

## Table of contents
* [Introduction: Problem Description](#introduction)
* [Data Description](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


### Introduction: Problem Description
<a id="introduction"></a>

Two of the most important cities in the world are New York City and City of Toronto. Every year millions of tourists visit these cities for business and pleasure. 

The main reasons for tourist to visit these cities are as follows:

* Discover different neighborhoods
* Stand in the awe of skyscrapers
* Enjoying different international and experimental foods
* Visiting world renowned arts and galleries

People from different part of the world come to visit, they like to explore places, taste similar and different cuisines, enjoy popular sites and much more. When someone wants to visit and has to decide which city to choose, the visitor/visitors would like to compare two cities based on their likes and dislikes.

A comparison of the venues between two cities will help people decide where to visit. A data analysis between New York city and Toronto which gives a picture of the sought after venues will serve the purpose.

### Data Description
<a id="data"></a>

In order to make a comparative analysis of venues of interest between two cities( New York and Toronto), we need effective datasets for both the two cities. 

Following data sources will be needed to extract/generate the required information:

* New York City data will be obtained from a json file obtained from IBM Developer Skills Network
* City of Toronto postal codes are obtained from a Wikipedia page
* City of Toronto data will be obtained from a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_dataFollowing 
* Number of restaurants and their type and location in every neighborhood will be obtained using Foursquare API

Communicating with the Foursquare database is done by their RESTful API. A uniform resource identifier or URI is created and  extra parameters are appended depending on the data that we are seeking from the database. Any call request you make is composed of, we can call this base URI, which is api.foursquare.com/v2, and you can request data about venues, users, or tips.

For New York City, We will be using the coordinates of Manhattan to conduct the search.
For City of Toronto, the coordinates of Toronto city will be used.



In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes  
#uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


#### City of Toronto Data Preparation

A dataframe will be created from a Wikipedia page. The dataframe will consist of three columns:PostalCode, Borough, and Neighborhood.

In [2]:
url='https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969'

In [3]:
df = pd.read_html(url)[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [4]:
filt = (df['Borough'] !='Not assigned')
df=df[filt]
df.rename(columns={'Postal Code':'PostalCode'}, inplace =True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [5]:
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [6]:
df_postal = df.groupby(['PostalCode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
df_postal.set_index('PostalCode', inplace = True)
df_postal.head()

Unnamed: 0_level_0,Borough,Neighbourhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
M1E,Scarborough,"Guildwood, Morningside, West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


Create a dataframe with latitude and longitude from the csv file that has the geographical coordinates of each postal code

In [7]:
df_latlong = pd.read_csv('Geospatial_Coordinates.csv')
df_latlong.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
df_latlong.set_index('PostalCode', inplace = True)
df_latlong.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Geospatial_Coordinates.csv'

Join two dataframes to create one dataframe with PostalCode, Borough, Neighborhood, Latitude and Longitude

In [None]:
toronto_data = df_postal.join(df_latlong)
toronto_data.reset_index(inplace=True)
toronto_data.head()

In [None]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto_data['Borough'].unique()),
        toronto_data.shape[0]
    )
)

##### Let's get the geographical coordinates of Toronto

In [None]:
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [None]:
CLIENT_ID = 'YP0SOLEF10ANMVNHTI2UUHNKJEA3Q52YJJXS0AGPPHJ4C15Q' # Foursquare ID
CLIENT_SECRET = 'ARHXH0TAGBK5DOBTZ5DJ0SHOGKBBLBAW5MHTLWPOXM4B3QFT' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

### 2. Explore Neighbourhoods in Toronto

Let's create a function to get the venues to all the neighborhoods in Toronto

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### The following code to run the above function on each neighborhood and create a new dataframe called toronto_venues.

In [None]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

##### Let's check the size of the resulting dataframe

In [None]:
print(toronto_venues.shape)
toronto_venues.head()

In [None]:
len(toronto_venues['Venue'].unique())

In [None]:
len(toronto_venues['Venue Category'].unique())

In [None]:
print('Total number of venues in Toronto :{}'.format(toronto_venues.shape[0]))
print('Total number of unique venues in Toronto:{}'.format(len(toronto_venues['Venue'].unique())))
print('Total number of venue categories in Toronto :{}'.format(len(toronto_venues['Venue Category'].unique())))

#### Let's create a dataframe with Toronto venues

In [None]:
toronto_df = pd.DataFrame({'No. of Venue':[toronto_venues.shape[0]],
                          'No. of Unique Venue':[len(toronto_venues['Venue'].unique())],
                          'No. of Unique Venue Category':[len(toronto_venues['Venue Category'].unique())]})
toronto_df = toronto_df.T
toronto_df.columns =['Toronto']
toronto_df

In [None]:
toronto_df.reset_index(inplace=True)

In [None]:
toronto_df.columns = ['Items', 'Toronto']
toronto_df

##### Let's get the frequency of all the venue categories that is the total number of each venue categories in toronto

In [None]:
toronto_venue_cat = toronto_venues.groupby('Venue Category').count()['Neighbourhood'].to_frame()['Neighbourhood'].sort_values(ascending=False).to_frame()

In [None]:
toronto_venue_cat.reset_index(inplace = True)

In [None]:
toronto_venue_cat.head()

In [None]:
toronto_venue_cat = toronto_venue_cat.rename(columns={'Venue Category':'Venue Category', 'Neighbourhood':'Toronto'})
toronto_venue_cat

In [None]:
toronto_venue_cat.shape

##### Let's get number of all the restaurants in Toronto

In [None]:
filt_1 = toronto_venue_cat['Venue Category'].str.contains('Restaurant')
toronto_restaurants = toronto_venue_cat[filt_1]
#toronto_restaurants = toronto_restaurants.rename(columns = {'Venue Category':'Venue Category', 'Neighbourhood':'No. of Restaurant'})
toronto_restaurants.set_index('Venue Category', inplace=True)
#toronto_restaurants.reset_index(inplace= True)

In [None]:
toronto_restaurants

In [None]:
Number_of_Toronto_restaurants = 0

for n in range(toronto_restaurants.shape[0]):
    Number_of_Toronto_restaurants = Number_of_Toronto_restaurants + int(toronto_restaurants['Toronto'][n])
    
print('Total number of restaurants in Toronto is :', Number_of_Toronto_restaurants)

In [None]:
#toronto_restaurants.to_excel('toronto_restaurant.xlsx')

##### Let's check how many venues were returned for each neighborhood

In [None]:
toronto_venues.groupby('Neighbourhood').count()

##### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

### 3. Analyze Each Neighborhood

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

##### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

##### Confirm the new size

In [None]:
toronto_grouped.shape

##### Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

##### Let's put that into a pandas dataframe

First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

In [None]:
neighbourhoods_venues_sorted.shape

In [None]:
len(neighbourhoods_venues_sorted['1st Most Common Venue'].unique())

### 4. Cluster Neighbourhoods

We analyze the clusters of city of Toronto and Manhattan by the number of the most common venues  for each cluster.

In [None]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

##### Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood of Toronto.

In [None]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood', how = 'right')

toronto_merged.head() # check the last columns!

##### Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
Toronto_Cluster_1 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
Toronto_Cluster_1

In [None]:
Toronto_Cluster_2 =toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
Toronto_Cluster_2

In [None]:
Toronto_Cluster_3 = toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
Toronto_Cluster_3

In [None]:
Toronto_Cluster_4=toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
Toronto_Cluster_4

In [None]:
Toronto_Cluster_5=toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]
Toronto_Cluster_5

In [None]:
type(Toronto_Cluster_5['1st Most Common Venue'].value_counts())

### New York City data preparation
Data is extracted from the following url named as newyorkurl.

In [None]:
import urllib.request

In [None]:
newyorkurl = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json'

In [None]:
response = urllib.request.urlopen(newyorkurl)
content = response.read()
newyork_data = json.loads(content.decode("utf8"))
print(newyork_data)


Notice how all the relevant data is in the _features_ key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [None]:
neighborhoods_data = newyork_data['features']

Let's look at the first item in this list.

In [None]:
neighborhoods_data[0]

##### Tranform the data into a pandas dataframe

In [None]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [None]:
neighborhoods

In [None]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [None]:
neighborhoods.head()

In [None]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

##### let's slice the original dataframe and create a new dataframe of the Manhattan data.

In [None]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.

In [None]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

In [None]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

### 2. Explore Neighborhoods in Manhattan

Let's create a function to for all the neighborhoods in Manhattan

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

##### Now write the code to run the above function on each neighborhood and create a new dataframe called _manhattan_venues_.

In [None]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

**Let's get the number of unique venues in Manhattan**

In [None]:
print('The total number of venues in manhattan is {}'.format(manhattan_venues.shape[0]))
manhattan_venues.head()

In [None]:
print('The number of unique venues in manhattan is {}'.format(len(manhattan_venues['Venue'].unique())))

**Let's get the number of unique venue catagories in Manhattan**

In [None]:
print('The number of unique venue catagories in manhattan is {}'.format(len(manhattan_venues['Venue Category'].unique())))

In [None]:
manhattan_venues.groupby('Venue Category').count()

**Let's create a dataframe with no. of venues, no. of unique venues and number of unique venue categories for Manhattan called manhattan_df**

In [None]:
manhattan_df = pd.DataFrame({'No. of Venue':[manhattan_venues.shape[0]],
                          'No. of Unique Venue':[len(manhattan_venues['Venue'].unique())],
                          'No. of Unique Venue Category':[len(manhattan_venues['Venue Category'].unique())]})
manhattan_df = manhattan_df.T
manhattan_df.columns =['Manhattan']
manhattan_df

In [None]:
manhattan_df.reset_index(inplace = True)

In [None]:
manhattan_df.columns = ['Items', 'Manhattan']
manhattan_df

##### Let's get the frequency of all the venue categories that is the total number of each venue categories in manhattan in descending order. This will give the most popular ones on the top and so on.
Let's name the dataframe as manhattan_venue_cat

In [None]:
manhattan_venue_cat = manhattan_venues.groupby('Venue Category').count()['Neighborhood'].to_frame()['Neighborhood'].sort_values(ascending=False).to_frame()

In [None]:
manhattan_venue_cat.reset_index(inplace = True)

In [None]:
manhattan_venue_cat.rename(columns={'Venue Category':'Venue Category', 'Neighborhood':'Manhattan'}, inplace=True)
manhattan_venue_cat

In [None]:
manhattan_venue_cat.to_excel('manhattan_venue_cat.xlsx')

In [None]:
print('There are {} uniques venue categories in manhattan.'.format(len(manhattan_venues['Venue Category'].unique())))

##### Let's get the number of restaurants in manhattan

In [None]:
filt_2 = manhattan_venue_cat['Venue Category'].str.contains('Restaurant')
manhattan_restaurants = manhattan_venue_cat[filt_2]
manhattan_restaurants = manhattan_restaurants.rename(columns = {'Venue Category':'Venue Category', 'Neighborhood':'Manhattan'})
manhattan_restaurants.set_index('Venue Category', inplace=True)
#manhattan_restaurants.reset_index(inplace= True)

In [None]:
#manhattan_restaurants.rename(columns = {'Neighborhood':'Manhattan'}, inplace = True)
manhattan_restaurants

In [None]:
Number_of_Manhattan_restaurants=0

for n in range(manhattan_restaurants.shape[0]):
    Number_of_Manhattan_restaurants = Number_of_Manhattan_restaurants + int(manhattan_restaurants['Manhattan'][n])
    
print('Total number of restaurants in Manhattan is :', Number_of_Manhattan_restaurants)

In [None]:
#filt = man_res_percentage['No. of Restaurants']>.50
#man_res_percentage[filt].shape

**Let's say we want to get the number of Italian Restaurants in Manhattan**

In [None]:
italian_restaurant = (manhattan_venues['Venue Category']== 'Italian Restaurant')
print('The number of Italian restaurants in manhattan is {}'.format(manhattan_venues[italian_restaurant].shape[0]))

**Let's check how many venues were returned for each neighborhood**

In [None]:
manhattan_venues.groupby('Neighborhood').count()

Let's find out how many unique categories can be curated from all the returned venues

### 3. Analyze Each Neighborhood

In [None]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

And let's examine the new dataframe size.

In [None]:
manhattan_onehot.shape

##### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

In [None]:
manhattan_grouped.shape

##### Let's print each neighborhood along with the top 5 most common venues

In [None]:
num_top_venues = 5

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

#### Let's put that into a pandas dataframe
First, let's write a function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

### 4. Cluster Neighborhoods

###### Run k-means to cluster the neighborhood into 5 clusters for Manhattan neighborhoods.

In [None]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

**Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood of Manhattan.**

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_merged = manhattan_data

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

In [None]:
manhattan_merged['Cluster Labels'].unique()

Finally, let's visualize the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### 5. Examine Clusters

Now, let's  examine each cluster and determine the discriminating venue categories that distinguish each cluster.

In [None]:
Manhattan_Cluster_1 = manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
Manhattan_Cluster_1

In [None]:
Manhattan_Cluster_2=manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
Manhattan_Cluster_2

In [None]:
Manhattan_Cluster_3=manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
Manhattan_Cluster_3

In [None]:
Manhattan_Cluster_4=manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
Manhattan_Cluster_4

In [None]:
Manhattan_Cluster_5=manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
Manhattan_Cluster_5

### Methodology 
<a id = "methodology"></a>

The two cities chosen for the analysis are city of Toronto and New York city. Since Manhattan is the most visited borough among other boroughs of New York, we chose borough of Manhattan.

We are comparing the venues of **city of Toronto** and borough of **Manhattan** in New York city.

In the first step, we have gathered datasets for both the two cities and explored the neighborhoods based on number of venues.

In the second step, we have analyzed the neighborhoods with respect to top ten venues for each neighborhood.

In the final step, we have used **K-means clustering** to cluster venue categories. This means, it's an unsupervised algorithm. Objects within a cluster are similar, and objects across different clusters are different or dissimilar. For both the cities, we have taken venues within 500 meter  from the location of Toronto and Manhattan. The clusters show which venue category dominate which cluster.
By analyzing the venue categories between the clusters of the two cities, we can compare the similarities or differences between two cities.


### Analysis
<a id="analysis"></a>

##### Let's get a dataframe with number of venues in both Toronto and Manhattan

In [None]:
result = pd.merge(toronto_df,manhattan_df, on ='Items')
result

In [None]:
result.set_index('Items', inplace=True)

In [None]:
result

In [None]:
#converting the columns to lists
item_list= result.index.to_list()
toronto = result['Toronto'].to_list()
manhattan = result['Manhattan'].to_list()

In [None]:
import matplotlib
import matplotlib.pyplot as plt

indx = np.arange(len(item_list))  # the label locations
score_label = np.arange(0, 4500, 500)
bar_width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(12,6))
barToronto = ax.bar(indx - bar_width/2, toronto, bar_width, label='Toronto')
barManhattan = ax.bar(indx + bar_width/2, manhattan, bar_width, label='Manhattan')

# inserting x axis label
ax.set_xticks(indx)
ax.set_xticklabels(item_list)

# inserting y axis label
ax.set_yticks(score_label)
ax.set_yticklabels(score_label)

ax.legend()
ax.set_title("Comparison of Toronto and Manhattan Venues", fontsize=16)

def insert_data_labels(bars):
	for bar in bars:
		bar_height = bar.get_height()
		ax.annotate('{0:.0f}'.format(bar.get_height()),
			xy=(bar.get_x() + bar.get_width() / 2, bar_height),
			xytext=(0, 3),
			textcoords='offset points',
			ha='center',
			va='bottom'
		)

insert_data_labels(barToronto)
insert_data_labels(barManhattan)

#plt.xticks(rotation=90)

plt.show()

In [None]:
toronto_venue_cat.head(3)

In [None]:
manhattan_venue_cat.head(10)

In [None]:
venue_category_merged = pd.merge(manhattan_venue_cat, toronto_venue_cat, on ='Venue Category')

In [None]:
venue_category_merged.shape

In [None]:
venue_category_merged.set_index('Venue Category', inplace = True)

In [None]:
venue_category_merged.rename(columns={'Neighborhood':'Manhattan', 'Frequency':'Toronto'},inplace=True)
venue_category_merged

In [None]:
venue_category_merged_50 = venue_category_merged.head(50)

In [None]:
venue_category_merged_50

In [None]:
venue_cat= venue_category_merged_50.index.to_list()
manhattan_venue = venue_category_merged_50['Manhattan'].to_list()
toronto_venue = venue_category_merged_50['Toronto'].to_list()

In [None]:
import matplotlib
import matplotlib.pyplot as plt


indx = np.arange(len(venue_cat))  # the label locations
score_label = np.arange(0, 225, 20)
bar_width = 0.4  # the width of the bars

fig, ax = plt.subplots(figsize=(14,8))
barManhattan = ax.bar(indx - bar_width/2, manhattan_venue, bar_width, label='Manhattan')
barToronto = ax.bar(indx + bar_width/2, toronto_venue, bar_width, label='Toronto')

# inserting x axis label
ax.set_xticks(indx)
ax.set_xticklabels(venue_cat)

# inserting y axis label
ax.set_yticks(score_label)
ax.set_yticklabels(score_label)
ax.set_title("1st 50 Venue Categories of Manhattan & Toronto", fontsize=16)
ax.legend()

def insert_data_labels(bars):
	for bar in bars:
		bar_height = bar.get_height()
		ax.annotate('{0:.0f}'.format(bar.get_height()),
			xy=(bar.get_x() + bar.get_width() / 2, bar_height),
			xytext=(0, 3),
			textcoords='offset points',
			ha='center',
			va='bottom'
		)

insert_data_labels(barManhattan)
insert_data_labels(barToronto)

plt.xticks(rotation=90)

plt.show()

#### Let's analyze the restaurants in Toronto and Manhattan 

In [None]:
print('Number of restaurants in Toronto : ',Number_of_Toronto_restaurants)
print('Number of restaurants in Manhattan : ',Number_of_Manhattan_restaurants)

In [None]:
toronto_restaurants.head(3)

In [None]:
manhattan_restaurants.head(3)

In [None]:
restaurants_merged = pd.merge(toronto_restaurants, manhattan_restaurants, on ='Venue Category')
restaurants_merged.rename(columns={'Frequency':'Toronto', 'Neighborhood':'Manhattan'}, inplace=True)
restaurants_merged

In [None]:
restaurant= restaurants_merged.index.to_list()
toronto_res = restaurants_merged['Toronto'].to_list()
manhattan_res = restaurants_merged['Manhattan'].to_list()

In [None]:
import matplotlib
import matplotlib.pyplot as plt


indx = np.arange(len(restaurant))  # the label locations
score_label = np.arange(0, 150, 10)
bar_width = 0.35  # the width of the bars

fig, ax = plt.subplots(figsize=(14,8))
barToronto = ax.bar(indx - bar_width/2, toronto_res, bar_width, label='Toronto')
barManhattan = ax.bar(indx + bar_width/2, manhattan_res, bar_width, label='Manhattan')
# inserting x axis label
ax.set_xticks(indx)
ax.set_xticklabels(restaurant)
ax.set_title('Comparison of Toronto and Manhattan Restaurants', fontsize=16)
# inserting y axis label
ax.set_yticks(score_label)
ax.set_yticklabels(score_label)

ax.legend()

def insert_data_labels(bars):
	for bar in bars:
		bar_height = bar.get_height()
		ax.annotate('{0:.0f}'.format(bar.get_height()),
			xy=(bar.get_x() + bar.get_width() / 2, bar_height),
			xytext=(0, 3),
			textcoords='offset points',
			ha='center',
			va='bottom'
		)

insert_data_labels(barManhattan)
insert_data_labels(barToronto)

plt.xticks(rotation=90)

plt.show()

**Let's get the percentage of different restaurants in manhattan. Let's name it man_res_percentage.**

In [None]:
man_res_percentage = (100*manhattan_restaurants /898).round(2)
man_res_percentage

**Let's find the restaurants that are present in Manhattan but not found in Toronto**

In [None]:
to = toronto_restaurants.index.to_list()
ma = manhattan_restaurants.index.to_list()
manhattan_res_only = []
for m in ma:
    if m not in to:
        manhattan_res_only.append(m)
print('List of the reataurants only found in Manhattan: \n',manhattan_res_only)

In [None]:
print(len(manhattan_res_only))
#print(len(toronto_res_only))

**Let's find the restaurants that are present in Toronto but not found in Manhattan**

In [None]:
toronto_res_only = []
for t in to:
    if t not in ma:
        toronto_res_only.append(t)
print('List of the reataurants only found in Toronto: \n',toronto_res_only)

In [None]:
print('Number of outstanding reataurants found only in manhattan are:{}'.format(len(manhattan_res_only)))
print('Number of outstanding reataurants found only in toronto are:{}'.format(len(toronto_res_only)))

In [None]:
print(len(ma))
print(len(to))
print(len(manhattan_res_only))
print(len(toronto_res_only))

In [None]:
print('Number of neighborhoods in cluster_1 with the similar charactereistics: {}'.format(Toronto_Cluster_1.shape[0]))

**Let's find the number of 1st most common venues of the five clusters of Toronto**

In [None]:
tc_1 = Toronto_Cluster_1['1st Most Common Venue'].value_counts().to_frame().reset_index()
tc_1

In [None]:
#Toronto_Cluster_2.shape
tc_2 =Toronto_Cluster_2['1st Most Common Venue'].value_counts().to_frame().reset_index()
tc_2

In [None]:
#Toronto_Cluster_3.shape
tc_3 = Toronto_Cluster_3['1st Most Common Venue'].value_counts().to_frame().reset_index()
tc_3.head()

In [None]:
#Toronto_Cluster_4.shape
tc_4 = Toronto_Cluster_4['1st Most Common Venue'].value_counts().to_frame().reset_index()
tc_4

In [None]:
#Toronto_Cluster_5.shape
tc_5 = Toronto_Cluster_5['1st Most Common Venue'].value_counts().to_frame().reset_index()
tc_5

**Let's find the number of 1st most common venues of the five clusters of Manhattan**

In [None]:
#Manhattan_Cluster_1.shape
mc_1 = Manhattan_Cluster_1['1st Most Common Venue'].value_counts().to_frame().reset_index()
mc_1

In [None]:
#Manhattan_Cluster_2.shape
mc_2 = Manhattan_Cluster_2['1st Most Common Venue'].value_counts().to_frame().reset_index()
mc_2

In [None]:
#Manhattan_Cluster_3.shape
mc_3 = Manhattan_Cluster_3['1st Most Common Venue'].value_counts().to_frame().reset_index()
mc_3

In [None]:
mc_4 = Manhattan_Cluster_4['1st Most Common Venue'].value_counts().to_frame().reset_index()
mc_4

In [None]:
mc_5 = Manhattan_Cluster_5['1st Most Common Venue'].value_counts().to_frame().reset_index()
mc_5

### Results and Discussion
<a id="results"></a>

In the analysis, we have devided the neighborhoods with five clusters. Then from each cluster, we find the largest number of the 1st most common venue.


#### The most frequent venues of all the Toronto clusters are as follows:

In [None]:
print('The most frequent of venue of 1st Most Common Venue of Toronto_Cluster_1 is: {}, {}'.format(tc_1.loc[0, 'index'],tc_1.loc[0, '1st Most Common Venue']))
print('The most frequent of venue of 1st Most Common Venue of Toronto_Cluster_2 is: {}, {}'.format(tc_2.loc[0, 'index'], tc_2.loc[0, '1st Most Common Venue']))
print('The most frequent of venue of 1st Most Common Venue of Toronto_Cluster_3 is: {}, {}'.format(tc_3.loc[0, 'index'],tc_3.loc[0, '1st Most Common Venue']))
print('The most frequent of venue of 1st Most Common Venue of Toronto_Cluster_4 is: {}, {}'.format(tc_4.loc[0, 'index'],tc_4.loc[0, '1st Most Common Venue']))
print('The most frequent of venue of 1st Most Common Venue of Toronto_Cluster_5 is: {}, {}'.format(tc_5.loc[0, 'index'],tc_5.loc[0, '1st Most Common Venue']))

#### The most frequent venues of all the Manhattan clusters are as follows:

In [None]:
print('The most frequent of venue of 1st Most Common Venue of Manhattan_cluster_1 is: {}, {}'.format(mc_1.loc[0, 'index'], mc_1.loc[0, '1st Most Common Venue']))
print('The most frequent of venue of 1st Most Common Venue of Manhattan_cluster_2 is: {}, {}'.format(mc_2.loc[0, 'index'], mc_2.loc[0, '1st Most Common Venue']))
print('The most frequent of venue of 1st Most Common Venue of Manhattan_cluster_3 is: {}, {}'.format(mc_3.loc[0, 'index'], mc_3.loc[0, '1st Most Common Venue']))
print('The most frequent of venue of 1st Most Common Venue of Manhattan_cluster_4 is: {}, {}'.format(mc_4.loc[0, 'index'], mc_4.loc[0, '1st Most Common Venue']))
print('The most frequent of venue of 1st Most Common Venue of Manhattan_cluster_5 is: {}, {}'.format(mc_5.loc[0, 'index'], mc_5.loc[0, '1st Most Common Venue']))


If we take a look at the above results, we see that both Toronto and Manhattan has similar characteristics. 

The following is the list of the most frequent venue of the Toronto clusters in descending order:

1. Park
2. Fast Food Restaurant
3. Pizza Place
4. Coffee Shop
5. Baseball Field

The following is the list of the most frequent venue of the Manhattan clusters in descending order:

1. Italian Restaurant
2. Coffee Shop
3. Park
4. Mexican Restaurant
5. Korean Restaurant

After analyzing the venue categories of the two cities (Toronto and Manhattan), we skim the most frequent number of venues available in both the cities.


### Conclusion
<a id="conclusion"></a>

From the analysis and result section of the clusters of the two cities above, we can come up with the following conclusion:

* Both Toronto and Manhattan have similar category of venues 
* Three of the five clusters of  both Toronto and Manhattan have Park, Coffee Shop and Restaurant 
* Another common venue is Restaurant
* Toronto is more popular with Fast Food Restautrant
* Manhattan is more popular with Italian Restaurant



We find the following number of venues for both the cities:

* The total number of venues for Toronto : **2133**
* The total number of venues for Manhattan : **3252**
* The total number of unique venues for Toronto : **1391**
* The total number of unique venues for Manhattan : **2767**

From the analysis of the Restaurants of the two cities, we  find that Manhattan has almost twice the number of reataurants then Toronto following:

* Number of restaurants in Toronto :  **480**
* Number of restaurants in Manhattan :  **923**

Also there are some unique restaurants which are only found in either of the two cities. 

* Number of unique restaurants only in Toronto :  **9**
* Number of unique restaurants only in Manhattan :  **35**



Both Toronto and Manhattan have similar characteristics about the type of venues. Manhattan has higher number of venues, unique venues, unique venue categories and restaurants than city of Toronto. Also both cities offer some distinct venues and restaurants.