# The Battle of Neighborhoods - Capstone Project

## 1) Business Problem

The goal of the business case here, is to understand the similarities and differences between 2 cities' venues, and to be able to have a better insight of the demographic and decide on what neighbourhoods will be the most suitable to open a specific venue downtown.

The Two cities' neighbourhoods will be compared to each other, based on the clusters they fall within. The City of NEW YORK will be compared to the city of TORONTO, to better understand the style of their venues, and how they are similar or dissimilar. 

This business case is aimed towards new business owners to allow them to decide on what Neighbourhood is the most suitable for their new venue investment such as; Restaurants, coffee shops or other entertainment venues.

## 2) Analytics Approach

K means clustering will be used to segment the cities' neighbourhoods and give an idea of how some neighbourhoods are similar or dissimilar to others, based on the venues' categories that exist in each of these neighbourhoods.

## 3) Data Sourcing and Requirements 

1. The first Data Set to be used is of the NEW YORK city - Including different cities, boroughs and Neighbourhoods within NY
2. The second Data Set will be from the Wikipedia page for Toronto city and its neighbourhoods, boroughs and Neighbourhoods
3. FourSquare Location Data

Both Data sets will utilize the Foursquare location Data, and all of the venues for each neighbourhood will be displayed. Then both Datasets will be merged together within a bigger dataframe that inlcudes different cities (NY and Toronto) and their borughs and Neighbourhoods

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import scipy as sc
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
import seaborn as sns


import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

!pip install pandas==1.0.3

import pandas as pd

#!pip install geocoder
#import geocoder # import geocoder

# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):
#  g = geocoder.google('{}, Toronto, Ontario'.format(new_data['Postal Code']))
#  lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.3
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-2.0.0                |     pyh9f0ad1d_0          63 KB  conda-forge
    openssl-1.1.1g             |       h516909a_1         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.2 MB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geo

## 4) Data Collection  

### 1 - Collecting Toronto Data

#### A) WEB SCRAPING

In [2]:
!pip install lxml

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/79/37/d420b7fdc9a550bd29b8cfeacff3b38502d9600b09d7dfae9a69e623b891/lxml-4.5.2-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 19.9MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.2


In [9]:
raw_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", header=0)
raw_data = raw_data[0]
raw_data.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


#### B) Filtering only valid Boroughs - Removing "Not Assigned" Boroughs

In [5]:
new_data = raw_data[raw_data['Borough']!='Not assigned']
new_data.shape

(103, 3)

In [6]:
new_data.reset_index(drop=True,inplace=True)

In [7]:
new_data.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [8]:
new_data[new_data['Neighbourhood']=='Not Assigned']

# No Neighbourhoods Exist with NOT ASSIGNED if they have a borough

Unnamed: 0,Postal Code,Borough,Neighbourhood


In [9]:
print("The Data Frame has {} rows and {} columns".format(new_data.shape[0], new_data.shape[1]))

The Data Frame has 103 rows and 3 columns


#### C) Including the Latitude and Longitude

#### Getting Latitude and Longitude of Neighbourhoods - Google Geocoder is not complying - There is ARC GIS Option but not used

In [10]:
latlong = pd.read_csv('Geospatial_Coordinates.csv') 
latlong.head(5)

FileNotFoundError: [Errno 2] File Geospatial_Coordinates.csv does not exist: 'Geospatial_Coordinates.csv'

#### Joining Neighbourhoods with their Latitudes and Longitudes

In [11]:
neigh_with_latlong = new_data.merge(right=latlong,on='Postal Code')
neigh_with_latlong.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [12]:
# Getting Latitude and Longitude using ARC GIS

#!pip install geocoder
#import geocoder # Import geocoder package
#postal_code = new_data['Postal Code'] # Postal code for each neighborhood in Toronto, Canada

# Initialize your variable to 'None'
#lat_lng_coords = None

# Create an empty list to append the Latitude values
#lat_toronto = []

# Create an empty list to append the Longitude values
#lon_toronto = []

# Loop until getting the geographical coordinates
#for postal in postal_code:
#    g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal))
#    lat_lng_coords = g.latlng
#    lat_toronto.append(lat_lng_coords[0])
#    lon_toronto.append(lat_lng_coords[1])

#neigh_with_latlong = new_data.copy()

#neigh_with_latlong['Latitude'] = lat_toronto
#neigh_with_latlong['Longitude'] = lon_toronto
#neigh_with_latlong.head(12)

## Part 3 - Exploring and Segmenting Neighbourhoods and Venues

In [13]:
len(neigh_with_latlong['Borough'].unique())

10

#### > We have 10 unique boroughs

In [14]:
neigh_with_latlong['Borough'].value_counts()

North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East York            5
East Toronto         5
Mississauga          1
Name: Borough, dtype: int64

In [15]:
len(neigh_with_latlong['Neighbourhood'])

103

In [16]:
neigh_with_latlong[neigh_with_latlong[['Neighbourhood']].duplicated()]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
13,M3C,North York,Don Mills,43.7259,-79.340923
46,M3L,North York,Downsview,43.739015,-79.506944
53,M3M,North York,Downsview,43.728496,-79.495697
60,M3N,North York,Downsview,43.761631,-79.520999


#### > We have 99 unique neighbourhoods and 103 total neighbourhoods with different postal codes

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [17]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


#### lets visualize ALL Boroughs and Neighbourhoods around toronto - Using Follium - Please use NB VIEWER WEBSITE and copy paste the project link

In [18]:
# create map of New York using latitude and longitude values
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neigh_with_latlong['Latitude'], neigh_with_latlong['Longitude'], neigh_with_latlong['Borough'], neigh_with_latlong['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

#### lets visualize ONLY Boroughs around Downtown toronto and nearby Boroughs - Using Follium - Please use NB VIEWER WEBSITE and copy paste the project link

In [19]:
toronto_df = neigh_with_latlong[neigh_with_latlong['Borough'].str.contains('Toronto')]
toronto_df.head(3)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [20]:
# create map of New York using latitude and longitude values
map_tor = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

### Exploring Venues around Downtown Toronto and Nearby Boroughs

In [21]:
#@hidden_cell
CLIENT_ID = '0IRSU1PXCTADSJ0GAWTDUVNWMXS352WY4JU4XJN4XBLW4BV1' # your Foursquare ID
CLIENT_SECRET = 'TCNAWWPCZ22X34PMOHGEA2O0S1BILYL3BS3TQRD1Y5YQDAUG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT= 100
radius = 500

Defining a function that retrieves a list venues for each neighbourhood and creates a dataframe

In [22]:
def getNearbyVenues(postal, borough, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for postal, borough, name, lat, lng in zip(postal, borough, names, latitudes, longitudes):
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            postal,
            borough,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [
                  'Postal Code',
                  'Borough',
                  'Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [23]:
toronto_venues = getNearbyVenues(  postal = toronto_df['Postal Code'],
                                   borough = toronto_df['Borough'],
                                   names = toronto_df['Neighbourhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

#### The Compelete DataFrame containing Downtown Toronto Boroughs, Nearby Boroughs and their VENUES

In [24]:
print(toronto_venues.shape)
toronto_venues.head(5)

(1633, 9)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
