# Capstone Project - The Battle of Neighborhoods (Week 1)

# Introduction/Business Problem

#### In this project we will try to find an optimal location to set up a restaurant. Specifically, this project will be help the stakeholders interested in opening an Italian restaurant in Toronto, Canada.

#### There are lots of restaurants in Toronto. My aim will be try to detect locations that are less crowded with restaurants and particularly interested in areas with no Italian restaurants in vicinity by prefering locations as close to City Center / Downtown as possible.

## Data

Following data sources will be needed to extract/generate the required information:
* Get the list of Neighborhoods with Postal codes of Canada from a Wikipage
* Download csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data
* Coordinate of Toronto will be obtained using **Google Maps API geocoding** 
* Number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**

## First Let us prepare DataFrame for neighborhoods of Toronto

## To convert the table listing the postal codes of Canada from a Wiki page to a dataframe, I have used pandas library

In [2]:
import pandas as pd

# This Reads all the dataframes from the html contents and stores them as a list. 
# Here we are reading all the dataframes from the wikipedia URL

list_of_dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
list_of_dfs

[            0                 1  \
 0    Postcode           Borough   
 1         M1A      Not assigned   
 2         M2A      Not assigned   
 3         M3A        North York   
 4         M4A        North York   
 5         M5A  Downtown Toronto   
 6         M5A  Downtown Toronto   
 7         M6A        North York   
 8         M6A        North York   
 9         M7A      Queen's Park   
 10        M8A      Not assigned   
 11        M9A         Etobicoke   
 12        M1B       Scarborough   
 13        M1B       Scarborough   
 14        M2B      Not assigned   
 15        M3B        North York   
 16        M4B         East York   
 17        M4B         East York   
 18        M5B  Downtown Toronto   
 19        M5B  Downtown Toronto   
 20        M6B        North York   
 21        M7B      Not assigned   
 22        M8B      Not assigned   
 23        M9B         Etobicoke   
 24        M9B         Etobicoke   
 25        M9B         Etobicoke   
 26        M9B         Etobi

## The postal codes dataframe is identified to be avaialbe at index = 0

In [3]:
df = list_of_dfs[0]
df.head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


# After identifying and obtainng the correct dataframe for postal codes of Canada, we have to move the first row as column header

In [4]:
df = df.rename(columns=df.iloc[0]).drop(df.index[0])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


# Only process the cells that have an assigned borough. Ignoring the cells with a Borough and Neighborhood that is Not assigned.

In [5]:
df.drop(df.index[df['Borough'] == 'Not assigned'], inplace = True)
df.drop(df.index[df['Neighbourhood'] == 'Not assigned'], inplace = True)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


# If more than one neighborhood exist in one postal code area then these two rows will be combined into one row with the neighborhoods separated with a comma

In [6]:
df = df.groupby(['Postcode', 'Borough'], as_index=False)['Neighbourhood'].apply(','.join).reset_index().rename(columns={0:'Neighbourhood'})
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


# If a cell has a borough but a Not assigned neighborhood, then the Neighborhood will be same as the Borough

In [7]:
df['Neighbourhood'] = df.apply(lambda x: x['Borough'] if x['Neighbourhood'] == 'Not assigned' else x['Neighbourhood'], axis=1)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


In [87]:
df.shape

(102, 3)

# Download csv file that has the geographical coordinates of each postal code, read the csv file to a dataframe

#### Here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [8]:
df_gc = pd.read_csv('Geospatial_Coordinates.csv')
df_gc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


# Merge the two dataframes to get a single dataframe consisting of Longitudes and Latitudes for all the Postalcodes. Before merging check the data type and column name to merge and make sure they match. 

In [9]:
print('Data type of Postal code df \n', df.dtypes)
print ()
print('Data types of GeoCode df \n', df_gc.dtypes)

Data type of Postal code df 
 Postcode         object
Borough          object
Neighbourhood    object
dtype: object

Data types of GeoCode df 
 Postal Code     object
Latitude       float64
Longitude      float64
dtype: object


In [10]:
# Before Merging rename the Postal Code in df_gc to Postcode to match with the df
df_gc.rename(columns={'Postal Code':'Postcode'}, inplace = True)
df_gc.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


# After merge check the shape. This is the final DataFrame for Neighborhoods in Canada.

In [11]:
#df_result = pd.merge(df, df_gc[['Latitude', 'Longitude']], left_on = 'Postcode', right_on = 'Postcode', how = 'inner
df_result = pd.DataFrame(pd.merge(df.assign(Postcode=df.Postcode.astype(str)), df_gc.assign(Postcode=df_gc.Postcode.astype(str)),how='inner', on='Postcode'))

In [12]:
df_result.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


In [13]:
df_result.shape

(103, 5)

## Map and Segment and Cluster only the neighborhoods in Toronto. So let's slice the original dataframe and create a new dataframe of the Toronto data.

In [12]:
toronto_data = df_result[df_result['Borough'].str.contains('Toronto')].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


## Get the Latitude and Longitude of Toronto

In [13]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Toronto City, ON'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.7189883, -79.44157.


## Create a map of Toronto with neighborhoods

In [14]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import folium # map rendering library


# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Now, let's get the top venues that are in within a radius of 2 kms from Toronto City

*This is hidden*

#### Use the URL generated above and Send the GET request and examine the resutls

In [16]:
import requests # library to handle requests
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d32e17e6c0aa50023aef206'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Toronto',
  'headerFullLocation': 'Toronto',
  'headerLocationGranularity': 'city',
  'totalResults': 236,
  'suggestedBounds': {'ne': {'lat': 43.772988354000056,
    'lng': -79.36699357101782},
   'sw': {'lat': 43.66498824599994, 'lng': -79.51614642898218}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5810fe9138fabe486b7d632c',
       'name': 'Nordstrom',
       'location': {'address': '3401 Dufferin Street, Unit 500',
        'lat': 43.7260761,
        'lng': -79.4493348,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.7260761,
          'lng': 

In [17]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [18]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng', 'venue.location.address', 'venue.location.distance']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
print('\n')
print('Only {} venues were returned by Foursquare'.format(nearby_venues.shape[0]), 'within 6kms from Toronto City ')



Only 100 venues were returned by Foursquare within 6kms from Toronto City 


In [19]:
nearby_venues.head(5)

Unnamed: 0,name,categories,lat,lng,address,distance
0,Nordstrom,Clothing Store,43.726076,-79.449335,"3401 Dufferin Street, Unit 500",1006
1,Yorkdale Shopping Centre,Shopping Mall,43.725939,-79.451427,3401 Dufferin St,1107
2,UNIQLO ユニクロ,Clothing Store,43.726446,-79.450564,3401 Dufferin St,1101
3,Crate and Barrel,Furniture / Home Store,43.726584,-79.452661,3401 Dufferin Street,1229
4,United Bakers Dairy Restaurant,Breakfast Spot,43.720043,-79.431095,506 Lawrence Ave. W,850


In [20]:
for lat, lng, name, categories in zip(nearby_venues['lat'], nearby_venues['lng'], nearby_venues['name'], nearby_venues['categories']):
    label = '{}, {}'.format(name, categories)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color = 'red' if categories == 'Italian Restaurant' else 'blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [21]:
italian_res_venue = nearby_venues[nearby_venues['categories'].str.contains('Ita')].reset_index(drop=True)
italian_res_venue

Unnamed: 0,name,categories,lat,lng,address,distance
0,Grazie Ristorante,Italian Restaurant,43.709329,-79.398823,2373 Yonge St.,3603
1,La Vecchia Ristorante,Italian Restaurant,43.710167,-79.399086,2405A Yonge St.,3556
2,Balsamico,Italian Restaurant,43.701505,-79.397162,2029 Yonge St.,4068
3,Cibo Wine Bar,Italian Restaurant,43.711464,-79.39957,2472 Yonge St,3481
4,Marcello's Pizzeria,Italian Restaurant,43.678017,-79.442725,1163 St Clair Avenue West,4561


The below will show the restaurant type and their count in the neighbourhoods of Toronto

In [22]:
nearby_res = nearby_venues[nearby_venues['categories'].str.contains('Res')].groupby(['categories']).size().reset_index(name='counts')
nearby_res

Unnamed: 0,categories,counts
0,American Restaurant,1
1,Asian Restaurant,1
2,Caribbean Restaurant,1
3,Falafel Restaurant,1
4,Fast Food Restaurant,1
5,French Restaurant,1
6,Indian Restaurant,1
7,Indonesian Restaurant,1
8,Italian Restaurant,5
9,Japanese Restaurant,1


So now we have all the restaurants in area within few kilometers from Toronto City, and we know which ones are Italian restaurants!

This concludes the data gathering phase - 
we are now ready to use this data for analysis to produce the report on optimal locations for a new Italian restaurant!

