# Capstone Project

This notebook is for Capstone project and we will be using Pandas in the Python Programming Language and Various machine learning Techniques to deliver the end outcome.

<b> Install BeautifulSoup4 tool for data Scrapping 

In [3]:
# Installing Dependencies

!pip install requests bs4 pandas

print("beautifulsoup4 is SUCCESSFULLY installed !")

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
beautifulsoup4 is SUCCESSFULLY installed !


In [4]:
from bs4 import BeautifulSoup # magical tool for parsing html data
from urllib.request import urlopen # for making standard html requests

import requests
import json # for parsing data
import pandas as pd # premier library for data organization

<h2>1. Web Scrapping and Data Preparation

<b> Extract Table data from the WIKIPEDIA html page

In [6]:
# Request html page from our target URL

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

table_data = soup.find('table')
#print(table_data.prettify())

<b>Capture the table data cells from the HTML page into the Dataframe

In [7]:
# Get all the table rows from 2nd Row onwards and place it under Headers.

data = []
for tr in table_data.find_all('tr')[1:]:
    row_data = tr.find_all('td')
    data.append([cell.text for cell in row_data])
df_data = pd.DataFrame(data, columns = ['PostalCode', 'Borough', 'Neighborhood'])
df_data.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"


<b> Data Cleanup: Remove the new line charecter (\n) from the dataset. 

In [8]:
df_data['PostalCode'] = df_data['PostalCode'].str.split('\n', expand = True)[0]
df_data['Borough'] = df_data['Borough'].str.split('\n', expand = True)[0]
df_data['Neighborhood'] = df_data['Neighborhood'].str.split('\n', expand = True)[0]

df_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<b>Only process the cells that have an assigned borough (Drop, Borough = "Not assigned")

In [9]:
# Clean datasete to remove records with Borough as "Not Assigned"

df_cleaned_data = df_data[df_data.Borough != 'Not assigned']
df_cleaned_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<b>Check if we have any multiple recods for the same Postalcode, if exists then merge the records

In [10]:
# Check if we have any multiple recods for the same Postalcode.

duplicateRowsDF = df_cleaned_data[df_cleaned_data.duplicated(['PostalCode'])]

Multi_records_postalCode = duplicateRowsDF.shape[0]

# For Postalcode with multiple recods, conctenate 'Neighborhood' values and keep only 1 record in the DataFrame

if Multi_records_postalCode == 0:
    print("Multirecord Postalcode does not exists for merger")
else:
    df_cleaned_data = df_cleaned_data.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()

Multirecord Postalcode does not exists for merger


<b>Check If a cell has a borough but a "Not assigned" neighborhood, if found, assign Borough to neighborhood.

In [11]:
# If Neighborhood == 'Not assigned' then Neighborhood = Borough

df_cleaned_data.loc[df_cleaned_data['Neighborhood'] == 'Not assigned', 'Neighborhood'] = df_cleaned_data['Borough']

df_cleaned_data.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


<b>Save the Cleansed Data in a CSV file and Publish the size of your DataFrame.

In [12]:
df_cleaned_data.to_csv('Final_Cleaned_dataset.csv', index = False)

df_cleaned_data.shape

(103, 3)

# 2. Get the Geo coordinates for each neighborhood.

<b> Let's get all the Geo Liabraries installed

In [13]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0           conda-forge
    geopy:          

<b> Given below is Geocoder code with Foursqaure Agent but it's not working for all the Neighborhood

In [14]:
geolocator = Nominatim(user_agent="foursquare_agent")

df_NeighData = pd.read_csv('Final_Cleaned_dataset.csv')

Rec_Count = df_NeighData.shape[0]

for i in range(Rec_Count):
    address = df_NeighData.Neighborhood[i]
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print(i,address,latitude,longitude)

0 Parkwoods 37.8567738 -122.22068778004532
1 Victoria Village 43.732658 -79.3111892
2 Regent Park, Harbourfront 43.64076885 -79.37989177980148
3 Lawrence Manor, Lawrence Heights 43.7227784 -79.4509332


AttributeError: 'NoneType' object has no attribute 'latitude'

<b> Used GeoSpatial Data for the Latitude and Longitude

In [15]:
!wget -O GeoCord.csv http://cocl.us/Geospatial_data/

--2020-05-30 06:09:31--  http://cocl.us/Geospatial_data/
Resolving cocl.us (cocl.us)... 158.85.108.83, 169.48.113.194, 158.85.108.86
Connecting to cocl.us (cocl.us)|158.85.108.83|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data/ [following]
--2020-05-30 06:09:31--  https://cocl.us/Geospatial_data/
Connecting to cocl.us (cocl.us)|158.85.108.83|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-05-30 06:09:32--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 185.235.236.197
Connecting to ibm.box.com (ibm.box.com)|185.235.236.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-05-30 06:09:33--  https://ibm.box.com/p

In [16]:
df_geospatial = pd.read_csv('GeoCord.csv')

df_geospatial.head(10)

Final_Dataset = pd.merge(df_NeighData, df_geospatial, left_on='PostalCode', right_on='Postal Code').drop(['Postal Code'], axis = 1)
Final_Dataset.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<b> Save the Final Dataset with Geo coordinates

In [17]:
Final_Dataset.to_csv("Final_Dataset_with_Geo_coordinates.csv", index = False)
Final_Dataset.shape

(103, 5)

# 3. Explore and cluster the neighborhoods in Toronto

In [18]:
# Read the Final Dataset Saved as Final_Dataset_with_Geo_coordinates.csv

import pandas as pd
import numpy as np

df_FinalDataset = pd.read_csv('Final_Dataset_with_Geo_coordinates.csv')
df_FinalDataset.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


<b>Use geopy library to get the latitude and longitude values of Toronto City.

In [19]:
address = 'Toronto'

#geolocator = Nominatim(user_agent="ny_explorer")
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are:', latitude, longitude)

The geograpical coordinate of Toronto are: 43.6534817 -79.3839347


<b>Create a map of Toronto with neighborhoods superimposed on top.

In [20]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_FinalDataset['Latitude'], df_FinalDataset['Longitude'], df_FinalDataset['Borough'], df_FinalDataset['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    #print(label)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<b>Simplify the above map and segment and cluster only the neighborhoods in Downtown Toronto

In [21]:
df_Toronoto_Downtown = df_FinalDataset[df_FinalDataset['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
df_Toronoto_Downtown.shape

(19, 5)

<b>Get the Geo Coordinates of Toronto Downtown

In [23]:
address = 'Downtown Toronto'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The Geo coordinate of Toronto Downtown are:', latitude, longitude)

The Geo coordinate of Toronto Downtown are: 43.6541737 -79.38081164513409


In [24]:
# create map of Toronto Downtown using latitude and longitude values
map_torontodowntown = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_Toronoto_Downtown['Latitude'], df_Toronoto_Downtown['Longitude'], df_Toronoto_Downtown['Borough'], df_Toronoto_Downtown['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_torontodowntown)  

In [25]:
map_torontodowntown

<b>Define Foursqaure credentials

In [26]:
CLIENT_ID = 'DHN0LHBZZCEPHH31FNDPKBC4AFSKYVC3NVZM2AJRB4MVE0C5' # your Foursquare ID
CLIENT_SECRET = 'PQHOCSFLIZVOBZ2IWZWBVCAZV2NI2X2U1GZPR32BRKYJSUTD' # your Foursquare Secret
VERSION = '20200521'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DHN0LHBZZCEPHH31FNDPKBC4AFSKYVC3NVZM2AJRB4MVE0C5
CLIENT_SECRET:PQHOCSFLIZVOBZ2IWZWBVCAZV2NI2X2U1GZPR32BRKYJSUTD


<b>Explore the first Neighborhood in our Dataset

In [28]:
Neighborhood = df_Toronoto_Downtown.loc[0,'Neighborhood']

neighborhood_latitude = df_Toronoto_Downtown.loc[0,'Latitude']
neighborhood_longitude = df_Toronoto_Downtown.loc[0,'Longitude']

print('{} Neighborhood Longitude is {} and lattitude is{} .'.format(Neighborhood,neighborhood_longitude,neighborhood_latitude))

Regent Park, Harbourfront Neighborhood Longitude is -79.3606359 and lattitude is43.6542599 .


<b>Get the top 100 venues that are in Regent Park within a radius of 500 meters. 

In [29]:
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

# display URL
url

'https://api.foursquare.com/v2/venues/explore?&client_id=DHN0LHBZZCEPHH31FNDPKBC4AFSKYVC3NVZM2AJRB4MVE0C5&client_secret=PQHOCSFLIZVOBZ2IWZWBVCAZV2NI2X2U1GZPR32BRKYJSUTD&v=20200521&ll=43.6542599,-79.3606359&radius=500&limit=30'

In [30]:
# Examine the Result set
results = requests.get(url).json()
#results

In [31]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [32]:
# clean the json and structure it into a pandas dataframe.

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']

nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(20)
nearby_venues.shape

(30, 4)

In [33]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

30 venues were returned by Foursquare.


<b>Explore Neighborhoods in Toronto Downtown

<b>Create a function to repeat the same process to all the neighborhoods in Toronoto Downtown

In [35]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

<b>Run the above function on each neighborhood and create a new dataframe called Tornoto_Downtown_Venues.

In [36]:
Tornoto_Downtown_Venues = getNearbyVenues(names=df_Toronoto_Downtown['Neighborhood'],
                                   latitudes=df_Toronoto_Downtown['Latitude'],
                                   longitudes=df_Toronoto_Downtown['Longitude']
                                  )

In [37]:
# Count of Distinct Neighborhood in Toronto Downtown 

len(Tornoto_Downtown_Venues['Neighborhood'].unique())

19

In [38]:
df_RstaurantOnly = Tornoto_Downtown_Venues[Tornoto_Downtown_Venues['Venue Category'].str.contains('Restaurant')]
df_AllOthers = Tornoto_Downtown_Venues[Tornoto_Downtown_Venues['Venue Category'].str.contains('Restaurant') == False]

df_RstaurantOnly.head()
#df_AllOthers.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
5,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant
21,"Regent Park, Harbourfront",43.65426,-79.360636,Cluny Bistro & Boulangerie,43.650565,-79.357843,French Restaurant
31,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Mercatto,43.660391,-79.387664,Italian Restaurant
36,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Como En Casa,43.66516,-79.384796,Mexican Restaurant
42,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,Tokyo Sushi,43.665885,-79.386977,Sushi Restaurant


In [39]:
# Unique categories can be curated from all the returned venues

print('There are {} uniques categories.'.format(len(df_RstaurantOnly['Venue Category'].unique())))
print('There are {} uniques categories.'.format(len(df_AllOthers['Venue Category'].unique())))

There are 28 uniques categories.
There are 120 uniques categories.


<b>Analyse Each Neighborhood for Restaurant and Other Facilities

In [41]:
# one hot encoding for both Restaurants only and All Others venue Category Dataset

RestaurantOnly_onehot = pd.get_dummies(df_RstaurantOnly[['Venue Category']], prefix="", prefix_sep="")
AllOthers_onehot = pd.get_dummies(df_AllOthers[['Venue Category']], prefix="", prefix_sep="")

RestaurantOnly_onehot = pd.concat([df_RstaurantOnly['Neighborhood'], RestaurantOnly_onehot], axis = 1)
AllOthers_onehot = pd.concat([df_AllOthers['Neighborhood'], AllOthers_onehot.drop(['Neighborhood'], axis = 1)], axis = 1)

#RestaurantOnly_onehot.head()
RestaurantOnly_onehot[['Neighborhood']]

Unnamed: 0,Neighborhood
5,"Regent Park, Harbourfront"
21,"Regent Park, Harbourfront"
31,"Queen's Park, Ontario Provincial Government"
36,"Queen's Park, Ontario Provincial Government"
42,"Queen's Park, Ontario Provincial Government"
48,"Queen's Park, Ontario Provincial Government"
69,"Garden District, Ryerson"
72,"Garden District, Ryerson"
78,"Garden District, Ryerson"
79,"Garden District, Ryerson"


In [42]:
RestaurantOnly_onehot.shape

RestaurantOnly_onehot.head()
AllOthers_onehot.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,...,Tailor Shop,Tanning Salon,Tea Room,Theater,Trail,Train Station,Video Game Store,Wine Bar,Wings Joint,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<b>Let's Identify the Neighborhoods and it's Restaurant Business Around

In [44]:
# Group the Neighborhood one hot dataset,  to get the Number for Restaurant in Neighborhod 
Neigh_Restaurant_groupby = RestaurantOnly_onehot.groupby(['Neighborhood']).sum().reset_index()


# Get the Number of Restaurants in the Neighborhood
Neigh_Restaurant_groupby['RestaurantCnt'] = Neigh_Restaurant_groupby.sum(axis=1)

# Create the Neighnborhood Restaurant Index
Neighborhood_Restaurant= Neigh_Restaurant_groupby[['Neighborhood','RestaurantCnt']].sort_values(by = 'RestaurantCnt').reset_index()

Neighborhood_Restaurant.drop(['index'],axis=1,inplace=True)
Neighborhood_Restaurant.head()

Unnamed: 0,Neighborhood,RestaurantCnt
0,Christie,2
1,"Regent Park, Harbourfront",2
2,"Harbourfront East, Union Station, Toronto Islands",4
3,"Queen's Park, Ontario Provincial Government",4
4,"Toronto Dominion Centre, Design Exchange",5


<b>Find out the Other Venues or Happening Places (Other than Restaurant) around Neighborhoods

In [45]:
# Group the Neighborhood one hot dataset,  to get all other venuws (other than Restaurant) in the Neighborhod
# THis is to identify the happening places and rate them accordingly

Neighborhood_AllOthers_groupby = AllOthers_onehot.groupby(['Neighborhood']).sum().reset_index()

# Get the Number of Restaurants in the Neighborhood
Neighborhood_AllOthers_groupby['OthVenueCnt'] = Neighborhood_AllOthers_groupby.sum(axis=1)

# Create the Neighnborhood Restaurant Index
Neighborhood_AllOther= Neighborhood_AllOthers_groupby[['Neighborhood','OthVenueCnt']].sort_values(by = 'OthVenueCnt').reset_index()

Neighborhood_AllOther.drop(['index'],axis=1,inplace=True)
Neighborhood_AllOther.head()

Unnamed: 0,Neighborhood,OthVenueCnt
0,Rosedale,4
1,"CN Tower, King and Spadina, Railway Lands, Har...",14
2,Christie,15
3,"University of Toronto, Harbord",20
4,"Richmond, Adelaide, King",20


<b>Calculate the Restaurant Index based on the Places around and in business Restaurants in the Neighborhood

In [47]:
print('Neighborhood_Restaurant_index size is =',Neighborhood_Restaurant.shape)
print('Neighborhood_AllOther_index size is =',Neighborhood_AllOther.shape)

Final_Analysis_Data = Neighborhood_AllOther.set_index('Neighborhood').join(Neighborhood_Restaurant.set_index('Neighborhood')).reset_index()

# Replace missing value with 0 (Zero)
Final_Analysis_Data['RestaurantCnt'].replace(np.NaN,0,inplace = True)
Final_Analysis_Data['RestaurantCnt'] = Final_Analysis_Data['RestaurantCnt'].astype(int)

# Calculate the Restaurant Weightage Index in comparision with all other Venues Around
Final_Analysis_Data['Restaurant_index'] = 1 - (Final_Analysis_Data['RestaurantCnt']/Final_Analysis_Data['OthVenueCnt'])

Final_Analysis_Data = Final_Analysis_Data.sort_values(by = 'Restaurant_index', ascending = False)
Final_Analysis_Data.reset_index (inplace = True)
Final_Analysis_Data.drop(['index'],axis=1,inplace = True)

Neighborhood_Restaurant_index size is = (17, 2)
Neighborhood_AllOther_index size is = (19, 2)


<b>Top 5 Neighborhood best suitable to start Restaurant Business

In [48]:
# Select the Top 5 Locations suitable for Restaurant Business based on their Restauran Index values
Final_Analysis_Data.head()

Unnamed: 0,Neighborhood,OthVenueCnt,RestaurantCnt,Restaurant_index
0,Rosedale,4,0,1.0
1,"CN Tower, King and Spadina, Railway Lands, Har...",14,0,1.0
2,"Regent Park, Harbourfront",28,2,0.928571
3,Christie,15,2,0.866667
4,"Queen's Park, Ontario Provincial Government",26,4,0.846154


<b>Based on the above Analysis, My recomendation for a New Restaurant Business would be "Regent Park" as there are many great happening places around but only 4 Restaurants in top 100 list.