# Searching for optimal locations for Cloud Kitchens in Mumbai

Project is divided into 3 stages,
- A. Generating Datasets
- B. Search Space Creation
- C. Optimal Location and Location Set Identification
- D. Cuisine recommendations for best locations

###### Approach:
Our analysis is based on computing the distance betweent to location coordinates, and optimising the search using this are the key parameter.

Distance is defined as the sum of the x(or W-E) and y (or N-S) intercept.

The analysis progresses with first rating each venue location as follows,
    rating = (minimum service time for a given set of city centres)/(service time of a location for given set of city centres)

This data will be used to train a model that can generate similar ratings for any point using the following three attributes,
- Number of city centres within serviceable distance
- Average distance between location and the city centres
- Average potential serviceable population

Then we generate a search space of points using the city centre location coordinates, and find points with the best ratings.

It is assumed that these points represent locations that have the maximum potential serviceable population for given set of city centres possible.

After that top locations are sorted and searched for a combination of best locations that serve the maximum proportion of population of the city centres.

Once we have obtained the list of best locations, we simply check the venue categories available at all city centres near our best locations.

We use the Foursquare location data to compute the average venue category at each city centre, and then use the population of the city centres to compute the weighted average for a best location.

###### Key comments
- We have two datasets for city centers, but one lacks population data, hence it is automatically dropped from further analysis
- Still we have utilised that data to generate venue locations from the Foursquare API as it provides more results
- Venue category or cuisine reccomendations are based Foursquare location data
- Functions for all computations will be available in related notebooks



## A. Generating Datasets

##### 1. Downloading libraries

In [1]:
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


##### 2. Foursquare location data access details

In [2]:
CLIENT_ID = 'EMUIZRLWLDLHUSBVLVDFE0EZD4MKTGSB2CUBNBT3DDWZY03U' # your Foursquare ID
CLIENT_SECRET = 'MJUZCPBLBMT00WJZOVVYXELAXD3AVSKIWRV1WQH13SDL34FL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 1000 # A default Foursquare API limit value


print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: EMUIZRLWLDLHUSBVLVDFE0EZD4MKTGSB2CUBNBT3DDWZY03U
CLIENT_SECRET:MJUZCPBLBMT00WJZOVVYXELAXD3AVSKIWRV1WQH13SDL34FL


##### 3. Creating dataframe of City Centres in Mumbai

As mentioned before, we have to sources for city centre location coordinates, and only once source has population details.

Let us first create both datasets

###### 3.1. Creating dataframe of Mumbai neighborhoods

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai'

results = requests.get(url).text

In [4]:
soup = BeautifulSoup(results,"html5lib")

In [5]:
table_data = soup.find("table")

In [6]:
mumbai_neighborhood = pd.DataFrame(columns = ['Neighborhood','Location','Lat','Long'])

for row in table_data.find_all("tr"):
    
    row_data = row.find_all("td")
    
    if row_data != []:
        
        mumbai_neighborhood = mumbai_neighborhood.append({'Neighborhood':row_data[0].text.strip(),
                                                          'Location':row_data[1].text.strip(),
                                                          'Lat':row_data[2].text.strip(),
                                                          'Long':row_data[3].text.strip()}, ignore_index=True)      

mumbai_neighborhood.head()

Unnamed: 0,Neighborhood,Location,Lat,Long
0,Amboli,"Andheri,Western Suburbs",19.1293,72.8434
1,"Chakala, Andheri",Western Suburbs,19.111388,72.860833
2,D.N. Nagar,"Andheri,Western Suburbs",19.124085,72.831373
3,Four Bungalows,"Andheri,Western Suburbs",19.124714,72.82721
4,Lokhandwala,"Andheri,Western Suburbs",19.130815,72.82927


###### 3.2. Creating dataframe of Mumbai Wards

This information has been scrapped from multiple websites, including pincode details. A mapping of Pin codes and latitude-longitude details was used to add location coordinate details.

In [7]:
city_centers = pd.read_csv('m_ward_data.csv')

In [8]:
city_centers

Unnamed: 0,Ward,Area,Land Area (SKM),Households,Population,Density per Square Kilometer,Ward.1,Pincode,Lat,Long,Unnamed: 10
0,A,Colaba,13,43661,210847,16868,A,400001,18.949594,72.838152,True
1,B,Sanhurst Road,3,27225,140633,56253,B,400009,18.95702,72.842004,True
2,C,Marine Lines,2,39657,202922,112734,C,400002,18.946385,72.825268,True
3,D,Grant Road,7,79131,382841,58006,D,400007,18.958458,72.814963,True
4,E,Byculla,7,80970,440335,59505,E,400008,18.969439,72.825823,True
5,F South,Parel,14,80777,396122,28294,F South,400012,19.0008,72.83085,True
6,F North,Matunga,13,112765,524393,40338,F North,400019,19.028744,72.844147,True
7,G South,Elphinstone,10,92525,457931,45793,G South,400018,19.016674,72.816659,True
8,G North,Dadar/Plaza,9,120643,582007,63957,G North,400028,19.056294,72.843076,True
9,H East,Khar/Santacruz,14,114423,580835,43025,H East,400051,19.061022,72.847717,True


##### 4. Generating Venue Data

We will use two datasets for city centres to generate two sets of venue data from the Foursquare API, visually inspect them using Folium maps and chose one to move forward with in our analysis.

Please note, we have already concluded which city centre dataset we would be using in our analysis, hence during the visually inspection only that dataset would be utilised.

In [9]:
catid = '4d4b7105d754a06374d81259' # Category ID for food

In [10]:
# function has been copied from the previous module)

def getNearbyVenues(names, latitudes, longitudes, radius=5000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            catid,
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except:
            print("Skipped a step")

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

This data will contain duplicates as search spaces will overlap. Venue names give us a fair indication of unique venues, though its not necessary. Hence we create an identifier using Venue name and Venue location coordinates to identify unique venues.

In [11]:
# function to remove duplicates

def remove_duplicate_venue_data(venue_data_FS):
    
    food_data_sorted = venue_data_FS[['Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']]
    
    #print(food_data1_sorted.head())
    
    food_data_identifier = []

    for i in range(food_data_sorted.shape[0]):

        venue = food_data_sorted.loc[i,'Venue']
        lat = food_data_sorted.loc[i,'Venue Latitude']
        lng = food_data_sorted.loc[i,'Venue Longitude']

        food_data_identifier.append(venue+str(lat)+str(lng))
    
    
    food_data_identifier = np.array(food_data_identifier).reshape(-1)
    food_data_identifier[:10]
    
    food_data_sorted['ID'] = food_data_identifier.tolist()
    
    #print(food_data1_sorted.head())
    
    food_data_sorted.drop_duplicates(subset = 'ID', keep='first', inplace=True)
    
    return food_data_sorted
    

Generating list of venue details using Mumbai neighborhoods data and Foursquare location data

In [12]:
#venue_data1 = getNearbyVenues(mumbai_neighborhood.Neighborhood, mumbai_neighborhood.Lat, mumbai_neighborhood.Long)

In [13]:
# importing a previously generated venue_data1 set
venue_data1 = pd.read_csv('venue_data1.csv')

In [14]:
venue_data1.shape

(8350, 8)

In [15]:
venue_data1_sorted = remove_duplicate_venue_data(venue_data1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  food_data_sorted['ID'] = food_data_identifier.tolist()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  food_data_sorted.drop_duplicates(subset = 'ID', keep='first', inplace=True)


In [16]:
venue_data1_sorted.head()

Unnamed: 0,Venue,Venue Latitude,Venue Longitude,Venue Category,ID
0,Merwans Cake shop,19.1193,72.845418,Bakery,Merwans Cake shop19.11930021588547772.84541776...
1,Jaffer Bhai's Delhi Darbar,19.137714,72.845909,Mughlai Restaurant,Jaffer Bhai's Delhi Darbar19.13771405659304772...
2,Hard Rock Cafe Andheri,19.135995,72.835335,American Restaurant,Hard Rock Cafe Andheri19.1359945078199372.8353...
3,Joey's Pizza,19.126762,72.830001,Pizza Place,Joey's Pizza19.12676215515010772.83000121236746
4,Narayan Sandwich,19.121398,72.85027,Sandwich Place,Narayan Sandwich19.1213976910787772.8502703550...


In [17]:
venue_data1_sorted.shape

(1181, 5)

In [18]:
#venue_data1.to_csv('venue_data1.csv') #Downloading the data for using in further analysis

Generating list of venue details using Mumbai Ward wise data and Foursquare location data

In [19]:
#venue_data2 = getNearbyVenues(city_centers.Area, city_centers.Lat, city_centers.Long)

In [20]:
# importing a previously generated venue_data2 set
venue_data2 = pd.read_csv('venue_data2.csv')

In [21]:
venue_data2.shape

(2301, 8)

In [22]:
venue_data2.head()

Unnamed: 0.1,Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Colaba,18.949594,72.838152,Shree Thaker Bhojnalay,18.951217,72.828326,Indian Restaurant
1,1,Colaba,18.949594,72.838152,Gulshan-E-Iran,18.948118,72.835427,Middle Eastern Restaurant
2,2,Colaba,18.949594,72.838152,Royal China,18.938715,72.832933,Chinese Restaurant
3,3,Colaba,18.949594,72.838152,Bhagat Tarachand Restaurant,18.951802,72.830486,Indian Restaurant
4,4,Colaba,18.949594,72.838152,Shalimar Restaurant,18.95818,72.832367,Indian Restaurant


In [23]:
venue_data2_sorted = remove_duplicate_venue_data(venue_data2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  food_data_sorted['ID'] = food_data_identifier.tolist()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  food_data_sorted.drop_duplicates(subset = 'ID', keep='first', inplace=True)


In [24]:
venue_data2_sorted.head()

Unnamed: 0,Venue,Venue Latitude,Venue Longitude,Venue Category,ID
0,Shree Thaker Bhojnalay,18.951217,72.828326,Indian Restaurant,Shree Thaker Bhojnalay18.95121698526222472.828...
1,Gulshan-E-Iran,18.948118,72.835427,Middle Eastern Restaurant,Gulshan-E-Iran18.94811792666806572.83542708588257
2,Royal China,18.938715,72.832933,Chinese Restaurant,Royal China18.9387152391562972.83293313173236
3,Bhagat Tarachand Restaurant,18.951802,72.830486,Indian Restaurant,Bhagat Tarachand Restaurant18.9518019997218947...
4,Shalimar Restaurant,18.95818,72.832367,Indian Restaurant,Shalimar Restaurant18.9581801222498972.8323665...


In [25]:
venue_data2_sorted.shape

(808, 5)

In [26]:
#venue_data2.to_csv('venue_data2.csv') #Downloading the data for using in further analysis

Visusally inspecting which venue data to select in further analysis.

In [27]:
# creating new feature group for city centers

def plot_city_centers(city_centers, _map):
    
    city_centers_plot = folium.map.FeatureGroup()

    print(city_centers.shape)

    # plotting venues
    for lat, lng, in zip(city_centers.Lat.astype(float), city_centers.Long.astype(float)):
        city_centers_plot.add_child(
            folium.features.CircleMarker(
                [lat, lng],
                radius=3, # define how big you want the circle markers to be
                color='blue',
                fill=True,
                fill_color='blue',
                fill_opacity=0.2
            )
        )

    # add venues to map
    _map.add_child(city_centers_plot)
    
    return

In [28]:
# creating new feature group for venues

def plot_venue_locations(venue_data, _map):
    
    venues = folium.map.FeatureGroup()

    print(venue_data.shape)

    # plotting venues
    for lat, lng, in zip(venue_data['Venue Latitude'], venue_data['Venue Longitude']):
        venues.add_child(
            folium.features.CircleMarker(
                [lat, lng],
                radius=1, # define how big you want the circle markers to be
                color='yellow',
                fill=True,
                fill_color='blue',
                fill_opacity=0.6
            )
        )

    # add venues to map
    _map.add_child(venues)
    
    return

In [29]:
# creating new feature group for venues

def plot_locations(venue_data, _map):
    
    venues = folium.map.FeatureGroup()

    print(venue_data.shape)

    # plotting venues
    for lat, lng, in zip(venue_data['Lat'], venue_data['Long']):
        venues.add_child(
            folium.features.CircleMarker(
                [lat, lng],
                radius=4, # define how big you want the circle markers to be
                color='red',
                fill=True,
                fill_color='red',
                fill_opacity=0.0
            )
        )

    # add venues to map
    _map.add_child(venues)
    
    return

In [30]:
address = 'Mumbai, India'

geolocator = Nominatim(user_agent="mu_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Mumbai City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Mumbai City are 19.0759899, 72.8773928.


Checking first dataset - venue_data1

In [31]:
# create map and display it
mumbai_map1 = folium.Map(location=[latitude, longitude], zoom_start=11)

plot_city_centers(city_centers, mumbai_map1)
plot_venue_locations(venue_data1_sorted, mumbai_map1)

mumbai_map1

(24, 11)
(1181, 5)


Checking first dataset - venue_data2

In [32]:
# create map and display it
mumbai_map2 = folium.Map(location=[latitude, longitude], zoom_start=11)

plot_city_centers(city_centers, mumbai_map2)
plot_venue_locations(venue_data2_sorted, mumbai_map2)

mumbai_map2

(24, 11)
(808, 5)


The first venue set has an additional 200 venue locations. Hence this would be the dataset used in further analysis.

In [33]:
#venue_data1_sorted.drop(columns = 'ID', inplace = True)
#venue_data1_sorted.head()

In [34]:
#venue_data1_sorted.to_csv('m_venue_data.csv') # Downloading the data for using in further analysis