# Distribution solution for milk delivery to Restaurants/Cafes in Scarborough, Toronto

### Capstone Project - Battle of the Neighbourhoods


### Part 1: Problem Description
There is a milk contractor that wants to start distributing milk in all neighbourhoods of  Scarborough, Toronto. This contractor wants timely delivery of milk to all major clusters of restaurants, cafes, bakeries and breakfast places every morning. 
The contractor wants to build an efficient network of delivery with maximum 10 delivery trucks and yet cover all areas within time. The contractor wants to segment every probable customer (restaurant/cafe/bakery/breakfast place) into a group and operate each group as a separate entity for better and efficient customer service.

### Part 2: Data we need
 - We will need geo-locational information about that specific borough and the neighbourhoods in that borough. We specifically and technically mean the latitude and longitude numbers of that borough. This we will be able to get from the Geopy- geocoders library and the wikipedia page : https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


 - To cluster every restaurant/cafe/bakery/breakfast place will need data about different venues in different neighbourhoods of Scarborough. In order to gain that information we will use "Foursquare" locational information. By locational information for each venue we mean the venue id, venue name, its precise latitude and longitude co-ordinates and its  category of that venue.

# Preparing Data - Part 1

In [1]:
from bs4 import BeautifulSoup
import requests

In [2]:
import numpy as np
import pandas as pd

In [3]:
import bs4

In [4]:
import lxml.html as lh

In [5]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
req = requests.get(url)

In [6]:
soup = bs4.BeautifulSoup(req.text, "html5lib")

In [7]:
data = soup.select('.wikitable.sortable')

In [8]:
print (type(data))
print (len(data))

<class 'list'>
1


In [9]:
doc = lh.fromstring(req.content)

In [10]:
tr_elements = doc.xpath('//tr')

In [11]:
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [12]:
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    col.append((name,[]))

In [13]:
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1


In [14]:
# Convert the list into dict and then into a dataframe
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

In [15]:
df = df[df.Borough != 'Not assigned']

In [16]:
df.rename(columns = {'Neighbourhood\n':'Neighbourhood'}, inplace = True)

In [17]:
for i in range(0,212):
    df.iloc[i,2] = df.iloc[i,2].strip('\n')

In [18]:
df = df.reset_index(drop = True)

In [19]:
for i in range(0,212):
    if df.iloc[i,2] == 'Not assigned':
        df.iloc[i,2] = df.iloc[i,1]

In [20]:
df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [21]:
df1 = (df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: ','.join(set(x.dropna()))).reset_index())

In [22]:
df1.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern,Rouge"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,West Hill,Morningside"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [23]:
df1.shape

(103, 3)

### Adding latitude and longitude

In [24]:
latlon = pd.read_csv('Geospatial_Coordinates.csv')
latlon.rename(columns = {'Postal Code':'Postcode'}, inplace = True)

In [25]:
df_final = pd.merge(df1,latlon,on=['Postcode'], how='left')

In [26]:
df_final.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,West Hill,Morningside",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [27]:
df_final.shape

(103, 5)

In [28]:
df_final.to_csv("Toronto_data")

## Preparing Data - Part 2 (Foursquare)

In [29]:
df1 = pd.read_csv('Toronto_data')

In [30]:
df1.drop(['Unnamed: 0'] , axis = 1 , inplace = True )

In [31]:
df1.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,West Hill,Morningside",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


As we are interested in exploring only Downtown Toronto, lets create a data frame with all its neighbourhoods

In [32]:
df_Scarborough = df1[df1['Borough'] == 'Scarborough']

In [33]:
df_Scarborough.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern,Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,West Hill,Morningside",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Explore Neighbourhoods of Scarborough

In [34]:
def foursquare_explore (postal_code_list, neighborhood_list, lat_list, lng_list, LIMIT = 500, radius = 1000):
    result_ds = []
    counter = 0
    for postal_code, neighborhood, lat, lng in zip(postal_code_list, neighborhood_list, lat_list, lng_list):
         
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, 
            lat, lng, radius, LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        tmp_dict = {}
        tmp_dict['Postal Code'] = postal_code; tmp_dict['Neighborhood(s)'] = neighborhood; 
        tmp_dict['Latitude'] = lat; tmp_dict['Longitude'] = lng;
        tmp_dict['Crawling_result'] = results;
        result_ds.append(tmp_dict)
        counter += 1
        
    return result_ds;

In [35]:
CLIENT_ID = 'ADG3ZXD3ROLMNTHU00E5F4XHVXWMMNQFUYB5DAIKPBENYQSA' # your Foursquare ID
CLIENT_SECRET = 'IK1MNCJN2OFQHAF5F0OZT5PJ1IDZBZDUEFQ4PAVP5P5FDX5C' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [36]:
# Retriving data from foursquare api 

SB_Foursquare_Dataset = foursquare_explore(list(df_Scarborough['Postcode']),list(df_Scarborough['Neighbourhood']),
                           list(df_Scarborough['Latitude']),list(df_Scarborough['Longitude']),)

In [37]:
SB_Foursquare_Dataset[1]

{'Postal Code': 'M1C',
 'Neighborhood(s)': 'Highland Creek,Rouge Hill,Port Union',
 'Latitude': 43.7845351,
 'Longitude': -79.16049709999999,
 'Crawling_result': [{'reasons': {'count': 0,
    'items': [{'summary': 'This spot is popular',
      'type': 'general',
      'reasonName': 'globalInteractionReason'}]},
   'venue': {'id': '4b96e31cf964a5207deb34e3',
    'name': 'Shamrock Burgers',
    'location': {'address': '6070 Old Kingston Rd.',
     'lat': 43.78382252268771,
     'lng': -79.16840631604676,
     'labeledLatLngs': [{'label': 'display',
       'lat': 43.78382252268771,
       'lng': -79.16840631604676}],
     'distance': 640,
     'cc': 'CA',
     'city': 'Scarborough',
     'state': 'ON',
     'country': 'Canada',
     'formattedAddress': ['6070 Old Kingston Rd.',
      'Scarborough ON',
      'Canada']},
    'categories': [{'id': '4bf58dd8d48988d16c941735',
      'name': 'Burger Joint',
      'pluralName': 'Burger Joints',
      'shortName': 'Burgers',
      'icon': {'prefi

In [38]:
#extract details from foursquare dataset and save in dataframe

def get_venue_dataset(foursquare_dataset):
    result_df = pd.DataFrame(columns = ['Postcode', 'Neighbourhood', 
                                           'Neighbourhood Latitude', 'Neighbourhood Longitude','Venue_id',
                                          'Venue', 'Venue Category', 'Venue_lat' , 'Venue_lng'])
    # print(result_df)
    
    for neigh_dict in foursquare_dataset:
        postal_code = neigh_dict['Postal Code']; neigh = neigh_dict['Neighborhood(s)']
        lat = neigh_dict['Latitude']; lng = neigh_dict['Longitude']
    
        for venue_dict in neigh_dict['Crawling_result']:
            name = venue_dict['venue']['name']
            vlat = venue_dict['venue']['location']['lat']
            vlng = venue_dict['venue']['location']['lng']
            cat =  venue_dict['venue']['categories'][0]['name']
            vid = venue_dict['venue']['id']
            
            
            
          
            result_df = result_df.append({'Postcode': postal_code, 'Neighbourhood': neigh, 
                              'Neighbourhood Latitude': lat, 'Neighbourhood Longitude':lng,'Venue_id':vid,
                              'Venue': name,'Venue Category': cat, 'Venue_lat': vlat ,'Venue_lng': vlng }, 
                                ignore_index = True)
            
    return(result_df)

In [39]:
df_SB = get_venue_dataset(SB_Foursquare_Dataset)

In [40]:
df_final = df_SB[(df_SB['Venue Category'].str.contains('Coffee')) | (df_SB['Venue Category'].str.contains('Restaurant')) 
                | (df_SB['Venue Category'].str.contains('Breakfast')) | (df_SB['Venue Category'].str.contains('Café'))
                | (df_SB['Venue Category'].str.contains('Bakery'))]

In [41]:
df_final.head()

Unnamed: 0,Postcode,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue_id,Venue,Venue Category,Venue_lat,Venue_lng
1,M1B,"Malvern,Rouge",43.806686,-79.194353,4b914562f964a520d4ae33e3,Caribbean Wave,Caribbean Restaurant,43.798558,-79.195777
2,M1B,"Malvern,Rouge",43.806686,-79.194353,4b6718c2f964a5203f3a2be3,Harvey's,Fast Food Restaurant,43.800106,-79.198258
3,M1B,"Malvern,Rouge",43.806686,-79.194353,579a91b3498e9bd833afa78a,Wendy's,Fast Food Restaurant,43.802008,-79.19808
4,M1B,"Malvern,Rouge",43.806686,-79.194353,4b16e23bf964a520edbe23e3,Tim Hortons,Coffee Shop,43.802,-79.198169
5,M1B,"Malvern,Rouge",43.806686,-79.194353,4bb6b9446edc76b0d771311c,Wendy's,Fast Food Restaurant,43.807448,-79.199056


### df_final is the final dataset we will use to run clustering