# Finding a good location for a Vietnamese Restaurant in Waltham Forest
## Preparation and retreiving locations

First we import all packages we may need

In [1]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


from geopy.geocoders import Nominatim 

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium 

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.



Next we scrub a wikipedia page, "List of areas in London"

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_areas_of_London').text
page = BeautifulSoup(source,'lxml')

We find the table, use 'find_all' to get the set of rows, and pop the first row, which is the headers.

In [3]:
table = page.find('table', class_='wikitable sortable')
tableLines = table.find_all('tr')
tableLines.pop(0)

<tr>
<th>Location</th>
<th>London borough</th>
<th>Post town</th>
<th>Postcode district</th>
<th>Dial code</th>
<th>OS grid ref
</th></tr>

We explore the first actual row to get an idea of how the html is formatted

In [4]:
tableLines[0].find_all('td')    

[<td><a href="/wiki/Abbey_Wood" title="Abbey Wood">Abbey Wood</a></td>,
 <td>Greenwich<sup class="reference" id="cite_ref-mills1_1-0"><a href="#cite_note-mills1-1">[1]</a></sup></td>,
 <td>LONDON</td>,
 <td>SE2</td>,
 <td>020</td>,
 <td><span class="plainlinks nourlexpansion" style="white-space: nowrap"><a class="external text" href="https://tools.wmflabs.org/os/coor_g/?pagename=List_of_areas_of_London&amp;params=TQ465785_region%3AGB_scale%3A25000">TQ465785</a></span>
 </td>]

Now we reformat it into a list of rows.

In [5]:
tableLinesSeperated = []
for line in tableLines:
    tableLinesSeperated.append(line.find_all('td'))

Now we pull the text out of each tag.

In [6]:
tableColumnsText=[]
for line in tableLinesSeperated:
    tableColumnsText.append([line[0].text,line[1].text,line[2].text,line[3].text])

However, there is an issue with some of the text - specifically that the references remain on some London Boroughs:

In [7]:
tableColumnsText[0][1]

'Greenwich[1]'

Thus, for each borough with a reference tag at the end, we snip it off.

In [8]:
for line in tableColumnsText:
    if (line[1][-1]==']'):
        while (line[1][-1]!='['):
            line[1] = line[1][:-1]
        line[1]=line[1][:-1]

Now we can put the scraped data into a dataframe!

In [9]:
londonLocationDataframe = pd.DataFrame(tableColumnsText, columns = ['Location','London Borough', 'Post Town', 'Postcode District'])
londonLocationDataframe.head()

Unnamed: 0,Location,London Borough,Post Town,Postcode District
0,Abbey Wood,Greenwich,LONDON,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4"
2,Addington,Croydon,CROYDON,CR0
3,Addiscombe,Croydon,CROYDON,CR0
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14"


## Retrieving the venues
First we put in required client details to use FourSquare.

In [10]:
CLIENT_ID = 'client-id' # your Foursquare ID
CLIENT_SECRET = 'client-secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: client-id
CLIENT_SECRET:client-secret


The following functions retrieves all venues near each location in our dataframe. It takes the a set of names of the location, the locations address, and the radius around which to search. If the location can be found in FourSquare, we add the venue to the list. If it fails, we move on to the next location. We have a large number of locations, so we are not concerned about some of them failing, since we will till have plenty of data to work with. Finally, the results are then put into a dataframe. Each row gives a Location in London, its longer address, the venue, the venue's latitude and longitude, and the type of venue.

In [11]:
failedLocations=[]
def getNearbyVenues(names, location, radius=500):
    
    venues_list=[]
    for name, loc in zip(names, location):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&near={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            loc,
            500, 
            100)
            
        # make the GET request
        resultsInitial = requests.get(url).json()
        if 'errorType' in resultsInitial['meta']:
            failedLocations.append(loc)
            print(loc+' failed')
        else:
            print(loc+' succeded')
            results = resultsInitial["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            loc, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Name', 
                  'Address',  
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we have the fucntion, we need the locations to put in. Since some locations are in multiple boroughs, we seperate out the first borough for each loation into a list

In [12]:
firstBorough = []
for borough in londonLocationDataframe['London Borough']:
    firstBorough.append(borough.split(',')[0])
    
    

Then add a row to the dataframe with the first borough of the location, and then create a column which gives the adress we shall put into the function.

In [13]:
londonLocationDataframe['First Borough']=firstBorough
londonLocationDataframe['Full Location']=londonLocationDataframe['Location']+', '+ londonLocationDataframe['First Borough']

In [14]:
londonLocationDataframe.head()

Unnamed: 0,Location,London Borough,Post Town,Postcode District,First Borough,Full Location
0,Abbey Wood,Greenwich,LONDON,SE2,Greenwich,"Abbey Wood, Greenwich"
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",Ealing,"Acton, Ealing"
2,Addington,Croydon,CROYDON,CR0,Croydon,"Addington, Croydon"
3,Addiscombe,Croydon,CROYDON,CR0,Croydon,"Addiscombe, Croydon"
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",Bexley,"Albany Park, Bexley"


At this point, we would run the function. However, since this requires a lot of calls to FourSquare (and in fact calling to FourSquare with the same request can give different results) the data I used is provided at:
https://github.com/JPigden/Coursera_Capstone/blob/master/LondonVenues.csv

and we use read_csv

In [15]:
# londonVenues =getNearbyVenues(names =londonLocationDataframe['Location'],location =londonLocationDataframe['Full Location'])
londonVenues=pd.read_csv('https://github.com/JPigden/Coursera_Capstone/raw/master/LondonVenues.csv',index_col =0)

Create a copy of the Dataframe, so we can muck around with the resulting dataframe and keep a clean copy.

In [16]:
londonVenuesCopy = londonVenues.copy()

Lets take a look at the dataframe

In [17]:
londonVenues.head()

Unnamed: 0,Name,Address,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abbey Wood,"Abbey Wood, Greenwich",Bostal Gardens,51.48667,0.110462,Playground
1,Abbey Wood,"Abbey Wood, Greenwich",Co-op Food,51.48765,0.11349,Grocery Store
2,Abbey Wood,"Abbey Wood, Greenwich",tommysdriveways,51.489386,0.104273,Construction & Landscaping
3,Abbey Wood,"Abbey Wood, Greenwich",Meghna Tandoori,51.485709,0.101681,Indian Restaurant
4,Acton,"Acton, Ealing",The Aeronaut,51.508376,-0.275216,Pub


## Creating the clusters
First we'll use one hot encoding to transform the venue category into numeric data.

In [18]:
# one hot encoding
londonOneHot = pd.get_dummies(londonVenues[['Venue Category']], prefix="", prefix_sep="")

# add PostalCode column back to dataframe
londonOneHot['Name'] = londonVenues['Name'] 

# move PostalCode column to the first column
fixed_columns = [londonOneHot.columns[-1]] + list(londonOneHot.columns[:-1])
londonOneHot = londonOneHot[fixed_columns]

londonOneHot.head()

Unnamed: 0,Name,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Xinjiang Restaurant,Yakitori Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,Abbey Wood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Abbey Wood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Abbey Wood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Abbey Wood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we group together the venues by location name. Thus, we know how many of each evnue type is near each location.

In [19]:
londonGrouped = londonOneHot.groupby('Name').sum().reset_index()
londonGrouped.head()

Unnamed: 0,Name,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Xinjiang Restaurant,Yakitori Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,Abbey Wood,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Acton,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Addington,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Addiscombe,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Albany Park,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Taking a look at the column names, we see some venue types that are irrrelevant, such as airport lounge - we would only care that there is an airport, not what facilities that airport has. Similarly, certain venue types can be combined. For example, for the purpose of our report, we don't want to differentiate between non-vitenamese restaurants. Thus further processing of the columns is neccesary. We will take a look at all of the columns, and try to filter out or combine venue types to improve the relevancy of the data.

In [20]:
columnList=list(londonGrouped.columns.values)
columnList

['Name',
 'Accessories Store',
 'Adult Boutique',
 'Afghan Restaurant',
 'African Restaurant',
 'Airport',
 'Airport Lounge',
 'Airport Service',
 'Airport Terminal',
 'American Restaurant',
 'Antique Shop',
 'Aquarium',
 'Arcade',
 'Arepa Restaurant',
 'Argentinian Restaurant',
 'Art Gallery',
 'Art Museum',
 'Arts & Crafts Store',
 'Arts & Entertainment',
 'Asian Restaurant',
 'Athletics & Sports',
 'Australian Restaurant',
 'Austrian Restaurant',
 'Auto Garage',
 'Auto Workshop',
 'Automotive Shop',
 'BBQ Joint',
 'Baby Store',
 'Bagel Shop',
 'Bakery',
 'Bank',
 'Bar',
 'Baseball Field',
 'Beach',
 'Bed & Breakfast',
 'Beer Bar',
 'Beer Garden',
 'Beer Store',
 'Belgian Restaurant',
 'Betting Shop',
 'Bike Rental / Bike Share',
 'Bike Shop',
 'Bistro',
 'Boarding House',
 'Boat or Ferry',
 'Bookstore',
 'Botanical Garden',
 'Boutique',
 'Bowling Alley',
 'Boxing Gym',
 'Brasserie',
 'Brazilian Restaurant',
 'Breakfast Spot',
 'Brewery',
 'Bridge',
 'Bubble Tea Shop',
 'Buddhist Tem

Firstly, we notice a large variety of restaurants. For future reference we want to know which areas have a vietnamese restaurant, so will copy the Vietnamese Restaurant column.


In [21]:
vietnameseLocations = londonGrouped[['Name','Vietnamese Restaurant']]

Now we want to categorise the venue types, so we can group similar venue types together. We start by listing all of the Restaurant and Cafe venue types.

In [22]:
restaurantColumnList=['Afghan Restaurant', 'African Restaurant', 'American Restaurant', 'Arepa Restaurant', 'Argentinian Restaurant', 'Asian Restaurant', 'Australian Restaurant', 'Austrian Restaurant', 'BBQ Joint', 'Bagel Shop','Belgian Restaurant', 'Bistro', 'Brazilian Restaurant', 'Breakfast Spot','Bubble Tea Shop', 'Buffet', 'Burger Joint', 'Burrito Place', 'Café', 'Cantonese Restaurant', 'Caribbean Restaurant', 'Caucasian Restaurant', 'Chaat Place', 'Chinese Restaurant', 'Churrascaria', 'Cigkofte Place','Comfort Food Restaurant', 'Creperie', 'Cuban Restaurant', 'Cupcake Shop', 'Currywurst Joint', 'Deli / Bodega','Dessert Shop', 'Dim Sum Restaurant', 'Diner','Doner Restaurant', 'Donut Shop', 'Dumpling Restaurant', 'Eastern European Restaurant', 'English Restaurant', 'Ethiopian Restaurant', 'Falafel Restaurant',  'Fast Food Restaurant', 'Filipino Restaurant', 'Fish & Chips Shop', 'Food', 'Food & Drink Shop', 'Food Court', 'Food Stand', 'Food Truck', 'French Restaurant', 'Fried Chicken Joint', 'Frozen Yogurt Shop', 'Gaming Cafe', 'Gastropub', 'German Restaurant', 'Gluten-free Restaurant', 'Greek Restaurant', 'Halal Restaurant', 'Himalayan Restaurant', 'Hot Dog Joint', 'Hunan Restaurant', 'Ice Cream Shop', 'Indian Chinese Restaurant', 'Indian Restaurant', 'Indonesian Restaurant', 'Iraqi Restaurant', 'Italian Restaurant', 'Japanese Restaurant', 'Jewish Restaurant','Kebab Restaurant', 'Korean Restaurant', 'Kosher Restaurant', 'Latin American Restaurant', 'Lebanese Restaurant',  'Malay Restaurant', 'Mamak Restaurant', 'Mediterranean Restaurant', 'Mexican Restaurant', 'Middle Eastern Restaurant', 'Modern European Restaurant', 'Molecular Gastronomy Restaurant', 'Moroccan Restaurant', 'Noodle House', 'North Indian Restaurant', 'Okonomiyaki Restaurant',  'Pakistani Restaurant', 'Pastry Shop', 'Persian Restaurant', 'Peruvian Restaurant', 'Pizza Place', 'Poke Place', 'Polish Restaurant', 'Portuguese Restaurant', 'Ramen Restaurant', 'Restaurant', 'Russian Restaurant', 'Salad Place', 'Sandwich Place', 'Scandinavian Restaurant', 'Scottish Restaurant', 'Seafood Restaurant', 'Shabu-Shabu Restaurant', 'Smoothie Shop', 'Snack Place', 'Soba Restaurant', 'Soup Place', 'South American Restaurant', 'South Indian Restaurant', 'Southern / Soul Food Restaurant', 'Souvlaki Shop', 'Spanish Restaurant', 'Sri Lankan Restaurant', 'Steakhouse', 'Street Food Gathering', 'Sushi Restaurant', 'Szechuan Restaurant', 'Taco Place', 'Taiwanese Restaurant','Tapas Restaurant', 'Tea Room', 'Thai Restaurant', 'Theme Restaurant', 'Turkish Restaurant', 'Udon Restaurant', 'Vegetarian / Vegan Restaurant', 'Vietnamese Restaurant', 'Wings Joint', 'Xinjiang Restaurant', 'Yakitori Restaurant']

We will then add a column to the Dataframe that is the sum of those columns, entitled 'Restaurants and Cafes'.

In [23]:
londonGrouped['Restaurants and Cafes']=pd.DataFrame(londonGrouped[restaurantColumnList].sum(axis=1))[0]

Then we drop those columns we summed.

In [24]:
londonGrouped=londonGrouped.drop(restaurantColumnList, axis = 1)

We will then repeat this, grouping most of the venue types into the types 'Arts and Museums', 'Educational Facilities', 'Entertainment', 'Indoor Fitness', 'Outdoor Entertainment', 'Pubs and Bars', 'Stores', 'Travel'

In [25]:
museumsColumnList =['Aquarium', 'Art Gallery', 'Art Museum', 'Arts & Entertainment', 'Castle', 'Concert Hall', 'Historic Site', 'History Museum', 'Memorial Site', 'Monument / Landmark', 'Museum', 'Music Venue', 'Opera House', 'Palace', 'Performing Arts Venue', 'Planetarium', 'Science Museum', 'Theater', 'Zoo', 'Zoo Exhibit']
londonGrouped['Arts and Museums']=pd.DataFrame(londonGrouped[museumsColumnList].sum(axis=1))[0]
londonGrouped=londonGrouped.drop(museumsColumnList,axis=1)

In [26]:
educationColumnList =['College Auditorium', 'College Cafeteria', 'College Quad', 'General College & University', 'Observatory', 'Student Center',  'University']
londonGrouped['Educational Facilities']=pd.DataFrame(londonGrouped[educationColumnList].sum(axis=1))[0]
londonGrouped=londonGrouped.drop(educationColumnList,axis=1)

In [27]:
entertainmentColumnList =['Arcade', 'Bowling Alley', 'Casino', 'Comedy Club', 'Community Center', 'General Entertainment', 'Go Kart Track', 'Indie Movie Theater', 'Indie Theater', 'Indoor Play Area', 'Jazz Club', 'Massage Studio', 'Mini Golf', 'Movie Theater', 'Multiplex', 'Pool Hall', 'Racecourse', 'Racetrack', 'Recreation Center', 'Rock Club', 'Social Club', 'Spa',  'Sports Club']
londonGrouped['Entertainment venues']=pd.DataFrame(londonGrouped[entertainmentColumnList].sum(axis=1))[0]
londonGrouped=londonGrouped.drop(entertainmentColumnList,axis=1)

In [28]:
fitnessColumnList =['Boxing Gym', 'Climbing Gym', 'Cycle Studio', 'Dance Studio', 'Gymnastics Gym', 'Martial Arts Dojo', 'Pilates Studio', 'Yoga Studio']
londonGrouped['Indoor Fitness']=pd.DataFrame(londonGrouped[fitnessColumnList].sum(axis=1))[0]
londonGrouped=londonGrouped.drop(fitnessColumnList,axis=1)

In [29]:
outdoorColumnList =['Athletics & Sports', 'Baseball Field', 'Beach', 'Botanical Garden', 'Cricket Ground', 'Forest', 'Fountain', 'Garden', 'Golf Course', 'Golf Driving Range', 'Gym', 'Gym / Fitness Center', 'Gym Pool', 'Harbor / Marina', 'Hill', 'Hockey Arena', 'Hockey Field', 'Hockey Rink','Lake', 'Nature Preserve', 'Other Great Outdoors', 'Outdoor Sculpture', 'Outdoors & Recreation', 'Paintball Field', 'Park', 'Pier', 'Playground', 'Plaza', 'Pool', 'Rafting', 'Reservoir', 'River', 'Road', 'Rock Climbing Spot', 'Rugby Pitch', 'Rugby Stadium', 'Scenic Lookout', 'Sculpture Garden', 'Skate Park', 'Skating Rink', 'Soccer Field', 'Soccer Stadium', 'Stables', 'Stadium', 'Tennis Court', 'Tennis Stadium', 'Theme Park', 'Theme Park Ride / Attraction', 'Track', 'Track Stadium', 'Trail', 'Volleyball Court', 'Waterfront', 'Windmill']
londonGrouped['Outdoor Entertainment']=pd.DataFrame(londonGrouped[outdoorColumnList].sum(axis=1))[0]
londonGrouped=londonGrouped.drop(outdoorColumnList,axis=1)

In [30]:
pubsColumnList =['Bar', 'Beer Bar', 'Beer Garden', 'Brewery', 'Champagne Bar', 'Cocktail Bar', 'Dive Bar', 'Gay Bar', 'Hookah Bar', 'Hotel Bar', 'Irish Pub', 'Juice Bar', 'Karaoke Bar', 'Nightclub', 'Piano Bar', 'Pub', 'Speakeasy', 'Sports Bar', 'Whisky Bar', 'Wine Bar'] 
londonGrouped['Pubs and Bars']=pd.DataFrame(londonGrouped[pubsColumnList].sum(axis=1))[0]
londonGrouped=londonGrouped.drop(pubsColumnList,axis=1)

In [31]:
storesColumnList =['Accessories Store', 'Adult Boutique', 'Antique Shop', 'Arts & Crafts Store', 'Auto Garage', 'Auto Workshop', 'Automotive Shop', 'Baby Store', 'Bakery', 'Beer Store', 'Betting Shop', 'Bike Rental / Bike Share', 'Bike Shop', 'Bookstore', 'Boutique', 'Brasserie', 'Butcher', 'Camera Store', 'Candy Store', 'Cheese Shop', 'Chocolate Shop', 'Clothing Store', 'Coffee Shop', 'Comic Shop', 'Convenience Store', 'Cosmetics Shop', 'Costume Shop', 'Department Store', 'Discount Store', 'Duty-free Shop', 'Electronics Store', 'Dry Cleaner',  'Fabric Shop', 'Farmers Market', 'Fish Market', 'Flea Market', 'Flower Shop', 'Fruit & Vegetable Store', 'Furniture / Home Store', 'Garden Center', 'Gas Station', 'Gift Shop', 'Gourmet Shop', 'Grocery Store', 'Gun Shop', 'Hardware Store', 'Health & Beauty Service', 'Health Food Store', 'Herbs & Spices Store', 'Hobby Shop', 'Home Service', 'Jewelry Store', 'Kids Store', 'Leather Goods Store', 'Lighting Store', 'Lingerie Store', 'Liquor Store', 'Locksmith', 'Market', "Men's Store", 'Miscellaneous Shop', 'Mobile Phone Shop', 'Motorcycle Shop', 'Music Store', 'Nail Salon', 'Optical Shop', 'Organic Grocery', 'Outdoor Supply Store', 'Outlet Mall', 'Outlet Store', 'Paper / Office Supplies Store', 'Perfume Shop', 'Pet Store', 'Pharmacy', 'Pie Shop', 'Post Office', 'Record Shop', 'Salon / Barbershop', 'Shoe Repair', 'Shoe Store', 'Shopping Mall', 'Shopping Plaza', 'Smoke Shop', 'Souvenir Shop', 'Sporting Goods Shop', 'Stationery Store', 'Street Art', 'Supermarket', 'Tailor Shop', 'Thrift / Vintage Store', 'Toy / Game Store', 'Used Bookstore', 'Vape Store', 'Video Game Store', 'Video Store', 'Warehouse Store', 'Watch Shop', 'Wine Shop', 'Winery', "Women's Store"] 
londonGrouped['Stores and Shops']=pd.DataFrame(londonGrouped[storesColumnList].sum(axis=1))[0]
londonGrouped=londonGrouped.drop(storesColumnList,axis=1)

In [32]:
travelColumnList =['Airport', 'Airport Lounge', 'Airport Service', 'Airport Terminal',  'Boat or Ferry', 'Bus Station', 'Bus Stop', 'Light Rail Station', 'Metro Station', 'Rental Car Location', 'Train Station', 'Tram Station']
londonGrouped['Travel links']=pd.DataFrame(londonGrouped[travelColumnList].sum(axis=1))[0]
londonGrouped=londonGrouped.drop(travelColumnList,axis=1)

Finally, the remaining columns don't fit into any of our categories, and occur infrequently, so we will remove them from our dataframe.

In [33]:
otherColumnList=['Bank', 'Bed & Breakfast', 'Boarding House', 'Bridge', 'Buddhist Temple', 'Building', 'Business Service', 'Campground', 'Canal', 'Canal Lock', 'Cave', 'Cemetery', 'Church', 'Construction & Landscaping', 'Convention Center', 'Design Studio', 'Distillery', "Doctor's Office", 'Event Space', 'Exhibit', 'Farm', 'Field', 'Film Studio', 'Hostel', 'Hotel', 'IT Services', 'Lawyer', 'Lounge', 'Military Base', 'Office', 'Pedestrian Plaza', 'Platform', 'Public Art', 'Recording Studio', 'Residential Building (Apartment / Condo)',  'Roof Deck', 'Tour Provider', 'Tourist Information Center', 'Tunnel','Veterinarian']
londonGrouped=londonGrouped.drop(otherColumnList, axis=1)

Let us look at the DataFrame now.

In [34]:
londonGrouped.head()

Unnamed: 0,Name,Restaurants and Cafes,Arts and Museums,Educational Facilities,Entertainment venues,Indoor Fitness,Outdoor Entertainment,Pubs and Bars,Stores and Shops,Travel links
0,Abbey Wood,1,0,0,0,0,1,0,1,0
1,Acton,7,0,0,0,0,2,5,2,0
2,Addington,27,0,0,1,0,16,14,27,4
3,Addiscombe,27,0,0,1,0,16,14,27,4
4,Albany Park,27,0,0,1,0,16,14,27,4


To use K-means properly, we want instead to have each entry in the dataframe instead represnt the propotion of nearby venues, rather than the sum total, so we shall divide each row in the Dataframe by its sum. We'll remove the name column briefly so that all columns are numeric.

In [35]:
londonGroupedProportions=londonGrouped.set_index('Name')
londonGroupedProportions.head()

Unnamed: 0_level_0,Restaurants and Cafes,Arts and Museums,Educational Facilities,Entertainment venues,Indoor Fitness,Outdoor Entertainment,Pubs and Bars,Stores and Shops,Travel links
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Abbey Wood,1,0,0,0,0,1,0,1,0
Acton,7,0,0,0,0,2,5,2,0
Addington,27,0,0,1,0,16,14,27,4
Addiscombe,27,0,0,1,0,16,14,27,4
Albany Park,27,0,0,1,0,16,14,27,4


Now we create a series sumDataframe for which each row is the sum of the corresponding row in londonGroupedProportions, then divide each column by it.

In [36]:
sumDataframe = londonGroupedProportions.sum(axis=1)
for column in list(londonGroupedProportions.columns.values):
    londonGroupedProportions[column]=londonGroupedProportions[column]/sumDataframe

londonGroupedProportions.head()

Unnamed: 0_level_0,Restaurants and Cafes,Arts and Museums,Educational Facilities,Entertainment venues,Indoor Fitness,Outdoor Entertainment,Pubs and Bars,Stores and Shops,Travel links
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Abbey Wood,0.333333,0.0,0.0,0.0,0.0,0.333333,0.0,0.333333,0.0
Acton,0.4375,0.0,0.0,0.0,0.0,0.125,0.3125,0.125,0.0
Addington,0.303371,0.0,0.0,0.011236,0.0,0.179775,0.157303,0.303371,0.044944
Addiscombe,0.303371,0.0,0.0,0.011236,0.0,0.179775,0.157303,0.303371,0.044944
Albany Park,0.303371,0.0,0.0,0.011236,0.0,0.179775,0.157303,0.303371,0.044944


Now we can run k-means clustering! We shall run with 6 clusters.

In [37]:
kmeans = KMeans(n_clusters=6, random_state=0).fit(londonGroupedProportions)

Add the cluster labels to our grouped dataframe

In [38]:
londonGroupedProportions['Cluster Labels']=kmeans.labels_
londonGroupedProportions.head()

Unnamed: 0_level_0,Restaurants and Cafes,Arts and Museums,Educational Facilities,Entertainment venues,Indoor Fitness,Outdoor Entertainment,Pubs and Bars,Stores and Shops,Travel links,Cluster Labels
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Abbey Wood,0.333333,0.0,0.0,0.0,0.0,0.333333,0.0,0.333333,0.0,2
Acton,0.4375,0.0,0.0,0.0,0.0,0.125,0.3125,0.125,0.0,3
Addington,0.303371,0.0,0.0,0.011236,0.0,0.179775,0.157303,0.303371,0.044944,2
Addiscombe,0.303371,0.0,0.0,0.011236,0.0,0.179775,0.157303,0.303371,0.044944,2
Albany Park,0.303371,0.0,0.0,0.011236,0.0,0.179775,0.157303,0.303371,0.044944,2


Lets add back in the names, and location address.

In [39]:
locationData=londonLocationDataframe.loc[:,['Location', 'London Borough']]
locationData.set_index('Location', inplace = True)
londonGroupedProportions=londonGroupedProportions.join(locationData, on='Name')
londonGroupedProportions.head()

Unnamed: 0_level_0,Restaurants and Cafes,Arts and Museums,Educational Facilities,Entertainment venues,Indoor Fitness,Outdoor Entertainment,Pubs and Bars,Stores and Shops,Travel links,Cluster Labels,London Borough
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Abbey Wood,0.333333,0.0,0.0,0.0,0.0,0.333333,0.0,0.333333,0.0,2,Greenwich
Acton,0.4375,0.0,0.0,0.0,0.0,0.125,0.3125,0.125,0.0,3,"Ealing, Hammersmith and Fulham"
Addington,0.303371,0.0,0.0,0.011236,0.0,0.179775,0.157303,0.303371,0.044944,2,Croydon
Addiscombe,0.303371,0.0,0.0,0.011236,0.0,0.179775,0.157303,0.303371,0.044944,2,Croydon
Albany Park,0.303371,0.0,0.0,0.011236,0.0,0.179775,0.157303,0.303371,0.044944,2,Bexley


Now we'll analyse each of the clusters. Not that, since K-mean Clustering give results somewhat randomly, replicating these results may be difficult.

In [40]:
londonClusterZero = londonGroupedProportions.loc[londonGroupedProportions['Cluster Labels']==0]
londonClusterZero.mean()

Restaurants and Cafes     0.237249
Arts and Museums          0.018231
Educational Facilities    0.000000
Entertainment venues      0.018483
Indoor Fitness            0.001812
Outdoor Entertainment     0.068547
Pubs and Bars             0.092742
Stores and Shops          0.519263
Travel links              0.043673
Cluster Labels            0.000000
dtype: float64

In [41]:
londonClusterOne = londonGroupedProportions.loc[londonGroupedProportions['Cluster Labels']==1]
londonClusterOne.mean()

Restaurants and Cafes     0.527311
Arts and Museums          0.026629
Educational Facilities    0.000677
Entertainment venues      0.018525
Indoor Fitness            0.005619
Outdoor Entertainment     0.074797
Pubs and Bars             0.145771
Stores and Shops          0.177551
Travel links              0.023121
Cluster Labels            1.000000
dtype: float64

In [42]:
londonClusterTwo = londonGroupedProportions.loc[londonGroupedProportions['Cluster Labels']==2]
londonClusterTwo.mean()

Restaurants and Cafes     0.315332
Arts and Museums          0.012196
Educational Facilities    0.000070
Entertainment venues      0.017475
Indoor Fitness            0.000235
Outdoor Entertainment     0.151568
Pubs and Bars             0.124263
Stores and Shops          0.333014
Travel links              0.045847
Cluster Labels            2.000000
dtype: float64

In [43]:
londonClusterThree = londonGroupedProportions.loc[londonGroupedProportions['Cluster Labels']==3]
londonClusterThree.mean()

Restaurants and Cafes     0.342967
Arts and Museums          0.043403
Educational Facilities    0.000629
Entertainment venues      0.018311
Indoor Fitness            0.009436
Outdoor Entertainment     0.200178
Pubs and Bars             0.198155
Stores and Shops          0.160276
Travel links              0.026645
Cluster Labels            3.000000
dtype: float64

In [44]:
londonClusterFour = londonGroupedProportions.loc[londonGroupedProportions['Cluster Labels']==4]
londonClusterFour.mean()

Restaurants and Cafes     0.141852
Arts and Museums          0.000000
Educational Facilities    0.000000
Entertainment venues      0.033333
Indoor Fitness            0.000000
Outdoor Entertainment     0.552593
Pubs and Bars             0.042963
Stores and Shops          0.082222
Travel links              0.147037
Cluster Labels            4.000000
dtype: float64

In [45]:
londonClusterFive = londonGroupedProportions.loc[londonGroupedProportions['Cluster Labels']==5]
londonClusterFive.mean()

Restaurants and Cafes     0.484151
Arts and Museums          0.013606
Educational Facilities    0.000397
Entertainment venues      0.008634
Indoor Fitness            0.003788
Outdoor Entertainment     0.053185
Pubs and Bars             0.070552
Stores and Shops          0.342206
Travel links              0.023482
Cluster Labels            5.000000
dtype: float64

Our result suggest that the best cluster types for Restaurants to be built in are cluster 1, followed by 5.
Cluster 1 has a greater proportion of restaurants, and also Pubs and Bars, than Cluster 5, suggesting that Cluster
1 type location have a vibrant evening life, whereas the high proportion of stores and shops in cluster 5 suggest the location in cluster 5 tend to be high Streets.

## Investigating Waltham Forest
Let's now find all the locations in Waltham Forest, and the clusters they occupy.

In [46]:
walthamForestDataframe = londonGroupedProportions.loc[londonGroupedProportions['London Borough']=='Waltham Forest']
walthamForestDataframe

Unnamed: 0_level_0,Restaurants and Cafes,Arts and Museums,Educational Facilities,Entertainment venues,Indoor Fitness,Outdoor Entertainment,Pubs and Bars,Stores and Shops,Travel links,Cluster Labels,London Borough
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Cann Hall,0.53125,0.010417,0.0,0.03125,0.020833,0.125,0.072917,0.208333,0.0,1,Waltham Forest
Chingford,0.333333,0.0,0.0,0.0,0.0,0.083333,0.083333,0.5,0.0,0,Waltham Forest
Highams Park,0.383838,0.050505,0.0,0.030303,0.0,0.121212,0.181818,0.232323,0.0,3,Waltham Forest
Leyton,0.324324,0.0,0.0,0.0,0.0,0.108108,0.135135,0.378378,0.054054,2,Waltham Forest
Leytonstone,0.4,0.04,0.0,0.0,0.0,0.04,0.16,0.32,0.04,5,Waltham Forest
Upper Walthamstow,0.57377,0.016393,0.0,0.0,0.0,0.065574,0.114754,0.180328,0.04918,1,Waltham Forest
Walthamstow,0.5,0.055556,0.0,0.0,0.0,0.111111,0.111111,0.222222,0.0,1,Waltham Forest
Walthamstow Village,0.5,0.055556,0.0,0.0,0.0,0.111111,0.111111,0.222222,0.0,1,Waltham Forest


We have some indication of which locations in Waltham Forest are best - but which have Vietnamese Restaurants? We shall use the viestnameseLocations saved earlier.

In [47]:
vietnameseLocations.set_index('Name',inplace = True)
walthamForestDataframe.join(vietnameseLocations)

Unnamed: 0_level_0,Restaurants and Cafes,Arts and Museums,Educational Facilities,Entertainment venues,Indoor Fitness,Outdoor Entertainment,Pubs and Bars,Stores and Shops,Travel links,Cluster Labels,London Borough,Vietnamese Restaurant
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Cann Hall,0.53125,0.010417,0.0,0.03125,0.020833,0.125,0.072917,0.208333,0.0,1,Waltham Forest,1
Chingford,0.333333,0.0,0.0,0.0,0.0,0.083333,0.083333,0.5,0.0,0,Waltham Forest,0
Highams Park,0.383838,0.050505,0.0,0.030303,0.0,0.121212,0.181818,0.232323,0.0,3,Waltham Forest,0
Leyton,0.324324,0.0,0.0,0.0,0.0,0.108108,0.135135,0.378378,0.054054,2,Waltham Forest,0
Leytonstone,0.4,0.04,0.0,0.0,0.0,0.04,0.16,0.32,0.04,5,Waltham Forest,0
Upper Walthamstow,0.57377,0.016393,0.0,0.0,0.0,0.065574,0.114754,0.180328,0.04918,1,Waltham Forest,1
Walthamstow,0.5,0.055556,0.0,0.0,0.0,0.111111,0.111111,0.222222,0.0,1,Waltham Forest,0
Walthamstow Village,0.5,0.055556,0.0,0.0,0.0,0.111111,0.111111,0.222222,0.0,1,Waltham Forest,0


Since we want to avoid locations which already have a Vietnamese Restaurant, to minimise conpetition, Walthamstow is best location in which to build a new Vietnamese Restaurant.