# Capstone Project - The Battle of Neighbours

___

## Table of Contents
1. __Introduction: Business Problem__
2. __Data__


### 1) Introduction: Business Problem

__Nowadays stakeholders are paying a lot of money to calculate the risks of starting a new business. Data analysis and machine learning play a great role when it comes to risk assessment. In this project I will carry an analysis to try and help a Pharmacist choose the best neighbourhood to start a Pharmacy business with the lowest risk possible.__

### 2) Data

#### Resources
1. __Wikipedia (List of Locations)__
2. __GeoCoder (Co-ordinates)__
3. __Foursquare (Venues)__

### 2.1. Wikipedia (List of Locations)

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'

html = urlopen(url)

soup = BeautifulSoup(html, 'html.parser')

In [5]:
my_table = soup.find_all('table', class_= 'wikitable')

### It is time to write a for loop to pull the data from the url mentioned above

In [7]:
location = []
borough = []


for table in my_table:
    rows = table.find_all('tr')
    
    for row in rows:
        cells = row.find_all('td')
        
        if len(cells)==6:
            location.append(cells[0].find(text=True).strip())
            borough.append(cells[1].find(text=True))
            
            

### Lets put the data in a dataframe and check if anything is missing

In [8]:
df = pd.DataFrame(location,
                  columns = ['Location'])

df['Borough'] = borough


print(df.shape)
df.head(10)

(533, 2)


Unnamed: 0,Location,Borough
0,Abbey Wood,"Bexley, Greenwich"
1,Acton,"Ealing, Hammersmith and Fulham"
2,Addington,Croydon
3,Addiscombe,Croydon
4,Albany Park,Bexley
5,Aldborough Hatch,Redbridge
6,Aldgate,City
7,Aldwych,Westminster
8,Alperton,Brent
9,Anerley,Bromley


### 2.2. GeoCoder (Co-ordinates)

### Lets get the latitudes and longitude using GeoCoder

In [11]:
latitudes = []
longitudes = []

for loc in df['Location']:
    YOUR_API_KEY = 'AIzaSyDpHR6TP-U2WSH7fMDcnp5f8dUzsv5v-yE'
    place_name = loc + ' ,London, England'
    API_KEY = 'AIzaSyDpHR6TP-U2WSH7fMDcnp5f8dUzsv5v-yE'
    url = 'https://maps.googleapis.com/maps/api/geocode/json?address={}&key={}'.format(place_name, API_KEY)
    
    r = requests.get(url)
    results = r.json()['results']
    
    lat = results[0]['geometry']['location']['lat']
    lng = results[0]['geometry']['location']['lng']
    
    latitudes.append(lat)
    longitudes.append(lng)
    
df['Latitude'] = latitudes
df['Longitude'] = longitudes

In [12]:
df.head(10)

Unnamed: 0,Location,Borough,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich",51.492612,0.118818
1,Acton,"Ealing, Hammersmith and Fulham",51.508372,-0.27444
2,Addington,Croydon,51.358673,-0.031254
3,Addiscombe,Croydon,51.38055,-0.072274
4,Albany Park,Bexley,51.426316,0.102809
5,Aldborough Hatch,Redbridge,51.585525,0.098766
6,Aldgate,City,51.513438,-0.077171
7,Aldwych,Westminster,51.513266,-0.117183
8,Alperton,Brent,51.539601,-0.298837
9,Anerley,Bromley,51.411911,-0.067978


In [18]:
import folium

map_london = folium.Map(location=[latitude, longitude], zoom_start=10)

for location, borough, lat, lng in zip(df['Location'], df['Borough'], df['Latitude'], df['Longitude']):
    label = '{}, {}'.format(location, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    
map_london

### 2.3. Foursquare (Venues)

#### Define Foursquare Credentials and Version

In [19]:
CLIENT_ID = '3ADV02DF5KC12WZCVWLTXADW12CGNJZ2BAFT5CPG5FBNNLOW' # your Foursquare ID
CLIENT_SECRET = 'SD5RXOIC0OM0CEOA3QP2LGTHWZIEW4LKDBJTQY431JOFQ1J1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3ADV02DF5KC12WZCVWLTXADW12CGNJZ2BAFT5CPG5FBNNLOW
CLIENT_SECRET:SD5RXOIC0OM0CEOA3QP2LGTHWZIEW4LKDBJTQY431JOFQ1J1


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [20]:
df.loc[0, 'Location']

'Abbey Wood'

Get the neighborhood's latitude and longitude values.

In [21]:
location_latitude = df.loc[0, 'Latitude'] # community latitude value
location_longitude = df.loc[0, 'Longitude'] # community longitude value

location_name = df.loc[0, 'Location'] # community name

print('Latitude and longitude values of {} are {}, {}.'.format(location_name, 
                                                               location_latitude, 
                                                               location_longitude))

Latitude and longitude values of Abbey Wood are 51.4926116, 0.1188182.


#### Now, let's get the top 100 venues that are in Abu Hail, Dubai within a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [22]:
# type your answer here

radius = 500
latitude = 51.4926116
longitude = 0.1188182
LIMIT = 100

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url



'https://api.foursquare.com/v2/venues/explore?client_id=3ADV02DF5KC12WZCVWLTXADW12CGNJZ2BAFT5CPG5FBNNLOW&client_secret=SD5RXOIC0OM0CEOA3QP2LGTHWZIEW4LKDBJTQY431JOFQ1J1&ll=51.4926116,0.1188182&v=20180605&radius=500&limit=100'

Send the GET request and examine the resutls

In [54]:
results = requests.get(url).json()
results

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [24]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [25]:
from pandas.io.json import json_normalize

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  """


Unnamed: 0,name,categories,lat,lng
0,Sainsbury's,Supermarket,51.492824,0.120724
1,Lidl,Supermarket,51.496152,0.118417
2,Platform 1,Platform,51.491023,0.119491
3,Bean @ Work,Coffee Shop,51.491172,0.120649
4,Costcutter,Convenience Store,51.491287,0.120938


And how many venues were returned by Foursquare?

In [26]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

7 venues were returned by Foursquare.


<a id='item2'></a>

#### Explore Neighborhoods in London

#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [27]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *manhattan_venues*.

In [28]:
london_venues = getNearbyVenues(names=df['Location'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )



Abbey Wood
Acton
Addington
Addiscombe
Albany Park
Aldborough Hatch
Aldgate
Aldwych
Alperton
Anerley
Angel
Aperfield
Archway
Ardleigh Green
Arkley
Arnos Grove
Balham
Bankside
Barbican
Barking
Barkingside
Barnehurst
Barnes
Barnes Cray
Barnet Gate
Barnet
Barnsbury
Battersea
Bayswater
Beckenham
Beckton
Becontree
Becontree Heath
Beddington
Bedford Park
Belgravia
Bellingham
Belmont
Belmont
Belsize Park
Belvedere
Bermondsey
Berrylands
Bethnal Green
Bexley
Bexleyheath
Bickley
Biggin Hill
Blackfen
Blackfriars
Blackheath
Blackheath Royal Standard
Blackwall
Blendon
Bloomsbury
Botany Bay
Bounds Green
Bow
Bowes Park
Brentford
Brent Cross
Brent Park
Brimsdown
Brixton
Brockley
Bromley
Bromley
Bromley Common
Brompton
Brondesbury
Brunswick Park
Bulls Cross
Burnt Oak
Burroughs, The
Camberwell
Cambridge Heath
Camden Town
Canary Wharf
Cann Hall
Canning Town
Canonbury
Carshalton
Castelnau
Castle Green
Catford
Chadwell Heath
Chalk Farm
Charing Cross
Charlton
Chase Cross
Cheam
Chelsea
Chelsfield
Chessington


#### Let's check the size of the resulting dataframe

In [29]:
print(london_venues.shape)
london_venues.head()

(12229, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abbey Wood,51.492612,0.118818,Sainsbury's,51.492824,0.120724,Supermarket
1,Abbey Wood,51.492612,0.118818,Lidl,51.496152,0.118417,Supermarket
2,Abbey Wood,51.492612,0.118818,Platform 1,51.491023,0.119491,Platform
3,Abbey Wood,51.492612,0.118818,Bean @ Work,51.491172,0.120649,Coffee Shop
4,Abbey Wood,51.492612,0.118818,Costcutter,51.491287,0.120938,Convenience Store


### 3) Methodology

### 3.1. Data Cleaning

Lets select the categories that will be used in the analysis

In [28]:
knn_venues = ['Pharmacy', 'Hotel', 'Pub', 'Train Station', 'Gym', 'Bus Station', 'Supermarket', 'Bar', 'Gym / Fitness Center', 'Metro Station']
london_venues = london_venues.loc[london_venues['Venue Category'].isin(knn_venues)]

Let's check how many venues were returned for each neighborhood

In [29]:
london_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Abbey Wood,3,3,3,3,3,3
Acton,7,7,7,7,7,7
Addington,1,1,1,1,1,1
Albany Park,5,5,5,5,5,5
Aldgate,12,12,12,12,12,12
...,...,...,...,...,...,...
Woolwich,13,13,13,13,13,13
Worcester Park,5,5,5,5,5,5
Wormwood Scrubs,1,1,1,1,1,1
Yeading,1,1,1,1,1,1


#### Let's find out how many unique categories can be curated from all the returned venues

In [30]:
print('There are {} uniques categories.'.format(len(london_venues['Venue Category'].unique())))

There are 10 uniques categories.


### 3.2. Analyze Each Neighborhood

In [31]:
# one hot encoding
london_onehot = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
london_onehot['Neighbourhood'] = london_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [london_onehot.columns[-1]] + list(london_onehot.columns[:-1])
london_onehot = london_onehot[fixed_columns]

london_onehot.head()

Unnamed: 0,Neighbourhood,Bar,Bus Station,Gym,Gym / Fitness Center,Hotel,Metro Station,Pharmacy,Pub,Supermarket,Train Station
0,Abbey Wood,0,0,0,0,0,0,0,0,1,0
1,Abbey Wood,0,0,0,0,0,0,0,0,1,0
2,Abbey Wood,0,0,0,0,0,0,0,0,0,1
6,Acton,0,0,0,0,1,0,0,0,0,0
7,Acton,0,0,0,0,0,0,0,1,0,0


And let's examine the new dataframe size.

In [32]:
london_onehot.shape

(2263, 11)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [33]:
london_grouped = london_onehot.groupby('Neighbourhood').mean().reset_index()
london_grouped

Unnamed: 0,Neighbourhood,Bar,Bus Station,Gym,Gym / Fitness Center,Hotel,Metro Station,Pharmacy,Pub,Supermarket,Train Station
0,Abbey Wood,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.666667,0.333333
1,Acton,0.000000,0.000000,0.0,0.285714,0.142857,0.000000,0.000000,0.571429,0.000000,0.000000
2,Addington,0.000000,1.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,Albany Park,0.000000,0.000000,0.0,0.000000,0.200000,0.000000,0.400000,0.200000,0.200000,0.000000
4,Aldgate,0.083333,0.000000,0.0,0.250000,0.500000,0.000000,0.000000,0.166667,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...
447,Woolwich,0.000000,0.000000,0.0,0.000000,0.153846,0.076923,0.153846,0.384615,0.230769,0.000000
448,Worcester Park,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.400000,0.200000,0.200000,0.200000
449,Wormwood Scrubs,0.000000,0.000000,1.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
450,Yeading,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,0.000000


#### Let's confirm the new size

In [34]:
london_grouped.shape

(452, 11)

#### Let's print each neighborhood along with the top 5 most common venues

In [35]:
num_top_venues = 5

for hood in london_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = london_grouped[london_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Abbey Wood----
           venue  freq
0    Supermarket  0.67
1  Train Station  0.33
2            Bar  0.00
3    Bus Station  0.00
4            Gym  0.00


----Acton----
                  venue  freq
0                   Pub  0.57
1  Gym / Fitness Center  0.29
2                 Hotel  0.14
3                   Bar  0.00
4           Bus Station  0.00


----Addington----
                  venue  freq
0           Bus Station   1.0
1                   Bar   0.0
2                   Gym   0.0
3  Gym / Fitness Center   0.0
4                 Hotel   0.0


----Albany Park----
         venue  freq
0     Pharmacy   0.4
1        Hotel   0.2
2          Pub   0.2
3  Supermarket   0.2
4          Bar   0.0


----Aldgate----
                  venue  freq
0                 Hotel  0.50
1  Gym / Fitness Center  0.25
2                   Pub  0.17
3                   Bar  0.08
4           Bus Station  0.00


----Aldwych----
                  venue  freq
0                   Pub  0.50
1                 Hotel

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [36]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [37]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = london_grouped['Neighbourhood']

for ind in np.arange(london_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(london_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.tail()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
447,Woolwich,Pub,Supermarket,Pharmacy,Hotel,Metro Station
448,Worcester Park,Pharmacy,Train Station,Supermarket,Pub,Metro Station
449,Wormwood Scrubs,Gym,Train Station,Supermarket,Pub,Pharmacy
450,Yeading,Supermarket,Train Station,Pub,Pharmacy,Metro Station
451,Yiewsley,Supermarket,Bar,Pub,Hotel,Bus Station


### 3.3. Cluster Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [48]:
london_grouped_clustering.head()

Unnamed: 0,Cluster Labels,Bar,Bus Station,Gym,Gym / Fitness Center,Hotel,Metro Station,Pharmacy,Pub,Supermarket,Train Station
0,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.333333
1,0,0.0,0.0,0.0,0.285714,0.142857,0.0,0.0,0.571429,0.0,0.0
2,2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,0.0,0.0,0.0,0.0,0.2,0.0,0.4,0.2,0.2,0.0
4,1,0.083333,0.0,0.0,0.25,0.5,0.0,0.0,0.166667,0.0,0.0


In [41]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

london_grouped_clustering = london_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(london_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 0, 2, 3, 1, 0, 3, 0, 0, 3], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [56]:
neighbourhoods_venues_sorted.drop(['Cluster Labels'], axis = 1, inplace = True)

In [57]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

london_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighbourhood
london_merged = london_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Location')

london_merged.head() # check the last columns!

Unnamed: 0,Location,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Abbey Wood,"Bexley, Greenwich",51.492612,0.118818,3.0,Supermarket,Train Station,Pub,Pharmacy,Metro Station
1,Acton,"Ealing, Hammersmith and Fulham",51.508372,-0.27444,0.0,Pub,Gym / Fitness Center,Hotel,Train Station,Supermarket
2,Addington,Croydon,51.358673,-0.031254,2.0,Bus Station,Train Station,Supermarket,Pub,Pharmacy
3,Addiscombe,Croydon,51.38055,-0.072274,,,,,,
4,Albany Park,Bexley,51.426316,0.102809,3.0,Pharmacy,Supermarket,Pub,Hotel,Train Station


<a id='item5'></a>

### 3.4. Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 1

In [64]:
london_merged.loc[london_merged['Cluster Labels'] == 0, london_merged.columns[[0] + list(range(5, london_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,Acton,Pub,Gym / Fitness Center,Hotel,Train Station,Supermarket
7,Aldwych,Pub,Hotel,Gym / Fitness Center,Gym,Bar
9,Anerley,Pub,Train Station,Supermarket,Pharmacy,Metro Station
10,Angel,Pub,Gym / Fitness Center,Supermarket,Hotel,Bar
22,Barnes,Pub,Train Station,Supermarket,Pharmacy,Metro Station
...,...,...,...,...,...,...
496,Wembley,Pub,Train Station,Supermarket,Pharmacy,Metro Station
503,West Hackney,Pub,Train Station,Gym / Fitness Center,Bus Station,Bar
505,West Hampstead,Pub,Train Station,Supermarket,Pharmacy,Metro Station
510,West Norwood,Pub,Bus Station,Train Station,Supermarket,Pharmacy


#### Cluster 2

In [65]:
london_merged.loc[london_merged['Cluster Labels'] == 1, london_merged.columns[[0] + list(range(5, london_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
6,Aldgate,Hotel,Gym / Fitness Center,Pub,Bar,Train Station
19,Barking,Hotel,Supermarket,Pharmacy,Gym,Train Station
28,Bayswater,Hotel,Pub,Gym / Fitness Center,Pharmacy,Supermarket
35,Belgravia,Hotel,Supermarket,Gym / Fitness Center,Bar,Train Station
39,Belsize Park,Hotel,Pub,Pharmacy,Train Station,Supermarket
52,Blackwall,Hotel,Gym / Fitness Center,Train Station,Supermarket,Pub
68,Brompton,Hotel,Pub,Gym,Gym / Fitness Center,Train Station
77,Canary Wharf,Hotel,Gym / Fitness Center,Bar,Train Station,Supermarket
87,Charing Cross,Hotel,Pub,Pharmacy,Bar,Train Station
95,Chinatown,Hotel,Pub,Gym / Fitness Center,Gym,Train Station


#### Cluster 3

In [66]:
london_merged.loc[london_merged['Cluster Labels'] == 2, london_merged.columns[[0] + list(range(5, london_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Addington,Bus Station,Train Station,Supermarket,Pub,Pharmacy
12,Archway,Pub,Bar,Hotel,Gym / Fitness Center,Train Station
13,Ardleigh Green,Pub,Gym / Fitness Center,Bar,Train Station,Supermarket
15,Arnos Grove,Metro Station,Train Station,Supermarket,Pub,Pharmacy
16,Balham,Pub,Supermarket,Bar,Pharmacy,Hotel
...,...,...,...,...,...,...
522,Wood Green,Pub,Pharmacy,Supermarket,Gym / Fitness Center,Train Station
523,Woodford,Metro Station,Train Station,Supermarket,Pub,Pharmacy
525,Woodlands,Gym / Fitness Center,Train Station,Supermarket,Pub,Pharmacy
530,Wormwood Scrubs,Gym,Train Station,Supermarket,Pub,Pharmacy


#### Cluster 4

In [67]:
london_merged.loc[london_merged['Cluster Labels'] == 3, london_merged.columns[[0] + list(range(5, london_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Abbey Wood,Supermarket,Train Station,Pub,Pharmacy,Metro Station
4,Albany Park,Pharmacy,Supermarket,Pub,Hotel,Train Station
8,Alperton,Supermarket,Metro Station,Gym / Fitness Center,Train Station,Pub
11,Aperfield,Supermarket,Train Station,Pub,Pharmacy,Metro Station
20,Barkingside,Supermarket,Pub,Pharmacy,Train Station,Metro Station
...,...,...,...,...,...,...
514,Whetstone,Pub,Supermarket,Pharmacy,Metro Station,Train Station
521,Winchmore Hill,Supermarket,Train Station,Pub,Pharmacy,Metro Station
528,Woolwich,Pub,Supermarket,Pharmacy,Hotel,Metro Station
529,Worcester Park,Pharmacy,Train Station,Supermarket,Pub,Metro Station


#### Cluster 5

In [68]:
london_merged.loc[london_merged['Cluster Labels'] == 4, london_merged.columns[[0] + list(range(5, london_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
36,Bellingham,Train Station,Bus Station,Supermarket,Pub,Pharmacy
93,Chessington,Train Station,Supermarket,Pub,Pharmacy,Metro Station
120,Crews Hill,Train Station,Supermarket,Pub,Pharmacy,Metro Station
160,Elmstead,Train Station,Supermarket,Pub,Pharmacy,Metro Station
167,Erith,Train Station,Gym / Fitness Center,Supermarket,Pub,Pharmacy
194,Grange Park,Train Station,Supermarket,Pub,Pharmacy,Metro Station
204,Hadley Wood,Train Station,Supermarket,Pub,Pharmacy,Metro Station
241,Hither Green,Train Station,Gym / Fitness Center,Supermarket,Pub,Pharmacy
259,Kenley,Train Station,Supermarket,Pub,Pharmacy,Metro Station
304,Maze Hill,Train Station,Supermarket,Pub,Pharmacy,Metro Station


### 4) Discussion (Results)

Based on the clear clusters shown above, and as a Pharmacist myself, it seems like clusters 1 and 2 are the most suitable clusters to open a new pharmacy. Lets try and define each cluster first and then view clusters 1 and 2 again and discuss why they are the most suitable ones.


Cluster 1 --> Mostly pubs and less pharmacies

Cluster 2 --> Mostly hotels and less pharmacies

Cluster 3 --> Most of these areas have pharmacies as the fifth most common venue

Cluster 4 --> A mix of areas with some having pharmacies as thier first most common venue

Cluster 5 --> Areas with pharmacies as the fourth most common venue


So obviously we are looking for areas that have less pharmacies and more places that will cause high traffic in the area like pubs and hotels for example.

### 5) Conclusion

Based on the discussion the most suitable areas will be the ones that fall in Clusters 1 and 2

In [64]:
london_merged.loc[london_merged['Cluster Labels'] == 0, london_merged.columns[[0] + list(range(5, london_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,Acton,Pub,Gym / Fitness Center,Hotel,Train Station,Supermarket
7,Aldwych,Pub,Hotel,Gym / Fitness Center,Gym,Bar
9,Anerley,Pub,Train Station,Supermarket,Pharmacy,Metro Station
10,Angel,Pub,Gym / Fitness Center,Supermarket,Hotel,Bar
22,Barnes,Pub,Train Station,Supermarket,Pharmacy,Metro Station
...,...,...,...,...,...,...
496,Wembley,Pub,Train Station,Supermarket,Pharmacy,Metro Station
503,West Hackney,Pub,Train Station,Gym / Fitness Center,Bus Station,Bar
505,West Hampstead,Pub,Train Station,Supermarket,Pharmacy,Metro Station
510,West Norwood,Pub,Bus Station,Train Station,Supermarket,Pharmacy


#### Cluster 2

In [65]:
london_merged.loc[london_merged['Cluster Labels'] == 1, london_merged.columns[[0] + list(range(5, london_merged.shape[1]))]]

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
6,Aldgate,Hotel,Gym / Fitness Center,Pub,Bar,Train Station
19,Barking,Hotel,Supermarket,Pharmacy,Gym,Train Station
28,Bayswater,Hotel,Pub,Gym / Fitness Center,Pharmacy,Supermarket
35,Belgravia,Hotel,Supermarket,Gym / Fitness Center,Bar,Train Station
39,Belsize Park,Hotel,Pub,Pharmacy,Train Station,Supermarket
52,Blackwall,Hotel,Gym / Fitness Center,Train Station,Supermarket,Pub
68,Brompton,Hotel,Pub,Gym,Gym / Fitness Center,Train Station
77,Canary Wharf,Hotel,Gym / Fitness Center,Bar,Train Station,Supermarket
87,Charing Cross,Hotel,Pub,Pharmacy,Bar,Train Station
95,Chinatown,Hotel,Pub,Gym / Fitness Center,Gym,Train Station


### Thank You