# The Battle of Neighborhood

In this project, I have clustered the 25 most populous cities in the UK based on their top ten sport venue categories, e.g., basketball, tennis, and volleyball courts, using k-means clustering. I have assumed that the demand for sport products follows the availability of sports venues; for example, if there are many tennis courts and zero football pitches in a city, there will probably be higher demand for tennis rackets and balls than for football shoes. 

This notebook is organized as follows. In Section 2, I webscrape a table of the most popolous cities in the UK, transform it into a dataframe, and obtaine the latitude and longitude of each city using geopy. In Section 3, I obtain the top 10 most common sport venues in each city; the cities will be clustered using this information.

In [1]:
#libraries necessary for webscraping
!pip install lxml html5lib beautifulsoup4   #lxml parser
from bs4 import BeautifulSoup               #this package is used to extract data from html files
from urllib.request import urlopen          #as the name suggests, this is used to open URLs

#libraries necessary for data manipulation and visualization 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
#import seaborn as sns
%matplotlib inline
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#this is necessary to obtain geospatial information
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim       # convert an address into latitude and longitude values
import json                               # library to handle JSON files
import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#finally, import k-means from clustering stage
from sklearn.cluster import KMeans
print('Libraries imported')

Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/1f/1d/a4485412268b38043a6c0f873245b5d9315c6615bcf44776759a2605dca5/lxml-4.6.3-cp36-cp36m-manylinux1_x86_64.whl (5.5MB)
[K     |████████████████████████████████| 5.5MB 9.7MB/s eta 0:00:01
Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 13.8MB/s eta 0:00:01
Collecting soupsieve>1.2; python_version >= "3.0" (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/36/69/d82d04022f02733bf9a72bc3b96332d360c0c5307096d76f6bb7489f7e57/soupsieve-2.2.1-py3-none-any.whl
Installing collected packages: lxml, soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 lxml-4.6.3 soupsieve-2.2.1
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen s

### 1. Web Scraping and Data Wrangling

In this section, I will (i) get a table of the top 1000 UK cities by population in html format and then
(ii) transform it in a pandas dataframe for easier manipulation.

In [2]:
url='https://www.thegeographist.com/uk-cities-population-1000/'
html=urlopen(url)                        # Get the html of the page
soup = BeautifulSoup(html, 'lxml')       # Create a Beautiful Soup object from the html

In [4]:
tables=soup.find_all('table')                 #extract all the tables in the webpage into a soup object
print('There is {} table in this webpage'.format(len(tables)))

There is 1 table in this webpage


In [5]:
#the block below converts the table soup objet into a dataframe
rows = tables[0].find_all('tr')
import re

list_rows = []                                                 
for row in rows:                                           # Iterate through the table rows
    cells = row.find_all('td')                             # and assign the cells of the rows to the object "cells"
    str_cells = str(cells)                                 # Convert the BeautifulSoup elements to strings
    cleantext = BeautifulSoup(str_cells, "lxml").get_text()# Remove the html tags from the text
    list_rows.append(cleantext)                            # Append the rows to a list, 
                                                           # which will then be converted to a dataframe

In [6]:
df = pd.DataFrame(list_rows)                               # Convert the list into a Pandas Dataframe
df.head()

Unnamed: 0,0
0,[]
1,"[1, 1, London, London, , London, 8,907,918]"
2,"[2, 1, Birmingham, West Midlands, , West Midla..."
3,"[3, 1, Glasgow, Glasgow, , Scotland, 612,040]"
4,"[4, 1, Liverpool, Merseyside, , North West, 57..."


In [7]:
df1 = df[0].str.split(', ', expand=True)   # Split the first column based on the ", " character (note: the space
                                           # after the comma is important to prevent the population number from 
                                           # splitting)
df1.head()

Unnamed: 0,0,1,2,3,4,5,6
0,[],,,,,,
1,[1,1.0,London,London,,London,"8,907,918]"
2,[2,1.0,Birmingham,West Midlands,,West Midlands,"1,153,717]"
3,[3,1.0,Glasgow,Glasgow,,Scotland,"612,040]"
4,[4,1.0,Liverpool,Merseyside,,North West,"579,256]"


In [8]:
df1.drop([0,1,3,4,5], axis=1, inplace=True)                 # Only keep City name and Population, i.e., columns 2 and 6
df1.drop([0], axis=0, inplace=True)                         # The first row does not contain useful info; hence, remove it
df1.rename({2:'City', 6:'Population'},axis=1, inplace=True) #Rename the columns
df1['Population']=df1['Population'].str.strip(']')          #Remove the bracket at the beginiing of the first column
df1.reset_index(inplace=True,drop=True)
df1.head()

Unnamed: 0,City,Population
0,London,8907918
1,Birmingham,1153717
2,Glasgow,612040
3,Liverpool,579256
4,Bristol,571922


In [9]:
df1.to_csv(r'UK_cities_by_population.csv', index=False) #I am exporting this dataframe for future analysis

In [10]:
df1.shape

(1000, 2)

In [11]:
#there are 1000 cities in the table; for simpliity, i will only 
#analyze the 25 most popolous ones in this project
df1=df1.loc[0:24,:]

In [12]:
df1.shape

(25, 2)

## 2. Obtain the geospatial information on the cities

In this section, I will obtain the latitude and longitude information of the cities using the geopy package

In [13]:
geolocator = Nominatim(user_agent="uk_explorer")       # I define an user_agent called "uk_explorer" 
top_cities=pd.DataFrame(columns=['City', 
                                 'Population', 
                                 'Latitude', 
                                 'Longitude'])         #create an empty dataframe

iteration=1
for city,population in zip(df1['City'],df1['Population']):
    
    address = '{}, UK'.format(city)
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('({}/25): {} has latitude {} and longitude {}'.format(iteration, address,latitude,longitude))   #sanity check
    df=pd.DataFrame(data={'City': city, 'Population': population,  
                          'Latitude': [latitude], 'Longitude':[longitude]})
    top_cities=top_cities.append(df, ignore_index=True)
    iteration=iteration+1
print('finished')

(1/25): London, UK has latitude 51.5073219 and longitude -0.1276474
(2/25): Birmingham, UK has latitude 52.4796992 and longitude -1.9026911
(3/25): Glasgow, UK has latitude 55.8609825 and longitude -4.2488787
(4/25): Liverpool, UK has latitude 53.407154 and longitude -2.991665
(5/25): Bristol, UK has latitude 51.4538022 and longitude -2.5972985
(6/25): Manchester, UK has latitude 53.4794892 and longitude -2.2451148
(7/25): Sheffield, UK has latitude 53.3806626 and longitude -1.4702278
(8/25): Leeds, UK has latitude 53.7974185 and longitude -1.5437941
(9/25): Edinburgh, UK has latitude 55.9533456 and longitude -3.1883749
(10/25): Leicester, UK has latitude 52.6361398 and longitude -1.1330789
(11/25): Coventry, UK has latitude 52.4081812 and longitude -1.510477
(12/25): Bradford, UK has latitude 53.7944229 and longitude -1.7519186
(13/25): Cardiff, UK has latitude 51.4816546 and longitude -3.1791934
(14/25): Belfast, UK has latitude 54.5964411 and longitude -5.9302761
(15/25): Nottingham

In [14]:
top_cities.head()

Unnamed: 0,City,Population,Latitude,Longitude
0,London,8907918,51.507322,-0.127647
1,Birmingham,1153717,52.479699,-1.902691
2,Glasgow,612040,55.860982,-4.248879
3,Liverpool,579256,53.407154,-2.991665
4,Bristol,571922,51.453802,-2.597298


<a id='item1'></a>


Now, I want to visualize the cities on the UK map to check that the locations obtained from geopy are correct

In [15]:
geolocator = Nominatim(user_agent="uk_explorer")       # I define an user_agent called "uk_explorer" 
locationUK = geolocator.geocode('UK')
latitudeUK = locationUK.latitude
longitudeUK = locationUK.longitude
print('The geograpical coordinate of Britain are {}, {}.'.format(latitudeUK, longitudeUK))

The geograpical coordinate of Britain are 54.7023545, -3.2765753.


In [16]:
# create map of London using latitude and longitude values
map_uk = folium.Map(location=[latitudeUK, longitudeUK], zoom_start=5)

# add markers to the map
for lat, lng, city in zip(top_cities['Latitude'], top_cities['Longitude'], top_cities['City']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uk)  
    
map_uk

The location of the cities obtained with geopy seem correct

In [17]:
top_cities.to_csv(r'top_25_UK_cities_by_population.csv', index=False)

## 2. Explore the sports venues in the UK cities


In this section, I will use Foursquare API to determine which sport venue categories are the most common in UK cities


#### Define Foursquare Credentials and Version


In [33]:
CLIENT_ID = 'BRTPTOBMNSFUIIAOSCOZJJLNMSZ0JEYKQRWU0LHHHRHXCENI'      # my Foursquare ID
CLIENT_SECRET = 'NTW0GOGHYXZBDBBPFZB01ZND3GKE5JCMA0M1D3AN20JKPE1T'  # my Foursquare Secret
VERSION = '20180604'
ACCESS_TOKEN = 'IYJ45ASRMRJ5CI0SZHX0L4X0LWCMDPH1PLACXU3PBTA1NJRJ'   # my FourSquare Access Token
LIMIT = 100

Define the Foursquare category ID of different sport venues;
this info can be found at https://developer.foursquare.com/docs/build-with-foursquare/categories 

In [37]:
category1='52e81612bcbc57f1066b7a2b'   #Badminton Court
category2='4bf58dd8d48988d1e8941735'   #Baseball Field
category3='4bf58dd8d48988d1e1941735'   #Basketball Court 
category4='4bf58dd8d48988d1e6941735'   #Golf Course
category5='52f2ab2ebcbc57f1066b8b47'   #Boxing Gym
category6='503289d391d4c4b30a586d6a'   #Climbing Gym
category7='52f2ab2ebcbc57f1066b8b48'   #Gymnastics Gym
category8='4bf58dd8d48988d101941735'   #Martial Arts Dojo
category9='5744ccdfe4b0c0459246b4b2'   #Pilates Studio
category10='4bf58dd8d48988d102941735'  #Yoga Studio
category11='4f452cd44b9081a197eba860'  #Hockey Field
category12='56aa371be4b08b9a8d57352c'  #Hockey Rink
category13='52e81612bcbc57f1066b7a2c'  #Rugby Pitch
category14='4bf58dd8d48988d102941735'  #Skate Park
category15='4bf58dd8d48988d168941735'  #Skating Rink
category16='4cce455aebf7b749d5e191f5'  #Soccer Field
category17='52e81612bcbc57f1066b7a2d'  #Squash Court
category18='4e39a956bd410d7aed40cbc3'  #Tennis Court
category19='4eb1bf013b7b6f98df247e07'  #Volleyball Court

In [38]:
#This function returns dataframe sport_venues which contains information on the top 100 sport venues in each city

def getNearbyVenues(names, latitudes, longitudes, radius=10000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?categoryId={},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},&client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&radius={}&limit={}'.format(
            category1,category2,category3,category4,category5,
            category6,category7,category8,category9,category10,
            category11,category12,category13,category14,category15,
            category16,category17,category18,category19,
            CLIENT_ID, 
            CLIENT_SECRET,  
            lat, 
            lng, 
            ACCESS_TOKEN,
            VERSION,
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['venues']
        
        # return only relevant information for each sport venue  
        
        venues_list.append([(name,lat, lng,
                            result['name'], 
                            result['categories'][0]['name'],
                            result['location']['lat'],
                            result['location']['lng']) for result in results])
    
    sport_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    sport_venues.columns =['City', 'City Latitude', 'City Longitude',
                            'Venue Name','Category','Venue Latitude', 'Venue Longitude']
    
    return(sport_venues)

#### Now I create dataframe UK_venues using the above function

In [39]:
# type your answer here
UK_venues=getNearbyVenues(names=top_cities['City'],
                          latitudes=top_cities['Latitude'],
                          longitudes=top_cities['Longitude'])

London
Birmingham
Glasgow
Liverpool
Bristol
Manchester
Sheffield
Leeds
Edinburgh
Leicester
Coventry
Bradford
Cardiff
Belfast
Nottingham
Kingston upon Hull
Newcastle upon Tyne
Stoke-on-Trent
Southampton
Derby
Portsmouth
Brighton
Plymouth
Northampton
Reading


In [18]:
UK_venues=pd.read_csv('UK_venues.csv')
UK_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue Name,Category,Venue Latitude,Venue Longitude
0,London,51.507322,-0.127647,Basketball Court,Basketball Court,51.514509,-0.128802
1,London,51.507322,-0.127647,Eythorne Park,Park,51.474665,-0.109728
2,London,51.507322,-0.127647,Globe Lawn Tennis Club,Tennis Court,51.550796,-0.164131
3,London,51.507322,-0.127647,London Fields Tennis Courts,Tennis Court,51.543177,-0.060551
4,London,51.507322,-0.127647,Low Hall Sports Ground,Soccer Field,51.574699,-0.040285


In [19]:
# I will save this dataframe to avoid repeating the foursquare calls in the future
UK_venues.to_csv(r'UK_venues.csv', index=False)   

In [20]:
trial_list=[]
for i in UK_venues['Category']:
    trial_list.append(i)
print(trial_list)

['Basketball Court', 'Park', 'Tennis Court', 'Tennis Court', 'Soccer Field', 'Soccer Field', 'Tennis Court', 'Dance Studio', 'Tennis Court', 'Soccer Field', 'Forest', 'Martial Arts School', 'Climbing Gym', 'Yoga Studio', 'Gym / Fitness Center', 'Tennis Court', 'Golf Course', 'Soccer Field', 'Soccer Field', 'Soccer Field', 'Tennis Court', 'Soccer Field', 'Tennis Court', 'Boxing Gym', 'Tennis Court', 'Soccer Field', 'Park', 'Park', 'Baseball Field', 'Yoga Studio', 'Gym / Fitness Center', 'Hockey Field', 'Tennis Court', 'Yoga Studio', 'Skating Rink', 'Park', 'Climbing Gym', 'Skating Rink', 'Yoga Studio', 'Yoga Studio', 'Rugby Pitch', 'Yoga Studio', 'Yoga Studio', 'Yoga Studio', 'Yoga Studio', 'Gym / Fitness Center', 'Boxing Gym', 'Yoga Studio', 'Gym / Fitness Center', 'Gym', 'Yoga Studio', 'Soccer Field', 'Tennis Court', 'Park', 'Golf Course', 'Golf Course', 'Golf Course', 'Soccer Field', 'Martial Arts School', 'Park', 'Golf Course', 'Tennis Court', 'Soccer Field', 'Rugby Pitch', 'Soccer 

Two important considerations can be made from the above result:
    
    1. Several categories are not useful for the analysis; for example, the presence of a 'Stadium' or a 'Primary school' will
    not influence the demand of technical sportsware --- these categories can be removed
    
    2. Some categories can be combined, e.g., 'Rock Climbing Spot' and 'Climbing Gym', as they require similar equipment.

Hence, I will remove some of the cateogries

In [21]:
UK_venues=UK_venues[(UK_venues['Category'] != 'Park') & (UK_venues['Category'] != 'Forest') & (UK_venues['Category'] != 'Bar') & 
                    (UK_venues['Category'] != 'Gym') & (UK_venues['Category'] != 'Stadium') & (UK_venues['Category'] != 'Playground') & 
                    (UK_venues['Category'] != "Women's Store") & (UK_venues['Category'] != 'Spa') &  (UK_venues['Category'] != 'Dog Run') &
                    (UK_venues['Category'] != 'Private School') & (UK_venues['Category'] != 'Lake') & (UK_venues['Category'] != 'Field') & 
                    (UK_venues['Category'] != 'Sports Club') & (UK_venues['Category'] != 'Massage Studio') & 
                    (UK_venues['Category'] != 'Other Great Outdoors') & (UK_venues['Category'] != 'General Entertainment') &
                    (UK_venues['Category'] != 'Hotel') & (UK_venues['Category'] != 'Performing Arts Venue') & 
                    (UK_venues['Category'] != 'Physical Therapist') & (UK_venues['Category'] != 'Athletics & Sports') &
                    (UK_venues['Category'] != 'Dance Studio') & (UK_venues['Category'] != 'Gym / Fitness Center') &
                    (UK_venues['Category'] != 'Rugby Stadium') & (UK_venues['Category'] != 'Soccer Stadium')]
    
UK_venues.reset_index(drop=True, inplace=True)

and I will combine some of the categories

In [22]:
for i,category in enumerate(UK_venues['Category']):
    if (category == 'Rock Climbing Spot') or (category == 'Climbing Gym'):
        UK_venues.loc[i,'Category']='Climbing Spot'
        
    else:
        if (category == 'College Football Field') or (category ==  'Rugby Pitch'):
            UK_venues.loc[i,'Category']='Football/Rugby Pitch'  
                
        else:
            if category == 'Disc Golf':
                UK_venues.loc[i,'Category']='Golf Course'
            
            else:
                if (category == 'Hockey Field') or (category == 'Hockey Rink'):
                    UK_venues.loc[i,'Category']='Hockey Field/Rink'
                else:
                    pass

In [23]:
UK_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue Name,Category,Venue Latitude,Venue Longitude
0,London,51.507322,-0.127647,Basketball Court,Basketball Court,51.514509,-0.128802
1,London,51.507322,-0.127647,Globe Lawn Tennis Club,Tennis Court,51.550796,-0.164131
2,London,51.507322,-0.127647,London Fields Tennis Courts,Tennis Court,51.543177,-0.060551
3,London,51.507322,-0.127647,Low Hall Sports Ground,Soccer Field,51.574699,-0.040285
4,London,51.507322,-0.127647,Gunnersbury Sports And Social Club,Soccer Field,51.497535,-0.281597


Let's check how many venues were returned for each City


In [24]:
UK_venues.groupby('City').count()

Unnamed: 0_level_0,City Latitude,City Longitude,Venue Name,Category,Venue Latitude,Venue Longitude
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Belfast,43,43,43,43,43,43
Birmingham,42,42,42,42,42,42
Bradford,43,43,43,43,43,43
Brighton,42,42,42,42,42,42
Bristol,41,41,41,41,41,41
Cardiff,43,43,43,43,43,43
Coventry,39,39,39,39,39,39
Derby,40,40,40,40,40,40
Edinburgh,45,45,45,45,45,45
Glasgow,47,47,47,47,47,47


#### Let's find out how many unique categories can be curated from all the returned venues


In [25]:
print('There are {} uniques categories.'.format(len(UK_venues['Category'].unique())))

There are 17 uniques categories.


<a id='item3'></a>


## 3. Rank the venue in each city by frequency


In [26]:
UK_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue Name,Category,Venue Latitude,Venue Longitude
0,London,51.507322,-0.127647,Basketball Court,Basketball Court,51.514509,-0.128802
1,London,51.507322,-0.127647,Globe Lawn Tennis Club,Tennis Court,51.550796,-0.164131
2,London,51.507322,-0.127647,London Fields Tennis Courts,Tennis Court,51.543177,-0.060551
3,London,51.507322,-0.127647,Low Hall Sports Ground,Soccer Field,51.574699,-0.040285
4,London,51.507322,-0.127647,Gunnersbury Sports And Social Club,Soccer Field,51.497535,-0.281597


I will change the dataframe into categorical variables in order to perform the clustering

In [27]:
# one hot encoding
UK_onehot = pd.get_dummies(UK_venues[['Category']], prefix="", prefix_sep="")
UK_onehot.head()

Unnamed: 0,Baseball Field,Basketball Court,Bowling Green,Boxing Gym,Climbing Spot,Football/Rugby Pitch,Golf Course,Gymnastics Gym,Hockey Field/Rink,Martial Arts School,Pilates Studio,Skating Rink,Soccer Field,Squash Court,Tennis Court,Volleyball Court,Yoga Studio
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [28]:
UK_onehot['City'] = UK_venues['City']                                      # add City column back to dataframe
fixed_columns = [UK_onehot.columns[-1]] + list(UK_onehot.columns[:-1])     #and move it to the first column
UK_onehot = UK_onehot[fixed_columns]
UK_onehot.head()

Unnamed: 0,City,Baseball Field,Basketball Court,Bowling Green,Boxing Gym,Climbing Spot,Football/Rugby Pitch,Golf Course,Gymnastics Gym,Hockey Field/Rink,Martial Arts School,Pilates Studio,Skating Rink,Soccer Field,Squash Court,Tennis Court,Volleyball Court,Yoga Studio
0,London,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,London,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,London,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,London,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,London,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


#### Next, I grouped the rows by cities taking the mean of the frequency of occurrence of each category


In [29]:
UK_grouped = UK_onehot.groupby('City').mean().reset_index()
UK_grouped

Unnamed: 0,City,Baseball Field,Basketball Court,Bowling Green,Boxing Gym,Climbing Spot,Football/Rugby Pitch,Golf Course,Gymnastics Gym,Hockey Field/Rink,Martial Arts School,Pilates Studio,Skating Rink,Soccer Field,Squash Court,Tennis Court,Volleyball Court,Yoga Studio
0,Belfast,0.0,0.0,0.0,0.093023,0.0,0.023256,0.209302,0.0,0.0,0.093023,0.023256,0.023256,0.372093,0.0,0.069767,0.0,0.093023
1,Birmingham,0.0,0.0,0.0,0.02381,0.02381,0.047619,0.214286,0.0,0.0,0.095238,0.0,0.0,0.5,0.0,0.071429,0.0,0.02381
2,Bradford,0.0,0.023256,0.0,0.023256,0.046512,0.116279,0.348837,0.023256,0.0,0.139535,0.0,0.023256,0.255814,0.0,0.0,0.0,0.0
3,Brighton,0.0,0.047619,0.0,0.0,0.047619,0.071429,0.119048,0.0,0.0,0.119048,0.02381,0.0,0.309524,0.0,0.119048,0.0,0.142857
4,Bristol,0.0,0.0,0.0,0.02439,0.04878,0.219512,0.195122,0.0,0.0,0.121951,0.0,0.0,0.195122,0.0,0.073171,0.0,0.121951
5,Cardiff,0.023256,0.046512,0.0,0.0,0.046512,0.093023,0.209302,0.0,0.0,0.116279,0.023256,0.023256,0.209302,0.0,0.139535,0.0,0.069767
6,Coventry,0.0,0.0,0.0,0.025641,0.025641,0.179487,0.153846,0.0,0.0,0.102564,0.0,0.0,0.435897,0.0,0.051282,0.0,0.025641
7,Derby,0.0,0.0,0.0,0.025,0.075,0.025,0.25,0.0,0.025,0.05,0.0,0.0,0.3,0.025,0.15,0.0,0.075
8,Edinburgh,0.022222,0.0,0.0,0.022222,0.022222,0.088889,0.288889,0.0,0.0,0.133333,0.066667,0.0,0.222222,0.0,0.022222,0.0,0.111111
9,Glasgow,0.0,0.042553,0.0,0.021277,0.021277,0.085106,0.212766,0.021277,0.06383,0.085106,0.0,0.0,0.297872,0.0,0.042553,0.0,0.106383


#### Let's print each neighborhood along with the top 5 most common venues


In [30]:
num_top_venues = 5

for hood in UK_grouped['City']:
    print("----"+hood+"----")
    temp = UK_grouped[UK_grouped['City'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Belfast----
                 venue  freq
0         Soccer Field  0.37
1          Golf Course  0.21
2          Yoga Studio  0.09
3           Boxing Gym  0.09
4  Martial Arts School  0.09


----Birmingham----
                  venue  freq
0          Soccer Field  0.50
1           Golf Course  0.21
2   Martial Arts School  0.10
3          Tennis Court  0.07
4  Football/Rugby Pitch  0.05


----Bradford----
                  venue  freq
0           Golf Course  0.35
1          Soccer Field  0.26
2   Martial Arts School  0.14
3  Football/Rugby Pitch  0.12
4         Climbing Spot  0.05


----Brighton----
                 venue  freq
0         Soccer Field  0.31
1          Yoga Studio  0.14
2         Tennis Court  0.12
3          Golf Course  0.12
4  Martial Arts School  0.12


----Bristol----
                  venue  freq
0  Football/Rugby Pitch  0.22
1           Golf Course  0.20
2          Soccer Field  0.20
3           Yoga Studio  0.12
4   Martial Arts School  0.12


----Cardiff----
 

#### Let's put that into a _pandas_ dataframe


First, let's write a function to sort the venues in descending order.


In [31]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.


In [32]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
UK_venues_sorted = pd.DataFrame(columns=columns)
UK_venues_sorted['City'] = UK_grouped['City']

for ind in np.arange(UK_grouped.shape[0]):
    UK_venues_sorted.iloc[ind, 1:] = return_most_common_venues(UK_grouped.iloc[ind, :], num_top_venues)

In [33]:
UK_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Belfast,Soccer Field,Golf Course,Yoga Studio,Martial Arts School,Boxing Gym,Tennis Court,Skating Rink,Pilates Studio,Football/Rugby Pitch,Climbing Spot
1,Birmingham,Soccer Field,Golf Course,Martial Arts School,Tennis Court,Football/Rugby Pitch,Yoga Studio,Boxing Gym,Climbing Spot,Basketball Court,Bowling Green
2,Bradford,Golf Course,Soccer Field,Martial Arts School,Football/Rugby Pitch,Climbing Spot,Basketball Court,Skating Rink,Boxing Gym,Gymnastics Gym,Yoga Studio
3,Brighton,Soccer Field,Yoga Studio,Tennis Court,Martial Arts School,Golf Course,Football/Rugby Pitch,Basketball Court,Climbing Spot,Pilates Studio,Bowling Green
4,Bristol,Football/Rugby Pitch,Golf Course,Soccer Field,Yoga Studio,Martial Arts School,Tennis Court,Climbing Spot,Boxing Gym,Basketball Court,Bowling Green


<a id='item4'></a>


## 4. Cluster Neighborhoods


Run _k_-means to cluster the neighborhood into 5 clusters.


In [34]:
# set number of clusters
kclusters = 5

UK_grouped_clustering =UK_grouped.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(UK_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 1, 4, 4, 3, 0, 4, 1, 4], dtype=int32)

Then, I create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.


In [35]:
# add clustering labels
UK_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
UK_merged = top_cities

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
UK_merged = UK_merged.join(UK_venues_sorted.set_index('City'), on='City')

UK_merged.head(25) # check the last columns!

Unnamed: 0,City,Population,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,London,8907918,51.507322,-0.127647,2,Yoga Studio,Tennis Court,Soccer Field,Skating Rink,Boxing Gym,Climbing Spot,Golf Course,Basketball Court,Football/Rugby Pitch,Hockey Field/Rink
1,Birmingham,1153717,52.479699,-1.902691,0,Soccer Field,Golf Course,Martial Arts School,Tennis Court,Football/Rugby Pitch,Yoga Studio,Boxing Gym,Climbing Spot,Basketball Court,Bowling Green
2,Glasgow,612040,55.860982,-4.248879,4,Soccer Field,Golf Course,Yoga Studio,Martial Arts School,Football/Rugby Pitch,Hockey Field/Rink,Tennis Court,Basketball Court,Gymnastics Gym,Climbing Spot
3,Liverpool,579256,53.407154,-2.991665,3,Golf Course,Soccer Field,Tennis Court,Martial Arts School,Yoga Studio,Baseball Field,Football/Rugby Pitch,Boxing Gym,Pilates Studio,Skating Rink
4,Bristol,571922,51.453802,-2.597298,4,Football/Rugby Pitch,Golf Course,Soccer Field,Yoga Studio,Martial Arts School,Tennis Court,Climbing Spot,Boxing Gym,Basketball Court,Bowling Green
5,Manchester,554400,53.479489,-2.245115,4,Soccer Field,Martial Arts School,Golf Course,Football/Rugby Pitch,Tennis Court,Basketball Court,Gymnastics Gym,Yoga Studio,Bowling Green,Boxing Gym
6,Sheffield,544402,53.380663,-1.470228,1,Golf Course,Soccer Field,Martial Arts School,Tennis Court,Yoga Studio,Climbing Spot,Basketball Court,Football/Rugby Pitch,Hockey Field/Rink,Gymnastics Gym
7,Leeds,503388,53.797418,-1.543794,1,Golf Course,Soccer Field,Football/Rugby Pitch,Yoga Studio,Martial Arts School,Tennis Court,Basketball Court,Gymnastics Gym,Bowling Green,Boxing Gym
8,Edinburgh,488050,55.953346,-3.188375,1,Golf Course,Soccer Field,Martial Arts School,Yoga Studio,Football/Rugby Pitch,Pilates Studio,Boxing Gym,Climbing Spot,Baseball Field,Tennis Court
9,Leicester,470965,52.63614,-1.133079,0,Soccer Field,Golf Course,Martial Arts School,Tennis Court,Gymnastics Gym,Football/Rugby Pitch,Yoga Studio,Climbing Spot,Basketball Court,Skating Rink


Finally, let's visualize the resulting clusters


In [36]:
# create map
map_clusters = folium.Map(location=[latitudeUK, longitudeUK], zoom_start=5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(UK_merged['Latitude'], UK_merged['Longitude'], UK_merged['City'], UK_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>


## 5. Examine Clusters


Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.


#### Cluster 1


In [254]:
UK_merged.head()

Unnamed: 0,City,Population,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,London,8907918,51.507322,-0.127647,2,Yoga Studio,Tennis Court,Soccer Field,Skating Rink,Boxing Gym,Climbing Spot,Golf Course,Basketball Court,Football/Rugby Pitch,Hockey Field/Rink
1,Birmingham,1153717,52.479699,-1.902691,0,Soccer Field,Golf Course,Martial Arts School,Tennis Court,Football/Rugby Pitch,Yoga Studio,Boxing Gym,Climbing Spot,Basketball Court,Bowling Green
2,Glasgow,612040,55.860982,-4.248879,4,Soccer Field,Golf Course,Yoga Studio,Martial Arts School,Football/Rugby Pitch,Hockey Field/Rink,Tennis Court,Basketball Court,Gymnastics Gym,Climbing Spot
3,Liverpool,579256,53.407154,-2.991665,3,Golf Course,Soccer Field,Tennis Court,Martial Arts School,Yoga Studio,Baseball Field,Football/Rugby Pitch,Boxing Gym,Pilates Studio,Skating Rink
4,Bristol,571922,51.453802,-2.597298,4,Football/Rugby Pitch,Golf Course,Soccer Field,Yoga Studio,Martial Arts School,Tennis Court,Climbing Spot,Boxing Gym,Basketball Court,Bowling Green


In [263]:
cluster_one=UK_merged.loc[UK_merged['Cluster Labels'] == 0, UK_merged.columns[[0] + list(range(5, UK_merged.shape[1]))]]
cluster_one.reset_index(drop=True, inplace=True)
cluster_one.head(30)

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Birmingham,Soccer Field,Golf Course,Martial Arts School,Tennis Court,Football/Rugby Pitch,Yoga Studio,Boxing Gym,Climbing Spot,Basketball Court,Bowling Green
1,Leicester,Soccer Field,Golf Course,Martial Arts School,Tennis Court,Gymnastics Gym,Football/Rugby Pitch,Yoga Studio,Climbing Spot,Basketball Court,Skating Rink
2,Coventry,Soccer Field,Football/Rugby Pitch,Golf Course,Martial Arts School,Tennis Court,Yoga Studio,Boxing Gym,Climbing Spot,Basketball Court,Bowling Green
3,Belfast,Soccer Field,Golf Course,Yoga Studio,Martial Arts School,Boxing Gym,Tennis Court,Skating Rink,Pilates Studio,Football/Rugby Pitch,Climbing Spot
4,Kingston upon Hull,Soccer Field,Golf Course,Football/Rugby Pitch,Boxing Gym,Basketball Court,Skating Rink,Martial Arts School,Climbing Spot,Yoga Studio,Bowling Green
5,Newcastle upon Tyne,Soccer Field,Golf Course,Basketball Court,Martial Arts School,Yoga Studio,Tennis Court,Football/Rugby Pitch,Bowling Green,Boxing Gym,Climbing Spot
6,Stoke-on-Trent,Soccer Field,Golf Course,Martial Arts School,Football/Rugby Pitch,Yoga Studio,Tennis Court,Basketball Court,Boxing Gym,Volleyball Court,Climbing Spot
7,Southampton,Soccer Field,Golf Course,Football/Rugby Pitch,Tennis Court,Squash Court,Boxing Gym,Martial Arts School,Yoga Studio,Basketball Court,Baseball Field
8,Northampton,Soccer Field,Golf Course,Football/Rugby Pitch,Tennis Court,Yoga Studio,Boxing Gym,Martial Arts School,Volleyball Court,Climbing Spot,Basketball Court


In [271]:
#I create a dictionary with the number of times each venue appears in each position (the most common ones, the second most common ones, etc.)

#create an empty dictionary
category_dict={}

for column in cluster_one.columns.tolist()[1:len(cluster_one.columns.tolist())]: #cluster_one.columns.tolist()[1:len(cluster_one.columns.tolist())] selects
                                                                                                 #the name of the columns of cluster_one except the first column (City)                                                                    
    
        #i start a dictionary with the first venue in the column as the key
        #the value of this key is set to zero
        secondary_dict={cluster_one.loc[0,column]:0}
    
        #iterate through every row
        for row in cluster_one[column]:
            if row in secondary_dict:                            #if the key (e.g., Soccer Field) is already present in the dictionary, 
                secondary_dict[row]= secondary_dict[row] + 1     # add one to the value
            
            else:
                secondary_dict[row]=1                            #otherwise, create a new key with a value of one
    
        category_dict[column]=secondary_dict                     #to each column in the dataframe corresponds a dictionary
    

#this block orders the dictionary values in descrending order

ordered_category_dictionary={}   
 

for i in category_dict:                                                     #iterate thrugh each key, which is itself a dicitonary
    if len(category_dict[i]) <= 1:                                          #if there is only one value, there is nothing to order
        ordered_category_dictionary[i]=category_dict[i]
        
    else:                                                                   #if there is more than one values, they need to be ordered
        category_list=list(category_dict[i].keys())                         #i transform the keys and
        value_list=list(category_dict[i].values())                          #values of category_dict[i] into lists
        ordered_category_list=[]
        ordered_value_list=[]                  
        
        ordered_sub_dictionary={}
        
        if len(category_list) != len(value_list):                                  
            print('Attention! The number of key is not the same as the number of values')
            
        for n in range(len(value_list)):
            
            ordered_value_list.insert(n, max(value_list))                                      #insert the largest values at the left of the ordered list
            
            ordered_category_list.insert(n,category_list[value_list.index(max(value_list))])   #and the corresponding key is inserted at the corresponding index 
            
                                                                                               #value_list.index(max(value_list)) outputs the inex of the max value 
                                                                                               #in value list
            
            category_list.remove(category_list[value_list.index(max(value_list))])
                                 
            value_list.remove(max(value_list))
            
                
        ordered_sub_dictionary={x:y for x,y in zip(ordered_value_list,ordered_category_list)}
        ordered_category_dictionary[i]=ordered_sub_dictionary

#### Cluster 2


In [252]:
UK_merged.loc[UK_merged['Cluster Labels'] == 1, UK_merged.columns[[0] + list(range(5, UK_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Sheffield,Golf Course,Soccer Field,Martial Arts School,Tennis Court,Yoga Studio,Climbing Spot,Basketball Court,Football/Rugby Pitch,Hockey Field/Rink,Gymnastics Gym
7,Leeds,Golf Course,Soccer Field,Football/Rugby Pitch,Yoga Studio,Martial Arts School,Tennis Court,Basketball Court,Gymnastics Gym,Bowling Green,Boxing Gym
8,Edinburgh,Golf Course,Soccer Field,Martial Arts School,Yoga Studio,Football/Rugby Pitch,Pilates Studio,Boxing Gym,Climbing Spot,Baseball Field,Tennis Court
11,Bradford,Golf Course,Soccer Field,Martial Arts School,Football/Rugby Pitch,Climbing Spot,Basketball Court,Skating Rink,Boxing Gym,Gymnastics Gym,Yoga Studio


#### Cluster 3


In [253]:
UK_merged.loc[UK_merged['Cluster Labels'] == 2, UK_merged.columns[[0] + list(range(5, UK_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,London,Yoga Studio,Tennis Court,Soccer Field,Skating Rink,Boxing Gym,Climbing Spot,Golf Course,Basketball Court,Football/Rugby Pitch,Hockey Field/Rink


#### Cluster 4


In [104]:
UK_merged.loc[UK_merged['Cluster Labels'] == 3, UK_merged.columns[[0] + list(range(5, UK_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Liverpool,Golf Course,Soccer Field,Tennis Court,Martial Arts School,Yoga Studio,Baseball Field,Football/Rugby Pitch,Boxing Gym,Pilates Studio,Skating Rink
12,Cardiff,Golf Course,Soccer Field,Tennis Court,Martial Arts School,Football/Rugby Pitch,Yoga Studio,Basketball Court,Climbing Spot,Baseball Field,Pilates Studio
22,Plymouth,Tennis Court,Soccer Field,Football/Rugby Pitch,Martial Arts School,Golf Course,Yoga Studio,Gymnastics Gym,Climbing Spot,Basketball Court,Baseball Field
24,Reading,Tennis Court,Golf Course,Soccer Field,Yoga Studio,Martial Arts School,Boxing Gym,Football/Rugby Pitch,Hockey Field/Rink,Climbing Spot,Basketball Court


#### Cluster 5


In [105]:
UK_merged.loc[UK_merged['Cluster Labels'] == 4, UK_merged.columns[[0] + list(range(5, UK_merged.shape[1]))]]

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Glasgow,Soccer Field,Golf Course,Yoga Studio,Martial Arts School,Football/Rugby Pitch,Hockey Field/Rink,Tennis Court,Basketball Court,Gymnastics Gym,Climbing Spot
4,Bristol,Football/Rugby Pitch,Golf Course,Soccer Field,Yoga Studio,Martial Arts School,Tennis Court,Climbing Spot,Boxing Gym,Basketball Court,Bowling Green
5,Manchester,Soccer Field,Martial Arts School,Golf Course,Football/Rugby Pitch,Tennis Court,Basketball Court,Gymnastics Gym,Yoga Studio,Bowling Green,Boxing Gym
14,Nottingham,Golf Course,Soccer Field,Yoga Studio,Tennis Court,Martial Arts School,Climbing Spot,Boxing Gym,Football/Rugby Pitch,Basketball Court,Skating Rink
19,Derby,Soccer Field,Golf Course,Tennis Court,Yoga Studio,Climbing Spot,Martial Arts School,Boxing Gym,Football/Rugby Pitch,Hockey Field/Rink,Squash Court
20,Portsmouth,Soccer Field,Golf Course,Yoga Studio,Hockey Field/Rink,Skating Rink,Martial Arts School,Tennis Court,Football/Rugby Pitch,Climbing Spot,Boxing Gym
21,Brighton,Soccer Field,Yoga Studio,Tennis Court,Martial Arts School,Golf Course,Football/Rugby Pitch,Basketball Court,Climbing Spot,Pilates Studio,Bowling Green
