# Capstone Project - The Battle of Neighborhoods

## Introduction

Over the last five years, crime rates in London have increased by over 23%. A total of 15,590 knife crimes were reported across London from 2019-2020. The concentration of crimes across different London boroughs should be considered by commercial and residential real estate agents, as well as anybody looking to buy or rent a home in the city. Hence, this report is of interest to anyone that fulfils such criteria. 

The business problem we are faced with is: which parts of London are safe and attractive for residential buyers/renters across various demographic groups. To solve this problem, we will assess the crime rates in each of the London boroughs, and cluster neighborhoods to assess the venues on offer, such as pubs, cafes and parks.

## Data

The data that will be used to assess the best residential areas in London is: (1) the Metropolitan Police Service (MPS) Borough Level Crime from August 2018 to July 2020 (https://data.london.gov.uk), and (2) Foursquare venue data.

The headings from the MPS Borough Level Crime data are: MajorText, MinorText, LookUp_Borough and 24 separate columns for the 24 months of crime data. The major and minor text describe the accused crime, for example “violence against the person” and “violence with injury”. In order to analyse this data, we must sum the 24 columns to find the total crime levels in each borough. We must then group the data by Borough, which we will refer to as the ‘Address’.

## Methodology

The methodology section will consist of five sections covering the exploratory data analysis, statistical testing and machine learning involved in the project.

I. Downloading and exploring the dataset

II.	Exploring neighborhoods in London

III. Analysing each neighbourhood

IV. Clustering neighborhoods

V. Examining clusters

In [1]:
import numpy as np
import pandas as pd
import datetime as dt # Datetime
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
print('Libraries imported.')

Libraries imported.


### Downloading and exploring the dataset

Loading the data and transforming it into a *pandas* dataframe

In [2]:
# Read the data for examination (Source: https://data.london.gov.uk/dataset/recorded_crime_summary)
df_mps = pd.read_csv("https://data.london.gov.uk/download/recorded_crime_summary/d2e9ccfc-a054-41e3-89fb-53c2bc3ed87a/MPS%20Borough%20Level%20Crime%20%28most%20recent%2024%20months%29.csv")

In [3]:
df_mps.head(5)

Unnamed: 0,MajorText,MinorText,LookUp_BoroughName,201808,201809,201810,201811,201812,201901,201902,...,201910,201911,201912,202001,202002,202003,202004,202005,202006,202007
0,Arson and Criminal Damage,Arson,Barking and Dagenham,5,3,8,5,1,5,2,...,9,8,6,4,5,6,2,2,4,3
1,Arson and Criminal Damage,Criminal Damage,Barking and Dagenham,101,107,132,105,88,97,127,...,109,97,121,97,103,107,80,86,121,121
2,Burglary,Burglary - Business and Community,Barking and Dagenham,18,33,32,39,33,45,24,...,30,30,25,31,17,28,29,16,16,28
3,Burglary,Burglary - Residential,Barking and Dagenham,84,99,94,106,164,114,107,...,97,114,130,116,123,97,57,41,63,72
4,Drug Offences,Drug Trafficking,Barking and Dagenham,7,10,9,7,4,5,2,...,8,12,3,14,5,6,12,13,11,20


In [4]:
df_mps.shape

(1568, 27)

Renaming the columns

In [5]:
# Assign meaningful column names
df_mps.columns = ['Major_Crime', 'Minor_Crime', 'Address', 'Jul_18', 'Aug_18', 'Sep_18', 'Oct_18', 'Nov_18',\
                  'Dec_18', 'Jan_19', 'Feb_19', 'Mar_19', 'Apr_19', 'May_19', 'Jun_19', 'Jul_19', 'Aug_19',\
                  'Sep_19', 'Oct_19', 'Nov_19', 'Dec_19', 'Jan_20', 'Feb_20', 'Mar_20', 'Apr_20', 'May_20', 'Jun_20']

In [6]:
df_mps

Unnamed: 0,Major_Crime,Minor_Crime,Address,Jul_18,Aug_18,Sep_18,Oct_18,Nov_18,Dec_18,Jan_19,...,Sep_19,Oct_19,Nov_19,Dec_19,Jan_20,Feb_20,Mar_20,Apr_20,May_20,Jun_20
0,Arson and Criminal Damage,Arson,Barking and Dagenham,5,3,8,5,1,5,2,...,9,8,6,4,5,6,2,2,4,3
1,Arson and Criminal Damage,Criminal Damage,Barking and Dagenham,101,107,132,105,88,97,127,...,109,97,121,97,103,107,80,86,121,121
2,Burglary,Burglary - Business and Community,Barking and Dagenham,18,33,32,39,33,45,24,...,30,30,25,31,17,28,29,16,16,28
3,Burglary,Burglary - Residential,Barking and Dagenham,84,99,94,106,164,114,107,...,97,114,130,116,123,97,57,41,63,72
4,Drug Offences,Drug Trafficking,Barking and Dagenham,7,10,9,7,4,5,2,...,8,12,3,14,5,6,12,13,11,20
5,Drug Offences,Possession of Drugs,Barking and Dagenham,70,72,64,75,69,79,74,...,88,94,79,98,106,107,145,180,192,115
6,Miscellaneous Crimes Against Society,Bail Offences,Barking and Dagenham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Miscellaneous Crimes Against Society,Bigamy,Barking and Dagenham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Miscellaneous Crimes Against Society,Dangerous Driving,Barking and Dagenham,2,1,0,2,1,1,0,...,2,1,2,2,0,2,0,2,3,2
9,Miscellaneous Crimes Against Society,"Disclosure, Obstruction, False or Misleading S...",Barking and Dagenham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Finding the sum of the 24 months of crime data

In [7]:
# Create total of number columns
df_mps['Total_Crime'] = df_mps.sum(numeric_only=True, axis=1)

In [8]:
df_mps

Unnamed: 0,Major_Crime,Minor_Crime,Address,Jul_18,Aug_18,Sep_18,Oct_18,Nov_18,Dec_18,Jan_19,...,Oct_19,Nov_19,Dec_19,Jan_20,Feb_20,Mar_20,Apr_20,May_20,Jun_20,Total_Crime
0,Arson and Criminal Damage,Arson,Barking and Dagenham,5,3,8,5,1,5,2,...,8,6,4,5,6,2,2,4,3,116
1,Arson and Criminal Damage,Criminal Damage,Barking and Dagenham,101,107,132,105,88,97,127,...,97,121,97,103,107,80,86,121,121,2681
2,Burglary,Burglary - Business and Community,Barking and Dagenham,18,33,32,39,33,45,24,...,30,25,31,17,28,29,16,16,28,681
3,Burglary,Burglary - Residential,Barking and Dagenham,84,99,94,106,164,114,107,...,114,130,116,123,97,57,41,63,72,2301
4,Drug Offences,Drug Trafficking,Barking and Dagenham,7,10,9,7,4,5,2,...,12,3,14,5,6,12,13,11,20,199
5,Drug Offences,Possession of Drugs,Barking and Dagenham,70,72,64,75,69,79,74,...,94,79,98,106,107,145,180,192,115,2362
6,Miscellaneous Crimes Against Society,Bail Offences,Barking and Dagenham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
7,Miscellaneous Crimes Against Society,Bigamy,Barking and Dagenham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
8,Miscellaneous Crimes Against Society,Dangerous Driving,Barking and Dagenham,2,1,0,2,1,1,0,...,1,2,2,0,2,0,2,3,2,29
9,Miscellaneous Crimes Against Society,"Disclosure, Obstruction, False or Misleading S...",Barking and Dagenham,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Dropping the non-essential columns and creating a new dataframe

In [9]:
df_mps2 = df_mps.drop(columns=['Major_Crime', 'Minor_Crime', 'Jul_18', 'Aug_18', 'Sep_18', 'Oct_18', 'Nov_18',\
                  'Dec_18', 'Jan_19', 'Feb_19', 'Mar_19', 'Apr_19', 'May_19', 'Jun_19', 'Jul_19', 'Aug_19',\
                  'Sep_19', 'Oct_19', 'Nov_19', 'Dec_19', 'Jan_20', 'Feb_20', 'Mar_20', 'Apr_20', 'May_20', 'Jun_20'])

In [10]:
df_mps2

Unnamed: 0,Address,Total_Crime
0,Barking and Dagenham,116
1,Barking and Dagenham,2681
2,Barking and Dagenham,681
3,Barking and Dagenham,2301
4,Barking and Dagenham,199
5,Barking and Dagenham,2362
6,Barking and Dagenham,1
7,Barking and Dagenham,1
8,Barking and Dagenham,29
9,Barking and Dagenham,1


Grouping the crime data for each boroughs

In [11]:
df_crime = df_mps2.groupby(['Address'])['Total_Crime'].mean().reset_index()

In [12]:
df_crime.head()

Unnamed: 0,Address,Total_Crime
0,Barking and Dagenham,809.3125
1,Barnet,1259.851064
2,Bexley,734.173913
3,Brent,1244.416667
4,Bromley,1033.23913


Labelling each borough with London, UK in order to pull the correct coordinates

In [13]:
df_crime['Address'] = df_crime['Address'] + ', London, UK'
df_crime.head()

Unnamed: 0,Address,Total_Crime
0,"Barking and Dagenham, London, UK",809.3125
1,"Barnet, London, UK",1259.851064
2,"Bexley, London, UK",734.173913
3,"Brent, London, UK",1244.416667
4,"Bromley, London, UK",1033.23913


In [14]:
import pandas as pd
import numpy as np
import datetime as DT
import hmac
from geopy.geocoders import Nominatim
# import k-means from clustering stage
from geopy.extra.rate_limiter import RateLimiter
from functools import partial
from sklearn.cluster import KMeans

Using the geopy library to get the coordinates of the London boroughs

In [15]:
geolocator = Nominatim(user_agent="ldn_explorer")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
# Create location column
df_crime['location'] = df_crime['Address'].apply(geocode)
df_crime['point']=df_crime['location'].apply(lambda loc: tuple(loc.point) if loc else None)
df_crime.head()

Unnamed: 0,Address,Total_Crime,location,point
0,"Barking and Dagenham, London, UK",809.3125,"(London Borough of Barking and Dagenham, Great...","(51.5541171, 0.15050434261994267, 0.0)"
1,"Barnet, London, UK",1259.851064,"(Chipping Barnet, London Borough of Barnet, Lo...","(51.65309, -0.2002261, 0.0)"
2,"Bexley, London, UK",734.173913,"(Bexley, London Borough of Bexley, London, Gre...","(51.4416793, 0.150488, 0.0)"
3,"Brent, London, UK",1244.416667,"(London Borough of Brent, Greater London, Engl...","(51.563825800000004, -0.2757596561855699, 0.0)"
4,"Bromley, London, UK",1033.23913,"(Bromley, London, Greater London, England, BR1...","(51.4028046, 0.0148142, 0.0)"


Separating the coordinates into separate columns

In [16]:
df_crime[['Latitude', 'Longitude', 'Altitude']] = pd.DataFrame(df_crime['point'].tolist(), index=df_crime.index)
df_crime.head()

Unnamed: 0,Address,Total_Crime,location,point,Latitude,Longitude,Altitude
0,"Barking and Dagenham, London, UK",809.3125,"(London Borough of Barking and Dagenham, Great...","(51.5541171, 0.15050434261994267, 0.0)",51.554117,0.150504,0.0
1,"Barnet, London, UK",1259.851064,"(Chipping Barnet, London Borough of Barnet, Lo...","(51.65309, -0.2002261, 0.0)",51.65309,-0.200226,0.0
2,"Bexley, London, UK",734.173913,"(Bexley, London Borough of Bexley, London, Gre...","(51.4416793, 0.150488, 0.0)",51.441679,0.150488,0.0
3,"Brent, London, UK",1244.416667,"(London Borough of Brent, Greater London, Engl...","(51.563825800000004, -0.2757596561855699, 0.0)",51.563826,-0.27576,0.0
4,"Bromley, London, UK",1033.23913,"(Bromley, London, Greater London, England, BR1...","(51.4028046, 0.0148142, 0.0)",51.402805,0.014814,0.0


Dropping the unnecessary columns

In [17]:
df_crime2 = df_crime.drop(df_crime.columns[[2, 3, 6]], axis=1)
df_crime2.head()

Unnamed: 0,Address,Total_Crime,Latitude,Longitude
0,"Barking and Dagenham, London, UK",809.3125,51.554117,0.150504
1,"Barnet, London, UK",1259.851064,51.65309,-0.200226
2,"Bexley, London, UK",734.173913,51.441679,0.150488
3,"Brent, London, UK",1244.416667,51.563826,-0.27576
4,"Bromley, London, UK",1033.23913,51.402805,0.014814


In [18]:
!pip install folium
import folium
print('Libraries imported.')

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/a4/f0/44e69d50519880287cc41e7c8a6acc58daa9a9acf5f6afc52bcc70f69a6d/folium-0.11.0-py2.py3-none-any.whl (93kB)
[K     |████████████████████████████████| 102kB 8.9MB/s ta 0:00:011
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/13/fb/9eacc24ba3216510c6b59a4ea1cd53d87f25ba76237d7f4393abeaf4c94e/branca-0.4.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Libraries imported.


In [None]:
address = 'London, UK'

geolocator = Nominatim(user_agent="ldn_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London City are 51.5073219, -0.1276474.


### Exploring neighborhoods in London

In [None]:
!pip install geopandas
!pip install geopy

Collecting geopandas
[?25l  Downloading https://files.pythonhosted.org/packages/f7/a4/e66aafbefcbb717813bf3a355c8c4fc3ed04ea1dd7feb2920f2f4f868921/geopandas-0.8.1-py2.py3-none-any.whl (962kB)
[K     |█▍                              | 40kB 21.2MB/s eta 0:00:01

Using the Foursquare API to explore the neighborhoods and segment them

In [None]:
#Define Foursquare Credentials and Version

CLIENT_ID = 'CLQJQRLVWRV1NPPLLMIDF5OZHFP1W3HZXXJSO1JAVI0UXCSG' # your Foursquare ID
CLIENT_SECRET = 'QNXR2VJSFYIW4KJFWQMGFXBBQAQZJTS40PVODGVTXDL45YKO' # your Foursquare Secret
VERSION = '20180605'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
def getNearbyVenues(names, latitudes, longitudes):
    radius=500
    LIMIT=100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
df_crime2.drop(df_crime2.index[22], inplace = True)

Creating a new dataframe for the London venues

In [None]:
london_venues = getNearbyVenues(names=df_crime2['Address'],
                                   latitudes=df_crime2['Latitude'],
                                   longitudes=df_crime2['Longitude']
                                  )

In [None]:
london_venues.head()

Checking how many venues are returned for each neighborhood

In [None]:
london_venues.groupby('Neighborhood').count()

### Analysing each neighborhood

In [None]:
# one hot encoding
london_onehot = pd.get_dummies(london_venues[['Venue Category']], prefix="", prefix_sep="")

# add street column back to dataframe
london_onehot['Neighborhood'] = london_venues['Neighborhood'] 

# move borough column to the first column
fixed_columns = [london_onehot.columns[-1]] + list(london_onehot.columns[:-1])

#fixed_columns
london_onehot = london_onehot[fixed_columns]

london_onehot.head()

Grouping the rows by neighborhood and the mean of frequency of occurrence of each category

In [None]:
london_grouped = london_onehot.groupby('Neighborhood').mean().reset_index()
london_grouped.head()

Writing a function to sort the venues in descending order

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Creating a new dataframe and displaying the top 10 venues for each neighborhood

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = london_grouped['Neighborhood']

for ind in np.arange(london_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(london_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

### Clustering neighborhoods

Running *k*-means to cluster the neighborhood into 5 clusters

In [None]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

london_grouped_clustering = london_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(london_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood 

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

london_merged = df_crime2

# merge london_grouped with df_crime2 to add latitude/longitude for each neighborhood
london_merged = london_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Address')

london_merged.head()

In [None]:
neighborhoods_venues_sorted.head()

Visualising the resulting clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(london_merged['Latitude'], london_merged['Longitude'], london_merged['Address'], london_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Examining Clusters

#### Cluster 1

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 0, london_merged.columns[[1] + list(range(5, london_merged.shape[1]))]]

#### Cluster 2

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 1, london_merged.columns[[1] + list(range(5, london_merged.shape[1]))]]

#### Cluster 3

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 2, london_merged.columns[[1] + list(range(5, london_merged.shape[1]))]]

#### Cluster 4

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 3, london_merged.columns[[1] + list(range(5, london_merged.shape[1]))]]

#### Cluster 5

In [None]:
london_merged.loc[london_merged['Cluster Labels'] == 4, london_merged.columns[[1] + list(range(5, london_merged.shape[1]))]]

### Summary of crime and venue data

In [None]:
london_merged.sort_values(by=['Total_Crime','Cluster Labels'], inplace=True)
london_merged

## Results and discussion

We analysed two sources of data in this report. Hence, we may discuss our results in two main parts. 

In terms of the Metropolitan Police Service Borough Level Crime data, we found that the boroughs with the least crime are: (1) Kingston upon Thames, (2) Richmond upon Thames and (3) Sutton. The boroughs with the most crime are: (1) Westminster, (2) Newham, and (3) Southwark. The quartiles for the crime data are: 893.4, 1106.6, 1271.3, 2868.6 crimes over the course of 24 months. The mean number of crimes is 1118.9. 

Clusters 2, 4 and 5 all showed crime levels below average. However, clusters 1 and 3 showed mixed crime levels across each borough. 

The most common venues were as follows:

Cluster 1 – Pubs, fast food restaurants, parks.

Cluster 2 – Bus stops, grocery stores, convenience stores. 

Cluster 3 – Coffee shops, pubs and clothing stores. 

Cluster 4 – Bakeries, train stations, parks. 

Cluster 5 – Sports clubs, home services, pubs. 

The clusters that would appeal most to families and older people, with fewer pubs and more active spaces and independent stores, are Cluster 4 and Cluster 5 as they offer suitable facilities and low crime rates. Hence, the boroughs of Sutton (4) and Richmond (5) are attractive for family homes and older people. 

Cluster 1 seems to be appropriate for younger residential buyers/renters with more pubs, fast food restaurants and parks. Within Cluster 1, the borough with the least crime is Merton, followed by Bexley. 

Cluster 3 would be fitting for either group, especially families with slightly older children, with a variety of coffee shops, pubs and clothing stores. Within this cluster, the boroughs with the least crime are Kingston upon Thames and Harrow. 

Finally, Cluster 4 appears to have low crime rates, but has much fewer desirable venues such as pubs or cafes. Hence, Barking and Dagenham may be safe but less attractive for buyers/renters. 

## Conclusion

The business question that we have answered in this report is: which parts of London are safe and attractive for residential buyers/renters across various demographic groups? To solve this problem, we have assessed the crime rates in each of the London boroughs, and clustered neighborhoods to assess the venues on offer, such as pubs, cafes and parks.

Overall, the ideal borough for a residential buyer/renter depends on the age and venue preference of the individual. However, it seems that the Sutton and Richmond are most suitable for older people or families with young children. Kingston upon Thames and Harrow are ideal for families with older childer. Merton and Bexley are appropriate for young people looking for more pubs and fast food restaurant. Barking and Dagenham is a safe option but with few desirable venues.

Whilst London crime rates have been growing over the last few years, there are many boroughs which are both relatively safe and offer venues which are appealing to all ages. This report acts as a guide to all commercial and residential real estate agents, as well as anybody looking to buy or rent a home in the city, and should inform individuals across all ages. 