# Capstone Project - The Battle of the Neighborhoods 
## Applied Data Science Capstone

## Table of contents
* [Introduction/Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction/Business Problem <a name="introduction"></a>

In this project boroughs in London are studied based on the total crimes, and neighborhoods are explored in each borough and analyzed using k-mean clustering.

This report is targeted to people who are relocating to London considering safety as top priority. The crime data is taken into consideration to select safest neighborhoods in London.

For the safest borough the neighborhoods will be clustered based on most common venues which will be obtained from Foursquare API


## Data <a name="data"></a>

Based on problem, important factors for final decision are:
* The crime rate in each of  borough
* The most common venues in each neighborhood

Data acquired from following data sources:

-  The dataset consisting of the crime statistics of each borough in London obtained from Kaggle
-  Co-ordinate of neighborhood  obtained using Google Maps API geocoding
-  Most common venues obtained by Foursquare API

### Part 1: Loading Kaggle data of London Crimes <a name="part1"></a>


####  London Crime Data 

Data set URL: https://www.kaggle.com/jboysen/london-crime


#### Import  libraries

In [1]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

#!conda install -c conda-forge geocoder --yes
import geocoder

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')


Folium installed
Libraries imported.


####  Foursquare Credentials 

In [2]:
CLIENT_ID = 'HDCAWXKGHJWZKOIGSXSVKTTSXLFZRH1R4IIZKATUT0O3JEYF' # your Foursquare ID
CLIENT_SECRET = 'TTEJSXIF5J3M4B3VJFS4DYO2V0PEURDNIDJRYSCDJJ1PLS4W' # your Foursquare Secret

VERSION = '20180604'
LIMIT = 30

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: HDCAWXKGHJWZKOIGSXSVKTTSXLFZRH1R4IIZKATUT0O3JEYF
CLIENT_SECRET:TTEJSXIF5J3M4B3VJFS4DYO2V0PEURDNIDJRYSCDJJ1PLS4W


#### Load dataset

In [32]:
# Read in the data from csv which is saved in my computer
df = pd.read_csv(r"F:\# Coursera\Applied Data Science with Python\london_crime.csv")

In [33]:
# View the top rows of the dataset
df.head()

Unnamed: 0,lsoa_code,borough,major_category,minor_category,value,year,month
0,E01001116,Croydon,Burglary,Burglary in Other Buildings,0,2016,11
1,E01001646,Greenwich,Violence Against the Person,Other violence,0,2016,11
2,E01000677,Bromley,Violence Against the Person,Other violence,0,2015,5
3,E01003774,Redbridge,Burglary,Burglary in Other Buildings,0,2016,3
4,E01004563,Wandsworth,Robbery,Personal Property,0,2008,6


#### Check crime rates in 2016

In [34]:
# Taking only the most recent year (2016) and dropping the rest
df.drop(df.index[df['year'] != 2016], inplace = True)

# Removing all the entires where crime values are null  
df = df[df.value != 0]

# Reset the index and dropping the previous index
df = df.reset_index(drop=True)

In [35]:
# Shape of the data frame
df.shape

(392042, 7)

In [36]:
# View the top of the dataset 
df.head()

Unnamed: 0,lsoa_code,borough,major_category,minor_category,value,year,month
0,E01004177,Sutton,Theft and Handling,Theft/Taking of Pedal Cycle,1,2016,8
1,E01000733,Bromley,Criminal Damage,Criminal Damage To Motor Vehicle,1,2016,4
2,E01003989,Southwark,Theft and Handling,Theft From Shops,4,2016,8
3,E01002276,Havering,Burglary,Burglary in a Dwelling,1,2016,8
4,E01003674,Redbridge,Drugs,Possession Of Drugs,2,2016,11


#### Change the column names 

In [37]:
df.columns = ['LSOA_Code', 'Borough','Major_Category','Minor_Category','No_of_Crimes','Year','Month']
df.head()

Unnamed: 0,LSOA_Code,Borough,Major_Category,Minor_Category,No_of_Crimes,Year,Month
0,E01004177,Sutton,Theft and Handling,Theft/Taking of Pedal Cycle,1,2016,8
1,E01000733,Bromley,Criminal Damage,Criminal Damage To Motor Vehicle,1,2016,4
2,E01003989,Southwark,Theft and Handling,Theft From Shops,4,2016,8
3,E01002276,Havering,Burglary,Burglary in a Dwelling,1,2016,8
4,E01003674,Redbridge,Drugs,Possession Of Drugs,2,2016,11


#### dropping unwanted columns
['LSOA_Code', 'Minor_Category','Year','Month']

In [38]:
df.drop(columns=['LSOA_Code', 'Minor_Category','Year','Month'], inplace = True)
df.head()

Unnamed: 0,Borough,Major_Category,No_of_Crimes
0,Sutton,Theft and Handling,1
1,Bromley,Criminal Damage,1
2,Southwark,Theft and Handling,4
3,Havering,Burglary,1
4,Redbridge,Drugs,2


In [39]:
# View the information of the dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392042 entries, 0 to 392041
Data columns (total 3 columns):
Borough           392042 non-null object
Major_Category    392042 non-null object
No_of_Crimes      392042 non-null int64
dtypes: int64(1), object(2)
memory usage: 9.0+ MB


#### Total crimes in each Borough

In [40]:
df['Borough'].value_counts()

Lambeth                   17605
Southwark                 16560
Croydon                   16254
Newham                    15622
Ealing                    15284
Tower Hamlets             15219
Brent                     14980
Barnet                    14668
Hackney                   14392
Lewisham                  14235
Haringey                  14202
Enfield                   13658
Wandsworth                13498
Westminster               13383
Islington                 13116
Greenwich                 12750
Camden                    12632
Hillingdon                12417
Hounslow                  12316
Waltham Forest            12121
Bromley                   11980
Redbridge                 11490
Hammersmith and Fulham    10281
Barking and Dagenham       9784
Havering                   9699
Kensington and Chelsea     9653
Harrow                     8257
Bexley                     8245
Merton                     8223
Richmond upon Thames       7199
Sutton                     6823
Kingston

#### The total crimes per major category

In [41]:
df['Major_Category'].value_counts()

Theft and Handling             129159
Violence Against the Person    123050
Criminal Damage                 48584
Burglary                        43020
Drugs                           21782
Robbery                         14889
Other Notifiable Offences       11558
Name: Major_Category, dtype: int64

#### Check no. of crimes for each major category 

In [42]:
London_crime = pd.pivot_table(df,values=['No_of_Crimes'],
                               index=['Borough'],
                               columns=['Major_Category'],
                               aggfunc=np.sum,fill_value=0)
London_crime.head()

Unnamed: 0_level_0,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes
Major_Category,Burglary,Criminal Damage,Drugs,Other Notifiable Offences,Robbery,Theft and Handling,Violence Against the Person
Borough,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Barking and Dagenham,1287,1949,919,378,534,5607,6067
Barnet,3402,2183,906,499,464,9731,7499
Bexley,1123,1673,646,294,209,4392,4503
Brent,2631,2280,2096,536,919,9026,9205
Bromley,2214,2202,728,417,369,7584,6650


In [43]:
# Reset the index
London_crime.reset_index(inplace = True)

In [45]:
London_crime.shape

(33, 9)

In [46]:
# Total crimes per Borough
London_crime['Total'] = London_crime.sum(axis=1)
London_crime.head()

Unnamed: 0_level_0,Borough,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,No_of_Crimes,Total
Major_Category,Unnamed: 1_level_1,Burglary,Criminal Damage,Drugs,Other Notifiable Offences,Robbery,Theft and Handling,Violence Against the Person,Unnamed: 9_level_1
0,Barking and Dagenham,1287,1949,919,378,534,5607,6067,33482
1,Barnet,3402,2183,906,499,464,9731,7499,49368
2,Bexley,1123,1673,646,294,209,4392,4503,25680
3,Brent,2631,2280,2096,536,919,9026,9205,53386
4,Bromley,2214,2202,728,417,369,7584,6650,40328


#### Removing the multi index so that it will be easier to merge

In [47]:
London_crime.columns = London_crime.columns.map(' '.join)
London_crime.head()

Unnamed: 0,Borough,No_of_Crimes Burglary,No_of_Crimes Criminal Damage,No_of_Crimes Drugs,No_of_Crimes Other Notifiable Offences,No_of_Crimes Robbery,No_of_Crimes Theft and Handling,No_of_Crimes Violence Against the Person,Total
0,Barking and Dagenham,1287,1949,919,378,534,5607,6067,33482
1,Barnet,3402,2183,906,499,464,9731,7499,49368
2,Bexley,1123,1673,646,294,209,4392,4503,25680
3,Brent,2631,2280,2096,536,919,9026,9205,53386
4,Bromley,2214,2202,728,417,369,7584,6650,40328


#### Renaming the columns

In [48]:
London_crime.columns = ['Borough','Burglary', 'Criminal Damage','Drugs','Other Notifiable Offences',
                        'Robbery','Theft and Handling','Violence Against the Person','Total']
London_crime.head()

Unnamed: 0,Borough,Burglary,Criminal Damage,Drugs,Other Notifiable Offences,Robbery,Theft and Handling,Violence Against the Person,Total
0,Barking and Dagenham,1287,1949,919,378,534,5607,6067,33482
1,Barnet,3402,2183,906,499,464,9731,7499,49368
2,Bexley,1123,1673,646,294,209,4392,4503,25680
3,Brent,2631,2280,2096,536,919,9026,9205,53386
4,Bromley,2214,2202,728,417,369,7584,6650,40328


In [49]:
# Shape of the data set 
London_crime.shape

(33, 9)

## Methodology <a name="methodology"></a>

The methodology in this project consists of two parts:
- [Data Analysis](#EDA): Analyse the crime rates in the London boroughs to idenity the safest borough and extract the neighborhoods in that borough to find the 10 most common venues in each neighborhood.


- [Modelling](#modelling): To help people find similar neighborhoods in the safest borough we will be clustering similar neighborhoods using K - means clustering. With this  neighborhoods with similar venues will be clustered together so that people can shortlist the area of their interests.
 

### Data Analysis <a name="EDA"></a>

In [50]:
London_crime.describe()

Unnamed: 0,Burglary,Criminal Damage,Drugs,Other Notifiable Offences,Robbery,Theft and Handling,Violence Against the Person,Total
count,33.0,33.0,33.0,33.0,33.0,33.0,33.0,33.0
mean,2069.242424,1941.545455,1179.212121,479.060606,682.666667,8913.121212,7041.848485,44613.393939
std,737.448644,625.20707,586.406416,223.298698,441.425366,4620.565054,2513.601551,17656.457498
min,2.0,2.0,10.0,6.0,4.0,129.0,25.0,356.0
25%,1531.0,1650.0,743.0,378.0,377.0,5919.0,5936.0,33806.0
50%,2071.0,1989.0,1063.0,490.0,599.0,8925.0,7409.0,45460.0
75%,2631.0,2351.0,1617.0,551.0,936.0,10789.0,8832.0,54348.0
max,3402.0,3219.0,2738.0,1305.0,1822.0,27520.0,10834.0,96660.0


In [51]:
London_crime.dtypes

Borough                        object
Burglary                        int64
Criminal Damage                 int64
Drugs                           int64
Other Notifiable Offences       int64
Robbery                         int64
Theft and Handling              int64
Violence Against the Person     int64
Total                           int64
dtype: object

In [52]:
London_crime.sort_values(['Total'], ascending = False, axis = 0, inplace = True )

In [53]:
London_crime.head()

Unnamed: 0,Borough,Burglary,Criminal Damage,Drugs,Other Notifiable Offences,Robbery,Theft and Handling,Violence Against the Person,Total
32,Westminster,3218,2179,2049,708,1822,27520,10834,96660
21,Lambeth,3087,2764,2738,635,1196,13155,10496,68142
27,Southwark,2946,2621,1838,494,1317,12946,9474,63272
24,Newham,2115,2496,1684,713,1472,11964,9646,60180
29,Tower Hamlets,2794,2357,1629,678,1234,10953,9608,58506


In [55]:
London_crime.shape

(33, 9)

#### To have a safest borough we need to select one having least total no of crimes

In [56]:
London_crime.tail(10)

Unnamed: 0,Borough,Burglary,Criminal Damage,Drugs,Other Notifiable Offences,Robbery,Theft and Handling,Violence Against the Person,Total
25,Redbridge,1997,1650,1017,381,599,7447,6411,39004
15,Havering,1826,1804,718,389,311,5919,5936,33806
0,Barking and Dagenham,1287,1949,919,378,534,5607,6067,33482
14,Harrow,1994,1212,473,267,377,4537,4293,26306
2,Bexley,1123,1673,646,294,209,4392,4503,25680
23,Merton,1419,1418,466,249,283,4894,4026,25510
26,Richmond upon Thames,1359,1148,320,217,106,4769,3155,22148
28,Sutton,1233,1316,461,253,165,3516,3714,21316
20,Kingston upon Thames,879,1054,743,189,121,3803,3194,19966
6,City of London,2,2,10,6,4,129,25,356


#### Lets select one least crime rate borough i.e. Kingston upon Thames (based on wikipedia) for futhur analysis of neighborhood

### Part 3: Analysis of Neighborhoods of the safest borough in London <a name="part3"></a>



The list of Neighborhoods taken from: https://en.wikipedia.org/wiki/List_of_districts_in_the_Royal_Borough_of_Kingston_upon_Thames

In [58]:
Neighborhood = []

Borough = []

Latitude = []
Longitude = []

df_dict = {'Neighborhood': Neighborhood,'Borough':Borough,'Latitude': Latitude,'Longitude':Longitude}
df_neigh = pd.DataFrame(data=df_dict, columns=['Neighborhood', 'Borough', 'Latitude', 'Longitude'], index=None)

df_neigh

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude


In [60]:
df_neigh.dtypes

Neighborhood     object
Borough         float64
Latitude        float64
Longitude       float64
dtype: object

In [61]:
df_neigh=df_neigh.astype(str)

In [62]:
df_neigh.dtypes

Neighborhood    object
Borough         object
Latitude        object
Longitude       object
dtype: object

In [63]:
nbhd = ['Berrylands','Canbury','Chessington','Coombe','Hook','Kingston upon Thames',
'Kingston Vale','Malden Rushett','Motspur Park','New Malden','Norbiton',
'Old Malden','Seething Wells','Surbiton','Tolworth']

for i in range(len(nbhd)):
    df_neigh.at[i,'Neighborhood']= nbhd[i]
    df_neigh.at[i,'Borough']= 'Kingston upon Thames'
    df_neigh.at[i,'Latitude']= ''
    df_neigh.at[i,'Longitude']= ''

In [64]:
df_neigh

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Berrylands,Kingston upon Thames,,
1,Canbury,Kingston upon Thames,,
2,Chessington,Kingston upon Thames,,
3,Coombe,Kingston upon Thames,,
4,Hook,Kingston upon Thames,,
5,Kingston upon Thames,Kingston upon Thames,,
6,Kingston Vale,Kingston upon Thames,,
7,Malden Rushett,Kingston upon Thames,,
8,Motspur Park,Kingston upon Thames,,
9,New Malden,Kingston upon Thames,,


#### Get Co-ordiantes of  Neighborhood

In [48]:
Latitude = []
Longitude = []

for i in range(len(nbhd)):
    address = '{},London,United Kingdom'.format(nbhd[i])
    geolocator = Nominatim(user_agent="London_agent")
    location = geolocator.geocode(address)
    Latitude.append(location.latitude)
    Longitude.append(location.longitude)
print(Latitude, Longitude)

[51.3937811, 51.41749865, 51.358336, 51.4194499, 51.3678984, 51.4096275, 51.43185, 51.3410523, 51.3909852, 51.4053347, 51.4099994, 51.382484, 51.3926421, 51.3937557, 51.3788758] [-0.2848024, -0.305552805049262, -0.2986216, -0.2653985, -0.3071453, -0.3062621, -0.2581379, -0.3190757, -0.2488979, -0.2634066, -0.2873963, -0.2590897, -0.3143662, -0.3033105, -0.2828604]


#### we got: Latitude = [51.3937811, 51.41749865, 51.358336, 51.4194499, 51.3678984, 51.4096275, 51.43185, 51.3410523, 51.3909852, 51.4053347, 51.4099994, 51.382484, 51.3926421, 51.3937557, 51.3788758] Longitude = [-0.2848024, -0.305552805049262, -0.2986216, -0.2653985, -0.3071453, -0.3062621, -0.2581379, -0.3190757, -0.2488979, -0.2634066, -0.2873963, -0.2590897, -0.3143662, -0.3033105, -0.2828604]

In [66]:
for i in range(len(nbhd)):
    df_neigh.at[i,'Latitude']= Latitude[i]
    df_neigh.at[i,'Longitude']= Longitude[i]

In [67]:
df_neigh

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude
0,Berrylands,Kingston upon Thames,51.3938,-0.284802
1,Canbury,Kingston upon Thames,51.4175,-0.305553
2,Chessington,Kingston upon Thames,51.3583,-0.298622
3,Coombe,Kingston upon Thames,51.4194,-0.265398
4,Hook,Kingston upon Thames,51.3679,-0.307145
5,Kingston upon Thames,Kingston upon Thames,51.4096,-0.306262
6,Kingston Vale,Kingston upon Thames,51.4318,-0.258138
7,Malden Rushett,Kingston upon Thames,51.3411,-0.319076
8,Motspur Park,Kingston upon Thames,51.391,-0.248898
9,New Malden,Kingston upon Thames,51.4053,-0.263407


#### Get the co-ordinates of central neighborhood of Kingston upon Thames

In [50]:
address = 'Berrylands, London, United Kingdom'

geolocator = Nominatim(user_agent="ld_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Berrylands, London are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London are 51.3937811, -0.2848024.


#### We got: latitude, longitude = (51.3937811, -0.2848024)

### Visualize using folium

In [69]:
# create map of New York using latitude and longitude values
map_lon = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_neigh['Latitude'], df_neigh['Longitude'], df_neigh['Borough'], df_neigh['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_lon)  
    
map_lon

### Modelling <a name="modelling"></a>

- Get venues within a 500 m radius of each neighborhood.
- Analyse and Perform a K-means clustering 

#### Create a function to extract the venues from each Neighborhood (Ref: Coursera previous assignment)

In [70]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [72]:
neigh_venues = getNearbyVenues(names=df_neigh['Neighborhood'],
                                   latitudes=df_neigh['Latitude'],
                                   longitudes=df_neigh['Longitude']
                                  )


Berrylands
Canbury
Chessington
Coombe
Hook
Kingston upon Thames
Kingston Vale
Malden Rushett
Motspur Park
New Malden
Norbiton
Old Malden
Seething Wells
Surbiton
Tolworth


In [73]:
print(neigh_venues.shape)
neigh_venues.head()

(170, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Berrylands,51.393781,-0.284802,Surbiton Racket & Fitness Club,51.392676,-0.290224,Gym / Fitness Center
1,Berrylands,51.393781,-0.284802,Alexandra Park,51.39423,-0.281206,Park
2,Berrylands,51.393781,-0.284802,K2 Bus Stop,51.392302,-0.281534,Bus Stop
3,Berrylands,51.393781,-0.284802,Cafe Rosa,51.390175,-0.28249,Café
4,Canbury,51.417499,-0.305553,Canbury Gardens,51.417409,-0.3053,Park


In [74]:
neigh_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Berrylands,4,4,4,4,4,4
Canbury,14,14,14,14,14,14
Hook,4,4,4,4,4,4
Kingston Vale,4,4,4,4,4,4
Kingston upon Thames,30,30,30,30,30,30
Malden Rushett,4,4,4,4,4,4
Motspur Park,4,4,4,4,4,4
New Malden,7,7,7,7,7,7
Norbiton,25,25,25,25,25,25
Old Malden,4,4,4,4,4,4


In [75]:
print('There are {} uniques categories.'.format(len(neigh_venues['Venue Category'].unique())))

There are 68 uniques categories.


#### One hot encoding (Ref: Coursera previous assignment)

In [76]:
# one hot encoding
neigh_onehot = pd.get_dummies(neigh_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
neigh_onehot['Neighborhood'] = neigh_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [neigh_onehot.columns[-1]] + list(neigh_onehot.columns[:-1])
neigh_onehot = neigh_onehot[fixed_columns]

neigh_onehot.head()

Unnamed: 0,Neighborhood,Asian Restaurant,Athletics & Sports,Auto Garage,Bakery,Bar,Beer Bar,Bistro,Bowling Alley,Breakfast Spot,...,Shop & Service,Soccer Field,Spa,Supermarket,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Train Station,Wine Shop
0,Berrylands,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Berrylands,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Berrylands,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Berrylands,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Canbury,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Grouping rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [77]:
neigh_grouped = neigh_onehot.groupby('Neighborhood').mean().reset_index()
neigh_grouped

Unnamed: 0,Neighborhood,Asian Restaurant,Athletics & Sports,Auto Garage,Bakery,Bar,Beer Bar,Bistro,Bowling Alley,Breakfast Spot,...,Shop & Service,Soccer Field,Spa,Supermarket,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Train Station,Wine Shop
0,Berrylands,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Canbury,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.071429,0.0,0.071429,0.071429,0.0,0.0,0.0,0.0,0.0,0.0
2,Hook,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
3,Kingston Vale,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Kingston upon Thames,0.033333,0.0,0.0,0.033333,0.0,0.033333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.033333,0.066667,0.0,0.033333,0.033333,0.0,0.0
5,Malden Rushett,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Motspur Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,New Malden,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.142857,0.142857,0.0,0.0,0.0,0.0,0.0
8,Norbiton,0.0,0.04,0.04,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.04
9,Old Malden,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0


In [78]:
neigh_grouped.shape

(13, 69)

In [80]:
num_top_venues = 5

for hood in neigh_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = neigh_grouped[neigh_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Berrylands----
                  venue  freq
0  Gym / Fitness Center  0.25
1                  Park  0.25
2                  Café  0.25
3              Bus Stop  0.25
4                 Plaza  0.00


----Canbury----
               venue  freq
0                Pub  0.29
1     Shop & Service  0.07
2              Hotel  0.07
3  Indian Restaurant  0.07
4               Park  0.07


----Hook----
               venue  freq
0             Bakery  0.25
1  Indian Restaurant  0.25
2  Fish & Chips Shop  0.25
3        Supermarket  0.25
4   Asian Restaurant  0.00


----Kingston Vale----
              venue  freq
0     Grocery Store  0.25
1               Bar  0.25
2    Sandwich Place  0.25
3      Soccer Field  0.25
4  Asian Restaurant  0.00


----Kingston upon Thames----
              venue  freq
0       Coffee Shop  0.13
1               Pub  0.07
2  Sushi Restaurant  0.07
3      Burger Joint  0.07
4              Café  0.07


----Malden Rushett----
              venue  freq
0               Pub  0.25


#### Create a data frame of the venues 

In [81]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [82]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = neigh_grouped['Neighborhood']

for ind in np.arange(neigh_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(neigh_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berrylands,Gym / Fitness Center,Park,Café,Bus Stop,Wine Shop,Food,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,French Restaurant
1,Canbury,Pub,Hotel,Shop & Service,Café,Plaza,Indian Restaurant,Fish & Chips Shop,Park,Spa,Supermarket
2,Hook,Bakery,Supermarket,Fish & Chips Shop,Indian Restaurant,Wine Shop,French Restaurant,Electronics Store,Farmers Market,Fast Food Restaurant,Food
3,Kingston Vale,Sandwich Place,Grocery Store,Bar,Soccer Field,Wine Shop,Electronics Store,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Food
4,Kingston upon Thames,Coffee Shop,Pub,Sushi Restaurant,Café,Burger Joint,Asian Restaurant,Gift Shop,Furniture / Home Store,French Restaurant,Electronics Store


### Clustering using k - means 

In [83]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

neigh_grouped_clustering = neigh_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(neigh_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 4, 2, 0, 4, 0, 0, 4, 4, 1])

In [84]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

neigh_merged = df_neigh

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
neigh_merged = neigh_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

neigh_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berrylands,Kingston upon Thames,51.3938,-0.284802,3.0,Gym / Fitness Center,Park,Café,Bus Stop,Wine Shop,Food,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,French Restaurant
1,Canbury,Kingston upon Thames,51.4175,-0.305553,4.0,Pub,Hotel,Shop & Service,Café,Plaza,Indian Restaurant,Fish & Chips Shop,Park,Spa,Supermarket
2,Chessington,Kingston upon Thames,51.3583,-0.298622,,,,,,,,,,,
3,Coombe,Kingston upon Thames,51.4194,-0.265398,,,,,,,,,,,
4,Hook,Kingston upon Thames,51.3679,-0.307145,2.0,Bakery,Supermarket,Fish & Chips Shop,Indian Restaurant,Wine Shop,French Restaurant,Electronics Store,Farmers Market,Fast Food Restaurant,Food


In [85]:
neigh_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 15 columns):
Neighborhood              15 non-null object
Borough                   15 non-null object
Latitude                  15 non-null object
Longitude                 15 non-null object
Cluster Labels            13 non-null float64
1st Most Common Venue     13 non-null object
2nd Most Common Venue     13 non-null object
3rd Most Common Venue     13 non-null object
4th Most Common Venue     13 non-null object
5th Most Common Venue     13 non-null object
6th Most Common Venue     13 non-null object
7th Most Common Venue     13 non-null object
8th Most Common Venue     13 non-null object
9th Most Common Venue     13 non-null object
10th Most Common Venue    13 non-null object
dtypes: float64(1), object(14)
memory usage: 2.5+ KB


In [86]:
# Dropping the row with the NaN value 
neigh_merged.dropna(inplace = True)

In [87]:
neigh_merged.shape

(13, 15)

In [88]:
neigh_merged['Cluster Labels'] = neigh_merged['Cluster Labels'].astype(int)

In [89]:
neigh_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13 entries, 0 to 14
Data columns (total 15 columns):
Neighborhood              13 non-null object
Borough                   13 non-null object
Latitude                  13 non-null object
Longitude                 13 non-null object
Cluster Labels            13 non-null int32
1st Most Common Venue     13 non-null object
2nd Most Common Venue     13 non-null object
3rd Most Common Venue     13 non-null object
4th Most Common Venue     13 non-null object
5th Most Common Venue     13 non-null object
6th Most Common Venue     13 non-null object
7th Most Common Venue     13 non-null object
8th Most Common Venue     13 non-null object
9th Most Common Venue     13 non-null object
10th Most Common Venue    13 non-null object
dtypes: int32(1), object(14)
memory usage: 1.6+ KB


### Visualize the clusters

In [91]:

%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt



# check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


Bad key "text.kerning_factor" on line 4 in
C:\Users\TUSHAR\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution


Matplotlib version:  3.1.1


In [92]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11.5)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(neigh_merged['Latitude'], neigh_merged['Longitude'], neigh_merged['Neighborhood'], neigh_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.5).add_to(map_clusters)
       
map_clusters

Each cluster is color coded for the ease of presentation

## Analysis <a name="analysis"></a>

#### 1st cluster

In [93]:
neigh_merged[neigh_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Kingston Vale,Kingston upon Thames,51.4318,-0.258138,0,Sandwich Place,Grocery Store,Bar,Soccer Field,Wine Shop,Electronics Store,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Food
7,Malden Rushett,Kingston upon Thames,51.3411,-0.319076,0,Grocery Store,Pub,Garden Center,Restaurant,Wine Shop,Fish & Chips Shop,Department Store,Electronics Store,Farmers Market,Fast Food Restaurant
8,Motspur Park,Kingston upon Thames,51.391,-0.248898,0,Gym,Park,Restaurant,Soccer Field,Fish & Chips Shop,Department Store,Electronics Store,Farmers Market,Fast Food Restaurant,Food
14,Tolworth,Kingston upon Thames,51.3789,-0.28286,0,Grocery Store,Pharmacy,Bowling Alley,Coffee Shop,Italian Restaurant,Pizza Place,Hotel,Café,Bus Stop,Sandwich Place


The 1st cluster has 4 of 15 neighborhoods in the borough Kingston upon Thames. The most common venues are Gym, sandwich place and grocery stores, pharmacy, park.

#### 2nd cluster

In [94]:
neigh_merged[neigh_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Old Malden,Kingston upon Thames,51.3825,-0.25909,1,Construction & Landscaping,Train Station,Food,Deli / Bodega,Department Store,Electronics Store,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Wine Shop


The second cluster has one neighborhood with Construction & Landscaping as most common venue. 

#### 3rd cluster

In [95]:
neigh_merged[neigh_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Hook,Kingston upon Thames,51.3679,-0.307145,2,Bakery,Supermarket,Fish & Chips Shop,Indian Restaurant,Wine Shop,French Restaurant,Electronics Store,Farmers Market,Fast Food Restaurant,Food


The third cluster has one neighborhood with Bakery as most common venue. 

#### 4th cluster

In [96]:
neigh_merged[neigh_merged['Cluster Labels'] == 3]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berrylands,Kingston upon Thames,51.3938,-0.284802,3,Gym / Fitness Center,Park,Café,Bus Stop,Wine Shop,Food,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,French Restaurant


The fourth cluster has one neighborhood with Gym / Fitness Center as most common venue. 	 

#### 5th  cluster

In [97]:
neigh_merged[neigh_merged['Cluster Labels'] == 4]

Unnamed: 0,Neighborhood,Borough,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Canbury,Kingston upon Thames,51.4175,-0.305553,4,Pub,Hotel,Shop & Service,Café,Plaza,Indian Restaurant,Fish & Chips Shop,Park,Spa,Supermarket
5,Kingston upon Thames,Kingston upon Thames,51.4096,-0.306262,4,Coffee Shop,Pub,Sushi Restaurant,Café,Burger Joint,Asian Restaurant,Gift Shop,Furniture / Home Store,French Restaurant,Electronics Store
9,New Malden,Kingston upon Thames,51.4053,-0.263407,4,Gym,Gastropub,Sushi Restaurant,Supermarket,Bar,Indian Restaurant,Korean Restaurant,Food,Electronics Store,Farmers Market
10,Norbiton,Kingston upon Thames,51.41,-0.287396,4,Pub,Italian Restaurant,Indian Restaurant,Food,Platform,Wine Shop,Coffee Shop,Hotel,Hardware Store,Pharmacy
12,Seething Wells,Kingston upon Thames,51.3926,-0.314366,4,Indian Restaurant,Coffee Shop,Pub,Café,Gym,Restaurant,Park,Pet Café,Fast Food Restaurant,Chinese Restaurant
13,Surbiton,Kingston upon Thames,51.3938,-0.30331,4,Coffee Shop,Pub,Grocery Store,Italian Restaurant,Pharmacy,French Restaurant,Train Station,Gym / Fitness Center,Hotel,Farmers Market


The fifth cluster is biggest cluster with 6/15 neighborhood which consists of Venues such as Pub, Coffee shop, gym, indian restaurent

## Results and Discussion <a name="results"></a>

The aim of this project is to find safest borough in London for people who are relocating to London. For the safest borough the neighbourhood need to be analyzed so that person will be able to enjoy most common venues and could enjoy life along with safety. Most important venues include pharmacy, gym, restaurants, ease of transportation and grocery stores.

From the analysis of neighborhoods we can see that 1st cluster includes venues like Gym, sandwich place and grocery stores, pharmacy, park. 
2nd cluster includes Construction & Landscaping, Train Station, Food, Deli / Bodega, Department Store, Electronics Store, Farmers Market. 
3rd cluster includes Bakery, Supermarket, Fish & Chips Shop, Indian Restaurant, Wine Shop, French Restaurant, Electronics Store. 
4th cluster includes Gym / Fitness Center, Park, Café, Bus Stop, Wine Shop, Food, Farmers Market. 
5th cluster includes Pub, Coffee shop, gym, Indian restaurant, pharmacy, mostly restaurants and eatery.

For a family 1st cluster is more suitable dues to the common venues in that cluster includes most of the essential services. For a foodie person I think 5th cluster is more suitable.


## Conclusion <a name="conclusion"></a>

Relocation is most difficult and hectic process and it involves moving to new unknown location where we don’t about food, transportation services and neighboring area and safety. so due to new techniques in data science we can analyze different neighborhoods as if we are virtually present there and we can choose best area to live which is safe for our family and also fulfills all necessary needs. Here in this project we analyzed different borough of London based on crime rates and shortlisted some neighborhood of safe borough and finally categorized different clusters based on most common venues. this can be helpful for person relocating to London and choosing best neighborhood to live.

## END