### The Battle of the Neighborhoods - Week 1

#### Data

The city I have chosen to analyse is: __London__

__Data 1__

The first set of data will be from looking at the make-up of London itself as well as defining the boundaries of London. There is the 'City of London' which is known as 'the square mile' making up only 1.12miles squared - it even has it's own police service seperate to the Metropolitan Police who police the rest of London and it's 32 Boroughs. 

For this data, it will be 'The Greater London Area' that we are referring to. So everything up to Enfield, down to Croydon and over to Hillingdon in the West and Havering in the East.

This data will be extracted from Wikipedia and will also use FourSquare.

London's 32 boroughs;
https://en.wikipedia.org/wiki/List_of_areas_of_London 

This will be scraped below;


In [2]:
#import relevant libraries required.

from bs4 import BeautifulSoup
import numpy as np

#import pandas and set limiters on columns/rows.
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json
print('libraries numpy, pandas, json imported')

#install our geopy handling library
!pip install -q install geopy
print('geopy installed')

from geopy.geocoders import Nominatim
print('Nominatim imported')

libraries numpy, pandas, json imported
geopy installed
Nominatim imported


In [4]:
#other libraries to install
import requests

#system to convert JSON files into pandas dataframes
from pandas.io.json import json_normalize

print('import successful')

import successful


In [5]:
import matplotlib.cm as cm
import matplotlib.colors as colors
print('matplot imported')

#clustering tools
from sklearn.cluster import KMeans

!pip -q install geocoder
import geocoder
import time

print('You have time!')

matplot imported
You have time!


In [6]:
!conda install -c conda-forge folium=0.5.0 --yes
!pip -q install folium 

print('folium installed')
import folium
print('folium finished')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

folium installed
folium finished


In [7]:
#Now we can begin looking at the data
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'}
wikipedia_page = requests.get(wikipedia_link, headers = headers)
wikipedia_page

<Response [200]>

In [10]:
#Cleaning the web page to extract just the table data we are after
soup = BeautifulSoup(wikipedia_page.content, 'html.parser')

table = soup.find('table', {'class':'wikitable sortable'}).tbody

In [11]:
#Extract data

rows = table.find_all('tr')

In [12]:
columns = [i.text.replace('\n', '')
          for i in rows[0].find_all('th')]

In [13]:
#convert to a pandas df

df = pd.DataFrame(columns = columns)
df

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref


In [14]:
#Now we will extract the rows of data and incorporate it into our table

for i in range(1, len(rows)):
    tds = rows[i].find_all('td')
    
    
    if len(tds) == 7:
        values = [tds[0].text, tds[1].text, tds[2].text.replace('\n', ''.replace('\xa0','')), tds[3].text, tds[4].text.replace('\n', ''.replace('\xa0','')), tds[5].text.replace('\n', ''.replace('\xa0','')), tds[6].text.replace('\n', ''.replace('\xa0',''))]
    else:
        values = [td.text.replace('\n', '').replace('\xa0','') for td in tds]
        
        df = df.append(pd.Series(values, index = columns), ignore_index = True)

        df

In [15]:
df.head(5)

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


In [16]:
#rename some of the columns to better fit our needs

df = df.rename(index=str, columns = {'Location': 'Location', 'London\xa0borough': 'Borough', 'Post town': 'Post-Town', 'Postcode\xa0district': 'Postcode', 'Dial\xa0code': 'Dial-code', 'OS grid ref': 'OSGridRef'})

In [17]:
df.head(5)

Unnamed: 0,Location,Borough,Post-Town,Postcode,Dial-code,OSGridRef
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


In [18]:
#We will now remove the [x] next to the boroughs

df['Borough'] = df['Borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))

In [19]:
#see shape of table
df.shape

(533, 6)

In [20]:
df.head(5)

Unnamed: 0,Location,Borough,Post-Town,Postcode,Dial-code,OSGridRef
0,Abbey Wood,"Bexley, Greenwich",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon,CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon,CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


__Post codes__
Due to the size of some boroughs, they will contain multiple postcodes. To combat this, I will split the multi-lines into singular.

In [22]:
df0 = df.drop('Postcode', axis=1).join(df['Postcode'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('Postcode'))

In [23]:
df0.head(5)

Unnamed: 0,Location,Borough,Post-Town,Dial-code,OSGridRef,Postcode
0,Abbey Wood,"Bexley, Greenwich",LONDON,20,TQ465785,SE2
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,20,TQ205805,W3
1,Acton,"Ealing, Hammersmith and Fulham",LONDON,20,TQ205805,W4
10,Angel,Islington,LONDON,20,TQ345665,EC1
10,Angel,Islington,LONDON,20,TQ345665,N1


In [25]:
#Only data from certain headers will be used, we will now drop the columns not needed.

df1 = df0[['Location', 'Borough', 'Postcode', 'Post-Town']].reset_index(drop=True)

In [26]:
df1.head(5)

Unnamed: 0,Location,Borough,Postcode,Post-Town
0,Abbey Wood,"Bexley, Greenwich",SE2,LONDON
1,Acton,"Ealing, Hammersmith and Fulham",W3,LONDON
2,Acton,"Ealing, Hammersmith and Fulham",W4,LONDON
3,Angel,Islington,EC1,LONDON
4,Angel,Islington,N1,LONDON


In [29]:
#Now remove all non-London post-towns
df2 = df1
df21 = df2[df2['Post-Town'].str.contains('LONDON')]

In [30]:
df21.head(5)

Unnamed: 0,Location,Borough,Postcode,Post-Town
0,Abbey Wood,"Bexley, Greenwich",SE2,LONDON
1,Acton,"Ealing, Hammersmith and Fulham",W3,LONDON
2,Acton,"Ealing, Hammersmith and Fulham",W4,LONDON
3,Angel,Islington,EC1,LONDON
4,Angel,Islington,N1,LONDON


In [31]:
#We will now also drop the 'Post-town' column as we now know we only have the towns within London.

df3 = df21[['Location', 'Borough', 'Postcode']].reset_index(drop=True)

In [34]:
df3.head(20)

Unnamed: 0,Location,Borough,Postcode
0,Abbey Wood,"Bexley, Greenwich",SE2
1,Acton,"Ealing, Hammersmith and Fulham",W3
2,Acton,"Ealing, Hammersmith and Fulham",W4
3,Angel,Islington,EC1
4,Angel,Islington,N1
5,Church End,Brent,NW10
6,Church End,Barnet,N3
7,Clapham,"Lambeth, Wandsworth",SW4
8,Clerkenwell,Islington,EC1
9,Colindale,Barnet,NW9


In [35]:
df_london = df3
df_london.to_csv('LondonLocations.csv', index = False)

__Specific area to review__

The area of London I wish to focus is the diverse area of the South and South East. Contained within here we have London's famous 'Borough Market' just by London Bridge. A great place to get any kind of food you can imagine.

In [36]:
df_london.Postcode = df_london.Postcode.str.strip()

In [37]:
df_london.head(5)

Unnamed: 0,Location,Borough,Postcode
0,Abbey Wood,"Bexley, Greenwich",SE2
1,Acton,"Ealing, Hammersmith and Fulham",W3
2,Acton,"Ealing, Hammersmith and Fulham",W4
3,Angel,Islington,EC1
4,Angel,Islington,N1


In [38]:
#Now we will extract all postcodes only within South East London (SE postcodes)

df_se = df_london[df_london['Postcode'].str.startswith(('SE'))].reset_index(drop=True)

In [39]:
#So now we should only see SE postcode areas below

df_se.head(20)

Unnamed: 0,Location,Borough,Postcode
0,Abbey Wood,"Bexley, Greenwich",SE2
1,Crofton Park,Lewisham,SE4
2,Crossness,Bexley,SE2
3,Crystal Palace,Bromley,SE19
4,Crystal Palace,Bromley,SE20
5,Crystal Palace,Bromley,SE26
6,Denmark Hill,Southwark,SE5
7,Deptford,Lewisham,SE8
8,Dulwich,Southwark,SE21
9,East Dulwich,Southwark,SE22


__Data 2__

We will now examine the demographics of these areas to work out the proportion of nationalities. This area of London is very diverse with a large make-up of Afro-Caribbran populace.
This data will enable us to gain a better idea of what cuisine is likely to be found. For this, we are using demographic data from Wikipedia.

https://en.wikipedia.org/wiki/Demography_of_London


__Data 3__

We will next be looking at obtaining the location data for our targetted areas. To accomplish this, we will be using a Geocoder which will provide longitude and latitudes.

We will create this data frame for our SE locations below;

In [55]:
# Geocoder starts here
# Defining a function to use --> get_latlng()'''
def get_latlng(arcgis_geocoder):
    
    # Initialize the Location (lat. and long.) to "None"
    lat_lng_coords = None
    
    # While loop helps to create a continous run until all the location coordinates are geocoded
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, United Kingdom'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [57]:
#This will return the co-ordinates for the SE2 postcode
sample = get_latlng('SE2')
sample

[51.492450000000076, 0.12127000000003818]

In [59]:
#We now apply this to our entire data frame of SE post-codes

start = time.time()

postal_codes = df_se['Postcode']    
coordinates = [get_latlng(postal_code) for postal_code in postal_codes.tolist()]

end = time.time()
print("Time of execution: ", end - start, "seconds")

Time of execution:  50.5660560131073 seconds


In [61]:
#Now we add the co-ordinates to our table
df_se_loc = df_se

# The obtained coordinates (latitude and longitude) are joined with the dataframe as shown
df_se_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])
df_se_loc['Latitude'] = df_se_coordinates['Latitude']
df_se_loc['Longitude'] = df_se_coordinates['Longitude']

In [62]:
df_se_loc.head(5)

Unnamed: 0,Location,Borough,Postcode,Latitude,Longitude
0,Abbey Wood,"Bexley, Greenwich",SE2,51.49245,0.12127
1,Crofton Park,Lewisham,SE4,51.46268,-0.03558
2,Crossness,Bexley,SE2,51.49245,0.12127
3,Crystal Palace,Bromley,SE19,51.4199,-0.08808
4,Crystal Palace,Bromley,SE20,51.41009,-0.05683


__Data 3__

Finally, we will be using FourSquare geographical location data to explore the boroughs/postal codes with greater ease.