# Data Pre-processing Walkthrough: Geographic IDs

Our larger goal in the [Election Data Visualisation Walkthrough](Election-Data-Visualisation-Walkthrough.ipynb) notebook is plotting the results of the 2010 and 2015 General Elections as different colours on a map of Wales. However to do so we needed to do some pre-processing and manipulation to combine several datasets and align ID values for the geographic regions. This is what we step though in this notebook. 

Before starting this walkthrough, make sure you followed the setup instruction in the [README.md](https://github.com/MangoTheCat/rss-2018-hackathon/blob/master/README.md) and download and extracted all the datasets into your data folder.

## Election Results Data

OK, lets begin with looking at our General Election results for Wales. We'll use the `pandas` library to read the csv for the results from 2015.

In [1]:
import pandas as pd

In [2]:
ge2015 = pd.read_csv('../../data/election/ge_2015_results.csv')

In [3]:
ge2015.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 650 entries, 0 to 649
Columns: 148 entries, Press Association ID Number to Zeb
dtypes: float64(141), object(7)
memory usage: 751.6+ KB


In [4]:
ge2015.shape

(650, 148)

In [5]:
ge2015.head()

Unnamed: 0,Press Association ID Number,Constituency ID,Constituency Name,Constituency Type,County,Region ID,Region,Country,Election Year,Electorate,...,Wessex Reg,Whig,Wigan,Worth,WP,WRP,WVPTFP,Yorks,Young,Zeb
0,1.0,W07000049,Aberavon,County,West Glamorgan,W92000004,Wales,Wales,2015.0,49821.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,W07000058,Aberconwy,County,Clwyd,W92000004,Wales,Wales,2015.0,45525.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.0,S14000001,Aberdeen North,Burgh,Scotland,S92000003,Scotland,Scotland,2015.0,67745.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,S14000002,Aberdeen South,Burgh,Scotland,S92000003,Scotland,Scotland,2015.0,68056.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5.0,S14000058,Aberdeenshire West & Kincardine,County,Scotland,S92000003,Scotland,Scotland,2015.0,73445.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


And just subset to keep those related to Wales

In [6]:
wales_ge2015 = ge2015.loc[ge2015.Region == 'Wales', :] 

So we can see we have two ID numbers here, `Press Association ID Number`, and `Constituency ID`. 

## Geographical Border Data

Next we look at boundary data file for Wales, which are available from the [ONS Open Geography Portal here](http://geoportal.statistics.gov.uk/datasets/national-assembly-for-wales-constituencies-december-2015-super-generalised-clipped-boundaries-in-wales). We are concerned mostly with the shapefile that is included in the about data folder in the repo at `data/geographic/Wales-Constituency-boundaries`. 

Let's read this shapefile data into a dataframe by using the geopandas package (geopandas, as the name suggests behaved just like the `pandas` library in its datafram manipulations, but has support for representing georgraphic data types as well)

In [7]:
import geopandas as gpd

In [8]:
borders = gpd.read_file(
    '../../data/geographic/Wales-Constituency-boundaries/National_Assembly_for_Wales_Constituencies_December_2015_Super_Generalised_Clipped_Boundaries_in_Wales.shp'
)

Let’s take a look at some features of this borders dataframe before continuing.

In [9]:
borders.shape

(40, 6)

In [10]:
borders.head()

Unnamed: 0,objectid,nawc15cd,nawc15nm,st_areasha,st_lengths,geometry
0,1,W09000001,Ynys Mon,713430200.0,245438.947583,(POLYGON ((238892.2153000003 395248.7354000006...
1,2,W09000002,Arfon,409601600.0,112884.695237,"POLYGON ((265248.2999999998 356615.0999999996,..."
2,3,W09000003,Aberconwy,606410300.0,164784.269486,"POLYGON ((283201.0451999996 381406.0425000004,..."
3,4,W09000004,Clwyd West,925135800.0,200726.490374,"POLYGON ((300311.9201999996 379240.3910000008,..."
4,5,W09000005,Vale of Clwyd,215474800.0,107307.191139,"POLYGON ((314674.2999999998 365751.5999999996,..."


## Data Merging

We now have a very similar looking ID associated with the border geormetry data in the `nawc15cd` column, unfortunately these do not match up. 

In [11]:
len(set(borders.nawc15cd) - set(wales_ge2015["Constituency ID"]))

40

So could we use the names instead?

In [12]:
set(wales_ge2015["Constituency Name"]) - set(borders.nawc15nm) 

{'Alyn & Deeside',
 'Brecon & Radnorshire',
 'Cardiff South & Penarth',
 'Carmarthen East & Dinefwr',
 'Carmarthen West & Pembrokeshire South',
 'Merthyr Tydfil & Rhymney',
 'Vale Of Clwyd',
 'Vale Of Glamorgan'}

In [13]:
set(borders.nawc15nm) - set(wales_ge2015["Constituency Name"])

{'Alyn and Deeside',
 'Brecon and Radnorshire',
 'Cardiff South and Penarth',
 'Carmarthen East and Dinefwr',
 'Carmarthen West and South Pembrokeshire',
 'Merthyr Tydfil and Rhymney',
 'Vale of Clwyd',
 'Vale of Glamorgan'}

Well that's annoying! We can see that there are some discrepancies, and these are because the geospatial data uses 'and' instead of '&', and also names constituencies as 'South Pembrokeshire' vs 'Pembrokeshire South'. As there are only a handful we could manually clean these up to match each other, but that could get tedious fast if we want to do this with larger regions.

So lets try and use these ID's somehow. 

## ONS Geographic ID's

The ID's used are provided by the ONS and follow a standard naming policy (e.g. codes beginning with `W` relate to Wales). However the various geographical boundaries and the hierarchies that make up the UK are far from straight forward, and to add further complication, these can also change over time, as new boundaries get agreed. The end result is many codes, which may or may not map to the same region over time and at different levels. Great!

If you wish to know more about how geographical regions in the UK are organised the ONS have a whole book on the subject at [this address](https://www.arcgis.com/sharing/rest/content/items/41381f11374d4e909f24dcb215f3d837/data) 

So back to our problem. Our goal is to find, a mapping between  the `nawc15cd` column in the shapefile data and the 
`Constituency ID` in the wales GE results data set, however we have found these codes don't match, though they are both valid ONS codes, just for different definitions of the save region. 

Fortunately the ONS have a [Code History Database](https://ons.maps.arcgis.com/home/item.html?id=a6a1247a7d8e45068011e8f482cdf3c5) to store all these ONS geographic codes, which we will use to try and match our shapefile IDs to our General Election results. A copy of the database is included in the repo under `data/chd`. the main file we are interested in in there is the `Equivalents.csv` file. 

In [14]:
code_lookup = pd.read_csv('../../data/chd/Equivalents.csv')

  interactivity=interactivity, compiler=compiler, result=result)


Dont worry about the warnings, this just means some columns have missing values or mixed data types. We are going to clean it up a bit for our needs.

In [15]:
code_lookup.head()

Unnamed: 0,GEOGCD,GEOGNM,GEOGNMW,GEOGCDO,GEOGNMO,GEOGCDD,GEOGNMD,GEOGCDH,GEOGNMH,GEOGCDS,...,GEOGCDI,GEOGNMI,GEOGCDWG,GEOGNMWG,GEOGNMWWG,OPER_DATE,TERM_DATE,ENTITYCD,YEAR,STATUS
0,E00096883,,,16UCFY0002,,,,,,,...,,,,,,01/01/2009,,E00,2009,live
1,E00096884,,,16UCFY0003,,,,,,,...,,,,,,01/01/2009,,E00,2009,live
2,E00096885,,,16UCFY0004,,,,,,,...,,,,,,01/01/2009,,E00,2009,live
3,E00096886,,,16UCFY0005,,,,,,,...,,,,,,01/01/2009,,E00,2009,live
4,E00096887,,,16UCFY0006,,,,,,,...,,,,,,01/01/2009,,E00,2009,live


From looking into the supporting documents that came with the database we find that the two most usefull columns for us are:
    
    * `GEOGCD` - The ONS code for the region.
    * `GEOGNM` - The ONS name for the region.

We will keep data for just this mapping, and drop the other rows/columns, and save the result to file for reference. 

In [16]:
code_lookup = code_lookup[["GEOGCD", "GEOGNM"]]
code_lookup = code_lookup.dropna()
code_lookup = code_lookup.drop_duplicates()
code_lookup.to_csv('../../data/ONS_geographic_code_lookup.csv', index=False)

In [17]:
code_lookup.head()

Unnamed: 0,GEOGCD,GEOGNM
407,E32000001,Barnet and Camden
408,E32000002,Bexley and Bromley
409,E32000003,Brent and Harrow
410,E32000004,City and East
411,E32000005,Croydon and Sutton


## Merging based on Names

We saw before that the names were inconsistent, however they appear to be much more standard in the CHD, so our strategy is going to be to map each ONS code to a CHD name, and then merge based on these CHD names. 

### Geographic Borders Data

In [18]:
borders_CHD = pd.merge(borders, code_lookup, left_on="nawc15cd", right_on="GEOGCD", how='left')

In [19]:
borders_CHD.shape

(41, 8)

So it looks like we have one new region that's been added. 

In [20]:
borders_CHD.tail()

Unnamed: 0,objectid,nawc15cd,nawc15nm,st_areasha,st_lengths,geometry,GEOGCD,GEOGNM
36,37,W09000044,Merthyr Tydfil and Rhymney,163232900.0,68163.471761,"POLYGON ((309746.2000000002 211439.4000000004,...",W09000044,Merthyr Tydfil and Rhymney
37,38,W09000045,Ogmore,203308600.0,74062.515019,"POLYGON ((291929.5999999996 195233.5999999996,...",W09000045,Ogmore
38,39,W09000046,Pontypridd,104997900.0,63401.569715,"POLYGON ((310058.2999999998 191003, 310887 188...",W09000046,Pontypridd
39,40,W09000047,Vale of Glamorgan,316438300.0,92698.37796,"POLYGON ((302260.7999999998 179532, 302783.900...",W09000047,The Vale of Glamorgan
40,40,W09000047,Vale of Glamorgan,316438300.0,92698.37796,"POLYGON ((302260.7999999998 179532, 302783.900...",W09000047,Vale of Glamorgan


This is due to there being one inconsistency with `The Vale of Glamorgan` vs `Vale of Glamorgan`. Not perfect, but having both in this table will be enough for our needs going forward. We clean up the merge by dropping repeated data columns. 

In [21]:
borders_CHD = borders_CHD.drop("GEOGCD", axis=1)
borders_CHD = borders_CHD.rename(columns={'GEOGNM': 'CHD_Name'})

In [22]:
borders_CHD.shape

(41, 7)

### General Election Results

Good start, now lets do the same with the General election results data. 

In [23]:
# Keep just the first 8 columns as these relate to regional information
wales_ge2015 = wales_ge2015[wales_ge2015.columns[:8]]
wales_ge2015.shape

(40, 8)

In [24]:
wales_ge2015_CDH = pd.merge(wales_ge2015, code_lookup, left_on="Constituency ID", right_on="GEOGCD", how='left')

In [25]:
wales_ge2015_CDH.shape

(41, 10)

In [26]:
wales_ge2015_CDH.head()

Unnamed: 0,Press Association ID Number,Constituency ID,Constituency Name,Constituency Type,County,Region ID,Region,Country,GEOGCD,GEOGNM
0,1.0,W07000049,Aberavon,County,West Glamorgan,W92000004,Wales,Wales,W07000049,Aberavon
1,2.0,W07000058,Aberconwy,County,Clwyd,W92000004,Wales,Wales,W07000058,Aberconwy
2,10.0,W07000043,Alyn & Deeside,County,Clwyd,W92000004,Wales,Wales,W07000043,Alyn and Deeside
3,16.0,W07000057,Arfon,County,Gwynedd,W92000004,Wales,Wales,W07000057,Arfon
4,71.0,W07000072,Blaenau Gwent,County,Gwent and Mid Glamorgan,W92000004,Wales,Wales,W07000072,Blaenau Gwent


In [27]:
wales_ge2015_CDH.tail()

Unnamed: 0,Press Association ID Number,Constituency ID,Constituency Name,Constituency Type,County,Region ID,Region,Country,GEOGCD,GEOGNM
36,588.0,W07000060,Vale Of Clwyd,County,Clwyd,W92000004,Wales,Wales,W07000060,Vale of Clwyd
37,589.0,W07000078,Vale Of Glamorgan,County,South Glamorgan,W92000004,Wales,Wales,W07000078,The Vale of Glamorgan
38,589.0,W07000078,Vale Of Glamorgan,County,South Glamorgan,W92000004,Wales,Wales,W07000078,Vale of Glamorgan
39,641.0,W07000044,Wrexham,County,Clwyd,W92000004,Wales,Wales,W07000044,Wrexham
40,647.0,W07000041,Ynys Mon,County,Gwynedd,W92000004,Wales,Wales,W07000041,Ynys Môn


Again we have the issue of the with `The Vale of Glamorgan` vs `Vale of Glamorgan`, but we have matched on both, and both might be needed downstream, so lets leave then for now. 

In [28]:
wales_ge2015_CDH = wales_ge2015_CDH.drop("GEOGCD", axis=1)
wales_ge2015_CDH = wales_ge2015_CDH.rename(columns={'GEOGNM': 'CHD_Name'})

In [29]:
wales_ge2015_CDH.shape

(41, 9)

In [30]:
wales_ge2015_CDH.dropna().shape

(41, 9)

### Final Merged Dataset

Now we are ready to check these names in `GEOGNM` can be used to match our shapefiles to the Press Association Number.

In [31]:
wales_region_data = pd.merge(borders_CHD, wales_ge2015_CDH, on="CHD_Name", how='left')
wales_region_data

Unnamed: 0,objectid,nawc15cd,nawc15nm,st_areasha,st_lengths,geometry,CHD_Name,Press Association ID Number,Constituency ID,Constituency Name,Constituency Type,County,Region ID,Region,Country
0,1,W09000001,Ynys Mon,713430200.0,245438.947583,(POLYGON ((238892.2153000003 395248.7354000006...,Ynys Môn,647.0,W07000041,Ynys Mon,County,Gwynedd,W92000004,Wales,Wales
1,2,W09000002,Arfon,409601600.0,112884.695237,"POLYGON ((265248.2999999998 356615.0999999996,...",Arfon,16.0,W07000057,Arfon,County,Gwynedd,W92000004,Wales,Wales
2,3,W09000003,Aberconwy,606410300.0,164784.269486,"POLYGON ((283201.0451999996 381406.0425000004,...",Aberconwy,2.0,W07000058,Aberconwy,County,Clwyd,W92000004,Wales,Wales
3,4,W09000004,Clwyd West,925135800.0,200726.490374,"POLYGON ((300311.9201999996 379240.3910000008,...",Clwyd West,155.0,W07000059,Clwyd West,County,Clwyd,W92000004,Wales,Wales
4,5,W09000005,Vale of Clwyd,215474800.0,107307.191139,"POLYGON ((314674.2999999998 365751.5999999996,...",Vale of Clwyd,588.0,W07000060,Vale Of Clwyd,County,Clwyd,W92000004,Wales,Wales
5,6,W09000006,Delyn,283314000.0,121719.631672,(POLYGON ((312926.5999999996 385120.6999999993...,Delyn,179.0,W07000042,Delyn,County,Clwyd,W92000004,Wales,Wales
6,7,W09000007,Alyn and Deeside,154491200.0,93966.421338,"POLYGON ((335613 359943.3000000007, 335032.200...",Alyn and Deeside,10.0,W07000043,Alyn & Deeside,County,Clwyd,W92000004,Wales,Wales
7,8,W09000008,Wrexham,103538800.0,58646.889926,"POLYGON ((335613 359943.3000000007, 336736.099...",Wrexham,641.0,W07000044,Wrexham,County,Clwyd,W92000004,Wales,Wales
8,9,W09000009,Clwyd South,622386100.0,178322.75597,"POLYGON ((342307.0999999996 349382.3000000007,...",Clwyd South,154.0,W07000062,Clwyd South,County,Clwyd,W92000004,Wales,Wales
9,14,W09000015,Carmarthen East and Dinefwr,1555839000.0,271230.844918,"(POLYGON ((275309.2000000002 248901, 276034.5 ...",Carmarthen East and Dinefwr,131.0,W07000067,Carmarthen East & Dinefwr,County,Dyfed,W92000004,Wales,Wales


In [32]:
wales_region_data.shape

(41, 15)

In [33]:
wales_region_data.dropna().shape

(41, 15)

Nice this has worked. Now whats left is to cleanup the columns in the resulting, as there is quite a bit of dubplication, save the result for reference later. 

The key bits of information we want to keep are:
    
* `nawc15cd` code from the shapefile
* `Press Association ID` from the published general election results
* `Constituency ID` based on the ONS code column `GEOGCD`
* `Constituency Name` based on ONS name column `GEOGNM`

In [34]:
wales_region_data = wales_region_data.filter(['nawc15cd', 'Press Association ID Number', 'Constituency ID', 'CHD_Name'])

In [35]:
wales_region_data = wales_region_data.sort_values("CHD_Name")

In [36]:
wales_region_data.head()

Unnamed: 0,nawc15cd,Press Association ID Number,Constituency ID,CHD_Name
20,W09000022,1.0,W07000049,Aberavon
2,W09000003,2.0,W07000058,Aberconwy
6,W09000007,10.0,W07000043,Alyn and Deeside
1,W09000002,16.0,W07000057,Arfon
30,W09000038,71.0,W07000072,Blaenau Gwent


Now we are ready to save the region data for future use in downstream analysis. 

In [37]:
wales_region_data.to_csv('../../data/geographic/wales_region_data.csv', index=False)

In [38]:
wales_region_data

Unnamed: 0,nawc15cd,Press Association ID Number,Constituency ID,CHD_Name
20,W09000022,1.0,W07000049,Aberavon
2,W09000003,2.0,W07000058,Aberconwy
6,W09000007,10.0,W07000043,Alyn and Deeside
1,W09000002,16.0,W07000057,Arfon
30,W09000038,71.0,W07000072,Blaenau Gwent
33,W09000041,89.0,W07000068,Brecon and Radnorshire
21,W09000023,94.0,W07000073,Bridgend
27,W09000035,114.0,W07000076,Caerphilly
25,W09000031,126.0,W07000050,Cardiff Central
34,W09000042,127.0,W07000051,Cardiff North


# Extensions

Now that you have followed this walkthrough to get you going, try any (or all) of the following ideas for yourself:

* Repeat a similar process for the Scotland/England data, or for the whole of Great Britain. You will need the relevant geospatial data from the [ONS geographic portal](http://geoportal.statistics.gov.uk). 
* Search for other data sources that may be useful to include for downstream visualisation or prediction of the general election, such as:
    * Amount of party spending per constituency.
    * Length of time the MP has represented the constituency. 
    * Anything else you can think of, be creative!

**Good Luck and Have Fun!**