# Modeling Walkthrough - Predicting the Winning Parties of the 2015 General Elections

In this notebook, I am going to take you through the key steps involved with fitting a machine learning model and using it to make predictions. In this case we are going to be using the data from the 2010 general elections to predict which party will win in each constituency in the 2015 general elections. To keep it simple, we are going to use a Decision Tree Classifier.

First, we import a couple of modules; **pandas** and **DecisionTreeClassifier** from **sklearn**. We are going to use pandas to manipulate our data and then fit a DecisionTreeClassifier model to come up with some predictions. 

In [1]:
# Setup
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

pd.options.mode.chained_assignment = None 
pd.set_option('display.max_columns', None)

The first thing we have to do is import the data that we will be using to feed our model and try and find some of the summary characteristics of the data.

In [2]:
# pandas reads in the data into a DataFrame 
ge_10 = pd.read_csv("../RSS-hackathon/data/ge_2010_results.csv")
ge_15 = pd.read_csv("../RSS-hackathon/data/ge_2015_results.csv")

# Use .head() to look at what kind of data we have.
ge_10.head(2)


Unnamed: 0,Press Association Reference,Constituency Name,Region,Election Year,Electorate,Votes,AC,AD,AGS,APNI,APP,AWL,AWP,BB,BCP,Bean,Best,BGPV,BIB,BIC,Blue,BNP,BP Elvis,C28,Cam Soc,CG,Ch M,Ch P,CIP,CITY,CNPG,Comm,Comm L,Con,Cor D,CPA,CSP,CTDP,CURE,D Lab,D Nat,DDP,DUP,ED,EIP,EPA,FAWG,FDP,FFR,Grn,GSOT,Hum,ICHC,IEAC,IFED,ILEU,Impact,Ind1,Ind2,Ind3,Ind4,Ind5,IPT,ISGB,ISQM,IUK,IVH,IZB,JAC,Joy,JP,Lab,Land,LD,Lib,Libert,LIND,LLPB,LTT,MACI,MCP,MEDI,MEP,MIF,MK,MPEA,MRLP,MRP,Nat Lib,NCDV,ND,New,NF,NFP,NICF,Nobody,NSPS,PBP,PC,Pirate,PNDP,Poet,PPBF,PPE,PPNV,Reform,Respect,Rest,RRG,RTBP,SACL,Sci,SDLP,SEP,SF,SIG,SJP,SKGP,SMA,SMRA,SNP,Soc,Soc Alt,Soc Dem,Soc Lab,South,Speaker,SSP,TF,TOC,Trust,TUSC,TUV,UCUNF,UKIP,UPS,UV,VCCA,Vote,Wessex Reg,WRP,You,Youth,YRDPL
0,1.0,Aberavon,Wales,2010.0,50838.0,30958,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,558.0,0.0,0.0,0.0,0.0,0.0,1276.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4411.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,919.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16073.0,0.0,5034.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2198.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,489.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,Aberconwy,Wales,2010.0,44593.0,29966,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,137.0,0.0,0.0,0.0,0.0,0.0,10734.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7336.0,0.0,5786.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5341.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,632.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [3]:
ge_15.head(2)

Unnamed: 0,Press Association ID Number,Constituency ID,Constituency Name,Constituency Type,County,Region ID,Region,Country,Election Year,Electorate,Valid Votes,30-50,Above,Active Dem,AD,Alliance,AP,Apni,Atom,AWP,Beer BS,Birthday,BNP,Bournemouth,Bristol,Brit Dem,Brit Ind,C,Campaign,Change,Ch M,Ch P,Christian,Class War,Comm,Comm Brit,Comm Lge,Communist,Community,Consensus,CPA,Croydon,CSA,CSP,Dem Ref,Digital,DP,DUP,Eccentric,Elmo,Eng Dem,EP,FPT,Green,Green Soc,Guildford,Hoi,Hospital,Humanity,IASI,IE,Ind,Ind2,Ind CHC,IPAP,ISWSL,IZB,JACP,JMB,Lab,Lab Co-op,LD,Lib,Lib GB,Lincs Ind,Loony,LP,LU,Magna Carta,Mainstream,Manston,Meb Ker,Nat Lib,ND,NE Party,New IC,NF,NHAP,Northern,Patria,PBP,PC,Peace,PF,Pilgrim,Pirate,Plural,Poole,PPP,PP UK,PSP,Real,Realist,Reality,Rep Soc,Respect,Restore,RFAC,Rochdale,Roman,RTP,Scottish CP,SCP,SDLP,SEP,SF,S New,SNP,Soc Dem,Soc Lab,Song,Southport,Speaker,SPGB,SSP,TEP,Thanet,TSPP,TUSC,TUV,Ubuntu,UKIP,UKPDP,U Party,Uttlesford,UUP,Vapers,VAT,Wessex Reg,Whig,Wigan,Worth,WP,WRP,WVPTFP,Yorks,Young,Zeb
0,1.0,W07000049,Aberavon,County,West Glamorgan,W92000004,Wales,Wales,2015.0,49821.0,31523.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3742.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,711.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1137.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15416.0,0.0,1397.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3663.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,352.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,134.0,0.0,0.0,4971.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,W07000058,Aberconwy,County,Clwyd,W92000004,Wales,Wales,2015.0,45525.0,30148.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12513.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,727.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8514.0,0.0,1391.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3536.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3467.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


From this, we can see that there are a lot of columns, which can be split into two;
firstly we have the columns that give us generic information about the constituency and the votes and next we have a lot of columns specifying how the votes were split up over each party.

## Feature Engineering

According to Kaggle, the online platform for machine-learning competitions, good feature engineering is usually the defining characteristic of winning entries. As such, we are going to spend some time trying to 

There are members of the Labour party who are also members of the Labour Co-operative party. When we get to fitting the model, this could alter the fit and reduce the accuracy of our model. So we are going to merge the two columns in the 2015 data (there is no column for Labour Co

In [4]:
# Merge Labour and Labour Co-operative party
ge_15['Lab'] = ge_15['Lab'] + ge_15['Lab Co-op']
del ge_15['Lab Co-op']

There are a **lot** of columns in the DataFrames containing our election data. In order to make feature selection easier, we are going to reduce the columns by keeping the main parties and then summing the votes from all of the other parties into a column which we will call *Other*.

In [5]:
# Specify column names of the main parties that we are reducing the columns to
main_parties_10 = ['Lab', 'Con', 'LD', 'Grn', 'UKIP', 'SNP', 'PC']
main_parties_15 = ['Lab', 'C', 'LD', 'Green', 'UKIP', 'SNP', 'PC']

# Specify column names of the generic info that we are including in the new DataFrame 
non_parties_10 = ['Press Association Reference', 'Constituency Name', 'Region', 'Election Year', 'Electorate', 'Votes']
non_parties_15 = ['Press Association ID Number', 'Constituency ID', 'Constituency Name', 'Constituency Type', 'County', 'Region ID',
               'Region', 'Country', 'Election Year', 'Electorate', 'Valid Votes']

# Add all of the columns that we are keeping into new DataFrames
ge_10_main = ge_10[non_parties_10 + main_parties_10]
ge_15_main = ge_15[non_parties_15 + main_parties_15]

# All of the columns that we have left behind... and need to sum across to get our 'Other' column
other_cols_10 = list(set(ge_10.columns.difference(set(ge_10_main.columns))))
other_cols_15 = list(set(ge_15.columns.difference(set(ge_15_main.columns))))

# Add a column for all of the parties that weren't included in the main parties list
ge_10_main['Other'] = ge_10[other_cols_10].sum(axis = 1)
ge_15_main['Other'] = ge_15[other_cols_15].sum(axis = 1)

Next we want to add some of the columns from the 2015 DataFrame to the 2010 DataFrame but first we must check to see if both DataFrames are in the same order.

In [6]:
# Checking if there are any inconsistencies in 'Constituency Name'
len(ge_10[ge_10['Constituency Name'] != ge_15['Constituency Name']]['Constituency Name'])

20

It seems that there are some inconstistencies between them, let's have a look at them to see how we can fix this.

In [7]:
# Check if name inconsistencies are because of different formatting or actually different Constituencies
name_diff_10 = ge_10[ge_10['Constituency Name'] != ge_15['Constituency Name']]['Constituency Name']
name_diff_15 = ge_15[ge_10['Constituency Name'] != ge_15['Constituency Name']]['Constituency Name']

pd.DataFrame([name_diff_10, name_diff_15])


Unnamed: 0,49,142,150,213,253,332,406,412,413,414,415,428,435,449,536,537,538,542,587,588
Constituency Name,Berwick-upon-Tweed,"Chester, City of",Cities of London & Westminster,"Durham, City of",Forest of Dean,Isle of Wight,Na h-Eileanan an Iar (Western Isles),Newcastle-under-Lyme,Newcastle upon Tyne Central,Newcastle upon Tyne East,Newcastle upon Tyne North,Northamptonshire South,Ochil & South Perthshire,Perth & North Perthshire,Stoke-on-Trent Central,Stoke-on-Trent North,Stoke-on-Trent South,Stratford-on-Avon,Vale of Clwyd,Vale of Glamorgan
Constituency Name,Berwick-Upon-Tweed,"Chester, City Of",Cities Of London & Westminster,"Durham, City Of",Forest Of Dean,Isle Of Wight,Na H-Eileanan An Iar,Newcastle-Under-Lyme,Newcastle Upon Tyne Central,Newcastle Upon Tyne East,Newcastle Upon Tyne North,Northamptonshire South,Ochil & Perthshire South,Perth & Perthshire North,Stoke-On-Trent Central,Stoke-On-Trent North,Stoke-On-Trent South,Stratford-On-Avon,Vale Of Clwyd,Vale Of Glamorgan


The inconsistencies are only because of different formatting. This means that each DataFrame has the same index and so we can add columns from the 2015 DataFrame to the 2010 DataFrame!

In [8]:
# Add columns from 2015 data for generic info because these haven't changed between the time at which both datasets were created
ge_10_main['Constituency Type'] = ge_15['Constituency Type']
ge_10_main['Country'] = ge_15['Country']

# Rename the columns to be consistent with 2010 DataFrame
ge_15_main.rename(columns = {'Press Association ID Number' : 'Press Association Reference', 'C' : 'Con', 'Green' : 'Grn', 'Valid Votes' : 'Votes' }, inplace = True)

# Get the winning proportion of votes and add it as a column in the DataFrames
seats_10 = ge_10_main[main_parties_10 + ['Other']]
seats_15 = ge_15_main[main_parties_10 + ['Other']]
ge_10_main['Highest Vote Prop'] = seats_10.max(axis = 1)/ge_10_main['Votes']
ge_15_main['Highest Vote Prop'] = seats_15.max(axis = 1)/ge_15_main['Votes']

ge_10_main.head(2)

Unnamed: 0,Press Association Reference,Constituency Name,Region,Election Year,Electorate,Votes,Lab,Con,LD,Grn,UKIP,SNP,PC,Other,Constituency Type,Country,Highest Vote Prop
0,1.0,Aberavon,Wales,2010.0,50838.0,30958,16073.0,4411.0,5034.0,0.0,489.0,0.0,2198.0,2753.0,County,Wales,0.519187
1,2.0,Aberconwy,Wales,2010.0,44593.0,29966,7336.0,10734.0,5786.0,0.0,632.0,0.0,5341.0,137.0,County,Wales,0.358206


In [9]:
ge_15_main.head(2)

Unnamed: 0,Press Association Reference,Constituency ID,Constituency Name,Constituency Type,County,Region ID,Region,Country,Election Year,Electorate,Votes,Lab,Con,LD,Grn,UKIP,SNP,PC,Other,Highest Vote Prop
0,1.0,W07000049,Aberavon,County,West Glamorgan,W92000004,Wales,Wales,2015.0,49821.0,31523.0,15416.0,3742.0,1397.0,711.0,4971.0,0.0,3663.0,1623.0,0.48904
1,2.0,W07000058,Aberconwy,County,Clwyd,W92000004,Wales,Wales,2015.0,45525.0,30148.0,8514.0,12513.0,1391.0,727.0,3467.0,0.0,3536.0,0.0,0.415052


For this walkthrough, we are going to choose the features using our intuition. However, there is more information on various methods to choose features [here](http://scikit-learn.org/stable/modules/feature_selection.html) and [here](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/). The features we have chosen are *Constituency Type*, *Region*, *Highest Vote Proportion*, *Electorate* and *Votes cast*.  

#### One Hot Encoding

Many machine learning algorithms cannot process categorical variables. Hence, we create a new column for each category, that the variable can take, and, for each observation, set its value to 1 if the observation was in the category defined by the column and 0 otherwise. We then feed these new columns into the model as features.

This process is called **One Hot Encoding**. 

We are going to do this for *Region*, *Constituency Type* and the *Winning Party* of each constituency because they are the only categorical variables in our predictors. 

In [10]:
# Find the winning party of each constituency
winner_10 = seats_10.idxmax(axis = 1)
winner_15 = seats_15.idxmax(axis = 1)

# Apply One Hot Encoding to Constituency Type, Region and the Winning Party in each constituency
one_hot_const_type_10 = pd.get_dummies(ge_10_main['Constituency Type'])
one_hot_region_10 = pd.get_dummies(ge_10_main['Region'])
y_10 = pd.get_dummies(winner_10)

one_hot_const_type_15 = pd.get_dummies(ge_15_main['Constituency Type'])
one_hot_region_15 = pd.get_dummies(ge_15_main['Region'])

In [11]:
# Create a DataFrame to hold all of the predictors and the Constituency Name
predictors_10 = ge_10_main[['Constituency Name', 'Electorate', 'Votes', 'Highest Vote Prop']]
predictors_10 = predictors_10.join([one_hot_const_type_10, one_hot_region_10])

predictors_15 = ge_15_main[['Constituency Name', 'Electorate', 'Votes', 'Highest Vote Prop']]
predictors_15 = predictors_15.join([one_hot_const_type_15, one_hot_region_15])

In [12]:
predictors_10.head(3)

Unnamed: 0,Constituency Name,Electorate,Votes,Highest Vote Prop,Borough,Burgh,County,East Midlands,Eastern,London,North East,North West,Northern Ireland,Scotland,South East,South West,Wales,West Midlands,Yorkshire and the Humber
0,Aberavon,50838.0,30958,0.519187,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
1,Aberconwy,44593.0,29966,0.358206,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
2,Aberdeen North,64808.0,37701,0.444179,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0


In [13]:
predictors_15.head(3)

Unnamed: 0,Constituency Name,Electorate,Votes,Highest Vote Prop,Borough,Burgh,County,East,East Midlands,London,North East,North West,Northern Ireland,Scotland,South East,South West,Wales,West Midlands,Yorkshire and The Humber
0,Aberavon,49821.0,31523.0,0.48904,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
1,Aberconwy,45525.0,30148.0,0.415052,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
2,Aberdeen North,67745.0,43936.0,0.564298,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0


We have ended up with a DataFrame of all of our predictors and Constituency Name. This is what we are going to use to fit our model and make predictions.

## Making Predictions

In this section we are going to fit our model, make predictions and calculate how accurate our model fit was by comparing it with the actual results.

#### Fitting the model

In [32]:
# Fit the model
model = DecisionTreeClassifier()
model.fit(predictors_10[predictors_10.columns[1:]], y_10)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

#### Making predictions

In [33]:
# Make the predictions
k = model.predict(predictors_15[predictors_15.columns[1:]])

The ouput given by the .predict() function of the model is in the same format as y_10 (one hot encoded data). To make it easier to read/compare we change the format to give the winning party.

In [34]:
k2 = pd.DataFrame(k, columns = y_10.columns)
k3 = k2.idxmax(axis = 1)

#### Model Validation

Finally, we can compare what our model has predicted to the real outcome. This is important because it lets us see if our model fit needs improvement.

In [36]:
# Compute percentage of accurate predictions in 2015 data
ge_15_valid = predictors_15.copy()
ge_15_valid['Winner'] = winner_15

acc = len(ge_15_valid[ge_15_valid['Winner'] == k3])/650

print("Our model was %s accurate!" % "{0:.2%}".format(acc))

Our model was 69.54% accurate!


It's a great start but it looks like our model could do with some work... Now it's over to you! 

You can fork this notebook and pick up where we left off by trying out some new models, or improve this one.

Or you could choose something else to predict, like maybe using the model_2015 dataset to predict how people voted in the referendum.

The possibilities are endless!