I have downloaded UK election results from 1918 (widely considered the first in the modern democratic era) to 2019

### Importing dependencies

In [1]:
import pandas as pd

## Initial Look at the Data

Let's start by loading the election data into a df:

In [2]:
def load_election_results_csv():
    return pd.read_csv("1918-2019election_results.csv", encoding="ISO-8859-1")
df = load_election_results_csv()

In [3]:
df.columns

Index(['constituency_id', 'seats', 'constituency_name', 'country/region',
       'electorate', 'con_votes', 'con_share', 'lib_votes ', 'lib_share',
       'lab_votes', 'lab_share', 'natSW_votes', 'natSW_share', 'oth_votes',
       'oth_share', 'total_votes', 'turnout ', 'election', 'boundary_set',
       'Unnamed: 19'],
      dtype='object')

Most of these are self-explanatory. **con_votes**, **lib_votes** and **lab_votes** refer to the number of votes received for each of the three large parties of the past century: Conservative, Liberal and Labour. **natSW_votes** refers to Scottish and Welsh nationalist parties, while **oth_votes** totals all votes for all other parties.

### Fixing Data Weirdness

#### Whitespace in Field Names

Weirdly **lib_votes** and **turnout** have some whitespace after them, they're the only field names with that so let's fix them to avoid errors later:

In [4]:
df.rename(columns={"lib_votes ": "lib_votes"}, inplace=True)
df.rename(columns={"turnout ": "turnout"}, inplace=True)

In [5]:
df.head()

Unnamed: 0,constituency_id,seats,constituency_name,country/region,electorate,con_votes,con_share,lib_votes,lib_share,lab_votes,lab_share,natSW_votes,natSW_share,oth_votes,oth_share,total_votes,turnout,election,boundary_set,Unnamed: 19
0,,1,Belfast Victoria,Ireland,19494.0,,,,,,,,,13317.0,1.0,13317,0.683,1918,1918-1935,
1,,1,Carlow,Ireland,,,,,,,,,,-1.0,1.0,-1,,1918,1918-1935,
2,,1,Cavan East,Ireland,,,,,,,,,,-1.0,1.0,-1,,1918,1918-1935,
3,,1,Clare West,Ireland,,,,,,,,,,-1.0,1.0,-1,,1918,1918-1935,
4,,2,Cork City,Ireland,45017.0,2519.0,0.082,,,,,,,28281.0,0.918,30800,0.684,1918,1918-1935,


#### Multiple Seat Constituencies

There's some interesting fields in here. The first weirdness is that Cork City has 2 seats. Prior to WW2, some UK constituencies returned 2 seats, these were usually university towns. These were not proportional constituencies, they were allocated to the cadidates in order of the nuber of votes they received and could in fact lead to a far more disproportionate result than "pure" FPTP does. Since these seats will distort my analysis somewhat, I would like to be able to exclude those cities with more than 1 seat. Let's see how many of them there were in the 1918 election:

In [6]:
constituencies_results_1918 = df[df["election"] == 1918]

In [7]:
constituencies_results_1918

Unnamed: 0,constituency_id,seats,constituency_name,country/region,electorate,con_votes,con_share,lib_votes,lib_share,lab_votes,lab_share,natSW_votes,natSW_share,oth_votes,oth_share,total_votes,turnout,election,boundary_set,Unnamed: 19


That's weird, I wonder if the election field is stored as a string instead of a number.

In [8]:
constituencies_results_1918 = df[df["election"] == "1918"]

In [9]:
constituencies_results_1918

Unnamed: 0,constituency_id,seats,constituency_name,country/region,electorate,con_votes,con_share,lib_votes,lib_share,lab_votes,lab_share,natSW_votes,natSW_share,oth_votes,oth_share,total_votes,turnout,election,boundary_set,Unnamed: 19
0,,1,Belfast Victoria,Ireland,19494,,,,,,,,,13317.0,1.000,13317,0.683,1918,1918-1935,
1,,1,Carlow,Ireland,,,,,,,,,,-1.0,1.000,-1,,1918,1918-1935,
2,,1,Cavan East,Ireland,,,,,,,,,,-1.0,1.000,-1,,1918,1918-1935,
3,,1,Clare West,Ireland,,,,,,,,,,-1.0,1.000,-1,,1918,1918-1935,
4,,2,Cork City,Ireland,45017,2519.0,0.082,,,,,,,28281.0,0.918,30800,0.684,1918,1918-1935,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
683,509,1,Yorkshire (West Riding) Skipton,Yorkshire and the Humber,35722,12599.0,0.550,10318,0.450,,,,,,,22917,0.642,1918,1918-1935,
684,510,1,Yorkshire (West Riding) Sowerby,Yorkshire and the Humber,34286,,,6778,0.303,7306,0.327,,,8287.0,0.370,22371,0.652,1918,1918-1935,
685,511,1,Yorkshire (West Riding) Spen Valley,Yorkshire and the Humber,38827,,,,,8508,0.444,,,10664.0,0.556,19172,0.494,1918,1918-1935,
686,512,1,Yorkshire (West Riding) Wentworth,Yorkshire and the Humber,36004,5315.0,0.244,3453,0.158,13029,0.598,,,,,21797,0.605,1918,1918-1935,


Looks like some of these fields are stored as strings when they should be numeric. Let's coerce them to be numbers:

In [10]:
def coerce_numeric_values_and_fix_field_names(df):
    df.rename(columns={"lib_votes ": "lib_votes"}, inplace=True)
    df.rename(columns={"turnout ": "turnout"}, inplace=True)
    for numeric_field in ("seats", "electorate", "con_votes", "con_share", "lib_votes", "lib_share", "lab_votes", "lab_share", "natSW_votes", "natSW_share", "oth_votes", "oth_share", "total_votes", "turnout", "election"):
        df[numeric_field] = pd.to_numeric(df[numeric_field], errors="coerce")
    return df
df = coerce_numeric_values_and_fix_field_names(df)

In [11]:
constituencies_with_more_than_one_seat_1918 = df[(df["seats"] > 1) & (df["election"] == 1918)]
constituencies_with_more_than_one_seat_1918

Unnamed: 0,constituency_id,seats,constituency_name,country/region,electorate,con_votes,con_share,lib_votes,lib_share,lab_votes,lab_share,natSW_votes,natSW_share,oth_votes,oth_share,total_votes,turnout,election,boundary_set,Unnamed: 19
4,,2,Cork City,Ireland,45017.0,2519.0,0.082,,,,,,,28281.0,0.918,30800.0,0.684,1918.0,1918-1935,
68,86.0,2,Blackburn,North West,61972.0,15605.0,0.337,,,14134.0,0.305,,,16638.0,0.359,46372.0,0.748,1918.0,1918-1935,
70,90.0,2,Bolton,North West,80888.0,,,,,-1.0,,,,-1.0,-1.0,,,1918.0,1918-1935,
79,97.0,2,Brighton,South East,82449.0,33325.0,0.787,,,9027.0,0.213,,,,,42352.0,0.514,1918.0,1918-1935,
99,,2,Cambridge University,University,9282.0,3925.0,0.677,,,640.0,0.11,,,1229.0,0.212,5794.0,0.624,1918.0,1918-1935,
120,12.0,2,City Of London,London,,-2.0,,,,,,,,,,-2.0,,1918.0,1918-1935,
122,,2,Combined English Universities,University,2357.0,303.0,0.152,,,366.0,0.184,,,1325.0,0.664,1994.0,0.846,1918.0,1918-1935,
123,,3,Combined Scottish Universities,University,27283.0,7005.0,0.58,,,1581.0,0.131,,,3499.0,0.29,12085.0,0.47,1918.0,1918-1935,
145,117.0,2,Derby,East Midlands,61538.0,9867.0,0.245,7102.0,0.176,16065.0,0.398,,,7287.0,0.181,40322.0,0.655,1918.0,1918-1935,
186,,2,Dublin University,University,4541.0,1904.0,0.645,,,,,,,1050.0,0.355,2954.0,0.651,1918.0,1918-1935,


In [12]:
len(constituencies_with_more_than_one_seat_1918)

18

So there are were 17 constituencies in 1918 that returned 2 seats and 1 that returned 3(!), making for 37 seats in total.

In [13]:
constituencies_results_1918["seats"].sum()

np.int64(707)

In [14]:
37/707*100

5.233380480905233

So 5.2% of seats came from these "double constituencies" in 1918. Potentially enough to have an impact on the outcome of the election. While this is annoying, it hopefully shouldn´t compromise my model too much if I assume that all constituencies only return 1 seat, as in the modern UK electoral system.

#### Election Year Weirdness and NaN Values in Vote Counts

Another major annoyance in the 1918 results at least is that a lot of them have NaN values in the vote counts. This makes them pretty much unusable for my model, so I'm going to completely ignore all elections where this kind of data is missing. Let's go through all of the elections I have data for and see which ones have usable vote count data:

In [15]:
elections = df["election"].unique()
elections

array([1918., 1922., 1923., 1924., 1929., 1931., 1935., 1945., 1950.,
       1951., 1955., 1959., 1964., 1966., 1970., 1979., 1983., 1987.,
       1992., 1997., 2001., 2005., 2010., 2015., 2017., 2019.,   nan])

Well that's interesting, why is theres such a big gap between the 1970 and 1979 elections? Looking online I can see that there were 2 elections in 1974. That may be the source of the issue. Let's reload the data frame and have another look.

In [16]:
df = load_election_results_csv()
elections = df["election"].unique()
elections

array(['1918', '1922', '1923', '1924', '1929', '1931', '1935', '1945',
       '1950', '1951', '1955', '1959', '1964', '1966', '1970', '1979',
       '1983', '1987', '1992', '1997', '2001', '2005', '2010', '2015',
       '2017', '2019', '1974F', '1974O'], dtype=object)

Okay, so the 2 elections in 1974 are referred to as 1974F (February) and 1974O (October). This makes sense, but I'd still like to have a numerical value for the election year. So instead I will replace these values with 1974.1 1974.2 respectively. Then coerce all the numerical values again.

In [17]:
df["election"] = df["election"].replace("1974F", 1974.1)
df["election"] = df["election"].replace("1974O", 1974.2)
elections = df["election"].unique()
elections

array(['1918', '1922', '1923', '1924', '1929', '1931', '1935', '1945',
       '1950', '1951', '1955', '1959', '1964', '1966', '1970', '1979',
       '1983', '1987', '1992', '1997', '2001', '2005', '2010', '2015',
       '2017', '2019', 1974.1, 1974.2], dtype=object)

In [19]:
df = coerce_numeric_values_and_fix_field_names(df)
elections = df["election"].unique()
elections

array([1918. , 1922. , 1923. , 1924. , 1929. , 1931. , 1935. , 1945. ,
       1950. , 1951. , 1955. , 1959. , 1964. , 1966. , 1970. , 1979. ,
       1983. , 1987. , 1992. , 1997. , 2001. , 2005. , 2010. , 2015. ,
       2017. , 2019. , 1974.1, 1974.2])

Right, now let's find all elections where NaN values are included in vote counts: