# US election prediction on historical data

## 1. Introduction

This is a research on election prediction in the United States. Let’s start with general things and significant conclusions:

1. The history of elections dates back to 1789. More than 200 years have already passed as of 2020.  It means that there should already be enough data accumulated to make some interesting things as prediction.
2. Both party and non party candidates have participated in elections throughout history. However, members of only two parties have become US presidents since the second half of the 19th century . These are Democrats and Republicans. Hence, we can consider predicting the next
president as a binary classification problem in which two parties act as two classes.
3. The US election system is not direct, therefore we need to simplify it.

Let's take a closer look at the last point. Generally speaking, Americans first vote for the electors and then the second elect the President.  However, this system may have a number of disadvantages.  For example, electors are not required to vote for the candidates they promised to vote for and other external factors can take place. We will assume that the election is not affected by anything like external influence, electoral bribery, or other election fraud. Then we can state that every American citizen can influence the outcome of the election. This is  simplification of election system in this work.

Next a significant question arises. What can affect the vote of an American citizen in an election? If you imagine yourself before an election, what influences your vote? Perhaps you could ask yourself. How does the current president handle his duties? How much has education improved over this period? And health care? What about the economy? Generally speaking, there are many factors that can influence the choice of a particular candidate.

Some of these factors will be considered in this research and applied for the prediction of the next president. This work is founded on the assumption that Americans decide to either change or support the ruling party based on changes in economic or social factors during the current period of that party's rule. These trends of changes in factors over the period of the current party's rule will be used to predict the next ruling party.

Further plan:
1. Dataset preparation
2. Training
3. Testing
4. Next president prediction 
5. Discussion




First, let us do some preparation work.


In [1]:
# import all needed packages

from sklearn.metrics import f1_score, make_scorer, precision_score, recall_score, accuracy_score
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
# unzip the data

! unzip -qq data.zip

Let us divide the time since the 20th century into presidential terms and label  them as follows:

'*period*' : [label of ruling party for that period, label of next ruling party]

Labels are : 1 for Democrats, 0 for Republicans. Therefore class 1 corresponds to Democrats and class, 0 - to Democrats in terms of classification problem. The period between 2017 and 2019 is labeled only by current ruling party and will be used in the last part for prediction next president. 

In [3]:
# 1 for democrats, 0 for republicans

periods = {
   # '1901-1905' : 0,
   # '1905-1909' : 0,
   # '1909-1913' : 0,
   # '1913-1917' : 1,
   # '1917-1921' : 1,
   # '1921-1923' : 0,
   # '1923-1929' : 0,
   # '1929-1933' : 0,
   # '1933-1937' : 1,
   # '1937-1941' : 1,
   # '1941-1945' : 1,
   # '1945-1949' : 1,
   # '1949-1953' : 1,
   # '1953-1957' : 0,
   # '1957-1961' : 0,
    '1961-1963' : [1, 1],
    '1963-1967' : [1, 1],
    '1967-1969' : [1, 0],
    '1969-1974' : [0, 0],
    '1974-1977' : [0, 1],
    '1977-1981' : [1, 0],
    '1981-1985' : [0, 0],
    '1985-1989' : [0, 0],
    '1989-1993' : [0, 1],
    '1993-1997' : [1, 1],
    '1997-2001' : [1, 0],
    '2001-2005' : [0, 0],
    '2005-2009' : [0, 1],
    '2009-2013' : [1, 1],
    '2013-2017' : [1, 0],
#    '2017-2019' : 0
}

## 2. Data preparation

Let us describe the dataset. This research will consider only 10 factors that reflect the situation in the country for the period between 1960 and 2019. These are: GDP growth rate, Inflation, Unemployment rate, Life expectancy, Birth rate, Level of crime, Education, Healthcare and Welfare spendings. And one the most significant value is the label of currently ruling party. The reason for such a small number of factors and a short period of time is a lack of data. Surprisingly, many resources with data on the US economy and social life are not available in Russia. However, in this work data augmentation approach will be applied to overcame this problem. This set of methods is commonly used in Deep Learning. It consists in generating synthetic data, which can help if the scarcity of data takes place. The Synthetic Minority Over-sampling Technique (SMOTE) will be used to create new objects. 

In [4]:
# upload the data

gdp_growth = pd.read_csv('./gdp_growth_rate.csv')
inflation = pd.read_csv('./inflation_rate.csv')
life_exp = pd.read_csv('./life_expectancy.csv')
birth_rate = pd.read_csv('./birth_rate.csv')
#death_rate = pd.read_csv('./death_rate.csv')
crime_rate = pd.read_csv('./crime_rate.csv', delimiter='    ')
spendings = pd.read_csv('./spendings.csv', delimiter='\t')
unemp_rate = pd.read_csv('./unem_rate.csv', delimiter=' ')

Let us get the necessary data from the files and create a table of initial factors by year.

### 2.1 Preprocessing

Let's define a preprocessing function for each factor's data.

In [5]:
def general_preproc(data, prev_name, new_name):
    data['date'] = data['date'].apply(lambda x: int(x[0:4]))
    data = data[['date', prev_name]]
    data.columns = ['date', new_name]
    return data

def gdp_preproc(data):
    return general_preproc(data, ' GDP Growth (%)', 'gdp_growth')

def inf_preproc(data):
    return general_preproc(data, ' Inflation Rate (%)', 'inflation')

def life_exp_preproc(data):
    return general_preproc(data, ' Life Expectancy from Birth (Years)', 
                           'life_exp')

def birth_rate_preproc(data):
    return general_preproc(data, ' Births per 1000 People', 'births')

def crime_rate_preproc(data):
    data = data.rename(columns={'Year':'date'})
    data['crime_rate'] = data['Total'].apply(lambda x: float(x.replace(',', '.')))
    data = data[['date', 'crime_rate']]
    return data

def unemp_preproc(data):
    cols = list(data.columns)
    cols.remove('Year')
    data = data.rename(columns={'Year':'date'})
    data['unemp_rate'] = data[cols].mean(axis=1)
    data = data.drop(cols, axis=1)
    return data

def spendings_preproc(data):
    data = data.drop(['Unnamed: 4', 'Unnamed: 6', 'Unnamed: 8'], axis=1)
    data = data.rename(columns={'Year':'date'})

    health_spends = data[['date', 'Health Care - Total $ billion nominal']]
    edu_spends = data[['date', 'Education - Total $ billion nominal']]
    welfare_spend = data[['date', 'Welfare - Total $ billion nominal']]

    rename_cols = [{'Health Care - Total $ billion nominal':'health_spends'},
                  {'Education - Total $ billion nominal':'edu_spends'},
                  {'Welfare - Total $ billion nominal':'welfare_spends'}]

    health_spends = health_spends.rename(columns=rename_cols[0])
    edu_spends = edu_spends.rename(columns=rename_cols[1])
    welfare_spend = welfare_spend.rename(columns=rename_cols[2])
    return health_spends, edu_spends, welfare_spend

In [6]:
gdp_growth = gdp_preproc(gdp_growth)
inflation = inf_preproc(inflation)
life_exp = life_exp_preproc(life_exp)
birth_rate = birth_rate_preproc(birth_rate)
crime_rate = crime_rate_preproc(crime_rate)
health_spends, edu_spends, welfare_spends = spendings_preproc(spendings)
unemp_rate = unemp_preproc(unemp_rate)

Merge everything together.

In [7]:
data = pd.merge(gdp_growth, inflation, on='date')
mass = [life_exp, birth_rate, crime_rate, health_spends, edu_spends,
        welfare_spends, unemp_rate]
  
for m in mass:
    data = pd.merge(data, m, on='date')

data = data.set_index('date')

Let's look at the current data.

In [8]:
data.head()

Unnamed: 0_level_0,gdp_growth,inflation,life_exp,births,crime_rate,health_spends,edu_spends,welfare_spends,unemp_rate
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1961,2.3,1.0707,69.93,22.134,1.906,5.68,21.21,10.27,6.691667
1962,6.1,1.1988,70.02,21.573,2.019,5.54,22.79,13.22,5.566667
1963,4.4,1.2397,70.11,21.012,2.18,6.09,24.61,13.27,5.641667
1964,5.8,1.2789,70.16,20.336,2.388,6.7,27.24,13.19,5.158333
1965,6.4,1.5852,70.21,19.66,2.449,7.15,29.82,12.93,4.508333


Now let's make a feature description of presidential terms. We will calculate the changes in factors between the beginning and the end of each party's rule, which are indicated in the dictionary *periods* defined in the Introduction. Also, we will add the label of ruling party for corresponding period and the label of next ruling party as a target variable.

In [32]:
pres_mass = []

for k, v in periods.items():
    start = int(k[:4])
    end = int(k[-4:])

    diff = data.loc[end] - data.loc[start]
    pres_mass.append(np.hstack((diff, v[0], v[1])))

In [33]:
pres_columns = ['d_' + col for col in data.columns] 
pres_columns += ['cur_pres', 'next_pres'] 
pres_dataset = pd.DataFrame(pres_mass, columns=pres_columns)

Now we get our dataset.

In [34]:
pres_dataset.head()

Unnamed: 0,d_gdp_growth,d_inflation,d_life_exp,d_births,d_crime_rate,d_health_spends,d_edu_spends,d_welfare_spends,d_unemp_rate,cur_pres,next_pres
0,2.1,0.169,0.18,-1.122,0.274,0.41,3.4,3.0,-1.05,1.0,1.0
1,-1.9,1.5331,0.2,-2.704,0.809,8.62,16.74,-1.95,-1.8,1.0,1.0
2,0.6,2.6896,0.262,-1.076,0.691,6.56,9.65,3.92,-0.35,1.0,0.0
3,-3.6405,5.5924,1.214,-1.776,1.17,20.3,30.44,19.58,2.15,0.0,0.0
4,5.1647,-4.5531,1.098,-0.525,0.227,25.7,33.18,31.22,1.408333,0.0,1.0


Further pipeline:
1. Remove highly correlated features (abs value of corr. coefficient > 0.8) for linear models as they can negatively affect the representativeness of such models.
2. Make one-hot-encoding for the categorical feature (cur_pres) Otherwise, models can compare the values of this feature mathematically using comparison operations, which is not correct for categorical features.
3. Scale features since they have different range.
4. Data augmentation.

Let us build the correlation matrix.

In [35]:
corr = pres_dataset.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

Unnamed: 0,d_gdp_growth,d_inflation,d_life_exp,d_births,d_crime_rate,d_health_spends,d_edu_spends,d_welfare_spends,d_unemp_rate,cur_pres,next_pres
d_gdp_growth,1.0,-0.24,-0.1,0.09,-0.16,-0.12,-0.45,-0.52,-0.46,0.08,0.2
d_inflation,-0.24,1.0,-0.02,-0.43,0.63,-0.14,-0.08,-0.39,-0.21,0.43,-0.33
d_life_exp,-0.1,-0.02,1.0,0.19,0.17,-0.27,-0.16,0.43,0.67,-0.51,-0.02
d_births,0.09,-0.43,0.19,1.0,-0.47,0.3,0.22,0.23,0.15,-0.37,-0.37
d_crime_rate,-0.16,0.63,0.17,-0.47,1.0,-0.63,-0.47,-0.24,0.08,0.02,-0.15
d_health_spends,-0.12,-0.14,-0.27,0.3,-0.63,1.0,0.45,0.13,-0.04,0.01,-0.05
d_edu_spends,-0.45,-0.08,-0.16,0.22,-0.47,0.45,1.0,0.13,-0.04,0.24,-0.17
d_welfare_spends,-0.52,-0.39,0.43,0.23,-0.24,0.13,0.13,1.0,0.85,-0.48,0.3
d_unemp_rate,-0.46,-0.21,0.67,0.15,0.08,-0.04,-0.04,0.85,1.0,-0.61,0.12
cur_pres,0.08,0.43,-0.51,-0.37,0.02,0.01,0.24,-0.48,-0.61,1.0,0.07


It can be seen that the pair *diff. of welfare spendings* and *diff. of unemployment rate*	has high correlation. This can de explained by the fact that welfare spendings include the unemployment allowance. Hence, when unemployment rises, welfare spendings increases as well. However, there is no correlation with the target value (*next_pres*).

In [36]:
# one-hot-encoding for cur_pres column

pres_dataset = pd.get_dummies(pres_dataset, columns=['cur_pres'])

In [37]:
# separate target value from whole dataset and remove highly corr. features

target_value = np.array(pres_dataset['next_pres'])
pres_dataset = pres_dataset.drop(['next_pres'], axis=1)
pres_dataset_linear = pres_dataset.drop(['d_welfare_spends'], axis=1)

Let us scale features into the interval (0, 1) using MinMaxScaler.

In [38]:
scaler = MinMaxScaler()
pres_dataset_scaled = scaler.fit_transform(pres_dataset)
pres_dataset_linear_scaled = scaler.fit_transform(pres_dataset_linear)

### 2.2 Data augmentation

Let us give a short description of the SMOTe algorithm. 

Algorithm for given k:
1. Take a random object of one class.
2. Find it’s *k*-th nearest neighbor of the same class (e.g. using KNN-algorithm).
3. Take a random point between chosen object (from the 1 step) and it’s k-th nearest neighbor.


In [808]:
class SMOTE_augmentation():
    def __init__(self, k=3, N=10):
        # k - sets the k-th nearest neighbor
        # N - sets the amount of generated objects for each class

        self.k = k
        self.N = N

    def generate_for_one(self, X, y):
        neigh = NearestNeighbors(n_neighbors=self.k).fit(np.array(X))
        new_points = []

        for i in range(self.N):
            rand_ind = np.random.randint(len(X))
            dists, ns = neigh.kneighbors([X[rand_ind]])
            rand_neighbor = np.random.randint(1, self.k)
            coef = np.random.random(1)[0]
            new_point = X[rand_ind] - (X[rand_ind] - X[ns[0][rand_neighbor]]) * coef
            new_points.append(new_point)
        
        return np.concatenate((X, np.array(new_points))), np.concatenate((y, y[0]*np.ones(self.N)))

    def generate_data(self, X, y):
        first_class = X[y == 1]
        second_class = X[y == 0]

        X_1, y_1 = self.generate_for_one(first_class, y[y == 1])
        X_2, y_2 = self.generate_for_one(second_class, y[y == 0])

        return np.concatenate((X_1, X_2)), np.concatenate((y_1, y_2))

Let's upsample the datasets to about 100 objects. It is still a small amount of data, but generating more synthetic data can cause a bias in it.

In [813]:
augment = SMOTE_augmentation(k=5, N=35)

x_aug, y_aug = augment.generate_data(pres_dataset_scaled, target_value)
x_aug_lin, y_aug_lin = augment.generate_data(pres_dataset_linear_scaled, 
                                             target_value)

## 3. Training

5 classical machine learning classifiers are used in this work:

1. Logistic regression (*LogReg*)
2. Support Vector Machine (*SVM*)
3. Multi-layer Perceptron (*MLP*)
4. Decision Tree (*DT*)
5. K-nearest neighbors (*KNN*)

The most appropriate parameters will be selected using *Grid-search* method with *Cross-validation* on 5 folds. Set of possible parameters is defined for each classifier. *F1-score* is chosen as an optimizing metric since it is stable to different amount of opposite classes. The *LogReg* model will be trained on data without high correlations.

In [814]:
def f1_scorer():
    return make_scorer(f1_score)

def Grid_search_trainig(clf, params, x=x_aug, y=y_aug, scorer=f1_scorer(), cv=5):
    grid_clf = GridSearchCV(clf[0], params, cv=cv, scoring=scorer)
    grid_clf.fit(x, y)
    best_clf = grid_clf.best_estimator_
    return (best_clf, clf[1])

params = [
          {'C':[1e-2, 1e-1, 1, 5]},
          {'C':[1e-2, 1e-1, 1, 5]},
          {'hidden_layer_sizes': [(4,), (8,), (8, 4), (16, 8, 4,)], 
           'activation' : ['relu', 'logistic'],
           'solver' : ['lbfgs', 'sgd', 'adam'],
           'alpha' : [1e-2, 1e-1, 1, 5]},
          {'criterion': ['gini', 'entropy'], 
           'splitter' : ['best', 'random'],
           'min_samples_leaf' : [1, 2, 5],
           'max_depth' : [None, 2, 5]}, 
          {'n_neighbors': [3, 5, 7], 
           'algorithm':['ball_tree', 'kd_tree', 'brute']}
]

cls = [
       (LogisticRegression(), 'LogReg'), 
       (SVC(probability=True), 'SVM'),
       (MLPClassifier(), 'MLP'),
       (DecisionTreeClassifier(), 'DT'),
       (KNeighborsClassifier(), 'KNN')
]

best_clfs = []

for clf, params in zip(cls, params):
    if clf[1] == 'LogReg':
        best_clfs.append(Grid_search_trainig(clf, params, x_aug_lin, y_aug_lin))
    else:
        best_clfs.append(Grid_search_trainig(clf, params))

## 4. Testing

At this stage, all classifiers with most appropriate parameters will be compared. The entire dataset is divided into two parts: training and testing (in relation 7:3). Classifiers are trained on the first part, and then quality is checked on the second one. The best model will be used in the future prediction of the next president. The values of *precision*, *recall* and accuracy metrics will be compared in addition to *F1-score*. 

In [821]:
# split data

X_train, X_test, y_train, y_test = train_test_split(x_aug, y_aug, test_size=0.3)
X_train_lin, X_test_lin, y_train_lin, y_test_lin = train_test_split(x_aug_lin, 
                                                                    y_aug_lin, 
                                                                    test_size=0.3)

In [822]:
# compare classifiers

def compare_models(models):
    presicion, recall, f1, accuracy, methods = [], [], [], [], []

    for model in models:
        if model[1] == 'LogReg':
            clf = model[0].fit(X_train_lin, y_train_lin)
            presicion.append(precision_score(y_test_lin, clf.predict(X_test_lin)))
            recall.append(recall_score(y_test_lin, clf.predict(X_test_lin)))
            f1.append(f1_score(y_test_lin, clf.predict(X_test_lin)))
            accuracy.append(accuracy_score(y_test_lin, clf.predict(X_test_lin)))
        else:
            clf = model[0].fit(X_train, y_train)
            presicion.append(precision_score(y_test, clf.predict(X_test)))
            recall.append(recall_score(y_test, clf.predict(X_test)))
            f1.append(f1_score(y_test, clf.predict(X_test)))
            accuracy.append(accuracy_score(y_test, clf.predict(X_test)))
        methods.append(model[1])

    comp_table = pd.DataFrame({'Precision' : presicion, 
                                'Recall' : recall, 
                                'F1-score' : f1, 
                                'Accuracy' : accuracy,
                                'Method' : methods})
    comp_table = comp_table.set_index('Method')
    return comp_table
  
compare_models(best_clfs)

Unnamed: 0_level_0,Precision,Recall,F1-score,Accuracy
Method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LogReg,0.785714,0.916667,0.846154,0.846154
SVM,0.933333,1.0,0.965517,0.961538
MLP,0.714286,0.714286,0.714286,0.692308
DT,0.875,1.0,0.933333,0.923077
KNN,0.875,1.0,0.933333,0.923077


As it can be seen from the table above, the *SVM* model has the highest results. This method will be used to predict the future US President.

In [823]:
best_model = best_clfs[1]

## 5. Predicting the next president

Let us prepare the data for the period between 2017 and 2019 in the same way as for other ones. Then the *SVM* model will be applied to predict the next ruling party.

In [824]:
# test data preparation

diff = (data.loc[2019] - data.loc[2017]).values
test_data = np.concatenate((diff, [1], [0]))

In [827]:
# predicting next ruling part

best_model[0].predict(test_data.reshape(1, -1))

array([1.])

As we can see, the *SVM*-classifier predicts class 1 for given data. It means that the next ruling party will be the Democrats.

In [826]:
# get the probability

best_model[0].predict_proba(test_data.reshape(1, -1))

array([[0.73467831, 0.26532169]])

The probability of this event is 73%.

## 6. Discussion

To sum up, in this work the prediction of the next ruling party was made on the historical data. The approach described in this research has one significant problem - a lack of data. The dataset upsample was made using the SMOTe algorithm to overcome this problem. 5 classical classifiers were trained and compared across the dataset. The *SVM* classifier showed the highest results on the test set and was used to predict the next president. It predicts that the next ruling party in 2021 will be the Democrats.


Although current model is unstable and does not fully reflect the real world, since in such a task there may be additional dimensions of other factors that reflect the current situation in the country. I believe that this work is a good baseline for future researchers in this field. For example, one of significant improvements to this model may be to add a new set of factors for earlier time periods. This is what I was unable to do due to the unavailability of resources with statistical data on the US economy and social life in Russia.