# Modeling elections

In [1]:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import pystan

## Data

The `electoral_votes` variable is a dictionary containing the number of Electoral College votes for each state. For example
```
  >>> electoral_votes['Indiana']
  11
```
Data from [Wikipedia: United_States_Electoral_College](https://en.wikipedia.org/wiki/United_States_Electoral_College)

The `survey_results` variable is a dictionary mapping from states to an array of survey results for each candidate. **Each row in a survey results array represents one survey** and each column represents one candidate. There are **4 columns, representing Clinton, Trump, Johnson, and Stein** in that order. In the example below, Clinton got 340 votes in the first survey, Trump got 258, Johnson got 27, and Stein got 13.
```
  >>> survey_results['Indiana']
  array([[340, 258,  27,  13],
         [240, 155,   5,   5],
         [235, 155,  50,  20],
         [308, 266,  49,  35],
         [222, 161,  80,  30]])
```
Data from [Wikipedia: Statewide opinion polling for the United States presidential election, 2016](https://en.wikipedia.org/wiki/Statewide_opinion_polling_for_the_United_States_presidential_election,_2016)


In [2]:
electoral_votes = {
    'Alabama': 9,
    'Alaska': 3,
    'Arizona': 11,
    'Arkansas': 6,
    'Colorado': 9,
}

survey_results = {
    'Alabama': np.array([], dtype=int).reshape(0, 4),
    'Alaska': np.array([400 * np.array([.47, .43, .07, .03]), 500 * np.array([.36, .37, .07, .03]), 500 * np.array([.34, .37, .10, .02]), 660 * np.array([.31, .36, .18, .06])], dtype=int),
    'Arizona': np.array([392 * np.array([.45, .47, .05, .02]), 550 * np.array([.39, .47, .04, .03]), 719 * np.array([.40, .45, .09, .03]), 769 * np.array([.44, .49, .05, .01]), 2229 * np.array([.45, .44, .07, .01]), 700 * np.array([.43, .47, .02, .02]), 550 * np.array([.41, .45, .03, .01]), 994 * np.array([.42, .44, .04, .01]), 550 * np.array([.40, .42, .05, .02]), 2385 * np.array([.48, .46, .05, .01]), 401 * np.array([.45, .46, .04, .01]), 550 * np.array([.41, .41, .05, .02]), 1538 * np.array([.39, .44, .06, .02]), 713 * np.array([.43, .38, .06, .01]), 400 * np.array([.39, .37, .08, .03]), 600 * np.array([.44, .42, .09, .01]), 718 * np.array([.42, .42, .05, .01]), 484 * np.array([.41, .46, .09, .01]), 649 * np.array([.38, .40, .12, .03])], dtype=int),
    'Arkansas': np.array([463 * np.array([.33, .56, .04, .02]), 831 * np.array([.34, .55, .03, .01]), 600 * np.array([.29, .57, .05, .03])], dtype=int),
    'Colorado': np.array([1150 * np.array([.45, .44, .05, .04]), 500 * np.array([.44, .38, .07, .02]), 550 * np.array([.39, .39, .05, .04]), 750 * np.array([.44, .41, .08, .04]), 685 * np.array([.45, .37, .10, .03]), 400 * np.array([.49, .38, .07, .03]), 602 * np.array([.44, .33, .10, .03]), 694 * np.array([.46, .40, .06, .02]), 784 * np.array([.41, .42, .13, .03]), 991 * np.array([.40, .39, .07, .02]), 644 * np.array([.44, .42, .10, .02]), 540 * np.array([.41, .34, .12, .03]), 600 * np.array([.38, .42, .13, .02]), 704 * np.array([.48, .43, .04, .02]), 605 * np.array([.43, .38, .07, .02]), 997 * np.array([.42, .39, .07, .02])], dtype=int),
}

states = sorted(survey_results.keys())
print('Modeling', len(states), 'states with', sum(electoral_votes[s] for s in states), 'electoral college votes')

Modeling 5 states with 38 electoral college votes


In [3]:
print(survey_results)

{'Alabama': array([], shape=(0, 4), dtype=int64), 'Alaska': array([[188, 172,  28,  12],
       [180, 185,  35,  15],
       [170, 185,  50,  10],
       [204, 237, 118,  39]]), 'Arizona': array([[ 176,  184,   19,    7],
       [ 214,  258,   22,   16],
       [ 287,  323,   64,   21],
       [ 338,  376,   38,    7],
       [1003,  980,  156,   22],
       [ 301,  329,   14,   14],
       [ 225,  247,   16,    5],
       [ 417,  437,   39,    9],
       [ 220,  231,   27,   11],
       [1144, 1097,  119,   23],
       [ 180,  184,   16,    4],
       [ 225,  225,   27,   11],
       [ 599,  676,   92,   30],
       [ 306,  270,   42,    7],
       [ 156,  148,   32,   12],
       [ 264,  252,   54,    6],
       [ 301,  301,   35,    7],
       [ 198,  222,   43,    4],
       [ 246,  259,   77,   19]]), 'Arkansas': array([[152, 259,  18,   9],
       [282, 457,  24,   8],
       [174, 341,  30,  18]]), 'Colorado': array([[517, 506,  57,  46],
       [220, 190,  35,  10],
       [214

## Generative model

1. For each state we generate an $\vec{\alpha}$ vector, which defines a Dirichlet distribution over the proportion of votes that go to each of the 4 candidates whenever we do a survey — including the final survey, namely the election itself which we want to predict. The **prior over each component of $\vec{\alpha}$ is taken as a (positive half-) Cauchy distribution** with location 0 and scale 1. Since the components of $\vec{\alpha}$ are positive, we actually use the positive half-Cauchy distribution.

2. For each survey in a state we **generate a probability vector $\vec{p_i} \sim \text{Dirichlet}(\vec{\alpha})$ for the probability that a voter selects each of the 4 candidates**.

3. For each survey, we then generate the number of votes going to each candidate **(posterior) as $\vec{k_i} \sim \text{Multinomial}(\vec{p_i})$.**

### Tasks

* Use Stan to sample from the posterior distribution over $\alpha$ and visualize your results. There are 10 states, so you will have 10 posteriors.
* The posteriors over $\alpha$ show a lot of variation between different states. Explain the results you get in terms of the model and the data.

Let's summarize the model we want to build like this:

- model for alpha -> pos-half-cauchy (0,1)
- model for p -> dirichlet(alpha)
- posterior model k -> multinomial(p)

Since the prior parameters for the Cauchy distribution are preset at 0,1, we don't need to add them to the data. All other parameters can be entered in the "parameter" section of our model then, since they are not predetermined.

In [54]:
stan_code = '''
data {
    int<lower=0> N;  //num of candidates
    int<lower=0> S;  //num of surveys in a state
    int<lower=0> polls[S, N];   //matrix with survey results   
}

parameters {
    vector<lower=0>[N] alpha;    //dirichlet parameter
    simplex[N] p[S];  //multinomial parameter
}

model {
    
    alpha ~ cauchy(0,1);
    
    for (j in 1:S) {
        p[j] ~ dirichlet(alpha);
        polls[j] ~ multinomial(p[j]);
    }
    

}
'''

stan_model = pystan.StanModel(model_code=stan_code)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_3569e5a0cd6c3c10792bde5f98dd5a51 NOW.


Now we can run this model for each state if we iteratively define the data:

In [55]:
poll_results = {}

for state in states:
    survey_data = {
        'N': 4,
        'S': len(survey_results[state]),
        'polls': survey_results[state]
    }
    poll_results[state] = stan_model.sampling(data=survey_data)
    
print(poll_results)



{'Alabama': Inference for Stan model: anon_model_3569e5a0cd6c3c10792bde5f98dd5a51.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha[1]   5.04    0.79  33.35   0.04   0.38   0.99   2.48  25.91   1797    1.0
alpha[2]   4.31    0.47  24.42   0.03   0.37   0.94   2.35  24.67   2685    1.0
alpha[3]   9.12    3.51 178.61   0.04   0.42   0.99   2.41  27.87   2585    1.0
alpha[4]   8.21    2.63 162.59   0.04   0.41   0.97   2.33  25.78   3816    1.0
lp__      -5.59    0.05   1.82 -10.22  -6.59   -5.2  -4.23  -3.17   1347    1.0

Samples were drawn using NUTS at Thu Oct 24 16:09:39 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1)., 'Alaska': Inference for Stan model: anon_model_3569e5a0cd6c3c10792bde5f98dd5a51.
4 chains, each

In [67]:
def get_survey_data(state):
    survey_data = {
        'N': 4,
        'S': len(survey_results[state]),
        'polls': survey_results[state]
    }
    return survey_data

## Simulation time

Use the posterior samples to predict the outcome of the presidential elections.

* Predict the probability that each candidate **will win each state**.
   * Use the posterior $\alpha$ samples to generate posterior predictive samples for $p$ — the proportion of votes each candidate would get in each state in an election.
   * Use these $p$ samples to estimate the probability that each candidate will win each state.


In [68]:
def get_ps(results):

    samples = results.extract()

    # Make a new array with same dimensions as alpha
    p_predicted = np.empty(samples['alpha'].shape)
    # Generate one p sample for each alpha sample
    for i in range(samples['alpha'].shape[0]):
        p_predicted[i] = stats.dirichlet(samples['alpha'][i]).rvs()
    
    return p_predicted

In [69]:
for state in states:
    results = poll_results[state]
    p_pred = get_ps(results)
    print("Chances of winning ", state)
    print(np.mean(p_pred, axis=0))

Chances of winning  Alabama
[0.25164425 0.24773114 0.25105299 0.24957162]
Chances of winning  Alaska
[0.40733986 0.42327298 0.1163932  0.05299395]
Chances of winning  Arizona
[0.45215685 0.46796059 0.06194643 0.01793613]
Chances of winning  Arkansas
[0.33307032 0.58605135 0.04916752 0.03171081]
Chances of winning  Colorado
[0.46351758 0.42250452 0.08466454 0.02931337]


* Predict the probability that each candidate **will win the presidential election.**
   * Use the posterior predictive probability that each candidate will win each state to generate samples over the total number Electoral College votes each candidate would get in an election.
   * Use the total number of votes to generate samples over who would win the election.

In [86]:
ps_states = {}
for state in states:     #retrieveing sample probabilities for each state

    state_data = get_survey_data(state)
    results = stan_model.sampling(data=state_data)
    ps_states[state] = get_ps(results)
                   



In [87]:
# Let's simulate 4000 different elections:
electoral_votes = {
    'Alabama': 9,
    'Alaska': 3,
    'Arizona': 11,
    'Arkansas': 6,
    'Colorado': 9,
}
election_wins = [0, 0, 0, 0]

for election in range(4000):   #4k fake elections
    
    votes = [0, 0, 0, 0]
    
    for state in states:       #simulating each state's results
        p = ps_states[state][election] 
        winner = np.random.choice(range(0,4),p=p)

        votes[winner] += electoral_votes[state]

    election_wins[np.argmax(votes)] += 1   #most votes wins

In [88]:
print("Chances of winning the election:")

print("Total: ", election_wins)
print("Percentage of winning: ", election_wins/np.sum(election_wins))

Chances of winning the election:
Total:  [1739, 1985, 209, 67]
Percentage of winning:  [0.43475 0.49625 0.05225 0.01675]
