# 2024 Harris vs. Trump Election Prediction
The practice followed [ritvikmath](https://www.youtube.com/watch?v=O5-A2ensKb0) guidance on utilizing Bayesian methodology, using 2020 election data (`prior`), with current polling data of 7 swing states (`likelihood`) to get 2024 election prediction (`posterior`). A baseline level of uncertainty is allowed for the prediction, which usually becomes lower as we get clower to the date of election.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import re
from bs4 import BeautifulSoup
import os 
print(os.getcwd()) #print the current working directory

c:\Users\sherie.lin\OneDrive - AES Corporation\Documents\Python Scripts


In [None]:
# Set pandas to display all columns
pd.set_option('display.max_columns', None) #None as no column limit for display

In [None]:
cwd = os.getcwd()
relative_path = "Learning/votingState.txt"
file_path = os.path.join(cwd, relative_path)

In [None]:
with open(file_path, 'r') as file:
    text = file.read()
    print(text[:100])


Alabama - 9 votes

Kentucky - 8 votes

North Dakota - 3 votes

Alaska - 3 votes

Louisiana - 8 votes


In [None]:
# Regex pattern to find state name
pattern = r"^[A-Za-z\s]+(?=\s*-\s*\d+\s*votes\n)"
voting_states = re.findall(pattern, text, flags=re.M)
voting_states = [match.strip() for match in voting_states]
print(voting_states)

['Alabama', 'Kentucky', 'North Dakota', 'Alaska', 'Louisiana', 'Ohio', 'Arizona', 'Maine', 'Oklahoma', 'Arkansas', 'Maryland', 'Oregon', 'California', 'Massachusetts', 'Pennsylvania', 'Colorado', 'Michigan', 'Rhode Island', 'Connecticut', 'Minnesota', 'South Carolina', 'Delaware', 'Mississippi', 'South Dakota', 'District of Columbia', 'Missouri', 'Tennessee', 'Florida', 'Montana', 'Texas', 'Georgia', 'Nebraska', 'Utah', 'Hawaii', 'Nevada', 'Vermont', 'Idaho', 'New Hampshire', 'Virginia', 'Illinois', 'New Jersey', 'Washington', 'Indiana', 'New Mexico', 'West Virginia', 'Iowa', 'New York', 'Wisconsin', 'Kansas', 'North Carolina']


* `^` anchor the start of each line
* `re.M` allows `^` to work on each line independently
* `str.strip()` to remove any leading or trailing whitespace

## Read 2020 Election Data

In [None]:
cwd = os.getcwd()
relative_path = "Learning/2020_election_results.txt"
file_path = os.path.join(cwd, relative_path)

In [None]:
# read 2020 election result text file and extract 
try:
    with open(file_path, 'r') as file:
        text = file.read()
    print(text[:200]) # display the first 100 characters to confirm success
except FileNotFoundError:
    print(f"File not found at {file_path}")
except UnicodeDecodeError:
    print(f"Encoding issue while reading the file. Try another eoncding.")


STATE RESULTS
President: Alabama
9 Electoral Votes
Trump
PROJECTED WINNER
+ FOLLOW
Candidate	%		Votes
Trump
62.0%	
1,441,170
Biden
36.6%	
849,624
Est. 99% In
Updated 10:17 p.m. ET, Mar. 6
Full Details


[`regex`](https://www.w3schools.com/python/python_regex.asp) to extract expression pattern, and comfile the matches into dataframe:

In [None]:
pattern = (
    # Lines begin with 'President', followed by the state name into 'state' column
    r"President:\s*(?P<state>[A-Za-z\s]+)\n"
    r"(?P<electoral_votes>\d+)\s*Electoral Votes\n"
    r"(?P<winner>Trump|Biden)\nPROJECTED WINNER\n\+ FOLLOW\n"
    # Tie votes to name
    r"Candidate\t%\t\tVotes\n(?P<candidate1>Trump|Biden)\n(?:\d+\.\d+%)?\s*\n?(?P<votes1>[\d,]+)\n(?P<candidate2>Trump|Biden)\n(?:\d+.\d+%)?s*\n?(?P<votes2>[\d,]+)"
)

In [None]:
# Compile the regrex pattern
regex = re.compile(pattern)

# Find all matches in the text and return a list of tuples
# matches = regex.findall(text)
matches = regex.finditer(text)

In [None]:
# matches

[('Alabama', '9', 'Trump', 'Trump', '1,441,170', 'Biden', '36'), ('Alaska', '3', 'Trump', 'Trump', '189,951', 'Biden', '42'), ('Arizona\nParty change\nBATTLEGROUND', '11', 'Biden', 'Biden', '1,672,143', 'Trump', '49'), ('Arkansas', '6', 'Trump', 'Trump', '760,647', 'Biden', '34'), ('California', '55', 'Biden', 'Biden', '11,110,250', 'Trump', '34'), ('Colorado\nBATTLEGROUND', '9', 'Biden', 'Biden', '1,804,352', 'Trump', '41'), ('Connecticut', '7', 'Biden', 'Biden', '1,080,831', 'Trump', '39'), ('Delaware', '3', 'Biden', 'Biden', '296,268', 'Trump', '39'), ('District Of Columbia', '3', 'Biden', 'Biden', '317,323', 'Trump', '5'), ('Florida\nBATTLEGROUND', '29', 'Trump', 'Trump', '5,668,731', 'Biden', '47'), ('Georgia\nParty change\nBATTLEGROUND', '16', 'Biden', 'Biden', '2,473,633', 'Trump', '49'), ('Hawaii', '4', 'Biden', 'Biden', '366,130', 'Trump', '34'), ('Idaho', '4', 'Trump', 'Trump', '554,119', 'Biden', '33'), ('Illinois', '20', 'Biden', 'Biden', '3,471,915', 'Trump', '40'), ('Indi

`matches` is a list of tuples, if using `re.findall()`, and can only access with index such as `match[0]`. Or use `re.finditer()` to return match objects so that you could use `.group()`

In [None]:
# Process the matches
results = []
for match in matches:
    candidate1 = match.group('candidate1')
    votes1 = int(match.group('votes1').replace(',',''))
    candidate2 = match.group('candidate2')
    votes2 = int(match.group('votes2').replace(',',''))  

    # Assign votes to candidates
    if candidate1 == 'Trump':
        trump_votes, biden_votes = votes1, votes2
    else:
        trump_votes, biden_votes = votes2, votes1
    
    # Append the results
    results.append({
        "State": match.group('state'), #instead of match[0] so that we prevent indexing error
        # winner takes all
        "Electoral Votes": int(match.group('electoral_votes')),
        "Winner": match.group('winner'),
        "Trump Votes": trump_votes,
        "Biden Votes": biden_votes
    })

voting_df = pd.DataFrame(results)


In [None]:
voting_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   State            49 non-null     object
 1   Electoral Votes  49 non-null     int64 
 2   Winner           49 non-null     object
 3   Trump Votes      49 non-null     int64 
 4   Biden Votes      49 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 2.0+ KB


In [None]:
# Winner takes all
voting_df['Trump Electoral Votes'] = voting_df.apply(lambda row: row['Electoral Votes'] if row.Winner == 'Trump' else 0, 1) 
voting_df['Biden Electoral Votes'] = voting_df.apply(lambda row: row['Electoral Votes'] if row.Winner == 'Biden' else 0, 1)

In [None]:
voting_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   State                  49 non-null     object
 1   Electoral Votes        49 non-null     int64 
 2   Winner                 49 non-null     object
 3   Trump Votes            49 non-null     int64 
 4   Biden Votes            49 non-null     int64 
 5   Trump Electoral Votes  49 non-null     int64 
 6   Biden Electoral Votes  49 non-null     int64 
dtypes: int64(5), object(2)
memory usage: 2.8+ KB


From above we see that State column requires cleansing: split on newline character(`\n`) and takes only the first index

In [None]:
voting_df.State = voting_df.State.apply(lambda x: x.split('\n')[0]) 

Check if there's missing data: `set()` are optimized for operations like `difference` or `intersection`, faster than comparing 2 lists directly, and it automatically removes duplicates

In [None]:
present_states = set(voting_df.State)
missing_states = set(voting_states) - present_states
print(missing_states)

{'District of Columbia', 'Nebraska', 'Maine'}


In [None]:
voting_df.State.unique

<bound method Series.unique of 0                  Alabama
1                   Alaska
2                  Arizona
3                 Arkansas
4               California
5                 Colorado
6              Connecticut
7                 Delaware
8     District Of Columbia
9                  Florida
10                 Georgia
11                  Hawaii
12                   Idaho
13                Illinois
14                 Indiana
15                    Iowa
16                  Kansas
17                Kentucky
18               Louisiana
19                Maryland
20           Massachusetts
21                Michigan
22               Minnesota
23             Mississippi
24                Missouri
25                 Montana
26                  Nevada
27           New Hampshire
28              New Jersey
29              New Mexico
30                New York
31          North Carolina
32            North Dakota
33                    Ohio
34                Oklahoma
35                  Oreg

`voting_df['State']` is a pandas Series, so we use `isin()`

In [None]:
na_df = voting_df[voting_df['State'].isin(missing_states)] 
print(na_df)

Empty DataFrame
Columns: [State, Electoral Votes, Winner, Trump Votes, Biden Votes, Trump Electoral Votes, Biden Electoral Votes]
Index: []


In [None]:
dc_rows = voting_df[voting_df['State'] == 'District Of Columbia']
print("Rows with District of Columbia:")
print(dc_rows)


Rows with District of Columbia:
                  State  ...  Biden Electoral Votes
8  District Of Columbia  ...                      3

[1 rows x 7 columns]


Found that it's font size issue.

In [None]:
# Add missing data
missing_df = pd.DataFrame(
    columns=voting_df.columns,
    data=[['Maine', 4, 'Biden', 360737, 435072, 1, 3], ['Nebraska', 5, 'Trump', 556846, 374583, 4, 1]]
)

voting_df = pd.concat([voting_df, missing_df])

voting_df = voting_df.sort_values('State').reset_index(drop=True)

In [None]:
voting_df.head()

        State  Electoral Votes Winner  Trump Votes  Biden Votes  \
0     Alabama                9  Trump      1441170           36   
1      Alaska                3  Trump       189951           42   
2     Arizona               11  Biden           49      1672143   
3    Arkansas                6  Trump       760647           34   
4  California               55  Biden           34     11110250   

   Trump Electoral Votes  Biden Electoral Votes  
0                      9                      0  
1                      3                      0  
2                      0                     11  
3                      6                      0  
4                      0                     55  


## Scrape latest state-level polling data

Source: [538 Polls data](https://projects.fivethirtyeight.com/polls/president-general)

In [None]:
url

In [None]:
def get_latest_poll(url, state):

## 

In [None]:

def sum_election(df, weight_vote, weight_poll, baseline_uncertainty=None, alpha_beta_dist=False):
    """
    Simluating 2024 election based on 2020 elction and recent polling
    - weight_vote: level of trust/predictive power of the 2020 election result to 2024 election
    - weight_poll: level of trust/predictive power of the most recent polling data to 2024 election
    - baseline uncertainty: the amount of the uncertainty state-by-state (+- x%)
    - alpha_beta_dist: (weighted avg is calculated from weight_vote and weight_poll)
        alpha as the number of success + 1: representing the weighted avg number of voters who'd vote for Harris; 
        beta as the number of failure + 1: representing the weighted ave number of voters who'd vote for Trump
    
    """


In [None]:
# no. of simulations
n_sim = 250

# baseline uncertainty
baseline_uncertainty = 0.01 #ie.1% (+- 1%)

In [None]:
def simulate_election(df, weight_vote, weight_poll, baseline_uncertainty=None, return_alphas_betas=False):
    """
    this function simulates an election using the joined dataframe df
    weight_vote is our level of trust on the 2020 election being predictive of the 2024 election
    weight_poll is our level of trust in the most recent polling data being predictive of the 2024 election
    """
    
    """
    in a Beta(alpha, beta) distribution:
    - alpha is the number of "successes" + 1
    - beta is the number of "failures" + 1
    here we'll thus set:
    - alpha as the weighted average number of voters who'd vote for Harris
    - beta as the weighted average number of voters who'd vote for Trump
    these weighted averages use the weight_vote and weight_poll we defined in the arguments
    """
    
    #get indices for each logical polling situation
    exists_trump_harris_polling = df.exists_trump_harris_poll.values
    exists_trump_biden_polling = df.exists_trump_biden_poll.values
    not_exists_polling = (~exists_trump_harris_polling) & (~exists_trump_biden_polling)
    
    #if we're not encforcing uncertainty, use sample size as number of votes
    if baseline_uncertainty is None:
        n_votes_vals = df['N Votes']
    #otherwise, set sample size to allow uncertainty to be set at the given value
    #if no Trump/Harris poll for a state, but there is Trump/Biden poll, 1.5x the uncertainty
    #if no poll for a state, 2x the uncertainty
    else:
        n_votes_vals = 1/(4*baseline_uncertainty**2) - 3
        n_votes_missing_harris_polling_vals = 1/(4*(1.5*baseline_uncertainty)**2) - 3
        n_votes_missing_polling_vals = 1/(4*(2*baseline_uncertainty)**2) - 3
     
    #posterior alphas and betas for the Beta distribution of p(Harris) winning
    alphas = weight_vote * n_votes_vals * df['Biden Vote Frac'] + weight_poll * n_votes_vals * df['Harris Poll Frac'] + 1
    
    betas = weight_vote * n_votes_vals * df['Trump Vote Frac'] + weight_poll * n_votes_vals * df['Trump Poll Frac'] + 1
    
    #for states that do not have Trump/Harris polling but do have Trump/Biden polling
    alphas[exists_trump_biden_polling] = weight_vote * n_votes_missing_harris_polling_vals * df.iloc[exists_trump_biden_polling]['Biden Vote Frac'] + weight_poll * n_votes_missing_harris_polling_vals * df.iloc[exists_trump_biden_polling]['Biden Poll Frac'] + 1
    betas[exists_trump_biden_polling] = weight_vote * n_votes_missing_harris_polling_vals * df.iloc[exists_trump_biden_polling]['Trump Vote Frac'] + weight_poll * n_votes_missing_harris_polling_vals * df.iloc[exists_trump_biden_polling]['Trump Poll Frac'] + 1
    
    #for states that have no polling data at all
    alphas[not_exists_polling] = weight_vote * n_votes_missing_polling_vals * df.iloc[not_exists_polling]['Biden Vote Frac'] + weight_poll * n_votes_missing_polling_vals * HARRIS_NATIONAL_POLL_FRAC + 1
    betas[not_exists_polling] = weight_vote * n_votes_missing_polling_vals * df.iloc[not_exists_polling]['Trump Vote Frac'] + weight_poll * n_votes_missing_polling_vals * TRUMP_NATIONAL_POLL_FRAC + 1

    #using these alphas and betas, simulate the probability that Harris would win
    p_wins = [np.random.beta(a,b) for a,b in zip(alphas, betas)]
    harris_wins = np.array([p > 0.5 for p in p_wins])
    harris_evotes = df[harris_wins]['Electoral Votes'].sum()
    trump_evotes = df[~harris_wins]['Electoral Votes'].sum()
    
    if return_alphas_betas:
        return harris_evotes, trump_evotes, alphas, betas
    return harris_evotes, trump_evotes 

In [None]:
results = []
# iterate over several choic of voting and pollin weights
for weight_vote in np.arange(0.01, 1.01, 0.01):
    weight_vote = round(weight_vote, 10)
    weight_poll = round(1-weight_vote, 10)
    if weight_poll < 0:
        continue
    print(weight_vote, weight_poll)
    # do n times simulations
    for _ in range(n_sim):
        harris_evotes, trump_evotes = sim_election(joined, weight_vote, weight_poll, baseline_uncertainty)
        results.append([weight_vote, weight_poll, harris_evotes, trump_evotes])
results = pd.DateFrame(columns=['weight_vote', 'weight_poll', 'harris_evotes', 'trump_evotes'], data=results)


In [None]:
# aggregate based on weight of polling data
stats = results.groupby('weight_poll').agg(
    avg_harris_evotes = pd.NamedAgg('harris_evote', np.mean),
    dev_harris_evotes = pd.NamedAgg('harris_evotes', np.std),
    avg_trump_evotes = pd.NamedAgg('trump_evote', np.mean),
    dev_trump_evotes = pd.NamedAgg('trump_evotes', np.std)
).reset_index()

In [None]:
# plot results
plt.figure(figsize=(12,6))
plt.errorbar(stats.weight_poll, stats.avg_harris_evotes, yerr=stats.dev_harris_evotes, color='cornflowerblue', linewidth=2, capsize=2)
plt.errorbar(stats.weight_poll, stats.avg_trump_evotes, yerr=stats.dev_trump_evotes, color='firebrick', linewidth=2, capsize=2)
plt.legend(['Harris', 'Trump'], fontsize=16, loc=1)
plt.ylim(200,340)
plt.xticks(np.arange(0, 1.1, 0.1), fontsize=18)
plt.xlabel('Polling Trust Level', fontsize=22)
plt.yticks(np.arange(200, 350, 20), fontsize=18)
plt.ylabel('Electoral Votes', fontsize=22)
plt.tight_layout()
plt.savefig('election_simluation.png',dpi=250)


Interpreation:
X-axis: 0.0 ~ 1.0 representing no trust to full trust to the poll
Y-axis: average weight votes by standard deviation

But instead of picking single polling trust level, below plot a distribution of the polling trust level across all possible trust level. And we could see that the trust level should follow a normal distribution, that we shouldn't be reading from a trust level less than 0.4 or more than 0.95, and center of mass is 0.7.

In [None]:
plt.figure(figsize=(10,3))
factor = 2
beta_dist = np.random.beta(7*factor, 3*factor, 100000)
sns.histplot(beta_dist, stat='density', color='green', alpha=0.3)
plt.xticks(np.arange(0, 1.1, 0.1), fontsize=18)
plt.xlabel('Polling Trust Level', fontsize=18)
plt.ylabel('Density', fontsize=18)
plt.tight_layout()


Pulling weighted average votes to the density of trust level to get the final predicted votes per candidates:

In [None]:
stats['prob'] = stats.apply(lambda row: beta.pdf(row.weight_poll, 7*factor, 3*factor), 1)
stats.prob = stats.prob / stats.prob.sum()
harris_pred = round(sum(stats.prob * stats.avg_harris_evotes))
trump_pred = round(sum(stats.prob * stats.avg_trump_evotes))
print(f'Predicted Harris Electoral Votes: {harris_pred}')
print(f'Predicted Trump Electoral Votes: {trump_pred}')

#### Different outputs per scenarios
| Baseline Uncertainty | Personal Polling Distribution | Harris | Trump |
| :------------------- | :---------------------------: | :----: | ----: |
| 1%                   | Centered at 0.7               | 270    | 268   |
| 0.5%                 | Centered at 0.7               | 259    | 279   |
| 1%                   | Centered at 0.1               | 284    | 254   |

* [Edited 11/26/2024]: Interestingly, the final result of the 2024 election was 226 votes for Kamala Harris, and 312 votes for Donald J. Trump.