# 2024 Harris vs. Trump Election Prediction
The practice followed [ritvikmath](https://www.youtube.com/watch?v=O5-A2ensKb0) guidance on utilizing Bayesian methodology, using 2020 election data (`prior`), with current polling data of 7 swing states (`likelihood`) to get 2024 election prediction (`posterior`). A baseline level of uncertainty is allowed for the prediction, which usually becomes lower as we get clower to the date of election.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import re
from bs4 import BeautifulSoup
import urllib3 # to bypass SSL verification and suppress warnings (not recommended for production level)
import os 
print(os.getcwd()) #print the current working directory

c:\Users\sherie.lin\OneDrive - AES Corporation\Documents\Python Scripts\Learning


In [None]:
# import urllib3

In [2]:
# Set pandas to display all columns
pd.set_option('display.max_columns', None) #None as no column limit for display

In [3]:
cwd = os.getcwd()
relative_path = "votingState.txt"
file_path = os.path.join(cwd, relative_path)

In [4]:
with open(file_path, 'r') as file:
    text = file.read()
    print(text[:100])


Alabama - 9 votes

Kentucky - 8 votes

North Dakota - 3 votes

Alaska - 3 votes

Louisiana - 8 votes


In [5]:
# Regex pattern to find state name
pattern = r"^[A-Za-z\s]+(?=\s*-\s*\d+\s*votes\n)"
voting_states = re.findall(pattern, text, flags=re.M)
voting_states = [match.strip() for match in voting_states]
print(voting_states)

['Alabama', 'Kentucky', 'North Dakota', 'Alaska', 'Louisiana', 'Ohio', 'Arizona', 'Maine', 'Oklahoma', 'Arkansas', 'Maryland', 'Oregon', 'California', 'Massachusetts', 'Pennsylvania', 'Colorado', 'Michigan', 'Rhode Island', 'Connecticut', 'Minnesota', 'South Carolina', 'Delaware', 'Mississippi', 'South Dakota', 'District of Columbia', 'Missouri', 'Tennessee', 'Florida', 'Montana', 'Texas', 'Georgia', 'Nebraska', 'Utah', 'Hawaii', 'Nevada', 'Vermont', 'Idaho', 'New Hampshire', 'Virginia', 'Illinois', 'New Jersey', 'Washington', 'Indiana', 'New Mexico', 'West Virginia', 'Iowa', 'New York', 'Wisconsin', 'Kansas', 'North Carolina']


* `^` anchor the start of each line
* `re.M` allows `^` to work on each line independently
* `str.strip()` to remove any leading or trailing whitespace

In [6]:
# Format state names for URLs
def format_state_for_url(state_name):
    # Convert to lowercase and replace spaces with hyphens
    return state_name.lower().replace(' ', '-')

# Apply the function to all states
formated_states = [format_state_for_url(state_name) for state_name in voting_states]
print(formated_states)

['alabama', 'kentucky', 'north-dakota', 'alaska', 'louisiana', 'ohio', 'arizona', 'maine', 'oklahoma', 'arkansas', 'maryland', 'oregon', 'california', 'massachusetts', 'pennsylvania', 'colorado', 'michigan', 'rhode-island', 'connecticut', 'minnesota', 'south-carolina', 'delaware', 'mississippi', 'south-dakota', 'district-of-columbia', 'missouri', 'tennessee', 'florida', 'montana', 'texas', 'georgia', 'nebraska', 'utah', 'hawaii', 'nevada', 'vermont', 'idaho', 'new-hampshire', 'virginia', 'illinois', 'new-jersey', 'washington', 'indiana', 'new-mexico', 'west-virginia', 'iowa', 'new-york', 'wisconsin', 'kansas', 'north-carolina']


## Read 2020 Election Data

In [7]:
cwd = os.getcwd()
relative_path = "2020_election_results.txt"
file_path = os.path.join(cwd, relative_path)

In [8]:
# read 2020 election result text file and extract 
try:
    with open(file_path, 'r') as file:
        text = file.read()
    print(text[:200]) # display the first 100 characters to confirm success
except FileNotFoundError:
    print(f"File not found at {file_path}")
except UnicodeDecodeError:
    print(f"Encoding issue while reading the file. Try another eoncding.")


STATE RESULTS
President: Alabama
9 Electoral Votes
Trump
PROJECTED WINNER
+ FOLLOW
Candidate	%		Votes
Trump
62.0%	
1,441,170
Biden
36.6%	
849,624
Est. 99% In
Updated 10:17 p.m. ET, Mar. 6
Full Details


[`regex`](https://www.w3schools.com/python/python_regex.asp) to extract expression pattern, and comfile the matches into dataframe:

In [9]:
pattern = (
    # Lines begin with 'President', followed by the state name into 'state' column
    r"President:\s*(?P<state>[A-Za-z\s]+)\n"
    r"(?P<electoral_votes>\d+)\s*Electoral Votes\n"
    r"(?P<winner>Trump|Biden)\nPROJECTED WINNER\n\+ FOLLOW\n"
    # Tie votes to name
    r"Candidate\t%\t\tVotes\n(?P<candidate1>Trump|Biden)\n(?:\d+\.\d+%)?\s*\n?(?P<votes1>[\d,]+)\n(?P<candidate2>Trump|Biden)\n(?:\d+.\d+%)?s*\n?(?P<votes2>[\d,]+)"
)

In [10]:
# Compile the regrex pattern
regex = re.compile(pattern)

# Find all matches in the text and return a list of tuples
# matches = regex.findall(text)
matches = regex.finditer(text)

In [None]:
# matches

[('Alabama', '9', 'Trump', 'Trump', '1,441,170', 'Biden', '36'), ('Alaska', '3', 'Trump', 'Trump', '189,951', 'Biden', '42'), ('Arizona\nParty change\nBATTLEGROUND', '11', 'Biden', 'Biden', '1,672,143', 'Trump', '49'), ('Arkansas', '6', 'Trump', 'Trump', '760,647', 'Biden', '34'), ('California', '55', 'Biden', 'Biden', '11,110,250', 'Trump', '34'), ('Colorado\nBATTLEGROUND', '9', 'Biden', 'Biden', '1,804,352', 'Trump', '41'), ('Connecticut', '7', 'Biden', 'Biden', '1,080,831', 'Trump', '39'), ('Delaware', '3', 'Biden', 'Biden', '296,268', 'Trump', '39'), ('District Of Columbia', '3', 'Biden', 'Biden', '317,323', 'Trump', '5'), ('Florida\nBATTLEGROUND', '29', 'Trump', 'Trump', '5,668,731', 'Biden', '47'), ('Georgia\nParty change\nBATTLEGROUND', '16', 'Biden', 'Biden', '2,473,633', 'Trump', '49'), ('Hawaii', '4', 'Biden', 'Biden', '366,130', 'Trump', '34'), ('Idaho', '4', 'Trump', 'Trump', '554,119', 'Biden', '33'), ('Illinois', '20', 'Biden', 'Biden', '3,471,915', 'Trump', '40'), ('Indi

`matches` is a list of tuples, if using `re.findall()`, and can only access with index such as `match[0]`. Or use `re.finditer()` to return match objects so that you could use `.group()`

In [11]:
# Process the matches
results = []
for match in matches:
    candidate1 = match.group('candidate1')
    votes1 = int(match.group('votes1').replace(',',''))
    candidate2 = match.group('candidate2')
    votes2 = int(match.group('votes2').replace(',',''))  

    # Assign votes to candidates
    if candidate1 == 'Trump':
        trump_votes, biden_votes = votes1, votes2
    else:
        trump_votes, biden_votes = votes2, votes1
    
    # Append the results
    results.append({
        "State": match.group('state'), #instead of match[0] so that we prevent indexing error
        # winner takes all
        "Electoral Votes": int(match.group('electoral_votes')),
        "Winner": match.group('winner'),
        "Trump Votes": trump_votes,
        "Biden Votes": biden_votes
    })

voting_df = pd.DataFrame(results)


In [12]:
voting_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   State            49 non-null     object
 1   Electoral Votes  49 non-null     int64 
 2   Winner           49 non-null     object
 3   Trump Votes      49 non-null     int64 
 4   Biden Votes      49 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 2.0+ KB


In [13]:
# Winner takes all
voting_df['Trump Electoral Votes'] = voting_df.apply(lambda row: row['Electoral Votes'] if row.Winner == 'Trump' else 0, 1) 
voting_df['Biden Electoral Votes'] = voting_df.apply(lambda row: row['Electoral Votes'] if row.Winner == 'Biden' else 0, 1)

In [14]:
voting_df.head()

Unnamed: 0,State,Electoral Votes,Winner,Trump Votes,Biden Votes,Trump Electoral Votes,Biden Electoral Votes
0,Alabama,9,Trump,1441170,36,9,0
1,Alaska,3,Trump,189951,42,3,0
2,Arizona\nParty change\nBATTLEGROUND,11,Biden,49,1672143,0,11
3,Arkansas,6,Trump,760647,34,6,0
4,California,55,Biden,34,11110250,0,55


From above we see that State column requires cleansing: split on newline character(`\n`) and takes only the first index

In [15]:
voting_df.State = voting_df.State.apply(lambda x: x.split('\n')[0]) 

Check if there's missing data: `set()` are optimized for operations like `difference` or `intersection`, faster than comparing 2 lists directly, and it automatically removes duplicates

In [16]:
present_states = set(voting_df.State)
missing_states = set(voting_states) - present_states
print(missing_states)

{'Nebraska', 'Maine', 'District of Columbia'}


In [17]:
voting_df.State.unique

<bound method Series.unique of 0                  Alabama
1                   Alaska
2                  Arizona
3                 Arkansas
4               California
5                 Colorado
6              Connecticut
7                 Delaware
8     District Of Columbia
9                  Florida
10                 Georgia
11                  Hawaii
12                   Idaho
13                Illinois
14                 Indiana
15                    Iowa
16                  Kansas
17                Kentucky
18               Louisiana
19                Maryland
20           Massachusetts
21                Michigan
22               Minnesota
23             Mississippi
24                Missouri
25                 Montana
26                  Nevada
27           New Hampshire
28              New Jersey
29              New Mexico
30                New York
31          North Carolina
32            North Dakota
33                    Ohio
34                Oklahoma
35                  Oreg

`voting_df['State']` is a pandas Series, so we use `isin()`

In [18]:
na_df = voting_df[voting_df['State'].isin(missing_states)] 
print(na_df)

Empty DataFrame
Columns: [State, Electoral Votes, Winner, Trump Votes, Biden Votes, Trump Electoral Votes, Biden Electoral Votes]
Index: []


In [19]:
dc_rows = voting_df[voting_df['State'] == 'District Of Columbia']
print("Rows with District of Columbia:")
print(dc_rows)


Rows with District of Columbia:
                  State  Electoral Votes Winner  Trump Votes  Biden Votes  \
8  District Of Columbia                3  Biden            5       317323   

   Trump Electoral Votes  Biden Electoral Votes  
8                      0                      3  


Found that it's font size issue.

In [20]:
# Add missing data
missing_df = pd.DataFrame(
    columns=voting_df.columns,
    data=[['Maine', 4, 'Biden', 360737, 435072, 1, 3], ['Nebraska', 5, 'Trump', 556846, 374583, 4, 1]]
)

voting_df = pd.concat([voting_df, missing_df])

voting_df = voting_df.sort_values('State').reset_index(drop=True)

In [21]:
voting_df.head()

Unnamed: 0,State,Electoral Votes,Winner,Trump Votes,Biden Votes,Trump Electoral Votes,Biden Electoral Votes
0,Alabama,9,Trump,1441170,36,9,0
1,Alaska,3,Trump,189951,42,3,0
2,Arizona,11,Biden,49,1672143,0,11
3,Arkansas,6,Trump,760647,34,6,0
4,California,55,Biden,34,11110250,0,55


## Scrape latest state-level polling data

Source: [538 Polls data](https://projects.fivethirtyeight.com/polls/president-general)

In [27]:
# Suppress warnings if SSL verification is disabled
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

#### Core Scraper

In [None]:
# Function to scrape the latest polls result for a given state URL
def get_latest_poll(url, state):
    # Fallback: national polling url
    national_url = "https://projects.fivethirtyeight.com/polls/president-general/2024/national/"

    
    #Fetch the page for specific state
    response = requests.get(url, verify=False) #to bypass SSL error tmporarily     instead of # response = requests.get(url)
    # print(response.text) #to printe the raw HTML content 
    # check for the presence of <tr class='visible-row'>
    # if 'class="visible-row"' in response.text:
    #     print("The class 'visible-row' is present in the response.")
    # else:
    #     print("The class 'visible-row' is not found in the response.")

    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all rows for polls
    # latest_polls = soup.find_all('tr', class_='visibl-row') #this fails to process as it ask for exact match
    # Use all_rows and filter
    all_rows = soup.find_all('tr')
    latest_polls = [row for row in all_rows if 'visible-row' in row.get('class', [])]

    if not latest_polls:
        # print(f"No polling data found for state: {state}. Falling back to national polling data...")
        # Fallback to national polling data
        url = national_url 
        response = requests.get(url, verify=False)
        soup = BeautifulSoup(response.content, 'html.parser')
        all_rows = soup.find_all('tr')
        latest_polls = [row for row in all_rows if 'visible-row' in row.get('class', [])]
        print(f"Number of national rows found: {len(latest_polls)}")

        if not latest_polls:
            print("No national polling data available either.")
            return None
    
    results_dict = {
        'State': [],
        'Date': [],
        'Sample Size': [],
        'Pollster': [],
        'Democratic Candidate': [],
        'Democratic Result': [],
        'Trump Result': []
    }

    # for latest_poll in latest_polls:
    try: 
        latest_poll = latest_polls[0] # Ensure only the first row representing the latest poll returned
        # Extract data from the row
        date = latest_poll.find('td', class_='dates hide-mobile').get_text(strip=True) 
        sample_size = latest_poll.find('td', class_='sample hide-mobile').get_text(strip=True)
        pollster = latest_poll.find('div', class_='pollster-name').get_text(strip=True)
        democractic_candidate = latest_poll.find('td', class_='answer first hide-mobile').get_text(strip=True) #dynamically fetch the Democratic's name
        democratic_result = latest_poll.find('td', class_='value hide-mobile hide-default').get_text(strip=True) 
        
        # Extract Trump result
        # trump_td = latest_poll.find_all('td', string='Trump')
        trump_result = None
        # Debugging: Print the entire row containing "Trump"
        # trump_td = latest_poll.find('td', string="Trump")
        # if trump_td:
        #     print("Row containing 'Trump':")
        #     print(latest_poll.prettify())

        for i, td in enumerate(tds):
            if "Trump" in td.get_text(strip=True):
                if i > 0 and 'value hide-moible' in tds[i - 1].get('class', []):
                    trump_result = tds[i - 1].get_text(strip=True)
                break
        # if trump_td:
        #     # Access the row containing "Trump"
        #     parent_row = trump_td.find_parent('tr')
        #     if parent_row:
        #         # Find all <td> elements in the row
        #         row_tds = parent_row.find_all('td')
                
        #         # Traverse the <td> elements to locate the one with the 'value hide-mobile' class
        #         for td in row_tds:
        #             if "value hide-mobile" in td.get("class", []):
        #                 trump_result = td.get_text(strip=True)
        #                 break
        # Find all <tr> elements
        # all_rows = latest_poll.find_all('tr')

        # # Traverse rows to find the one containing "Trump"
        # for row in all_rows:
        #     # Locate the <td> containing "Trump" in the current row
        #     trump_td = row.find('td', string="Trump")
        #     if trump_td:
        #         # Find all <td> elements in this row
        #         row_tds = row.find_all('td')
                
        #         # Debugging: print all <td> elements
        #         print("All <td> elements in the row containing 'Trump':")
        #         for td in row_tds:
        #             print(td.prettify())

        #         # Locate the <td> with the class containing 'value hide-mobile'
        #         for td in row_tds:
        #             if "value hide-mobile" in td.get("class", []):
        #                 trump_result = td.get_text(strip=True)
        #                 break
                
        #         # Stop searching if the result is found
        #         if trump_result:
        #                 break

        # Add to dictionary
        results_dict['State'].append(state)
        results_dict['Date'].append(date)
        results_dict['Sample Size'].append(sample_size)
        results_dict['Pollster'].append(pollster)
        results_dict['Democratic Candidate'].append(democractic_candidate)
        results_dict['Democratic Result'].append(democratic_result)
        results_dict['Trump Result'].append(trump_result)

    except AttributeError as e:
        print(f"Error parsing data for {state}:{e}")
        None # continue # Skip rows with missing data
    
    return results_dict


In [89]:
# Testing the function
state = "north-carolina"
url = f"https://projects.fivethirtyeight.com/polls/president-general/2024/{state.lower()}/"

latest_poll = get_latest_poll(url, state)
print(latest_poll)

Row containing 'Trump':
<tr class="visible-row" data-id="216477" data-type="president-general">
 <!-- Desktop-->
 <td class="dates hide-mobile">
  <div class="date-wrapper">
   Nov. 3-4
  </div>
 </td>
 <!-- Desktop-->
 <td class="sample hide-mobile">
  1,219
 </td>
 <td class="sample-type hide-mobile">
  LV
 </td>
 <!-- Mobile-->
 <td class="dates hide-desktop">
  <div class="date-wrapper">
   Nov. 3-4
  </div>
  <div class="sample-type-wrapper">
   <span class="sample">
    1,219
   </span>
   <span class="sample-type">
    LV
   </span>
  </div>
 </td>
 <!-- Mobile and Desktop-->
 <td class="pollster" tabindex="">
  <div class="pollster-container">
   <a href="https://atlasintel.org/poll/usa-key-states-2024-11-04" target="_blank">
    <div class="pollster-name">
     AtlasIntel
    </div>
   </a>
  </div>
 </td>
 <td class="sponsor hide-mobile" tabindex="">
 </td>
 <!-- Desktop-->
 <td class="answer first hide-mobile hide-default">
  Harris
 </td>
 <td class="value hide-mobile hide-

* `'dates hide-mobile'` instead of `dates hide-desktop` as the former is hidden on mobile device, but could be more detailed, consistent and comprehensive data for scraping.

#### Intermediate Function
Dynamically builds the state URL and passes it to `get_latest_poll`

In [74]:
# Function to dynamically constructs the URL for a given state based on its abbreviation
def get_poll_info(state, democratic_candidate):
    # Define the URL for DC because of error in http
    if state == "district-of-columbia":
        state_url = f"https://projects.fivethirtyeight.com/polls/president-general/2020/{state.lower()}/"
    else:
        state_url = f"https://projects.fivethirtyeight.com/polls/president-general/2024/{state.lower()}/"
    # state_url = f"https://projects.fivethirtyeight.com/polls/president-general/2024/{state.lower()}/"

    latest_polls = get_latest_poll(state_url, state)

    if not latest_polls:
        print(f"No polling data found for {state}.")
        return None
    
    # Create DataFrame from results
    df = pd.DataFrame(latest_polls)

    if 'Democratic Candidate' not in df.columns or 'Trump Result' not in df.columns:
        print("Missing necessary columns in polling data.")
        return None
    
    # Filter rows where Trump and a Democratic candidate are present/matching
    df = df[
        (df['Democratic Candidate'].str.contains(democratic_candidate, case=False, na=False)) &
        df['Trump Result'].fillna("0%")
    ]


    print("Filtered DataFrame:")
    print(df)

    # Return only the latest poll
    if not df.empty:
        return df.iloc[[0]]  # Return only the latest poll
    return None

    # # Return only the latest poll
    # return df

In [75]:
state_poll = get_poll_info(state, democratic_candidate="Harris")
print(f"State polling data for {state} (Biden): {state_poll}")


Filtered DataFrame:
            State      Date Sample Size    Pollster Democratic Candidate  \
0  north-carolina  Nov. 3-4       1,219  AtlasIntel               Harris   

  Democratic Result Trump Result  
0               48%         None  
State polling data for north-carolina (Biden):             State      Date Sample Size    Pollster Democratic Candidate  \
0  north-carolina  Nov. 3-4       1,219  AtlasIntel               Harris   

  Democratic Result Trump Result  
0               48%         None  


#### Outer Loop (top level)
Iterate through the list of states and handle fallback logic with national polling dataif specific mathups (eg. Trump vs Harris) are unavailable

In [None]:
# Iterates through the list of states 
def collect_polling_data(states, democratic_candidates=['Harris', 'Biden']):
    # Placeholder for all states' polling data
    all_state_data = []

    for state in states:
        print(f"Fetching polling data for {state}...")

        # Get pollin data for each Democratic candidate
        state_poll = None
        for candidate in democratic_candidates:
            print(f"Trying candidate: {candidate}")
            # Get polls for the current state and candidate
            state_poll = get_poll_info(state, democratic_candidates=candidate)
            if state_poll is not None:
                print(f"Polling data found for {state} with {candidate}")
                print(state_poll.head())
                break # Stop seaching once data for a candidate is found
        
        if state_poll is None:
            # If no state-level polls, fetch national polling data
            print(f"No polling data for {state}. Falling back to national polls")
            national_url = national_url
            national_poll = get_latest_poll(national_url, "National")

            if national_poll:
                # Create a placeholder DataFrame for national fallback
                national_df = pd.DataFrame(national_poll)
                national_df['State'] = state # Assign the state name to the data
                all_state_data.append(national_df)
            else:
                print(f"No national polling data available for {state}. Skipping")
        else:
            all_state_data.append(state_poll)
    
    # Combine all states' data into a single DataFrame
    if all_state_data:
        combined_df = pd.concat(all_state_data, ignore_index=True)
        return combined_df
    else:
        print("No polling data collected.")
        return None
    

In [60]:
# Call the function
polling_data = collect_polling_data(formated_states)

# Display the combined polling data
if polling_data is not None:
    print(polling_data.head())
else:
    print("No data was collected.")

Fetching polling data for alabama...
Fetching polling data for kentucky...
Fetching polling data for north-dakota...
Fetching polling data for alaska...
Fetching polling data for louisiana...
Fetching polling data for ohio...
Fetching polling data for arizona...
Fetching polling data for maine...
Fetching polling data for oklahoma...
Fetching polling data for arkansas...
Fetching polling data for maryland...
Fetching polling data for oregon...
Fetching polling data for california...
Fetching polling data for massachusetts...
Fetching polling data for pennsylvania...
Fetching polling data for colorado...
Fetching polling data for michigan...
Fetching polling data for rhode-island...
Fetching polling data for connecticut...
Fetching polling data for minnesota...
Fetching polling data for south-carolina...
Fetching polling data for delaware...
Fetching polling data for mississippi...
Fetching polling data for south-dakota...
Fetching polling data for district-of-columbia...
Fetching polli

##

In [None]:

def sum_election(df, weight_vote, weight_poll, baseline_uncertainty=None, alpha_beta_dist=False):
    """
    Simluating 2024 election based on 2020 elction and recent polling
    - weight_vote: level of trust/predictive power of the 2020 election result to 2024 election
    - weight_poll: level of trust/predictive power of the most recent polling data to 2024 election
    - baseline uncertainty: the amount of the uncertainty state-by-state (+- x%)
    - alpha_beta_dist: (weighted avg is calculated from weight_vote and weight_poll)
        alpha as the number of success + 1: representing the weighted avg number of voters who'd vote for Harris; 
        beta as the number of failure + 1: representing the weighted ave number of voters who'd vote for Trump
    
    """


In [None]:
# no. of simulations
n_sim = 250

# baseline uncertainty
baseline_uncertainty = 0.01 #ie.1% (+- 1%)

In [None]:
def simulate_election(df, weight_vote, weight_poll, baseline_uncertainty=None, return_alphas_betas=False):
    """
    this function simulates an election using the joined dataframe df
    weight_vote is our level of trust on the 2020 election being predictive of the 2024 election
    weight_poll is our level of trust in the most recent polling data being predictive of the 2024 election
    """
    
    """
    in a Beta(alpha, beta) distribution:
    - alpha is the number of "successes" + 1
    - beta is the number of "failures" + 1
    here we'll thus set:
    - alpha as the weighted average number of voters who'd vote for Harris
    - beta as the weighted average number of voters who'd vote for Trump
    these weighted averages use the weight_vote and weight_poll we defined in the arguments
    """
    
    #get indices for each logical polling situation
    exists_trump_harris_polling = df.exists_trump_harris_poll.values
    exists_trump_biden_polling = df.exists_trump_biden_poll.values
    not_exists_polling = (~exists_trump_harris_polling) & (~exists_trump_biden_polling)
    
    #if we're not encforcing uncertainty, use sample size as number of votes
    if baseline_uncertainty is None:
        n_votes_vals = df['N Votes']
    #otherwise, set sample size to allow uncertainty to be set at the given value
    #if no Trump/Harris poll for a state, but there is Trump/Biden poll, 1.5x the uncertainty
    #if no poll for a state, 2x the uncertainty
    else:
        n_votes_vals = 1/(4*baseline_uncertainty**2) - 3
        n_votes_missing_harris_polling_vals = 1/(4*(1.5*baseline_uncertainty)**2) - 3
        n_votes_missing_polling_vals = 1/(4*(2*baseline_uncertainty)**2) - 3
     
    #posterior alphas and betas for the Beta distribution of p(Harris) winning
    alphas = weight_vote * n_votes_vals * df['Biden Vote Frac'] + weight_poll * n_votes_vals * df['Harris Poll Frac'] + 1
    
    betas = weight_vote * n_votes_vals * df['Trump Vote Frac'] + weight_poll * n_votes_vals * df['Trump Poll Frac'] + 1
    
    #for states that do not have Trump/Harris polling but do have Trump/Biden polling
    alphas[exists_trump_biden_polling] = weight_vote * n_votes_missing_harris_polling_vals * df.iloc[exists_trump_biden_polling]['Biden Vote Frac'] + weight_poll * n_votes_missing_harris_polling_vals * df.iloc[exists_trump_biden_polling]['Biden Poll Frac'] + 1
    betas[exists_trump_biden_polling] = weight_vote * n_votes_missing_harris_polling_vals * df.iloc[exists_trump_biden_polling]['Trump Vote Frac'] + weight_poll * n_votes_missing_harris_polling_vals * df.iloc[exists_trump_biden_polling]['Trump Poll Frac'] + 1
    
    #for states that have no polling data at all
    alphas[not_exists_polling] = weight_vote * n_votes_missing_polling_vals * df.iloc[not_exists_polling]['Biden Vote Frac'] + weight_poll * n_votes_missing_polling_vals * HARRIS_NATIONAL_POLL_FRAC + 1
    betas[not_exists_polling] = weight_vote * n_votes_missing_polling_vals * df.iloc[not_exists_polling]['Trump Vote Frac'] + weight_poll * n_votes_missing_polling_vals * TRUMP_NATIONAL_POLL_FRAC + 1

    #using these alphas and betas, simulate the probability that Harris would win
    p_wins = [np.random.beta(a,b) for a,b in zip(alphas, betas)]
    harris_wins = np.array([p > 0.5 for p in p_wins])
    harris_evotes = df[harris_wins]['Electoral Votes'].sum()
    trump_evotes = df[~harris_wins]['Electoral Votes'].sum()
    
    if return_alphas_betas:
        return harris_evotes, trump_evotes, alphas, betas
    return harris_evotes, trump_evotes 

In [None]:
results = []
# iterate over several choic of voting and pollin weights
for weight_vote in np.arange(0.01, 1.01, 0.01):
    weight_vote = round(weight_vote, 10)
    weight_poll = round(1-weight_vote, 10)
    if weight_poll < 0:
        continue
    print(weight_vote, weight_poll)
    # do n times simulations
    for _ in range(n_sim):
        harris_evotes, trump_evotes = sim_election(joined, weight_vote, weight_poll, baseline_uncertainty)
        results.append([weight_vote, weight_poll, harris_evotes, trump_evotes])
results = pd.DateFrame(columns=['weight_vote', 'weight_poll', 'harris_evotes', 'trump_evotes'], data=results)


In [None]:
# aggregate based on weight of polling data
stats = results.groupby('weight_poll').agg(
    avg_harris_evotes = pd.NamedAgg('harris_evote', np.mean),
    dev_harris_evotes = pd.NamedAgg('harris_evotes', np.std),
    avg_trump_evotes = pd.NamedAgg('trump_evote', np.mean),
    dev_trump_evotes = pd.NamedAgg('trump_evotes', np.std)
).reset_index()

In [None]:
# plot results
plt.figure(figsize=(12,6))
plt.errorbar(stats.weight_poll, stats.avg_harris_evotes, yerr=stats.dev_harris_evotes, color='cornflowerblue', linewidth=2, capsize=2)
plt.errorbar(stats.weight_poll, stats.avg_trump_evotes, yerr=stats.dev_trump_evotes, color='firebrick', linewidth=2, capsize=2)
plt.legend(['Harris', 'Trump'], fontsize=16, loc=1)
plt.ylim(200,340)
plt.xticks(np.arange(0, 1.1, 0.1), fontsize=18)
plt.xlabel('Polling Trust Level', fontsize=22)
plt.yticks(np.arange(200, 350, 20), fontsize=18)
plt.ylabel('Electoral Votes', fontsize=22)
plt.tight_layout()
plt.savefig('election_simluation.png',dpi=250)


Interpreation:
X-axis: 0.0 ~ 1.0 representing no trust to full trust to the poll
Y-axis: average weight votes by standard deviation

But instead of picking single polling trust level, below plot a distribution of the polling trust level across all possible trust level. And we could see that the trust level should follow a normal distribution, that we shouldn't be reading from a trust level less than 0.4 or more than 0.95, and center of mass is 0.7.

In [None]:
plt.figure(figsize=(10,3))
factor = 2
beta_dist = np.random.beta(7*factor, 3*factor, 100000)
sns.histplot(beta_dist, stat='density', color='green', alpha=0.3)
plt.xticks(np.arange(0, 1.1, 0.1), fontsize=18)
plt.xlabel('Polling Trust Level', fontsize=18)
plt.ylabel('Density', fontsize=18)
plt.tight_layout()


Pulling weighted average votes to the density of trust level to get the final predicted votes per candidates:

In [None]:
stats['prob'] = stats.apply(lambda row: beta.pdf(row.weight_poll, 7*factor, 3*factor), 1)
stats.prob = stats.prob / stats.prob.sum()
harris_pred = round(sum(stats.prob * stats.avg_harris_evotes))
trump_pred = round(sum(stats.prob * stats.avg_trump_evotes))
print(f'Predicted Harris Electoral Votes: {harris_pred}')
print(f'Predicted Trump Electoral Votes: {trump_pred}')

#### Different outputs per scenarios
| Baseline Uncertainty | Personal Polling Distribution | Harris | Trump |
| :------------------- | :---------------------------: | :----: | ----: |
| 1%                   | Centered at 0.7               | 270    | 268   |
| 0.5%                 | Centered at 0.7               | 259    | 279   |
| 1%                   | Centered at 0.1               | 284    | 254   |

* [Edited 11/26/2024]: Interestingly, the final result of the 2024 election was 226 votes for Kamala Harris, and 312 votes for Donald J. Trump.