## Data Pipeline ##

We use a web crawler to download from Wikipedia county-level election results for all but two states from 1972 to 2008 (9 elections). This is done using the `get_past_ten_elections_all_states()` function in `data_processing.py`. Then, we integrate these data with detailed county-level economic data from the Bureau of Economic Analysis (bea.gov) which contains 24 indicators for each county from 1969 to 2023, resulting in the training data (d's) of the below form:

d = 

{
    
Name of County (C), 

Election Year (EY), 

Winning Party in EY,

Economic Indicator #1-#24 one year before EY, 

Economic Indicator #1-#24 two years before EY, 

Economic Indicator #1-#24 three years before EY

}

d's then are normalized, adjusted for inflation, and stored in `training_data_long.csv`, which consists of 27925 such d's. We load these and split them into training, valid, and test sets.



## PCA for Exploratory Data Analysis ##

Run the code to see results below. Red dots are d's with winning party being Republican, and blue dots are d's with winning party being Democrats. After twisting the interactive plot a while (the plot below isn't interactive. For that, run `eda.py` from terminal), there does show some separation along some directions, but not a lot.

In [None]:
import eda
reload(eda)
eda.main()

## Description of the Task

Predict county level presidential election results based on historical economic data. Take 2024 as an example. The goal is to build a model that takes the economic data of 2021, 2022, 2023 of a county as input, and gives "R" or "D" as output.  



## Results 

For the details of model and how it's trained, refer to `train()` and `main()` functions of `training.py`. But at best, we have 0.96 training accuracy and 0.83 validation accuracy (see its training curves below drawn using TensorBoard). This is the result of setting `lr = 0.0025, epochs = 3000, training_accuracy_threshold(stops after) = 0.95, batchsize = len(training_set)`

The specification of the neural network is as follows:

'''
 model = torch.nn.Sequential(
        torch.nn.Linear(72, 256, bias = True),
        torch.nn.LeakyReLU(),
        torch.nn.Linear(256, 512),
        torch.nn.LeakyReLU(),
        torch.nn.Linear(512, 256),
        torch.nn.LeakyReLU(),
        torch.nn.Linear(256, 256),
        torch.nn.LeakyReLU(),
        torch.nn.Linear(256, 1),
        torch.nn.Sigmoid(),
        
    )

    
    hparams = {"Epochs": 3000, "LR": 0.0025, "Threshold": 0.95, "Batch Size": int(len(training_set)/1)}
'''


To see the live results, run the cell block below, and run the following command in terminal (requires tensorboard) `% tensorboard --logdir=runs`


**Loss**
<div>
    <img src="loss.png" width="1000"/>
</div>

**Training Accuracy**

<div>
    <img src="training.png" width="1000"/>
</div>


**Validation Accuracy**

<div>
    <img src="valid.png" width="1000"/>
</div>




In [None]:
import training
reload(training)
training.main()

## Analyzing the Results

1. So far, the training set contains only data with EY ranging from 1972-2008. So, we'll incorporate data with EY being 2012, 2016, 2020, as well. We have written the function requisite for it in `data_processing.py` and `crawl.py`

2. Add BEA.gov's CAINC4 data, which contains a couple dozens more economic indicators, to training data.

3. Add field: "Economic indicators four years before EY" to d's. 

3. Test the model on 2024 Election, and visualize prediction with a US map. 