Skip to content

afoshiok/Money-and-Elections

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Money + Elections (End-to-End Data Engineering Project)

At-A-Glance

Election year has finally rolled back around. You know what that means? Millions, if not billions, of dollars are being spent on sending candidates to Capitol Hill and the White House. In 1974, the Federal Election Commission (FEC) was created to oversee every federal election in the United States. As a result of this, every candidate's financial records were made openly accessible to the public. This project will use FEC data to create a dashboard monitoring all election finances in the 2024 election cycle.

Project Diagram (Click to make it readable)

image

ERD

After figuring out what data I would need, I made an ERD based on the data documentation provided by the FEC. Each table has an ingestion DAG to move the data from the CSVs to an S3 bucket.

2024 Election ERD

Data Stack

Data Orchestration:

  • Apache Airflow

Data Processing/Cleaning:

  • Polars
  • Pandas

Data Warehouse and Staging:

  • AWS S3
  • Snowflake

Data Transformations:

  • DBT (Data Build Tool)
  • SQL (of course)

Data Visualization / BI:

  • Metabase

Other:

  • AWS IAM (Access Management for AWS)
  • Notion (Project management / Documentation)
  • Docker (App containerization)

Data Ingestion

There are two types of ingestion DAGs:

  1. DAGs that download the bulk data from the FEC website.
  2. DAGs that scrape data straight from the tables on the FEC website. These DAGs are reserved for data in "type" tables (i.e. Report Type, Transaction Type, and Political Parties).

Example of a "Bulk Data" DAG (TLDR - Full code: here):

For these types of DAGs, I chose to convert the files into parquet files because it decreased my DAG runtimes and cut costs on S3 storage (Only $5, but hey money is money!). For example for the "Individual Contributions" DAG, the runtime went from **15m 57s → 5m 25s and the file size went from 3.4 GB → 528.7 MB

** = Not including the DAG trigger. image

Example of a "Web Scraper" DAG (TLDR - Full code: here):

image

Loading Data into Snowflake

Before loading the data into the data warehouse, I had to make the tables (see ERD) and create an S3 stage for each table. To make the S3 stages, allow Snowflake access to my "./FEC" S3 Bucket. I created a custom role for Snowflake in AWS IAM and created the stages using a script like this:

image

* Snowflake recommends using CREATE INTEGRATION to give access to S3 instead of using the AWS ID and AWS SECRET KEY!

A corresponding COPY INTO DAG is triggered at the end of both types of ingestion DAGs in the previous section. These DAGs follow these steps:

  1. The DAG receives a run date as a parameter.
  2. Truncate the table in Snowflake to avoid duplicate data.
  3. Copy the data into the table using the run date parameter.
  4. Run DBT models*

* Didn't add it because I use DBT locally instead of DBT cloud, and Airflow wasn't cooperating with DBT.

Here's an example using the Candidates table:

image

Data Transformation (DBT)

I used DBT to make data models to be used by Metabase. All my sources tables are placed in the "Election.Raw" schema, and all the models created with DBT are placed in the "Election.Analytics" schema. Below you will see the lineage graphs of the "src" models and the "fct" views they make. If you want to look at the full DBT project look here.

Lineage Graphs (Click to make readable):

src_candidates src_committee_transactions
image image
src_committees src_independent_exp
image image
src_individual_cont src_operating_exp
image image

The Dashboard

Individual Contributions Page

image

Operating Expenditures Page

image

Committee Contributions to Candidates

image

About

A data pipeline from the FEC to the dashboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published