CMU Interactive Data Science Final Project

Data-Driven Analysis of Popular Beliefs about COVID-19

Online URL: https://share.streamlit.io/cmu-ids-2020/fp-mythbusters/main
Team members:
- Contact person: bwarren2@andrew.cmu.edu
- jkoshako@andrew.cmu.edu
- jmahler@andrew.cmu.edu
- vivianle@andrew.cmu.edu
Track: Narrative

Abstract

Given the rising concerns about the long-standing socioeconomic inequalities in our country that have been exposed since Covid-19 came about, we sought to explore these strong claims and take a data-driven approach to examine the correlation between a region’s socioeconomic makeup and how it has been affected by the virus throughout 2020. We contribute a Streamlit web application with interactive maps and visualizations that enables users to (1) explore the relationships between different socioeconomic indicators and Covid-19 statistics and (2) analyze the most common words used in tweets written by Twitter users from different states throughout the summer of 2020. We supplement our app with a narrative article that both acts as an instructional tutorial and highlights several insights we were able to gain from using our app.

Summary Images

Bar chart of most common terms from Pennsylvania Tweets posted in the summer of 2020

Comparing poverty rates with cumulative cases of Covid-19 in Pennsylvania counties (as of Nov 29, 2020)

Comparing poverty rates with cumulative cases of Covid-19 in all United States counties (as of Nov 29, 2020)

Instructions

The easiest way to view our app is through the deployed streamlit app, you can find the url to this above. If you'd like to run the application locally, then follow the instructions below.

Install the latest version of Git: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git. Note: If you don't want to download git, then you can extract the repo from GitHub and unzip it in a directory of your choice. You should also skip the git clone step if you choose to extract the repo instead of clone it with git.
Install the latest version of Python: https://www.python.org/downloads/
Install the latest version of PIP: https://pip.pypa.io/en/stable/installing/
Install the latest version of Streamlit: https://docs.streamlit.io/en/stable/installation.html
Execute the following commands in a directory of your choice, which will clone the repo, download dependencies, and run the app

git clone https://github.com/CMU-IDS-2020/fp-mythbusters.git
cd fp-mythbusters
pip install -r requirements.txt
streamlit run streamlit_app.py

Link to Paper

Final Project Report (PDF)

Link to Video

Video Presentation

Work Distribution

Joseph Koshakow: Joe was responsible for all things tweet related. Joe wrote and executed the scripts to fetch and organize the tweet contents from the Twitter API. Additionally Joe created the tweet word clouds and the tweet bar charts as well as the functionality to switch between the two. Joe also performed the initial LDA analysis on the tweets which ended up not being included in the final project.
Vivian Lee: Vivian was responsible for obtaining the socioeconomic indicator, FIPS, and covid data and for creating the interactive side-by-side state maps. Vivian wrote scripts to fetch the covid data from Carnegie Mellon’s Epidata API and merge it with the socioeconomic and FIPs data, retrieved from federal agency sites like the US Census Bureau and the US Department of Agriculture. These scripts were executed in a Jupyter notebook; the notebook and csv files are saved in this repository to support the reproducibility of this work. Vivian used the Altair visualization library to create the side-by-side county-level state maps and add interactive features to them. These features include Streamlit widgets to control which features are used for color-coding the counties, tooltips to display specific county information, linked highlighting of counties between the maps, and a multi-select feature that allows users to click on multiple counties on the covid map to control which counties' data are displayed on the correlation/time-series plot below the map.
James Mahler: James was responsible for some of the visualizations in the Streamlit app. In particular, he put together the correlation plots and the time series plots that are displayed in the app for certain counties of a state. Additionally, he put together the USA choropleth maps that show the COVID data for the whole country. James also explored how to make the two state maps side-by-side.
Bradley Warren: Bradley was responsible for putting together the narrative for the app, and putting together the deliverables for both the checkpoint and the final video. After Joe, Vivian and James had put together the core parts of the application, Bradley went through the application and wrote the text to go into the application for the narrative.

Project Process

Tweets

The COVID tweet dataset provides 239,861,658 tweet IDs, as well as a subset of those tweet IDs that belong to tweets with geo data. The dataset did not provide the actual tweet content or any geo data. Therefore we had to use the tweet IDs and the twitter API to get tweet contents as well organize tweets by location. We decided to use the Tweepy library to connect with the Twitter API because we found that it made authentication easy and straightforward, which happened to be the hardest part of using the API.

When fetching tweets some explicitly have the state that they were made in, while others contain more specific locations like “Starbucks at Rockefeller Center”. All tweets contain a longitude and latitude which theoretically makes it possible to derive the state for all US tweets. However we were able to get enough tweets with explicit state locations to make this unnecessary.

The Twitter endpoint we used allowed us to fetch tweets in batches of 100 which made the fetching process significantly faster compared to if we had fetched one tweet at a time. One challenge is that to avoid spamming the Twitter API, Twitter implements a Rate Limiter which limits a user to 900 requests per 15 minutes (or 9000 tweets per 15 minutes). This means we could only fetch 36,000 tweets an hour and also had to implement the logic to catch rate limit errors and sleep for 15 min before resuming our fetching process. Another challenge we faced was that by default Twitter only sends the first 140 characters of a tweet, this created a bias of having twitter usernames being very prevalent in the dataset because they generally appear at the start of a tweet. We had to rerun all of our scripts with an additional query parameter to request the full tweet contents. This whole process was done using python scripts that were run over long periods of time.

Initially we wanted to run Linear Discriminant Analysis on the tweets and see what topics we extract. We thought that the topics would be meaningful in the sense that they would help us summarize what people were most talking about. For example if the most popular topic was "wearing a mask" we could focus on data around this. However we found out that the output of LDA is a weighted vector of words which wasn't very helpful to us. For example one topic was [0.021*"covid" + 0.018*"https" + 0.011*"pour" + 0.009*"dan" + 0.006*"coronavirus" + 0.005*"conflits_fr" + 0.005*"masqu" + 0.005*"confin" + 0.005*"tout" + 0.004*"admit"] which doesn't do a good job of telling us what people are discussing.

We initially displayed the tweet data as a word cloud as we thought it would be a visually appealing way to display aggregated contents of the tweets. We used a python word cloud library to accomplish this (https://amueller.github.io/word_cloud/index.html). We saw some interesting common themes across states. One of the top words in almost every state was “people”. Also in a lot of the states we see references to the president and local elected officials, some of these are positive/neutral while others are negative. Some examples for negative words taken from PA are “tempgenocide”, “gopgenocide”, and “trumpistheworstpresidentever”. Some positive/neutral examples taken from Oklahoma are “president”, “realdonaldtrump” (the twitter handle for president Trump), “trump”, and “govstitt” (the twitter handle for the governor of Oklahoma)

We transformed the state word clouds from a black rectangle to be in the shape of the corresponding state. Not only was it more visually appealing but it was actually an effective way of encoding the location of the tweets into the image itself. This way when people look at the word cloud they immediately associate it with a particular state. One thing that we noticed was that generating the word clouds was a bit slow, but the actual image generated was always the same. This was due to the fact that we weren’t refetching the tweet contents everytime, making all the words static. So instead of generating the word cloud every time, we generated it once and saved an image of the word cloud. We would then just load that image for each state. This significantly improved performance and lowered the size of our cache.

One suggestion from the TAs was that we should consider replacing the word clouds with bar charts of the words. To allow for flexibility we added radio buttons that allowed the user to toggle between word clouds and bar charts. They both encode and present the same data but they have slightly different pros and cons. Word clouds are visually more appealing and it’s easier to squeeze more words into them. Bar charts give more accurate numbers for each word and allow you to more easily compare words and give you a sense of scale. Due to the pros and cons of each approach we thought that it is mostly a user preference and decided to have the functionality to include both depending on the user’s preference.

Socioeconomic Indicators

Like many research projects, we found that the process of arriving at our final product for comparing socioeconomic indicators and covid metrics was not linear. In our proposal, we had mentioned comparing covid metrics between different countries and measuring the health and economic Covid-19 had on US college towns. However, we later found that we had to change our scope and deviate from this initial plan due to the lack of data available for measuring these correlations. We ultimately decided on narrowing our scope to different US state counties since (1) the data was readily available via Delphi's public Epidata API and (2) looking at counties instead of states would allow for a more fine-grained analysis since the socioeconomic makeup of a state can vary greatly between its different counties.

Since universities are likely still in the process of gathering Covid-19 metrics from students in real-time this Fall semester, covid data on college towns was not publicly available yet. However, once compiled and if available, looking at the impact the virus has had on college towns would be an interesting area for future work. In the meantime, we decided to stick with existing data that was available, meaningful, and gathered from trustworthy sources. We found the United States Department of Agriculture website to be immensely helpful; it provided county-level datasets f or various socioeconomic features, which we downloaded, cleaned, and merged with our fetched covid datasets.

We took a scientific approach by writing down our initial hypotheses first, which stemmed from both personal biases and popular public opinions within our communities and on our news feeds. Then we completed the implementation of our interactive maps and visualizations before re-visiting our set of questions and observing what the data had to say about them.

After receiving feedback from our design review, we added side-by-side country maps of the United States, with each individual county colored by its value for the selected socioeconomic and covid features. Getting a birds-eye view of all of the counties in the US allowed us to also make comparisons between states and extrapolate feature correlations that took all of the counties into consideration.

Deliverables

Proposal

The URL at the top of this readme needs to point to your application online. It should also list the names of the team members.
A completed proposal. The contact should submit it as a PDF on Canvas.

Design review

Develop a prototype of your project.
Create a 5 minute video to demonstrate your project and lists any question you have for the course staff. The contact should submit the video on Canvas.

Final deliverables

All code for the project should be in the repo.
A 5 minute video demonstration.
Update Readme according to Canvas instructions.
A detailed project report. The contact should submit the video and report as a PDF on Canvas.

Data References

Data from Delphi COVIDcast. Obtained via the Delphi Epidata API. https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
Kerchner, D., and Wrubel, L. 2020. Coronavirus Tweet Ids. (June 2020). Retrieved December 9, 2020 from Harvard Dataverse V7 https://doi.org/10.7910/DVN/LW0BTB
Gupta, S. 2020. How COVID-19 worsened gender inequality in the U.S. workforce. (September 2020). Retrieved December 9, 2020 from https://www.sciencenews.org/article/covid19-worsened-gender-inequality-us-workforce
Patel, J.A. et al. 2020. Poverty, inequality and COVID-19: the forgotten vulnerable. (May 2020). Retrieved December 9, 2020 from http://www.sciencedirect.com/science/article/pii/S0033350620301657
Ray, R. 2020. Why are Blacks dying at higher rates from COVID-19? (April 2020). Retrieved December 9, 2020 from https://www.brookings.edu/blog/fixgov/2020/04/09/why-are-blacks-dying-at-higher-rates-from-covid-19/
U.S. Bureau of Labor Statistics. 2020. Ability to work from home: evidence from two surveys and implications for the labor market in the COVID-19 pandemic : Monthly Labor Review. (June 2020). Retrieved December 9, 2020 from https://www.bls.gov/opub/mlr/2020/article/ability-to-work-from-home.htm
U.S. Department of Agriculture. 2018. County-level datasets. Retrieved December 2, 2020 from https://www.ers.usda.gov/data-products/county-level-data-sets/

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
data		data
media		media
twitter		twitter
.gitignore		.gitignore
LICENSE		LICENSE
Proposal.md		Proposal.md
README.md		README.md
Report.md		Report.md
Report.pdf		Report.pdf
explore_data.ipynb		explore_data.ipynb
preprocess_data_for_covid_counties.ipynb		preprocess_data_for_covid_counties.ipynb
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

License

CMU-IDS-2020/fp-mythbusters

Folders and files

Latest commit

History

Repository files navigation