<a href="https://colab.research.google.com/github/ReidelVichot/DSTEP23/blob/main/assignments/dstep23_assignment_03_rvichot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DSTEP23 // Assignment #3**

assigned : **Sep 25, 2023**

DUE : **Oct 3, 2023 11:59pm**

## Parsing Congressional Tweets

<img src="https://www.dailydot.com/wp-content/uploads/2019/01/twitter-on-capitol-hill.jpg" width=625>

---

<b>

How to complete and submit assignments:

1. Please make a copy of this notebook in your Google Drive and add your name to the filename.

2. Once you have completed the notebook, please share it with me before the due date and time by clicking the "Share" button in the upper right corner of the notebook.


Rules for assignments:

1. You may work with other students in the class, but if you do, each student with whom you worked <u>must</u> be listed below.  Direct copying from someone else's notebook is not permitted.

2. You may use generative AI models (e.g., ChatGPT) to help complete the assignment but if you do you must answer YES to the question below and bear in mind that such models <u>often</u> yield incorrect and biased solutions.

3. All solutions and outputs must be derived with python and the notebook should be "runable" by me (top to bottom) without errors.

4. Late assignments will assess a 15% late penalty up to 3 days after the due date and a 50% late penalty until the end of the term.

</b>

<u>**Instructions for tasks that will be graded are in bold below.**</u>

---

**Please list the names of the other students with whom you worked for this assignment (if none, put "None").**

None

**Did you use a generative AI model (e.g., ChatGPT) to create text or code for this assignment?**

No

---

#### **PART 0 - WEEKLY VISUALIZATION**

***<u>NOTE: Part 0 should be done by yourself and not in collaboration with others.</u>***

Beginning this week, part of the weekly assignments will include the finding and visualization of data.  This is – and will continue to be – a *very* open-ended task with two objectives:

1. ***Find a data set on the web that relates to a policy problem***

2. ***Make a plot of some characteristics/features of that data***

These "weekly visualizations" should be done in a <u>separate notebook</u> and should **include a link** to where I can find the data.  **A caption is also necessary** but it is *not* a requirement that the visualization show an obvious relationship (e.g., correlation or scaling) between the features of the data set.  

Lastly, these visualizations should be made **using Python/Jupyter running on your own machine** and not Colaboratory.  If you do not have access to your own computer on which you have permissions to install software for yourself, or if your machine does not have sufficient computational resources to load and analyze data, please let me know!

**To submit the visualization, the `.ipynb` Jupyter notebook that you create and write on your machine should be uploaded to your UD GoogleDrive and shared with me.**

---

### OVERVIEW

Social media use by the US congress has been [booming over the past several election cycles](https://fas.org/sgp/crs/misc/R45337.pdf), with members using the platform to directly engage with their constiuents, take part in the national conversation around timely issues, and fundraise for (re)election.  Participation in social media discourse is almost mandatory for public servants looking to both compete in the messaging game and solicit feedback from those who are affected by their policies.

In this assignment, you will be using string manipulations and indexing to parse congressional tweets to extract patterns of twitter usage by congress that are focused on issues of public health and climate.

---

### **PART 1 - Background**

Good data science (and data analysis more generally) depends on a clear understanding of the underlying problem/situation, the methods by which the data you are about to analyze are collected, and the situational context in which that data sits.  To that end:


**Read through the Congressional Research Service 2018 report on [Social Media Adoption by Members of Congress](https://sgp.fas.org/crs/misc/R45337.pdf) as well as a detailed 2020 study of congressional social media use across multiple platforms by the [Pew Research Center](https://www.pewresearch.org/internet/2020/07/16/1-the-congressional-social-media-landscape/).**

### **Part 2 - Loading and Preparing the Data**

The data that you will be using for this assignment comes from [this project](https://github.com/alexlitel/congresstweets) that collects and collates tweets from congresspersons each day.  For this assignment we will be concentrating on tweets during the summer of 2021 during which multiple policy-relevant issues were unfolding including the release of IPCC assessment reports and impacts of reopening schools during the COVID pandemic.

**Run the following cell to create a list of filenames for the summer (June, July, and August) of 2021.**

In [3]:
import pandas as pd

# -- set the summer 2021 filenames
json_names = [
           "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-01.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-02.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-03.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-04.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-05.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-06.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-07.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-08.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-09.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-10.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-11.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-12.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-13.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-14.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-15.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-16.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-17.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-18.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-19.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-20.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-21.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-22.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-23.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-24.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-25.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-26.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-27.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-28.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-29.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-06-30.json",
           "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-01.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-02.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-03.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-04.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-05.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-06.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-07.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-08.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-09.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-10.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-11.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-12.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-13.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-14.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-15.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-16.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-17.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-18.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-19.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-20.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-21.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-22.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-23.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-24.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-25.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-26.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-27.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-28.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-29.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-30.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-07-31.json",
           "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-01.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-02.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-03.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-04.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-05.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-06.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-07.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-08.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-09.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-10.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-11.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-12.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-13.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-14.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-15.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-16.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-17.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-18.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-19.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-20.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-21.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-22.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-23.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-24.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-25.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-26.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-27.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-28.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-29.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-30.json", "https://raw.githubusercontent.com/alexlitel/congresstweets/master/data/2021-08-31.json"
           ]

**Run the following cell to load the data and describe what each step is doing by filling in the comments below (replace the "???" in each comment with your own words).  Note, that you are permitted to use generative AI models for this step, but as usual, take caution since they may provide inaccurate motivations for the individual steps.**

In [None]:
# -- Create an empty list
data_list = []

# -- Start a for loop that tests each ellemtne of the json_names list
for fname in json_names:

  # -- Print the name of each element of the list
  print(fname)

  # -- Create a list of pandas dataframe from each json file located at the
  #    direction of each element of json_names
  #    if an element cannot be appended, ...NOT FOUND!!! will be printed and
  #    the loop will continue to the next element.
  try:
    data_list.append(pd.read_json(fname))
  except:
    print("  ...NOT FOUND!!!")

# -- Create a dataframe that is the result of concatenating all elements of data_list
data = pd.concat(data_list)

Notice that you now have a DataFrame `data` containing tweets for the summer of 2021.  However, the accounts from which these are sent include all accounts affiliated with Congresspersons and we would like to focus on only official accounts.

**Select only those tweets that are from official accounts for each Congressperson by:**

<b>

1. loading the accounts data from `dstep23/data/congress_tweets/congress_social_media_handles.csv` into a DataFrame and calling it `accts`.

2. creating a `True`/`False` index for the `data` DataFrame that is `True` when `"screen_name"` from `data` is in the `"Twitter"` column in `accts`. <small>(nb, you will have to use a method that we did not use in class for this step)</small>

3. using that index to sub-select only those tweets from official accounts, put those in a new DataFrame and call it `tweets`.

</b>

**Create a column called `"day"` in the `tweets` DataFrame that represents the day the tweet was sent$^{\dagger}$.**

<small>$^\dagger$ Note, ignore any Warnings (but not Errors!) that you may get when creating the column.  Also note that the day is the first 10 characters in the `"time"` column.</small>

**Create two separate subsets of the `tweets` DataFrame that contains only those tweets from Congresspersons affiliated with the Republican or Democratic party.**

### **Part 3 - Parsing Tweets and Generating Timeseries**

**Make a bar chart showing the top 50 most prolific tweeters in Congress in the summer of 2021$^\dagger$ <u>labeling the x-axis with the last name of the Congressperson</u> *not* their screen name.  For this labeling, last names should be determined with python code and not "hard coded" by hand.**

<small>$^\dagger$ Hint: the `.sort_values()` method of DataFrames will be useful here.</small>

**Plot the total number of tweets per day for all Congresspersons, Republican Congresspersons, and Democratic Congresspersons.  Be sure to include captions for all plots that you create.**

**Using <u>only</u> information contained in the `tweets` DataFrame, what day was the IPCC Sixth Assessment Report released?**

**What fraction of tweets by each party mentioned COVID in the summer of 2021?  Be sure to articulate any assumptions made when determining this number.**

**Plot the number of mentions of the Delta variant (the dominant variant at the time) for each day in the summer of 2021.  Does this track the total number of positive tests in the US during that time?**

Throughout the summer of 2021, there were significant policy decisions being made regarding the reopening of school systems.

**Count the total number tweets that mention "teachers" AND either "safe" OR "safety".  Do the numbers of mentions differ by political party?**

**Describe the many and various assumptions that went into all the analyses you have presented above (200 words max).**

TEXT FOR YOUR ANSWER HERE.