# Visualization exercise

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# colors
demblue = "#0015BC"
repred = "#FF0000"
demgrey = "#9EA4BF"
repgrey = "#BF9EA2"

## 1. Histogram

**Task 1.1** Load the data you collected via MTUrk on day 2. Visualize the `WorkTimeInSeconds` using the `histplot` function of the package `seaborn`. Can you identify odd behaviour in the histogram?

In [None]:
results = pd.read_csv(
    # your code here
)

In [None]:
fig, ax = plt.subplots()
sns.histplot(
    # your code here
    ax=ax
)

**Alternative task 1.1:** Use the `histplot` function of the package `seaborn` to visualize how the share of belief-speaking changed for democrats between the time period 2010-2013 and 2019-2022. How did it change for republicans?

**Hint:** You can use the argument `hue` of the `histplot` function to create a separate histogram for each time period in the same plot. See also the function's [documentation](https://seaborn.pydata.org/generated/seaborn.histplot.html).

In [4]:
# read the data frame with information about individual Congress Members
users = pd.read_csv("https://raw.githubusercontent.com/JanaLasser/SICSS-aachen-graz/main/04_01_visualization/exercise/data/users.csv", dtype={"author_id":str})

# histplot() requires "long-form" data, we therefore reshape the data frame
# to match the required input
belief = pd.melt(
    users, 
    id_vars=["handle", "author_id", "party"],
    value_vars=["belief_share_2010_to_2013", "belief_share_2019_to_2022"],
    var_name="time_period",
    value_name="share"
)

# rename the values in the newly created "time_period" column to contain only
# the required information
belief["time_period"] = belief["time_period"].replace({
    "belief_share_2010_to_2013":"2010 to 2013",
    "belief_share_2019_to_2022":"2019 to 2022"
})

belief.tail(3)

Unnamed: 0,handle,author_id,party,time_period,share
2051,RodneyDavis,993153006,Republican,2019 to 2022,0.057344
2052,RepDelBene,995193054,Democrat,2019 to 2022,0.052866
2053,RepDLesko,996094929733652480,Republican,2019 to 2022,0.049877


In [None]:
fig, ax = plt.subplots()
sns.histplot(
    # your code here
    ax=ax
)

**Task 1.2 (optional):** Instead of the `histplot` function use the `kdeplot` function to visualize the distributions. When does it make sense to use a KDE plot instead of a histogram?

In [None]:
fig, ax = plt.subplots()
sns.kdeplot(
    # your code here
    ax=ax
)

**Task 1.3 (optional):** Create a 2x2 grid to show how the proportion of belief-speaking and truth-seeking tweets changed for republicans and democrats between 2010-2013 and 2019-2022.

In [None]:
fig, axes = plt.subplots(2, 2)

# your code here

## 2. Bar chart

**Task 2.1** Load the data you collected via MTUrk on day 2. Visualize the answers as bar charts. Exactly how to do this and what to display depends on the kind of questions you asked. Note that it might also make sense to visualize results as a scatterplot instead. Ask for help if you are unsure.

In [None]:
results = pd.read_csv(
    # your code here
)

In [None]:
fig, ax = plt.subplots()
sns.barplot(
    # your code here
    ax=ax
)

**Alternative task 2.1:** The data frame `topics` contains information about the proportion of belief-speaking and truth-seeking that Democrats and Republicans use when discussing different topics. Use the `barplot` function of the package `seaborn` to visualize the share of both honesty components for the topics "abortion", "gun", "vaccine" and "putin", differentiaded by party. For which topics do the Republicans use more belief-speaking? For which the Democrats?

In [5]:
topics = pd.read_csv("https://raw.githubusercontent.com/JanaLasser/SICSS-aachen-graz/main/04_01_visualization/exercise/data/topics.csv")
topics.head(10)

Unnamed: 0,component,party,proportion,topic_name
0,belief-speaking,Democrat,0.07142,abortion $\vert$ woman $\vert$ right $\vert$ life
1,truth-seeking,Democrat,0.099941,abortion $\vert$ woman $\vert$ right $\vert$ life
2,belief-speaking,Republican,0.094262,abortion $\vert$ woman $\vert$ right $\vert$ life
3,truth-seeking,Republican,0.111906,abortion $\vert$ woman $\vert$ right $\vert$ life
4,belief-speaking,Democrat,0.054801,gun $\vert$ violence $\vert$ background $\vert...
5,truth-seeking,Democrat,0.108535,gun $\vert$ violence $\vert$ background $\vert...
6,belief-speaking,Republican,0.089088,gun $\vert$ violence $\vert$ background $\vert...
7,truth-seeking,Republican,0.140884,gun $\vert$ violence $\vert$ background $\vert...
8,belief-speaking,Democrat,0.069394,vaccine $\vert$ vaccinate $\vert$ mandate $\ve...
9,truth-seeking,Democrat,0.252456,vaccine $\vert$ vaccinate $\vert$ mandate $\ve...


In [6]:
belief_speaking = topics[topics["component"] == "belief-speaking"][0:8].copy()
belief_speaking["proportion"] = belief_speaking["proportion"] * 100

In [None]:
fig, ax = plt.subplots()
sns.barplot(
    # your code here
    ax=ax
)


**Task 2.2 (optional):** The file `mean_corpus_values.csv` contains the mean values for belief-speaking and truth-seeking for the full corpus of tweets. Load the file into a data frame and plot two lines (one for each party) indicating the mean level of belief-speaking in the corpus. For which topics is the proportion of belief-speaking above average? For which below?

In [8]:
mean_corpus_values = pd.read_csv("https://raw.githubusercontent.com/JanaLasser/SICSS-aachen-graz/main/04_01_visualization/exercise/data/mean_corpus_values.csv")
mean_corpus_values

Unnamed: 0,component,party,corpus_mean
0,belief-speaking,Democrat,8.216097
1,belief-speaking,Republican,7.643007
2,truth-seeking,Democrat,18.290874
3,truth-seeking,Republican,16.60015


In [None]:
fig, ax = plt.subplots(figsize=(8, 4))

sns.barplot(
    # your code here
    ax=ax,
)

ax.plot(
    # your code here
)

**Task 2.3 (optional):** Create two bar plots next to each other, one for belief-speaking and one for truth-seeking, this time showing all 20 topics contained in the `topics` data frame. Which topics have the highest difference in belief-speaking and truth-seeking between the parties?

In [None]:
# your code here

## 3. Time series

**Task 3.1:** The files `belief.csv` and `truth.csv` contain the mean proportion of belief-speaking and truth-seeking tweets for every month since 2022-01-01, split by Democrats and Republicans. Create a figure with two vertically stacked panels. Plot the time-series of belief-speaking split by party in the first panel, and the time-series of truth-seeking in the panel below. Apply a rolling average of three months to the data to smooth the time-series. 

In [9]:
belief = pd.read_csv("https://raw.githubusercontent.com/JanaLasser/SICSS-aachen-graz/main/04_01_visualization/exercise/data/belief.csv", parse_dates=["date"])
truth = pd.read_csv("https://raw.githubusercontent.com/JanaLasser/SICSS-aachen-graz/main/04_01_visualization/exercise/data/truth.csv", parse_dates=["date"])
belief.head(2)

Unnamed: 0,party,mean,perc_2.5,perc_97.5,date
0,Democrat,0.066411,0.036199,0.097674,2011-01-01
1,Republican,0.062301,0.041096,0.080412,2011-01-01


In [None]:
figx, axes = plt.subplots(2, 1, figsize=(9, 4))

axes[0].plot(
    # your code here
)

axes[1].plot(
    # your code here
)


**Task 3.2:** The data frames also contain the 95% confidence intervals of the time-series generated through bootstrapping (columns `perc_2.5` and `perc_97.5`). Plot a shaded area indicating the confidence interval around the mean value.  

**Hint:** You can use matplotlib's `fill_between()` function to plot a shaded area between two lines. You can modify the transparency of the area using the `alpha` function argument (see also the function's [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.fill_between.html)).

In [None]:
figx, axes = plt.subplots(2, 1, figsize=(9, 4))

axes[0].plot(
    # your code here
)

axes[0].fill_between(
    # your code here
)

axes[1].plot(
    # your code here
)

axes[1].fill_between(
    # your code here
)

**Task 3.3:** Indicate the dates of the presidential elections in 2016 and 2020 in the plot. What can you say about the trend of belief-speaking and truth-seeking after the presidential elections?

In [None]:
# your code here

## 4. Scatter plot

**Task 4.1:** The `users.csv` dataset contains information about the average [NewsGuard](https://www.newsguardtech.com/) score of links posted by each Congress Member (column `NG_score_mean`). Use `seaborn`'s `scatterplot()` function to find out if the share of belief-speaking over the whole observation period (column `belief_share`) is correlated to the average NewsGuard score.

In [10]:
users = pd.read_csv(
    "https://raw.githubusercontent.com/JanaLasser/SICSS-aachen-graz/main/04_01_visualization/exercise/data/users.csv", 
    dtype={"author_id":str},
    parse_dates=["created_at"]
)
users = users[users["party"].isin(["Democrat", "Republican"])]
users["belief_share"] = users["belief_share"] * 100
users.head(2)

Unnamed: 0,handle,author_id,name,party,N_tweets,followers_count,following_count,tweet_count,created_at,congress,NG_score_mean,belief_share,truth_share,belief_share_2010_to_2013,truth_share_2010_to_2013,belief_share_2019_to_2022,truth_share_2019_to_2022,ideology_mean,followers_count_log,following_count_log
0,RepLipinski,1009269193,Former Rep. Daniel Lipinski,Democrat,3179,19893.0,2478.0,4359.0,2012-12-13 17:03:06+00:00,116.0,93.193439,5.067698,0.188008,,,0.053934,0.214467,0.471657,9.898123,7.815207
1,CaptClayHiggins,1011053278304592000,Clay Higgins,Republican,21,3289.0,156.0,24.0,2018-06-25 01:07:40+00:00,116.0,,0.0,0.230769,,,0.0,0.181818,0.282969,8.098339,5.049856


In [5]:
fig, ax = plt.subplots(figsize=(7, 4))

sns.scatterplot(
    # your code here
    ax=ax
)

**Task 4.2 (optional):** Use the package `statsmodels` to perform an ordinary least squares regression of the form  

`NG_score_mean ~ belief_share + truth_share + party + party * belief_share + party * truth_share`.

How does the NewsGuard score chage if the proportion of belief-speaking increases by 10%? How does the score behave if the proportion of truth-seeking increases by 10%?

In [6]:
import statsmodels.formula.api as smf

In [None]:
# your code here

**Task 4.3 (optional):** Use the fitted OLS regression model to make predictions of the NewsGuard score for a range of proportions of belief-speaking from 0% to 30%. Get the 95% confidence intervals for the predictions. Visualize the predictions including their confidence intervals on top of the scatterplot.

**Hint:** Use the function `get_prediction()` of the fitted model to get the predictions. Use the function `summary_frame()` on the predictions to get the confidence intervals.

In [None]:
# your code here