# Political Polarization in the United States of America 

This section will contain or link to all the *behind the scenes* for the exam project in the course *Social graphs and interactions* (02805). A website with all the key results and visualization can be found on [groenning.net](http://groenning.net). Our suggestion is to read and look around the web-page first and then use this notebook to get a better idea of how everything is created, and it also shows some other interesting results that did not make the cut to get into the webpage.

This notebook is structured as follows:
- 1. Motivation
- 2. Data
- 3. Basic statistics
- 4. Tools, Theory, and Analysis (a teaser to the three analyses)
    - 4.1. Analysis 1: Who-follow-whom (link to notebook)
    - 4.2. Analysis 2: Natural Language Processing of Tweets (link to notebook)
    - 4.3. Analysis 3: Analysis of *retweets* (link to notebook)
- 5. Discussion
- 6. Conclussion
- 7. Contributions

Links to other notebooks with the analysis are used to limit this notebook size to make more computers being able to handle it.

With that in place let's get going!

<img src="https://img.youtube.com/vi/KEkrWRHCDQU/0.jpg" alt="image info" />

# 1. Motivation

The presidency of Donald J. Trump began at noon EST (17:00 UTC) on January 20, 2017, when he was inaugurated as the 45th president of the United States, and will come to an end on January 20, 2021, as he ultimately lost the 2020 presidential election to Joe Biden. It is not far-fetched to say it has been a bizarre presidency compared to the most recent presidencies. 

It feels like America has been split in two the supporters of Trump and those against him - Republicans against Democrats. In this project, we wanted to explore if our hypothesis of polarization can be seen or rejected by analyzing the Congress of United States' behavior on the social media Twitter including the infamous Twitter account managed by Donald J. Trump. The idea was to analyze the congress tweets from the time of the 45th presidency to explore potential polarization.


## 1.1 What is your dataset?
The idea is to analyze data from `Twitter` with a focus on tweets from the American congress in the period 2017-2020 to get an understanding of the political polarization in the US. The data used in this project consists of tweets from 1072 congress members from the 115th and 116th congress respectively and the president of the United States, Donald J. Trump. Data is from Harvard Dataverse and Trump Twitter Archive (links are presented in the Data section) and contains the following information:

You can read all about how tweets were extracted in [this] notebook. Notice, that to extract all the tweets you need to have a Twitter developer account in order for accessing the Twitter API. Extracting all the raw data takes approx. 24 hours, however extracting the cleaned and processed data takes approx. 6-8 hours as there are many duplicates and unnecessary tweets e.g. posts by random profiles in raw data. In this notebook, we will only consider the cleaned data but again we refer to the other notebook for full elaboration on how data was extracted and cleaned. 


* The state they are from,
* whether they are representative, senator, or POTUS (President),
* their full name,
* which party they are a member of,
* and their Twitter handle. 
The Twitter handles have been used to download tweets from all members in the given period using the Twitter API (insert ref.). In addition, we have added 16 of the largest media in the US with the same attributes but without tweets. 

Information about followers and retweets have been extracted for all users (both persons and media) in order to create networks that might reveal some interesting information about the polarization.

<img src="../web_app/figures/congress.png" width=350 height=250 /> <img src="../web_app/figures/trumpeten.png" width=350 height=250 /> <img src="../web_app/figures/medias.jpg" width=350 height=250 />

Per Twitter's Developer Policy, tweet ids may be publicly shared for academic purposes; tweets may not (see [ToC](https://developer.twitter.com/en/developer-terms/agreement-and-policy)). Thus, the data available for our readers will not contain the tweets. But details follow on how they easily can be obtained (if one has patience).

## 1.2 Why did you choose these particular datasets?
These particular datasets have been chosen as we want to investigate whether the political polarization in the US appears in the congress members' tweets. One could suspect that the polarization was especially expressed during Donald Trump's presidency and therefore the period of his presidency is interesting to look at. It also guarantees us a large network that can be analyzed based on both followers, retweets, and a bipartite graph showing the polarization.

Twitter is also a very interesting site that more and more politicians use as they can easily get across to many followers. This is though also a catalysator for polarization as users decide who to follow - and many probably follow others with the same views as themselves. It would be very interesting to know if this also is the case for the Congress of the United States.

## 1.3 What was your goal for the end user's experience?

The end goal is to have a website where the user should be able to investigate the key results of the full project as interesting visualizations and nice summarization of numbers in tables. A user - who knows Twitter - should be able to understand what the visualizations indicate without having all the theoretical insight from the course. A user should also be able to get insights from just looking at the page for 5 minutes while we also have hours of material in the form of all background analysis.

On a more subject-based matter the user should get and insight into whether the polarization of the political fronts is expressed in the form of tweets but also whether there exists a pattern in who follows and retweets each other internally in the congress. Additionally, the aim is also to provide insight into how the media influences this polarization. 

# 2. Data

In this section the data used will be presented. First we will explain what kind of data we have, afterwards we will explain how it was preprocessed before giving some basic stats. We will roughly follow the structure shown in the figure below:

<img src="../figures/data_processing.png" width=660 height=150 />

Before starting any coding the needed packages will be imported which can be seen below:

In [1]:
# Package import
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import urllib.request
import camelot
import tweepy
import tqdm
import networkx as nx 
import pickle
import itertools
import matplotlib.pyplot as plt
import matplotlib as mpl
from operator import itemgetter
import seaborn as sns
sns.set()
import json
import re
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Own source
from src.data.trump_tweet_ids import get_trump_tweet_ids
from src.data.hydrate import hydrate_tweets
from src.tools.twitter_api_credentials import api_key, api_secret_key, access_token, access_token_secret

## 2.1 Raw data

In this project, we will use data from Donald Trump's presidency (until May 2020) where the 115$^{\text{th}}$ and 116$^{\text{th}}$ US congress has taken office. For all politicians who have been a part of either congress we need to know what party they are part of, if they are Representative or Senator, what State they are from and most important; their Twitter user handles. After some research the following two sources were found that had the desired data:

* 116. Congress twitter info: (website) https://triagecancer.org/congressional-social-media 
* 115. Congress twitter info : (PDF) https://www.sciencecoalition.org/wp-content/uploads/2018/09/115th-Congress-Twitter-Handles.pdf  

We will almost exclusively use the Twitter handles that these sources have listed. The formats of the data are neither ideal, as not an easy format to interpret like a CSV file but we will present a solution to get the data. Besides these Congress members, we will also use President Trump.

To extract the data from Twitter we will use the Twitter API with python library `Tweepy` as a wrapper. With access to the Twitter API, it is though only possible to extract the most reason 3200 tweets from a given account (3200 tweets do not go far back for many American politicians). However, Twitter's Terms of Service do allow for datasets of tweets ID's to be distributed to third parties (not the full JSON). Luckily we found two sources that keep tweet id open very related to our project and one source that stored the full-length tweets of Donald Trump namely:

* **115th U.S. Congress Tweet Ids:**
    An open dataset with 2,041,399 tweet ids from the Twitter accounts of members of the 115th U.S. Congress collected in the period of January 27, 2017 and January 2, 2019. The dataset consists of two files of interest namely
    * `senators-1.txt` that contains tweet ids for Senators
    * `representatives-1.txt` that contains tweet ids for Representatives
    *Littman, Justin, 2017, "115th U.S. Congress Tweet Ids", https://doi.org/10.7910/DVN/UIVHQR, Harvard Dataverse, V5.*

* **116th U.S. Congress Tweet Ids**
    An open dataset with 2,817,747 tweets from the Twitter accounts of members of the 116th U.S. Congress collected in the period of January 27, 2019, and May 7, 2020.  The dataset consists of two files of interest namely
    * `Senators: congress116-senate-ids.txt` that contains tweet ids for Senators
    * `Representatives: congress116-house-ids.txt` that contains tweet ids for Representatives  * Wrubel, Laura; Kerchner, Daniel, 2020, "116th U.S. Congress Tweet Ids", https://doi.org/10.7910/DVN/MBOJNS, Harvard Dataverse*

* **Trump Twitter Archive**
    A site dedicated to scraping every single tweet from Donald J. Trump. Here we downloaded all tweets in the periods of January 27, 2017 and January 2, 2019 and January 27, 2019 and May 7, 2020.  See more at [https://www.thetrumparchive.com/]

Examing the Harvard Dataverse we discovered that politicians in Congress can have a number of profiles, for instance, a private profile, a profile associated with work in congress, and a campaign profile. An example is Alexandria Ocasio-Cortez who has @AOC and @RepAOC. Unfortunately, the data also contains a large number of random profiles. Moreover, there is tweet dating as far back as 2008. This is also something that has to be fixed.

During the project, we will also do a sentiment analysis where the sentiment scores from [this](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026752) source will be used.

We will use different media accounts for an analysis based on retweets. This data has been created manually based on prior knowledge and research and we ended up with a data frame with 16 prominent American news media and their Twitter. This can be seen in the `Data/Raw/LargestMedia.csv` or in [this csv](https://github.com/MikkelGroenning/social_graph/blob/main/Data/Raw/LargestMedia.csv).


## 2.2 Creating and preprocessing the data

Now it will be described how data was extracted, created, and preprocessed. As mentioned previously Twitter's Terms and Conditions does not allow for tweets to be publicly available and we only had the IDs - from the above sources - that had to be processed in order to create the actual data we have to use for the analysis. 

The easiest way to re-create the data is to clone the [github repository](https://github.com/MikkelGroenning/social_graph) and setup the corresponding Conda environment. The code below can then be run to create data and it consists of the following parts:

* **Extract Trump Tweet IDs**
    Here the tweet ids are extracted from the tweets made publicly by https://www.thetrumparchive.com/

* **Hydrate Tweets**
    In this part all the tweet ids from the Harvard data archive as well as Trumps' tweet id, are hydrated. I.e. the ids are turned back into tweets with metadata. 

* **Getting Congress Twitter Account**
    Here a pandas data frame is constructed from information scraped from the PDF describing the 115 Congress members' Twitter accounts and the HTML site describing the 116 Congress Twitter info.

* **Clean-up of Harvard Data**
    The Harvard data archive needs to be cleaned prior to analysis as it contains
    * Data prior to January 27, 2017
    * Duplicates
    * Many random profiles
    * Duplicate tweets accounts
    
* ** Preprocess the Twitter data**
    Many tweets contain links, emojis, etc that makes it difficult to perform natural language processing. In this part, the tweets are preprocessed such that they can be used for our analysis.

* ** Twitter ID to Username**
    Create a dictionary that can convert the between user id and username of the users in the data setup

* ** Get following adjacency**
    Create an adjacency matrix on who follows who on Twitter of the users in our data set.

The bottom important in the package import cell contains functions from our own module `src`. The source code for these functions can be found in our Github repository linked above.

Moreover, it is very important to state that the file twitter_api_credentials.py misses twitter API credentials on our Github repository as it is classified information. To recreate the dataset one needs to access the Twitter API through a developer account is needed. Such an account can be requested at https://developer.twitter.com/en/apply-for-access - typically one is granted access instantly if it is for student usage. The cell below will not work unless a token is found.

**Warning! Due to Twitter's API limit running the code below will take around 35-40 hours to run.**

In [31]:
# Get twitter credentials
auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
try:
    redirect_url = auth.get_authorization_url()
except tweepy.TweepError:
    print('Error! Failed to get request token.')

### 2.2.1 Extract Trump Tweet IDs
As the site Trump Twitter Achive (https://www.thetrumparchive.com/) store Donald Trump's tweets in a different format than how it typically extracted from Twitter-API we extracted the tweet id from this sources and stored them in the file `trump_id.txt`.

In [21]:
# Get tweets
df_trump_tweets1 = pd.read_csv('../Data/raw/tweets/trump_tweets_1st.csv')  
df_trump_tweets2 = pd.read_csv('../Data/raw/tweets/trump_tweets_2nd.csv')
df_trump = pd.concat([df_trump_tweets1, df_trump_tweets2])

# Write data
filepath = "../Data/raw/tweets/trump_id.txt"
get_trump_tweet_ids(df_trump, filepath)

11326 tweet ids saved


### 2.2.2 Hydrate Tweets
The process of turning tweet ID's into actual tweets with metadata is called *hydration* and requires Twitter developer account. In the cell below we load all twitter ids obtained from Harvard Data Archive and Trump Twitter Archive.

In [22]:
representatives115 = np.loadtxt(
    "../Data/Raw/Tweets/representatives115.txt", dtype=int
)
representatives116 = np.loadtxt(
    "../Data/Raw/Tweets/representatives116.txt", dtype=int
)
senators115 = np.loadtxt(
    "../Data/Raw/Tweets/senators115.txt", dtype=int
)
senators116 = np.loadtxt(
    "../Data/Raw/Tweets/senators116.txt", dtype=int
)
trump = np.loadtxt(
    "../Data/Raw/Tweets/trump_id.txt", dtype=int
)
congress = np.concatenate([representatives115, representatives116, senators115, senators116, trump])
print(len(congress))

4870472


The concateneted into array of tweet id consist of 4.8 millions ID. All these tweets are now hydrated with the function `hydrate_tweets` located in src/data folder in our reposortiry and it can also be seen [here](https://github.com/MikkelGroenning/social_graph/blob/main/src/data/hydrate.py). Note running the cell below take $24 \pm 6$ hours as the twitter API set limits to how much can be exstracted. More info about rate limits can be found at https://developer.twitter.com/en/docs/twitter-api/v1/rate-limits


In [None]:
filepath = "../Data/interim/congress.pkl"
api = tweepy.API(auth, wait_on_rate_limit=True)

hydrate_tweets(
    tweet_ids=congress,
    filepath=filepath,
    api = api
)

### 2.2.3 Getting Congress Twitter Account
In this part a pandas data frame will generated with each members congress member's name State, Type (Reprensative, Senator, POTUS), Name, Party. This part cosist of three subparts:
* **116<sup>th</sup>** Here the desired data frame for 116 congress will be scraped
* **115<sup>th</sup>** Here the desired data frame for 115 congress will be scraped
* **Merge data** Here the different congress data frame will be merged.

#### 116<sup>th</sup> congress

First the twitter handles for the 116<sup>th</sup> congress will be extracted using [this](https://triagecancer.org/congressional-social-media) source. The choice of source comes from the fact that the Twitter handle as well as the party is desired.

`BeautifulSoup` is used to extract the HTML table from the webpage (that has been downloaded to allow for offline work).

In [23]:
# Open data
with open('../Data/Raw/116_congress_twitter.html') as fp:
    soup = BeautifulSoup(fp, 'html.parser')

# Find table
table = soup.find('table', attrs={'id':"footable_16836"})

# Extract data row wise from table
l = []
for tr in table.findAll('tr'):
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)

# Make the data into a Pandas data frame and drop irrelevant columns
Data116 = pd.DataFrame(l[1:], columns = [header.getText() for header in table.findAll('th')]).drop(columns = ['Name Links', 'Twitter Links', 'Instagram', 'Facebook Page', 'Facebook'])

# Ensure that the type of politician is alligned
rename_chamber = {'U.S. Representative': 'Representative', 'U.S. Senator': 'Senator'}
Data116 = Data116.replace(rename_chamber).rename(columns = {'Chamber of Congress': 'Type'})

In this data set the state is given as well as congressional district. This is fixed using regex strings as shown below. Moreover the "@" are removed from the Twitter handles as the Twitter API does not need it. The vancant positions in Congress are also disregarded.

In [24]:
# All states abbreviations
us_state_abbrev = {
    r'Alabama.*': 'AL',
    r'Alaska.*': 'AK',
    r'American Samoa.*': 'AS',
    r'Arizona.*': 'AZ',
    r'Arkansas.*': 'AR',
    r'California.*': 'CA',
    r'Colorado.*': 'CO',
    r'Connecticut.*': 'CT',
    r'Delaware.*': 'DE',
    r'District of Columbia.*': 'DC',
    r'Florida.*': 'FL',
    r'Georgia.*': 'GA',
    r'Guam.*': 'GU',
    r'Hawaii.*': 'HI',
    r'Idaho.*': 'ID',
    r'Illinois.*': 'IL',
    r'Indiana.*': 'IN',
    r'Iowa.*': 'IA',
    r'Kansas.*': 'KS',
    r'Kentucky.*': 'KY',
    r'Louisiana.*': 'LA',
    r'Maine.*': 'ME',
    r'Maryland.*': 'MD',
    r'Massachusetts.*': 'MA',
    r'Michigan.*': 'MI',
    r'Minnesota.*': 'MN',
    r'Mississippi.*': 'MS',
    r'Missouri.*': 'MO',
    r'Montana.*': 'MT',
    r'Nebraska.*': 'NE',
    r'Nevada.*': 'NV',
    r'New Hampshire.*': 'NH',
    r'New Jersey.*': 'NJ',
    r'New Mexico.*': 'NM',
    r'New York.*': 'NY',
    r'North Carolina.*': 'NC',
    r'North Dakota.*': 'ND',
    r'Northern Mariana Islands.*':'MP',
    r'Ohio.*': 'OH',
    r'Oklahoma.*': 'OK',
    r'Oregon.*': 'OR',
    r'Pennsylvania.*': 'PA',
    r'Puerto Rico.*': 'PR',
    r'Rhode Island.*': 'RI',
    r'South Carolina.*': 'SC',
    r'South Dakota.*': 'SD',
    r'Tennessee.*': 'TN',
    r'Texas.*': 'TX',
    r'Utah.*': 'UT',
    r'Vermont.*': 'VT',
    r'Virgin Islands.*': 'VI',
    r'Virginia.*': 'VA',
    r'Washington.*': 'WA',
    r'West V.*': 'WV', # Written in different ways
    r'Wisconsin.*': 'WI',
    r'Wyoming.*': 'WY'
}

# Convert states to two letter abbreviations
Data116['State'] = Data116['State'].replace(regex = us_state_abbrev)

# Remove @
Data116 = Data116.replace(regex = {r'^@': ''})

# Remove vacant positions
Data116 = Data116[Data116.Name != "Vacant"]

# Look at the data
Data116

Unnamed: 0,State,Type,Name,Party,Twitter
0,AL,Senator,Richard Shelby,R,SenShelby
1,AL,Senator,Doug Jones,D,DougJones
2,AL,Representative,"Byrne, Bradley",R,RepByrne
3,AL,Representative,"Roby, Martha",R,RepMarthaRoby
4,AL,Representative,"Rogers, Mike",R,RepMikeRogersAL
...,...,...,...,...,...
536,WI,Representative,"Tiffany, Thomas",R,TomTiffanyWI
537,WI,Representative,"Gallagher, Mike",R,MikeforWI
538,WY,Senator,"Enzi, Mike",R,SenatorEnzi
539,WY,Senator,"Barrasso, John",R,SenJohnBarrasso


It is also seen that there are an inconsistency in the ways the names are written. This is changed so all names are written with the first name first:

In [25]:
Data116['Name'] = [name[1][1:]+ " " +name[0] if len(name) == 2 else name[0] for name in [name.replace(u'\xa0', u'').split(',') for name in Data116.Name]]

#### 115<sup>th</sup> congress

Now we move onto the 115th congress. This is data stored in a pdf.table, so for this the `camelot` library is used. 

In [26]:
# Get data
file115 = '../Data/Raw/115_congress_twitter.pdf'

# Read table across all pages
tables = camelot.read_pdf(file115, pages = 'all')

# Convert data to pandas data frame
Data115 = pd.DataFrame(np.concatenate([d.df.drop(0).values for d in tables]), columns=tables[0].df.iloc[0]).drop(columns = "District")

# Align chamber name with the 116 data
rename_chamber = {'Rep.': 'Representative', 'Sen.': 'Senator'}
Data115 = Data115.replace(rename_chamber)

# Align name with the 116 data and store it in one column
Data115["Name"] = Data115["First Name"] + " " + Data115["Last Name"]
Data115 = Data115.drop(columns = ["First Name", "Last Name"])

# Align columns name with the 116 data
Data115 = Data115.rename(columns = {'Title': 'Type', "Twitter Handle": "Twitter"})

#### Merge data

Now the two datasets are merged. Here we need to take duplicate acounts into account which accounts for reelections.

In [27]:
# Merge data set
Data_Full = Data115.append(Data116, ignore_index = True)

# Get shape
Data_Full.shape

(1072, 5)

In the cell below is Twitter display name extracted with twitter API for full data. This is done as the full name does not always match the Twitter Display Name:

In [32]:
api = tweepy.API(auth, wait_on_rate_limit=True)
to_remove = []
twitter_display_name = []
for index, handle in tqdm.tqdm(enumerate(Data_Full.Twitter)):
    try:
        u=api.get_user(handle)
    except Exception:
        to_remove.append(index)

1072it [15:52,  1.13it/s]


Now they will be removed

In [33]:
Data_Full = Data_Full.drop(index=to_remove)

A few manual fixes to errors that was found will be carried out and duplicates will be dropped as duplicates are expected due to reelections.

In [34]:
# Extra duplicate from AS
Data_Full = Data_Full[Data_Full.Twitter != 'RepTomPrice']

# Drop closed users
Data_Full = Data_Full[~Data_Full.Name.isin(['Aumua Radewages', 'Madeleine Bordallo', 'Elizabeth Esty'])]

# Fix Eric
Data_Full.loc[Data_Full[Data_Full.Name == "Erik Paulsen"].index,"Twitter"] = "ErikPaulsen"

# Fix Bobby
Data_Full.loc[Data_Full[Data_Full.Name == "Bobby Scott"].index,"Twitter"] = "BobbyScott"

# Fix Dave
Data_Full.loc[Data_Full[Data_Full.Name == 'Dave Reichert'].index,"Twitter"] = "TeamReichert"

# Fix Lindsey
Data_Full.loc[Data_Full[Data_Full.Name == 'Lindsey Graham'].index,"Twitter"] = "LindseyGrahamSC"

# Darin's name
Data_Full.loc[Data_Full[Data_Full.Name == "arin LaHood"].index,"Name"] = "Darin LaHood"

# Drop dups
Data_Full = Data_Full.drop_duplicates(subset = ["Twitter"], keep = 'last')
Data_Full = Data_Full.drop_duplicates(subset = ["Name"], keep = 'last')

And now President Trump will be added.

In [35]:
# Add trump
Data_Full = Data_Full.append({'State': None, 'Party': 'R', 'Type': 'POTUS', 'Twitter': 'realDonaldTrump', 'Name': 'Donald J. Trump', 'twitter_display_name': 'Donald J. Trump'}, ignore_index=True)

Their display name on Twitter will also be added.

In [36]:
# Add display name
twitter_display_name = [api.get_user(handle).name for handle in Data_Full.Twitter]
Data_Full['twitter_display_name'] = twitter_display_name

And lastly the data is saved for later analysis.

In [37]:
# Save data
Data_Full.to_csv('../Data/Processed/Twitter_Handles.csv')

### 2.2.4 Clean-up of Harvard Data
In this part the Harvard data is cleaned such that:
* It only contains tweets from account from `Data_Full`. 
* There is only tweets from the two periods of January 27, 2017 to January 2, and 2019 January 27, 2019 to May 7, 2020
* There is no duplicate tweets

In [3]:
congress = pd.read_pickle('../Data/Interim/congress.pkl')
twitter_handles = pd.read_table('../Data/Processed/Twitter_Handles_updated.csv', sep = ',')

s1 = set(twitter_handles['twitter_display_name'])
s2 = set(congress.user_name.unique())

Below non-overlapping twitter profile are shortly investigated by finding out how many users are not in both sets and as well 10 examples are shown.

In [5]:
print(f'The set of summetric difference includes {len(s1 ^s2)} users')
list(s1 ^ s2)[:10]

The set of summetric difference includes 1204 users


['みんなの釣果自慢@釣り人のための釣果投稿サイト🐟',
 'Rep. Devin Nunes',
 'Joe Cunningham',
 'Belu Musante',
 'ハ゜クマン(´･_･`)',
 'Xochitl Torres Small',
 'sams',
 'Brulindo🐻',
 'josé urach 💙vote45💙',
 'David Schweikert']

So this is actaully a lot! - now we will make sure that we only use tweets from users in out dataset.

In [49]:
# Make sure tweets only comes from people that twitter handles exist for. 
congress = congress[congress.user_name.isin(s1)]

As mentioned previously we will also have to filter out the desired time period.

In [50]:
# Keep only the periods from Harvard:
mask = (
    #January 27, 2017 and January 2, 2019 
    (congress.created_at > '2017-1-27 00:00:00') & (congress.created_at < '2019-1-2 00:00:00')
    | 
    #January 27, 2019 and May 7, 2020 
    (congress.created_at > '2019-1-27 00:00:00') & (congress.created_at < '2020-5-7 00:00:00')
)
congress = congress[mask]

Lasly duplicates are dropped and the data is prepared for returning.

In [52]:
congress = congress.drop_duplicates(keep='first')
congress = congress.sort_values(by='created_at')
congress = congress.reset_index(drop=True)
congress.to_pickle("../data/interim/congress_cleaned.pkl")
len(congress)

1650398


The cleanup results in $4,870,472-1,650,398 = 3,220,074$ less tweets than the orginal data. These tweet ids are saved such as they can be shared online with concent of Twitter. That will make it lot faster to hydrate the tweets of interest of one want to re-create the project. In fact the tweet-ids are publicly available at [http://groenning.net/data/Cleaned_tweet_id.txt].

In [53]:
# Extract the tweets ids and convert them to integers
ids = list(congress.id.astype(int).values)

filepath = "../Data/raw/tweets/Cleaned_tweet_id.txt"
with open(filepath, 'w') as output:
    for row in ids:
        output.write(str(row) + '\n')

    print(f'{len(ids)} tweet ids saved.')

1650398 tweet ids saved.


The dataframe with tweets of from congress after cleanup contain about 33 % rows of data prior to clean-up. This means it data can be hydrated much quicker. Running the cell below take $8 \pm 6$ hours and creates the same `congress` data frame as had it been cleaned up. 
The cleaned tweet id can be found at 'http://groenning.net/data/Cleaned_tweet_id.txt' as the file is too large to be on Github. The below cell make sure that the list is downloaded.

In [None]:
from urllib.request import urlopen

uurl = 'http://groenning.net/data/Cleaned_tweet_id.txt'
file_name = "../Data/Raw/Tweets/Cleaned_tweet_id.txt"

response = urlopen(uurl)
data = response.read()      # a `bytes` object
text = data.decode('utf-8') # convert from bytes object

with urlopen(uurl) as response, open(file_name, 'wb') as out_file:
    data = response.read() # a `bytes` object
    out_file.write(data)

In [None]:
congress_tweet_id = np.loadtxt("../Data/Raw/Tweets/Cleaned_tweet_id.txt", dtype=int)
filepath = "../Data/interim/congress_cleaned.pkl"

hydrate_tweets(
    tweet_ids=congress_tweet_id,
    filepath=filepath,
    api = api
)

### 2.2.5 Preprocess the twitter data
In this part the cleaned tweet will processed such that the text is suited for natural language processing. The cells below do the following
* Convert HTML tags to UTF8 symbol and text
* Make all tweets lowercase
* Remove all links from tweets
* Replace all unicode whitespace with normal space
* Remove all unknown charcters and symbols
* Save the processed data

In [None]:
# Load data
congress = pd.read_pickle('../Data/Interim/congress_cleaned.pkl')

# Define character set to keep
special_characters = "@#"
character_set = {
    "characters": "abcdefghijklmnopqrstuvwxyz0123456789" + special_characters,
    "space": " ",
}
alphabet = "".join(character_set.values())

# Get different expresions
regex_links = re.compile("http\S+")
regex_whitespace = re.compile("[\s|-]+")
regex_unknown = re.compile(f"[^{alphabet}]+")
regex_html_tags = {
    "&amp": "and"
}

# Replace unicode charetars
for pattern_string, char in regex_html_tags.items():
    congress["text"] = congress["text"].str.replace(pattern_string, char)

# Add columns with tweet text
congress["text"] = (congress["text"]
    .str.lower()
    .str.replace(regex_links, "")
    .str.replace(regex_whitespace, character_set["space"])
    .str.replace(regex_unknown, '')
    .str.strip()
)

# Save data
congress.to_pickle('../Data/Processed/congress_cleaned_processed.pkl')

### 2.2.6 Create username to id dict

Here a dictionary will be created that can be used to go from twitter user id to username (and a reverse dict can be used for the opposite).

In [2]:
# Load data
Handle_data = pd.read_csv('../../Data/Processed/Twitter_Handles.csv')
Usr = Handle_data.Twitter.values

# Get api
api = tweepy.API(auth, wait_on_rate_limit=True)

# Create dict
Usr_ID = {U: api.get_user(U).id for U in Usr}

# Save dict
with open('../../Data/Processed/Usr_ID_dict.pickle', 'wb') as handle:
    pickle.dump(Usr_ID, handle, protocol=pickle.HIGHEST_PROTOCOL)

### 2.2.7 Follow adjacency matrix

Now we will create an adjacency matrix of who follows whom of the users in our data. This will be done using Twitter API and as a warning it takes 12-14 hours to generate the data due to Twitter rate limit of calls.

First the relevant data will be loaded.

In [8]:
# Get usr-id dict
with open('../Data/Processed/Usr_ID_dict.pickle', 'rb') as handle:
    Usr_ID = pickle.load(handle)

# Get reverse version
ID_Usr = {I: U for U, I in Usr_ID.items()}

# Load handle data
Handle_data = pd.read_csv('../Data/Processed/Twitter_Handles.csv')
Handle_dict = pd.Series(Handle_data.Name.values, index=Handle_data.Twitter.values).to_dict()

A function will now be made to get the total number of following a user has as well as who the user follows from out data set. If a rejection happens due to the api we will retry a minute later.

In [10]:
def get_following(name):
    try:
        ids_list = []
        for page in tweepy.Cursor(api.friends_ids, screen_name=name).pages():
            ids_list.extend(page)

        return len(ids_list), [ID_Usr[id] for id in ids_list if id in ID_Usr.keys()]
    except Exception:
        time.sleep(60) # Wait 1 minutes if limit is reached or another error is encounted
        get_following(name)

It is noW time to create and populate the dataset which is done below.

In [None]:
# Define api
api = tweepy.API(auth, wait_on_rate_limit=True)

# Init datafame
Follow_df = pd.DataFrame(data = 0, index = Handle_dict.keys(), columns = Handle_dict.keys())
No_follow_dict = dict.fromkeys(Handle_dict.keys())

# Populate dataframe
for name in tqdm.tqdm(Follow_df.index):
    # Try again if fail (which happens first time after the api values are replenished)
    try: 
        no_follows, Follows = get_following(name)
    except TypeError:
        no_follows, Follows = get_following(name)

    # Add results
    No_follow_dict[name] = no_follows
    Follow_df.loc[name,Follows] = 1

And at last the data is saved.

In [23]:
# Write follow df
Follow_df.to_csv('../Data/Processed/Follow_df.csv')

# Write number of following dict
with open('../Data/Processed/No_follow_dict.pickle', 'wb') as handle:
    pickle.dump(No_follow_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

# 3. Basic Statistics
In this part of the notebook, basic statistics for two datasets `Twitter_Handles.csv` and `congress_cleaned_processed.pkl` will be given. The data set `Twitter_Handles.csv` is publicly available in the 'data/proccesed' folder on the Github repository whereas congress_cleaned_processed.pkl can't be shared per Twitter’s Developer Policy. It was described in the previous section how the data was generated but a dedicated notebook to extract and clean the data can be found in the notebook folder on our Github with the [Extract_And_Clean_data.ipynb] (https://github.com/MikkelGroenning/social_graph/blob/main/Notebooks/Extract_And_Clean_data.ipynb) or quickly accessed with the following [nbviewer link](https://nbviewer.jupyter.org/github/MikkelGroenning/social_graph/blob/main/Notebooks/Extract_And_Clean_data.ipynb).

The Basic Statistics section of this notebook is structured of two subparts namely
* **Handle Data** 
* **Congress Tweets Data**

The first part is devoted to describing `Twitter_Handles.csv` dataset whereas the second part is devoted to describing `congress_cleaned_processed.pkl` dataset.

## 3.1 Handle Data
The Twitter data is loaded in the cell below. As can be seen, it describes the State, Party, Type (Senator, Representative, or President (POTUS)), their Twitter handles, their name, and their display name on Twitter (those two are not always identical) for every politician in the dataset.

In [54]:
Handle_data = pd.read_csv('../data/processed/Twitter_Handles.csv')

party_dict = {
    "D" : "Democrat", 
    "R" : "Republican",
    "I" : "Independent",
    'L' : 'Libertarian'
}
Handle_data['Party'] = [party_dict[p] for p in Handle_data['Party'] ]
Handle_data

Unnamed: 0.1,Unnamed: 0,State,Party,Type,Twitter,Name,twitter_display_name
0,0,AZ,Republican,Senator,JeffFlake,Jeff Flake,Jeff Flake
1,1,AZ,Republican,Senator,SenJonKyl,Jon Kyl,U.S. Senator Jon Kyl
2,2,CA,Democrat,Representative,reppeteaguilar,Peter Aguilar,Rep. Pete Aguilar
3,3,CA,Democrat,Representative,repcardenas,Tony Cardenas,Rep. Tony Cárdenas
4,4,CA,Republican,Representative,DarrellIssa,Darrell Issa,Darrell Issa
...,...,...,...,...,...,...,...
613,613,WI,Republican,Representative,MikeforWI,Mike Gallagher,Mike Gallagher
614,614,WY,Republican,Senator,SenatorEnzi,Mike Enzi,Mike Enzi
615,615,WY,Republican,Senator,SenJohnBarrasso,John Barrasso,Sen. John Barrasso
616,616,WY,Republican,Representative,Liz_Cheney,Liz Cheney,Liz Cheney


### 3.1.1 Party distribution

A natural way to start is to look at the distribution of parties from the 618 politicians. They are calculated below and plotted as a bar plot. The party colors are used with gray from independents.

In [29]:
party_distribution = Handle_data.groupby('Party').agg('count')[['Name']].reset_index()
party_distribution.rename(columns={'Name':'Count'},inplace=True)
party_distribution

Unnamed: 0,Party,Count
0,Democrat,310
1,Independent,2
2,Libertarian,1
3,Republican,305


In [49]:
fig = px.bar(
    party_distribution,
    x="Party",
    y="Count",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    title = 'Distribution of Party sizes',
    text = 'Count'
)
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(
    xaxis=dict(
        tickangle=-45,
    ),
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_barplot_parties.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

In [31]:
Handle_data[~Handle_data.Party.isin(['Republican', 'Democrat'])]

Unnamed: 0.1,Unnamed: 0,State,Party,Type,Twitter,Name,twitter_display_name
300,300,ME,Independent,Senator,SenAngusKing,Angus King,Senator Angus King
329,329,MI,Libertarian,Representative,justinamash,Justin Amash,Justin Amash
571,571,VT,Independent,Senator,SenSanders,Bernie Sanders,Bernie Sanders


An interesting name to notice is Bernie Sanders who has run in the primary presidential election for the Democrats even though he officially is an independent senator for Vermount. 

### 3.2 Type distribution

Another interesting aggregation is based on the type (i.e. Representative, Senator, or President (POTUS)). Below are the numbers aggregated together with the party so the distribution can be shown.

In [32]:
party_type_distribution = Handle_data.groupby(['Party','Type']).agg('count')[['Name']].reset_index()
party_type_distribution.rename(columns={'Name':'Count'},inplace=True)

In [50]:
fig = px.bar(
    party_type_distribution,
    x="Type",
    y="Count",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    title = 'Distributuion per party',
    text = 'Count'
)
fig.update_layout(
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_parties_2_barplot.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

Immediately it becomes clear that Representatives clearly dominate the dataset with a total of 503 Representatives 244 Republicans and 259 Democrats.
Senators are only 111 while there naturally only is one President, Donald J. Trump. 
Between the 115th and 116th congress, all 435 seats in the House of Representatives were up for election while only around a third of the 100 seats in the Senate were up for election. That means some were not re-elected but still appear in the dataset.
Moreover, the United States overseas territories also get seats in the House of Representatives (though without voting power). For these reasons, it makes sense that Representatives are so dominant.
Furthermore, it explains why the total number of Senators and Representatives excessed 435 and 100 respectively which is the number of seats in the House of Representatives and the Senate respectively.
From the plot, the stacked element also clearly illustrates a two-party system - Democrats and Republicans dominate the political landscape but are fairly equal in terms of sizes.

### 3.3 Distribution of States
Another interesting feature is the state representation. Let's first look at the number of different states in the dataset.

In [55]:
print(f'There are {Handle_data.State.nunique()} unique states in the data.')

There are 56 unique states in the data.


So there are 56. This is more than the 50 usual states that we hear about. This is because American Samoa, District of Columbia, Northern Mariana Islands, Puerto Rico, and the Virgin Islands also get seats in the House of Representatives which also in the case for the District of Columbia. They do not have voting power but they can participate in debates. Let's see the number of congress members in each state in the data set (please note that the President does not a have a state associated with him.

In [62]:
# Count the number of types per party
with open('../data/processed/us_state_abbrev.json') as json_file:
    us_state_abbrev = json.load(json_file)
# flip state dict
us_state_abbrev = {value:key for key, value in us_state_abbrev.items()}

party_state_distribution = Handle_data.groupby(['Party', 'State']).agg('count')[['Name']].reset_index()
party_state_distribution.rename(columns={'Name':'Count'},inplace=True)
party_state_distribution['State'] = [us_state_abbrev[state_abrreviation] for state_abrreviation in party_state_distribution['State']]

In [65]:
fig = px.bar(
    party_state_distribution,
    x="State",
    y="Count",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    title = 'Distributuion per party',
)
fig.update_layout(
    xaxis=dict(
        tickangle=90,
        dtick = 1,
        showticklabels = True,
    ),
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
fig.write_html(
    file = "../web_app/plotly_files/tweet_parties_3_barplot.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

Above the distribution is seen. Remember that the data is from two congresses so reelected politicians will only count as one - thus it does not fully illustrate the political landscape. But the plot gives the exact distribution of the dataset that is used in the project. It also becomes very apparent how the sizes (based on members of congress) varies between states. California is the largest by some distance down to Texas. Then there is an additional large jump to Florida and New York with 32 each - and an additional large jump is then down to Illinois and Pennsylvania with 23. Many states are found in the band with 11-14 members. The full distribution as a histogram is found below:

In [53]:
fig = px.histogram(
    party_state_distribution.groupby('State').agg('sum').reset_index(),
    x="Count", 
    nbins=15,
    color_discrete_sequence=['#ff7f0e'],
    title='Stacked histogram of state sizes based on number of congress members'
)
fig.update_layout(
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
fig.write_html(
    file = "../web_app/plotly_files/tweet_parties_histogram.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

From this histogram, it really becomes apparent how most state sizes are below 15 why California and Texas are more *edge cases*. These sizes are relevant to keep in mind if community detection will be done based on states.

## 3.2 Congress Tweets Data
In this part, the basic statistics of the extracted tweets from the politicians will be presented. In the below cell is the `congress_cleaned_processed.pkl` loaded and merged with the Twitter handles such that the statistics can be made on a party level.

In [43]:
df_congress = pd.read_pickle('../data/processed/congress_cleaned_processed.pkl')
df_congress = pd.merge(df_congress, Handle_data, how='left',left_on='user_name', right_on='twitter_display_name')

The below cell exstract the total number of tweets exstracted from every politician.

In [44]:
tweet_counts = df_congress.groupby(['Name', 'Party', 'Type']).agg('count')[['text']].reset_index()
tweet_counts.rename(columns={"text": "Total Tweets"}, inplace=True)
tweet_counts.head()

Unnamed: 0,Name,Party,Type,Total Tweets
0,Abby Finkenauer,Democrat,Representative,2193
1,Abigail Spanberger,Democrat,Representative,2229
2,Adam Kinzinger,Republican,Representative,3182
3,Adam Schiff,Democrat,Representative,4103
4,Adam Smith,Democrat,Representative,2638


### 3.2.1 Total number of tweets per party
Let's first look at the number of tweets based on the party and type of states in the dataset.

In [45]:
tweet_count_party = tweet_counts.groupby(['Party', 'Type']).agg('sum').reset_index()
tweet_count_party['Party and Type'] = tweet_count_party['Party'] + ' ' + tweet_count_party['Type']

In [46]:
fig = px.bar(
    tweet_count_party,
    x="Party and Type",
    y="Total Tweets",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    text='Total Tweets',
    title = 'Total number of tweets per group'
)
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
# Rotate labels 45 degrees
fig.update_layout(
    xaxis=dict(
        tickangle=-45,
    ),
    yaxis=dict(
        range=[0, 8.5e5]
    ),
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_barplot.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig.show()

From the barplot, it can be seen that Democrats tweet much more than the Republicans colleges despite their proportion of profile is close to equal. Interesting is also seen that Senators tweet disproportionally more than their colleagues in the House of Representatives - particularly for the Republicans. Recall there is data for
* 244 Republican Representatives
* 259 Democratic Representatives
* 60 Republican Senators
* 51 Democratic Senators

### 3.2.2 Distribution of tweets
Lets explore the distribution of number of tweets posted per politician. 

In [48]:
fig = px.histogram(
    tweet_counts, 
    x="Total Tweets", 
    title='Stacked histogram of number of total tweets posted per politicans',
    color='Party',
    color_discrete_sequence=["#0015BC", "#DE0100", "#7f7f7f", "#FED105"]
)
fig.update_layout(
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_distribution.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

From the stacked historgram a skewed distribution is seen. Most politicans have tweeted less than 5000 times, but a few politicans are super active on the platform.

The stats so far are only very basis and all the more interesting analysis of the graphs and text are found in the section below.

# 4. Tools, Theory, and Analysis (a teaser to the three analyses)

In this section, we will go through how we've worked with the text and which network science tools and data analysis strategies we've used for solving the problem about how political polarization is expressed on Twitter. 

<img src="../figures/tools.png" width=360 height=250/>


The overall idea is to use a wide variety of the tools and methods learned in the course 02805 Social Graphs and Interactions to find interesting results about the political polarization where we will explore the coherence of the congress by considering multiple graphs that are generated with a different view on the Twitter data. Due to the broad course curriculum as well as our chosen subject the analysis part has been split into three separate analyses. 

1) **Who-Follows-Whom graph:** This section creates and investigates who follows whom in the congress. Do Senators follow Senators? Do Republicans follow Republicans? etc. This graph will be analyzed using a wide variety of tools including community analysis.

2) **Text analysis:** In this party multiple text-analysis tools will be used taking a starting point in the communities found in part 1. The tools used are TF-TR, TF-IDF, Lexical dispersion plots, and sentiment analysis.

3) **Retweet-graph:** Here, two graphs will be built. One examines whether political polarization is expressed in the retweets. By comparing the 'Follow'-graph with the 'Retweet'-graph, we probably get insight into whether people are only lurking on their political opponents. This will be extended to including media where a bipartite graph will be utilized where we thus can investigate if some media are preferred by some party compared to others

Below an introduction and conclusion of each section will be presented as well as an nb-viewer link + Github link to the full notebook containing the analysis. The links are provided to contain the size of the explainer notebook as a high-end computer would be needed to run the notebook smoothly.

## 4.1. Analysis 1: Who-follow-who graph

- NB-viewer link: https://nbviewer.jupyter.org/github/MikkelGroenning/social_graph/blob/main/Notebooks/Make_Adjancency_Graph.ipynb
- Github link: https://github.com/MikkelGroenning/social_graph/blob/main/Notebooks/Make_Adjancency_Graph.ipynb

______
In this part, a graph is constructed based on which politician's profiles follow each other. In other words, if e.g. a Senator follows another Senator on Twitter a directed edge between them is constructed.
The idea is illustrated in the image below. In this graph, we consider the three nodes (in the shape of politicians) *Dean Heller*, *Alexandria Ocasio-Cortez*, and *'Bernie Sanders*. There is a directed edge between *Dean Heller* and *Alexandria Ocasio-Cortez* that means that *Dean Heller* follows *Alexandria Ocasio-Cortez* on Twitter. As *Alexandria Ocasio-Cortez* does not follow *Dean Heller* there is no edge going the opposite direction. Following the analogy, it can be seen that *Bernie Sanders* follows *Alexandria Ocasio-Cortez* and vice whereas *Dean Heller* follows *Bernie Sanders* but *Bernie Sanders* does not follow *Dean Heller*. This is the type of graph that will be created for 618 politicians in our dataset.

![alt text](../figures/follow_graph_illustrated.png "Title")

### 4.1.1 Subconclusion

In this part, a graph that is based on what politicians follow each other has been created, visualized, and analyzed. The graph was very dense! Overall it was found that representatives very clearly followed others from their own party while the senators were more likely to follow other senators. A lot of different tools for analyzing networks were used and the conclusion can be summarized to the fact that politicians are by a clear margin more likely to follow others with the same beliefs as themselves - this includes a split between senators and representatives too. When using the Louvain method three communities were found that can be summarized as senators, Republican representatives, and Democratic representatives.

## 4.2. Analysis 2: Natural Language Processing on the tweets

- NB-viewer link: https://nbviewer.jupyter.org/github/MikkelGroenning/social_graph/blob/main/Notebooks/Text_analysis.ipynb
- Github link: https://github.com/MikkelGroenning/social_graph/blob/main/Notebooks/Text_analysis.ipynb

________
In this part of the notebook, we will look into the text behind the tweets that American Politicians posts.
This part is composed of 4 parts which are

1. TF-TR Analysis for the three Communities, 
1. TD-IDF on a state level,
1. Dispersion plot
1. Sentiment Analysis

In the first sub-part, it will be explored how the three previously extracted communities use Twitter differently with TF-TR method. Recall the three communities consists of
1. Roughly all Senators access all parties,
1. Roughly all Republican Representatives, and
1. Roughly all Democratic Representatives.
In the second sub-part, the use of words in tweets across the different states that politicians represent will be discovered using TD-IDF method.
For the third sub-part interesting word that was discovered in the previous two subparts will be shown in a dispersion plot.
The fourth sub-part will discover the sentiment of tweets across the 3 communities but also on party bases as well as it explores how the politicians feel about the selected word in the dispersion plot.
Finally, in the fifth sub-part, we will conclude our findings.

### 4.2.1 Subconclussion

In this part, various measures of natural language processing have been used to analyze the tweets taking a start in the communities found in the *Who-follow-who* graph. First TF-TR was carried out on the full set of tweets across the three communities to find important words in the communities. The results were illustrated as word clouds. Some of the words that were seen did make sense but in general, it was dominated by names. Afterward, TF-IDF was carried out on a state level - leading to 55 *documents* - where word clouds again were used to illustrate the results. Some meaningful takeaways could be found even though names again were dominant. We also created a lexical dispersion plot to see how the usage of some words varied over time where many of the results made sense thinking back on the last four years of American politics if one follows that.

The last analysis was focused on the sentiment of the tweets - again based on the communities. Here we investigated how the communities *felt* towards different words where small differences could be seen. Lastly, it was also investigated how positive the politicians wrote when tagging each other. Interestingly Democratic senators were found of Republican representatives but besides that, the results were as expected - but the differences were fairly minor.

## 4.3. Analysis 3: Graphs based on retweets

- NB-viewer link: https://nbviewer.jupyter.org/github/MikkelGroenning/social_graph/blob/main/Notebooks/Make_Bipartite_Graph.ipynb
- Github link: https://github.com/MikkelGroenning/social_graph/blob/main/Notebooks/Make_Bipartite_Graph.ipynb

__________________
In this analysis, we will look into how the politicians in our dataset retweet each other and the media. **Definition of a retweet:** *A retweet is when someone republishes or forwards a post to their own Twitter followers. Retweets are typically credited to their original authors, incentivizing users to create shareable content that expands their Twitter footprint.* [https://www.bigcommerce.com/ecommerce-answers/what-is-a-retweet/]
Our hypothesis is that people and particular American politicians retweet tweet that confirms their world view. By analyzing the way the politician's retweet we hope to reject or confirm this hypothesis.
This notebook is composed of three subparts.
The first part is devoted to loading the data and creating the retweet graph.
The second a part devoted to how the politicians retweet each other internally.
The third part is devoted to how politicians retweet the media. Here 16 prominent American new media Twitter profiles were chosen. The relationship between politicians and the media was modeled as a bipartite graph.

### 4.3.1 Subconclussion

In this analysis, we have looked at how the politicians retweet each other and how they retweet the media. The retweet between the politicians highlighted that some are very fond of re-tweeting themselves, however, this has been corrected. Both the in and out-degree distribution resembles that of a scale-free network and particular the in-degree highlight the "popular/important" politicians in the dataset. Modeling the retweets as a graph much resembles the who-follow-whom the most important distinguishing fact between the two graphs is that Donald J. Trump becomes the center of the Republican party rather in the retweet graph. Furthermore, it was found that politicians tend to retweet politicians from the same party. This aligns with our hypothesis that politicians tend to retweet people that they like - assuming they are more fond of people from their own party.
The bipartite graph is used to investigate the media's contribution to political polarization if they have any. By dividing a simple graph containing all nodes into two sets - one for media, one for politicians - it can be concluded that based on retweets, the media have a part in the political polarization.
In the bipartite analysis, it was found that politicians from the same party tend to retweet the same media. Particular are the republican Politicians found of retweeting Fox News whereas the Democratic Politicians are found of The Hill and MSNBC. Furthermore, the Robins-Alexander bipartite clustering coefficient was calculated on the graph to measure the tendency for politicians from the same party to retweet the same media. Here it was found there is a tendency for politicians from the same party to retweet the same media.

# 5. Discussion

The basis for much of the work carried out was in regard to the Twitter handles of the various politicians. Errors or ambiguities in this would propagate out through the project and analyses. Sadly these were apparent in the data used to get the Twitter handles. One issue is due to the fact that politicians have multiple users and one user had to be picked. An example is Mitch Mcconnell who has both @McConnellPress and @senatemajldr. For the data that we used @McConnellPress is picked even though @senatemajldr might have been better. This is a general case for many politicians who have a personal and campaign user. Another issue in that regard is the fact that some close their campaign user after they leave the Congress. Multiple cases were found where handles from the 115th congress now were owned by someone who is not a politician. We filtered the found ones out but some could have been missed. 

The Harvard data includes many random profiles that were not relevant for us. Here we were forced to use the found twitter handles as filters to only keep tweets from the corresponding politicians. The reasons we had to use the external handles in the first place is due to the fact that the Harvard data did not include the party of the tweeters and this was a central part of the project.

After having built the original Who-follows-Whom graph a few errors in the parties were actually also found. They were found as they seemed to be misplaced compared to the other members of their party. This was investigated further and it was found that there were errors in the data (3 cases were found). Of course it is not good that there are errors but on the other hand it was quite impressive that a graph drawn with `ForceAtlas` could find errors in the data!


In the Who-Follow-Whom graph the community detection was performed on the undirected graph. It would have been interesting to see how the results would have differed if similar methods for undirected graphs had been utilized. Making the graph undirected some compromises had to be made perhaps the choices here were not ideal - though results revealed some interesting patterns.

Our initial idea was to include the american news media to larger extend. But as the Twitter API only allow one to extract the latest 3200 tweets using only the profile name that was deemed infeasible. We could unfortunately not find a source that stored tweet ids for american news media. It would properly have revealed how the media plays a role in the hypotheses polarization. 

We think in particular it was interesting to see the interactive graphs of well Democrats were split from Republicans but also how well Representatives were split from Senators. Our result aligns clear with our initial hypothesis that there exists political polarization in the United States Congress. Though it does not prove polarization in general our analysis shows that polarization exists to some degree in the way the politicians act on Twitter.

## 5.1 Future work
The full project has been made on only 40 months of data. It could be interesting to extend it back to Obama’s presidency where Twitter was much younger. As such possible time dependent analyses could have been introduced to find other cool take-aways - are the polarization increasing?

As discussed above there were multiple errors in the data. It might be worthwhile to invest the time to construct a dataset from the bottom where we as such also could account for multiple users from the same politician.

As the full group is very interesting in Machine Learning we also discussed that it could be quite cool to develop NLP-models to detect who tweets (both a type/party of personal level).


 


# 6. Conclusion

In this project, political polarization in the United States Congress has been investigated based on data from Twitter. The data has been analyzed through 3 different analyses. Before any analysis could be carried out the data had to be obtained and processed where the data gathering was carried out using the Twitter API. With the data gathered preprocessing also had to be carried out to get the data in the desired format to be used in the analysis.

In the first analysis, a graph-based on if politicians followed each other was created, visualized, and analyzed. A very clear pattern was obtained that separated the two major American parties. Moreover, segregation was also found between representatives and senators. This split was so clear that when the Louvain method was used to find communities three communities were found that roughly were Senators, Republican Representatives, and Democratic Representatives.

The second analysis is based on natural language processing where multiple tools were used to analyze the actual tweets. Word-clouds were created from the found communities as well as on a State level. In general, there were fairly limited numbers of interesting take-aways as the keywords often were their own names. A lexicographical dispersion plot was also created where one clearly could get an idea of when different words were *hot* in Congress. Lastly, sentiment analysis was investigated to learn how the communities from part 1 *felt* towards different words as well as how they felt towards each other.

In the third analysis, two additional graphs were created based on retweets. The first was only for politicians where it was found that some are very fond of re-tweeting themselves. Compared to the who-follow-whom graph it was found that the in- and out-degree much more resembled a scale-free network. Another difference was that President Trump now was pretty much the center of the Republican Party. Afterward 16 media accounts were introduced and a bipartite graph was created between media and politicians. Here it was found there is a tendency for politicians from the same party to retweet the same media.

As this project is very large a great amount of effort has also been put into summarizing and visualizing the key results into a web-page.

# 7. Contributions

The full project has been made in close collabroation across the full project period. 
We have done our best to highligh the task that was performed below and how group member contributed. 
We think it has been pretty difficult to assign a *lead* as all group member have had a hand in anything. 


- **Basic stats: **  Mikkel Grønning + Toke Bøgelund-Andersen 
- **Who-follow-whom: ** Toke Bøgelund-Andersen 
- **Text analysis: ** Toke Bøgelund-Andersen + Ida Riis Jensen
- **Retweet-graphs**
    - **With politicians: ** Mikkel Grønning 
    - **Bipartite: ** Ida Riis Jensen
- **Front end developer** 
    - **design:**,  Mikkel Grønning
    - **hosting:**  Mikkel Grønning (family site)
- **Dope ass interactive graphs:** Mikkel Grønning
- **Github:** Mikkel Grønning 
- **Data handling: **
    - **Tweet ids: ** Mikkel Grønning + Ida Riis Jensen
    - **Handle Data: ** Toke Bøgelund-Andersen 
    - **Media: ** Ida Riis Jensen
- **Rest of the explainer notebook: ** Ida Riis Jensen


## References

#### Links
Twitter API Documentation
https://developer.twitter.com/en/docs/twitter-api
Trump Twitter Archive
http://www.trumptwitterarchive.com

Littman, Justin, 2017, "115th U.S. Congress Tweet Ids", https://doi.org/10.7910/DVN/UIVHQR, Harvard Dataverse, V5

Wrubel, Laura; Kerchner, Daniel, 2020, "116th U.S. Congress Tweet Ids", https://doi.org/10.7910/DVN/MBOJNS, Harvard Dataverse, V1

Software for Complex Networks — NetworkX 2.5 documentation
https://networkx.org/documentation/networkx-2.5/

115. Congress handles, Congressional Social Media Handles - Triage Cancer-Finances-Work-Insurance | Triage Cancer
https://triagecancer.org/congressional-social-media

115. Congress handles, Sciencecoalition.org
https://www.sciencecoalition.org/wp-content/uploads/2018/09/115th-Congress-Twitter-Handles.pdf


### Papers/books
Albert-Laszlo Barabasi. (2015). Network Science. Cambridge: Cambridge University Press. 
http://networksciencebook.com

Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12): e26752. https://doi.org/10.1371/journal.pone.0026752