# Tweeting Democracy:
### Tweets of the 2020 Democractic Nominee Hopefuls

##### Hanna Born, Thomas Malejko, & Nicole Yoder
##### ANLY 580 (Fall 2020)
##### 8 December 2020
Note: This report does not include any code, but the notebooks are linked where applicable.

### Introduction

Since its introduction in 2006, Twitter has evolved into a key platform for modern politics–offering politicians a fast and easy way to communicate their message and priorities to the public. Under the Trump administration, Twitter continued to grow even more central to politics as the President often chose his Twitter account as the preferred medium for connecting with the American people. Twitter’s popularity as a medium for political discourse means that information from political campaigns--everything from campaign events to fundraising to policy platforms--is readily available in real-time. The 2020 Democratic Presidential Primaries were no different. At one point more than two dozen candidates vied for the opportunity represent the Democratic Party in the 2020 Presidential Election. This project examined the tweets made by seven of the most prominent candidates, from the start of August 2019 until Super Tuesday (March 2020), to understand how each candidate used this growing platform to engage prospective voters.

### Literature Review

The rapid expansion of social media over the past decade has challenged the efficacy of existing analytic techniques due to the enormous quantity of data generated by these services (Facebook, Twitter, Youtube, etc.) as well as changes to how people communicate in these online forums. Subsequently, much research has been conducted to understand the feasibility and techniques required to generate useable insights from this data. Four papers, in particular, look at the practicality of conducting topic modelling and authorship detection of tweets. <br><br>
In “Author Identification on Twitter,” Antonio Castro and Brian Lindauer show that an author can be detected with 40 percent accuracy using a regularized linear regression model that relies only on publicly available information on Twitter. Interestingly, the authors comment that tweeters can evade detection by deliberately altering “their writing voice, or limiting the amount of text posted,” which—while beneficial to the political dissenter that the author is concerned about—may complicate our attempt to identify the originating Twitter account since most politicians likely have a multitude of personnel writing their tweet—from public affairs staffs, to aides, to the candidate themselves. Castro and Lindauer’s work builds upon the preeminent research conducted by Arvind Narayanan, et al. in a paper entitled, “On the Feasibility of Internet-Scale Author Identification.” The latter authors showed that neural networks and regularized linear regression models perform equally well for authorship identification, once the data has been normalized, and developed enhanced evaluation metrics including improved confidence estimators. Brunna de Sousa Pereira Amorim, et al. researched classification techniques that could identify tweets that contained political content and, hence, potential electoral crimes in Brazil (for example, out-of-term political advertising is illegal).  As with the previous studies mentioned above, they found that logistic regression models far outperformed neural networks in this task, correctly identifying political tweets with nearly a 90 percent accuracy. Finally, “An Evaluation of Topic Modelling Techniques for Twitter” evaluated multiple techniques for topic modelling on ‘short’ documents, of which Twitter is completely comprised. This paper showed that biterm topic models outperformed—as measured by coherence scores—Latent Dirichlet Allocation (including those modified for use on short texts) and word embedded models such as word2vec.

### The Data

   The dataset being used for this research was collected from 28 May to 9 June 2020 and is comprised of 13,814 tweets from seven of the most prominent 2020 Democratic Presidential Nominee hopefuls, including: Joe Biden, Pete Buttigieg, Tulsi Gabbard, Amy Klobuchar, Bernie Sanders, Tom Steyer, and Elizabeth Warren. The tweets span from 2 August 2019 (approximately the beginning of the 2020 Democratic Primary Campaign) to 2 March 2020 (Super Tuesday), a seven-month period wherein each candidate tweeted at least 1000 times. This dataset was collected via Twitter’s API, reformatted, and saved in a comma separated format that is 5,646 KB in size. In addition to the full text of the tweet, the dataset also contains information about the time, retweet count, and number of times that each tweet was favorited, in addition to, information about the user’s account at the time of the tweet such as follower count and friend count. 

### Initial Cleaning
[Cleaning Data notebook](CleaningData.ipynb)

Before starting any analysis, the data required some cleaning. Since this research was focused on the text of the tweets, many columns were removed (like favorite_count, friends_count, etc.). Then duplicate tweets and those in languages other than English were dropped.

### Exploratory Data Analysis
[EDA notebook](EDA.ipynb)

Initial Exploratory Data Analysis was performed to develop a better understanding of the data set as a whole and invetigate potential characteristics of each candidate's use of language that could provide any insights about authorship. 

| Candidate | Tweet Count |
| -- | -- |
| Elizabeth Warren    | 2347 |
| Joe Biden           | 2183 |
| Amy Klobuchar       | 2017 |
| Tom Steyer          | 1990 |
| Bernie Sanders      | 1861 |
| Pete Buttigieg      | 1700 |
| Tulsi Gabbard 🌺    | 1085 |

The first bar chart provides the distributions of tweet counts by candidate for the data set.

<img src="/Graphics/EDA_1.png" alt="Distribution of Tweets by Candidate" style="width: 350px; float: center;"/>

The number of tweets per candidate appears relatively comparable with the exceptions of Rep. Tulsi Gabbard, whose tweet count appears significantly lower for the time period considered. Next, each candidate's tweets were evaluated according to length (in tokens) and type-token ratio characteristics.

| Candidate      | Tokens per Tweet | SD of Tweet Length (in tokens) | Type/Token Ratio |
| -------------- | -------------- | -------------- | -------------- | 
| Amy Klobuchar    |   33.798883  | 13.083463  |  10.843   |
| Bernie Sanders   |   33.529924  | 11.299502  |  8.4832   |
| Elizabeth Warren |   35.564433  | 11.464681  |  8.0392   |
| Joe Biden        |   34.941204  | 11.669440  |  8.2022   |
| Pete Buttigieg   |   34.545226  | 12.688918  |  10.1462  |
| Tom Steyer       |   28.849539  | 12.371526  |  11.6813  |
| Tulsi Gabbard 🌺 |    28.365482 | 16.832461  |  15.4796  |

<img src="/Graphics/EDA_2.png" alt="Tweet Token/Type Ratio by Candidate" style="width: 350px; float: center;"/>

<img src="/Graphics/EDA_3.png" alt="Distribution of Tweet Lengths by Candidate" style="width: 800px; float: center;"/>

Rep. Tulsi Gabbard's tweets appear to have the most fluctuation by length and token/type ratio. Similar observations show up later during the authorship detection analysis in Part A.

<img src="/Graphics/EDA_4.png" alt="POS Distribution by Candidate" style="width: 700px; float: center;"/>

Based on the results from POS tagging, the distribution of POS useage across candidates is highly similar - mainly due to the structure of the english language. One noteable observation is that Sen. Amy Klobuchar and Rep. Tulsi Gabbard have a higher use of proper nouns ('PROPN') than the other candidates. This observation will coincides with some of the results in Part B. Topic Modeling, where we learn that these two candidates, proportionally, had more tweets aimed at campaigning and securing donations than other candidates who spent more time tweeting about issues important to their respective campaign platforms. 

### Methodology
#### Research Objectives
Based on the exploratory data analysis, background research, and relevant domain knowledge, the following research objectives were identified. These objectives were thoroughly evaluated using various analytic techniques and machine learning algorithms as to maximize prediction capability or coherence while maximizing the amount of insight about each objective.
   1. Authorship Detection: Predict the author (in this case the Democractic Presidential Nominee Hopeful) given only the text of a previously unseen tweet and understand the stylistic features that most heavily influence the final prediction. 
   2. Topic Modeling: Identify the main topics discussed in the candidates’ tweets overall and how the candidates differed in their use of those topics.


#### Part A. Authorship Detection
[Authorship Detection notebook](AuthorshipDetection.ipynb)

The process of authorship detection requires a thorough exploratory analysis, above and beyond the techniques used in the previous section, as to understand the stylistic characteristics for each of the seven candidates contained in this dataset. Two 'traditional' approaches were used to accomplish this--the Mendenhall Curve and Kilgariff's Chi-Squared Methods. Combining the results of those analyses with known features of stylistic significance (based on more modern research on this topic), new features were generated from the tweet text. These features were fed into a machine learning pipeline, which tuned the hyperparameters for and evaluatd the efficacy of four distinct models. Once the optimal model was identified, every effort would be made to extract information about the model's selection criterion. 

#### Part B. Topic Modeling
[Topic Modeling notebook](TopicModeling.ipynb)

Two methods of topic modeling, Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF), were used and evaluated. For both methods, the variable for the number of topics was tested from 5 to 20. Additionally for LDA, the alpha parameter was also tested from the choices of auto, symmetric, and asymmetric. Then the perplexity and topic coherence measures were analyzed to select several possible models. After investigating the top 25 tokens for each of those models to see which model made the most sense to a human, an NMF model with 6 topics was chosen as the best.

### Results
#### Part A. Authorship Detection
[Authorship Detection notebook](AuthorshipDetection.ipynb)
##### <em>Mendenhall Curve Method</em>

The analysis below is based on the work of literary scholar T.C. Mendenhall who theorized in 1887 that an author's stylistic signature could be determined by counting how often they used words of different lengths. While coarse by necessity of the time period in which he lived, the results of this analysis can be quickly generated and may provide some interesting insights (François Dominic Laramée, 2018). 

![Candidate Word Length Tendancies: Mendenhall's Characteristic Curve](Graphics/AD1.png "Candidate Word Length Tendancies: Mendenhall's Characteristic Curve")

Studying Mendenhall’s Characteristic Curves of Composition, there appears to be little difference in terms of word-length usage between Democratic Presidential Candidates on Twitter. Tom Steyer (the orange curve)--a businessman and non-career politician--had the most distinct curve of the group as he tended to use larger words more frequently than the other canidates. Could this be a sign of his naiveness about what constitutes a successul social media campaign for a politician? Additionally, two of three woman candidates studied (Amy Klobuchar and Tulsi Gabbard) tended to use three-letter words much more frequently than their male-counterparts. This may be a signal a concerted effort by both campaigns to make their candidates appear more relatable by using simple and direct language. 

#### <em> Kilgariff's Chi-Squared Method <em>

Adam Kilgariff, in a 2001 paper, proposed using the chi-squared statistic to determine authorship. According to his method the 'statistic' measurement is a given author's 'distance' from the average use of the most frequent words in the comparison corpus--whether it is a collective corpus, an unknown writing sample, etc. Therefore, the author with a smallest 'statistic' uses the most commonly occuring words at a similar rate to the comparison corpus (François Dominic Laramée, 2018).

<b> Determine the Number of Common Words Shared By All Candidates </b>

Selecting the number of tokens to use in this analysis is non-trivial. Some scholars suggest using between the 100 and 1,000 of the most common types in the corpus, while one researcher even recommended using every word that appeared at least twice. There appears to be some consensus, however, that the number of words selected should be proportional to the corpus size (larger corpus, larger number of common words used in the analysis) as to not give undue importance to infrequent words (François Dominic Laramée, 2018). In an attempt to be more scientific about selecting this critical number of types, a type-frequency diagram was constructed to identify an appropriate number of tokens to be used in this analysis.

![Kilgariff Method: Selecting Number of Common Tokens](Graphics/AD2.png "Kilgariff Method: Selecting Number of Common Tokens")

Given that this corpus is on the smaller size, the number of 'common tokens' should be proportionally small. As such, it is no surprise that this curve appears to level off around 100 types--this value includes types such as '.', 'or', and 'their' but excludes slightly less common words such as 'iowa', 'community', 'want' and 'act.' 

<b> Kilgariff Chi-Squared Method </b>

<img src="Graphics/AD3.png" alt="Kilgariff Results" style="width: 200px; float: left;"/>

Based on the results calculated above, stylistically, we can say that Tom Steyer and Pete Buttigieg have writing styles most similar to the collective whole while Elizabeth Warren and Bernie Sanders have writing styles that are very different from the rest of Democratic Presidential Nominee Hopefuls. Note, just because two canidates have a score close to each other does not mean that their writing styles are similiar, rather that their writing styles are equi-distant from the collective group average.

As this technique revealed some clear differences between the candidates's tweeting styles, it may be possible to correctly identify  an author given a sufficient number of tweets. To test this hypothesis, the dataset was broken into training and testing datasets at a 80:20 ratio. Bootstrap samples of various tweet sizes (ranging from 1 to 100) were taken for each candidate from the testing dataset and passed into the Kilgariff Chi-Squared Algorithm. The candidate whose known tweets (from the training set) had a chi-squared statistic closest to zero would be assigned as the likely author. These results were then compared to the actual values and scored using the F1-Score.  The interesting results are below:

![Kilgariff Prediction Score by Tweet Count](Graphics/AD4.png "Kilgariff Prediction Score by Tweet Count")

The Detailed Results for Balanced Results:

<img src="Graphics/AD5.png" alt="Balanced Results Details" style="width: 600px; float: left;"/>

The Detailed Results for Best Performance:

<img src="Graphics/AD6.png" alt="Best Performance Details" style="width: 600px; float: left;"/>

This method shows a great deal of promise for authorship detection. Using just 75 tweets, an author could be detected from a field of seven candidates with greater than 99 percent accuracy. With just 50 tweets, the algorithm still runs well achieving nearly 80 percent accuracy. Future analyses should considering rerunning this analysis while also capturing if the correct candidate appeared in the Top X number of candidates.

Specific to this analysis, the model tended to over-attribute tweets to Congresswoman Tulsi Gabbard--something observed in the two cases printed above but was also observed during the construction of the algorithm as well. It seems that her writing style closely matches that of a few other candidates, especially Senator Amy Klobuchar and Vice President Joe Biden.  

#### Part B. Topic Modeling
[Topic Modeling notebook](TopicModeling.ipynb)

The selected NMF model had six topics, which were generalized as follows:
1. Campaigning 
    - focused on early voting states like 'Iowa' and 'New Hampshire', asking people to 'join', and highlighting events like 'town halls'.
2. Donations 
    - focused on fundraising with words like 'help', 'chip (in)', 'grassroots', 'donor', and 'donation'.
3. Health Care and Climate Change 
    - mostly focused on 'health care', but 'climate change' is also present.
4. Defeating Donald Trump 
    - this topic has some overlap with the third, since it has 'climate' and 'crisis', but tweets in this category would be focused on how President Trump mishandled the crisis in their view.
5. Workers' Rights and Education 
    - mostly focused on workers' rights with words like 'worker', 'union', and 'pay', but the less common words of 'student', 'school', and 'teacher' extends this topic to education.
6. Gun Violence and Legislation 
    - mostly focused on gun violence with words like 'gun', 'violence', and 'epidemic', but the less common words of 'pass', 'law', 'house', and 'senate' extends this topic to general legislation.

![Wordclouds for Each Topic](Graphics/TopicModeling3.png "Wordclouds for Each Topic")

<b>Results for All Candidates</b>

The Campaigning and Donations topics accounted for about 41% of the tweets, which shows that the candidates frequently used Twitter to advertise their events and fundraise. But the other 59% of tweets were more about policies they would enact or disagree with, which is somewhat surprising based on the short character length available in a tweet.

The Gun Violence and Legislation topic had an unexpected number of tweets, which was caused by two factors. The first was the conversation driven by the mass shootings in El Paso, Dayton, and elsewhere in 2019. The second was the inclusion of many general political/legislative phrases in the topic, which would result in legislative tweets unrelated to gun violence also being categorized in this topic.

![Distribution of Tweets by Topic](Graphics/TopicModeling1.png "Distribution of Tweets by Topic")

<b> Results for Each Candidate </b>

Here are some insights:
- Warren was the closest candidate to tweeting about each topic the same amount and had proportionally more Donations tweets than the others
- Biden and Steyer had proportionally more Defeating Trump tweets
- Sanders had proportionally more Workers' Rights/Education tweets
- Gabbard and Klobuchar had more Campaigning tweets and less Health Care/Climate Change tweets than the rest of the candidates
- Buttigieg had proportionally more Gun Violence/Legislation tweets

President-Elect Biden's message of the importance of defeating President Trump was not unique among the field of candidates, but he emphasized it the most and made it a key pillar of his campaign. This strategy appears to have worked for him, as he won both the Democratic nomination and the general election.

![Percent of Candidates' Tweets Categorized as each Topic](Graphics/TopicModeling2.png "Percent of Candidates' Tweets Categorized as each Topic")

### Conclusion

Something...