In [3]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import findspark
findspark.init()

import pyspark
from pyspark.sql import *
import pyspark.sql.functions as func
from pyspark.sql.types import *

%run insights.py
%run plot.py

# Create spark session
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Topics during the 2016 election

It is widely accepted and recognized in the political sphere that the U.S. 2016 presidential election greatly divided the U.S. citizens. As early as January 2016 [journalists](https://www.washingtonpost.com/politics/a-divided-country-gets-a-divisive-election/2016/01/09/591bfccc-b61f-11e5-a842-0feb51d1d124_story.html?noredirect=on&utm_term=.54c240a270e9) already reported the rethoric used by the various presidential candidates was unusually violent and agressive, resulting in a division among the opinions of the american people. A division which kept on growing through the year as the presidential candidates were striked by diverse controversies. 

Many subreddits serves as a hub for internet users wanting to voice their support for the candidate they wished would win. This provides us some natural aggregation of the opinions of people supporting each side of the election. [r/The_Donald](https://www.reddit.com/r/The_Donald) and [r/hillaryclinton](https://www.reddit.com/r/hillaryclinton) were the main subreddits for reddit users standing respectively for the Republican and the Democrat candidate. In order to observe the divisiveness, we chose to take a more specialized look into what were the most discussed topics in each community the week preceding the election day. The goal would be to discern through the topic discussed in each community how the rhetoric of the candidate were apprehended, and how the internet community dealt with this controversial election.


### Methodology: LDA

Latent Dirichlet Allocation (LDA) is the unsupervised clustering method we choose to model Reddit discussions topics. Given a corpus of documents, LDA assumes that each document is the product of a mixture of a certain number of topic. Using this algorithm we can infer what might be the subjects discussed in the documents' collection.

In our case, we chose the documents to be directly reddit's comments, as it is the smallest coherent piece of information from a discussion thread. Some classic natural language preprocessing has to be done for LDA to work properly: each comment was first tokenized, then cleaned of any english stop words and finally lemmatized. Once applied to the dataset, we observed that more preprocessing had to be done in order to remove idiosyncracies of Reddit:

* Comments with a low score do not represent opinion which was appreciated enough by the community. We decided to remove each comment whose score was lower than 10.

* Short answers in threads do not bring much information to the discussion. We thus removes all comments whose length was below 50 character. As a beneficial side effect, this part of preprocessing also removes comments which vere deleted or removed by moderators.

* Some subreddits deploy special program disguised as user called "bots" which can serve many purpose, wether it is automatic moderation or helping the user with referencing. As the textual content of bots comments is mostly automatic, it does not give any information on the topic. Bots' comments even pollute topics modelled by LDA if considered since one both will likely spur the same textual content each time it is invoked. For those reason, we removed comments obviously authored by a bot.

* Users tend to often link outside sources or other reddit comments. As URLs does not mix well with natural language processing, they had to be removed.

LDA can then be applied on the collection of preprocessed Reddit comments. Among the number of topics and words per topic, the algorithm can take also two other parameters: the document concentration (alpha), and the topic concentration (beta). 

The alpha parameter is proportional to how many topics are mixed together to produce a document. In the context of Reddit's comments, it is unlikely that a single comment mentions a lot of topics, as it is a rather short piece of text and will likely focus on the subject of the thread. Thus a small value of alpha should be chosen.

The beta parameters is inversely proportionnal to how much the topics are compromised of the same words.In other words, a low value of beta would trigger LDA to produce topics which do not share many identical words.

Through various tests, it was found that 8 topics of 5 words each with a 0.1 value for beta and 0.025 for alpha produced the best results for our purpose.

## Topic analysis by subreddits

In [44]:
def display_topics(topics):
    tops = topics.select('topic').collect()
    for i in range(len(tops)):
        print("Topic %d: %s"%(i+1, tops[i][0]))

#### r/The_Donald

In [50]:
trump_lda = spark.read.load('../data/trump_lda_result.parquet/')
display_topics(trump_lda)

Topic 1: fearless donald fucking last time
Topic 2: cooking spirit propaganda traitor comey
Topic 3: clinton hillary cuck email child
Topic 4: pedophile corrupt epstein know shit
Topic 5: trump energy take train vote
Topic 6: deleted weiner laptop archive watch
Topic 7: vote trump elite thread year
Topic 8: hero clinton podesta hillary rich


Upon quick inspection, what was discussed the week before the election seems to be a rather erratic aggregation of diverse words. Each topic mention at least one person, often the name of the two presidential candidates. However other terms seems to not be making much sens in the context of the election. Fortunately, an in-depth analysis of the topics reveal some interesting insight to what was discussed by the redditors supporting Donald Trump:

* **"fearless donald fuckin last time"**: Supporters of Donald Trump have often qualified the then republican candidate to be _fearless_ which explains the occurence of the two first words. It is harder however to connect the last three words to this topic, as one swear word and two rather common terms do not apport much meaning to the _fearless donald_ revered by The_Donald's users.



* **"cooking spirit propaganda traitor comey"**: Two subtopics are actually present within this list of words. "_cooking spirit_" refers to an invitation by mail to John Podesta, Hillary's 2016 campaing chairman, to take part in a "Spirit cookinkg" dinner organised by Marina Abramović. This information was part of the leaked emails from Hillary Clinton and associates. [This dinner got soon exagerated by conspiray theorists as being something of satanist content](https://en.wikipedia.org/wiki/Marina_Abramović#Controversy). On the other hand, "propaganda traitor comey"  seems to refer to the then FBI director James Comey, which oversaw the investigation on Hillary Clinton's actions that led to the famous e-mails leak. At the end of the investigation, Comey determined that the action of Clinton did not deserve any penal sentence. As Trump was fond of the idea of locking up Hillary behind the bar for the whole e-mail scandal, the decision of Comey might have been perceived by r/The_Donald as treaterous or as a democrat led propaganda. It is possible that redditors linked the two events together, as the spirit cooking scandal was revealed through the emails' leak affair.



* **"clinton hillary cuck email child"**: While the _clinton hillary email_ part of the topic is easily interpretable, _child_ might be a reference to [the Pizzagate conspiracy theory](https://en.wikipedia.org/wiki/Pizzagate_conspiracy_theory) which posited that the e-mail leaked revealed clues about a child-sex trafficking ring established in Washington D.C. Lastly, "cuck" does not seem to be linked directly to pizzagate or the e-mails, and could be a byproduct of the common vocabulary and terminology used through discussions on r/The_Donald.



* **"pedophile corrupt epstein know shit"**: The named entity of the topic this time is someone named _Epstein_. Considering the other words, this surely refers to Jeffrey Epstein, billionaire and former top donator of the democratic party who was [sentenced to jail for underage girl prostitution in 2008](https://en.wikipedia.org/wiki/Jeffrey_Epstein#Criminal_proceedings). Rephrasing a bit the topic would give us "Pedophile and corrupt Jeffrey Epstein knows shit". It may be possible that this topic stems from discussions theorizing Epstein would be linked to the Pizzagate conspiracy.



* **"trump energy take train vote"**: The _trump train_ is a metaphor used by Trump's supporters which represent the [swarm of great ideas](https://www.urbandictionary.com/define.php?term=The%20Trump%20Train) the New York businessman came up with during his campaign. To _take_ such a train would mean to be on board with Trump's idea, and be willing to _vote_ for him.

* **"deleted weiner laptop archive watch"**: As with Jeffrey epstein, the mention of someone named _weiner_ helps us there understanding what those five words mean together. Anthony Weiner was a former U.S. congressman whose career came to an end after [scandals surrounding sex photos he would send on twitter to woman](https://en.wikipedia.org/wiki/Anthony_Weiner#Sexting_scandals). He was the husband of Human Abedin which was the Vice Chair of Hillary's 2016 campaign. Weiner [_deleted_ his twitter account after exposure of one of his sexting_scandal](https://nypost.com/2016/08/29/anthony-weiner-deletes-twitter-amid-new-sexting-scandal/). As a result of this scandal, his _laptop_ was seized by the FBI and e-mails contained in the computer were informations that led James Comey to reopen the investigation on Hillary Clinton's mail eleven days before the election. _Archive_ could be referring to the archives of Clinton's email, while _watch_ could indicate that inspecting the email would give more evidence for incriminating the democrat politician, but it is a bit far-fetched. 

* **"vote trump elite thread year"**: _vote trump elite_ might constitutes an authority argument for Trump supporter to convince other people to vote for the New Yorker businessman. "The elite vote for Trump" might be the formulation used, which would imply that someone voting for Donald Trump is part of the _elite_.

* **"hero clinton podesta hillary rich"**: John _Podesta_ is mentioned with _Hillary Clinton_, however the terms _hero_ and _rich_ do not seem to give a precise clue as to how are the campaign chairman and the democratic candidate discussed.


In conclusion, Trump supporter's discussion seemed to focus on two main axis the week preceding the election: Scandals and conspiracy surrounding Trump's opponent, and gloryfing Trump, his actions and words. 


#### r/hillaryclinton

In [49]:
hillary_lda = spark.read.load('../data/hillary_lda_result.parquet/')
display_topics(hillary_lda)

Topic 1: romney markets biggest cake website
Topic 2: happening dead wikileaks completely energy
Topic 3: hillary vote trump going election
Topic 4: cosponsored already children total picture
Topic 5: emails clinton trump party used
Topic 6: believe clinton thanks trump campaign
Topic 7: polls gotv voting tomorrow shit
Topic 8: trump state model polls poll


Just like in The_Donald LDA analysis, "hillary", "clinton" and "trump" are common words in the topics which come as no surprise considering the context. Although less coherent than the topics from r/The_Donald, in-depth analysis of the r/hillaryclinton topics gives us a good insight to what was mattering Reddit users supporting the former first lady:

* **"romney markets biggest cake website"** Mitt _Romney_, republican politician and Oboma opponent in the 2012 election, [was positively viewed by Hillary's supporter](https://www.reddit.com/r/hillaryclinton/search?q=romney&restrict_sr=1&sort=top) for stating he would not vote for Trump. Unfortunately, links can hardly be made with _markets biggest cake website_ regarding either Mitt Romney or Hillary Clinton.


* **"happening dead wikileaks completely energy"**: A possible rephrasing or interpretation of this topic could be "_Wikileaks happening completely_ killed _energy_", or "The _energy_ is _completely dead_ because what is _happening_ with _Wikileaks_", whatever the choosen combination of word, this topic might very likely indicates what were Hillary's supporter thought about the leaked e-mails scandal: it negatively impacted Hillary's campaing energy. At the time it was already public knowledge that [Wikileaks played a crucial role in leaking the democratic candidate mails](https://wikileaks.org/clinton-emails/?q=iraq%7Cbaghdad%7Cbasra%7Cmosoul).


* **"hillary vote trump going election"**: This topic likely represents simple discussion about the upcoming election, which was happening in less than a week. Both candidates are mentioned, with _going_, _vote_ and _election_.


* **"cosponsored already children total picture"**: The farthest evidence of organisation "_cosponsored_" by Clinton that we could find was this [post about Hillary support of the LGBT+ community](https://www.reddit.com/r/hillaryclinton/comments/4xw1vr/a_brief_and_incomplete_timeline_of_hillarys/). The other terms are unforutnately too common in the context of the election in order to draw any conclusion.


* **"emails clinton trump party used"**: The subject here seems to be again _clinton_'s _emails_, this time the words seems to indicate that Hillary's supporter condemned _Trump_'s _party_ for _using_ this scandal. This gives us more proof that the redditors from r/hillaryclinton were displeased about the whole e-mail investigation.


* **" believe clinton thanks trump campaign"**: Those words could have many different interpretation together. Are the reddit user's thankful for clinton campaign? Do they _believe_ in _Clinton_'s _campaign_ but somehow are _thanking_ _Trump_? No definitive conclusion can be made in this case.


* **"polls gotv voting tomorrow shit"**: The most interesting term in this topic, is [_gotv_](https://blog.everyaction.com/what-is-gotv-anyway), which is a known political acronym for "Get Out To Vote". It characterizes the last stage of an election campaing in which the people are encouraged to go to the urns. The imminent term _tomorrow_ and _voting_ coupled to _gotv_ seems to imply some urgency on the matter from Clinton's supporters.


* **"trump state model polls poll"**: With a double occurence, this topic seems tied to some dicussions about the _polls_. [Most polls were predicting a win for Clinton](https://www.realclearpolitics.com/epolls/latest_polls/president/) the days preceding the election. The other three terms might be attribute linked to polls, like discussions on which _state_ the _polls_ indicate to be obtained by _Trump_, or discussions on which _model_ the _polls_ are built on to criticize their credibility for example.

Although less easily interpretable than the topics from r/The_Donald, the topics from r/hillaryclinton allow us to picture the mindset from the ex first lady supporters. We get to see their point of view of the e-mail affair, while we can observe their anticipation of the approaching election day.

### Conclusion

LDA produced better results for r/The_Donald than for r/hillaryclinton, however the number of subscribers for each subreddit needs to be taken into account. On november the 7th, [r/The_Donald had 264'883 subscribers](https://web.archive.org/web/20161107233256/https://www.reddit.com/r/The_Donald/) while [r/hillaryclinton counted only 34’392 redditors](https://web.archive.org/web/20161107005937/reddit.com/r/hillaryclinton). This huge difference of subscribers was translated into a big difference in the number of comments considered by the LDA for either side:

In [53]:
trump_comm_number = spark.read.load('../data/trump_lda_prepro.parquet/').count()
hillary_comm_number = spark.read.load('../data/hillary_lda_prepro.parquet/').count()
print('Number of comments the week before the election :')
print("In r/The_Donald : %d"%(trump_comm_number))
print("In r/hillaryclinton : %d"%(hillary_comm_number))

Number of comments the week before the election :
In r/The_Donald : 50489
In r/hillaryclinton : 6688


Clearly Donald Trump had a bigger base of supporter on Reddit during the end of the election. 

LDA produced more coherent topics in the case of r/The_Donald. This could indicate that Trump's supporter might have been more focused in their battle into making the future president gain even more voters. Indeed we can conclude that r/The_Donald was pretty offensive in debate about the election as most of its discussions revolved around discreditizing Trump's opponnent using conspiracy theories while at the same time gloryfying its candidate. 

On the other side, Hillary's supporters seemed to act in a more passive way. They only lamented the negative coverage of the e-mails scandal, while they were discussing meta-concerns about the election, such as the election day coming and the polls. Maybe their concern of the polls was some form of self-reinsurance that Clinton had still chances of winning despite the various scandals.

Staying in the boundary of our dataset, it could not be possible to argue wether those behaviors on Reddit were also present on the real life discussions. It could neither be possible to argue wether they could be an explanation of Donald Trump's election on the 8th of November 2016. However, it shows how much polarized each side was on Reddit, as the reckless coverage of the scandals from r/The_Donald highly constrasted with the passive poll-based optimism from r/hillaryclinton, even though both commnities had the same goal in mind: see their candidate get elected.