In [3]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import findspark
findspark.init()

import pyspark
from pyspark.sql import *
import pyspark.sql.functions as func
from pyspark.sql.types import *

%run insights.py
%run plot.py

# Create spark session
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Topics during the 2016 election

It is widely accepted and recognized in the political sphere that the U.S. 2016 presidential election greatly divided the U.S. citizens. As early as January 2016 [journalists](https://www.washingtonpost.com/politics/a-divided-country-gets-a-divisive-election/2016/01/09/591bfccc-b61f-11e5-a842-0feb51d1d124_story.html?noredirect=on&utm_term=.54c240a270e9) already reported the rethoric used by the various presidential candidates was unusually violent and agressive, resulting in a division among the opinions of the american people. A division which kept on growing through the year as the presidential candidates were striked by diverse controversies. 

Many subreddits serves as a hub for internet users entitled into voicing their support for the candidate they wished would win. This provides us some natural aggregation of the opinions of people supporting each side of the election. [r/The_Donald](https://www.reddit.com/r/The_Donald) and [r/hillaryclinton](https://www.reddit.com/r/hillaryclinton) were the main subreddits for reddit users standing respectively for the Republican and the Democrat candidate. In order to observe the divisiveness, we chose to take a more specialized look into what were the most discussed topics in each community the week preceding the election day. However such communities might suffer from echo chamber effect as highlighted before by the agreement factor from The_Donald being one of the highest among the subreddit. Thus we are going to compare those topics with the ones discussed on r/news which will serve as a neutral ground of observation between the candidates respective communities. The agreement factor of the "news" subreddit being one of the lowest indicates the echo chamber effect has a lesser impact on this subreddit's threads. 


### Methodology: LDA

Latent Dirichlet Allocation (LDA) is the unsupervised clustering method we choose in order to model Reddit discussions topics. Given a corpus of documents, LDA assumes that each document is the product of a mixture of a certain number of topic. Using this algorithm we can infer what are the subjects of the documents collection.

In our case, we chose the documents to be directly the reddit comments, as it is the smallest coherent piece of information from a discussion thread. Some classic natural language preprocessing has to be done for LDA to work properly: each comment was first tokenized, then cleaned of any english stop words and finally lemmatized. Once applied to the dataset, we observed that more preprocessing had to be done in order to remove idiosyncracies of Reddit:

* Comments with a low score do not represent opinion which was appreciated enough by the community. We decided to remove each comment whose score was lower than 10.

* Short answers to a thread do not bring much information to the discussion. We thus removes all comments whose character length was below 50. As a side effect, this part of preprocessing also removes comments which vere deleted or removed by moderators.

* Some subreddits deploy special programs called "bots" which can serve many purpose, wether it is automatic moderation or helping the user with referencing. As the textual content of bots comments is mostly automatic, it does not give any information on the topic. Bot comments even pollute topics modelled by LDA if considered since one both will likely spur the same word content each time it is invoked. So we removed comments obviously authored by a bot.

* Users tend to often link outside sources or other reddit comments. As URLs does not mix well with natural language processing, they had to be removed.

LDA can then be applied on the collection of preprocessed Reddit comments. Among the number of topics and words per topic, the algorithm can take also two other parameters: the document concentration (alpha), and the topic concentration (beta). 

The alpha parameter is proportional to how many topics are mixed to produce a document. In the context of Reddit comments, it is unlikely that a single comment mentions a lot of topics, as it is a rather short pieces of text and will generally focus on the subject of the thread. Thus a small value of alpha should be preferred.

The beta parameters is inversely proportionnal to how much the topics are compromised of the same words.In other words, a low value of beta would trigger LDA to produce topics which do not share many identical words.

Through various tests, it was found that 8 topics of 5 words each with a 0.1 value for both and 0.025 for alpha produced the best results for our purpose.

## Topic analysis by subreddits

In [44]:
def display_topics(topics):
    tops = topics.select('topic').collect()
    for i in range(len(tops)):
        print("Topic %d: %s"%(i+1, tops[i][0]))

#### r/The_Donald

In [41]:
trump_lda = spark.read.load('../data/trump_lda_result.parquet/')
display_topics(trump_lda)

T 1: fearless donald fucking last time
T 2: cooking spirit propaganda traitor comey
T 3: clinton hillary cuck email child
T 4: pedophile corrupt epstein know shit
T 5: trump energy take train vote
T 6: deleted weiner laptop archive watch
T 7: vote trump elite thread year
T 8: hero clinton podesta hillary rich


Upon quick inspection, what was discussed the week before the election seems to be a rather erratic aggregation of diverse words. Each topic mention at least one person, but some terms seems to be present by what looks like a random selection. However an in-depth analysis of the topics reveals some interesting insight to what discussed redditors supporter of Donald Trump:

* **"fearless donald fuckin last time"**: Supporters of Donald Trump have often qualified the then republican candidate to be _fearless_ which explains the occurence of the two first words. It is harder however to connect the last three words to this topic, as one swear word and two rather common terms do not apport much meaning to the _fearless donald_ revered by The_Donald's users.



* **"cooking spirit propaganda traitor comey"**: Two subtopics are actually present within this list of words. "_cooking spirit_" refers to an invitation by mail to John Podesta, Hillary's 2016 campaing chairman, to take part in a "Spirit cookinkg" dinner organised by Marina Abramović. This was an information part of the leaked e-mail from Hillary Clinton and associates. (This dinner got soon exagerated by conspiray theorists as being something of satanist content)[https://en.wikipedia.org/wiki/Marina_Abramović#Controversy]. On the other hand, "propaganda traitor comey"  seems to refer to the then FBI director James Comey, which oversaw the investigation on Hillary Clinton's actions that led to the famous e-mails leak. At the end of the investigation, Comey determined that the action of Clinton did not deserve any penal sentence. As Trump was fond of the idea of locking up Hillary behind the bar for the whole e-mail scandal, the decision of Comey might have been perceived by r/The_Donald as treaterous or democrat led propaganda. It is possible that redditors linked the two events together, as the spirit cooking scandal was revealed through the e-mails leak's affair. This would explain why LDA aggregated them into one topic. 



* **"clinton hillary cuck email child"**: While the _clinton hillary email_ part of the topic is easily interpretable, _child_ might be a reference to [the Pizzagate conspiracy theory](https://en.wikipedia.org/wiki/Pizzagate_conspiracy_theory) which posited that the e-mail leaked revealed clues about a child-sex trafficking ring established in Washington D.C. Lastly, "cuck" does not seem to be linked directly to pizzagate or the e-mails, and could be a byproduct of the common vocabulary and terminology used through discussions on r/The_Donald.



* **"pedophile corrupt epstein know shit"**: The named entity of the topic this time is someone named _Epstein_. Considering the other words, this surely refers to Jeffrey Epstein, billionaire and former top donator of the democratic who was [sentenced to jail for underage girl prostitution in 2008](https://en.wikipedia.org/wiki/Jeffrey_Epstein#Criminal_proceedings). Rephrasing a bit the topic would give us "Pedophile and corrupt Jeffrey Epstein knows shit". It may be possible that this topic stems from discussions theorizing Epstein would be linked to the Pizzagate conspiracy.



* **"trump energy take train vote"**: The _trump train_ is a metaphor used by Trump's supporters which represent the [swarm of great ideas](https://www.urbandictionary.com/define.php?term=The%20Trump%20Train) the New York businessman came up with during his campaign. To _take_ such a train would mean to be on board with Trump's idea, and be willing to _vote_ for him.

* **"deleted weiner laptop archive watch"**: As with Jeffrey epstein, the mention of someone named _weiner_ helps us there understanding what those five words mean together. Anthony Weiner was a former U.S. congressman whose career came to an end after [scandals surrounding sex photos he would send on twitter to woman](https://en.wikipedia.org/wiki/Anthony_Weiner#Sexting_scandals). He was the husband of Human Abedin which was the Vice Chair of Hillary's 2016 campaign. Weiner [_deleted_ his twitter account after exposure of one of his sexting_scandal](https://nypost.com/2016/08/29/anthony-weiner-deletes-twitter-amid-new-sexting-scandal/). As a result of this scandal, his _laptop_ was seized by the FBI and _archived_ e-mails contained in the computer were informations that led James Comey to reopen the investigation on Hillary Clinton's mail eleven days before the election. 

* **"vote trump elite thread year"**: _vote trump elite_ might constitutes an authority argument for Trump supporter to convince other people to vote for the New Yorker businessman. "The elite vote for Trump" might be the formulation used, which would imply that someone voting for Donald Trump is part of the _elite_.

* **"hero clinton podesta hillary rich"**: John _Podesta_ is mentioned with _Hillary Clinton_, however the terms _hero_ and _rich_ do not seem to give a precise clue as to how are the campaign chairman and the democratic candidate discussed.


In conclusion, Trump supporter's discussion seemed to focus on two main axis the week preceding the election: Scandals and conspiracy surrounding Trump's opponent, and gloryfing Trump and his actions. 


#### r/hillaryclinton

In [48]:
hillary_lda = spark.read.load('../data/hillary_lda_result.parquet/')
display_topics(hillary_lda)

Topic 1: vote believe trump think poll
Topic 2: live cosponsored child cosponsor drop
Topic 3: comey russia move president year
Topic 4: clinton know trump hillary change
Topic 5: comment turnout reddit part home
Topic 6: school do ever apparently make
Topic 7: michigan cub county team broward
Topic 8: email trump model clinton poll


In [None]:

The same first observation than in The_Donald topic can be made, 

#### r/news

In [None]:
In the case of news

In [None]:
news_lda = spark.read.load('../data/news_lda_result.parquet/')
display_topics(news_lda)