In [3]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import findspark
findspark.init()

import pyspark
from pyspark.sql import *
import pyspark.sql.functions as func
from pyspark.sql.types import *

%run insights.py
%run plot.py

# Create spark session
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## Topics during the 2016 election

It is widely accepted and recognized in the political sphere that the U.S. 2016 presidential election greatly divided the U.S. citizens. As early as January 2016 [journalists](https://www.washingtonpost.com/politics/a-divided-country-gets-a-divisive-election/2016/01/09/591bfccc-b61f-11e5-a842-0feb51d1d124_story.html?noredirect=on&utm_term=.54c240a270e9) already reported the rethoric used by the various presidential candidates was unusually violent and agressive, resulting in a division among the opinions of the american people. A division which kept on growing through the year as the presidential candidates were striked by diverse controversies. 

Many subreddits serves as a hub for internet users entitled into voicing their support for the candidate they wished would win. This provides us some natural aggregation of the opinions of people supporting each side of the election. [r/The_Donald](https://www.reddit.com/r/The_Donald) and [r/hillaryclinton](https://www.reddit.com/r/hillaryclinton) were the main subreddits for reddit users standing respectively for the Republican and the Democrat candidate. In order to observe the divisiveness, we chose to take a more specialized look into what were the most discussed topics in each community the week preceding the election day. However such communities might suffer from echo chamber effect as highlighted before by the agreement factor from The_Donald being one of the highest among the subreddit. Thus we are going to compare those topics with the ones discussed on r/news which will serve as a neutral ground of observation between the candidates respective communities. The agreement factor of the "news" subreddit being one of the lowest indicates the echo chamber effect has a lesser impact on this subreddit's threads. 


### Methodology: LDA

Latent Dirichlet Allocation (LDA) is the unsupervised clustering method we choose in order to model Reddit discussions topics. Given a corpus of documents, LDA assumes that each document is the product of a mixture of a certain number of topic. Using this algorithm we can infer what are the subjects of the documents collection.

In our case, we chose the documents to be directly the reddit comments, as it is the smallest coherent piece of information from a discussion thread. Some classic natural language preprocessing has to be done for LDA to work properly: each comment was first tokenized, then cleaned of any english stop words and finally lemmatized. Once applied to the dataset, we observed that more preprocessing had to be done in order to remove idiosyncracies of Reddit:

* Comments with a low score do not represent opinion which was appreciated enough by the community. We decided to remove each comment whose score was lower than 10.

* Short answers to a thread do not bring much information to the discussion. We thus removes all comments whose character length was below 50. As a side effect, this part of preprocessing also removes comments which vere deleted or removed by moderators.

* Some subreddits deploy special programs called "bots" which can serve many purpose, wether it is automatic moderation or helping the user with referencing. As the textual content of bots comments is mostly automatic, it does not give any information on the topic. Bot comments even pollute topics modelled by LDA if considered since one both will likely spur the same word content each time it is invoked. So we removed comments obviously authored by a bot.

* Users tend to often link outside sources or other reddit comments. As URLs does not mix well with natural language processing, they had to be removed.

LDA can then be applied on the collection of preprocessed Reddit comments. Among the number of topics and words per topic, the algorithm can take also two other parameters: the document concentration (alpha), and the topic concentration (beta). 

The alpha parameter is proportional to how many topics are mixed to produce a document. In the context of Reddit comments, it is unlikely that a single comment mentions a lot of topics, as it is a rather short pieces of text and will generally focus on the subject of the thread. Thus a small value of alpha should be preferred.

The beta parameters is inversely proportionnal to how much the topics are compromised of the same words.In other words, a low value of beta would trigger LDA to produce topics which do not share many identical words.

Through various tests, it was found that 8 topics of 5 words each with a 0.1 value for both and 0.025 for alpha produced the best results for our purpose.

## Topic analysis by subreddits

In [44]:
def display_topics(topics):
    tops = topics.select('topic').collect()
    for i in range(len(tops)):
        print("Topic %d: %s"%(i+1, tops[i][0]))

#### r/The_Donald

In [41]:
trump_lda = spark.read.load('../data/trump_lda_result.parquet/')
display_topics(trump_lda)

T 1: fearless donald fucking last time
T 2: cooking spirit propaganda traitor comey
T 3: clinton hillary cuck email child
T 4: pedophile corrupt epstein know shit
T 5: trump energy take train vote
T 6: deleted weiner laptop archive watch
T 7: vote trump elite thread year
T 8: hero clinton podesta hillary rich


Although easily discernable, not all the topic are really focused on a real day matter or debate

* "fearless donald fuckin last time": Supporter of Donald Trump have often qualified the then republican candidate to be "fearless" which explains the occurence of the two first words. It is harder however to connect the last three words to this topic, as one swear word and two rather common words do not apport much meaning to the "fearless donald"
* "cooking spirit propaganda traitor comey": two topics are actually concatained 

#### r/hillaryclinton

In [43]:
hillary_lda = spark.read.load('../data/hillary_lda_result.parquet/')
display_topics(hillary_lda)

T 1: russia russian suffrage student happening
T 2: poll polling lead trump ohio
T 3: believe trump vote hillary think
T 4: poll model school ground turnout
T 5: sign white guess powerful male
T 6: email vote story already whether
T 7: clinton hillary year email trump
T 8: early trump voting voter time


#### r/news

In [None]:
In the case of news