In [None]:
# Importing data from the r/politics
from psaw import PushshiftAPI
import datetime
import pandas as pd
import os
import re
import numpy as np

# 1. Motivation

### The dataset

For this project, we chose to work with data from the [r/politics](https://www.reddit.com/r/politics/) subreddit, an online forum with 8 million members "for current and explicitly political U.S. news." according to the rules stated on the site. 

Visitors at r/politics will quickly notice that the majority of the submissions are by users posting links to news articles published on news media sites like CNN or the Huffington Post. The headlines of these linked articles are then shown on r/politics as the titles of the submissions. Other users then comment on the linked article, which is what ultimately constitutes the actual user-generated content on the site. 

We focused our data extraction to only include submissions from r/politics that fulfilled the following criteria: 
* contained either "Trump" or "Biden" in the title
* had received more than five comments
* had been published between 10-1-2020 and 11-3-2020, approximately a month before the most recent U.S. presidential election that took place on 11-3-2020.

The motivation for this time frame was an underlying assumption that the political discourse would be intensified in this period, which would facilitate inferring the political convictions (either democratic or republican) of the active redditors, based on their posts.  

In [None]:

api = PushshiftAPI()

my_subreddit = "politics"
query = "Trump | Biden "

date1 = int(datetime.datetime(2020,10,1).timestamp())
date2 = int(datetime.datetime(2020,11,3).timestamp())

gen = api.search_submissions(num_comments= '>5',
                             subreddit=my_subreddit, 
                             after=date1, 
                             before=date2, 
                             q=query
                            )
results = list(gen)

column_names = ['title', 'id', 'score', 'author', 'num_comments', 'url']

df = pd.DataFrame(
    {
        column_names[0] : [submission.d_[column_names[0]] for submission in results],
        column_names[1] : [submission.d_[column_names[1]] for submission in results],
        column_names[2] : [submission.d_[column_names[2]] for submission in results],
        column_names[3] : [submission.d_[column_names[3]] for submission in results],
        column_names[4] : [submission.d_[column_names[4]] for submission in results],
        column_names[5] : [submission.d_[column_names[5]] for submission in results]
    },
    index = [submission.d_['created_utc'] for submission in results])
df.index = pd.to_datetime(df.index, unit='s')

### Why this dataset

# 2 Basic stats

# 3 Tools, theory and analysis

# 4 Discussion

### What is still missing?

### What could be improved?

It would undoubtedly have been interesting to investigate the political content on r/politics over a longer time period, e.g. six months preceeding the election day, which would allow for the detection of longer term trends in redditor activity and sentiment. For such a scope to be feasible, bigger computational muscles than what the group members had at their disposal. 