In [62]:
# Importing data from the r/politics
from psaw import PushshiftAPI
import datetime
import pandas as pd
import os
import re
import numpy as np
from tqdm import tqdm
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# 1. Motivation

### The Reddit dataset

For this project, we chose to work with data from the [r/politics](https://www.reddit.com/r/politics/) subreddit, an online forum with 8 million members "for current and explicitly political U.S. news." according to the rules stated on the site. 

Visitors at r/politics will quickly notice that the majority of the submissions are by users posting links to news articles published on news media sites like CNN or The Huffington Post. The headlines of these linked articles are then shown on r/politics as the titles of the submissions. Other users can then comment on the linked article, which is what ultimately constitutes the actual user-generated content on the site. 

We focused our data extraction to only include submissions from r/politics that fulfilled the following criteria: 
* __They contained "Trump" or "Biden" in the title.__ While submissions containing other words and names than "Trump" and "Biden" (e.g. "Republican" and "Democrat") might be used to provide equally good indications of the political convictions of redditors, this textual query allowed us to limit the scope of the project while still extracting data essential to the aim of this project. 
* __They had received more than five comments.__ This requirement was to prevent us from downloading submissions with no or only a very small comments section, as we'll be using the comments to conduct the later sentiment analysis and produce a partitioning of the redditors. 
* __They had been published between 10-1-2020 and 11-3-2020.__ This period covered approximately a month before the most recent U.S. presidential election that took place on 11-3-2020. Ideally, we would have covered several months leading up to the election day, in order to detect longer term trends in the data. However, that would prove to be computationally infeasible, given the amount of data this would yield. 

__Submission variables__

The downloaded submissions would be structured in a Pandas dataframe containing the following variables for each submission in its respective columns: 
1. __time stamp index:__ Simply stating when the submissions was made.
2. __title:__ Being the title of the submission. Usually the header of the linked article. 
3. __id:__ A unique identifier for a particular submission. 
4. __author:__ The profile name of the author of the submissions.
5. __num_comments:__ The number of comments received on the particular submission.
6. __url:__ The link stated in the text of the submission.

__Comments variables__

As stated, we would download the associated comments section for all the downloaded submissions. Similarly to the submissions, the comments would be structured in a Pandas dataframe containing the following variables for each comment in its respective columns:
1. 

['id', 'link_id', 'author', 'parent_id', 'body']

__Reasons for choosing this particular data set__ <br>
1. Easy to collect using the Pushshift API.
2. Interesting topic that would fit the requirements of the project.
3. Similar format as the data we've previously worked with.

__Goal for end user's experience__ <br>
Our goal is for the end user's of our website to have a blast digesting our the findinds of our analyses presented in a beautiful and thought-provoking way. 

Below, we show how we got the data

In [4]:
# Note this is just a POC with a limit=100. 
api = PushshiftAPI()

my_subreddit = "politics"
query = "Trump | Biden "

date1 = int(datetime.datetime(2020,10,1).timestamp())
date2 = int(datetime.datetime(2020,11,3).timestamp())

gen = api.search_submissions(num_comments= '>5',
                             subreddit=my_subreddit, 
                             after=date1, 
                             before=date2, 
                             q=query
                             ,limit=100
                            )
results = list(gen)



In [49]:
column_names = ['title', 'id', 'author', 'num_comments', 'url']

subs_df = pd.DataFrame(
    {
        column_names[0] : [submission.d_[column_names[0]] for submission in results],
        column_names[1] : [submission.d_[column_names[1]] for submission in results],
        column_names[2] : [submission.d_[column_names[2]] for submission in results],
        column_names[3] : [submission.d_[column_names[3]] for submission in results],
        column_names[4] : [submission.d_[column_names[4]] for submission in results]
    },
    index = [submission.d_['created_utc'] for submission in results])
subs_df.index = pd.to_datetime(subs_df.index, unit='s')

In [50]:
subs_df

Unnamed: 0,title,id,author,num_comments,url
2020-11-02 22:54:58,Trump ramps up Fauci attacks on eve of electio...,jmybs3,geoxol,33,https://thehill.com/homenews/administration/52...
2020-11-02 22:48:58,Trump Loves To Declare Victory Even if He Didn...,jmy7vu,Facerealityalready,16,https://www.motherjones.com/politics/2020/11/t...
2020-11-02 22:46:57,Trump creates 1776 Commission to promote 'patr...,jmy6j9,bluestblue,53,https://www.politico.com/news/2020/11/02/trump...
2020-11-02 22:42:26,Judge blocks Trump campaign challenge to Nevad...,jmy3hn,TrumpSharted,10,https://thehill.com/homenews/state-watch/52403...
2020-11-02 22:41:51,Trump boasts about newspaper endorsement that ...,jmy340,Zhana-Aul,23,https://www.independent.co.uk/news/world/ameri...
...,...,...,...,...,...
2020-11-02 19:18:22,How Trump and Barr’s October Surprise Went Bust,jmu50j,i-am-sancho,25,https://nymag.com/intelligencer/2020/11/durham...
2020-11-02 19:18:18,Who’s Giving to Trump and Biden? Top Donors by...,jmu4yn,ttkk1248,8,https://www.bloomberg.com/graphics/2020-electi...
2020-11-02 19:16:48,Eminem signals Biden support as campaign relea...,jmu410,wwabc,11,https://www.freep.com/story/entertainment/musi...
2020-11-02 19:14:18,Eminem Licenses ‘Lose Yourself’ for Biden-Harr...,jmu2e0,chanma50,661,https://variety.com/2020/music/news/eminem-lic...


To ensure that each submissions is unabigously related to either one of the candidates, we simply remove all submissions containing both "Trump" and "Biden". 

In [51]:
# List to contain indices of subs in df with a title containing both "Trump" and "Biden"
TB = []
for i in range(len(df['title'])):
    if (re.search('Trump', subs_df['title'][i])) and (re.search('Biden', subs_df['title'][i])):
        TB.append(i)
    else:
        continue
        
subs_df = subs_df.drop(subs_df.index[TB])

Now we're ready to download the associated comments sections for each of the remaining submissions. 

In [18]:
comments = []
for link_id in tqdm(subs_df['id']):
    gen = api.search_comments(subreddit=my_subreddit,
                              link_id=link_id)
    comment_sec = list(gen)
    comments += comment_sec

100%|██████████| 82/82 [05:26<00:00,  3.98s/it]


In [53]:
column_names = ['id', 'link_id', 'author', 'parent_id', 'body']

coms_df = pd.DataFrame(
    {
        column_names[0] : [comment.d_[column_names[0]] for comment in comments],
        column_names[1] : [comment.d_[column_names[1]] for comment in comments],
        column_names[2] : [comment.d_[column_names[2]] for comment in comments],
        column_names[3] : [comment.d_[column_names[3]] for comment in comments],
        column_names[4] : [comment.d_[column_names[4]] for comment in comments]
    },
    columns= column_names, index = [comment.d_['created_utc'] for comment in comments])

coms_df.index = pd.to_datetime(coms_df.index, unit='s')

Some of these comments have been removed after being posted, so we'll do some cleaning first by filtering out the comments where author = "[deleted]", which will do the job.

In [55]:
coms_df = coms_df[coms_df.author != "[deleted]"]

Unnamed: 0,id,link_id,author,parent_id,body
2020-11-03 01:38:03,gaykxxw,t3_jmybs3,saint-cecelia,t1_gayg0dt,"He got on eminem, too? I can't keep up. \n\nI'..."
2020-11-03 00:52:13,gayg0dt,t3_jmybs3,NorweAmeriLove,t1_gay44pe,and Eminem!
2020-11-03 00:48:23,gayflfy,t3_jmybs3,ChickenNPisza,t1_gaye3oj,They are all saving face I bet. Contradict the...
2020-11-03 00:34:43,gaye3oj,t3_jmybs3,droplivefred,t1_gay339x,That his been his strategy since summer and hi...
2020-11-03 00:23:36,gaycvmz,t3_jmybs3,yyungpiss,t3_jmybs3,is there some sort of weird strategy to this o...
...,...,...,...,...,...
2020-11-02 19:24:57,gaxd8l3,t3_jmu2e0,Das_Man,t3_jmu2e0,"Whatever happens tomorrow, Biden's ad team des..."
2020-11-02 19:23:10,gaxd0ot,t3_jmu2e0,F6Pilot,t3_jmu2e0,"Quality move, Slim Shady!"
2020-11-02 19:18:47,gaxcgxe,t3_jmu2e0,BroadAsparagus,t3_jmu2e0,Not Afraid would also make a good campaign ad.
2020-11-02 19:18:03,gaxcdph,t3_jmu2e0,Shwetty_Morrow,t3_jmu2e0,"Yes.\n\nSo much yes.\n\nEpic win, much?\n\nAlt..."


### Why this dataset

# 2 Basic stats

In [63]:
# Downloading the comments data set
url = 'https://raw.githubusercontent.com/JaQtae/SocInfo2022/FinalProject/Data/politics_comments_very_smol_fully_processed.csv'
coms_df = pd.read_csv(url,index_col=0,parse_dates=[0])

In [68]:
coms_df = coms_df.iloc[:1000]

In [70]:
# Implementing the VADER sentiment analysis of the comments. 

#https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/
def calculate_compound_sentiment_score(sentence):
 
    # Create a SentimentIntensityAnalyzer object.
    sid_obj = SentimentIntensityAnalyzer()
 
    # polarity_scores method of SentimentIntensityAnalyzer
    # object gives a sentiment dictionary.
    # which contains pos, neg, neu, and compound scores.
    sentiment_dict = sid_obj.polarity_scores(sentence)
     
    # The Compound score is a metric that calculates the sum of all the lexicon ratings 
    # which have been normalized between -1(most extreme negative) and +1 (most extreme positive).   
    
    return sentiment_dict['compound']

In [71]:
tqdm.pandas()
coms_df["compound_sentiment_score"] = coms_df["body"].progress_apply(calculate_compound_sentiment_score)

100%|██████████| 1000/1000 [00:17<00:00, 58.73it/s]


In [84]:
len(coms_df["author"].unique())

869

In [102]:
coms_df

Unnamed: 0_level_0,id,link_id,score,author,parent_id,body,parent_author,politician,children,mentions_Trump,mentions_Biden,compound_sentiment_score
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2020-11-03 00:48:23,gayflfy,t3_jmybs3,1,ChickenNPisza,t1_gaye3oj,They are all saving face I bet. Contradict the...,droplivefred,Trump,[],,,-0.7234
2020-11-02 23:25:58,gay6j16,t3_jmybs3,1,QueerlyTremendous,t1_gay46gx,In reality all trump had to do was get an hand...,LasersGirl,Trump,[],,,0.7050
2020-11-02 23:15:01,gay5bcm,t3_jmybs3,1,Venator850,t3_jmybs3,This really looks like Trump throwing. Last I ...,geoxol,Trump,[],True,,0.1000
2020-11-02 23:04:56,gay46gx,t3_jmybs3,1,LasersGirl,t3_jmybs3,He’s beside himself because Fauci’s approval r...,geoxol,Trump,['gay6j16'],,,0.4767
2020-11-02 22:59:42,gay3l02,t3_jmybs3,1,qx82717,t1_gay3b6d,"Alllright, partys getting started.",J_Class_Ford,Trump,[],,,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...
2020-11-03 03:22:17,gayw6qd,t3_jmxsgq,1,OhThatNick,t3_jmxsgq,"Ah, yes, the fear that only criminals have.",JohnBasedow_,Trump,[],,,-0.6369
2020-11-03 03:22:16,gayw6n3,t3_jmxsgq,1,shyvananana,t1_gayjf76,Agreed. That's why we have laws on the books n...,spinfip,Trump,[],,,0.4043
2020-11-03 03:22:12,gayw6el,t3_jmxsgq,1,UsurpUsername,t3_jmxsgq,Seriously. For what???,JohnBasedow_,Trump,[],,,-0.3049
2020-11-03 03:22:11,gayw6bw,t3_jmxsgq,1,BoxUpTheChildren,t3_jmxsgq,Ahh yes. Vanity fair. Good source. Good source...,JohnBasedow_,Trump,[],,,0.7906


In [125]:
group = coms_df.groupby("author")

author_df_V = group.apply(lambda x: [x["politician"].unique()] )

In [127]:
list(author_df_V)

[[array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], dtype=object)],
 [array(['Trump'], d

In [122]:
author_df_V

author
-BigMan39              [[How hunter biden used his father's name to z...
-Heart_of_Dankness-    [[Sometimes when you commit a crime for a long...
-eimaj-                [[2016: Orange (Trump) is the new Black (Obama...
0O00OO0O000O           [[Good points. I haven't read the book but I'v...
1nthenet               [[Best birthday gift if he does], [0.7964], [T...
                                             ...                        
z_funny182             [[Bulletproof vest ads on a bit], [0.0], [Trump]]
zajacdan                                [[Lock him up!], [0.0], [Trump]]
zapitron               [[Cohen has already been imprisoned, found gui...
zoodiedoodie           [[I remember 2-3 years ago listening to a podc...
ztimulating            [[His adult children will most likely be keepi...
Length: 869, dtype: object

# 3 Tools, theory and analysis

# 4 Discussion

### What is still missing?

### What could be improved?

It would undoubtedly have been interesting to investigate the political content on r/politics over a longer time period, e.g. six months preceeding the election day, which would allow for the detection of longer term trends in redditor activity and sentiment. For such a scope to be feasible, bigger computational muscles than what the group members had at their disposal. 