# Reddit r/politics/hot top comment analyses

Reddit is an extremely popular online forum, ranking as #23 in internet traffic engagement(according to https://www.alexa.com/siteinfo/reddit.com). It is broken into subreddits, shortcut as r/, where users can post relevant links, images, videos, or simply text posts relevant to the particular subreddit. These posts are sometimes called threads. Other users may then reply to these threads with comments. This allows other users to reply to their comments, allowing for robust(occasionally), informative(sometimes), interesting(inane?) and diverse(hah!) conversation. 

All users registered to Reddit may upvote or downvote any thread or comment on the site, creating a score for each item. This score is known as Karma. Reddit also tracks the total Karma for each user (the cumulative of their earned karma from their posts comments). 

The purpose of this project is to analyze the top comments in trending  threads from the r/politics subreddit, in the section known as hot/. Put all of this together, and you get the URL of the site we are scraping for data: https://www.reddit.com/r/politics/hot.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter
from datetime import datetime, timedelta


#Reads dataframe from .csv
df = pd.read_csv('out.csv')
print(len(df))

#Changes columns back to DateTime, cleans out AutoModerator posts
df = df.astype({'AccountDateTime': 'datetime64', 'CommentDateTime': 'datetime64', 'ThreadDateTime': 'datetime64'})
print(df.dtypes)
df = df[df['Name'] != 'AutoModerator']

df.head()

### Analysis

Above is a snippet of the data we will be analyzing. I have been scraping data with RedditReader.ipynb and placing it into a .csv file for this section of the project. I have chosen this method for 2 reasons:
    
    1) Data scraping can be tedious, as it involves making several requests of the Reddit API and parsing the information
    2) This allows the opportunity to analyze data over several days of Reddit threads.

This project will analyze the data in a couple of ways, pertaining to the various Karma and times of creations of any comments, comment authors, and Reddit threads.

The end goal of these analyses on comments to attempt to ferret out accounts that are considered "troll" or "bot" accounts. These are accounts that are either automated to post("bot") or ones where people are intentionally posting frequently to get a reaction("troll"). This project is currently in the exploratory phase, however, and will draw no clear conclusions. We will merely be looking at trends in the collected data to see if anything of interest pops up.

In [None]:


df['TimeDifference'] = df['CommentDateTime'] - df['ThreadDateTime']
sortByPostSpeed = df.sort_values(by = ['TimeDifference'])
print(len(sortByPostSpeed))
sortByPostSpeed.head(100)

In [None]:

today = datetime.now()
start_date = today - timedelta(days = 30)
end_date = today

sortdf = df.sort_values(by=['AccountDateTime'])



Below, we are looking at comments sorted by post speed. Comments to the left were posted in the shortest amount of time after the thread wass posted. Unsurprisingly, these comments have the most variance. I have color coded the scatter plot to change based on earned comment karma.

In [None]:
sortByPostSpeed['TotalMinutes'] = sortByPostSpeed['TimeDifference'] / pd.to_timedelta(1, unit = 'm')

FastComments = sortByPostSpeed.plot.scatter(x='TotalMinutes', y='AdjustedKarmaPercent', c = 'CommentKarma',
                                            colormap='viridis')

This is a quick look at the shape of comments from accounts created within the last 30 days, comparing their comment Karma to their date creation.

In [None]:
condition = (sortdf['AccountDateTime'] >= start_date) & (sortdf['AccountDateTime'] <= end_date)
newAccounts = sortdf.loc[condition]
print(len(newAccounts))
newAccounts.head(100)

NewAccounts = newAccounts.plot.scatter(x='AccountDateTime', y='CommentKarma', c='AdjustedKarmaPercent', 
                                       logy = True, colormap='viridis')

Here we can see all accounts that started the thread with negative karma. These are accounts I would initially expect to be troll or bot accounts, as maintaining a negative karma on Reddit...can take some work. That or one REALLY downvoted post. I'll print a couple URL's to the comment links, in order to see what kind of comments they are. 

WARNING: This is done in a live Reddit environment, I have no control over the data. The possibility for hateful, NSFW, or outright illegal comments is very likely. I assume no responsibility for the results. FOLLOW THESE LINKS AT YOUR OWN RISK!!!

In [None]:
pd.set_option('display.max_colwidth', None)
condition = (sortByPostSpeed['AdjustedKarma'] < 0)


negativeKarma = sortByPostSpeed.loc[condition]
print(len(negativeKarma))
sample = negativeKarma.sample(n = 5)
sample['CommentURL']

The above query often results in toxic comments. This is to be expected from accounts with a low Karma rating. Let's look at a few of the highest Karma comments; hopefully they will provide more uplifting results.

In [None]:
largestdf = df.nlargest(5, 'CommentKarma')
largestdf.head()
largestdf['CommentURL']

## Conclusion

I have run a very limited number of queries of the past 2 cells, but both of them seem to share the same qualities:

    1) short comments
    2) pointed(borderline toxic) comments
    
It is hard to point to any of these accounts being "bots"(which was my original) intent; but I think it speaks very clearly to the nature of communication on the internet. "Hot takes", or short, often derisive comments, seem to be the king at the upper and lower end of the spectrum. I will need to do further research to try to suss out which accounts are "bots", but it seems a lot of them would be considered "troll" accounts. I have been looking at their post histories, and in general these accounts seem to post short, pointed comments. 
It could be that this is the nature of communication on the internet: it takes a lot more work to post a long, thought out comment, and by nature people may simply do that less on average. It could also be that the accounts at the upper and lower ends of the spectrum have more nefarious means than communication. 

I do not know if this is an answer I am capable of finding, but I hope to do future work on this project with the intent of making more scientific analyses of the information at hand.