Skip to content
scrape and analyze reddit comments
R
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
code
results
top200s
user_tables
.gitignore
README.html
README.md
talking_trump.Rproj

README.md

talking_trump

This repo contains everything you need to scrape reddit comments and find the most distinctive set of words that one community uses relative to another.

However, you will have to collect your own sets of usernames from the community of interest and reddit more broadly.

By convention, I call the community of interest TD and the rest of reddit nonTD, since I use this to analyze r/the_donald. But this code can be adapted for any subreddit, or group of subreddits, depending on how you choose your samples.

Using this repo

Start by running the functions in code/scraping_functions.R. Then use get_comment_history.R to scrape comment history from the users you specify. This file calls the functions defined in scraping_functions. Note that get_comment_history.R will save a .csv file for each user (files tend to be around 50-100 kB). You may want to start small.

analyze_comments.R takes the set of files you created for each user and turns them into two dataframes (one for each subsample). The dataframes contain every word ever used by any user, the number of instances of each word, and the number of unique users who used each word.

make_chatterplot.R combines these two dataframes, calculates a usage ratio for each word, and produces a chatterplot of the most distinctive words found in one community.

Usage score is calculated as (word uses + 1)/(total uses of any word) * (unique users of the word + 1)/(total users in subsample). This measure weights usage more heavily if a word is used by many users, rather than by a few users many times. Then ratio of usage scores is taken to find the most distinctive words in one subsample (TD) relative to the other (nonTD).

Sources and code I adapted

Scraping reddit: dmarx's github, https://gist.github.com/dmarx/8140428

Text analysis: Text mining with R by Julia Silge and David Robinson, https://www.tidytextmining.com/tidytext.html

Chatterplots: Daniel McNichols' Toward Data Science blog, https://towardsdatascience.com/rip-wordclouds-long-live-chatterplots-e76a76896098

You can’t perform that action at this time.