# Language of Reddit
This project looks at commonly used words in specific subreddits. Frequencies are compared to a baseline that is derived from the top $20$ popular subreddits. The most popular subreddits are assumed to be the most neutral in terms of their language and consequently they'll attract a broad audience rather than a subculture.

Also note that I will only extract top-level comments, so no replies. This has three reasons:
1. Subreddits often enforce stricter rules for the top-level comments. This means that top-level and lower-level comments will have inherently different language. For example, top-level comments in AskReddit threads will be stories or answers to the question being asked whereas replies to the top-level comments will be reactions.
2. The number of replies a comment get depends a great deal on how early it is made. Additionally, replies in conversations will tend to repeat language. Consequently, including replies to top-level comments would introduce a temporal bias towards early comments.
3. Reddit is huge and scraping ALL comments to a thread would take impossibly long and cause an unethical strain on the Reddit servers. As diversity in language is more important than sampling depth, I want to focus on capturing a few comments in many threads rather than many comments in a few threads.

## Getting data with the API and storing it in a database

In [140]:
import praw
import sqlite3
import _auth
reddit = _auth.get_api()

In [159]:
con = sqlite3.connect("reddit_comments.sqlite3")
cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS comments;")
cur.execute("""
CREATE TABLE IF NOT EXISTS comments (
    subreddit TEXT,
    thread_url TEXT, 
    thread_title TEXT, 
    thread_id TEXT, 
    comment_id TEXT PRIMARY KEY,
    comment_body TEXT);""")
con.commit()
con.close()

In [177]:
def mine_subreddit(subreddit, db="reddit_comments.sqlite3", threads=50, ignore_existing=False):
    """
    Looks at the top threads of all time in a given subreddit, mines the top comments, 
    and adds them to a database. This function doesn't do anything if the subreddit already
    exists in the database unless explicitly told to add entries. In this case, it does not
    delete any entries and only adds new comments the database doesn't know yet.
    """
    
    con = sqlite3.connect(db)
    cur = con.cursor()
    
    if subreddit.url in [ii[0] for ii in cur.execute(
        "select distinct subreddit from comments").fetchall()] and not ignore_existing:
        con.close()
        return None
    
    for thread in subreddit.top(limit=50):
        for comment in thread.comments:
            try:
                sql = "INSERT OR IGNORE INTO comments \
                    (subreddit, thread_url, thread_title, \
                    thread_id, comment_id, comment_body) \
                    VALUES (?, ?, ?, ?, ?, ?);"
                values = (
                    subreddit.url, thread.url, thread.title, 
                    thread.name, comment.name, comment.body)
                cur.execute(sql, values)
                con.commit()
            except  AttributeError:
                    pass
    
    con.close()
    return None

def get_comments_from_subreddits(subreddits, db="reddit_comments.sqlite3"):
    """
    Returns all comments belonging to certain subreddits
    """
    
    con = sqlite3.connect(db)
    cur = con.cursor()
    comments = cur.execute(
        "SELECT * from comments where subreddit in ('{}')".format(
            "','".join(baseline_subreddits))).fetchall()
    con.close()
    return comments

### Baseline: popular subreddits
The choice lies between "default" and "popular" subreddits. Because the default subreddits are defined by Reddit administrators, they may include inherent biases and corporate policies. To avoid this, I look at the popular subreddits by user participation.

I loop through all of these subreddits and extract top-level comments from the top $50$ threads of all time of each subreddit (resulting in a total of $1\,000$ threads). Reddit's API limitation means that

In [190]:
baseline_subreddits = []
for subreddit in reddit.subreddits.popular(limit=20):
    baseline_subreddits.append(subreddit.url)
    mine_subreddit(subreddit)

### Extracting word frequencies

In [194]:
subreddits = baseline_subreddits
comments = [entry[-1] for entry in get_comments_from_subreddits(subreddits)]