## Exploring Reddit data with Python

This notebooks contains a couple of examples in how to explore text data (like reddit comments) using text mining techniques in Python.

The document you are reading right now is a "Jupyter Notebook". This document format combines "text cells" (like the one you are reading) with "code cells" (like the one below this one). 

You can recognize a code cell by the squared brackets to the right of it (`[ ]`). These cells can be run by putting the cursor in the cell (clicking into it) and then by either clicking the "play" button in the top panel or by pressing "Shift+Enter". 

The notebook represents a workflow, so the cells have to be run in order (the squared brackets to the right indicate the order of which the cells have been run).

In some cells, you can make a few adjustment to the code to change the output.

### Loading packages

Python uses "packages" to store different functions and tools. A first step is always importing the tools needed for the notebook to work. 

A few options are also set as well as setting where data is stored.

In [None]:
# Packages

import os
import pandas as pd
import json
import seaborn as sns
from matplotlib import pyplot as plt
import spacy
from spacy import displacy
from datetime import datetime
import re
import numpy as np
import ast

%matplotlib inline
sns.set(rc={'figure.figsize':(20,12)})

data_path = os.path.join('.', 'data')
out_path = os.path.join('.', 'output')

nlp = spacy.load("en_core_web_sm", disable = ["ner"])

The next step is loading the data:

In [None]:
# Loading data
path = os.path.join(data_path, 'reddit_tagpro-mylittlepony_01042017-04042017_long_tokenized.csv')

posts_df = pd.read_csv(path)
posts_df['tokens'] = posts_df['tokens'].apply(ast.literal_eval)

With the data loaded, we can inspect it. For example by viewing the first five lines of the data:

In [None]:
posts_df.head()

Or by inspecting the "shape" (number of rows and columns).

In [None]:
posts_df.shape

The data contains a range of columns (/variables) for the different type of information stored. The column names gives us a hint as to what the dataset contains:

In [None]:
posts_df.columns

### Data handling

Working with data always requires various data handling operations. In the code below, we are telling Python how to work with the datetime information in the data as well as adding additional datetime columns:

In [None]:
# Adding date variables

posts_df['comment_created_date'] = pd.to_datetime(posts_df['comment_created_utc'], unit = "s")
posts_df['hour'] = posts_df['comment_created_date'].dt.hour
posts_df['comment_date'] = posts_df['comment_created_date'].dt.date
posts_df['comment_datehour'] = posts_df['comment_date'].astype('str') + "-" + posts_df['hour'].astype('str')

With the datetime information, we can filter the data to include specifically the time period of interest (April 1st to April 4th):

In [None]:
# Fiter comments
posts_df = posts_df.loc[(posts_df['comment_created_date'] > '2017-04-01') & (posts_df['comment_created_date'] < '2017-04-05'), :]

## Comment activity

### Visualizing comment activity

The comment activity can be visualized. The graph below shows the comment count per day per subreddit.

In the cell below, you can change two options:
- Whether to look at activity subreddit-wise or across both subreddits
- What unit of time to look at

Change the values and run the cell to change the options.

In [None]:
timeunit = "date" # possible options: "date", "hour"
subreddits = "all" # possible options: "compare", "all"

In [None]:
# Post activity

if timeunit == 'hour' and subreddits == 'compare':
    groups = ['comment_datehour', 'post_subreddit']
elif timeunit == 'date' and subreddits == 'compare':
    groups = ['comment_date', 'post_subreddit']
elif timeunit == 'hour' and subreddits == 'all':
    groups = ['comment_datehour']
elif timeunit == 'date' and subreddits == 'all':
    groups = ['comment_date']

df_timecount = posts_df.groupby(groups).size().to_frame(name = 'count')

if len(groups) > 1:
    sns.lineplot(data = df_timecount, x = groups[0], y = 'count', hue = 'post_subreddit')
    plt.xticks(rotation = 90)
    plt.title("Reddit comment activity")
else:
    sns.lineplot(data = df_timecount, x = groups[0], y = 'count')
    plt.xticks(rotation = 90)
    plt.title("Reddit comment activity")

plt.show()

## Analyzing comments using language models

One of the many things that machine learning technology has brought us is *language model*. These are models ("machines") that are trained to "understand" written and spoken human language. 

Autosuggestions from your messenger apps, speech-to-text (Siri, Google, Alexa) and many other applications use language models.

Some of these models are freely available, which allows us to use them to analyze text as well.

**Using a language model**

In the cells below, you can see an example of how a langauge model works. Change the number to change the comment to analyze.

In [None]:
# Change comment number here!

comment_no = 123 # a number between 0 and 4185

In [None]:
comment = posts_df.loc[comment_no, 'comment_body']

print(comment)

doc = nlp(comment)

displacy.render(doc, style="dep")

The language model allows us to create sophisticated "tokenization-functions". Tokenization is a common text pre-processing step involving splitting texts into individual words (tokens) while making sure that variations of the same word (fx "letter", "letters") are treated the same.

The cell below defines a custom function for how to pre-process the comment data:

In [None]:
# Functions

stop_words = list(nlp.Defaults.stop_words)
                                            
def tokenizer_custom(text, stop_words=stop_words, tags=['NOUN', 'ADJ', 'VERB', 'ADV']):
       
    text = text.replace('\n', ' ')
    numbers_re = r".*\d.*"
    punct_regex = r"[^\w\s]"
    
    doc = nlp(text)
        
    pos_tags = tags # Keeps proper nouns, adjectives and nouns
    
    exceptions = []
    
    tokens = []
      
    for word in doc:
        if ((word.pos_ in pos_tags) or (any([exception in word.text for exception in exceptions]))) and (len(word.lemma_) > 2) and (word.lemma_.lower() not in stop_words) and not (re.match(numbers_re, word.lemma_.lower())):
            token = word.lemma_.lower() # Returning the word in lower-case.
            token = re.sub(punct_regex, "", token)
            tokens.append(token)

    return(tokens)

The cell below applies the function to the data (this takes a while, so the line is "commented out". The data read in already includes the tokens).

In [None]:
# Tokenize all data
tags = ['NOUN', 'ADJ']

#posts_df['tokens'] = posts_df['comment_body'].apply(tokenizer_custom, tags = tags)

posts_df_long = posts_df.explode('tokens')
posts_df_long['tokens'] = posts_df_long['tokens'].astype('str')

### Visualizing the use of words

The graph below shows the use of the top X most used words (nouns and adjectives) overall and how it has developed over time (counts per hour).

Change the number below to change how many of the top words to visualize:

In [None]:
words_include = 10 # Change to update the number of words included in visualization

In [None]:
# Visualizing use of top tokens over time

top_tokens = list(posts_df_long['tokens'].value_counts().index[0:words_include]) # Top tokens as list
df_filter = posts_df_long.loc[posts_df_long['tokens'].isin(top_tokens), :] # Data filtered for top tokens

df_timecount = df_filter.groupby(['comment_datehour', 'tokens']).size().to_frame(name = 'count')

sns.lineplot(data = df_timecount, x = 'comment_datehour', y = 'count', hue = 'tokens')
plt.xticks(rotation = 90)
plt.title(f"Use of words (top {words_include} overall)")
plt.show()

### Visualizing the use of keywords

The use of a specific set of words can also be shown.

Tokens are stemmed so that words with the same stem like "fødevare", "fødevarechef", "fødevareområdet", "fødevaresikkerhed" are all counted as "fødevare".

Add keywords to the list below to see the evolution n how they are change over time.

In [None]:
keywords = ['place',
            'pixel']

In [None]:
# Visualizing use of specific tokens over time

posts_filter = posts_df_long.loc[posts_df_long['tokens'].apply(lambda token: any([word in token for word in keywords])), :] # Data filtered for top tokens

for keyword in keywords:
    posts_filter.loc[posts_filter['tokens'].str.contains(keyword), 'tokens'] = keyword

df_timecount = posts_filter.groupby(['comment_datehour', 'tokens']).size().to_frame(name = 'count')

sns.lineplot(data = df_timecount, x = 'comment_datehour', y = 'count', hue = 'tokens')
plt.xticks(rotation = 90)
plt.title("Use of specific keywords")
plt.show()