<div align="center">
    <h1><strong>&#x1F333; Reddit Trees - constructing conversation trees from posts.</strong></h1>
</div> 

The purpose of this notebook is to guide researchers through the process of assembling conversation trees from Reddit posts. The notebook accepts dataframes as input (refer to Aquisitions notebook) and will output a graph object (which can be saved as a .graphml file) and an adjacency list (which can be saved as a .csv file).

Good to Know:

&#x2139; The input dataframe needs to include the key columns - link_id, parent_id and replies

&#x2139; Graph objects are usually named with a capital G by convention

&#x2139; Both graph objects and adjacency lists are portable and can be saved for use in other programes such as Gephi

&#x2139; In the graphs produced: nodes = individual somments and submissions, and edges = replies


&#x1F381; Added Bonus:

This notebook also includes a workflow to assign a topic to a node based on the text of the post it represents in the graph. There are two options, BERTopic and LDA based topic models. BERTopic requires a GPU for optimal performance while LDA can be run without a GPU. 

In [None]:
# required imports
from reddit_topic_trees import Reddit_trees
import pandas as pd

Load data to make a working dataframe

The working dataframe should contain three key columns that the code uses to make the trees - link_id, parent_id and replies

In [None]:
data = pd.read_csv('test_data.csv')

#check the data

print(data.head(5))

You can make some choices about the workflow at this point:

you can pass the dataframe unaltered to the graph making code;

you can run the BERTopic topic modelling and then make the graphs (requires a GPU for optimal performance);

you can run the LDA topic modelling and then make the graphs (requires extra NLP steps).

If you choose to topic model there is a further consideration to be made. Topics are assigned at a document level, that means only one topic will be assigned per document. If you want a more in depth look at the topics being discussed (especially in longer posts) you can choose to expand the documents to a sentence level and model the sentences. 

Dataframe by itself


The first step is to set up the Reddit_trees class which manages the code tools.

In [None]:
#set up reddit tree tools class

reddit_workflow = Reddit_trees()

In [None]:
G_tree, adj_list = reddit_workflow.tree_graph_and_adj_list(data, incl_topic = False)