
<img src="/home/fleetr/RedTools/pics/godcat.png" alt="Godcat" style="width:150px;"/>


<div align="center">
    <h1><strong>&#x1F333; Reddit Trees - constructing conversation trees from posts.</strong></h1>
</div> 

The purpose of this notebook is to guide researchers through the process of assembling conversation trees from Reddit posts. The notebook accepts dataframes as input (refer to Aquisitions notebook) and will output a graph object (which can be saved as a .graphml file) and an adjacency list (which can be saved as a .csv file).

Good to Know:

&#x2139; The input dataframe needs to include the key columns - link_id, parent_id and replies

&#x2139; Graph objects are usually named with a capital G by convention

&#x2139; Both graph objects and adjacency lists are portable and can be saved for use in other programes such as Gephi

&#x2139; In the graphs produced: nodes = individual comments and submissions, and edges = replies (directed)

&#x2139; These are default file names feel free to change them to more meaningful names


&#x1F381; Added Bonus:

This notebook also includes a workflow to assign a topic to a node based on the text of the post it represents in the graph. There are two options, BERTopic and LDA based topic models. BERTopic requires a GPU for optimal performance while LDA can be run without a GPU. 

In [None]:
# required imports
from reddit_topic_trees import Reddit_trees
import pandas as pd

Load data to make a working dataframe

&#x2713; Check the working dataframe contains the three key columns that the code uses to make the trees - <strong>link_id, parent_id and replies</strong>

In [None]:
#load the data from a csv file

data = pd.read_csv('test_data_csv.csv')

#check the data

print(data.head(10))

&#x1F6D1; You can make some choices about the workflow at this point:

you can pass the dataframe unaltered to the graph making code;

you can run the BERTopic topic modelling and then make the graphs (requires a GPU for optimal performance);

you can run the LDA topic modelling and then make the graphs (requires extra NLP steps).
<div style="background-color: #90EE90; border: 1px solid #ddd; padding: 10px;">
<strong>&#x2049;</strong> If you are interested in topic models there is a further consideration to be made. Topics are assigned at a document level, that means only one topic will be assigned per document. If you want a more in depth look at the topics being discussed (especially in longer posts) you can choose to expand the documents to a sentence level and model the sentences. However, since we do only want one topic per node this is handled in another notebook.
</div>

<div align="center">
    <h1><strong>Dataframe by Itself</strong></h1>
</div> 


The first step is to set up the Reddit_trees class which manages the code tools.

In [None]:
#set up reddit tree tools class

reddit_workflow = Reddit_trees()

Now we run the graph making code. It outputs two objects a graph object and an adjacency list dataframe

In [None]:
#build the graph object and the adjacency list
G_tree, adj_list = reddit_workflow.tree_graph_and_adj_list(data, incl_topic = False)


We can plot a basic plot to see if it worked

In [None]:
# plot a basic graph

reddit_workflow.plot_basic_graph(G_tree, 'basic_graph')

We can look at the adjacency list dataframe to see whats in the dataframe as well

In [None]:
#check the adjancey list

print(adj_list.head(10))

Finally we can save the objects for later use

In [None]:
#save the graph object

reddit_workflow.save_graph(G_tree, 'reddit_tree.graphml')

#save the adjacency list

reddit_workflow.save_adj_list(adj_list, 'reddit_adj_list.csv')

<div align="center">
    <h1><strong>BERTopic topic modelling</strong></h1>
</div> 

https://maartengr.github.io/BERTopic/index.html

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

We are going to use the topic model to assign a topic to each of the nodes in our graph. This will provide more information about the topics of discussion going on in the tree representation of the reddit conversations. 

We can topic model the submissions (i.e. the original post), the comments (i.e. the replies to the orginal post), or both.

In [None]:
#load data
 
submissions  = pd.read_csv('submissions_csv.csv')

comments = pd.read_csv('comments_csv.csv')

Model the comments only

In [None]:
# model the comments

#this will return a dataframe with the comments plus added columns with the topic and a list of topics

comments_df, topic_list = reddit_workflow.topic_model_comments(comments, 'body')

Model the submissions only

In [None]:
#model the submissions

#this will return a dataframe with the submissions plus added columns with the topic and a list of topics

submissions_df, topic_list = reddit_workflow.topic_model_submissions(submissions, 'title', 'selftext')

Model both

In [None]:
#model the submissions and comments

#this will return a dataframe with the submissions and comments plus added columns with the topic and a list of topics

combined_df, topic_list = reddit_workflow.topic_model_combined(submissions, comments, 'body', 'selftext', 'title')

We should pass the combined dataframe for the sake of completeness of the graphs. This way the orginal submission and the commetns are in the same topic space. However the "incl_topic" flag needs to be set to True in the function arguments.

In [None]:
G_combined, adj_list_combined = reddit_workflow.tree_graph_and_adj_list(combined_df, incl_topic = True)

We can check out the adjacency list and finally save the Graph object to a .graphml to explore in a visualisation tool like Gephi.

https://gephi.org/

In [None]:
print(adj_list_combined.head(10))

reddit_workflow.save_graph(G_combined, 'reddit_combined_tree.graphml')

<div align="center">
    <h1><strong>LDA topic modelling</strong></h1>
</div> 

https://en.m.wikipedia.org/wiki/Latent_Dirichlet_allocation


We are going to use the topic model to assign a topic to each of the nodes in our graph. This will provide more information about the topics of discussion going on in the tree representation of the reddit conversations. 

The major difference is that the LDA method requires you to specifiy the number of topics you want to extract from the text. This can be tricky and there are no magic numbers. You should choose a reasonable number depending on the number of documents in the total corpus. You ma want to experiment.

We can topic model the submissions (i.e. the original post), the comments (i.e. the replies to the orginal post), or both.

Load the data

In [None]:
#load data
 
submissions  = pd.read_csv('submissions_csv.csv')

comments = pd.read_csv('comments_csv.csv')

Model the comments only

In [None]:
comments_lda, lda_model_comments = reddit_workflow.lda_comments(comments, 'body', 5)

Model the submissions only

In [None]:
submissions_lda, lda_model_submissions = reddit_workflow.lda_submissions(submissions, 'title', 'selftext', 5)

Model both

In [None]:
combined_lda = reddit_workflow.lda_combined(submissions, comments, 'body', 'selftext', 'title', 5)

Make the graph

In [None]:
G_combined_lda, adj_list_combined_lda = reddit_workflow.tree_graph_and_adj_list(combined_lda, incl_topic = True)

We can check out the adjacency list and finally save the Graph object to a .graphml to explore in a visualisation tool like Gephi.

https://gephi.org/

In [None]:
print(adj_list_combined_lda.head(10))

reddit_workflow.save_graph(G_combined_lda, 'reddit_combined_lda_tree.graphml')