<img src="https://i1.sndcdn.com/avatars-000274765548-vj7h0w-t500x500.jpg" 
style='float:right; width:200px; margin: 0 20px;'>

# Reddit Conversations
---
Hand by hand walkthrough of conversant in reddit.


## Read Data
Lets read conversation data from Change-My-View(CMV) data in pickle format.

One option is to load the data to pandas dataframe

In [12]:
# reading data to pandas df

from conversant.data.loaders.load import load2df
cmv_df = load2df(path='./3000tree.pickle', input_format='pickle')

cmv_df.sample(5)

[03/03/2020 11:25:22] INFO Conversation sample has 3000 unique trees
[03/03/2020 11:25:22] INFO Conversation sample has 21928 unique authors


Unnamed: 0,node_id,tree_id,timestamp,author,text,parent,index1
136700,dqhjk0v,7g9byy,1511922188,annoinferno,You think that people with addictions that are...,136699,136700
482873,dso4ihl,7qb35x,1515948387,Chantottie,I don’t know about “fairly common” or part of ...,482872,482873
394993,dgfj958,664k31,1492539685,Qwerty_Resident,<quote>Envelope B contains 2$x with a probabil...,394967,394993
148157,dt6gmdc,7snbz0,1516820106,aurojyoti_das,"Wow, rights of my family to life is an extreme...",148156,148157
73265,dim7blb,6fw2as,1496903421,jabberwockxeno,"<quote> In general, most of the time people ta...",73245,73265


Working with dataframes is not natural with conversation data. 

A better way data type for conversation is trees. 

In [1]:
from conversant.data.loaders.load import load2anytree

# load conversation data to a dictionary like {'post_id' : AnyTree Node object}
cmv = load2anytree(path='./3000tree.pickle', input_format='pickle')

[03/03/2020 11:00:31] INFO Conversation sample has 3000 unique trees
[03/03/2020 11:00:31] INFO Conversation sample has 21928 unique authors
[03/03/2020 11:01:09] INFO Done converting 3000 conversations to trees


We can print one random conversation structure. 

In [11]:
from anytree import RenderTree

random_root = cmv[9][457884]

for pre, fill, node in RenderTree(random_root):
    print("%s%s" % (pre, node.author))


[deleted]
├── Hq3473
│   └── [deleted]
│       └── Hq3473
│           └── [deleted]
│               └── Hq3473
├── ZEPHYREFTW
├── Skelletorr
├── [deleted]
├── DylanTheVillyn
│   └── [deleted]
│       └── DylanTheVillyn
├── Bengom
├── SchiferlED
├── teerre
└── SordidDreams


## Preprocessing
We can use the pre-processing tools to do any number of known conversation processing.

Each function supports both dataframe and AnyTree dictionary structures.

Lets filter out tree conversations that have under 5 nodes.

In [2]:
from conversant.data.preprocessing.filters import filter_under_n

# filter trees under 5 nodes
print(f'Number of trees in cmv dataset is {cmv.tree_id.nunique()}')

#cmv = filter_under_n(cmv)

print(f'Number of trees after applying filter is {cmv.tree_id.nunique()}')

Number of trees in cmv dataset is 3000
Number of trees after applying filter is 3000


## Enrichment 
We can use enrichment to add new relevant data to our conversation data. 

Each function supports both dataframe and AnyTree dictionary structures.

Enrichment's can be done with text data or using the structure of the conversation.

Let's add a new feature called "clean_text" which is a processed version of the "text" field

In [18]:
from conversant.data.enrichment.textual import clean_text_field

cmv_df = clean_text_field(cmv_df)

print('Example clean text:')
print(cmv_df.clean_text.sample(1).values[0])

Example clean text
i do not have to walk down any philosophical rabbit holes all i have to do if look for any actual evidence that god exists and so far there is really none therefore i do not need anything else i am not starting with a conclusion i am simply seeking evidence and finding none and sure my deep analysis does not really pass your test but then again nothing says that it has to


## Analysis

We might want to run some exploratory data analysis (EDA) on our conversational data. 

Let's plot the percentiles of the number of posts per conversation.