<a href="https://colab.research.google.com/github/Jeevesh8/arg-mining/blob/main/winning_args/convokit_winning_args.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip3 install convokit

In [None]:
from convokit import Corpus, download
corpus = Corpus(filename=download("winning-args-corpus"))

In [3]:
type(corpus)

convokit.model.corpus.Corpus

In [5]:
corpus.print_summary_stats()

Number of Speakers: 34911
Number of Utterances: 293297
Number of Conversations: 3051


## Loop over the dataset getting threads

In [30]:
for elem in corpus.iter_objs("conversation"):
    print("Common Title:", elem.meta["op-title"])
    for path in elem.get_root_to_leaf_paths():
        for utterance in path:
            print(utterance.text)
        break
    break

Common Title: CMV: Anything that is man-made is natural.
I can't remember the topic that spurred this discussion, but a friend and I were debating whether man-made things were natural. He took the position that they are unnatural. 

He cited this definition by Merriam-Webster:  existing in nature and not made or caused by people : coming from nature (http://www.merriam-webster.com/dictionary/natural) as his basis for the distinction for natural vs. unnatural.

However, I respectfully disagree with his position and furthermore that definition of natural. People arise from nature. Humankind's capacity to create, problem-solve, analyze, rationalize, and build also come from natural processes. How are the things we create unnatural? It is only through natural occurrences that we have this ability, why is it that we would give the credit of these things solely to man, as opposed to nature? We are not separate from nature, thus, how can any of our actions or creations be unnatural? If we wer

## Generating Subtrees

In [None]:
def size_subtrees(tree: convokit.model.UtteranceNode,
                  tokenizer: transformers.PreTrainedTokenizer,
                  extra_tokens: int = 1):
    """
    Args:
        tree:         The convokit Winning Arguments utterance node whose subtrees are to be measured. 
        tokenizer:    Tokenizer that will be used to tokenize the sentences in the tree;
                      must implement encode() functionality.
        extra_tokens: Expected number of extra tokens that will be added to the 'body'
                      of each comment.[e.g., user tags or post tags]
    Returns:
        Modified tree, with extra 'subtree_size' attribute at each node(n) that denotes the
        length of tokenizing the combined version of all the 'body' attributes of (all the
        nodes in the subtree(n) and the node (n) itself). Stored in root.utt.meta.
    """
    def core_recursion(root : convokit.model.UtteranceNode):
        self_length = len(tokenizer.encode(root.utt.text)) + extra_tokens

        for subroot in root.children:
            self_length += (core_recursion(subroot) - 2)  
            # -2 for <s> </s> tokens, they are already counted in initial val of self_length

        root.utt.meta["subtree_size"] = self_length
        return self_length

    unused_entire_tree_length = core_recursion(root)

    return root

In [None]:
def subtree_generator(tokenizer: transformers.PreTrainedTokenizer,
                      corpus: convokit.model.Corpus,
                      max_token_length: int = 4096)
    """A generator that yields nodes(n) in the tree, such that the maximum
    number of tokens in the all the utterances(combined) of the subtree rooted 
    at n is bounded above by max_token_length.
    """
    
    def gen_subtrees(root: convokit.model.UtteranceNode):
        if root.utt.meta["subtree_size"]<=max_token_length:
            yield root
        else:
            for child in root.children:
                for subroot in gen_subtress(child):
                    yield subroot

    for elem in corpus.iter_objs("conversation"):
        root = elem.get_subtree(elem.id)
        root = size_subtrees(tree, tokenizer)
        for subroot in gen_subtrees(root):
            yield subroot
    

## Experiment & Try Outs

In [51]:
for elem in corpus.iter_objs("conversation"):
    while True:
        
        for att in dir(elem):
            if not att.startswith("__"):
                print(att)
        
        print("\n\n")
        
        subtree = elem.get_subtree(elem.id)
        print(type(subtree))
        
        for att in dir(subtree):
            if not att.startswith("__"):
                print(att)
        
        #Can store new attributes here
        subtree.utt.meta["subtree_size"] = 4096
        
        print(subtree.utt.meta)

        print("\n\n")

        for child in subtree.children:
            print("Child type:", type(child))
            break
        
        print("\n\n")

        for elem in subtree.pre_order():
            print(elem.utt.text)
            break
                
        break
    break

_add_utterance
_get_path_from_leaf_to_root
_id
_owner
_print_convo_helper
_speaker_ids
_utterance_ids
add_meta
add_vector
check_integrity
delete_vector
get_chronological_speaker_list
get_chronological_utterance_list
get_id
get_info
get_longest_paths
get_owner
get_root_to_leaf_paths
get_speaker
get_speaker_ids
get_speakers_dataframe
get_subtree
get_user
get_usernames
get_utterance
get_utterance_ids
get_utterances_dataframe
get_vector
has_vector
id
init_meta
initialize_tree_structure
iter_speakers
iter_users
iter_utterances
meta
obj_type
owner
print_conversation_stats
print_conversation_structure
retrieve_meta
set_id
set_info
set_owner
traverse
tree
vectors



<class 'convokit.model.utteranceNode.UtteranceNode'>
bfs_traversal
children
dfs_traversal
post_order
pre_order
set_children
utt
{'pair_ids': [], 'success': None, 'approved_by': None, 'author_flair_css_class': None, 'author_flair_text': None, 'banned_by': None, 'controversiality': None, 'distinguished': None, 'downs': None, 'edited'