In this demo, we demonstrate how we can use Hypergraph features for various predictive tasks:

In [1]:
import convokit
from convokit import Corpus, HyperConvo, download
import pickle

### Setting up the corpus

In [2]:
corpus = Corpus(filename=download("reddit-corpus-small", use_local=True))

Dataset already exists at /Users/seanzhangkx/.convokit/downloads/reddit-corpus-small


By default, each Conversation in this Corpus corresponds to a full Reddit thread (i.e. the starting post and subsequent comments). In this demo, we want to exclude the post itself from the analysis, and focus only on conversation threads that begin with the 'top-level comments' in the Reddit threads.

To do that, we do a reindexing step such that our conversations begin from top-level comments instead of the post.

In [3]:
top_level_utterance_ids = [
    utt.id for utt in corpus.iter_utterances() if utt.id == utt.meta["top_level_comment"]
]

In [4]:
len(top_level_utterance_ids)

10000

In [5]:
threads_corpus = corpus.reindex_conversations(
    source_corpus=corpus,
    new_convo_roots=top_level_utterance_ids,
    preserve_convo_meta=True,
    preserve_corpus_meta=False,
)

### Annotating dataset with predictive features: hyperconvo, volume, BoW, reply-tree

For our classification tasks (elaborated on below), we want to set up suitable predictive features for use in fitting a classifier. As our tasks will be structural in nature (i.e. focused on predicting certain structural outcomes in the threads), we annotate the corpus with the following sets of features:

- hyperconvo: The HyperConvo graph features
- volume: The number of comments in the conversation
- BoW: A Bag-of-Words representation of the comments in the conversation 
- reply-tree: a subset of HyperConvo features that are focused on comment-to-comment (aka. c->c) features

#### HyperConvo features

In [6]:
hc = HyperConvo(prefix_len=10, min_convo_len=10, invalid_val=-1)
hc.fit_transform(threads_corpus)
feats = list(threads_corpus.get_vector_matrix("hyperconvo").columns)

(Note: By design of the corpus, every Conversation thread has at least 10 comments in it and will thus be of sufficient length to have a HyperConvo vector computed for it.)

In [7]:
feats[:3]

['max[indegree over c->c responses]',
 'argmax[indegree over c->c responses]',
 'norm.max[indegree over c->c responses]']

#### Reply-tree features

In [8]:
hyperconvo_matrix = threads_corpus.get_vector_matrix("hyperconvo")

In [9]:
reply_tree_matrix = hyperconvo_matrix.subset(
    columns=[c for c in hyperconvo_matrix.columns if "c->c" in c]
)

In [10]:
reply_tree_matrix.name = "reply-tree"

In [11]:
threads_corpus.append_vector_matrix(reply_tree_matrix)

In [12]:
for convo in threads_corpus.iter_conversations():
    if convo.has_vector("hyperconvo"):
        convo.add_vector("reply-tree")

#### Volume

In [13]:
## volume is the number of unique users in the first 10 comments
for convo in threads_corpus.iter_conversations():
    convo.meta["volume"] = len(
        set([utt.speaker for utt in convo.get_chronological_utterance_list()[:10]])
    )

We will add **BoW** vectors later in the notebook.

### Predictive tasks

Based on the first 10 utterances in the Conversation, we want to predict:
1. *Comment-growth*: Will the conversation grow to have at least 15 utterances?
2. *Commenter-growth*: Will the total number of participants double in the next 10 utterances (i.e. 11th-20th utterances)?

Firstly, we annotate Conversations with binary values indicating whether they exhibited comment-growth and whether they exhibited commenter-growth.

In [14]:
for convo in threads_corpus.iter_conversations():
    convo.meta["comment-growth"] = len(list(convo.iter_utterances())) >= 15

    convo_utts = convo.get_chronological_utterance_list()
    if len(convo_utts) >= 20:
        first_10_spkrs = len(set([utt.speaker.id for utt in convo_utts[:10]]))
        first_20_spkrs = len(set([utt.speaker.id for utt in convo_utts[:20]]))
        convo.meta["commenter-growth"] = (first_20_spkrs / first_10_spkrs) >= 2.0
    else:
        convo.meta["commenter-growth"] = None

In [15]:
threads_corpus.random_conversation().meta

ConvoKitMeta({'original_convo_meta': {'title': 'Coming Soon: Sibling Rivalry Podcast Season 2 with Bob The Drag Queen &amp; Monét X Change', 'num_comments': 19, 'domain': 'youtube.com', 'timestamp': 1536033837, 'subreddit': 'rupaulsdragrace', 'gilded': 0, 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0}, 'stickied': False, 'author_flair_text': 'Miz Cracker'}, 'original_convo_id': '9cs8tg', 'volume': 6, 'comment-growth': False, 'commenter-growth': None})

To control for topical factors (e.g. the subreddit the thread is from, the discussion topic of the post), we carry out a *paired prediction*, where we explicitly compare top-level comment threads that belong to the same post. To do this, we use ConvoKit's [Pairer](https://convokit.cornell.edu/documentation/pairedprediction.html) transformer to set up these pairs of threads for analysis.

In [16]:
from convokit import Classifier, Pairer

In [17]:
pairer_1 = Pairer(
    obj_type="conversation",
    pairing_func=lambda convo: convo.meta["original_convo_id"],
    pos_label_func=lambda convo: convo.meta["comment-growth"],
    neg_label_func=lambda convo: not convo.meta["comment-growth"],
    pair_id_attribute_name="pair_id_1",
    label_attribute_name="pair_obj_1",
    pair_orientation_attribute_name="pair_orientation_1",
)

In [18]:
pairer_1.transform(threads_corpus)

<convokit.model.corpus.Corpus at 0x7f7ea107b850>

In [19]:
pairer_2 = Pairer(
    obj_type="conversation",
    pairing_func=lambda convo: convo.meta["original_convo_id"],
    pos_label_func=lambda convo: convo.meta["commenter-growth"],
    neg_label_func=lambda convo: not convo.meta["commenter-growth"],
    pair_id_attribute_name="pair_id_2",
    label_attribute_name="pair_obj_2",
    pair_orientation_attribute_name="pair_orientation_2",
)

In [20]:
pairer_2.transform(threads_corpus)

<convokit.model.corpus.Corpus at 0x7f7ea107b850>

We add BoW vectors -- only on paired convos.

In [21]:
from convokit import BoWTransformer

In [22]:
bow = BoWTransformer(obj_type="conversation", vector_name="bow_1")
bow.fit_transform(threads_corpus, selector=lambda convo: convo.meta["pair_id_1"] is not None)

Initializing default unigram CountVectorizer...Done.


<convokit.model.corpus.Corpus at 0x7f7ea107b850>

In [23]:
bow2 = BoWTransformer(obj_type="conversation", vector_name="bow_2")
bow2.fit_transform(threads_corpus, selector=lambda convo: convo.meta["pair_id_2"] is not None)

Initializing default unigram CountVectorizer...Done.


<convokit.model.corpus.Corpus at 0x7f7ea107b850>

In [24]:
threads_corpus.vectors

{'bow_1', 'bow_2', 'hyperconvo', 'reply-tree'}

### Comment-growth task: : Will the conversation grow to have at least 15 utterances?

In [25]:
from convokit import PairedPrediction, PairedVectorPrediction

### Cross-validated scores for different feature sets

We initialize ConvoKit's PairedPrediction transformers, which will make use of the pair information we added using Pairer earlier + the predictive features/vectors we annotated earlier, to generate the feature and label data needed to train a classifier.

The *summarize()* function runs a cross-validation analysis using the PairedPrediction's internal classifier and outputs the mean accuracy score.

#### Hyperconvo

Notice that we use 'pair_id_1', 'pair_obj_1', and 'pair_orientation_1' because this corresponds to the Pairer annotations for the __comment-growth task__.

In [26]:
pp = PairedVectorPrediction(
    obj_type="conversation",
    vector_name="hyperconvo",
    pair_id_attribute_name="pair_id_1",
    label_attribute_name="pair_obj_1",
    pair_orientation_attribute_name="pair_orientation_1",
)
pp.summarize(threads_corpus)

Found 549 valid pairs.


0.5774812343619683

#### Reply-tree

In [27]:
pp = PairedVectorPrediction(
    obj_type="conversation",
    vector_name="reply-tree",
    pair_id_attribute_name="pair_id_1",
    label_attribute_name="pair_obj_1",
    pair_orientation_attribute_name="pair_orientation_1",
)
pp.summarize(threads_corpus)

Found 549 valid pairs.


0.5973144286905754

#### Volume

In [28]:
pp = PairedPrediction(
    obj_type="conversation",
    pred_feats=["volume"],
    pair_id_attribute_name="pair_id_1",
    label_attribute_name="pair_obj_1",
    pair_orientation_attribute_name="pair_orientation_1",
)
pp.summarize(threads_corpus)

Found 549 valid pairs.


0.5993327773144287

#### Bag-of-words

In [29]:
pp = PairedVectorPrediction(
    obj_type="conversation",
    vector_name="bow_1",
    pair_id_attribute_name="pair_id_1",
    label_attribute_name="pair_obj_1",
    pair_orientation_attribute_name="pair_orientation_1",
)
pp.summarize(threads_corpus)

Found 549 valid pairs.


0.8160633861551293

In [30]:
pp.fit(threads_corpus)

Found 549 valid pairs.


<convokit.paired_prediction.pairedVectorPrediction.PairedVectorPrediction at 0x7f7e360f8820>

### Commenter-growth task: Will the total number of participants double in the next 10 utterances (i.e. 11th-20th utterances)?

#### Hyperconvo

In [31]:
pp = PairedVectorPrediction(
    obj_type="conversation",
    vector_name="hyperconvo",
    pair_id_attribute_name="pair_id_2",
    label_attribute_name="pair_obj_2",
    pair_orientation_attribute_name="pair_orientation_2",
)
pp.summarize(threads_corpus)

Found 306 valid pairs.


0.5655737704918031

#### Reply-tree

In [32]:
pp = PairedVectorPrediction(
    obj_type="conversation",
    vector_name="reply-tree",
    pair_id_attribute_name="pair_id_2",
    label_attribute_name="pair_obj_2",
    pair_orientation_attribute_name="pair_orientation_2",
)
pp.summarize(threads_corpus)

Found 306 valid pairs.


0.5296668429402432

#### Volume

In [33]:
pp = PairedPrediction(
    obj_type="conversation",
    pred_feats=["volume"],
    pair_id_attribute_name="pair_id_2",
    label_attribute_name="pair_obj_2",
    pair_orientation_attribute_name="pair_orientation_2",
)
pp.summarize(threads_corpus)

Found 306 valid pairs.


0.5849814912744579

#### Bag-of-words

In [34]:
pp = PairedVectorPrediction(
    obj_type="conversation",
    vector_name="bow_2",
    pair_id_attribute_name="pair_id_2",
    label_attribute_name="pair_obj_2",
    pair_orientation_attribute_name="pair_orientation_2",
)
pp.summarize(threads_corpus)

Found 306 valid pairs.


0.7386567953463776