In this demo, we demonstrate how to:

1. Annotate a Corpus's utterances with their bag-of-words vector representations
2. Use these bag-of-words vectors in predictive tasks

For an introduction to vectors in ConvoKit, check out this [demo](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/vectors/vector_demo.ipynb) first.

In [1]:
import convokit

In [2]:
from convokit import Corpus, download

In [3]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at /Users/calebchiam/.convokit/downloads/subreddit-Cornell


In [4]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


### Annotating the Corpus with bag-of-words vectors

To do this, we use ConvoKit's [Bag-of-words Transformer](https://convokit.cornell.edu/documentation/bow.html) and set it to vectorize the Corpus's utterances.

In [5]:
from convokit import BoWTransformer

In [6]:
bow_transformer = BoWTransformer(obj_type="utterance")

Initializing default unigram CountVectorizer...Done.


Note that a custom text vectorizer can be sent by configuring the vectorizer parameter:

e.g. BoWTransformer(obj_type="utterance", *vectorizer*=...)

Let's inspect one of the Corpus utterances to see the changes that get made.

In [7]:
# before transformation
corpus.get_utterance('dsbgljl').vectors

[]

In [8]:
bow_transformer.fit_transform(corpus)

<convokit.model.corpus.Corpus at 0x7f8e991c5f90>

In [9]:
# after transformation
corpus.get_utterance('dsbgljl').vectors

['bow_vector']

The Corpus now has a new vector matrix associated with it.

In [10]:
corpus.vectors

{'bow_vector'}

In [11]:
corpus.get_vector_matrix('bow_vector')

ConvoKitMatrix('name': bow_vector, 'matrix': <74467x9340 sparse matrix of type '<class 'numpy.int64'>'
	with 2108383 stored elements in Compressed Sparse Row format>)

### Predictive task: will an utterance (i.e. Reddit comment) have a positive score?

We want to predict whether an utterance will have a positive score (i.e. more upvotes than downvotes) based on its bag-of-words vector.

Inspecting a random utterance, we see that it has a 'score' metadata attribute.

In [12]:
corpus.random_utterance().meta

{'score': 2,
 'top_level_comment': 'c3d45vg',
 'retrieved_on': 1428110290,
 'gilded': 0,
 'gildings': None,
 'subreddit': 'Cornell',
 'stickied': False,
 'permalink': '',
 'author_flair_text': ''}

We then use ConvoKit's VectorClassifier to train a classifier model predicting for whether the utterance's score is positive. Notice that the labeller is how we indicate the binary y value that we want the internal model to predict for, while vector_name specifies the vector feature set (i.e. the X data) to use in training the classifier.

In [13]:
from convokit import VectorClassifier

In [14]:
bow_classifier = VectorClassifier(obj_type="utterance", 
                                  vector_name='bow_vector',
                                  labeller=lambda utt: utt.meta['score'] > 0)

Initialized default classification model (standard scaled logistic regression).


In [15]:
# This fit_transform() step fits the classifier and then uses it to compute predictions for all the 
# utterances in the Corpus
bow_classifier.fit_transform(corpus)

<convokit.model.corpus.Corpus at 0x7f8e991c5f90>

In [16]:
# A DataFrame summary of the computed predictions
bow_classifier.summarize(corpus).head(10)

Unnamed: 0_level_0,prediction,pred_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1
dhhm9sa,True,1.0
dw553ml,True,1.0
dvzmhdx,True,1.0
dvzpp79,True,1.0
dw0imao,True,1.0
c3bsi2g,True,1.0
dw0mm3b,True,1.0
d5pddzi,True,1.0
dw25pga,True,1.0
5om61s,True,1.0


We can then inspect the coefficient weights assigned to the bag-of-words n-grams.

In [17]:
# The ngrams weighted most positively (i.e. utterances with these ngrams are more likely to have positive scores)
bow_classifier.get_coefs(feature_names=corpus.get_vector_matrix('bow_vector').columns).head()

Unnamed: 0_level_0,coef
feat_name,Unnamed: 1_level_1
hotels,1.270001
hbhs,1.11569
engine,1.109702
involves,1.081836
lincoln,1.071464


In [18]:
bow_classifier.get_coefs(feature_names=bow_transformer.get_vocabulary()).tail()

Unnamed: 0_level_0,coef
feat_name,Unnamed: 1_level_1
mahogany,-0.667785
ignoreme,-0.722992
hilton,-0.742234
binary,-0.764383
creation,-0.784593


### Evaluation metrics

In [19]:
# The base accuracy by predicting all objects to have the majority label, i.e. has positive score
bow_classifier.base_accuracy(corpus)

# 92.8% of the corpus utterances already have a positive score

0.9279546644822538

In [20]:
# Our classifier's accuracy on the Corpus
bow_classifier.accuracy(corpus)

0.9491452589737737

In [21]:
print(bow_classifier.classification_report(corpus))

              precision    recall  f1-score   support

       False       0.88      0.34      0.49      5365
        True       0.95      1.00      0.97     69102

    accuracy                           0.95     74467
   macro avg       0.91      0.67      0.73     74467
weighted avg       0.95      0.95      0.94     74467



## Bag-of-words prediction for Conversations

Just as utterances have bag-of-words vectors, we might imagine Conversations and Speakers having bag-of-words vectors as well, where:
- The text of a Conversation is the *combined* texts of all the Utterances within it
- The text of a Speaker is the *combined* texts of all the Utterances made by the Speaker

BoWTransformer provides native support for such vectorizations. In this example, we predict for whether a Conversation will eventually double in length or stay the same based on the bag-of-words representations of the first five utterances in the Conversation.

### Preprocessing

As r/Cornell's Conversations begin with the thread post (instead of only comments in the thread), we reindex the Conversations to begin with the top-level comments in each thread. This is a necessary step as our focus is on whether or not a **comment thread** will double in length.

In [22]:
top_level_comment_ids = [utt.id for utt in corpus.iter_utterances() if utt.id == utt.meta['top_level_comment']]

In [23]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In [24]:
len(top_level_comment_ids)

32893

In [25]:
threads_corpus = corpus.reindex_conversations(new_convo_roots=top_level_comment_ids)


['c3ocsyl', 'c3p1rn8', 'c3oyf4d', 'c3p8bze', 'c3od15i']


In [26]:
threads_corpus.print_summary_stats()

Number of Speakers: 6160
Number of Utterances: 63697
Number of Conversations: 32888


#### Label annotation for whether the thread doubles in length 

In [27]:
for thread in threads_corpus.iter_conversations():
    thread_len = len(list(thread.iter_utterances()))
    if thread_len == 5:
        thread.meta['thread_doubles'] = False
    elif thread_len >= 10:
        thread.meta['thread_doubles'] = True
    else:
        thread.meta['thread_doubles'] = None

#### BoW annotation of first 5 utterances

In [28]:
# We set our BoWTransformer to use only the first 5 utterances in the Conversation by configuring 'text_func'
bow_transformer2 = BoWTransformer(obj_type="conversation", vector_name='bow_vector_2',
                text_func=lambda convo: ' '.join([utt.text for utt in convo.get_chronological_utterance_list()[:5]])
                                 )

Initializing default unigram CountVectorizer...Done.


In [29]:
bow_transformer2.fit_transform(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

<convokit.model.corpus.Corpus at 0x7f8ea1a7e6d0>

In [30]:
threads_corpus.vectors

{'bow_vector_2'}

#### Training the Classifier

In [31]:
bow_classifier2 = VectorClassifier(obj_type="conversation", vector_name='bow_vector_2',
                                   labeller=lambda convo: convo.meta['thread_doubles'])

Initialized default classification model (standard scaled logistic regression).


In [32]:
bow_classifier2.fit_transform(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

<convokit.model.corpus.Corpus at 0x7f8ea1a7e6d0>

In [33]:
summary = bow_classifier2.summarize(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

In [34]:
summary.head()

Unnamed: 0_level_0,prediction,pred_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1
dt05qyf,True,1.0
dandio0,True,1.0
dwa6k96,True,1.0
dsldpxg,True,1.0
e70wjy3,True,1.0


In [35]:
summary.tail()

Unnamed: 0_level_0,prediction,pred_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1
drduxx1,False,2.465871e-12
dl7q7n2,False,8.168132e-14
dxfib8r,False,2.717009e-15
dwqaa06,False,2.680858e-16
d8y9akn,False,1.600627e-16


In [36]:
bow_classifier2.base_accuracy(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

0.6761904761904762

In [37]:
bow_classifier2.accuracy(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

0.9992063492063492

In [38]:
print(bow_classifier2.classification_report(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None))

              precision    recall  f1-score   support

       False       1.00      1.00      1.00       852
        True       1.00      1.00      1.00       408

    accuracy                           1.00      1260
   macro avg       1.00      1.00      1.00      1260
weighted avg       1.00      1.00      1.00      1260



In this artificial setup, our Bag-of-words classifier has achieved very high accuracy because the test and train data are identical. In a proper train-test split setting, our classifier would perform much more poorly. Setting up such a train-test evaluation is straightforward as well:



In [39]:
from sklearn.model_selection import train_test_split

In [40]:
# consider only conversations that have at least 5 utterances, i.e. from earlier,
# this is any conversation that has thread_doubles with a value that is not None.
valid_convos = list(threads_corpus.iter_conversations(lambda convo: convo.meta['thread_doubles'] is not None))

In [41]:
len(valid_convos)

1260

In [42]:
threads_corpus.print_summary_stats()

Number of Speakers: 6160
Number of Utterances: 63697
Number of Conversations: 32888


In [43]:
train_convos, test_convos = train_test_split(valid_convos, test_size=0.2)

In [44]:
print(len(train_convos), len(test_convos))

1008 252


In [45]:
for convo in train_convos:
    convo.meta['train_test_type'] = 'train'
    
for convo in test_convos:
    convo.meta['train_test_type'] = 'test'

# any other convo not part of the train/test split should have the metadata attribute value set to None
for convo in threads_corpus.iter_conversations():
    if 'train_test_type' not in convo.meta:
        convo.meta['train_test_type'] = None

In [46]:
# Fit the classifier only on train data
bow_classifier2.fit(threads_corpus, selector=lambda convo: convo.meta['train_test_type'] == 'train')

<convokit.classifier.vectorClassifier.VectorClassifier at 0x7f8ea28c91d0>

In [47]:
# Evaluating the classifier on test data

# First annotate the conversation with the prediction
bow_classifier2.transform(threads_corpus, selector=lambda convo: convo.meta['train_test_type'] == 'test')

# Then evaluate the accuracy of this prediction
bow_classifier2.summarize(threads_corpus, selector=lambda convo: convo.meta['train_test_type'] == 'test')

Unnamed: 0_level_0,prediction,pred_score
id,Unnamed: 1_level_1,Unnamed: 2_level_1
dandio0,True,1.000000e+00
d7247x6,True,1.000000e+00
dn521re,True,9.999998e-01
dkbth1f,True,9.999970e-01
cebb5so,True,9.999454e-01
...,...,...
d3rfzjm,False,1.442058e-08
c7h95bx,False,1.092930e-08
cx87pi5,False,3.321833e-09
dxfib8r,False,2.046636e-14


In [48]:
print(bow_classifier2.classification_report(threads_corpus, 
                                            selector=lambda convo: convo.meta['train_test_type'] == 'test'))

              precision    recall  f1-score   support

       False       0.63      0.76      0.69       156
        True       0.41      0.27      0.33        96

    accuracy                           0.58       252
   macro avg       0.52      0.52      0.51       252
weighted avg       0.55      0.58      0.55       252



### Other evaluation metrics

In [49]:
bow_classifier2.evaluate_with_cv(threads_corpus, selector=lambda convo: convo.meta['thread_doubles'] is not None)

Running a cross-validated evaluation...Done.


array([0.6031746 , 0.63095238, 0.66666667, 0.63095238, 0.67460317])

In [50]:
bow_classifier2.evaluate_with_train_test_split(threads_corpus, 
                                               selector=lambda convo: convo.meta['thread_doubles'] is not None,
                                               test_size=0.2)

Running a train-test-split evaluation...
Done.


(0.6071428571428571,
 array([[125,  46],
        [ 53,  28]]))

This concludes the demo. Check out [our other demo on predicting comment-growth and commenter-growth](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/hyperconvo/predictive_tasks.ipynb) to see how bag-of-words vectors can be used in a paired predictive setting.