Computing Surprise With ConvoKit
=====================
This notebook provides a demo of how to use the Surprise transformer to compute surprise across a corpus. In this demo, we will use the Surprise transformer to compute Speaker Convo Diversity, a measure of how surprising a speaker's participation in one conversation is compared to their participation in all other conversations.

In [1]:
import convokit
import itertools
import numpy as np
import spacy
from convokit import Corpus, download, Surprise
from convokit.text_processing import TextProcessor, TextParser
from sklearn.feature_extraction.text import CountVectorizer

Step 1: Load a corpus
--------
We will use data from the subreddit r/Cornell to demonstrate the functionality of this transformer

In [2]:
corpus = Corpus(filename=download('subreddit-Cornell'))

Dataset already exists at /home/axl4/.convokit/downloads/subreddit-Cornell


In [3]:
corpus.print_summary_stats()

Number of Speakers: 7568
Number of Utterances: 74467
Number of Conversations: 10744


In order to speed up the demo, we will take just the top 100 most active speakers (based on the number of conversations they participate in).

In [4]:
SPEAKER_BLACKLIST = ['[deleted]', 'DeltaBot', 'AutoModerator']
def utterance_is_valid(utterance):
    return utterance.speaker.id not in SPEAKER_BLACKLIST and utterance.text

In [5]:
corpus.organize_speaker_convo_history(utterance_filter=utterance_is_valid)



In [6]:
speaker_activities = corpus.get_attribute_table('speaker', ['n_convos'])

In [7]:
speaker_activities.sort_values('n_convos', ascending=False).head(10)

Unnamed: 0_level_0,n_convos
id,Unnamed: 1_level_1
laveritecestla,781.0
EQUASHNZRKUL,726.0
CornHellUniversity,696.0
t3hasiangod,647.0
ilovemymemesboo,430.0
omgdonerkebab,425.0
cartesiancategory,341.0
cornell256,330.0
mushiettake,321.0
Fencerman2,298.0


In [8]:
top_speakers = speaker_activities.sort_values('n_convos', ascending=False).head(100).index

In [9]:
import itertools

subset_utts = [list(corpus.get_speaker(speaker).iter_utterances(selector=utterance_is_valid)) for speaker in top_speakers]
subset_corpus = Corpus(utterances=list(itertools.chain(*subset_utts)))

In [10]:
subset_corpus.print_summary_stats()

Number of Speakers: 100
Number of Utterances: 20550
Number of Conversations: 6866


Step 2: Create instance of Surprise transformer
---------------
`target_sample_size` and `context_sample_size` specify the minimum number of tokens that should be in the target and context respectively. If we sent these to simply be 1, the most surprising statements tend to just be the very short statements. If a target or context is shorter than the specified sample size, the transformer will set the surprise to be `nan`. The transformer takes `n_samples` samples from the target and context transformer (where samples are of size corresponding to `target_sample_size` and `context_sample_size`). It calculates cross entropy for each pair of samples and takes the average to get the final surprise score. This is done to minimize effect of length on scores.

`model_key_selector` defines how utterances in a corpus should be mapped to a model. It takes in an utterance and returns the key for the corresponding model. For this demo we want to map utterances to models based on their speaker and conversation ids.

The transformer also has an optional `tokenizer` parameter to customize tokenization. Here we will tokenize the text outside of the surprise transformer, so our tokenizer will be an identity function.

The `smooth` parameter determines whether the transformer uses +1 laplace smoothing (`smooth = True`) or naively replaces 0 counts with 1's as the SpeakerConvoDiversity transformer does (`smooth = False`).

In [11]:
import spacy

spacy_nlp = spacy.load('en_core_web_sm', disable=['ner','parser', 'tagger', 'lemmatizer'])
for utt in subset_corpus.iter_utterances():
    utt.meta['joined_tokens'] = [t.text.lower() for t in spacy_nlp(utt.text)]

In [12]:
surp = Surprise(tokenizer=lambda x: x, model_key_selector=lambda utt: '_'.join([utt.speaker.id, utt.conversation_id]), target_sample_size=100, context_sample_size=1000, n_samples=50, smooth=True)

In [13]:
surp = surp.fit(subset_corpus, text_func=lambda utt: [list(itertools.chain(*[u.meta['joined_tokens'] for u in utt.speaker.iter_utterances() if u.conversation_id != utt.conversation_id]))])

fit1: 20550it [00:16, 1283.44it/s]
fit2: 100%|██████████| 15394/15394 [00:00<00:00, 1032033.56it/s]


Step 3: Transform corpus
--------
The object type input to `transform` determines what objects the transformer adds metadata to. Valid inputs are `'utterance'`, `'speaker'`, `'conversation'`, and `'corpus'`. Here we'll call `transform` with object type `'speaker'` so that surprise scores will be added as a metadata field for each speaker. See the tennis demo for an example where object type is utterance.

In [14]:
transformed_corpus = surp.transform(subset_corpus, obj_type='speaker')

transform: 100it [15:57,  9.57s/it]


Analysis
------

In [15]:
import pandas as pd
from functools import reduce
def combine_dicts(x,y):
    x.update(y)
    return x
surprise_scores = reduce(combine_dicts, transformed_corpus.get_speakers_dataframe()['meta.surprise'].values)
suprise_series = pd.Series(surprise_scores).dropna()

In the resulting pandas series, the keys are of the format {speaker}_{conversation id}.

Let's take a look at some of the most surprising speaker conversation involvements.

In [16]:
most_surprising = suprise_series.sort_values(ascending=False).head(10)
most_surprising

EQUASHNZRKUL_815y6t        7.233156
SwissWatchesOnly_8g5q88    7.216094
SwissWatchesOnly_67cljd    7.129933
EQUASHNZRKUL_73xuw6        7.114335
Straight_Derpin_5kst5l     7.067594
laveritecestla_6v4ysm      7.066840
ClawofBeta_52u1nu          7.059744
Udontlikecake_7rj6a0       7.053087
syntheticity_97zg9z        7.041747
DEEP_THORAX_8drwet         7.038059
dtype: float64

Now, let's look at some of the least surprising entries.

In [17]:
least_surprising = suprise_series.sort_values(ascending=True).head(10)
least_surprising

Unga_Bunga_30ac0l         5.841967
Bisphosphate_7r8nu1       5.941750
crash_over-ride_6bjxnm    5.945221
crash_over-ride_8f7b0y    5.962945
crash_over-ride_7owfvv    5.963205
crash_over-ride_30zba1    5.970271
crash_over-ride_2vhtzx    5.970866
crash_over-ride_t6w01     5.981621
omgdonerkebab_v4a3p       5.981898
crash_over-ride_9b132c    5.983570
dtype: float64

Notice that some speakers appear multiple times in the most or least surprising speaker, convo pairs. This is particularly true of the least surprising where speaker 'crash_over-ride' appears many times. A possible explanation for this could be that this particular speaker talks about very similar things in their conversations, so much of their conversation participation is not very surprising.