# Applying the Expected Context Framework to the Switchboard Corpus

### Using `ExpectedContextModelTransformer`

This notebook demonstrates how our implementation of the Expected Context Framework can be applied to the Switchboard dataset. See [this dissertation](https://tisjune.github.io/research/dissertation) for more details about the framework, and more comments on the below analyses.

This notebook will show how to apply two related instances of `ExpectedContextModelTransformer`. For a version of this demo that uses `DualContextWrapper`, a wrapper transformer around these two instances, see [this notebook](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/ecf/convokit/expected_context_framework/demos/switchboard_exploration_dual_demo.ipynb).

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import numpy as np
import math
import os

## 1. Loading and preprocessing the dataset

For this demo, we'll use the Switchboard corpus---a collection of telephone conversations which have been annotated with various dialog acts. More information on the dataset, as it exists in ConvoKit format, can be found [here](https://convokit.cornell.edu/documentation/switchboard.html); the original data is described [here](https://web.stanford.edu/~jurafsky/ws97/CL-dialog.pdf).

We will actually use a preprocessed version of the Switchboard corpus, which we can access below. Since Switchboard consists of transcribed telephone conversations, there are many disfluencies and backchannels, that make utterances messier, and that make it hard to identify what counts as an actual turn. In the version of the corpus we consider, for the purpose of demonstration, we remove the disfluencies and backchannels (acknowledging that we're discarding important parts of the conversations). 

In [3]:
from convokit import Corpus
from convokit import download

In [4]:
# OPTION 1: DOWNLOAD CORPUS 
# UNCOMMENT THESE LINES TO DOWNLOAD CORPUS
# DATA_DIR = '<YOUR DIRECTORY>'
# SW_CORPUS_PATH = download('switchboard-processed-corpus', data_dir=DATA_DIR)

# OPTION 2: READ PREVIOUSLY-DOWNLOADED CORPUS FROM DISK
# UNCOMMENT THIS LINE AND REPLACE WITH THE DIRECTORY WHERE THE TENNIS-CORPUS IS LOCATED
# SW_CORPUS_PATH = '<YOUR DIRECTORY>'

In [5]:
sw_corpus = Corpus(SW_CORPUS_PATH)

In [6]:
sw_corpus.print_summary_stats()

Number of Speakers: 440
Number of Utterances: 44402
Number of Conversations: 1155


In [7]:
utt_eg_id = '3496-79'

as input, we use a preprocessed version of the utterance that only contains alphabetical words, found in the `alpha_text` metadata field.

In [8]:
sw_corpus.get_utterance(utt_eg_id).meta['alpha_text']

'How old were you when you left'

In order to avoid capturing topic-specific information, we restrict our analyses to a vocabulary of unigrams that occurs across many topics, and across many conversations:

In [9]:
from collections import defaultdict

In [10]:
topic_counts = defaultdict(set)
for ut in sw_corpus.iter_utterances():
    topic = sw_corpus.get_conversation(ut.conversation_id).meta['topic']
    for x in set(ut.meta['alpha_text'].lower().split()):
        topic_counts[x].add(topic)
topic_counts = {x: len(y) for x, y in topic_counts.items()}

word_convo_counts = defaultdict(set)
for ut in sw_corpus.iter_utterances():
    for x in set(ut.meta['alpha_text'].lower().split()):
        word_convo_counts[x].add(ut.conversation_id)
word_convo_counts = {x:  len(y) for x, y in word_convo_counts.items()}

min_topic_words = set(x for x,y in topic_counts.items() if y >= 33)
min_convo_words = set(x for x,y in word_convo_counts.items() if y >= 200)
vocab = sorted(min_topic_words.intersection(min_convo_words))

In [11]:
len(vocab)

381

In [12]:
from convokit.expected_context_framework import ColNormedTfidfTransformer, ExpectedContextModelTransformer

## 2. Applying the Expected Context Framework

To apply the Expected Context Framework, we start by converting the input utterance text to an input vector representation. Here, we represent utterances in a term-document matrix that's _normalized by columns_ (empirically, we found that this ensures that the representations derived by the framework aren't skewed by the relative frequency of utterances). We use `ColNormedTfidfTransformer` transformer to do this:

In [13]:
tfidf_obj = ColNormedTfidfTransformer(input_field='alpha_text', output_field='col_normed_tfidf', binary=True, vocabulary=vocab)
_ = tfidf_obj.fit(sw_corpus)
_ = tfidf_obj.transform(sw_corpus)

We now use the Expected Context Framework. In short, the framework derives vector representations, and other characterizations, of terms and utterances that are based on their _expected conversational context_---i.e., the replies we expect will follow a term or utterance, or the preceding utterances that we expect the term/utterance will reply to. 

We start by applying the framework to derive characterizations based on the _forwards_ context, i.e., the expected replies. Here, we initialize a framework transformer object `ec_fw`.
We initialize it with the following arguments:
* `next_id` is a field in each utterance's metadata that indicates the ID of its reply, i.e., the context that we will use.
* `output_prefix` determines the names of the matrices and metadata fields that the framework object outputs, when we later call `ec_fw.transform(corpus)` (see below)
* `vect_field` and `context_vect_field` respectively denote the input vector representations of utterances and context utterances that `ec_fw` will work with. Here, we'll use the same tf-idf representations that we just computed above.
* `n_svd_dims` denotes the dimensionality of the vector representations that `ec_fw` will output. This is something that you can play around with---for this dataset, we found that more dimensions resulted in messier output, and a coarser, lower-dimensional representation was slightly more interpretable. (Technical note: technically, `ec_fw` produces vector representations of dimension `n_svd_dims`-1, since by default, it removes the first latent dimension, which we find tends to strongly reflect term frequency.)
* `n_clusters` denotes the number of utterance types that `ec_fw` will infer, given the representations it computes. Note that this is an interpretative step: looking at clusters of utterances helps us get a sense of what information the representations are capturing; this value does not actually impact the representations and other characterizations we derive.
* `random_state` and `cluster_random_state` are fixed for this demo, so we produce deterministic output.

In [14]:
ec_fw = ExpectedContextModelTransformer(context_field='next_id', output_prefix='fw', 
                                    vect_field='col_normed_tfidf', context_vect_field='col_normed_tfidf', 
                                      n_svd_dims=15, n_clusters=2,
                                     random_state=1000, cluster_random_state=1000)

We'll fit the `ec_fw` transformer on the subset of utterances and replies that have at least 5 unigrams from our vocabulary.

In [15]:
ec_fw.fit(sw_corpus, selector=lambda x: x.meta.get('col_normed_tfidf__n_feats',0)>=5, 
            context_selector=lambda x: x.meta.get('col_normed_tfidf__n_feats',0)>= 5)

Next, we apply the framework to derive characterizations based on the _backwards_ context, i.e., the expected predecessors. As such, we set `context_field='reply_to'`, meaning the `ec_bk` transformer will use derive characterizations based on predecessors.

Since we want the representations derived in the backwards direction to be comparable to that derived in the forwards direction, we initialize the backwards framework object, `ec_bk`, with the forwards one, via the argument `model=ec_fw`. (Under the hood, `ec_bk` is initialized with the latent context vectors that `ec_fw` has derived.) For a demonstration of `DualContextWrapper`, a wrapper that handles both expected context models we've initialized, see [this notebook](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/ecf/convokit/expected_context_framework/demos/switchboard_exploration_dual_demo.ipynb).

In [16]:
ec_bk = ExpectedContextModelTransformer(context_field='reply_to', output_prefix='bk', 
                                    vect_field='col_normed_tfidf', context_vect_field='col_normed_tfidf', 
                                      n_svd_dims=15, n_clusters=2,
                                     random_state=1000, cluster_random_state=1000,
                                        model=ec_fw)
ec_bk.fit(sw_corpus, selector=lambda x: x.meta.get('col_normed_tfidf__n_feats',0)>=5, 
            context_selector=lambda x: x.meta.get('col_normed_tfidf__n_feats',0)>= 5)

### Interpreting derived representations

Before applying the two transformers, `ec_fw` and `ec_bk` to transform the corpus, we can examine the representations and characterizations it's derived over the training data (note that in this case, the training data is also the corpus that we analyze, but this needn't be the case in general---see [this demo](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/convokit/expected_context_framework/demos/wiki_awry_demo.ipynb) for an example).

In [17]:
from sklearn.metrics.pairwise import paired_distances

First, to interpret the representations derived by each model, we can inspect the clusters of representations that we've inferred, for both the forwards and backwards direction. The following function calls print out representative terms and utterances, as well as context terms and utterances, per cluster (next two cells; note that the output is quite long). 

In [18]:
ec_fw.print_clusters(corpus=sw_corpus)

CLUSTER 0 0
---
terms
         cluster_dist
index                
to           0.339085
that         0.340681
is           0.398213
not          0.407413
know         0.413964
for          0.421002
with         0.429641
about        0.429842
because      0.444867
me           0.447905

context terms
       cluster_dist
index              
that       0.464557
true       0.517588
do         0.558011
think      0.579059
know       0.750303
sure       0.751195
what       0.762769
no         0.825287
right      0.846140
how        0.870710


utterances
> 2303-21 0.109 Yeah , that 's , that 's possible . I still think that a lot of those people are the ones who really think that their votes do n't make a difference , though , as well . I think it 's those same people who do n't know any better about how we vote , are , are , are a lot of the people who think that well , look at me , I 'm just a little nobody . My vote 's not going to count anyway . You know , and I think that 's probably a p

In [19]:
ec_bk.print_clusters(corpus=sw_corpus)

CLUSTER 0 0
---
terms
        cluster_dist
index               
a           0.206770
i           0.236567
well        0.256482
and         0.259060
so          0.260796
it          0.298410
to          0.300810
really      0.311325
you         0.322042
uh          0.335718

context terms
        cluster_dist
index               
huh         0.391466
yeah        0.437190
i           0.450316
you         0.489002
oh          0.560207
that        0.561080
well        0.575842
it          0.608198
sounds      0.639354
okay        0.653139


utterances
> 3405-0 0.109 Okay , Lowell , so I 'd like to know , um , what , what do you do in lawn and garden , what , uh , what 's , what 's of interest to you and how do you go about it ?
> 2709-79 0.111 So , and I really do wish I knew about quotas and really wish I -
> 3014-69 0.114 So , I just trusted the CONSUMER REPORTS and the auto ,
> 4378-11 0.116 Well , I , I just live in a , I live in an apartment now . I , uh , two summers ago I went to Ma

demo continues below

We can see that in each case, two clusters emerge that roughly correspond to utterances recounting personal experiences, and those providing commentary, generally not about personal matters. We'll label them as such, noting that there's a roughly 50-50 split with slightly more "personal" utterances than "commentary" ones:

In [20]:
ec_fw.set_cluster_names(['commentary','personal'])
ec_bk.set_cluster_names(['personal', 'commentary'])

In [21]:
ec_fw.print_cluster_stats()

Unnamed: 0,utts,terms,context_utts,context_terms
commentary,0.423153,0.461942,0.404751,0.435696
personal,0.576847,0.538058,0.595249,0.564304


### Interpreting derived characterizations

`ec_fw` and `ec_bk` also compute term-level statistics that we refer to as (forwards or backwards) _ranges_, which we roughly interpret as modeling the strengths of our forwards expectations of the replies that a term tends to get, or the backwards expectations of the predecessors that the term tends to follow. To examine these statistics, we'll put them in a Pandas dataframe:

In [22]:
term_df = pd.DataFrame({'index': ec_fw.get_terms(),
                       'fw_range': ec_fw.get_term_ranges(),
                       'bk_range': ec_bk.get_term_ranges()}).set_index('index')

In [23]:
term_df.sample(5)

Unnamed: 0_level_0,fw_range,bk_range
index,Unnamed: 1_level_1,Unnamed: 2_level_1
have,0.826351,0.809287
so,0.824033,0.816046
around,0.825741,0.829453
thinking,0.807137,0.800061
full,0.807683,0.798818


Since the characterizations derived from `ec_fw` and `ec_bk` are comparable (as we initialized the latter model with the former), we can also compare characterizations across the two models. In the later analysis, we'll examine two:
* orientation: this statistic compares the relative magnitude of forwards and backwards ranges. In a [counseling conversation setting](https://www.cs.cornell.edu/~cristian/Orientation_files/orientation-forwards-backwards.pdf) we interpreted orientation as a measure of the relative extent to which an interlocutor aims to advance the conversation forwards with a term, versus address existing content.
* shift: this statistic corresponds to the distance between the backwards and forwards representations for each term; we interpret it as the extent to which a term shifts the focus of a conversation. 

These statistics are admittedly somewhat hard to interpret in the Switchboard setting, perhaps due to the relative lack of structures in these conversations. As we show later on, at the utterance level, they do bear some correspondence to various discourse act labels, so it might be worth playing around and coming up with characterizations of your own, that might reflect better-founded ideas.

In [24]:
term_df['orn'] = term_df.bk_range - term_df.fw_range
term_df['shift'] = paired_distances(
        ec_fw.ec_model.term_reprs, ec_bk.ec_model.term_reprs
    )

In [25]:
k=10
print('low orientation')
display(term_df.sort_values('orn').head(k)[['orn']])
print('high orientation')
display(term_df.sort_values('orn').tail(k)[['orn']])
print('\nlow shift')
display(term_df.sort_values('shift').head(k)[['shift']])
print('high shift')
display(term_df.sort_values('shift').tail(k)[['shift']])

low orientation


Unnamed: 0_level_0,orn
index,Unnamed: 1_level_1
how,-0.056177
done,-0.055084
let,-0.050308
few,-0.049332
basically,-0.047425
nothing,-0.045719
nine,-0.043703
okay,-0.043319
whole,-0.042495
works,-0.042461


high orientation


Unnamed: 0_level_0,orn
index,Unnamed: 1_level_1
taking,0.036944
makes,0.037245
change,0.038951
funny,0.039011
next,0.03931
called,0.043352
least,0.051271
anymore,0.05141
seem,0.064159
myself,0.064653



low shift


Unnamed: 0_level_0,shift
index,Unnamed: 1_level_1
he,0.162183
to,0.173862
was,0.176154
car,0.195208
read,0.208384
him,0.214645
we,0.216772
for,0.218534
kids,0.22319
watch,0.223844


high shift


Unnamed: 0_level_0,shift
index,Unnamed: 1_level_1
goes,1.10464
how,1.107404
exactly,1.11714
least,1.142747
change,1.162627
taking,1.206768
before,1.223504
nothing,1.270712
without,1.271164
tell,1.361794


### Deriving utterance-level representations

We now use the `ec_fw` and `ec_bk` models to derive utterance-level characterizations, by transforming the corpus with them. Again, we focus on utterances that are sufficiently long:

In [26]:
_ = ec_fw.transform(sw_corpus, selector=lambda x: x.meta.get('col_normed_tfidf__n_feats',0)>=5)
_ = ec_bk.transform(sw_corpus, selector=lambda x: x.meta.get('col_normed_tfidf__n_feats',0)>=5)

The `transform` function does the following. 

First, it derives vector representations of utterances, stored as `fw_repr` and `bk_repr`:

In [27]:
sw_corpus.vectors

{'bk_repr', 'col_normed_tfidf', 'fw_repr'}

Next, it derives ranges of utterances, stored in the metadata as `fw_range` and `bk_range`:

In [28]:
eg_ut = sw_corpus.get_utterance(utt_eg_id)
print('Forwards range:', eg_ut.meta['fw_range'])
print('Backwards range:', eg_ut.meta['bk_range'])

Forwards range: 0.8370251853798225
Backwards range: 0.8160712383960356


It also assigns utterances to inferred types:

In [29]:
print('Forwards cluster:', eg_ut.meta['fw_clustering.cluster'])
print('Backwards cluster:', eg_ut.meta['bk_clustering.cluster'])

Forwards cluster: personal
Backwards cluster: personal


As with terms, we can derive orientations and shifts for utterances. For orientation, we compare the backwards and forwards ranges that the above calls to `transform` compute, for each utterance:

In [30]:
for ut in sw_corpus.iter_utterances(selector=lambda x: x.meta.get('col_normed_tfidf__n_feats',0)>=5):
    ut.meta['orn'] = ut.meta['bk_range'] - ut.meta['fw_range']

For shift, we compare the backwards and forwards representations:

In [31]:
utt_shifts = paired_distances(sw_corpus.get_vectors('fw_repr'), sw_corpus.get_vectors('bk_repr'))
for id, shift in zip(sw_corpus.get_vector_matrix('fw_repr').ids, utt_shifts):
    sw_corpus.get_utterance(id).meta['shift'] = shift

In [32]:
print('shift:', eg_ut.meta['shift'])
print('orientation:', eg_ut.meta['orn'])

shift: 0.6741247781441236
orientation: -0.020953946983786942


## 3. Analysis: correspondence to discourse act labels

We explore the relation between the characterizations we've derived, and the various annotations that the utterances are labeled with (for more information on the annotation scheme, see the [manual here](https://web.stanford.edu/~jurafsky/ws97/manual.august1.html)). See [this dissertation](https://tisjune.github.io/research/dissertation) for further explanation of the analyses and findings below. A high-level comment is that this is a tough dataset for the framework to work with, given the relative lack of structure---something future work could think more carefully about.

To facilitate the analysis, we extract relevant utterance attributes into a Pandas dataframe:

In [33]:
df = sw_corpus.get_attribute_table('utterance',
                ['bk_clustering.cluster', 'fw_clustering.cluster',
                'orn', 'shift', 'tags'])
df = df[df['bk_clustering.cluster'].notnull()]

We will stick to examining the 9 most common tags in the data:

In [34]:
tag_subset = ['aa', 'b', 'ba', 'h', 'ny', 'qw', 'qy', 'sd', 'sv'] 
for tag in tag_subset:
    df['has_' + tag] = df.tags.apply(lambda x: tag in x.split())

To start, we explore how the forwards and backwards vector representations correspond to these labels. To do this, we will compute log-odds ratios between the inferred utterance clusters and these labels:

In [35]:
def compute_log_odds(col, bool_col, val_subset=None):
    if val_subset is not None:
        col_vals = val_subset
    else:
        col_vals = col.unique()
    log_odds_entries = []
    for val in col_vals:
        val_true = sum((col == val) & bool_col)
        val_false = sum((col == val) & ~bool_col)
        nval_true = sum((col != val) & bool_col)
        nval_false = sum((col != val) & ~bool_col)
        log_odds_entries.append({'val': val, 'log_odds': np.log((val_true/val_false)/(nval_true/nval_false))})
    return log_odds_entries

In [36]:
bk_log_odds = []
for tag in tag_subset:
    entry = compute_log_odds(df['bk_clustering.cluster'],df['has_' + tag], ['commentary'])[0]
    entry['tag'] = tag
    bk_log_odds.append(entry)
bk_log_odds_df = pd.DataFrame(bk_log_odds).set_index('tag').sort_values('log_odds')[['log_odds']]

In [37]:
fw_log_odds = []
for tag in tag_subset:
    entry = compute_log_odds(df['fw_clustering.cluster'],df['has_' + tag], ['commentary'])[0]
    entry['tag'] = tag
    fw_log_odds.append(entry)
fw_log_odds_df = pd.DataFrame(fw_log_odds).set_index('tag').sort_values('log_odds')[['log_odds']]

In [38]:
print('forwards types vs labels')
display(fw_log_odds_df.T)
print('--------------------------')
print('backwards types vs labels')
display(bk_log_odds_df.T)

forwards types vs labels


tag,qy,ny,qw,sd,ba,b,aa,h,sv
log_odds,-0.491938,-0.485242,-0.472657,-0.380206,-0.304694,-0.159506,0.500705,0.718654,1.242259


--------------------------
backwards types vs labels


tag,ny,qy,sd,ba,qw,b,aa,h,sv
log_odds,-0.56057,-0.428184,-0.416996,-0.344168,-0.32995,-0.142323,0.50554,0.733656,1.249692


Tags further towards the right of the above tables (more positive log-odds) are those that co-occur more with the `commentary` than the `personal` utterance type. We briefly note that both forwards and backwards representations seem to draw a distinction between `sv` (opinion statements) and `sd` (non-opinion statements).

Next, we explore how the orientation and shift statistics relate to these labels. To do this, we compare statistics for utterances with a particular label, to statistics for utterances without that label.

In [39]:
from scipy import stats

In [40]:
def cohend(d1, d2):
    n1, n2 = len(d1), len(d2)
    s1, s2 = np.var(d1, ddof=1), np.var(d2, ddof=1)
    s = np.sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2))
    u1, u2 = np.mean(d1), np.mean(d2)
    return (u1 - u2) / s
def get_pstars(p):
    if p  < 0.001:
        return '***'
    elif p < 0.01:
        return '**'
    elif p < 0.05:
        return '*'
    else: return ''

In [41]:
stat_col = 'orn'
entries = []
for tag in tag_subset:
    has = df[df['has_' + tag]][stat_col]
    hasnt = df[~df['has_' + tag]][stat_col]
    entry = {'tag': tag, 'pval': stats.mannwhitneyu(has, hasnt)[1],
            'cd': cohend(has, hasnt)}
    entry['ps'] = get_pstars(entry['pval'] * len(tag_subset))
    entries.append(entry)
orn_stat_df = pd.DataFrame(entries).set_index('tag').sort_values('cd')
orn_stat_df = orn_stat_df[np.abs(orn_stat_df.cd) >= .1]

In [42]:
stat_col = 'shift'
entries = []
for tag in tag_subset:
    has = df[df['has_' + tag]][stat_col]
    hasnt = df[~df['has_' + tag]][stat_col]
    entry = {'tag': tag, 'pval': stats.mannwhitneyu(has, hasnt)[1],
            'cd': cohend(has, hasnt)}
    entry['ps'] = get_pstars(entry['pval'] * len(tag_subset))
    entries.append(entry)
shift_stat_df = pd.DataFrame(entries).set_index('tag').sort_values('cd')
shift_stat_df = shift_stat_df[np.abs(shift_stat_df.cd) >= .1]

(We'll only show labels for which there's a sufficiently large difference, in cohen's delta, between utterances with and without the label)

In [43]:
print('orientation vs labels')
display(orn_stat_df.T)
print('--------------------------')
print('shift vs labels')
display(shift_stat_df.T)

orientation vs labels


tag,qw,sv,ba
cd,-0.464487,0.137032,0.325546
ps,***,***,***
pval,4.57282e-58,3.04165e-41,1.04554e-35


--------------------------
shift vs labels


tag,sd,h,sv,ny,qy,ba,qw
cd,-0.514413,-0.457883,-0.245104,-0.162611,0.199682,0.239983,0.493643
ps,***,***,***,***,***,***,***
pval,0,4.83333e-52,6.94752e-99,1.35491e-11,6.66404e-36,5.29252e-12,3.5594e-71


We note that utterances containing questions (`qw`, `qy`) have higher shifts than utterances which do not. If you're familiar with the DAMSL designations for forwards and backwards looking communicative functions, the output for orientation might look a little puzzling/informative that our view of what counts as forwards/backwards is different from the view espoused by the annotation scheme. We discuss this further in [this dissertation](https://tisjune.github.io/research/dissertation).

## 4. Model persistence

Finally, we briefly demonstrate how the expected context models can be saved and loaded for later use. Here, we focus on `ec_fw`.

In [44]:
FW_MODEL_PATH = os.path.join(SW_CORPUS_PATH, 'fw')

In [45]:
ec_fw.dump(FW_MODEL_PATH)

In short, `ec_fw.dump` outputs latent context representations, clustering information, and various input parameters:

In [46]:
ls $FW_MODEL_PATH

clustering_context_terms.tsv  cluster_names.npy  meta.json
clustering_context_utts.tsv   context_s.npy      term_ranges.npy
clustering_terms.tsv          context_terms.npy  term_reprs.npy
clustering_utts.tsv           context_U.npy      terms.npy
cluster_km_df.tsv             context_V.npy      train_utt_reprs.npy
cluster_meta.json             km_model.joblib


To load the learned model, we start by initializing a new expected context model:

In [47]:
ec_fw_new = ExpectedContextModelTransformer('next_id', 'fw_new', 'col_normed_tfidf', 'col_normed_tfidf', 
                                      n_svd_dims=15, n_clusters=2,
                                     random_state=1000, cluster_random_state=1000)

In [48]:
ec_fw_new.load(FW_MODEL_PATH)

We see that using the re-loaded model to transform the corpus results in the same representations as `ec_fw`. 

In [49]:
_ = ec_fw_new.transform(sw_corpus, selector=lambda x: x.meta.get('col_normed_tfidf__n_feats',0)>=5)

In [50]:
np.allclose(sw_corpus.get_vectors('fw_repr'), sw_corpus.get_vectors('fw_new_repr'))

True

## 5. Pipeline usage

We also implement a pipeline that handles the following:
* processes text (via a pipeline supplied by the user)
* transforms text to input representation (via `ColNormedTfidfTransformer`)
* derives framework output (via `ExpectedContextModelTransformer`)

In [44]:
from convokit.expected_context_framework import ExpectedContextModelPipeline

In [45]:
# see `demo_text_pipelines.py` in this demo's directory for details
# in short, this pipeline will either output the `alpha_text` metadata  field
# of an utterance, or write the utterance's `text` attribute into the `alpha_text` 
# metadata field
from demo_text_pipelines import switchboard_text_pipeline

We initialize the pipeline with the following arguments:
* `text_field` specifies which utterance metadata field to use as text input
* `text_pipe` specifies the pipeline used to compute the contents of `text_field`
* `tfidf_params` specifies the parameters to be passed into the underlying `ColNormedTfidfTransformer` object
* `min_terms` specifies the minimum number of terms in the vocabulary that an utterance must contain for it to be considered in fitting and transforming the underlying `ExpectedContextModelTransformer` object (see the `selector` argument passed into `ec_fw.fit` above)

All other arguments are inherited from `ExpectedContextModelTransformer`.

In [46]:
fw_pipe = ExpectedContextModelPipeline(context_field='next_id', output_prefix='fw',
        text_field='alpha_text',
        text_pipe=switchboard_text_pipeline(), 
        tfidf_params={'binary': True, 'vocabulary': vocab}, 
        min_terms=5,
        n_svd_dims=15, n_clusters=2, cluster_on='utts',
        random_state=1000, cluster_random_state=1000)

In [None]:
# note this might output a warning that `col_normed_tfidf` already exists;
# that's okay: the pipeline is just recomputing this matrix
fw_pipe.fit(sw_corpus)

As with `ec_bk` we can initialize a second pipeline to compute backwards-characterizations, passing in argument `ec_model=fw_pipe` to ensure the derived representations in either direction are comparable.

In [48]:
bk_pipe = ExpectedContextModelPipeline(context_field='reply_to', output_prefix='bk',
        text_field='alpha_text',
        text_pipe=switchboard_text_pipeline(), 
        tfidf_params={'binary': True, 'vocabulary': vocab}, 
        min_terms=5,
        ec_model=fw_pipe,
        n_svd_dims=15, n_clusters=2, cluster_on='utts',
        random_state=1000, cluster_random_state=1000)

In [None]:
bk_pipe.fit(sw_corpus)

The pipeline class inherits several other methods of `ExpectedContextModelTransformer`, e.g.:

In [51]:
fw_pipe.set_cluster_names(['commentary','personal'])
bk_pipe.set_cluster_names(['personal', 'commentary'])

Note that the pipeline enables us to transform ad-hoc string input: 

In [52]:
eg_ut_new = fw_pipe.transform_utterance('How old were you when you left ?')
eg_ut_new = bk_pipe.transform_utterance(eg_ut_new)

Here, instead of storing vector representations with a corpus, the pipeline writes these representations to a field in the utterance metadata itself (since the utterance is not attached to a corpus):

In [57]:
eg_ut_new.meta['fw_repr']

[0.1269145021474405,
 -0.18177058434721982,
 0.501745581507668,
 0.3060047621669689,
 0.18772232872850586,
 0.30812601365393016,
 -0.17012829411583585,
 -0.5461342547631567,
 0.24163519678100423,
 -0.07626502435693563,
 -0.23943785779562357,
 0.10367101546307968,
 -0.11019044952480171,
 -0.06044023890156628]

In [53]:
# note these attributes have the exact same values as those of eg_ut, computed above
print('Forwards range:', eg_ut_new.meta['fw_range'])
print('Backwards range:', eg_ut_new.meta['bk_range'])
print('Forwards cluster:', eg_ut_new.meta['fw_clustering.cluster'])
print('Backwards cluster:', eg_ut_new.meta['bk_clustering.cluster'])

Forwards range: 0.8370251853798226
Backwards range: 0.8160712383960359
Forwards cluster: personal
Backwards cluster: personal
