# Inferring prompt types

This notebook demos two transformers, which broadly aim at producing abstract representations of an utterance in terms of its phrasing and its rhetorical intent: 

* The `PhrasingMotifs` transformer extracts representations of utterances in terms of how they are phrased;
* The `PromptTypes` transformer computes latent representations of utterances in terms of their rhetorical intention -- the _responses_ they aim at prompting -- and assigns utterances to different (automatically-inferred) types of intentions.

It also demos some additional transformers used in preprocessing steps.



Together, these transformers implement the methodology detailed in the [paper](http://www.cs.cornell.edu/~cristian/Asking_too_much.html), 

```
Asking Too Much? The Rhetorical Role of Questions in Political Discourse 
Justine Zhang, Arthur Spirling, Cristian Danescu-Niculescu-Mizil
Proceedings of EMNLP 2017
```

ConvoKit also includes an end-to-end implementation, `PromptTypesWrapper`, that runs the transformers one after another, and handles the particular pre-processing steps found in the paper. See TODO LINK for a demonstration of this end-to-end transformer.

This is a really clear example of a method which reflects both good (we think) ideas and somewhat ad-hoc implementation decisions. As such, there are lots of options and potential variations to consider (beyond the deeper question of what phrasings and intentions even are) -- I'll detail these as I go along.

Note that due to small methodological tweaks and changes in the random seed, the particular output of the transformers as presently implemented may not totally match the output from the paper, but the broad types of questions returned are comparable.

## Preliminaries

First we load the corpus. We will examine a dataset of questions from question periods that take place in the British House of Commons (also detailed in the paper). 

In [1]:
import convokit

In [63]:
from convokit import download

We'll load the corpus, plus some pre-computed dependency parses (see TODO LINK for a demonstration of how to get these parses on your own; for this dataset they should be TODO included with our release).

In [3]:
# OPTION 1: DOWNLOAD CORPUS 
# UNCOMMENT THESE LINES TO DOWNLOAD CORPUS
# DATA_DIR = '<YOUR DIRECTORY>'
# ROOT_DIR = download('parliament-corpus', data_dir=DATA_DIR)

# OPTION 2: READ PREVIOUSLY-DOWNLOADED CORPUS FROM DISK
# UNCOMMENT THIS LINE AND REPLACE WITH THE DIRECTORY WHERE THE PARLIAMENT-CORPUS IS LOCATED
# ROOT_DIR = '<YOUR DIRECTORY>'

corpus = convokit.Corpus(ROOT_DIR)
corpus.load_info('utterance',['parsed'])

In [4]:
VERBOSITY = 10000

Our specific goal, which we'll use ConvoKit to accomplish, is to produce an abstract representation of questions asked by members of parliament, in terms of:

* how they are phrased: what phrasing, or lexico-syntatic "motif", does a question have? 
* their rhetorical intention: what's the intent of the asker -- which we take to mean the response the asker aims to prompt? 

In other words, what are the different types of questions people ask in parliament?

Here's an example of an utterance:

In [5]:
test_utt_id = '1997-01-27a.4.0'
utt = corpus.get_utterance(test_utt_id)

In [6]:
utt.text

"Does my right hon Friend agree that last week 's statement about a replacement royal yacht has been widely welcomed ? Does he agree also that , ideally , Britannia should become the centrepiece of the millennium project in Portsmouth harbour , spanning Gosport and Portsmouth ? I am sure that that idea would prove very popular . As to plans for a new yacht , does my right hon Friend share my distaste for the Opposition 's tactics ? They had every opportunity to express their grudging and negative attitude during the past two years when the project was under discussion ."

To state our goals more precisely:

* For each _sentence_ that has a question (all but the last), we want to come up with a representation of the sentence's phrasing. Intuitively, for instance, the first two sentences sound like they could both be thought of as a "Does X agree that Y?" -- whether Y is asking about a yacht or a harbour. 
* For each utterance, we want to come up with a representation of the utterance's rhetorical intent. Intuitively, all the questions could be construed as asking if the answerer is in agreement with the asker -- whether they "agree" with the opinion or "share" the opinion. We might think of this as being an example of an "agreeing" type of question.

Intuitively, if we want to get at this higher level of abstraction, we have to look beyond the particular n-grams: it doesn't seem plausible that there is a meaningful type of question about yachts (unless our specific context is the parliamentary subcommittee on yachts). 



## Preprocessing step: Arcs

One place to start is to look at the structural "skeleton" of sentences -- i.e., its dependency parse. Thus, we are first going to provide a representation of questions in terms of their dependency parse by extracting all the parent-to-child token edges, or "arcs". We will use the `TextToArcs` class to do this:

In [10]:
from convokit.text_processing import TextToArcs

`get_arcs` is a transformer (actually a `TextProcessor`) that will read the dependency parse of an utterance and write the resultant arcs to a field called `'arcs'`:

In [9]:
get_arcs = TextToArcs('arcs', verbosity=VERBOSITY)
corpus = get_arcs.transform(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

`'arcs'` is a list where each element corresponds to a sentence in the utterance. Each sentence is represented in terms of its arcs, in a space-separated string. 

Each arc, in turn, can be read as follows:

* `x_y` means that `x` is the parent and `y` is the child token (e.g., `agree_does` = `agree --> does`)
* `x_*` means that `x` is a token with at least one descendant, which we do not resolve (this is roughly like bigrams backing off to unigrams)
* `x>y` means that `x` and `y` are the first two tokens in the sentence (the decision here was that how the sentence starts is a signal of "phrasing structure" on par with the dependency tree structure)
* `x>*` means that `x` is the first token in the sentence. 

In [10]:
utt.get_info('arcs')

["'s_* a_* about_* about_yacht agree_* agree_does agree_hon agree_welcomed been_* does>* does>my does_* friend_* has_* hon_* hon_friend hon_my hon_right last_* my_* replacement_* right_* royal_* statement_* statement_about statement_week that_* week_'s week_* week_last welcomed_* welcomed_been welcomed_has welcomed_statement welcomed_that welcomed_widely widely_* yacht_* yacht_a yacht_replacement yacht_royal",
 'agree_* agree_also agree_become agree_does agree_he also_* become_* become_britannia become_centrepiece become_ideally become_should become_spanning become_that britannia_* centrepiece_* centrepiece_of centrepiece_the does>* does>he does_* gosport_* harbour_* harbour_portsmouth he_* ideally_* in_* in_harbour millennium_* of_* of_project portsmouth_* project_* project_in project_millennium project_the should_* spanning_* spanning_gosport that_* the_*',
 'am_* am_i am_sure i>* i_* idea_* idea_that popular_* popular_very prove_* prove_idea prove_popular prove_that prove_would sure

### Further preprocessing: cleaned-up arcs

At this point, while we've got the methodology to start making sense of the dependency tree, we arguably haven't progressed beyond producing fancy bigram representations of sentences. One problem is perhaps that the default arc extraction is a bit too permissive -- it gives us _all_ of the arcs. We might not want this for a few reasons:

* We only want to learn about question phrasings; we don't actually care about non-question sentences.
* The structure of a question might be best encapsulated by the arcs that go out of the _root_ of the tree; as you get further down we might end up with less structural and more content-specific representations.
* Likewise, the particular _nouns_ used (e.g., `yacht`) might not be good descriptions of the more abstract phrasing pattern.

All of these points are debatable, and the resultant modules I'll show below hopefully allow you to play around with them. Taking these point as is for now, though, we'll do the following.

In [7]:
from convokit.phrasing_motifs import CensorNouns, QuestionSentences
from convokit.convokitPipeline import ConvokitPipeline

We will actually create a pipeline to extract the arcs we want. This pipeline has the following components, in order:

* `CensorNouns`: a transformer that removes all the nouns and pronouns from a dependency parse. This transformer also collapses constructions like `What time [is it]` into `What [is it]`.
* `TextToArcs`: calling the arc extractor from above with an extra parameter: `root_only=True` which will only extract arcs attached to the root (in addition to the first two tokens, though this is also tunable by passing in parameter `use_start=True`).
* `QuestionSentences`: a transformer that, given utterance fields consisting of a list of sentences, removes all the sentences which contain question marks. Here, we pass an extra parameter `input_filter=question_filter`, telling it to ignore utterances which aren't listed in the Corpus as questions (i.e., if a player asks a question, we'll discount this, since it's not labeled in the Corpus as a reporter question). 
    * (you may wonder how this transformer can tell whether a sentence has a question mark in it, given that the output of `TextToArcs` doesn't have any punctuation. Under the hood, `QuestionSentences` looks at the dependency parse of the sentence and checks whether the last token is a question.)
    * `QuestionSentences` also omits any sentences which don't begin in capital letters. To turn this off, pass parameter `use_caps=False`.

In [8]:
def question_filter(utt, aux_input={}):
    return utt.meta['is_question']

In [11]:
q_arc_pipe = ConvokitPipeline([
    ('censor_nouns', CensorNouns('parsed_censored', verbosity=VERBOSITY)),
    ('shallow_arcs', TextToArcs('arcs_censored', input_field='parsed_censored', 
                               root_only=True, verbosity=VERBOSITY)),
    ('question_sentence_filter', QuestionSentences('question_arcs', input_field='arcs_censored',
                                         input_filter=question_filter, verbosity=VERBOSITY))
])

In [12]:
corpus = q_arc_pipe.transform(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

This pipeline results in a more minimalistic representation of utterances, in terms of just the arcs at the root of dependency trees, just the questions, and no nouns:

In [13]:
utt.get_info('question_arcs')

['agree_* agree_does agree_welcomed does>*',
 'agree_* agree_also agree_become agree_does does>*',
 'as>* as>to share_* share_does']

Here's another example:

In [14]:
test_utt_id_1 = '2015-06-09c.1041.5'
utt1 = corpus.get_utterance(test_utt_id_1)

In [15]:
utt1.text

'Given what the Foreign Secretary has said about the importance of the Iran discussions on the nuclear agreement , what is he doing to ensure greater clarity about the baselines , the extent of the inspection regime and the consequences of infringement ? Given that the agreement will allow advanced centrifuge , the infringements might arrive a little earlier than anticipated .'

In [16]:
utt1.get_info('question_arcs')

['doing_* doing_ensure doing_given doing_is doing_what given>* given>what']

## Phrasing motifs

Finally, to arrive at our representation of phrasings, we can go one further level of abstraction. In short, some of these arcs feel less fully-specified than others. While `agree_does` sounds like it hints at a coherent question, `doing_is` seems like it's not meaningful until you consider that it occurs in the same sentence as `doing_ensure` (i.e., "_what is the Government doing to ensure...?_")

Our intuition is to think of phrasings as frequently-cooccurring sets of multiple arcs. To extract these frequent arc-sets (which may remind you of the data mining idea of extracting frequent itemsets) we will use the `PhrasingMotifs` class.

In [17]:
from convokit.phrasing_motifs import PhrasingMotifs

In [18]:
pm_model = PhrasingMotifs('motifs','question_arcs',min_support=100,fit_filter=question_filter,
                          verbosity=VERBOSITY)

Here, `pm_model` will:

* extract all sets of arcs, as read from the `question_arcs` field, which occur at least 50 times in a corpus. These frequently-occurring arc sets will constitute the set, or "vocabulary", of phrasings.
* write the resultant output -- the phrasings that an utterance contains -- to a field called `question_motifs`. 

On the latter point, `pm_model` will only transform (i.e., label phrasings for) utterances which are questions, i.e., `question_filter(utterance) = True`. That is, in both the train and transform steps, we totally ignore non-questions.

Note that the phrasings learned by `pm_model` are therefore _corpus-specific_ -- different corpora may have different frequently-occurring sets, resulting in different vocabularies of phrasings. For instance, you wouldn't expect people in the British House of Commons to ask questions that sound like questions asked to tennis players. In this respect, think of `PhrasingMotifs` like models from scikit learn (e.g., `LogisticRegression`) -- it is fit to a particular dataset:

In [19]:
pm_model.fit(corpus)

counting frequent itemsets for 325339 sets
	first pass: counting itemsets up to and including 5 items large
	first pass: 10000/325339 sets processed
	first pass: 20000/325339 sets processed
	first pass: 30000/325339 sets processed
	first pass: 40000/325339 sets processed
	first pass: 50000/325339 sets processed
	first pass: 60000/325339 sets processed
	first pass: 70000/325339 sets processed
	first pass: 80000/325339 sets processed
	first pass: 90000/325339 sets processed
	first pass: 100000/325339 sets processed
	first pass: 110000/325339 sets processed
	first pass: 120000/325339 sets processed
	first pass: 130000/325339 sets processed
	first pass: 140000/325339 sets processed
	first pass: 150000/325339 sets processed
	first pass: 160000/325339 sets processed
	first pass: 170000/325339 sets processed
	first pass: 180000/325339 sets processed
	first pass: 190000/325339 sets processed
	first pass: 200000/325339 sets processed
	first pass: 210000/325339 sets processed
	first pass: 220000

Here are the most common phrasings and how often they occur in the data (in # of sentences). Note that `('*',)` denotes the null phrasing -- i.e., it encapsulates sentences with _any_ root word. 

In [20]:
pm_model.print_top_phrasings(25)

('*',) 325339
('will>*',) 67920
('does>*',) 59959
('is_*',) 57904
('is>*',) 45238
('is>*', 'is_*') 42850
('agree_*',) 36085
('agree_does',) 33685
('agree_*', 'agree_does') 33685
('agree_*', 'does>*') 30009
('agree_does', 'does>*') 29984
('agree_*', 'agree_does', 'does>*') 29984
('is_aware',) 22049
('is_*', 'is_aware') 22049
('is>*', 'is_aware') 20704
('is>*', 'is_*', 'is_aware') 20704
('what>*',) 20518
('is_not',) 15977
('is_*', 'is_not') 15977
('is>*', 'is_not') 13408
('is>*', 'is_*', 'is_not') 13408
('accept_*',) 11867
('agree_is',) 11059
('agree_*', 'agree_is') 11059
('agree_does', 'agree_is') 10613


Having "trained", or fitted our model, we can then use it to annotate each (question) utterance in the corpus with the phrasings this utterance contains.

In [21]:
corpus = pm_model.transform(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

One thing to note here is that each sentence can and probably will have multiple phrasings it embodies. For instance, two sentences with phrasing `agree_do` and `agree_will` will also have phrasing `agree_*`. Intuitively, more finely-specified phrasings (i.e., `agree_does`) more closely specify the phrasing embodied by a sentence (we could imagine "Do you agree..." and "Will you agree..." being very different, but perhaps also more similar to each other than "Can you explain.."). 

We want to keep track of both the complete set of phrasings and the most finely-specified phrasing you can have for each utterance. Therefore, `PhrasingMotifs` actually annotates utterances with _two_ fields.

`motifs` lists all the phrasings (arcs in a phrasing motif are separated by two underscores, `'__'`):

In [22]:
utt.get_info('motifs')

['agree_* agree_*__does>* does>*',
 'agree_* agree_*__agree_also agree_*__does>* does>*',
 'as>* share_* share_*__share_does']

and `motifs__sink` lists the most finely specified _sink phrasings_ (they are "sinks" in the sense that if you think of phrasings as a directed graph where A-->B when B is a more finely-specified version of A, these sinks have no child phrasings which are contained in the utterance)

In [23]:
utt.get_info('motifs__sink')

['agree_*__does>*', 'agree_*__agree_also', 'as>* share_*__share_does']

We'll save a subset of our output to disk -- the filtered arcs, and the motifs, potentially for use in a later transformer.

In [23]:
corpus.dump_info('utterance', ['motifs', 'motifs__sink', 'arcs_censored'])

### model persistence

We can save `pm_model` to disk and later reload it, thus caching the trained model (i.e., the motifs in a corpus and the internal representation of these motifs). Here, we save the model to a `pm_model` subfolder in the corpus directory via `dump_model()`:

In [24]:
import os

In [25]:
pm_model.dump_model(os.path.join(ROOT_DIR, 'pm_model'))

writing itemset counts
writing downlinks
writing itemset to ids
writing meta information


This subfolder then stores the motifs, as well as relations between the motifs that facilitate transforming new utterances.

In [26]:
pm_model_dir = os.path.join(ROOT_DIR, 'pm_model')
!ls $pm_model_dir

downlinks.json	itemset_counts.json  itemset_to_ids.json  meta.json


Suppose we later initialize a new `PhrasingMotifs` model, `new_pm_model`.

In [27]:
new_pm_model = PhrasingMotifs('motifs_new','question_arcs',min_support=100,fit_filter=question_filter,
                          verbosity=VERBOSITY)

Calling `load_model()` then reloads the stored model from our earlier run into this new model:

In [28]:
new_pm_model.load_model(os.path.join(ROOT_DIR, 'pm_model'))

reading itemset counts
reading downlinks
reading itemset to ids
reading meta information


Just to check that we've loaded the same thing that we previously saved, we'll get the motifs in our test utterance using `new_pm_model`:

In [29]:
utt = new_pm_model.transform_utterance(utt)

This is the output from the original run:

In [30]:
utt.get_info('motifs__sink')

['agree_*__does>*', 'agree_*__agree_also', 'as>* share_*__share_does']

And we see the new output matches.

In [31]:
utt.get_info('motifs_new__sink')

['agree_*__does>*', 'agree_*__agree_also', 'as>* share_*__share_does']

### example variation: not removing the nouns

**note** this takes a while to run, and is somewhat of an extension -- you can safely skip these cells.

There are other ways to use `PhrasingMotifs` that might be more or less suited to your own application. For instance, you may wonder what happens if we do not remove the nouns (as we did with `CensorNouns` above). To try this out, we can create an alternate pipeline that uses `TextToArcs` to generate root arcs (setting argument `root_only=True`) on the original parses, not the noun-censored ones.

In [32]:
q_arc_pipe_full = ConvokitPipeline([
    ('shallow_arcs_full', TextToArcs('root_arcs', input_field='parsed', 
                               root_only=True, verbosity=VERBOSITY)),
    ('question_sentence_filter', QuestionSentences('question_arcs_full', input_field='root_arcs',
                                         input_filter=question_filter, verbosity=VERBOSITY)),

])
corpus = q_arc_pipe_full.transform(corpus)


10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

We can then train a new `PhrasingMotifs` model that finds phrasings with the nouns still included.

In [33]:
noun_pm_model = PhrasingMotifs('motifs_full','question_arcs_full',min_support=100,
                               fit_filter=question_filter, 
                          verbosity=VERBOSITY)
noun_pm_model.fit(corpus)

counting frequent itemsets for 325339 sets
	first pass: counting itemsets up to and including 5 items large
	first pass: 10000/325339 sets processed
	first pass: 20000/325339 sets processed
	first pass: 30000/325339 sets processed
	first pass: 40000/325339 sets processed
	first pass: 50000/325339 sets processed
	first pass: 60000/325339 sets processed
	first pass: 70000/325339 sets processed
	first pass: 80000/325339 sets processed
	first pass: 90000/325339 sets processed
	first pass: 100000/325339 sets processed
	first pass: 110000/325339 sets processed
	first pass: 120000/325339 sets processed
	first pass: 130000/325339 sets processed
	first pass: 140000/325339 sets processed
	first pass: 150000/325339 sets processed
	first pass: 160000/325339 sets processed
	first pass: 170000/325339 sets processed
	first pass: 180000/325339 sets processed
	first pass: 190000/325339 sets processed
	first pass: 200000/325339 sets processed
	first pass: 210000/325339 sets processed
	first pass: 220000

The most common phrasings, of course, won't be very topic-specific (unless people talk about yachts very very frequently in parliament). However, we do see that phrasings now reflect the pronoun used (which may be troublesome if we believe that "Does _he_ agree" and "Does _she_ agree" are getting at similar things).

In [34]:
noun_pm_model.print_top_phrasings(25)

('*',) 325339
('will>*',) 70226
('does>*',) 61032
('is_*',) 57964
('is>*',) 45268
('is>*', 'is_*') 42850
('agree_*',) 36108
('agree_does',) 33704
('agree_*', 'agree_does') 33704
('agree_*', 'does>*') 30012
('agree_does', 'does>*') 29987
('agree_*', 'agree_does', 'does>*') 29987
('will>the',) 26218
('will>*', 'will>the') 26218
('will>he',) 23049
('will>*', 'will>he') 23049
('is_aware',) 22063
('is_*', 'is_aware') 22063
('does>the',) 20932
('does>*', 'does>the') 20932
('what>*',) 20791
('is>*', 'is_aware') 20707
('is>*', 'is_*', 'is_aware') 20707
('does>he',) 16417
('does>*', 'does>he') 16417


Here are the sink phrasings for our example utterance from earlier, comparing against the noun-less run:

In [35]:
utt = noun_pm_model.transform_utterance(utt)

In [88]:
utt.get_info('motifs__sink')

['agree_*__does>*', 'agree_*__agree_also', 'as>* share_*__share_does']

In [36]:
utt.get_info('motifs_full__sink')

['agree_*__agree_hon',
 'agree_*__agree_also__agree_he',
 'as>* share_*__share_hon']

We see that we get this extra "hon" -- which actually stands for "honourable [member]" -- an artefact of parliamentary etiquette. 

For our particular dataset, removing nouns has the benefit of removing most of these etiquette-related words. However, you may also imagine cases where nouns actually carry a lot of useful information about rhetorical intent (including in this domain -- one could argue that asking about a person versus asking about a department is a strong signal of trying to get at different things, for instance). As such, noun-removal is something that you may want to play around with, and/or try to improve upon. 

## PromptTypes

As we intuited above, "do you agree" and "do you share my opinion" are both getting at similar intentions. However, extracting these phrasings alone won't allow us to make this association. Rather, our strategy will be to produce vector representations of them which encode this similarity. Clustering these representations then gives us different "types of question".

Our key intuition here is that questions with similar intentions will tend to be answered in similiar ways. Thus, "do you agree" and "do you share" may both often be answered with "yes, I agree"; if tomorrow I asked a new question of this ilk ("do you agree that we should invest in planes, instead of yachts"), I might be expecting a similar sort of answer. 

For a full explanation of this idea, and how we operationalized it, you can read our paper. In ConvoKit, we implement this methodology of producing vector representations and clusterings via the `PromptTypes` transformer:

In [24]:
from convokit.prompt_types import PromptTypes

`PromptTypes` will train a model -- a low-dimensional embedding, along with a k-means clustering -- by using question-answer pairs as input. 

In [25]:
def question_filter(utt, aux_input={}):
    return utt.meta['is_question']
def response_filter(utt, aux_input={}):
    return (not utt.meta['is_question']) and (utt.reply_to is not None)

We initialize `pt` with the following arguments:

* `n_types=8`: we want to infer 8 types of questions.
* `prompt_field='motifs'`: we want to encode questions in terms of the phrasing motifs we extracted above. thus, `pt` will produce representations of these motifs (rather than, e.g., the raw tokens in a question)
* `ref_field='arcs_censored'`: we will encode responses in terms of the noun-less arcs we extracted above (in practice, this appears to work better than using phrasings of responses as well, perhaps because responses are noisier)
* `prompt_filter=question_filter` and `ref_filter=response_filter`: To tell the transformer what counts as a question and an answer, we will pass the constructor the above filters (i.e., boolean functions). Note that in a less questions-heavy dataset, we could omit these filters and hence infer types of "prompts" beyond questions.
* `prompt_transform_field='motifs__sink'`: while we want to come up with a representation of _all_ phrasing motifs, when we produce a vector representation of a _particular_ utterance we want to use the most finely-specified phrasing.

There are some other arguments you can set, which are listed in the docstring. 



In [26]:
pt = PromptTypes(n_types=8, prompt_field='motifs', ref_field='arcs_censored', 
                 prompt_filter=question_filter, ref_filter=response_filter,
                 prompt_transform_field='motifs__sink',
                 output_field='prompt_types',
    random_state=1000, verbosity=1)

We can fit `pt` to the corpus -- that is, learn the associations between question phrasings and response dependency arcs that allow us to produce our vector representations, as well as a clsutering of these representations that gives us our different question types.

In [27]:
pt.fit(corpus)

fitting 195441 input pairs
fitting ref tfidf model
fitting prompt tfidf model
fitting svd model
fitting 8 prompt types


calling `display_type()` as below will print the question phrasings, response arcs, and prototypical questions and responses that are associated with each inferred type of question. We will examine some of these types more closely by way of examples, below.

In [28]:
for i in range(8):
    print(i)
    pt.display_type(i, corpus=corpus, k=15)
    print('\n\n')

0
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
made_*,0.627821,1.112966,1.253935,1.080853,1.263088,1.081064,1.085615,1.120296,0.0
made_*__made_in,0.670131,1.092172,1.296672,1.044795,1.16574,1.096866,1.117699,1.102888,0.0
made_*__made_to,0.677337,1.226368,1.219402,1.145383,1.388485,1.110612,1.212821,1.180075,0.0
in>*__tell_*,0.681406,1.176855,0.986156,1.121633,1.332023,0.847615,0.987093,1.26624,0.0
made_*__made_what,0.683845,1.124455,1.353602,1.139765,1.248646,1.112601,1.209657,0.959844,0.0
made_*__made_been,0.689245,1.12298,1.277784,1.117208,1.263149,1.144714,1.17727,1.178932,0.0
happen_*__happen_will,0.697615,1.202465,1.101319,1.157835,1.233026,0.868512,1.052773,1.120683,0.0
made_*__what>*,0.698273,1.148946,1.360852,1.140067,1.24789,1.156193,1.226739,0.99763,0.0
made_*__made_been__made_what,0.706422,1.105542,1.336376,1.133772,1.224077,1.158618,1.23057,1.052753,0.0
made_*__made_has,0.707376,1.123585,1.303389,1.18135,1.30712,1.138856,1.224905,1.109336,0.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
am_at,0.744773,0.935439,1.20501,1.062946,1.220152,1.02565,1.073168,1.238502,0.0
known_*,0.746823,1.227288,1.121113,1.15773,1.273498,1.041451,1.067706,1.242575,0.0
can_*,0.785983,1.107881,1.200609,1.058894,1.083504,0.993568,1.07024,0.982096,0.0
place_*,0.789185,1.049869,1.162727,1.044718,1.15699,1.036115,1.094712,1.125281,0.0
assure_have,0.796057,0.944208,1.301825,0.915244,0.966593,1.071265,1.028048,0.994812,0.0
was_made,0.797752,1.188686,0.977958,1.147366,1.179836,0.915078,0.944907,1.195879,0.0
make_shall,0.80267,0.873247,1.176298,0.898143,0.952539,0.89805,0.83441,1.12915,0.0
give_can,0.804878,1.036402,1.154003,1.068707,1.223465,0.858061,1.07292,1.131898,0.0
have_made,0.81164,1.045384,1.182388,0.922776,1.063783,1.034568,0.824843,1.211438,0.0
write_shall,0.815864,1.160921,1.133396,1.17003,1.271814,1.098569,1.200611,1.185541,0.0


top prompts:
1996-02-06a.122.5 As the House will be interested today in the change of personnel at British Gas , what direct contacts have been made with British Gas on the laying of the pipeline ? What issues were discussed ? On what dates were meetings held and what recommendations , if any , were made by the Ministry of Defence ?
['as>* made_*__made_be made_*__made_been__made_have made_*__made_been__made_what made_*__made_have__made_what made_*__made_on__made_what', 'discussed_* what>*', 'made_*__made_on on>*']

2011-05-05c.764.8 The Secretary of State made a point about less prescriptive service requirements , but will he give a guarantee that stations such as Runcorn mainline station and Widnes station in my constituency , which have seen a significant increase in passengers in the past five years , will not , as a result of his reform of franchising , have a reduction in the number of stopping trains ?
['give_*__give_will made_*']

2005-02-07a.1170.4 My understanding is that if t

Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_*__agree_will__will>*,1.081874,0.497942,1.273505,0.848749,0.933643,0.990091,0.997196,1.086475,1.0
agree_*__agree_will,1.05683,0.499644,1.266302,0.850806,0.947449,0.949093,0.966158,1.089017,1.0
agree_*__will>*,1.104583,0.515868,1.262447,0.846606,0.875943,1.016219,0.976856,1.101176,1.0
meet_*,1.123487,0.546317,1.25353,0.877287,1.00936,0.917613,1.023773,1.032337,1.0
agree_*__agree_meet__will>*,1.143766,0.55996,1.303826,0.99068,1.077422,1.059494,1.132881,1.088021,1.0
agree_*__agree_meet,1.113804,0.560347,1.303108,0.929585,1.034495,1.034661,1.067001,1.109925,1.0
undertake_*,1.008052,0.573649,1.260006,0.80237,1.021824,1.003247,1.016716,1.056516,1.0
meet_*__meet_will,1.135676,0.579419,1.234611,0.877325,0.985532,0.909151,1.012462,1.042688,1.0
raise_*__raise_will,1.039347,0.579697,1.310012,0.890007,0.994179,1.090194,1.039332,1.094929,1.0
press_*__press_may,1.113677,0.582473,1.234204,0.882112,1.106788,0.879551,1.003774,1.120163,1.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
am_always,1.154629,0.584922,1.21927,0.79042,0.961395,0.994641,0.96633,1.204803,1.0
am_aware,0.926102,0.610505,1.265185,0.805867,1.135296,1.001407,1.023496,1.168976,1.0
was_aware,1.096868,0.636389,1.215653,0.992701,1.159254,1.10333,1.106659,1.264721,1.0
want_obviously,1.196034,0.641195,1.270706,0.792399,1.009466,1.039959,1.110401,1.072885,1.0
know_been,1.049947,0.647338,1.222541,0.887264,0.876106,1.018087,0.997834,1.105618,1.0
know_takes,1.078992,0.653681,1.300367,0.852636,0.846953,1.086783,1.048624,0.949327,1.0
get_back,1.174138,0.675439,1.244518,1.085933,1.144474,1.130576,1.202466,1.202862,1.0
am_interested,1.162955,0.678403,1.114928,1.018159,1.209881,0.948988,1.01752,1.261251,1.0
suspect_is,1.090133,0.67904,1.093455,1.013447,1.047982,0.866533,0.952681,1.208949,1.0
be_happy,0.98805,0.685855,1.180272,0.787081,0.853065,0.855227,0.803427,1.110432,1.0


top prompts:
2015-10-26c.5.5 Redbridge , like many other parts of London , faces an acute shortage of places in primary and secondary provision over the course of this Parliament . Will the Secretary of State or a relevant Minister agree to meet me and representatives from the local authority to discuss this ? Will she consider allowing local authorities such as Redbridge with a good track record of local authority maintained schools not only to expand existing local authority schools but to build new ones ?
['agree_*__agree_meet__will>*', 'consider_*__will>*']

2015-02-24c.195.0 When I asked the Minister last June what guarantees he would give to GP practices at risk because of the withdrawal of the minimum practice income guarantee , I was told that NHS England would ensure threatened practices “ get to the right place.”—[ Official Report , 10 June 2014 ; Vol . 582 , c. 400 . ] Over the past seven months , those discussions have not alleviated the threat to two highly regarded practi

Unnamed: 0,0,1,2,3,4,5,6,7,type_id
admit_*,1.201171,1.292763,0.570578,1.214884,1.266494,0.947761,0.940144,1.397194,2.0
why>*,1.111746,1.310107,0.57465,1.22539,1.281744,0.860374,0.900415,1.388247,2.0
admit_*__will>*,1.231918,1.313895,0.576193,1.257695,1.240869,0.998804,1.011262,1.359635,2.0
explain_*,1.081285,1.253725,0.576442,1.206437,1.269414,0.843906,0.912722,1.365633,2.0
explain_*__explain_will,1.084332,1.294334,0.591024,1.183642,1.188385,0.920798,0.869349,1.373634,2.0
is>*__is_*__is_true,1.16878,1.315689,0.596213,1.103997,1.143709,1.017692,0.846551,1.483289,2.0
is_*__why>*,1.184021,1.282762,0.601567,1.21484,1.174174,0.91112,0.845809,1.364674,2.0
justify_*,1.203301,1.322534,0.609482,1.251149,1.275439,0.976639,1.037237,1.358217,2.0
admit_*__admit_will__will>*,1.239174,1.311162,0.610798,1.289524,1.272271,0.984968,1.058717,1.349525,2.0
is_*__is_true,1.171478,1.337066,0.616361,1.159053,1.183389,1.032628,0.898227,1.49126,2.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
wonder_*,1.1772,1.279819,0.594013,1.185855,1.109752,0.87208,0.870528,1.31476,2.0
failed_*,1.208959,1.330511,0.634805,1.256673,1.129173,0.963978,0.931793,1.338335,2.0
were_*,1.210541,1.366612,0.654166,1.19862,1.070178,1.06855,0.916717,1.350363,2.0
is_wrong,1.171259,1.387414,0.662125,1.253352,1.153526,0.974954,0.946674,1.315054,2.0
instead>*,1.178271,1.26101,0.675487,1.211907,1.236813,0.851639,0.999561,1.264442,2.0
talks_*,1.238792,1.229554,0.695368,1.236562,1.260476,0.900432,1.032589,1.339003,2.0
am_surprised,1.172448,1.23272,0.698857,1.211564,1.223285,1.007399,0.960399,1.332919,2.0
were_there,1.212256,1.391393,0.702025,1.248951,1.151671,1.088265,0.991262,1.355348,2.0
talks_about,1.231265,1.22522,0.706294,1.24026,1.297678,0.887904,1.04609,1.339956,2.0
was_*,1.178669,1.17664,0.713316,1.100452,0.869292,0.922654,0.726262,1.286168,2.0


top prompts:
1990-01-11a.1074.6 Does the Minister not understand that while he is slashing research into food safety , sacking scientists by the thousand and delaying the introduction of vital regulations , the general public will have little confidence in a food safety directorate within his Department that is responsible directly to him ? Why does he not show that he takes the issue seriously by establishing a food standards agency , independent of the Government , as advocated by the Labour party and many other information organisations ?
['does>*__understand_*__understand_not', 'show_*__show_does__show_not why>*__why>does']

1987-03-04a.857.5 Will the Secretary of State stop giving us what is called in the pop record industry a remix of alibis , excuses and gimmicks ? Will he admit that the number of homes built to rent last year by local authorities was the lowest in 62 years , that the housing investment programme net of capital receipts was the lowest in real terms since HIPs we

1985-07-18a.474.0 I am surprised that the right hon Gentleman thinks it surprising that the Chancellor should think it important to be at a party conference . I wonder whether the right hon Gentleman would not put off certain engagements to be at his own party conference .
['am_* am_surprised', 'wonder_* wonder_put']

1991-10-16a.298.7 I wonder whether the hon Gentleman is aware that one of the largest ever single export orders to Japan was won recently by British Aerospace . The order was worth £ 445 million and was to provide search and rescue aircraft to the Japanese self - defence force .
['wonder_* wonder_is', 'was_* was_provide was_worth']

1994-06-21a.111.9 The hon Gentleman has got it wrong . The job losses that the management announced last week were the result of efficiency measures that they had introduced . Therefore , my right hon and learned Friend has kept his word entirely . I should have thought that an accusation such as that on his birthday was below the belt .
['got

Unnamed: 0,0,1,2,3,4,5,6,7,type_id
learned_*__will_*,1.035791,0.903941,1.167507,0.538998,0.842047,1.151453,0.80045,1.30125,3.0
learned_*__will>*,1.031033,0.887743,1.17639,0.542707,0.852538,1.149004,0.822925,1.297674,3.0
draw_*__will>*,1.02198,0.905078,1.149718,0.546633,0.942019,1.051063,0.876365,1.174293,3.0
bear_*__bear_in__in>*,1.078134,0.951693,1.231007,0.552075,0.980356,1.15239,0.980947,1.223885,3.0
draw_*__draw_will,1.037764,0.907857,1.155981,0.555476,0.906422,1.068294,0.878078,1.179685,3.0
convey_*__convey_to,1.080879,0.956686,1.17974,0.566019,1.013484,1.081067,0.986655,1.232593,3.0
will_*,0.998802,0.842949,1.137713,0.568387,0.834968,1.096154,0.724175,1.314792,3.0
convey_*__convey_to__convey_will,1.105769,1.021487,1.16184,0.589949,0.98565,1.122533,0.987816,1.217223,3.0
will>*__will_*,0.998499,0.822044,1.170187,0.597136,0.865728,1.121626,0.765781,1.322978,3.0
does_*__learned_*__learned_accept,1.100867,0.996306,1.083552,0.606117,0.873094,1.116297,0.733854,1.325595,3.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
note_says,0.998016,0.95833,1.122652,0.602703,0.978719,1.028584,0.810145,1.288853,3.0
emphasise_*,1.038182,0.84825,1.206117,0.608093,0.922879,1.111263,0.814011,1.260635,3.0
learned_*,0.992413,0.936676,1.066827,0.616777,0.775857,0.982881,0.672981,1.247093,3.0
note_*,1.045313,0.916748,1.076491,0.621081,0.976873,1.074091,0.762412,1.35107,3.0
be_important,1.080727,0.864054,1.180871,0.627343,0.856606,1.047932,0.829196,1.143138,3.0
is_consider,0.97022,0.82766,1.24024,0.638418,0.972265,1.01127,0.938706,1.145201,3.0
are_always,1.076425,0.917767,1.173078,0.641689,1.026539,1.13699,0.8933,1.269326,3.0
consider_is,0.960686,0.810932,1.203336,0.643591,1.01007,1.000808,0.825824,1.243528,3.0
convey_*,1.11556,0.936023,1.205373,0.64489,1.076424,1.138973,1.018776,1.274255,3.0
consider_must,1.032756,0.89067,1.197095,0.647674,0.965585,1.106802,0.850974,1.284388,3.0


top prompts:
1980-05-15a.1743.0 Will my right hon Friend take time to study the difference in pay settlements between the private sector and the public services and public monopolies , especially the water authorities ? Will she bear in mind that our constituents , especially mine , are increasingly unable to pay for the enormous cost of water and sewage treatment ? Will she draw the appropriate conclusions ?
['take_*__will>*', 'bear_*__will>*', 'draw_*__will>*']

2002-07-15.7.3 Bearing in mind the fact that we have a national defence service and the shortage of such facilities in my area , will the Minister exercise caution in making decisions to move a facility from the north of England to a site further south ? In particular , will he bear in mind the history of Welbeck army college and the importance to the local economy of the jobs that it creates as the 10th largest employer in my constituency ?
['bearing>*', 'bear_*__bear_in__in>*']

1982-07-19a.8.9 Is the Minister aware that I 

Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_*__agree_is,1.188158,1.063,1.140767,0.952855,0.390807,1.232684,0.862497,1.120674,4.0
agree_*__agree_be__does>*,1.14527,1.019666,1.15276,0.885044,0.397438,1.20073,0.797764,1.133618,4.0
agree_*__agree_be,1.140607,1.022338,1.151803,0.878635,0.398083,1.19152,0.78868,1.138992,4.0
agree_*__agree_is__does>*,1.185667,1.067486,1.143729,0.954654,0.398762,1.238654,0.870336,1.11485,4.0
agree_*__agree_have,1.160346,1.060732,1.162782,0.954536,0.439447,1.232611,0.864388,1.130915,4.0
agree_*__agree_are,1.198654,1.093265,1.142069,0.940229,0.446017,1.247844,0.859772,1.162318,4.0
agree_*__agree_does__agree_have__does>*,1.145078,1.092075,1.144758,0.956562,0.454053,1.221288,0.848933,1.117233,4.0
agree_*__agree_are__agree_does__does>*,1.199917,1.099268,1.139137,0.94469,0.45824,1.253012,0.868788,1.158872,4.0
agree_*__agree_also,1.184892,1.135958,1.127462,1.026366,0.468385,1.265777,0.892345,1.176072,4.0
continue_*__will>*,1.153943,1.039539,1.207467,0.966704,0.474253,1.195661,0.962171,0.987554,4.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_certainly,1.180155,1.071745,1.201017,0.971265,0.461342,1.293082,0.952085,1.10175,4.0
agree_is,1.174426,1.071817,1.205062,0.98537,0.468024,1.291348,0.950482,1.095343,4.0
agree_however,1.171552,1.076513,1.193439,0.995855,0.468574,1.28644,0.93341,1.117073,4.0
agree_will,1.18198,1.04299,1.209447,1.001755,0.475267,1.291913,0.946098,1.121801,4.0
agree_also,1.194347,1.078747,1.19395,0.990467,0.476035,1.294584,0.947017,1.096503,4.0
agree_wholeheartedly,1.184639,1.092625,1.190386,1.018646,0.476682,1.280336,0.964095,1.089767,4.0
agree_absolutely,1.194103,1.062569,1.219094,0.982575,0.476712,1.294999,0.986945,1.067689,4.0
is_also,1.19596,1.051437,1.11008,0.98655,0.478089,1.101184,0.882075,1.002128,4.0
agree_be,1.164383,1.079607,1.202961,0.994257,0.481263,1.288859,0.94401,1.104167,4.0
agree_completely,1.186661,1.08971,1.208564,1.017079,0.481393,1.2983,0.976094,1.076271,4.0


top prompts:
1992-07-02a.953.5 Does the Minister agree that in the sugar beet sector , as in others , the proposed changes in inheritance tax , which have been discussed in Standing Committee , are likely to have some contradictory effects ? I welcome the changes in inheritance tax because the industry obviously needs them , but does the Minister agree that over a range of agricultural changes introduced as a result of the common agricultural policy , it will be necessary to avoid preventing the achievement of major environmental access and recreational agreements ? The good work that the Minister has done in the past could be undone . Does the Minister agree that some steps will have to be taken to mitigate those effects ?
['agree_*__agree_are__agree_does__does>*', 'welcome_*', 'agree_*__agree_does__agree_have__does>*']

1998-06-16a.132.0 Does my hon Friend agree that economic stability is fundamental to Israel and the Palestinians—a fact recognised by both sides ? Does he also agree 

Unnamed: 0,0,1,2,3,4,5,6,7,type_id
say_*,0.844708,1.083265,0.971673,1.117433,1.294797,0.619143,0.997052,1.194575,5.0
mean_*,0.996559,1.12255,0.864944,1.110938,1.171208,0.624402,0.817696,1.199592,5.0
have_*,0.94339,0.849485,0.994944,0.846488,1.098475,0.637829,0.825819,1.123806,5.0
mean_*__mean_does,0.959681,1.149855,0.872096,1.146608,1.217319,0.664998,0.853081,1.239925,5.0
given>*,1.009578,0.821352,1.145473,0.999191,1.153424,0.670679,1.043621,0.946857,5.0
explain_*__explain_can__explain_is,1.093834,1.091398,0.82147,1.144547,1.207012,0.686644,0.885649,1.186571,5.0
have_*__have_for__have_what,1.036565,0.960059,1.103153,1.146903,1.288828,0.692222,1.095189,1.137421,5.0
said_*,1.065214,0.86955,1.033804,1.078371,1.205563,0.693865,0.960979,1.095316,5.0
make_*__make_what,1.027176,1.022561,0.892963,1.045675,1.077412,0.698897,0.79764,1.201476,5.0
go_*,1.082529,0.951352,0.940076,0.935363,1.076293,0.703714,0.727434,1.257622,5.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
said_in,1.072858,1.111422,0.879407,1.156333,1.227892,0.625378,0.913928,1.223817,5.0
said_to,1.043832,1.108993,0.961855,1.15544,1.265747,0.630438,0.996852,1.199537,5.0
said_as,1.084223,1.054055,0.985156,1.135903,1.195105,0.65378,0.959594,1.193883,5.0
secondly>*,1.158341,1.166617,0.829746,1.223155,1.198225,0.664336,0.997184,1.150355,5.0
first>*,1.166789,1.093003,0.914265,1.215943,1.216189,0.66769,1.050758,1.111849,5.0
said_*,1.063014,1.126901,0.883348,1.143466,1.183556,0.669222,0.864938,1.256876,5.0
is_say,1.062904,0.999097,0.937645,1.073971,1.065091,0.671739,0.887503,1.140267,5.0
said_was,1.081109,1.141609,0.856563,1.15772,1.216481,0.67285,0.903487,1.238201,5.0
on>*,0.908192,1.054282,0.890554,0.992212,1.082866,0.673259,0.748206,1.243729,5.0
expect_do,0.962934,1.002189,0.962702,1.017961,1.225525,0.67797,0.861367,1.300114,5.0


top prompts:
1996-03-13a.979.7 Given that that must mean a considerable loss of export revenues , what does the Minister plan by way of incentives to British companies to ensure that they train their employees more effectively in languages ?
['given>* mean_*']

2011-09-05a.12.3 Under the strategic housing land assessment process started by the previous Government , developers can nominate potential sites to go on a list in a way that does not seem to engage heritage organisations or heritage issues . Given the presumption in favour of development , does that mean that heritage issues can not be brought to bear as reasons for refusing applications on sites on that list ?
['given>* mean_*__mean_does']

2009-06-01b.3.6 I would like to express the condolences of the Opposition to the family of Private Smith , and also to the families of those who have fallen since we last met . They gave their young lives in the service of our country , and their sacrifice must never be forgotten . The Min

['saying_* saying_is saying_rigged', 'bet_* bet_be bet_will report_* report_on report_will', 'whatever>* whatever_* whatever_to', 'say_* say_does say_is say_not what>*', 'have_* have_do have_not have_shall is_* is_have is_legitimate', 'accepted_*']

1989-04-20a.446.6 Can my right hon Friend say why the discussions on extradition and extra - territorial jurisdiction have been so drawn out ? When does he expect a settlement ?
['can>* say_* say_can say_drawn', 'expect_* expect_does expect_when when>* when>does']

2002-03-11.624.3 First , I am not seeking centralised control . I am not seeking operational control at any level , and it is a simple lie by those who have said the opposite . Secondly , I do not expect—I say this to the shadow Home Secretary—any member of his party to ask the Home Secretary to take responsibility for the level and quality of policing in this country , unless the Home Secretary has the power as well as the responsibility to do something about it . In the words o

Unnamed: 0,0,1,2,3,4,5,6,7,type_id
accept_*__accept_is,1.090317,1.063181,0.936988,0.832105,0.683805,1.040056,0.523513,1.280517,6.0
be_*__be_not,0.986601,0.985865,0.982842,0.7024,0.802755,0.961942,0.528468,1.317027,6.0
accept_*__accept_does__accept_is,1.089942,1.083021,0.939604,0.864796,0.684333,1.056191,0.530239,1.283107,6.0
accept_*__accept_will,1.114377,1.055778,0.825666,0.847018,0.793701,0.940009,0.531709,1.344388,6.0
be_*,0.909678,0.935754,1.026899,0.716806,0.892652,0.853703,0.539811,1.293616,6.0
accept_*,1.115738,1.085082,0.865772,0.850017,0.750077,1.000034,0.540087,1.328635,6.0
be_*__be_would,0.981651,0.920813,1.010986,0.676931,0.830448,0.941249,0.546333,1.328901,6.0
accept_*__accept_is__does>*,1.071313,1.112336,0.935831,0.892122,0.706264,1.053141,0.549273,1.28359,6.0
does>*__recognise_*,1.137781,0.991348,0.99109,0.756819,0.807313,1.028037,0.555275,1.320357,6.0
accept_*__accept_does,1.120551,1.108024,0.871339,0.861576,0.743819,1.024401,0.558752,1.320551,6.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
realise_*,1.051353,1.040419,0.870504,0.830392,0.898416,0.891873,0.510753,1.333201,6.0
therefore>*,1.073179,1.107272,0.791517,0.893782,0.895427,0.865649,0.533603,1.356722,6.0
realise_is,1.076871,1.040312,0.900864,0.878094,0.820099,0.953101,0.537008,1.288549,6.0
be_right,0.983154,0.999825,1.01152,0.82418,0.882781,0.827222,0.597474,1.218149,6.0
be_however,1.001768,0.831704,1.087172,0.702314,0.746173,0.884701,0.59945,1.203639,6.0
remind_is,1.095861,0.955227,0.928311,0.856989,0.948804,0.962759,0.601668,1.334839,6.0
be_might,1.103935,0.896718,1.006554,0.767808,0.674502,0.895251,0.602006,1.173772,6.0
believe_however,1.055552,0.951726,1.031753,0.781682,0.73825,0.912921,0.602504,1.207242,6.0
be_decide,1.006482,1.026823,0.967803,0.891903,0.886792,0.87345,0.603566,1.237801,6.0
realise_will,1.032995,1.115808,0.867645,0.958823,1.015834,0.951978,0.605154,1.362984,6.0


top prompts:
1980-02-06a.482.4 As the Minister does not propose to find out how many people are waiting on transfer lists , how can he presume to order local authorities to sell off every house that a tenant applies to buy ? Does not that illustrate once again the unanswerable case for giving local authorities discretion over the number and type of houses that they put on the market ?
['as>*', 'does>*__does>not']

1987-11-02a.647.10 When that time comes , will the Minister and the Lord President bear in mind that schedule 4(5 ) to the Medical Act 1983 gives them powers to require further review ? Is the Minister aware that , in a recent note to Members of this House , the chairman of the General Medical Council said that the council 's principal task always has been , that of informing and protecting the public " . Would it not be a good thing to review the rules in the light of those excellent precepts , especially paragraph 60 of the current statutory instrument relating to the avail

1986-11-05a.947.2 My hon Friend will realise from the list that I read to the House of the parts of the Rover Group that are being disposed of that an energetic programme is proceeding on that front . I am sure he welcomes that , and I endorse what he said .
['realise_* realise_from realise_will', 'am_* am_sure endorse_* endorse_said']

1981-06-22a.16.8 My hon Friend will be aware of the need to have equality of treatment , especially when considering inward investment , for mobile projects . However , he will realise that there is a substantial opportunity to replace the sizeable import trade in motor cars .
['be_* be_aware be_considering be_for be_will', 'however>* realise_* realise_however realise_is realise_will']

1992-03-02a.10.7 I am sure that the House needs no reminding of that . None the less , I am grateful to my hon Friend for doing so . He will realise the concern that I have for disabled people , who would be hit by such an increase in petrol tax .
['am_* am_sure', 'am_* 

Unnamed: 0,0,1,2,3,4,5,6,7,type_id
doing_*__what>*,1.188908,1.179227,1.296108,1.284275,1.161041,1.174012,1.380595,0.487743,7.0
doing_*,1.196526,1.177769,1.272685,1.269288,1.144117,1.162529,1.344962,0.501558,7.0
taking_*__taking_is__what>*,1.126557,1.190315,1.332255,1.244363,1.160392,1.187807,1.405707,0.508016,7.0
doing_*__doing_is__what>*,1.19491,1.201274,1.301613,1.311563,1.213732,1.199367,1.419399,0.529425,7.0
take_*__take_what,1.156225,1.012067,1.335862,1.188823,1.090698,1.169019,1.304218,0.532164,7.0
taking_*__taking_are,1.087618,1.218423,1.353556,1.237473,1.19186,1.198082,1.402136,0.533785,7.0
taking_*,1.134108,1.230914,1.339015,1.254403,1.189327,1.215009,1.418624,0.534918,7.0
will>*__work_*__work_with,1.055845,0.969658,1.376481,1.118873,1.019038,1.158587,1.261221,0.535301,7.0
taking_*__what>*,1.084641,1.223158,1.355838,1.253558,1.19629,1.20394,1.413722,0.540402,7.0
doing_*__doing_is,1.204739,1.21992,1.278352,1.299628,1.208292,1.188003,1.396126,0.541201,7.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
through>*,1.173298,1.242311,1.328201,1.279822,1.05414,1.258116,1.369618,0.642623,7.0
is_working,1.119604,1.152739,1.275667,1.203138,0.976927,1.155934,1.216139,0.645541,7.0
ensuring_is,1.195901,1.089027,1.243815,1.11077,0.954979,1.149191,1.22936,0.648673,7.0
supporting_are,1.214342,1.245588,1.238312,1.291103,1.178745,1.182181,1.357689,0.65338,7.0
working_on,1.134486,1.224916,1.291671,1.280997,1.308231,1.218774,1.380054,0.664101,7.0
supporting_*,1.220306,1.258811,1.250189,1.318995,1.196743,1.203403,1.381304,0.666979,7.0
ensuring_*,1.175195,1.123449,1.22512,1.152106,0.926351,1.104832,1.186094,0.669527,7.0
working_are,1.136606,1.22965,1.284925,1.282427,1.311925,1.214676,1.377935,0.67017,7.0
working_with,1.136615,1.228151,1.286453,1.28158,1.314418,1.215308,1.379528,0.672401,7.0
working_*,1.138597,1.230087,1.284754,1.282966,1.312211,1.21649,1.379105,0.67273,7.0


top prompts:
1982-06-21a.18.8 Following our discussions with our European partners last October about breast food substitutes , what concerted action are we taking with our European partners to combat the problem ? What is being done about illiteracy in Third world countries ? Does the right hon Gentleman agree that that is one of the reasons why the substitutes are being abused ?
['following>* taking_*__taking_are taking_*__taking_what', 'done_*__done_is__done_what__what>* what>*__what>is', 'agree_*__agree_is__does>*']

2008-07-07d.1148.1 My constituent is seeking repayment of benefits that she was entitled to but did not receive as a result of an inaccurate assessment last autumn—undertaken without interviewing her . Inverness special payments team received her case in March , but the team tell me that , as of 27 June , they were processing claims received in October 2007 . My constituent faces a nine to 12-month delay in having her case processed . What is the Minister doing to tack

When this trained model is used to transform a corpus, it will output several representations or features associated with each utterance.

In [29]:
utt = pt.transform_utterance(utt)

A vector representation encapsulating the utterance's rhetorical intent (in short, an embedding of the utterance based on the responses associated with questions containing its constituent phrasings):

In [30]:
utt.get_info('prompt_types__prompt_repr')

[-0.17102469270226672,
 0.030668340328548496,
 -0.14361175504485915,
 0.11035667766433878,
 -0.3149225651493796,
 -0.032341849264302266,
 -0.22282059450017552,
 -0.12806097960998153,
 0.1771197712150501,
 0.02081981426950371,
 -0.3536289362524308,
 -0.2403882732149338,
 -0.06125687797720545,
 -0.19491483622907677,
 -0.05056529746639132,
 -0.03309523304250256,
 -0.41507809747180324,
 -0.06014879946683283,
 -0.11378928189631009,
 -0.01750400434622636,
 -0.04641636672373653,
 -0.5430246645922729,
 0.13134581675643944,
 -0.0851526519739321]

The distance between the vector of that utterance and the centroid of each cluster, i.e., type of question:

In [31]:
utt.get_info('prompt_types__prompt_dists.8')

[1.1430428918990696,
 0.9502324189972602,
 1.1337070117657726,
 0.8403522127994898,
 0.38907971018922277,
 1.1182399779615004,
 0.7633996580496394,
 1.1170915977638758]

The particular type of question this utterance is, as well as how close it is to the centroid of that particular cluster (roughly, how well it fits that question type):

In [32]:
utt.get_info('prompt_types__prompt_type.8')

4.0

In [33]:
utt.get_info('prompt_types__prompt_type_dist.8')

0.38907971018922277

Here, we see that our running example is of question type 4, which is exemplified by phrasings like `does [the Minister] agree...` -- we may characterize the entire cluster as encapsulating "agreeing" questions which are perhaps asked helpfully to bolster the answerer's reputation.

In [64]:
pt.display_type(utt.get_info('prompt_types__prompt_type.8'), k=15)

top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_*__agree_is,1.188158,1.063,1.140767,0.952855,0.390807,1.232684,0.862497,1.120674,4.0
agree_*__agree_be__does>*,1.14527,1.019666,1.15276,0.885044,0.397438,1.20073,0.797764,1.133618,4.0
agree_*__agree_be,1.140607,1.022338,1.151803,0.878635,0.398083,1.19152,0.78868,1.138992,4.0
agree_*__agree_is__does>*,1.185667,1.067486,1.143729,0.954654,0.398762,1.238654,0.870336,1.11485,4.0
agree_*__agree_have,1.160346,1.060732,1.162782,0.954536,0.439447,1.232611,0.864388,1.130915,4.0
agree_*__agree_are,1.198654,1.093265,1.142069,0.940229,0.446017,1.247844,0.859772,1.162318,4.0
agree_*__agree_does__agree_have__does>*,1.145078,1.092075,1.144758,0.956562,0.454053,1.221288,0.848933,1.117233,4.0
agree_*__agree_are__agree_does__does>*,1.199917,1.099268,1.139137,0.94469,0.45824,1.253012,0.868788,1.158872,4.0
agree_*__agree_also,1.184892,1.135958,1.127462,1.026366,0.468385,1.265777,0.892345,1.176072,4.0
continue_*__will>*,1.153943,1.039539,1.207467,0.966704,0.474253,1.195661,0.962171,0.987554,4.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_certainly,1.180155,1.071745,1.201017,0.971265,0.461342,1.293082,0.952085,1.10175,4.0
agree_is,1.174426,1.071817,1.205062,0.98537,0.468024,1.291348,0.950482,1.095343,4.0
agree_however,1.171552,1.076513,1.193439,0.995855,0.468574,1.28644,0.93341,1.117073,4.0
agree_will,1.18198,1.04299,1.209447,1.001755,0.475267,1.291913,0.946098,1.121801,4.0
agree_also,1.194347,1.078747,1.19395,0.990467,0.476035,1.294584,0.947017,1.096503,4.0
agree_wholeheartedly,1.184639,1.092625,1.190386,1.018646,0.476682,1.280336,0.964095,1.089767,4.0
agree_absolutely,1.194103,1.062569,1.219094,0.982575,0.476712,1.294999,0.986945,1.067689,4.0
is_also,1.19596,1.051437,1.11008,0.98655,0.478089,1.101184,0.882075,1.002128,4.0
agree_be,1.164383,1.079607,1.202961,0.994257,0.481263,1.288859,0.94401,1.104167,4.0
agree_completely,1.186661,1.08971,1.208564,1.017079,0.481393,1.2983,0.976094,1.076271,4.0


We can transform the other utterances in the corpus as such:

In [35]:
corpus = pt.transform(corpus)

This utterance is of type 7: perhaps more information-seeking and querying for an update ("what steps is the Government taking, what are they doing to ensure", etc)

In [36]:
utt1.text

'Given what the Foreign Secretary has said about the importance of the Iran discussions on the nuclear agreement , what is he doing to ensure greater clarity about the baselines , the extent of the inspection regime and the consequences of infringement ? Given that the agreement will allow advanced centrifuge , the infringements might arrive a little earlier than anticipated .'

In [37]:
utt1.get_info('motifs__sink')

['doing_*__doing_ensure__doing_is doing_*__doing_is__doing_what given>*']

In [38]:
utt1.get_info('prompt_types__prompt_type.8')

7.0

In [67]:
pt.display_type(utt1.get_info('prompt_types__prompt_type.8'), k=15)

top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
doing_*__what>*,1.188908,1.179227,1.296108,1.284275,1.161041,1.174012,1.380595,0.487743,7.0
doing_*,1.196526,1.177769,1.272685,1.269288,1.144117,1.162529,1.344962,0.501558,7.0
taking_*__taking_is__what>*,1.126557,1.190315,1.332255,1.244363,1.160392,1.187807,1.405707,0.508016,7.0
doing_*__doing_is__what>*,1.19491,1.201274,1.301613,1.311563,1.213732,1.199367,1.419399,0.529425,7.0
take_*__take_what,1.156225,1.012067,1.335862,1.188823,1.090698,1.169019,1.304218,0.532164,7.0
taking_*__taking_are,1.087618,1.218423,1.353556,1.237473,1.19186,1.198082,1.402136,0.533785,7.0
taking_*,1.134108,1.230914,1.339015,1.254403,1.189327,1.215009,1.418624,0.534918,7.0
will>*__work_*__work_with,1.055845,0.969658,1.376481,1.118873,1.019038,1.158587,1.261221,0.535301,7.0
taking_*__what>*,1.084641,1.223158,1.355838,1.253558,1.19629,1.20394,1.413722,0.540402,7.0
doing_*__doing_is,1.204739,1.21992,1.278352,1.299628,1.208292,1.188003,1.396126,0.541201,7.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
through>*,1.173298,1.242311,1.328201,1.279822,1.05414,1.258116,1.369618,0.642623,7.0
is_working,1.119604,1.152739,1.275667,1.203138,0.976927,1.155934,1.216139,0.645541,7.0
ensuring_is,1.195901,1.089027,1.243815,1.11077,0.954979,1.149191,1.22936,0.648673,7.0
supporting_are,1.214342,1.245588,1.238312,1.291103,1.178745,1.182181,1.357689,0.65338,7.0
working_on,1.134486,1.224916,1.291671,1.280997,1.308231,1.218774,1.380054,0.664101,7.0
supporting_*,1.220306,1.258811,1.250189,1.318995,1.196743,1.203403,1.381304,0.666979,7.0
ensuring_*,1.175195,1.123449,1.22512,1.152106,0.926351,1.104832,1.186094,0.669527,7.0
working_are,1.136606,1.22965,1.284925,1.282427,1.311925,1.214676,1.377935,0.67017,7.0
working_with,1.136615,1.228151,1.286453,1.28158,1.314418,1.215308,1.379528,0.672401,7.0
working_*,1.138597,1.230087,1.284754,1.282966,1.312211,1.21649,1.379105,0.67273,7.0


This utterance, on the other hand is a lot more aggressive -- perhaps _accusatory_ to the ends of putting the answerer on the spot ("will the secretary admit that the policy is a failure?")

In [40]:
utt2 = corpus.get_utterance('1987-03-04a.857.5')

In [41]:
utt2.text

'Will the Secretary of State stop giving us what is called in the pop record industry a remix of alibis , excuses and gimmicks ? Will he admit that the number of homes built to rent last year by local authorities was the lowest in 62 years , that the housing investment programme net of capital receipts was the lowest in real terms since HIPs were invented and that , even during the past three years the number of repair and improvement grants , which would bring some private homes back into use , have dropped by 100,000 ? Does not the right hon Gentleman understand that , if the private owner and the local authority are starved of resources , we are left with lengthy queues , homelessness and all the other scandals of poor housing that exist today ?'

In [42]:
utt2.get_info('motifs__sink')

['stop_*__stop_will__will>*',
 'admit_*__admit_will__will>*',
 'does>*__does>not does>*__understand_*']

In [43]:
utt2.get_info('prompt_types__prompt_type.8')

2.0

In [68]:
pt.display_type(utt2.get_info('prompt_types__prompt_type.8'), k=15)

top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
admit_*,1.201171,1.292763,0.570578,1.214884,1.266494,0.947761,0.940144,1.397194,2.0
why>*,1.111746,1.310107,0.57465,1.22539,1.281744,0.860374,0.900415,1.388247,2.0
admit_*__will>*,1.231918,1.313895,0.576193,1.257695,1.240869,0.998804,1.011262,1.359635,2.0
explain_*,1.081285,1.253725,0.576442,1.206437,1.269414,0.843906,0.912722,1.365633,2.0
explain_*__explain_will,1.084332,1.294334,0.591024,1.183642,1.188385,0.920798,0.869349,1.373634,2.0
is>*__is_*__is_true,1.16878,1.315689,0.596213,1.103997,1.143709,1.017692,0.846551,1.483289,2.0
is_*__why>*,1.184021,1.282762,0.601567,1.21484,1.174174,0.91112,0.845809,1.364674,2.0
justify_*,1.203301,1.322534,0.609482,1.251149,1.275439,0.976639,1.037237,1.358217,2.0
admit_*__admit_will__will>*,1.239174,1.311162,0.610798,1.289524,1.272271,0.984968,1.058717,1.349525,2.0
is_*__is_true,1.171478,1.337066,0.616361,1.159053,1.183389,1.032628,0.898227,1.49126,2.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
wonder_*,1.1772,1.279819,0.594013,1.185855,1.109752,0.87208,0.870528,1.31476,2.0
failed_*,1.208959,1.330511,0.634805,1.256673,1.129173,0.963978,0.931793,1.338335,2.0
were_*,1.210541,1.366612,0.654166,1.19862,1.070178,1.06855,0.916717,1.350363,2.0
is_wrong,1.171259,1.387414,0.662125,1.253352,1.153526,0.974954,0.946674,1.315054,2.0
instead>*,1.178271,1.26101,0.675487,1.211907,1.236813,0.851639,0.999561,1.264442,2.0
talks_*,1.238792,1.229554,0.695368,1.236562,1.260476,0.900432,1.032589,1.339003,2.0
am_surprised,1.172448,1.23272,0.698857,1.211564,1.223285,1.007399,0.960399,1.332919,2.0
were_there,1.212256,1.391393,0.702025,1.248951,1.151671,1.088265,0.991262,1.355348,2.0
talks_about,1.231265,1.22522,0.706294,1.24026,1.297678,0.887904,1.04609,1.339956,2.0
was_*,1.178669,1.17664,0.713316,1.100452,0.869292,0.922654,0.726262,1.286168,2.0


Inspecting the other types should hopefully give you an intuition for the range of questions that tend to be asked in parliament, as well as the coherence of these types (which align fairly well with the output of the paper, even under different random seeds and small implementation tweaks). 

### a few caveats and potential modifications

One thorn in our sides might be that the model occasionally gets caught up on very generic motifs e.g., `'is>*'`, and as such, will fit many questions to the type containing `'is>*'` instead of going with a better signal; various optional parameters detailed in the documentation may provide incomplete solutions to this. Another caveat is that while this model allows us to associate together lexically-diverging phrasings (e.g., "will the Minister admit" and "does the Minister not realise" both serve to be accusatory towards the Minister), we are ultimately relying on the fact that our domain has a sufficient amount of lexical regularity (e.g., the institutional norms of how people talk in parliament) -- we might need to be cleverer when dealing with noisier settings where this regularity isn't guaranteed (like social media data). 

Finally, as a data-specific note, cluster 7 may be a result of the parser assuming that "Will the learned Gentleman please answer my question?" has "learned" as the root verb -- an artefact of parliamentary discourse we haven't handled. You may wish to play around with this by modifying how the data is preprocessed.

### model persistence

We can save our trained `pt_model` to disk for later use:

In [45]:
import os

In [46]:
pt.dump_model(os.path.join(ROOT_DIR, 'pt_model'))

dumping embedding model
dumping training embeddings
dumping type model 8


In broad strokes, what's loaded to disk is:

* TfIdf models that store the distribution of phrasings and arcs in the training data;
* SVD models that allow us to map raw phrasing/arc counts to vector representations;
* a KMeans model to cluster vector representations.

In [47]:
pt_model_dir = os.path.join(ROOT_DIR, 'pt_model')
!ls $pt_model_dir

km_model.8.joblib	   svd_model.joblib	   train_ref_ids.npy
prompt_df.8.tsv		   train_prompt_df.8.tsv   train_ref_vects.npy
prompt_tfidf_model.joblib  train_prompt_ids.npy    U_prompt.npy
ref_df.8.tsv		   train_prompt_vects.npy  U_ref.npy
ref_tfidf_model.joblib	   train_ref_df.8.tsv


Initializing a new `PromptTypes` model and loading our saved model then allows us to use it again:

In [48]:
new_pt = PromptTypes(prompt_field='motifs', ref_field='arcs_censored', 
                 prompt_filter=question_filter, ref_filter=response_filter,
                 prompt_transform_field='motifs__sink',
                 output_field='prompt_types_new', prompt__tfidf_min_df=100,
                 ref__tfidf_min_df=100, 
    random_state=1000, verbosity=1)

In [49]:
new_pt.load_model(pt_model_dir)

loading embedding model
loading training embeddings
loading type model 8


In [50]:
utt = new_pt.transform_utterance(utt)

In [51]:
utt.get_info('prompt_types_new__prompt_type.8')

4.0

## examples of potential variations

### trying other numbers of prompt types:

Calling `refit_types(n)` will retrain the clustering component of the `PromptType` model to infer a different number of types. Suppose we only wanted 4 types of questions:

In [52]:
pt.refit_types(4)

fitting 4 prompt types


In [53]:
for i in range(4):
    print(i)
    pt.display_type(i, type_key=4, k=15)
    print('\n\n')

0
top prompt:


Unnamed: 0,0,1,2,3,type_id
why>*,0.601818,1.085396,1.179613,1.312615,0.0
explain_*,0.605775,1.057703,1.166116,1.279778,0.0
why>*__why>does,0.616713,1.016271,1.038717,1.310166,0.0
explain_*__explain_will,0.633035,1.080472,1.096639,1.29712,0.0
is_*__why>*,0.633817,1.107558,1.088035,1.305727,0.0
admit_*,0.647199,1.123084,1.171259,1.336647,0.0
has>*__has>not,0.649402,0.98777,0.933776,1.316079,0.0
explain_*__explain_is,0.649678,1.024644,1.100717,1.233558,0.0
explain_*__will>*,0.650367,1.077112,1.064484,1.263428,0.0
is>*__is_*__is_true,0.66102,1.113861,1.041213,1.407218,0.0


top response:


Unnamed: 0,0,1,2,3,type_id
wonder_*,0.630674,1.102998,1.044466,1.25532,0.0
is>*,0.675474,0.987292,1.008256,1.176453,0.0
failed_*,0.681234,1.168922,1.087177,1.286135,0.0
suggest_*,0.681979,0.963971,1.015263,1.236774,0.0
says_*,0.692276,1.000248,1.06652,1.175726,0.0
instead>*,0.695701,1.095288,1.174475,1.213796,0.0
remind_*,0.701194,0.94906,0.932849,1.299634,0.0
is_what,0.704836,1.072153,1.041561,1.167645,0.0
was_*,0.70622,1.05427,0.828728,1.237861,0.0
is_wrong,0.707512,1.172941,1.105634,1.273054,0.0





1
top prompt:


Unnamed: 0,0,1,2,3,type_id
give_*__will>*,1.043248,0.657432,1.031485,1.07433,1.0
give_*__give_will,1.013476,0.660178,1.063695,1.080967,1.0
give_*,0.988746,0.675054,1.042047,1.025939,1.0
make_*,1.047946,0.685494,0.806577,1.161395,1.0
in>*,0.950518,0.69312,0.762574,1.20244,1.0
ask_*,1.045331,0.696865,0.985456,1.112932,1.0
press_*,1.131766,0.705377,0.896082,1.052293,1.0
have_*,0.874843,0.70548,0.974819,1.035824,1.0
raise_*,1.152602,0.707095,0.994261,1.040403,1.0
may>*__press_*,1.112269,0.708634,1.014641,1.088584,1.0


top response:


Unnamed: 0,0,1,2,3,type_id
understand_*,0.966634,0.662215,0.883908,1.104828,1.0
understand_will,0.991441,0.687805,1.001792,1.087629,1.0
appreciate_will,1.041817,0.693472,1.002399,1.130071,1.0
appreciate_*,1.030921,0.698451,0.917643,1.159681,1.0
consider_be,1.129064,0.698777,0.938326,1.137771,1.0
be_shall,1.006798,0.708291,0.756346,1.055697,1.0
am_aware,1.167042,0.713787,1.026887,1.069327,1.0
however>*,0.955174,0.714456,0.769619,1.147455,1.0
be_happy,1.058102,0.718625,0.7674,1.024599,1.0
consider_is,1.105853,0.72313,0.856749,1.162859,1.0





2
top prompt:


Unnamed: 0,0,1,2,3,type_id
agree_*__agree_be,1.11096,1.007206,0.479745,1.112463,2.0
agree_*__agree_be__does>*,1.11454,1.012929,0.483359,1.110787,2.0
agree_*__as>*,1.144315,0.966375,0.525692,1.110044,2.0
does>*__does_*,1.077883,1.053004,0.528685,1.221586,2.0
agree_*__agree_is,1.118272,1.072922,0.532932,1.106996,2.0
does_*,1.060453,1.050294,0.539875,1.197938,2.0
agree_*__agree_is__does>*,1.122454,1.076461,0.540153,1.102009,2.0
agree_*__agree_does__as>*,1.150785,0.995876,0.549916,1.13548,2.0
learned_*,1.076814,0.904834,0.549981,1.227574,2.0
agree_*__agree_have,1.133961,1.061377,0.552106,1.108647,2.0


top response:


Unnamed: 0,0,1,2,3,type_id
is_be,0.983326,0.887748,0.518452,1.075414,2.0
is_reduce,1.055853,1.003526,0.519157,1.109797,2.0
be_interested,1.014612,0.916946,0.547527,1.035521,2.0
be_of,1.001621,0.822517,0.553019,1.100756,2.0
be_for,0.987128,0.886889,0.559593,1.10987,2.0
be_indeed,0.972489,0.847795,0.563212,1.096901,2.0
be_have,1.048217,0.840767,0.5637,1.011078,2.0
be_better,0.903746,0.90384,0.565646,1.119409,2.0
is_necessary,1.033036,0.975829,0.568678,1.079672,2.0
be_also,1.049832,0.860241,0.570089,1.007618,2.0





3
top prompt:


Unnamed: 0,0,1,2,3,type_id
taking_*__taking_is__what>*,1.295492,1.185034,1.230208,0.607057,3.0
will>*__work_*__work_with,1.309957,1.037493,1.070822,0.610824,3.0
doing_*__what>*,1.261882,1.205985,1.229969,0.61392,3.0
taking_*__taking_are,1.313213,1.179124,1.242876,0.614792,3.0
taking_*__what>*,1.316787,1.185156,1.254286,0.616616,3.0
done_*__done_being,1.258333,1.110325,1.144343,0.621616,3.0
do_*__do_help,1.258489,1.105808,1.196712,0.624412,3.0
doing_*,1.238402,1.199368,1.205838,0.627153,3.0
taking_*,1.307366,1.209135,1.249842,0.630915,3.0
will>*__work_*,1.271969,1.021488,0.995575,0.631464,3.0


top response:


Unnamed: 0,0,1,2,3,type_id
is_working,1.222887,1.148835,1.047432,0.692729,3.0
ensure_to,1.216149,1.045685,0.940809,0.698138,3.0
raises_*,1.289359,0.891337,1.076154,0.703652,3.0
through>*,1.30467,1.232791,1.162545,0.728547,3.0
taking_in,1.186808,1.101722,1.01671,0.731985,3.0
met_discuss,1.330599,0.964782,1.130157,0.732484,3.0
working_on,1.266305,1.197235,1.315516,0.735116,3.0
supporting_are,1.217897,1.233306,1.237249,0.738121,3.0
ensuring_*,1.182124,1.12159,1.002762,0.739539,3.0
ensuring_is,1.209579,1.122381,1.023778,0.740001,3.0







### trying other input formats

We may also experiment with different representations of the input text -- for instance, in lieu of using phrasing motifs we may instead pass questions into the model as just the raw arcs, similar to the responses. This can be modified by changing the `prompt_field` argument:

In [54]:
pt_arcs = PromptTypes(prompt_field='arcs_censored', ref_field='arcs_censored', 
                 prompt_filter=question_filter, ref_filter=response_filter,
                 prompt_transform_field='arcs_censored',
                 output_field='prompt_types_arcs', prompt__tfidf_min_df=100,
                 ref__tfidf_min_df=100, n_types=8,
    random_state=1000, verbosity=1)

In [55]:
pt_arcs.fit(corpus)

fitting 214798 input pairs
fitting ref tfidf model
fitting prompt tfidf model
fitting svd model
fitting 8 prompt types


In [56]:
for i in range(8):
    print(i)
    pt_arcs.display_type(i,  k=10)
    print('\n\n')

0
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
be_would,0.529891,0.964818,1.22692,0.939504,0.918249,0.771973,0.884318,1.145255,0.0
be_not,0.53037,0.895,1.250734,0.939678,1.017548,0.823587,0.879274,1.134778,0.0
asked_for,0.566036,1.002276,1.153131,1.015477,1.030582,0.796301,0.908066,1.02775,0.0
would>*,0.568386,0.897959,1.227223,0.969823,0.937112,0.7373,0.911911,1.049858,0.0
hope_*,0.568698,1.140511,1.104456,1.029072,0.761381,0.735195,0.936335,1.202869,0.0
be_*,0.572036,1.050015,1.079135,0.87228,0.767881,0.79704,0.771462,1.20842,0.0
will_*,0.575144,1.111116,1.269291,0.946087,0.945891,0.869441,1.058066,1.16981,0.0
bearing>*,0.581458,0.884393,1.251779,0.913755,0.963513,0.881427,0.835626,1.0882,0.0
take_will,0.589443,1.073734,1.161073,1.001249,0.827194,0.747144,0.967276,1.209549,0.0
in>*,0.590339,0.875245,1.181718,0.887059,0.91628,0.806805,0.761068,1.023823,0.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
is_aware,0.567507,0.894577,1.163757,0.935506,0.94425,0.789681,0.778036,1.061885,0.0
be_however,0.568026,1.009577,1.122763,0.919661,0.871915,0.739309,0.764948,1.138676,0.0
be_may,0.578518,1.02178,1.082892,0.955665,0.879389,0.714347,0.789071,1.157237,0.0
be_possible,0.587127,0.963179,1.131451,0.988681,0.934369,0.788926,0.821702,1.181489,0.0
be_appropriate,0.596184,0.951954,1.12341,0.91831,0.931754,0.828037,0.745659,1.160878,0.0
note_*,0.603744,1.025815,1.289602,0.98894,1.001015,0.971495,1.066057,1.150926,0.0
however>*,0.606946,0.989311,1.167113,0.842493,0.870752,0.879818,0.777442,1.158154,0.0
be_indeed,0.610845,0.975784,1.095285,0.977176,0.989496,0.702595,0.786408,1.133642,0.0
be_would,0.612413,0.960792,1.121824,0.936766,0.900465,0.706843,0.731975,1.142114,0.0
realise_*,0.612891,0.827765,1.210775,0.976124,1.041444,0.820925,0.786484,0.996628,0.0





1
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
why>*,1.082362,0.498804,1.318543,0.969625,1.264774,1.29907,0.868052,0.863166,1.0
admit_*,1.085746,0.536929,1.331248,1.064418,1.255712,1.245212,0.935513,0.779047,1.0
explain_*,1.070559,0.549831,1.281412,0.902235,1.213101,1.263579,0.865572,0.84678,1.0
why>does,0.953765,0.571123,1.301964,1.009925,1.193766,1.188209,0.805041,0.894401,1.0
admit_will,1.157588,0.58415,1.299476,1.08395,1.236599,1.282549,0.946186,0.751614,1.0
explain_is,1.016322,0.585054,1.263605,0.901728,1.16353,1.199006,0.854056,0.949275,1.0
notice_*,0.973048,0.597462,1.239572,1.050174,1.154711,1.08987,0.823853,0.987293,1.0
admit_is,1.108795,0.601369,1.266044,1.052958,1.201181,1.19743,0.931544,0.776833,1.0
think_does,0.892586,0.602116,1.251281,1.06233,1.087503,0.959157,0.813694,0.876168,1.0
stop_will,1.072618,0.605782,1.297606,0.981082,1.247246,1.254974,0.891875,0.925826,1.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
understand_does,0.913371,0.531497,1.24744,0.924697,1.106914,1.118065,0.761419,1.01306,1.0
wonder_*,0.985299,0.601215,1.271933,1.051939,1.219788,1.157743,0.856984,0.863895,1.0
was_said,1.034886,0.60709,1.236062,0.992814,1.153608,1.237125,0.702315,0.988737,1.0
yet>*,1.038464,0.621445,1.25929,1.014306,1.173961,1.168001,0.851038,0.88602,1.0
understand_not,0.86874,0.621586,1.250042,0.92295,1.088858,1.084783,0.790922,1.029038,1.0
notice_*,1.043284,0.627223,1.263403,1.051341,1.190089,1.076326,0.941298,0.939591,1.0
am_surprised,1.091235,0.636753,1.285019,1.088831,1.209057,1.206532,1.039974,0.972947,1.0
seems_*,0.94113,0.644071,1.236035,0.941503,1.076378,1.10071,0.738261,0.932921,1.0
is_perhaps,1.044557,0.647682,1.240201,1.052324,1.065071,1.122222,0.899066,0.972781,1.0
surely>*,0.884225,0.649113,1.248093,1.046117,1.112814,0.949976,0.88706,0.986269,1.0





2
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
what>*,1.271861,1.242112,0.53322,1.002562,1.067609,1.145927,1.017996,1.17192,2.0
doing_*,1.300796,1.300474,0.556178,1.207347,1.097463,1.056035,1.200966,1.135389,2.0
work_will,1.158445,1.373427,0.570707,1.134883,0.925985,0.908346,1.144729,1.246823,2.0
taking_*,1.291463,1.389562,0.580095,1.194019,1.116571,1.101368,1.228923,1.229688,2.0
work_with,1.159453,1.379212,0.582198,1.174117,0.90043,0.910684,1.157689,1.255679,2.0
take_what,1.202233,1.339566,0.587073,1.202726,0.95275,1.003263,1.168627,1.226476,2.0
taking_are,1.312085,1.394881,0.591514,1.18639,1.147198,1.127962,1.251814,1.233307,2.0
doing_ensure,1.294153,1.391903,0.593092,1.223746,1.0684,1.074916,1.264096,1.224071,2.0
doing_is,1.34585,1.308252,0.597871,1.222234,1.134033,1.117,1.226352,1.141137,2.0
doing_are,1.24244,1.267446,0.609311,1.175242,1.068299,1.014913,1.155183,1.113011,2.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
ensure_to,1.108805,1.30556,0.639163,1.037371,0.950964,0.822511,1.063489,1.12183,2.0
is_working,1.209536,1.287185,0.647905,1.122864,1.090199,0.9517,1.1217,1.183024,2.0
supporting_are,1.331493,1.298213,0.664975,1.245794,1.174139,1.134312,1.190326,1.007073,2.0
leading_*,1.31666,1.348148,0.674289,1.233034,1.196772,1.02581,1.299496,1.100104,2.0
supporting_*,1.356756,1.310165,0.675227,1.25819,1.184234,1.153445,1.207287,1.011168,2.0
working_on,1.323279,1.291686,0.677293,1.179521,1.160974,1.235469,1.279383,1.212451,2.0
working_be,1.328495,1.292396,0.678438,1.184233,1.150985,1.246078,1.276297,1.210041,2.0
working_are,1.324384,1.285862,0.680286,1.178522,1.164219,1.239386,1.275411,1.207694,2.0
working_with,1.324358,1.287256,0.682205,1.178325,1.162823,1.241009,1.276747,1.209794,2.0
working_make,1.327221,1.287644,0.682783,1.179309,1.158044,1.24197,1.282125,1.209357,2.0





3
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
can>*,1.019797,1.023083,1.043756,0.623323,0.996365,1.169639,0.816799,1.054005,3.0
tell_*,1.126664,0.914395,1.130509,0.661783,1.134465,1.305937,0.879383,1.004925,3.0
say_can,1.058891,1.051673,1.126372,0.663397,1.084024,1.277586,0.857412,1.090996,3.0
tell_will,1.121881,0.895332,1.122838,0.663844,1.143115,1.32849,0.875881,1.036053,3.0
give_*,0.819948,1.026659,1.111919,0.677585,0.935771,1.146042,0.748185,1.18762,3.0
made_for,0.884407,0.999102,1.166185,0.682912,0.995943,1.059871,0.895331,1.120417,3.0
give_can,0.979174,1.075025,0.947453,0.684947,0.972619,1.104161,0.714017,1.130779,3.0
tell_can,1.198588,1.025032,1.153155,0.696808,1.138763,1.322276,0.997398,1.011026,3.0
give_on,1.068256,1.079455,0.988985,0.698475,0.996005,1.176986,0.836238,1.177126,3.0
give_is,1.062422,1.044347,1.006488,0.699055,0.937592,1.178587,0.907751,1.179706,3.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
am_able,0.918847,0.966668,1.098808,0.651006,0.977138,1.152355,0.858369,1.136925,3.0
answer_*,1.126042,0.986408,1.116936,0.671781,1.077578,1.296715,0.892877,1.066896,3.0
answer_not,1.137393,1.056228,1.11202,0.67247,1.056949,1.308798,0.965991,1.105968,3.0
answer_can,1.140176,1.062829,1.113727,0.676979,1.069435,1.302939,0.985692,1.099774,3.0
give_not,1.076947,1.090306,1.062359,0.679999,0.980701,1.206549,0.924597,1.082816,3.0
have_not,1.027469,1.033207,1.082738,0.693263,1.016492,1.079209,0.957782,1.076511,3.0
tell_not,1.106588,1.009616,1.028193,0.693676,1.053718,1.217617,0.915619,1.039259,3.0
undertake_*,0.972703,1.157062,1.056157,0.700159,0.867534,1.125391,0.954017,1.179408,3.0
answer_will,1.064103,0.904178,1.132988,0.702738,1.06756,1.245533,0.72581,1.021981,3.0
give_can,1.059715,1.140321,1.02673,0.704234,0.996021,1.167507,0.895503,1.093899,3.0





4
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
be_may,0.850559,1.187245,1.050705,0.98395,0.576201,0.897858,0.996053,1.199375,4.0
meet_*,0.841114,1.163689,0.991501,1.018547,0.594362,0.8865,0.870258,1.20345,4.0
agree_meet,0.973283,1.228591,1.044659,1.044455,0.597063,1.01515,1.027403,1.276255,4.0
meet_will,0.857941,1.158296,0.992003,1.03281,0.599742,0.915414,0.858315,1.221652,4.0
agree_will,0.812673,1.194345,1.052964,1.024859,0.605186,0.801279,0.978952,1.256917,4.0
may>*,0.768218,1.105548,0.97128,0.975572,0.608737,0.842346,0.779835,1.205788,4.0
bring_will,0.868066,1.191334,0.982289,1.031564,0.616371,0.868448,1.041258,1.229437,4.0
know_*,0.9789,1.191033,0.809219,1.003779,0.621997,0.925623,0.874437,1.21199,4.0
support_*,0.902916,1.136739,1.019087,0.99348,0.626198,0.945538,0.879832,1.228323,4.0
press_may,0.927536,1.159521,0.972477,1.043491,0.633587,1.064623,0.899393,1.269466,4.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
want_obviously,0.908459,1.251168,1.022802,1.127755,0.645155,0.924761,1.01767,1.176661,4.0
am_always,0.853934,1.144588,1.104828,1.089213,0.654422,0.932764,0.945959,1.227052,4.0
am_happy,1.011255,1.259328,0.990452,1.036424,0.663829,0.845702,0.981557,1.162549,4.0
raises_*,1.075678,1.343499,0.718234,1.036033,0.672282,0.997519,1.106982,1.27269,4.0
want_make,1.037984,1.217069,0.9177,1.085187,0.673089,0.820741,0.93244,1.159628,4.0
suspect_is,0.930814,0.997454,1.124729,0.903413,0.680733,1.013299,0.812977,1.110388,4.0
was_aware,1.011461,1.144945,1.186061,0.992019,0.684654,1.120405,1.107825,1.182254,4.0
want_give,1.009165,1.204382,0.872774,1.059542,0.690989,0.790994,0.912356,1.131974,4.0
am_aware,0.868493,1.212875,1.110015,0.862806,0.69366,1.096991,1.07339,1.264212,4.0
get_back,1.126453,1.193048,1.142249,1.099109,0.697986,1.1248,1.154655,1.175543,4.0





5
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
continue_*,0.823746,1.133221,1.037373,1.124218,1.053768,0.497383,1.006627,1.087384,5.0
share_*,0.846982,1.186381,0.977901,1.144325,0.883004,0.500044,1.019908,1.125189,5.0
agree_be,0.795171,1.158267,1.120676,1.155932,1.065201,0.500295,1.108687,1.123609,5.0
agree_make,0.866441,1.223336,0.972183,1.173972,1.004782,0.500519,1.063271,1.114853,5.0
agree_is,0.868513,1.177604,1.121923,1.195612,1.10071,0.513065,1.172806,1.060426,5.0
agree_provide,0.95137,1.260131,0.997164,1.203544,1.076106,0.514803,1.111756,1.095859,5.0
share_does,0.904929,1.214812,0.961775,1.188495,0.90294,0.537523,1.074094,1.119788,5.0
thank_for,0.892096,1.305375,0.866227,1.162818,0.904054,0.540597,1.088999,1.138932,5.0
am_*,0.756617,1.179449,0.88593,1.053578,0.778443,0.54226,0.898805,1.145352,5.0
thank_*,0.89372,1.309549,0.870667,1.164372,0.898669,0.542434,1.099034,1.144545,5.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
is_important,0.823582,1.202855,0.923133,1.12832,0.836411,0.464095,0.962275,1.141894,5.0
is_vital,0.914321,1.235208,0.928898,1.179347,1.004727,0.478204,1.058123,1.077268,5.0
is_also,0.906917,1.142997,0.971569,1.166803,1.043958,0.484042,1.012372,1.013617,5.0
does>*,0.806278,1.12108,1.114497,1.155176,1.062973,0.491217,1.118634,1.069222,5.0
encourage_*,0.938895,1.223624,0.940524,1.156032,0.930172,0.532868,1.013765,1.09437,5.0
ensure_certainly,0.853935,1.264268,0.924873,1.064658,0.847869,0.536431,1.012206,1.182717,5.0
agree_does,0.874049,1.195855,1.114803,1.191259,1.087699,0.53725,1.203224,1.106249,5.0
yes>*,0.900167,1.291535,1.034949,1.204444,1.081923,0.538276,1.133827,1.101594,5.0
is_maintain,0.814725,1.186004,1.095803,1.119872,0.966628,0.543527,1.033521,1.071165,5.0
is_essential,0.901653,1.286665,0.963225,1.167597,1.01016,0.548391,1.080061,1.188932,5.0





6
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
now>*,0.930031,0.697456,1.145914,0.942316,1.022478,1.094657,0.583116,0.986717,6.0
make_what,0.941779,0.861849,1.076216,0.90361,1.00061,1.093359,0.584666,1.0178,6.0
reassure_will,0.889137,1.02882,1.015252,0.92961,0.938334,0.986552,0.593643,1.126473,6.0
reassure_*,0.935077,1.059802,0.902859,0.909477,0.93969,1.000905,0.621374,1.068123,6.0
said_*,1.076542,0.654712,1.107009,0.932066,1.076808,1.215053,0.623786,0.961058,6.0
confirm_be,0.766303,0.956913,1.138238,0.921158,1.001438,1.01675,0.628933,1.137817,6.0
let_*,0.938394,0.848472,1.018708,0.797646,0.91052,1.083071,0.630709,1.082544,6.0
mean_does,1.001289,0.72643,1.132221,0.857832,1.077739,1.184051,0.635455,0.990977,6.0
have_*,0.71624,0.900181,1.015166,0.911907,0.780393,0.865034,0.637254,1.056921,6.0
said_was,1.019365,0.719546,1.226942,1.02403,1.05302,1.139955,0.643761,1.01268,6.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
said_*,1.007771,0.749372,1.146647,0.95086,1.060434,1.171081,0.535647,1.032958,6.0
said_have,1.015406,0.80327,1.13134,0.944497,1.052225,1.18814,0.545052,1.079588,6.0
said_be,0.955965,0.751642,1.155431,0.94589,1.069057,1.129259,0.549912,1.041251,6.0
said_in,1.03806,0.76021,1.123795,0.952098,1.042656,1.202765,0.550703,1.025735,6.0
said_has,0.964048,0.762851,1.152738,0.966716,1.060588,1.126601,0.551059,1.040329,6.0
said_as,1.026702,0.864824,1.096014,0.974081,0.988505,1.167021,0.557469,1.076606,6.0
said_already,1.00958,0.811801,1.131008,0.939459,1.038228,1.19517,0.559988,1.0851,6.0
said_are,1.028092,0.812897,1.123739,0.975124,1.021659,1.154675,0.561285,1.052133,6.0
said_is,0.999456,0.744576,1.14281,0.977048,1.055449,1.119307,0.564511,1.023917,6.0
having>*,0.865355,0.847961,1.085954,0.957158,0.903422,1.022245,0.576302,1.006352,6.0





7
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
since_*,1.213373,0.984793,1.228459,1.200303,1.248047,1.184441,1.178557,0.563933,7.0
show_*,1.214324,0.884084,1.260694,1.225959,1.273321,1.153741,1.165774,0.584431,7.0
higher_*,1.144362,0.948267,1.257864,1.188304,1.241403,1.106863,1.146803,0.597594,7.0
fell_*,1.287662,0.917713,1.203311,1.185937,1.257328,1.301727,1.095045,0.602782,7.0
higher>*,1.200829,0.947739,1.221994,1.199018,1.225261,1.153603,1.162826,0.607491,7.0
risen_*,1.268897,0.970374,1.243428,1.181472,1.265682,1.240808,1.219532,0.616034,7.0
show_not,1.174909,0.877384,1.321176,1.215424,1.30369,1.140005,1.127767,0.62548,7.0
fallen_*,1.332391,1.071796,1.172275,1.219411,1.285289,1.256513,1.26116,0.633623,7.0
of>*,0.969567,0.864659,1.248692,1.170299,1.226273,0.956756,1.04089,0.640232,7.0
rising_*,1.258914,1.038468,1.173008,1.229713,1.243672,1.209732,1.190037,0.641301,7.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
was_in,1.128448,0.854666,1.248275,1.168917,1.277705,1.065654,1.079326,0.588419,7.0
rising_*,1.22208,1.035094,1.196016,1.235545,1.292166,1.179795,1.225766,0.589343,7.0
is_higher,1.202293,0.988935,1.257062,1.253654,1.307331,1.140626,1.252858,0.59839,7.0
rising_is,1.210114,1.018229,1.21028,1.220725,1.276912,1.191285,1.212772,0.602744,7.0
rose_by,1.26471,1.042443,1.22361,1.217631,1.283751,1.230157,1.26858,0.604503,7.0
show_*,1.167111,0.909912,1.241077,1.244021,1.285517,1.061855,1.189429,0.608059,7.0
is_lower,1.238669,0.943992,1.267997,1.241343,1.320411,1.175372,1.208685,0.608498,7.0
are_now,1.28522,0.983058,1.158601,1.209898,1.306664,1.191377,1.18356,0.612259,7.0
rose_*,1.271496,1.053888,1.230944,1.21654,1.282343,1.244295,1.271104,0.61256,7.0
is_high,1.163276,0.957979,1.21,1.13352,1.224124,1.115457,1.129165,0.61326,7.0







### going beyond root arcs

If we initialize the `TextToArcs` transformer with `root_only=False`, we will use arcs beyond those attached to the root of the dependency parse. This may produce neater output, especially in domains where utterances are less well-structured (see TODO AWRY LINK for a demo of this on Wikipedia talk page data)

## storing vector representations

As mentioned above, `PromptTypes` produces a few vector representations of utterances. For efficiency, rather than storing these representations attached to the utterance (as values in utterance.meta), we store them in a corpus-wide matrix.

`get_vect_repr(utterance_id, matrix name)` allows us to access the representation of a particular utterance:

In [57]:
corpus.get_vect_repr(test_utt_id, 'prompt_types__prompt_repr')

array([-0.17102469,  0.03066834, -0.14361176,  0.11035668, -0.31492257,
       -0.03234185, -0.22282059, -0.12806098,  0.17711977,  0.02081981,
       -0.35362894, -0.24038827, -0.06125688, -0.19491484, -0.0505653 ,
       -0.03309523, -0.4150781 , -0.0601488 , -0.11378928, -0.017504  ,
       -0.04641637, -0.54302466,  0.13134582, -0.08515265])

To save all of these representations to disk, we can call the following:

In [58]:
corpus.dump_vector_reprs('prompt_types__prompt_repr')

This stores the representations (`vect_info.<FIELD NAME>.npy`) as a matrix, and the utterance IDs corresponding to each of the rows (`vect_info.<FIELD NAME>.keys`) -- both in the corpus directory.

In [74]:
ls $ROOT_DIR

conversations.json        [0m[01;34mpm_model[0m/
corpus.json               [01;34mpt_model[0m/
index.json                users.json
info.arcs_censored.jsonl  utterances.jsonl
info.motifs.jsonl         vect_info.prompt_types__prompt_repr.keys
info.motifs__sink.jsonl   vect_info.prompt_types__prompt_repr.npy
info.parsed.jsonl


These vector representations can later be re-loaded:

In [59]:
new_corpus = convokit.Corpus(ROOT_DIR)

In [60]:
new_corpus.vector_reprs.keys()

dict_keys([])

In [61]:
new_corpus.load_vector_reprs('prompt_types__prompt_repr')

In [62]:
new_corpus.get_vect_repr(test_utt_id, 'prompt_types__prompt_repr')

array([-0.17102469,  0.03066834, -0.14361176,  0.11035668, -0.31492257,
       -0.03234185, -0.22282059, -0.12806098,  0.17711977,  0.02081981,
       -0.35362894, -0.24038827, -0.06125688, -0.19491484, -0.0505653 ,
       -0.03309523, -0.4150781 , -0.0601488 , -0.11378928, -0.017504  ,
       -0.04641637, -0.54302466,  0.13134582, -0.08515265])