This notebook demos `PromptTypeWrapper`, a transformer that produces abstract representations of an utterance in terms of its phrasing and its rhetorical intent. 

The transformer, with some minor modifications, implements the methodology detailed in the [paper](http://www.cs.cornell.edu/~cristian/Asking_too_much.html), 

```
Asking Too Much? The Rhetorical Role of Questions in Political Discourse 
Justine Zhang, Arthur Spirling, Cristian Danescu-Niculescu-Mizil
Proceedings of EMNLP 2017
```

and by default analyzes _questions_ and their responses (though this can be modified on initialization). 

Under the surface, the transformer implements two key modules, `PhrasingMotifs` and `PromptTypes`, as well as a suite of preprocessing steps. For a more detailed description of each of these steps, and examples of calling the component modules separately, see demo notebook TODO LINK.

First we load the corpus. We will examine a dataset of questions from question periods that take place in the British House of Commons (also detailed in the paper). 

In [24]:
from convokit import Corpus
from convokit import download
from convokit.prompt_types import PromptTypeWrapper

For expedience, we load pre-computed dependency parses, which should come with the data release (see TODO LINK for a demonstration of how to get these parses for yourself).

In [3]:
# OPTION 1: DOWNLOAD CORPUS 
# UNCOMMENT THESE LINES TO DOWNLOAD CORPUS
# DATA_DIR = '<YOUR DIRECTORY>'
# ROOT_DIR = download('parliament-corpus', data_dir=DATA_DIR)

# OPTION 2: READ PREVIOUSLY-DOWNLOADED CORPUS FROM DISK
# UNCOMMENT THIS LINE AND REPLACE WITH THE DIRECTORY WHERE THE PARLIAMENT-CORPUS IS LOCATED
# ROOT_DIR = '<YOUR DIRECTORY>'

corpus = Corpus(ROOT_DIR)
corpus.load_info('utterance',['parsed'])

In [4]:
VERBOSITY = 10000

Inspecting an example utterance:

In [5]:
test_utt_id = '1997-01-27a.4.0'
utt = corpus.get_utterance(test_utt_id)

In [6]:
utt.text

"Does my right hon Friend agree that last week 's statement about a replacement royal yacht has been widely welcomed ? Does he agree also that , ideally , Britannia should become the centrepiece of the millennium project in Portsmouth harbour , spanning Gosport and Portsmouth ? I am sure that that idea would prove very popular . As to plans for a new yacht , does my right hon Friend share my distaste for the Opposition 's tactics ? They had every opportunity to express their grudging and negative attitude during the past two years when the project was under discussion ."

Initializing a `PromptTypeWrapper` model, that will infer 8 types of questions (see docstring for other arguments):

In [7]:
pt = PromptTypeWrapper(n_types=8, random_state=1000)

In [8]:
pt.fit(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

	counting itemset cooccurrences for 90000/318345 collections
	counting itemset cooccurrences for 100000/318345 collections
	counting itemset cooccurrences for 110000/318345 collections
	counting itemset cooccurrences for 120000/318345 collections
	counting itemset cooccurrences for 130000/318345 collections
	counting itemset cooccurrences for 140000/318345 collections
	counting itemset cooccurrences for 150000/318345 collections
	counting itemset cooccurrences for 160000/318345 collections
	counting itemset cooccurrences for 170000/318345 collections
	counting itemset cooccurrences for 180000/318345 collections
	counting itemset cooccurrences for 190000/318345 collections
	counting itemset cooccurrences for 200000/318345 collections
	counting itemset cooccurrences for 210000/318345 collections
	counting itemset cooccurrences for 220000/318345 collections
	counting itemset cooccurrences for 230000/318345 collections
	counting itemset cooccurrences for 240000/318345 collections
	counting

Output. Note that this should produce the same output as calling the component transformers separately, as detailed in this notebook TODO LINK:

In [9]:
for i in range(8):
    print(i)
    pt.display_type(i,  k=15)
    print('\n\n')

0
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
made_*,0.627821,1.112966,1.253935,1.080853,1.263088,1.081064,1.085615,1.120296,0.0
made_*__made_in,0.670131,1.092172,1.296672,1.044795,1.16574,1.096866,1.117699,1.102888,0.0
made_*__made_to,0.677337,1.226368,1.219402,1.145383,1.388485,1.110612,1.212821,1.180075,0.0
in>*__tell_*,0.681406,1.176855,0.986156,1.121633,1.332023,0.847615,0.987093,1.26624,0.0
made_*__made_what,0.683845,1.124455,1.353602,1.139765,1.248646,1.112601,1.209657,0.959844,0.0
made_*__made_been,0.689245,1.12298,1.277784,1.117208,1.263149,1.144714,1.17727,1.178932,0.0
happen_*__happen_will,0.697615,1.202465,1.101319,1.157835,1.233026,0.868512,1.052773,1.120683,0.0
made_*__what>*,0.698273,1.148946,1.360852,1.140067,1.24789,1.156193,1.226739,0.99763,0.0
made_*__made_been__made_what,0.706422,1.105542,1.336376,1.133772,1.224077,1.158618,1.23057,1.052753,0.0
made_*__made_has,0.707376,1.123585,1.303389,1.18135,1.30712,1.138856,1.224905,1.109336,0.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
am_at,0.744773,0.935439,1.20501,1.062946,1.220152,1.02565,1.073168,1.238502,0.0
known_*,0.746823,1.227288,1.121113,1.15773,1.273498,1.041451,1.067706,1.242575,0.0
can_*,0.785983,1.107881,1.200609,1.058894,1.083504,0.993568,1.07024,0.982096,0.0
place_*,0.789185,1.049869,1.162727,1.044718,1.15699,1.036115,1.094712,1.125281,0.0
assure_have,0.796057,0.944208,1.301825,0.915244,0.966593,1.071265,1.028048,0.994812,0.0
was_made,0.797752,1.188686,0.977958,1.147366,1.179836,0.915078,0.944907,1.195879,0.0
make_shall,0.80267,0.873247,1.176298,0.898143,0.952539,0.89805,0.83441,1.12915,0.0
give_can,0.804878,1.036402,1.154003,1.068707,1.223465,0.858061,1.07292,1.131898,0.0
have_made,0.81164,1.045384,1.182388,0.922776,1.063783,1.034568,0.824843,1.211438,0.0
write_shall,0.815864,1.160921,1.133396,1.17003,1.271814,1.098569,1.200611,1.185541,0.0





1
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_*__agree_will__will>*,1.081874,0.497942,1.273505,0.848749,0.933643,0.990091,0.997196,1.086475,1.0
agree_*__agree_will,1.05683,0.499644,1.266302,0.850806,0.947449,0.949093,0.966158,1.089017,1.0
agree_*__will>*,1.104583,0.515868,1.262447,0.846606,0.875943,1.016219,0.976856,1.101176,1.0
meet_*,1.123487,0.546317,1.25353,0.877287,1.00936,0.917613,1.023773,1.032337,1.0
agree_*__agree_meet__will>*,1.143766,0.55996,1.303826,0.99068,1.077422,1.059494,1.132881,1.088021,1.0
agree_*__agree_meet,1.113804,0.560347,1.303108,0.929585,1.034495,1.034661,1.067001,1.109925,1.0
undertake_*,1.008052,0.573649,1.260006,0.80237,1.021824,1.003247,1.016716,1.056516,1.0
meet_*__meet_will,1.135676,0.579419,1.234611,0.877325,0.985532,0.909151,1.012462,1.042688,1.0
raise_*__raise_will,1.039347,0.579697,1.310012,0.890007,0.994179,1.090194,1.039332,1.094929,1.0
press_*__press_may,1.113677,0.582473,1.234204,0.882112,1.106788,0.879551,1.003774,1.120163,1.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
am_always,1.154629,0.584922,1.21927,0.79042,0.961395,0.994641,0.96633,1.204803,1.0
am_aware,0.926102,0.610505,1.265185,0.805867,1.135296,1.001407,1.023496,1.168976,1.0
was_aware,1.096868,0.636389,1.215653,0.992701,1.159254,1.10333,1.106659,1.264721,1.0
want_obviously,1.196034,0.641195,1.270706,0.792399,1.009466,1.039959,1.110401,1.072885,1.0
know_been,1.049947,0.647338,1.222541,0.887264,0.876106,1.018087,0.997834,1.105618,1.0
know_takes,1.078992,0.653681,1.300367,0.852636,0.846953,1.086783,1.048624,0.949327,1.0
get_back,1.174138,0.675439,1.244518,1.085933,1.144474,1.130576,1.202466,1.202862,1.0
am_interested,1.162955,0.678403,1.114928,1.018159,1.209881,0.948988,1.01752,1.261251,1.0
suspect_is,1.090133,0.67904,1.093455,1.013447,1.047982,0.866533,0.952681,1.208949,1.0
be_happy,0.98805,0.685855,1.180272,0.787081,0.853065,0.855227,0.803427,1.110432,1.0





2
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
admit_*,1.201171,1.292763,0.570578,1.214884,1.266494,0.947761,0.940144,1.397194,2.0
why>*,1.111746,1.310107,0.57465,1.22539,1.281744,0.860374,0.900415,1.388247,2.0
admit_*__will>*,1.231918,1.313895,0.576193,1.257695,1.240869,0.998804,1.011262,1.359635,2.0
explain_*,1.081285,1.253725,0.576442,1.206437,1.269414,0.843906,0.912722,1.365633,2.0
explain_*__explain_will,1.084332,1.294334,0.591024,1.183642,1.188385,0.920798,0.869349,1.373634,2.0
is>*__is_*__is_true,1.16878,1.315689,0.596213,1.103997,1.143709,1.017692,0.846551,1.483289,2.0
is_*__why>*,1.184021,1.282762,0.601567,1.21484,1.174174,0.91112,0.845809,1.364674,2.0
justify_*,1.203301,1.322534,0.609482,1.251149,1.275439,0.976639,1.037237,1.358217,2.0
admit_*__admit_will__will>*,1.239174,1.311162,0.610798,1.289524,1.272271,0.984968,1.058717,1.349525,2.0
is_*__is_true,1.171478,1.337066,0.616361,1.159053,1.183389,1.032628,0.898227,1.49126,2.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
wonder_*,1.1772,1.279819,0.594013,1.185855,1.109752,0.87208,0.870528,1.31476,2.0
failed_*,1.208959,1.330511,0.634805,1.256673,1.129173,0.963978,0.931793,1.338335,2.0
were_*,1.210541,1.366612,0.654166,1.19862,1.070178,1.06855,0.916717,1.350363,2.0
is_wrong,1.171259,1.387414,0.662125,1.253352,1.153526,0.974954,0.946674,1.315054,2.0
instead>*,1.178271,1.26101,0.675487,1.211907,1.236813,0.851639,0.999561,1.264442,2.0
talks_*,1.238792,1.229554,0.695368,1.236562,1.260476,0.900432,1.032589,1.339003,2.0
am_surprised,1.172448,1.23272,0.698857,1.211564,1.223285,1.007399,0.960399,1.332919,2.0
were_there,1.212256,1.391393,0.702025,1.248951,1.151671,1.088265,0.991262,1.355348,2.0
talks_about,1.231265,1.22522,0.706294,1.24026,1.297678,0.887904,1.04609,1.339956,2.0
was_*,1.178669,1.17664,0.713316,1.100452,0.869292,0.922654,0.726262,1.286168,2.0





3
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
learned_*__will_*,1.035791,0.903941,1.167507,0.538998,0.842047,1.151453,0.80045,1.30125,3.0
learned_*__will>*,1.031033,0.887743,1.17639,0.542707,0.852538,1.149004,0.822925,1.297674,3.0
draw_*__will>*,1.02198,0.905078,1.149718,0.546633,0.942019,1.051063,0.876365,1.174293,3.0
bear_*__bear_in__in>*,1.078134,0.951693,1.231007,0.552075,0.980356,1.15239,0.980947,1.223885,3.0
draw_*__draw_will,1.037764,0.907857,1.155981,0.555476,0.906422,1.068294,0.878078,1.179685,3.0
convey_*__convey_to,1.080879,0.956686,1.17974,0.566019,1.013484,1.081067,0.986655,1.232593,3.0
will_*,0.998802,0.842949,1.137713,0.568387,0.834968,1.096154,0.724175,1.314792,3.0
convey_*__convey_to__convey_will,1.105769,1.021487,1.16184,0.589949,0.98565,1.122533,0.987816,1.217223,3.0
will>*__will_*,0.998499,0.822044,1.170187,0.597136,0.865728,1.121626,0.765781,1.322978,3.0
does_*__learned_*__learned_accept,1.100867,0.996306,1.083552,0.606117,0.873094,1.116297,0.733854,1.325595,3.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
note_says,0.998016,0.95833,1.122652,0.602703,0.978719,1.028584,0.810145,1.288853,3.0
emphasise_*,1.038182,0.84825,1.206117,0.608093,0.922879,1.111263,0.814011,1.260635,3.0
learned_*,0.992413,0.936676,1.066827,0.616777,0.775857,0.982881,0.672981,1.247093,3.0
note_*,1.045313,0.916748,1.076491,0.621081,0.976873,1.074091,0.762412,1.35107,3.0
be_important,1.080727,0.864054,1.180871,0.627343,0.856606,1.047932,0.829196,1.143138,3.0
is_consider,0.97022,0.82766,1.24024,0.638418,0.972265,1.01127,0.938706,1.145201,3.0
are_always,1.076425,0.917767,1.173078,0.641689,1.026539,1.13699,0.8933,1.269326,3.0
consider_is,0.960686,0.810932,1.203336,0.643591,1.01007,1.000808,0.825824,1.243528,3.0
convey_*,1.11556,0.936023,1.205373,0.64489,1.076424,1.138973,1.018776,1.274255,3.0
consider_must,1.032756,0.89067,1.197095,0.647674,0.965585,1.106802,0.850974,1.284388,3.0





4
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_*__agree_is,1.188158,1.063,1.140767,0.952855,0.390807,1.232684,0.862497,1.120674,4.0
agree_*__agree_be__does>*,1.14527,1.019666,1.15276,0.885044,0.397438,1.20073,0.797764,1.133618,4.0
agree_*__agree_be,1.140607,1.022338,1.151803,0.878635,0.398083,1.19152,0.78868,1.138992,4.0
agree_*__agree_is__does>*,1.185667,1.067486,1.143729,0.954654,0.398762,1.238654,0.870336,1.11485,4.0
agree_*__agree_have,1.160346,1.060732,1.162782,0.954536,0.439447,1.232611,0.864388,1.130915,4.0
agree_*__agree_are,1.198654,1.093265,1.142069,0.940229,0.446017,1.247844,0.859772,1.162318,4.0
agree_*__agree_does__agree_have__does>*,1.145078,1.092075,1.144758,0.956562,0.454053,1.221288,0.848933,1.117233,4.0
agree_*__agree_are__agree_does__does>*,1.199917,1.099268,1.139137,0.94469,0.45824,1.253012,0.868788,1.158872,4.0
agree_*__agree_also,1.184892,1.135958,1.127462,1.026366,0.468385,1.265777,0.892345,1.176072,4.0
continue_*__will>*,1.153943,1.039539,1.207467,0.966704,0.474253,1.195661,0.962171,0.987554,4.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
agree_certainly,1.180155,1.071745,1.201017,0.971265,0.461342,1.293082,0.952085,1.10175,4.0
agree_is,1.174426,1.071817,1.205062,0.98537,0.468024,1.291348,0.950482,1.095343,4.0
agree_however,1.171552,1.076513,1.193439,0.995855,0.468574,1.28644,0.93341,1.117073,4.0
agree_will,1.18198,1.04299,1.209447,1.001755,0.475267,1.291913,0.946098,1.121801,4.0
agree_also,1.194347,1.078747,1.19395,0.990467,0.476035,1.294584,0.947017,1.096503,4.0
agree_wholeheartedly,1.184639,1.092625,1.190386,1.018646,0.476682,1.280336,0.964095,1.089767,4.0
agree_absolutely,1.194103,1.062569,1.219094,0.982575,0.476712,1.294999,0.986945,1.067689,4.0
is_also,1.19596,1.051437,1.11008,0.98655,0.478089,1.101184,0.882075,1.002128,4.0
agree_be,1.164383,1.079607,1.202961,0.994257,0.481263,1.288859,0.94401,1.104167,4.0
agree_completely,1.186661,1.08971,1.208564,1.017079,0.481393,1.2983,0.976094,1.076271,4.0





5
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
say_*,0.844708,1.083265,0.971673,1.117433,1.294797,0.619143,0.997052,1.194575,5.0
mean_*,0.996559,1.12255,0.864944,1.110938,1.171208,0.624402,0.817696,1.199592,5.0
have_*,0.94339,0.849485,0.994944,0.846488,1.098475,0.637829,0.825819,1.123806,5.0
mean_*__mean_does,0.959681,1.149855,0.872096,1.146608,1.217319,0.664998,0.853081,1.239925,5.0
given>*,1.009578,0.821352,1.145473,0.999191,1.153424,0.670679,1.043621,0.946857,5.0
explain_*__explain_can__explain_is,1.093834,1.091398,0.82147,1.144547,1.207012,0.686644,0.885649,1.186571,5.0
have_*__have_for__have_what,1.036565,0.960059,1.103153,1.146903,1.288828,0.692222,1.095189,1.137421,5.0
said_*,1.065214,0.86955,1.033804,1.078371,1.205563,0.693865,0.960979,1.095316,5.0
make_*__make_what,1.027176,1.022561,0.892963,1.045675,1.077412,0.698897,0.79764,1.201476,5.0
go_*,1.082529,0.951352,0.940076,0.935363,1.076293,0.703714,0.727434,1.257622,5.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
said_in,1.072858,1.111422,0.879407,1.156333,1.227892,0.625378,0.913928,1.223817,5.0
said_to,1.043832,1.108993,0.961855,1.15544,1.265747,0.630438,0.996852,1.199537,5.0
said_as,1.084223,1.054055,0.985156,1.135903,1.195105,0.65378,0.959594,1.193883,5.0
secondly>*,1.158341,1.166617,0.829746,1.223155,1.198225,0.664336,0.997184,1.150355,5.0
first>*,1.166789,1.093003,0.914265,1.215943,1.216189,0.66769,1.050758,1.111849,5.0
said_*,1.063014,1.126901,0.883348,1.143466,1.183556,0.669222,0.864938,1.256876,5.0
is_say,1.062904,0.999097,0.937645,1.073971,1.065091,0.671739,0.887503,1.140267,5.0
said_was,1.081109,1.141609,0.856563,1.15772,1.216481,0.67285,0.903487,1.238201,5.0
on>*,0.908192,1.054282,0.890554,0.992212,1.082866,0.673259,0.748206,1.243729,5.0
expect_do,0.962934,1.002189,0.962702,1.017961,1.225525,0.67797,0.861367,1.300114,5.0





6
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
accept_*__accept_is,1.090317,1.063181,0.936988,0.832105,0.683805,1.040056,0.523513,1.280517,6.0
be_*__be_not,0.986601,0.985865,0.982842,0.7024,0.802755,0.961942,0.528468,1.317027,6.0
accept_*__accept_does__accept_is,1.089942,1.083021,0.939604,0.864796,0.684333,1.056191,0.530239,1.283107,6.0
accept_*__accept_will,1.114377,1.055778,0.825666,0.847018,0.793701,0.940009,0.531709,1.344388,6.0
be_*,0.909678,0.935754,1.026899,0.716806,0.892652,0.853703,0.539811,1.293616,6.0
accept_*,1.115738,1.085082,0.865772,0.850017,0.750077,1.000034,0.540087,1.328635,6.0
be_*__be_would,0.981651,0.920813,1.010986,0.676931,0.830448,0.941249,0.546333,1.328901,6.0
accept_*__accept_is__does>*,1.071313,1.112336,0.935831,0.892122,0.706264,1.053141,0.549273,1.28359,6.0
does>*__recognise_*,1.137781,0.991348,0.99109,0.756819,0.807313,1.028037,0.555275,1.320357,6.0
accept_*__accept_does,1.120551,1.108024,0.871339,0.861576,0.743819,1.024401,0.558752,1.320551,6.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
realise_*,1.051353,1.040419,0.870504,0.830392,0.898416,0.891873,0.510753,1.333201,6.0
therefore>*,1.073179,1.107272,0.791517,0.893782,0.895427,0.865649,0.533603,1.356722,6.0
realise_is,1.076871,1.040312,0.900864,0.878094,0.820099,0.953101,0.537008,1.288549,6.0
be_right,0.983154,0.999825,1.01152,0.82418,0.882781,0.827222,0.597474,1.218149,6.0
be_however,1.001768,0.831704,1.087172,0.702314,0.746173,0.884701,0.59945,1.203639,6.0
remind_is,1.095861,0.955227,0.928311,0.856989,0.948804,0.962759,0.601668,1.334839,6.0
be_might,1.103935,0.896718,1.006554,0.767808,0.674502,0.895251,0.602006,1.173772,6.0
believe_however,1.055552,0.951726,1.031753,0.781682,0.73825,0.912921,0.602504,1.207242,6.0
be_decide,1.006482,1.026823,0.967803,0.891903,0.886792,0.87345,0.603566,1.237801,6.0
realise_will,1.032995,1.115808,0.867645,0.958823,1.015834,0.951978,0.605154,1.362984,6.0





7
top prompt:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
doing_*__what>*,1.188908,1.179227,1.296108,1.284275,1.161041,1.174012,1.380595,0.487743,7.0
doing_*,1.196526,1.177769,1.272685,1.269288,1.144117,1.162529,1.344962,0.501558,7.0
taking_*__taking_is__what>*,1.126557,1.190315,1.332255,1.244363,1.160392,1.187807,1.405707,0.508016,7.0
doing_*__doing_is__what>*,1.19491,1.201274,1.301613,1.311563,1.213732,1.199367,1.419399,0.529425,7.0
take_*__take_what,1.156225,1.012067,1.335862,1.188823,1.090698,1.169019,1.304218,0.532164,7.0
taking_*__taking_are,1.087618,1.218423,1.353556,1.237473,1.19186,1.198082,1.402136,0.533785,7.0
taking_*,1.134108,1.230914,1.339015,1.254403,1.189327,1.215009,1.418624,0.534918,7.0
will>*__work_*__work_with,1.055845,0.969658,1.376481,1.118873,1.019038,1.158587,1.261221,0.535301,7.0
taking_*__what>*,1.084641,1.223158,1.355838,1.253558,1.19629,1.20394,1.413722,0.540402,7.0
doing_*__doing_is,1.204739,1.21992,1.278352,1.299628,1.208292,1.188003,1.396126,0.541201,7.0


top response:


Unnamed: 0,0,1,2,3,4,5,6,7,type_id
through>*,1.173298,1.242311,1.328201,1.279822,1.05414,1.258116,1.369618,0.642623,7.0
is_working,1.119604,1.152739,1.275667,1.203138,0.976927,1.155934,1.216139,0.645541,7.0
ensuring_is,1.195901,1.089027,1.243815,1.11077,0.954979,1.149191,1.22936,0.648673,7.0
supporting_are,1.214342,1.245588,1.238312,1.291103,1.178745,1.182181,1.357689,0.65338,7.0
working_on,1.134486,1.224916,1.291671,1.280997,1.308231,1.218774,1.380054,0.664101,7.0
supporting_*,1.220306,1.258811,1.250189,1.318995,1.196743,1.203403,1.381304,0.666979,7.0
ensuring_*,1.175195,1.123449,1.22512,1.152106,0.926351,1.104832,1.186094,0.669527,7.0
working_are,1.136606,1.22965,1.284925,1.282427,1.311925,1.214676,1.377935,0.67017,7.0
working_with,1.136615,1.228151,1.286453,1.28158,1.314418,1.215308,1.379528,0.672401,7.0
working_*,1.138597,1.230087,1.284754,1.282966,1.312211,1.21649,1.379105,0.67273,7.0







Transforming a single utterance. The model will annotate each utterance with a set of rerpesntations or features.

In [10]:
utt = pt.transform_utterance(utt)

the phrasing motifs, i.e., a representation of how each sentence in the utterance is phrased:

In [11]:
utt.get_info('motifs')

['agree_* agree_*__does>* does>*',
 'agree_* agree_*__agree_also agree_*__does>* does>*',
 'as>* share_* share_*__share_does']

A vector representation encapsulating the utterance's rhetorical intent (in short, an embedding of the utterance based on the responses associated with questions containing its constituent phrasings. see paper for details):

In [12]:
utt.get_info('prompt_types__prompt_repr')

[-0.17102469270226672,
 0.030668340328548496,
 -0.14361175504485915,
 0.11035667766433878,
 -0.3149225651493796,
 -0.032341849264302266,
 -0.22282059450017552,
 -0.12806097960998153,
 0.1771197712150501,
 0.02081981426950371,
 -0.3536289362524308,
 -0.2403882732149338,
 -0.06125687797720545,
 -0.19491483622907677,
 -0.05056529746639132,
 -0.03309523304250256,
 -0.41507809747180324,
 -0.06014879946683283,
 -0.11378928189631009,
 -0.01750400434622636,
 -0.04641636672373653,
 -0.5430246645922729,
 0.13134581675643944,
 -0.0851526519739321]

Distances between that vector and the centroid of each inferred cluster

In [13]:
utt.get_info('prompt_types__prompt_dists.8')

[1.1430428918990696,
 0.9502324189972602,
 1.1337070117657726,
 0.8403522127994898,
 0.38907971018922277,
 1.1182399779615004,
 0.7633996580496394,
 1.1170915977638758]

The particular type of question, and how close it is to the centroid of that particular cluster:

In [14]:
utt.get_info('prompt_types__prompt_type.8')

4.0

In [15]:
utt.get_info('prompt_types__prompt_type_dist.8')

0.38907971018922277

Transforming the entire corpus:

In [16]:
corpus = pt.transform(corpus)

10000/433787 utterances processed
20000/433787 utterances processed
30000/433787 utterances processed
40000/433787 utterances processed
50000/433787 utterances processed
60000/433787 utterances processed
70000/433787 utterances processed
80000/433787 utterances processed
90000/433787 utterances processed
100000/433787 utterances processed
110000/433787 utterances processed
120000/433787 utterances processed
130000/433787 utterances processed
140000/433787 utterances processed
150000/433787 utterances processed
160000/433787 utterances processed
170000/433787 utterances processed
180000/433787 utterances processed
190000/433787 utterances processed
200000/433787 utterances processed
210000/433787 utterances processed
220000/433787 utterances processed
230000/433787 utterances processed
240000/433787 utterances processed
250000/433787 utterances processed
260000/433787 utterances processed
270000/433787 utterances processed
280000/433787 utterances processed
290000/433787 utterances proc

Other examples:

In [17]:
utt1 = corpus.get_utterance('1987-03-04a.857.5')

In [18]:
utt1.get_info('motifs')

['stop_* stop_*__stop_will stop_*__stop_will__will>* stop_*__will>* will>*',
 'admit_* admit_*__admit_will admit_*__admit_will__will>* admit_*__will>* will>*',
 'does>* does>*__does>not does>*__understand_* understand_* understand_*__understand_does']

In [19]:
utt1.text

'Will the Secretary of State stop giving us what is called in the pop record industry a remix of alibis , excuses and gimmicks ? Will he admit that the number of homes built to rent last year by local authorities was the lowest in 62 years , that the housing investment programme net of capital receipts was the lowest in real terms since HIPs were invented and that , even during the past three years the number of repair and improvement grants , which would bring some private homes back into use , have dropped by 100,000 ? Does not the right hon Gentleman understand that , if the private owner and the local authority are starved of resources , we are left with lengthy queues , homelessness and all the other scandals of poor housing that exist today ?'

In [20]:
utt1.get_info('prompt_types__prompt_type.8')

2.0

We can also try out the model on arbitrary input. For instance, we see that the following question is also of type 5 -- that is, similar to other questions which voice agreement or support.

In [21]:
str_utt = pt.transform_utterance('Do you share my distaste for cockroaches?')

In [22]:
str_utt.get_info('motifs')

['do>* share_*']

In [23]:
str_utt.get_info('prompt_types__prompt_type.8')

4.0

Serializing the model. This dumps both the underlying `PhrasingMotifs` and `PromptTypes` models to disk:

In [21]:
import os

In [23]:
pt.dump_models(os.path.join(ROOT_DIR, 'full_pipe_models'))

writing itemset counts
writing downlinks
writing itemset to ids
writing meta information
dumping embedding model
dumping training embeddings
dumping type model 8


The entire pipeline can later be loaded back from memory and used to transform new data:

In [26]:
new_pt = PromptTypeWrapper(output_field='prompt_types_new',
                           min_support=100, svd__n_components=25, random_state=1000)

In [27]:
new_pt.load_models(os.path.join(ROOT_DIR, 'full_pipe_models'))

reading itemset counts
reading downlinks
reading itemset to ids
reading meta information
loading embedding model
loading training embeddings
loading type model 8


In [29]:
pt_model_dir = os.path.join(ROOT_DIR, 'full_pipe_models')
!ls $pt_model_dir

pm_model  pt_model


In [39]:
new_str_utt = new_pt.transform_utterance('Do you share my distaste for cockroaches?')

In [40]:
new_str_utt.get_info('motifs')

['do>* share_*']

In [41]:
new_str_utt.get_info('prompt_types__prompt_type.8')