This example extracts question types from the UK Parliament Question Answer Sessions 
reproducing the asking too much paper (http://www.cs.cornell.edu/~cristian/Asking_too_much.html).
(due to the non-deterministic nature of clustering, the order of the clusters and some cluster assignments will vary).<br> This version uses precomputed motifs for speed.

In [1]:
import os
import pkg_resources
import numpy as np

from convokit import Corpus, QuestionTypology, download

Initializing QuestionTypology Class

In [2]:
num_clusters = 8

# Get precomputed motifs. data_dir contains the downloaded data. 
# motifs_dir is the specific path within data_dir that contains the precomputed motifs
data_dir = os.path.join(pkg_resources.resource_filename("convokit", ""), 'downloads', 'parliament')
motifs_dir = os.path.join(data_dir, 'parliament-motifs')

#Load the corpus
corpus = Corpus(filename=os.path.join(data_dir, 'parliament-corpus'))

#Extract clusters of the motifs and assign questions to these clusters
questionTypology = QuestionTypology(corpus, data_dir, motifs_dir=motifs_dir, num_dims=25, 
  num_clusters=num_clusters, verbose=False, random_seed=164)



`questionTypology.types_to_data` contains the necessary data that is computed in the step above. Its keys are the indices of the clusters (here 0-7). The values are dictionaries with the following keys:<br>
    <br>`"motifs"`: the motifs, as a list of tuples of the motif terms
    <br>`"motif_dists"`: the corresponding distances of each motif from the centroid of the cluster this motif is in
    <br>`"fragments"`: the answer fragments, as a list of tuples of answer terms
    <br>`"fragment_dists"`: the corresponding distances of each fragment from the centroid of the cluster this 
fragment is in
    <br>`"questions"`: the IDs of the questions in this cluster. You can get the corresponding question text by using the
get_question_text_from_pair_idx(pair_idx) method.
    <br>`"question_dists"`: the corresponding distances of each question from the centroid of the cluster 
this question is in

Display Outputs

In [3]:
questionTypology.display_totals()
print('10 examples for type 1-8:')
for i in range(num_clusters):
    questionTypology.display_motifs_for_type(i, num_egs=10)
    questionTypology.display_answer_fragments_for_type(i, num_egs=10)
    questionTypology.display_questions_for_type(i, num_egs=10)

Total Motifs: 2255
Total Questions: 199861
Total Fragments: 2756
Number of Motifs in each cluster:  [197, 379, 236, 265, 277, 310, 244, 347]
Number of Questions of each type:  [11310, 25616, 34548, 29001, 17672, 21754, 29862, 30098]
10 examples for type 1-8:
	10 sample question motifs for type 0 (197 total motifs):
		1. ('made_*',)
		2. ('made_*', 'made_what')
		3. ('made_*', 'made_been', 'made_what')
		4. ('made_*', 'what>*', 'what>what')
		5. ('made_*', 'made_is', 'made_what')
		6. ('made_*', 'made_being', 'made_what')
		7. ('made_*', 'made_in')
		8. ('made_*', 'made_been')
		9. ('made_*', 'made_has', 'made_what')
		10. ('made_*', 'made_has', 'made_in')
	10 sample answer fragments for type 0 (365 total fragments) :
		1. discussed_*
		2. discussed_with
		3. discussed_have
		4. discussed_in
		5. reassure_can
		6. will>continue
		7. sent_to
		8. sent_*
		9. as>soon
		10. assure_however
	10 sample questions that were assigned type 0 (11310 total questions with this type) :
		1. 0.5426088