Dataset and source code for LearningQ: A Large-scale Dataset for Educational Question Generation (ICWSM 2018).
Please download the full dataset via DATAVERSE or Google Drive.
LearningQ is a dataset which can be used for educational question generation. Specifically, it contains:
- 7K instructor-designed questions collected from TED-Ed;
- 223K learner-generated questions collected from Khan Academy.
The source documents (i.e., lecture videos and articles) from which the questions are generated are also presented. We include the crawling code as part of LearningQ.
We open-source not only i) the filtered data which can be directly used to train educational question generators but also ii) the originally-collected data from both TED-Ed and Khan Academy. The list of the data files is described as below.
+ LearningQ
+---- README.txt
+---- code [crawling code for TED-Ed and Khan Academy]
+---- data [the originally-collected data and the filtered data]
+----+---- khan
+----+----+---- crawled_data [the originally-collected data from Khan Academy]
+----+----+----+---- topictree.json [the full hierachical listing of Khan Academy's topic tree]
+----+----+----+---- topics [the information about each topic node in the full topic tree, each file is named by the topic's slug in Khan Academy and stored in the JSON format]
+----+----+----+---- topic_videos [the list of all videos for each topic node, each file is named by the topic's slug in Khan Academy and stored in the JSON format]
+----+----+----+---- all_video_links [the links of all lecture videos in Khan Academy, the file is stored in the JSON format]
+----+----+----+---- transcripts [the transcripts of all lecture videos]
+----+----+----+---- video_discussions [the originally-collected questions generated by learners for each lecture video, each file is named by the video's YouTube ID and stored in the JSON format]
+----+----+----+---- all_article_links [the links of all articles in Khan Academy, the file is stored in the JSON format]
+----+----+----+---- articles [the content of each article, each file is named by the article's ID in Khan Academy and stored in the JSON format]
+----+----+----+---- article_discussions [the originally-collected questions generated by learners for each article, each file is named by the article ID and stored in the JSON format]
+----+----+---- khan_labeled_data [the manually-labelled questions (whether a question is useful for learning or not) we used to build the question classifier; each line in a file is a data sample, i.e., manually-assigned label (1 for useful and 0 for non-useful) and the corresonding question]
+----+----+---- predicted_article_questions [the list of useful learning questions on articles, the file is stored in the JSON format]
+----+----+---- predicted_video_questions [the list of useful learning questions on lecture videos, the file is stored in the JSON format]
+----+---- teded
+----+----+---- crawled_data [the originally-collected data from TED-Ed]
+----+----+----+---- transcripts [the transcripts for lecture videos, each file is named by the video's YouTube ID]
+----+----+----+---- videos [the instructor-generated questions for each lecture video, each file is named by the video's title in TED-Ed and stored in the JSON format]
+----+---- experiments [the filtered data (i.e., predicted useful learning questions) which can be directly used as input for question generators, each file is named as {para/src/tgt}_{train/dev/test}, which denotes its data type, i.e., source document (para), source sentences (src) and target questions (tgt), and its usage, i.e., whether it is used for training (train), validation (dev) or testing (test).]
We implemented our question classifier as well as the question generators based on the following code repositories:
- Sentence Convolution Code in Torch
- Question Generation via Overgenerating Transformations and Ranking
- Neural Question Generation
Methods | Bleu 1 | Bleu 2 | Bleu 3 | Bleu 4 | Meteor | Rouge_L | |
---|---|---|---|---|---|---|---|
Khan Academy | H&S | 0.28 | 0.17 | 0.13 | 0.10 | 3.24 | 6.61 |
Seq2Seq | 19.84 | 7.68 | 4.02 | 2.29 | 6.44 | 23.11 | |
Attention Seq2Seq | 24.70 | 11.68 | 6.36 | 3.63 | 8.73 | 27.36 | |
TED-Ed | H&S | 0.38 | 0.22 | 0.17 | 0.15 | 3.00 | 6.52 |
Seq2Seq | 12.96 | 3.95 | 1.82 | 0.73 | 4.34 | 16.09 | |
Attention Seq2Seq | 15.83 | 5.63 | 2.63 | 1.15 | 5.32 | 17.69 |
The best Bleu 4 score achieved by the state-of-the-art methods (i.e., Attention Seq2Seq) on SQuAD is larger than 12, while on LearningQ it is less than 4, which indicates large space for improvement on educational question generation.
For any questions about the dataset, please contact Guanliang Chen via angus.glchen@gmail.com
If you are using LearningQ in your work, please cite:
@paper{ICWSM18LearningQ,
author = {Guanliang Chen, Jie Yang, Claudia Hauff and Geert-Jan Houben},
title = {LearningQ: A Large-scale Dataset for Educational Question Generation},
conference = {International AAAI Conference on Web and Social Media},
year = {2018}
}