Dataset and source code for LearningQ: A Large-scale Dataset for Educational Question Generation (ICWSM 2018).
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
code
.project
.pydevproject
ICWSM2018_LearningQ_preprint.pdf
LearningQ_Logo.png
README.md
README.txt

README.md

Dataset and source code for LearningQ: A Large-scale Dataset for Educational Question Generation (ICWSM 2018).

Please download the full dataset via DATAVERSE or SURFSARA.

1. How the Dataset was Collected

LearningQ is a dataset which can be used for educational question generation. Specifically, it contains:

  • 7K instructor-designed questions collected from TED-Ed;
  • 223K learner-generated questions collected from Khan Academy.

The source documents (i.e., lecture videos and articles) from which the questions are generated are also presented. We include the crawling code as part of LearningQ.

2. Files in the Dataset

We open-source not only i) the filtered data which can be directly used to train educational question generators but also ii) the originally-collected data from both TED-Ed and Khan Academy. The list of the data files is described as below.

+ LearningQ
+---- README.txt
+---- code [crawling code for TED-Ed and Khan Academy]
+---- data [the originally-collected data and the filtered data]
+----+---- khan
+----+----+---- crawled_data [the originally-collected data from Khan Academy]
+----+----+----+---- topictree.json [the full hierachical listing of Khan Academy's topic tree]
+----+----+----+---- topics [the information about each topic node in the full topic tree, each file is named by the topic's slug in Khan Academy and stored in the JSON format]
+----+----+----+---- topic_videos [the list of all videos for each topic node, each file is named by the topic's slug in Khan Academy and stored in the JSON format]
+----+----+----+---- all_video_links [the links of all lecture videos in Khan Academy, the file is stored in the JSON format]
+----+----+----+---- transcripts [the transcripts of all lecture videos]
+----+----+----+---- video_discussions [the originally-collected questions generated by learners for each lecture video, each file is named by the video's YouTube ID and stored in the JSON format]
+----+----+----+---- all_article_links [the links of all articles in Khan Academy, the file is stored in the JSON format]
+----+----+----+---- articles [the content of each article, each file is named by the article's ID in Khan Academy and stored in the JSON format]
+----+----+----+---- article_discussions [the originally-collected questions generated by learners for each article, each file is named by the article ID and stored in the JSON format]
+----+----+---- khan_labeled_data [the manually-labelled questions (whether a question is useful for learning or not) we used to build the question classifier; each line in a file is a data sample, i.e., manually-assigned label (1 for useful and 0 for non-useful) and the corresonding question]
+----+----+---- predicted_article_questions [the list of useful learning questions on articles, the file is stored in the JSON format]
+----+----+---- predicted_video_questions [the list of useful learning questions on lecture videos, the file is stored in the JSON format]
+----+---- teded
+----+----+---- crawled_data [the originally-collected data from TED-Ed]
+----+----+----+---- transcripts [the transcripts for lecture videos, each file is named by the video's YouTube ID]
+----+----+----+---- videos [the instructor-generated questions for each lecture video, each file is named by the video's title in TED-Ed and stored in the JSON format]
+----+---- experiments [the filtered data (i.e., predicted useful learning questions) which can be directly used as input for question generators, each file is named as {para/src/tgt}_{train/dev/test}, which denotes its data type, i.e., source document (para), source sentences (src) and target questions (tgt), and its usage, i.e., whether it is used for training (train), validation (dev) or testing (test).]

3. Implementation of the Question Generators

We implemented our question classifier as well as the question generators based on the following code repositories:

4. Baseline Results

Methods Bleu 1 Bleu 2 Bleu 3 Bleu 4 Meteor Rouge_L
Khan Academy H&S 0.28 0.17 0.13 0.10 3.24 6.61
Seq2Seq 19.84 7.68 4.02 2.29 6.44 23.11
Attention Seq2Seq 24.70 11.68 6.36 3.63 8.73 27.36
TED-Ed H&S 0.38 0.22 0.17 0.15 3.00 6.52
Seq2Seq 12.96 3.95 1.82 0.73 4.34 16.09
Attention Seq2Seq 15.83 5.63 2.63 1.15 5.32 17.69

The best Bleu 4 score achieved by the state-of-the-art methods (i.e., Attention Seq2Seq) on SQuAD is larger than 12, while on LearningQ it is less than 4, which indicates large space for improvement on educational question generation.

5. Contact

For any questions about the dataset, please contact Guanliang Chen via angus.glchen@gmail.com or guanliang.chen@tudelft.nl

6. Citation

If you are using LearningQ in your work, please cite:

@paper{ICWSM18LearningQ,
	author = {Guanliang Chen, Jie Yang, Claudia Hauff and Geert-Jan Houben},
	title = {LearningQ: A Large-scale Dataset for Educational Question Generation},
	conference = {International AAAI Conference on Web and Social Media},
	year = {2018}
}