Dataset and source code for LearningQ: A Large-scale Dataset for Educational Question Generation (ICWSM 2018).

Please download the full dataset via DATAVERSE or Google Drive.

1. How the Dataset was Collected

LearningQ is a dataset which can be used for educational question generation. Specifically, it contains:

7K instructor-designed questions collected from TED-Ed;
223K learner-generated questions collected from Khan Academy.

The source documents (i.e., lecture videos and articles) from which the questions are generated are also presented. We include the crawling code as part of LearningQ.

2. Files in the Dataset

We open-source not only i) the filtered data which can be directly used to train educational question generators but also ii) the originally-collected data from both TED-Ed and Khan Academy. The list of the data files is described as below.

+ LearningQ
+---- README.txt
+---- code [crawling code for TED-Ed and Khan Academy]
+---- data [the originally-collected data and the filtered data]
+----+---- khan
+----+----+---- crawled_data [the originally-collected data from Khan Academy]
+----+----+----+---- topictree.json [the full hierachical listing of Khan Academy's topic tree]
+----+----+----+---- topics [the information about each topic node in the full topic tree, each file is named by the topic's slug in Khan Academy and stored in the JSON format]
+----+----+----+---- topic_videos [the list of all videos for each topic node, each file is named by the topic's slug in Khan Academy and stored in the JSON format]
+----+----+----+---- all_video_links [the links of all lecture videos in Khan Academy, the file is stored in the JSON format]
+----+----+----+---- transcripts [the transcripts of all lecture videos]
+----+----+----+---- video_discussions [the originally-collected questions generated by learners for each lecture video, each file is named by the video's YouTube ID and stored in the JSON format]
+----+----+----+---- all_article_links [the links of all articles in Khan Academy, the file is stored in the JSON format]
+----+----+----+---- articles [the content of each article, each file is named by the article's ID in Khan Academy and stored in the JSON format]
+----+----+----+---- article_discussions [the originally-collected questions generated by learners for each article, each file is named by the article ID and stored in the JSON format]
+----+----+---- khan_labeled_data [the manually-labelled questions (whether a question is useful for learning or not) we used to build the question classifier; each line in a file is a data sample, i.e., manually-assigned label (1 for useful and 0 for non-useful) and the corresonding question]
+----+----+---- predicted_article_questions [the list of useful learning questions on articles, the file is stored in the JSON format]
+----+----+---- predicted_video_questions [the list of useful learning questions on lecture videos, the file is stored in the JSON format]
+----+---- teded
+----+----+---- crawled_data [the originally-collected data from TED-Ed]
+----+----+----+---- transcripts [the transcripts for lecture videos, each file is named by the video's YouTube ID]
+----+----+----+---- videos [the instructor-generated questions for each lecture video, each file is named by the video's title in TED-Ed and stored in the JSON format]
+----+---- experiments [the filtered data (i.e., predicted useful learning questions) which can be directly used as input for question generators, each file is named as {para/src/tgt}_{train/dev/test}, which denotes its data type, i.e., source document (para), source sentences (src) and target questions (tgt), and its usage, i.e., whether it is used for training (train), validation (dev) or testing (test).]

3. Implementation of the Question Generators

We implemented our question classifier as well as the question generators based on the following code repositories:

4. Baseline Results

	Methods	Bleu 1	Bleu 2	Bleu 3	Bleu 4	Meteor	Rouge_L
Khan Academy	H&S	0.28	0.17	0.13	0.10	3.24	6.61
	Seq2Seq	19.84	7.68	4.02	2.29	6.44	23.11
	Attention Seq2Seq	24.70	11.68	6.36	3.63	8.73	27.36
TED-Ed	H&S	0.38	0.22	0.17	0.15	3.00	6.52
	Seq2Seq	12.96	3.95	1.82	0.73	4.34	16.09
	Attention Seq2Seq	15.83	5.63	2.63	1.15	5.32	17.69

The best Bleu 4 score achieved by the state-of-the-art methods (i.e., Attention Seq2Seq) on SQuAD is larger than 12, while on LearningQ it is less than 4, which indicates large space for improvement on educational question generation.

5. Contact

For any questions about the dataset, please contact Guanliang Chen via angus.glchen@gmail.com

6. Citation

If you are using LearningQ in your work, please cite:

@paper{ICWSM18LearningQ,
	author = {Guanliang Chen, Jie Yang, Claudia Hauff and Geert-Jan Houben},
	title = {LearningQ: A Large-scale Dataset for Educational Question Generation},
	conference = {International AAAI Conference on Web and Social Media},
	year = {2018}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
code		code
.DS_Store		.DS_Store
.project		.project
.pydevproject		.pydevproject
ICWSM2018_LearningQ_preprint.pdf		ICWSM2018_LearningQ_preprint.pdf
LearningQ_Logo.png		LearningQ_Logo.png
README.md		README.md
README.txt		README.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

.DS_Store

.DS_Store

.project

.project

.pydevproject

.pydevproject

ICWSM2018_LearningQ_preprint.pdf

ICWSM2018_LearningQ_preprint.pdf

LearningQ_Logo.png

LearningQ_Logo.png

README.md

README.md

README.txt

README.txt

Repository files navigation

Dataset and source code for LearningQ: A Large-scale Dataset for Educational Question Generation (ICWSM 2018).

Please download the full dataset via DATAVERSE or Google Drive.

1. How the Dataset was Collected

2. Files in the Dataset

3. Implementation of the Question Generators

4. Baseline Results

5. Contact

6. Citation

About

Releases

Packages

Languages

AngusGLChen/LearningQ

Folders and files

Latest commit

History

Repository files navigation

Dataset and source code for LearningQ: A Large-scale Dataset for Educational Question Generation (ICWSM 2018).

Please download the full dataset via DATAVERSE or Google Drive.

1. How the Dataset was Collected

2. Files in the Dataset

3. Implementation of the Question Generators

4. Baseline Results

5. Contact

6. Citation

About

Resources

Stars

Watchers

Forks

Languages