Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

Reddit-QA-corpus-

A question and answer corpus from the /r/askreddit subreddit. Designed for training seq2seq neural networks. The total corpus is 4,976,760 question and answer pairs.

Creation

Data was gathered from http://pushshift.io reddit data dump on the 15/06/2018.

All submission to the subreddit where taken as questions, and the first comment on each submission was taken as the answer.

Cleaning

  • A submission must have a comment to be included
  • Comments which where [deleted] , [removed] or posted by the subreddit auto bot where removed.
  • Submission which didn't contain questions where removed (Subreddit bot automatically inforces that all submissions must have a question mark)

Data

Each file is just under half a GB, hence why it's stored on a remote server.

Numbers

7,102,717 Questions 12,039,795 Answers

  • 2,548,641 Autobots
  • 831,108 Deleted
  • 73,881 =8,586,165

4,976,759 Questions and Answers pairs

Referencing

If used for academic purposes please contact fionnd [at] pm.me for full refencing infomation.

@misc{redditqa,
    author = "{Fionn Delahunty}",
    title = {{Reddit QA Corpus}},
    howpublished = {\url{https://github.com/FionnD/Reddit-QA-Corpus}},
    note = {Online; accessed XXX} ,
    year=2018,
}

About

A question and answer corpus from the /r/askreddit subreddit. Designed for training seq2seq neural networks.

Resources

License

Releases

No releases published

Packages

No packages published