Skip to content

An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.

Notifications You must be signed in to change notification settings

Skylion007/openwebtext

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenWebText

This project is a clone of the GPT-2 WebText dataset as outlined in the OpenAI paper. This project is still heavily WIP.

Dependencies

Pipenv, Python 3,

To install python dependencies:

pipenv install

Newspaper Dependencies:

On Ubuntu:

sudo apt-get install libxml2-dev libxslt-dev

On OS X:

brew install libxml2 libxslt

Usage

  1. Get list of URLs from reddit:
pipenv run python get_urls.py
  1. Download data from URLs:
pipenv run python download.py

Resulting files will be deposited in data/ with format {domain}-{sha256 hash of url}.txt.

Enjoy!

About

An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%