Scripts to convert some datasets to SQLite format.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


These are some scripts to convert some large datasets from their native format to SQLite. These scripts are designed to have minimal dependencies so that they may be copied and run independently of each other.

The scripts individually provide usage help if executed with insufficient parameters and can read the compressed version of data.


  • Amazon Reviews

    • Source:
    • python --gzip aggressive_dedup.json.gz amazon.sqlite reviews
    • python --gzip metadata.json.gz amazon.sqlite
  • Wikipedia Metadata

    • Source: (NOT the complete wikipedia history)
    • python --bz2 enwiki-20080103.main.bz2 wikipedia_2008.sqlite main
    • python --bz2 enwiki-20080103.users.bz2 wikipedia_2008.sqlite users
    • etc.
  • Memetracker data

    • Source:
    • python --gzip quotes_2008-08.txt.gz memetracker2.sqlite meme
    • python --gzip quotes_2008-09.txt.gz memetracker2.sqlite meme
    • python --gzip quotes_2008-10.txt.gz memetracker2.sqlite meme
    • etc.
  • Reddit data

    • Source:
    • From 2015-04, the comments contain 1 extra field: removal_reason. Hence, the headers need to be explicitly supplied.
    • python --bz2 RC_2015-01.bz2 --headers reddit_headers.txt reddit.sqlite comments
    • python --bz2 RC_2015-02.bz2 --headers reddit_headers.txt reddit.sqlite comments
    • python --bz2 RC_2015-03.bz2 --headers reddit_headers.txt reddit.sqlite comments
    • etc.
  • StackExchange data


I use code from rgrp/csv2sqlite for guessing types. The code for converting StackExchange dataset is taken (with minor changes) from testlnord/sedumpy.