StackOverflow data to postgres
This is a quick script to move the Stackoverflow data from the StackExchange data dump (Sept '14) to a Postgres SQL database.
- Create the database
stackoverflowin your database:
CREATE DATABASE stackoverflow;
- You can use a custom database name as well. Make sure to explicitly give it while executing the script later.
- Move the following files to the folder from where the program is executed:
- In some old dumps, the cases in the filenames are different.
- Execute in the current folder (in parallel, if desired):
python load_into_pg.py Badges
python load_into_pg.py Posts
python load_into_pg.py Tags(not present in earliest dumps)
python load_into_pg.py Users
python load_into_pg.py Votes
python load_into_pg.py PostLinks
python load_into_pg.py PostHistory
python load_into_pg.py Comments
- Finally, after all the initial tables have been created:
psql stackoverflow < ./sql/final_post.sql
- If you used a different database name, make sure to use that instead of
stackoverflowwhile executing this step.
- For some additional indexes and tables, you can also execute the the following;
psql stackoverflow < ./sql/optional_post.sql
- Again, remember to user the correct database name here, if not
Caveats and TODOs
- It prepares some indexes and views which may not be necessary for your analysis.
Poststable is NOT populated by default. You have to use
--with-post-bodyargument to include it.
Userstable is NOT populated.
Sept 2011 data dump
tags.xmlis missing from the data dump. Hence, the
UserTagQAtables will be empty after
Postsis sometimes equal to an
emptyvalue. It is replaced by
NULLin those cases.