|
1 | 1 | # StackOverflow data to postgres |
2 | 2 |
|
3 | | -This is a quick script to move the Stackoverflow data from the [StackExchange data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres SQL database. |
4 | | - |
5 | | -Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede) and from [StackExchange Data Explorer](http://data.stackexchange.com). |
6 | | - |
7 | | -## Dependencies |
8 | | - |
9 | | - - [`lxml`](http://lxml.de/installation.html) |
10 | | - - [`psycopg2`](http://initd.org/psycopg/docs/install.html) |
11 | | - - [`libarchive-c`](https://pypi.org/project/libarchive-c/) |
12 | | - |
13 | | -## Usage |
14 | | - |
15 | | - - Create the database `stackoverflow` in your database: `CREATE DATABASE stackoverflow;` |
16 | | - - You can use a custom database name as well. Make sure to explicitly give |
17 | | - it while executing the script later. |
18 | | - - Move the following files to the folder from where the program is executed: |
19 | | - `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. |
20 | | - - In some old dumps, the cases in the filenames are different. |
21 | | - - Execute in the current folder (in parallel, if desired): |
22 | | - - `python load_into_pg.py -t Badges` |
23 | | - - `python load_into_pg.py -t Posts` |
24 | | - - `python load_into_pg.py -t Tags` (not present in earliest dumps) |
25 | | - - `python load_into_pg.py -t Users` |
26 | | - - `python load_into_pg.py -t Votes` |
27 | | - - `python load_into_pg.py -t PostLinks` |
28 | | - - `python load_into_pg.py -t PostHistory` |
29 | | - - `python load_into_pg.py -t Comments` |
30 | | - - Finally, after all the initial tables have been created: |
31 | | - - `psql stackoverflow < ./sql/final_post.sql` |
32 | | - - If you used a different database name, make sure to use that instead of |
33 | | - `stackoverflow` while executing this step. |
34 | | - - For some additional indexes and tables, you can also execute the the following; |
35 | | - - `psql stackoverflow < ./sql/optional_post.sql` |
36 | | - - Again, remember to user the correct database name here, if not `stackoverflow`. |
37 | | - |
38 | | -## Loading a complete stackexchange project |
39 | | - |
40 | | -You can use the script to download a given stackexchange compressed file from |
| 3 | +This is a quick script to move the Stackoverflow data from the [StackExchange |
| 4 | +data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres |
| 5 | +SQL database. |
| 6 | + |
| 7 | +Schema hints are taken from [a post on |
| 8 | +Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede) |
| 9 | +and from [StackExchange Data Explorer](http://data.stackexchange.com). |
| 10 | + |
| 11 | +## Quickstart |
| 12 | + |
| 13 | +Install requirements, create a `stackoverflow` database, and use |
| 14 | +`load_into_pg.py` script: |
| 15 | + |
| 16 | +``` console |
| 17 | +$ pip install -r requirements.txt |
| 18 | +... |
| 19 | +Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0 wsgiref-0.1.2 |
| 20 | +$ createdb stackoverflow |
| 21 | +$ python load_into_pg.py -s beer |
| 22 | +``` |
| 23 | + |
| 24 | +This will download compressed files from |
41 | 25 | [archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load |
42 | | -all the tables at once, using the `-s` switch. |
| 26 | +all the tables at once. |
| 27 | + |
| 28 | + |
| 29 | +## Advanced Usage |
| 30 | + |
| 31 | +You can use a custom database name as well. Make sure to explicitly give it |
| 32 | +while executing the script later. |
| 33 | + |
| 34 | +Each table data is archived in an XML file. Available tables varies accross |
| 35 | +history. `load_into_pg.py` knows how to handle the following tables: |
43 | 36 |
|
44 | | -You will need the `urllib` and `libarchive-c` modules. |
| 37 | +- `Badges`. |
| 38 | +- `Posts`. |
| 39 | +- `Tags` (not present in earliest dumps). |
| 40 | +- `Users`. |
| 41 | +- `Votes`. |
| 42 | +- `PostLinks`. |
| 43 | +- `PostHistory`. |
| 44 | +- `Comments`. |
| 45 | + |
| 46 | +You can download manually the files to the folder from where the program is |
| 47 | +executed: `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. In |
| 48 | +some old dumps, the cases in the filenames are different. |
| 49 | + |
| 50 | +Then load each file with e.g. `python load_into_pg.py -t Badges`. |
| 51 | + |
| 52 | +After all the initial tables have been created: |
| 53 | + |
| 54 | +``` console |
| 55 | +$ psql stackoverflow < ./sql/final_post.sql |
| 56 | +``` |
| 57 | + |
| 58 | +For some additional indexes and tables, you can also execute the the following; |
| 59 | + |
| 60 | +``` console |
| 61 | +$ psql stackoverflow < ./sql/optional_post.sql |
| 62 | +``` |
45 | 63 |
|
46 | 64 | If you give a schema name using the `-n` switch, all the tables will be moved |
47 | 65 | to the given schema. This schema will be created in the script. |
48 | 66 |
|
49 | | -To load the _dba.stackexchange.com_ project in the `dba` schema, you would execute: |
50 | | -`./load_into_pg.py -s dba -n dba` |
51 | | - |
52 | 67 | The paths are not changed in the final scripts `sql/final_post.sql` and |
53 | 68 | `sql/optional_post.sql`. To run them, first set the _search_path_ to your |
54 | 69 | schema name: `SET search_path TO <myschema>;` |
55 | 70 |
|
| 71 | + |
56 | 72 | ## Caveats and TODOs |
57 | 73 |
|
58 | 74 | - It prepares some indexes and views which may not be necessary for your analysis. |
|
0 commit comments