Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download and load a complete stackexchange project #9

Merged
merged 9 commits into from May 2, 2019

Conversation

@madtibo
Copy link
Contributor

madtibo commented Aug 16, 2018

This commit give the possibility, using the -s switch, to download the compressed file from archive.org, then, uncompress it and load all the files in the database.
Add a '-n' switch to move the tables to a given schema

WARNING: since using the urllib.request module, set the script to use python3!

Copy link
Member

musically-ut left a comment

Thanks for the PR!

This is a rather big change; so I'll have to run it to verify that it works. However, I don't quite see why we have to make upgrade to Python 3 necessary, esp. because you already are using libarchive instead of internal lzma and urllib is available via six.moves, which offers the same interface as urllib from Python 3.

I can make these minor changes myself when I do the merging.

Thanks again!

@madtibo

This comment has been minimized.

Copy link
Contributor Author

madtibo commented Aug 16, 2018

Great! I did not know about the possibility to use six for that.
This would be splendid if you could make the change :-)

@madtibo

This comment has been minimized.

Copy link
Contributor Author

madtibo commented Aug 16, 2018

sorry, I was lazy and did not create a distinct PR for this feature.
We could work on it once the PR #8 about foreign key is done.

madtibo added 2 commits Aug 16, 2018
using the '-s' switch, download the compressed file from _https://ia800107.us.archive.org/27/items/stackexchange/_, then, uncompress it and load all the files in the database. Add a '-n' switch to move the tables to a given schema

WARNING: since using the urllib.request module, set the script to use python3
@madtibo madtibo force-pushed the madtibo:load_full_project branch from 619cadd to fc4dc26 Feb 12, 2019
@madtibo

This comment has been minimized.

Copy link
Contributor Author

madtibo commented Apr 1, 2019

Hello @musically-ut,

The "load complete project" MR is ready. I added a few options:

  • '-t' for the table name
  • '--archive-url' to specify a given archive directory
  • '-s' for the SO project name
  • '-k' to keep the downloaded project archive
  • '-f' can then be used to specify the archive file name
  • '-n' to specify a database schema

I tested several cases and found no problem:

./load_into_pg.py -k -s emacs
./load_into_pg.py -k -f /tmp/emacs.stackexchange.com.7z -d emacs
./load_into_pg.py -k -f /tmp/emacs.stackexchange.com.7z -s emacs -d emacs
time ./load_into_pg.py -k -f /tmp/emacs.stackexchange.com.7z -s emacs -d emacs -n emacs
./load_into_pg.py -k -f /tmp//emacs.stackexchange.com.7z -s emacs -d emacs -n json -j
./load_into_pg.py -k -f /tmp/emacs.stackexchange.com.7z -s emacs -d emacs -n foreign_keys --foreign-keys

Tell me what you think of it

@musically-ut

This comment has been minimized.

Copy link
Member

musically-ut commented Apr 11, 2019

Thank you for submitting this!

The code looks good and I don't see any immediate problems with it, but still have to just sit down and test all the options once (essentially the commands you gave in your last comment, thanks for that!)

I'll merge it soon.

load_into_pg.py Outdated Show resolved Hide resolved
@madtibo

This comment has been minimized.

Copy link
Contributor Author

madtibo commented Apr 17, 2019

@musically-ut here is a commit using tempfile library. I just get the temporary directory and store the file in it. Does it suites you?

load_into_pg.py Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved

# load a project
elif args.so_project:
import libarchive

This comment has been minimized.

Copy link
@musically-ut

musically-ut May 1, 2019

Member

Can you verify that you are using libarchive-c library instead of libarchive?

I will add this to the README.

This comment has been minimized.

Copy link
@madtibo

madtibo May 2, 2019

Author Contributor

I am indeed using libarchive-c (in version 2.8).

@musically-ut musically-ut merged commit b77bfbc into Networks-Learning:master May 2, 2019
@musically-ut

This comment has been minimized.

Copy link
Member

musically-ut commented May 2, 2019

Thanks for all the hard work! Merged! \o/

@madtibo

This comment has been minimized.

Copy link
Contributor Author

madtibo commented May 3, 2019

It was really nice to work on this project.
Thank you for the help and the follow-up!

@madtibo madtibo deleted the madtibo:load_full_project branch May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.