New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for using local storage instead of S3, per config #10

Merged
merged 8 commits into from Aug 3, 2018

Conversation

2 participants
@LRParser
Copy link
Contributor

LRParser commented Jul 11, 2018

Long term I think it would be nicer to make ctors for S3Client and LocalClient that implement the same interface, but kept existing static method pattern for readability now.

LRParser added some commits Jul 11, 2018

@LRParser

This comment has been minimized.

Copy link
Contributor Author

LRParser commented Jul 11, 2018

LRParser added some commits Jul 11, 2018

Updating the celery side for running locally. Still processing on my …
…PC. I see FilingIndex objects being created...
@LRParser

This comment has been minimized.

Copy link
Contributor Author

LRParser commented Jul 11, 2018

Actually did the OO refactor, to keep things more consistent. I sometimes see errors in Tika (java.io.IOException: Broken pipe), need to look into this in more depth. Would appreciate it if anyone has a chance to check this branch out! You should just need to set CLIENT_TYPE=LOCAL and DOWNLOAD_PATH=/some/path in your .env file to use this local functionality. Although it should, I did not check if S3 still works fully...

@LRParser

This comment has been minimized.

Copy link
Contributor Author

LRParser commented Jul 13, 2018

Can you think of any issue why I may not get FilingDocument objects populated, but the others would work?

In [9]: process_all_filing_index(year=2018, form_type_list=["10-K"])
openedgar.clients.edgar: INFO Locating form index list for 2018
openedgar.clients.edgar: INFO Retrieving directory listing from /Archives/edgar/daily-index/2018/
openedgar.clients.edgar: INFO Retrieving remote path /Archives/edgar/daily-index/2018/ to memory
openedgar.clients.edgar: INFO Successfully retrieved file /Archives/edgar/daily-index/2018/; 9004 bytes
openedgar.clients.edgar: INFO Successfully retrieved 3 links from /Archives/edgar/daily-index/2018/
['/Archives/edgar/daily-index/2018//QTR1/', '/Archives/edgar/daily-index/2018//QTR2/', '/Archives/edgar/daily-index/2018//QTR3/']
openedgar.clients.edgar: INFO Retrieving directory listing from /Archives/edgar/daily-index/2018//QTR1/
openedgar.clients.edgar: INFO Retrieving remote path /Archives/edgar/daily-index/2018//QTR1/ to memory
openedgar.clients.edgar: INFO Successfully retrieved file /Archives/edgar/daily-index/2018//QTR1/; 63914 bytes
openedgar.clients.edgar: INFO Successfully retrieved 310 links from /Archives/edgar/daily-index/2018//QTR1/
openedgar.clients.edgar: INFO Retrieving directory listing from /Archives/edgar/daily-index/2018//QTR2/
openedgar.clients.edgar: INFO Retrieving remote path /Archives/edgar/daily-index/2018//QTR2/ to memory
openedgar.clients.edgar: INFO Successfully retrieved file /Archives/edgar/daily-index/2018//QTR2/; 65660 bytes
openedgar.clients.edgar: INFO Successfully retrieved 320 links from /Archives/edgar/daily-index/2018//QTR2/
openedgar.clients.edgar: INFO Retrieving directory listing from /Archives/edgar/daily-index/2018//QTR3/
openedgar.clients.edgar: INFO Retrieving remote path /Archives/edgar/daily-index/2018//QTR3/ to memory
openedgar.clients.edgar: INFO Successfully retrieved file /Archives/edgar/daily-index/2018//QTR3/; 13907 bytes
openedgar.clients.edgar: INFO Successfully retrieved 30 links from /Archives/edgar/daily-index/2018//QTR3/
openedgar.clients.edgar: INFO Successfully located 132 form index files for 2018
openedgar.clients.local: INFO Initialized local client

In [10]: Filing.objects.count()
Out[10]: 6128

In [11]: FilingDocument.objects.count()
Out[11]: 0

In [12]: Company.objects.count()
Out[12]: 2987

In [13]: FilingDocument.objects.count()
Out[13]: 0

cc @mjbommar

@mjbommar

This comment has been minimized.

Copy link
Contributor

mjbommar commented Jul 14, 2018

@LRParser , first, awesome stuff!

Second, on your last immediate question, is there any chance that celery isn't up and running? I'll dig in over the weekend to your branch and see if I can replicate

@mjbommar

This comment has been minimized.

Copy link
Contributor

mjbommar commented Jul 14, 2018

@ericlex and @LRParser , how do you feel about a gitter room for conversations like these?

Need to handle the fact that some local files will be written as text…
… (str) and not as bytestream. Now FilingDocuments are populating
@LRParser

This comment has been minimized.

Copy link
Contributor Author

LRParser commented Jul 15, 2018

@mjbommar - yes a Gitter room would definitely be cool. FYI I figured out the FilingDocument issue in my latest commit. This code should make it easy to push all the data to HDFS (once downloaded locally) for parallel querying and remove any AWS dependency, should hopefully be helpful for folks. Will work on the pylint errors in a few days.

@LRParser

This comment has been minimized.

Copy link
Contributor Author

LRParser commented Jul 15, 2018

FYI I do still see a number of these errors, perhaps they can be ignored though?

openedgar.tasks: INFO No Filing record found for edgar/data/1118072/0001554795-18-000061.txt, creating...
openedgar.tasks: INFO Raw exception: Filing matching query does not exist.

I now have over 215k FilingDocument created.

@mjbommar

This comment has been minimized.

Copy link
Contributor

mjbommar commented Jul 31, 2018

@LRParser pylint is a harsh grader 😄

―――――――― [pylint] lexpredict_openedgar/openedgar/tests/test_process.py ―――――――――
E: 40, 8: No value for argument 'file_path' in function call (no-value-for-parameter)

LGTM anyway, but let me know if you are already mid-commit/push to fix.

@mjbommar mjbommar merged commit 240729b into LexPredict:master Aug 3, 2018

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment