Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial CSV file with disabled CSV button when Hydrator is closed too soon #51

Open
rtrad89 opened this issue Jul 8, 2020 · 4 comments

Comments

@rtrad89
Copy link

rtrad89 commented Jul 8, 2020

I was hydrating half a million tweets and since it was the first time I use Hydrator, I clicked the CSV button upon it finished and once I saw the CSV file I innocently closed Hydrator. However, the CSV file was incomplete and it contained around 300k tweets rather than 500k. When trying to re-save the CSV file from the Datasets tab, I couldn't as the button was disabled.

I tried to convert the huge JSONL to CSV but my machine couldn't handle it (16GB RAM), so as a quick workaround I re-hydrated all the tweets again and saved the CSV file properly this time.

Just reporting what's happened here because I feel it's tantamount to a bug. I am using Windows 10 by the way.

@rtrad89
Copy link
Author

rtrad89 commented Oct 21, 2020

Reporting that the same problem happens when the name of the hydrated jsonl file is changed and the tool can't locate it. The CSV button never reactivates again after fixing the name.

@edsu
Copy link
Member

edsu commented Oct 21, 2020

Yes, the conversion to CSV has to be tracked better internally in the app. For large datasets the conversion can take time and should be fault tolerant.

@rtrad89
Copy link
Author

rtrad89 commented Oct 21, 2020

Thank you. Is there any way to run the underlying convertor?

@edsu
Copy link
Member

edsu commented Oct 21, 2020

Alas, no. But the converter that's embedded in the Hydrator is a JavaScript version of a separate Python utility: https://github.com/DocNow/twarc/blob/main/utils/json2csv.py

If you want help installing and running it feel free to drop into our Slack channel. https://bit.ly/docnow-slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants