-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweed IDs Read/Hydrated is greater than Total Tweet IDs and Hydrator keeps fetching tweets #61
Comments
Oh no, that is not good! 5% is very low. If you are on a Mac I would be interested to know if you could count how many lines are in your JSON file.
Were you hydrating two datasets at the same time? That is something I can test with too. Did you leave the computer on the continuously or did you close the lid of your computer? I'm just trying to understand what might have happened so I can try to test & fix it. I have used Hydrator with datasets of this size before. But it could be that there is some kind of network or authentication problem that it is not handling properly. |
I downloaded the same dataset and started hydrating the
Are you sure this is the dataset you are using? If you can share your tweet id files with me I can test with them. |
Hi Ed! I'm using this txt file of tweet ids: https://github.com/leslie-huang/congress_tweetdata_prelim/blob/master/tweet_ids_to_hydrate.txt |
The number of lines in the jsonl is 2003053 🤔 I started downloading the 115th dataset first, then started downloading the 116th dataset concurrently after 400k+ items in the 115th dataset were done. I haven't closed the lid on my computer and I set it to not sleep for about 4 hours while I was away from my desk. I didn't have to reauthenticate or anything like that at any point. Hope this info helps! I'm writing a quick script to check the ids from the jsonl against the tweet ids that I initially requested... will report results soon! Thanks for your help! |
One thing you can try doing is starting the hydrator again but from a terminal so you can see the log messages. If you are on a Mac I think you can open a terminal and then start the Hydrator like this:
It would be interesting to see if those lines in the file you counted have JSON on them or not. I would be interested in the end of the file. The tweet ID file you directed me to looks fine. I thought it was strange that there were old/short tweet ids in there. But I guess they pulled from users timelines and some of them haven't used twitter a lot! |
Also, twarc is always an option if Hydrator is giving you trouble. The only thing you will need to have are developer keys. But that is probably the hardest part. I'm happy to help you use twarc if Hydrator continues to be a problem. |
Hi Ed, I've finished scanning through the jsonl file and things are looking a little wonky. I requested: 1671651 tweets requested.difference(collected) = 483422 (items requested but did not collect) collected.difference(requested) = 0 (items collected but did not request) tldr: So about 1.2 million unique requested tweets were collected, with ~800k duplicates, and there are ~480k tweets that haven't been collected (or were requested but were deleted tweets). It doesn't look like I can restart this specific dataset in Hydrator again but it's all good! I'll just generate a new list of tweet ids that I don't already have and delete the duplicates from my json. I just got a developer account with Twitter so I'll check out twarc if this doesn't work. Thanks a lot for your help! Let me know if I can provide any other info to help with debugging whatever the root problem is here. There was an exception for just one tweet in the jsonl (it was somewhere in the middle):
oder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) The line it was trying to parse was
|
here's the tail of the .jsonl:
and yes I noticed the short tweet ids too...I haven't checked yet whether they were successfully fetched but I can follow up if that would be of interest to you. |
@leslie-huang interesting! So am I reading your note correctly that the same ids were requested over and over? I didn't see anything wrong with the line of JSON you pasted. But it was printed out as a Python object ("false" was False, null was None, etc). I was able to hydrate both datasets concurrently. But it feels like perhaps your Hydrator ran into a network or authentication problem that wasn't handled properly. This might be related to work that needs to happen on #57 |
Yes, when I looked at the unique tweet ids in the jsonl file of ~2 million tweets, there were only ~1.2 million unique tweet ids, so about 800k duplicates. I didn't look closely at whether it was 800k copies of one tweet (for example) or a different breakdown, but since then I've hydrated a few more lists of tweets without any issues! This tool has saved me a lot of time in putting together a dataset, thank you for maintaining it! |
Thanks for noticing the repeated tweet identifiers in the jsonl, It will help me diagnose whet might be going on here. I'm glad to hear that it is working again! |
Hi, I don't see any solution for the issue. I counted the lines which more than the actual number of tweet IDs. Regards, |
Yes this issue is still open. I think that the bug is related to some kind of unhandled network error, or perhaps an API error during hydration. If you can easily share your ids and your jsonl file with me at ehs@pobox.com it might help me diagnose what is going on. |
Okay, thank you, yes I can share the ids, actually this error happening for only a few files. One more point, sometimes green colour overflows the progress bar while sometimes it stays in the middle even of progress bar even it already crawled all tweets. |
Thanks. That's good to know it is working sometimes. So does the same tweet ID file repeatedly cause a problem? If so that would be very helpful for me to test with. |
I sent the file, and it has around 30 millions tweet Id. Please update if you find the bug. Yes, the same file. I tried 10-12 big files. Only 2 had the issue, I shared one file with you. |
Thanks! So just to be clear: you have attempted to hydrate this file more than once and it has created the same problem? |
Welcome :) Yes, I tried 2 times. |
Experimenting with a small set of ids while flipping my wifi connection on and off resulted in getting the progress bar to overflow (see Short Test 4 in the screeshot beow). But this didn't happen reliably: sometimes the hydration finished ok. So clearly his error seems related to a timing issue. I'm guessing it is Promise related down utils.twitter. I noticed this error on my console (I was running in development mode).
When I took a look at the hydrated data for Short Test 4 I could see that 157 of the tweet ids were fetched twice. I think there must be some kind of error condition where multiple asynchronous requests are being made for the same set of tweet ids. This results in the fetched tweet ids overrunning the total number of tweet ids in the dataset being hydrated. I'll keep investigating but wanted to drop some notes in here so I remembered them. |
Thanks a lot for your effort. Do you still need JSON file from my side? |
No, I don't think I need the jsonl file @Gautamshahi. If you are on a Unix system and want to see if your jsonl contains duplicates you can do this (assuming you have jq installed).
If you see lines at the end that start with a number other than 1 that means you got duplicates too. |
I can replicate it reliably now by:
This makes me happy, because now it's possible to fix it! |
Hi, Did you find any solution for it? Any lead will be a great help. |
Hi, I'm trying to hydrate a subset of the 115th Congress tweets from https://catalog.docnow.io/datasets/20190222-115th-us-congress-tweet-ids/ (I already have a partly overlapping dataset).
Hydrator is still fetching tweets even after Total Tweet Ids Read has exceeded total tweet IDs. The "Stop" button has been replaced with the "CSV" button (which makes it seem like it's done?) but the number of tweets read keeps going up.
And the dataset looks like this:
(5% was roughly the percentage throughout hydration but doesn't make sense with the other numbers)
Should I hydrate this file again? Or use twarc? Thanks for any advice!
The text was updated successfully, but these errors were encountered: