-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to fetch more than 100 tweets per run #12
Comments
Unfortunately Twitter's Search API only returns results for the past week or so. So it is quite likely that a a query for tweets that previously returned results no longer does. What is the query? |
@edsu : #eLezioni, #elezioni, elezioni, eLezioni -> hashtags used yesterday for an event. |
@edsu not sure if it's related with any caching methods. I changed API Account to avoid any backend caching. But also cleaning paths, logs, files... it looks like it remembers something and it doesn't start in a clean way. I tested a different today hashtag. |
If twarc is able to find a data file that matches the query it is doing it will look at the first tweet in the file to determine what to use for the minimum id ; which prevents archiving the same tweets over and over. I will make sure it's still working as expected, and get back to you. |
Thx, That's what I remember about it but for two time in a row I tried to scrape other HT deleting data file and logs. I supposed restarting clean it scrapes all again the same. it didn't. Like if it's getting sync point also in another way. |
Deleting the file and running again does not fetch all the tweets again because results are only available through the search API for 1 week. I will still take a look to make sure twarc is working properly though. |
Ok, that's why when you asked for I did some tests on current release about last 24h. That's why I think it's weird, if repeated, to not get all tweets of the last 3 hours also when deleted data and log files. |
@edsu question: Am I wrong or previous release was able to wait and iterate again query after a few minutes to collect any new tweets without relaunching twarc.py ? |
No, you are right. There appears to be a bug in the latest release. Thanks for reporting this. Hopefully I'll have a fix for you shortly. |
It's very strange I can do a search, determine the max_id based on the last tweet, then do another search using that max_id and only get 1 result. But if I do the same thing while waiting 20 seconds between API requests I get 100 results. It seems related to this Twitter API issue which hopefully will get some attention. I'm afraid it could be an intentional change to confuse tools like twarc that walk backwards in search result sets. |
For now until the issue is resolved or clarified I think the only thing twarc can do is sleep between requests where the response from twitter is only one tweet with an id that's the same as the max_id we used. It could start by waiting 10 seconds and then double that until it gets more results. This is not good because it slows down archiving substantially, but at least it would still work. |
I see, anyway I'm thinking it's not really a bug but an anti-scraping or limiting scraping capability attempt by Twitter. Don't you think they are introducing an additional rate limit but not really related with API? (last "errors" filed):
|
@edsu something else is going wrong, I checked with a clean VM #InternetSlowdown and it doesn't get ANY results. With and without --scrape I get this:
Could it be b/c API Status of Userstream is under Total Disruption since 2 weeks? |
There is a problem with the Twitter API that they have acknowledged. Please see this issue. In theory twarc could try to hack around this, but I'm hopeful that it will get resolved. |
I think this problem should be fixed now that https://twittercommunity.com/t/0-status-after-a-next-result-request/22368/5 has been closed. My test seems to be working at least! |
It stops after limited results, doesn't mather if with or without --scrape.
Debug logs do not contains any error. Twarc.py exit saying something like this:
But comparing with previous results it miss a lot of results.
How to debug deeply?
The text was updated successfully, but these errors were encountered: