Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to fetch more than 100 tweets per run #12

Closed
remagio opened this issue Sep 5, 2014 · 15 comments
Closed

unable to fetch more than 100 tweets per run #12

remagio opened this issue Sep 5, 2014 · 15 comments

Comments

@remagio
Copy link

remagio commented Sep 5, 2014

It stops after limited results, doesn't mather if with or without --scrape.
Debug logs do not contains any error. Twarc.py exit saying something like this:

2014-09-05 11:44:09,695 INFO no new tweets with id < 505020989466771457

But comparing with previous results it miss a lot of results.
How to debug deeply?

@remagio remagio changed the title Repeating recente query with current release did not complete extractions Repeating recent queries with current release did not complete extractions Sep 5, 2014
@edsu
Copy link
Member

edsu commented Sep 5, 2014

Unfortunately Twitter's Search API only returns results for the past week or so. So it is quite likely that a a query for tweets that previously returned results no longer does.

What is the query?

@remagio
Copy link
Author

remagio commented Sep 5, 2014

@edsu : #eLezioni, #elezioni, elezioni, eLezioni -> hashtags used yesterday for an event.
query with '#' end very shortly. without '#' it take longer but anyway with different results also when executed nearly at same time.

@remagio
Copy link
Author

remagio commented Sep 5, 2014

@edsu not sure if it's related with any caching methods. I changed API Account to avoid any backend caching. But also cleaning paths, logs, files... it looks like it remembers something and it doesn't start in a clean way. I tested a different today hashtag.
Did you change something about how it manage "starting point time/ID" since past releases?

@edsu
Copy link
Member

edsu commented Sep 6, 2014

If twarc is able to find a data file that matches the query it is doing it will look at the first tweet in the file to determine what to use for the minimum id ; which prevents archiving the same tweets over and over. I will make sure it's still working as expected, and get back to you.

@remagio
Copy link
Author

remagio commented Sep 6, 2014

Thx, That's what I remember about it but for two time in a row I tried to scrape other HT deleting data file and logs. I supposed restarting clean it scrapes all again the same. it didn't. Like if it's getting sync point also in another way.

@edsu
Copy link
Member

edsu commented Sep 6, 2014

Deleting the file and running again does not fetch all the tweets again because results are only available through the search API for 1 week. I will still take a look to make sure twarc is working properly though.

@remagio
Copy link
Author

remagio commented Sep 6, 2014

Ok, that's why when you asked for I did some tests on current release about last 24h. That's why I think it's weird, if repeated, to not get all tweets of the last 3 hours also when deleted data and log files.

@remagio
Copy link
Author

remagio commented Sep 6, 2014

@edsu question: Am I wrong or previous release was able to wait and iterate again query after a few minutes to collect any new tweets without relaunching twarc.py ?

@edsu
Copy link
Member

edsu commented Sep 6, 2014

No, you are right. There appears to be a bug in the latest release. Thanks for reporting this. Hopefully I'll have a fix for you shortly.

@edsu
Copy link
Member

edsu commented Sep 7, 2014

It's very strange I can do a search, determine the max_id based on the last tweet, then do another search using that max_id and only get 1 result. But if I do the same thing while waiting 20 seconds between API requests I get 100 results.

It seems related to this Twitter API issue which hopefully will get some attention. I'm afraid it could be an intentional change to confuse tools like twarc that walk backwards in search result sets.

@edsu edsu changed the title Repeating recent queries with current release did not complete extractions unable to fetch more than 100 tweets per run Sep 7, 2014
@edsu
Copy link
Member

edsu commented Sep 7, 2014

For now until the issue is resolved or clarified I think the only thing twarc can do is sleep between requests where the response from twitter is only one tweet with an id that's the same as the max_id we used. It could start by waiting 10 seconds and then double that until it gets more results. This is not good because it slows down archiving substantially, but at least it would still work.

@remagio
Copy link
Author

remagio commented Sep 10, 2014

I see, anyway I'm thinking it's not really a bug but an anti-scraping or limiting scraping capability attempt by Twitter. Don't you think they are introducing an additional rate limit but not really related with API? (last "errors" filed):

2014-09-10 11:50:07,547 DEBUG rate limit remaining 173
2014-09-10 11:50:07,579 ERROR got error when fetching https://api.twitter.com/1.1/statuses/show.json?id=509670029533913089 sleeping 6 secs: {'x-rate-limit-remaining': '0', 'status': '429', 'content-length': '56', 'set-cookie': 'guest_id=v1%3A141034980755218461; Domain=.twitter.com; Path=/; Expires=Fri, 09-Sep-2016 11:50:07 UTC', 'strict-transport-security': 'max-age=631138519', 'server': 'tfe_b', '-content-encoding': 'gzip', 'x-rate-limit-reset': '1410350021', 'date': 'Wed, 10 Sep 2014 11:50:07 UTC', 'x-rate-limit-limit': '180', 'content-type': 'application/json;charset=utf-8'} - {"errors":[{"message":"Rate limit exceeded","code":88}]}

@remagio
Copy link
Author

remagio commented Sep 11, 2014

@edsu something else is going wrong, I checked with a clean VM #InternetSlowdown and it doesn't get ANY results. With and without --scrape I get this:

2014-09-11 22:20:23,639 INFO writing tweets to #internetslowdon-20140911222023.json
2014-09-11 22:20:23,639 INFO starting search for #internetslowdon with since_id=None and max_id=None
2014-09-11 22:20:23,640 DEBUG checking for rate limit info
2014-09-11 22:20:23,816 DEBUG new rate limit remaining=178 and reset=1410474849
2014-09-11 22:20:23,817 DEBUG fetching https://api.twitter.com/1.1/search/tweets.json?count=100&q=%23internetslowdon
2014-09-11 22:20:24,818 DEBUG rate limit remaining 178
2014-09-11 22:20:24,878 INFO no new tweets with id < None
2014-09-11 22:20:24,879 DEBUG checking for rate limit info
2014-09-11 22:20:25,032 DEBUG new rate limit remaining=177 and reset=1410474849
2014-09-11 22:20:25,032 INFO scraping tweets with id < None
2014-09-11 22:20:25,032 INFO scraping https://twitter.com/i/search/timeline??last_note_ts=0&f=realtime&include_available_features=1&oldest_unread_id=0&q=%23internetslowdon&include_entities=1
2014-09-11 22:20:25,040 INFO Starting new HTTPS connection (1): twitter.com
2014-09-11 22:20:25,207 DEBUG "GET /i/search/timeline?last_note_ts=0&f=realtime&include_available_features=1&oldest_unread_id=0&q=%23internetslowdon&include_entities=1 HTTP/1.1" 200 144

Could it be b/c API Status of Userstream is under Total Disruption since 2 weeks?

@edsu
Copy link
Member

edsu commented Sep 18, 2014

There is a problem with the Twitter API that they have acknowledged. Please see this issue. In theory twarc could try to hack around this, but I'm hopeful that it will get resolved.

@edsu
Copy link
Member

edsu commented Oct 4, 2014

I think this problem should be fixed now that https://twittercommunity.com/t/0-status-after-a-next-result-request/22368/5 has been closed. My test seems to be working at least!

@edsu edsu closed this as completed Oct 4, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants