unable to fetch more than 100 tweets per run #12

remagio · 2014-09-05T11:48:12Z

It stops after limited results, doesn't mather if with or without --scrape.
Debug logs do not contains any error. Twarc.py exit saying something like this:

2014-09-05 11:44:09,695 INFO no new tweets with id < 505020989466771457

But comparing with previous results it miss a lot of results.
How to debug deeply?

The text was updated successfully, but these errors were encountered:

edsu · 2014-09-05T19:07:12Z

Unfortunately Twitter's Search API only returns results for the past week or so. So it is quite likely that a a query for tweets that previously returned results no longer does.

What is the query?

remagio · 2014-09-05T19:33:29Z

@edsu : #eLezioni, #elezioni, elezioni, eLezioni -> hashtags used yesterday for an event.
query with '#' end very shortly. without '#' it take longer but anyway with different results also when executed nearly at same time.

remagio · 2014-09-05T22:06:56Z

@edsu not sure if it's related with any caching methods. I changed API Account to avoid any backend caching. But also cleaning paths, logs, files... it looks like it remembers something and it doesn't start in a clean way. I tested a different today hashtag.
Did you change something about how it manage "starting point time/ID" since past releases?

edsu · 2014-09-06T00:23:08Z

If twarc is able to find a data file that matches the query it is doing it will look at the first tweet in the file to determine what to use for the minimum id ; which prevents archiving the same tweets over and over. I will make sure it's still working as expected, and get back to you.

remagio · 2014-09-06T08:40:54Z

Thx, That's what I remember about it but for two time in a row I tried to scrape other HT deleting data file and logs. I supposed restarting clean it scrapes all again the same. it didn't. Like if it's getting sync point also in another way.

edsu · 2014-09-06T10:29:35Z

Deleting the file and running again does not fetch all the tweets again because results are only available through the search API for 1 week. I will still take a look to make sure twarc is working properly though.

remagio · 2014-09-06T21:13:33Z

Ok, that's why when you asked for I did some tests on current release about last 24h. That's why I think it's weird, if repeated, to not get all tweets of the last 3 hours also when deleted data and log files.

remagio · 2014-09-06T22:26:12Z

@edsu question: Am I wrong or previous release was able to wait and iterate again query after a few minutes to collect any new tweets without relaunching twarc.py ?

edsu · 2014-09-06T23:52:33Z

No, you are right. There appears to be a bug in the latest release. Thanks for reporting this. Hopefully I'll have a fix for you shortly.

edsu · 2014-09-07T02:24:21Z

It's very strange I can do a search, determine the max_id based on the last tweet, then do another search using that max_id and only get 1 result. But if I do the same thing while waiting 20 seconds between API requests I get 100 results.

It seems related to this Twitter API issue which hopefully will get some attention. I'm afraid it could be an intentional change to confuse tools like twarc that walk backwards in search result sets.

edsu · 2014-09-07T12:10:56Z

For now until the issue is resolved or clarified I think the only thing twarc can do is sleep between requests where the response from twitter is only one tweet with an id that's the same as the max_id we used. It could start by waiting 10 seconds and then double that until it gets more results. This is not good because it slows down archiving substantially, but at least it would still work.

remagio · 2014-09-10T11:54:18Z

I see, anyway I'm thinking it's not really a bug but an anti-scraping or limiting scraping capability attempt by Twitter. Don't you think they are introducing an additional rate limit but not really related with API? (last "errors" filed):

2014-09-10 11:50:07,547 DEBUG rate limit remaining 173
2014-09-10 11:50:07,579 ERROR got error when fetching https://api.twitter.com/1.1/statuses/show.json?id=509670029533913089 sleeping 6 secs: {'x-rate-limit-remaining': '0', 'status': '429', 'content-length': '56', 'set-cookie': 'guest_id=v1%3A141034980755218461; Domain=.twitter.com; Path=/; Expires=Fri, 09-Sep-2016 11:50:07 UTC', 'strict-transport-security': 'max-age=631138519', 'server': 'tfe_b', '-content-encoding': 'gzip', 'x-rate-limit-reset': '1410350021', 'date': 'Wed, 10 Sep 2014 11:50:07 UTC', 'x-rate-limit-limit': '180', 'content-type': 'application/json;charset=utf-8'} - {"errors":[{"message":"Rate limit exceeded","code":88}]}

remagio · 2014-09-11T22:50:11Z

@edsu something else is going wrong, I checked with a clean VM #InternetSlowdown and it doesn't get ANY results. With and without --scrape I get this:

2014-09-11 22:20:23,639 INFO writing tweets to #internetslowdon-20140911222023.json
2014-09-11 22:20:23,639 INFO starting search for #internetslowdon with since_id=None and max_id=None
2014-09-11 22:20:23,640 DEBUG checking for rate limit info
2014-09-11 22:20:23,816 DEBUG new rate limit remaining=178 and reset=1410474849
2014-09-11 22:20:23,817 DEBUG fetching https://api.twitter.com/1.1/search/tweets.json?count=100&q=%23internetslowdon
2014-09-11 22:20:24,818 DEBUG rate limit remaining 178
2014-09-11 22:20:24,878 INFO no new tweets with id < None
2014-09-11 22:20:24,879 DEBUG checking for rate limit info
2014-09-11 22:20:25,032 DEBUG new rate limit remaining=177 and reset=1410474849
2014-09-11 22:20:25,032 INFO scraping tweets with id < None
2014-09-11 22:20:25,032 INFO scraping https://twitter.com/i/search/timeline??last_note_ts=0&f=realtime&include_available_features=1&oldest_unread_id=0&q=%23internetslowdon&include_entities=1
2014-09-11 22:20:25,040 INFO Starting new HTTPS connection (1): twitter.com
2014-09-11 22:20:25,207 DEBUG "GET /i/search/timeline?last_note_ts=0&f=realtime&include_available_features=1&oldest_unread_id=0&q=%23internetslowdon&include_entities=1 HTTP/1.1" 200 144

Could it be b/c API Status of Userstream is under Total Disruption since 2 weeks?

edsu · 2014-09-18T19:48:21Z

There is a problem with the Twitter API that they have acknowledged. Please see this issue. In theory twarc could try to hack around this, but I'm hopeful that it will get resolved.

edsu · 2014-10-04T10:54:04Z

I think this problem should be fixed now that https://twittercommunity.com/t/0-status-after-a-next-result-request/22368/5 has been closed. My test seems to be working at least!

remagio changed the title ~~Repeating recente query with current release did not complete extractions~~ Repeating recent queries with current release did not complete extractions Sep 5, 2014

edsu mentioned this issue Sep 7, 2014

same query results with mismatches #13

Closed

edsu changed the title ~~Repeating recent queries with current release did not complete extractions~~ unable to fetch more than 100 tweets per run Sep 7, 2014

edsu closed this as completed Oct 4, 2014

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to fetch more than 100 tweets per run #12

unable to fetch more than 100 tweets per run #12

remagio commented Sep 5, 2014

edsu commented Sep 5, 2014

remagio commented Sep 5, 2014

remagio commented Sep 5, 2014

edsu commented Sep 6, 2014

remagio commented Sep 6, 2014

edsu commented Sep 6, 2014

remagio commented Sep 6, 2014

remagio commented Sep 6, 2014

edsu commented Sep 6, 2014

edsu commented Sep 7, 2014

edsu commented Sep 7, 2014

remagio commented Sep 10, 2014

remagio commented Sep 11, 2014

edsu commented Sep 18, 2014

edsu commented Oct 4, 2014

unable to fetch more than 100 tweets per run #12

unable to fetch more than 100 tweets per run #12

Comments

remagio commented Sep 5, 2014

edsu commented Sep 5, 2014

remagio commented Sep 5, 2014

remagio commented Sep 5, 2014

edsu commented Sep 6, 2014

remagio commented Sep 6, 2014

edsu commented Sep 6, 2014

remagio commented Sep 6, 2014

remagio commented Sep 6, 2014

edsu commented Sep 6, 2014

edsu commented Sep 7, 2014

edsu commented Sep 7, 2014

remagio commented Sep 10, 2014

remagio commented Sep 11, 2014

edsu commented Sep 18, 2014

edsu commented Oct 4, 2014