-
Notifications
You must be signed in to change notification settings - Fork 127
Download stops after a lot of tweets #3
Comments
What dates did you expect to see in Keep in mind that tweets in |
This is the date of the first tweet 19/02/2018 0:59 This is the date of the last tweet 18/02/2018 9:03 normally it should finished at 18/02/2018 1:00 |
I haven't downloaded all the tweets since there are too many, but the first row is |
So do you have any ideas about why I have this issue? |
Are you using the generic version of GetOldTweets3 or you have changed some code? I've check with |
I have only modified the exporter to include lang parameter. Other than that the code is the same as yours. About the UTC, you are right, I am not sure why, but the scrips saves the code in UTC + 1, I thought it was normal since that is the timezone I am in. |
I just tried again with python Exporter.py --lang en --querysearch "bitcoin" --since 2018-02-20 --until 2018-02-21 and the same happend. The first tweet is 2018-02-20 23:59:53,jmauli,,0,0,"The Bitcoin is dropping going to enjoy short this down to XXXXX",,,,966100124899278848,https://twitter.com/jmauli/status/966100124899278848 the last tweet is 2018-02-20 07:11:38,LibertarianBee,,0,0,"@CoinWarz is not taking in consideration the TX fees that the miners are also receiving. #BCH has less TX than #BTC #Bitcoin.",,@CoinWarz,#BCH #BTC #Bitcoin,965846390088785921,https://twitter.com/LibertarianBee/status/965846390088785921 It downloaded 48757 tweets |
I've tried several times and reproduced your error. I will look deeper somewhat later. |
Twitter gave me the total number of tweets within this period of 49877. |
Ok Thanks a lot! It is a pitty that you can not summit request by hour, since that would solve the issue. Maybe we can set in the script that after an x number of tweets, the script will have to sleep for 3mn so the requests looks more natural. |
I've tried it a couple of days earlier. Seems like they removed |
Yeah I tried yesterday aswell, and it doesnt work. |
Btw, I've added |
Thanks a lot that is pretty good! Send me a message if you find the solution to the issue. I will try to test some options as well. |
Hello, I am leaving you a list of querys that stops during the download consistenly around the same downloads. python Exporter.py --lang en --querysearch "bitcoin" --since 2018-02-18 --until 2018-02-19 It is very weird because there are some other days, with more number of tweets, that have no problem. Some of these queries stop after 2 or 4 thousand tweets (Not a big number) Just in case this helps solve and see the issue |
Hello! To add to the list of queries with issues, python Exporter.py --lang en --querysearch "bitcoin" --since 2018-03-02 --until 2018-03-03 | Number of tweets downloaded: 1186 The first query is relatively small so I runned the Debug option, here is the log I think this has something to do with the message of the tweet, or something that the script downloads, and makes it stop. The downloads fails consistently in with this queries at the same point, while runing other queries of full days (55000 messages) there is no issue. So I do not think Twitter is blocking the request, I think the program reads or scraps something that makes it think he is finished with the query. |
Ok, thanks The problem is this: each response has the
Feel free to make a pull request if you fix this issue. Thanks! |
I am going to try! however, it seems complicated (I only started learning python 4 months ago...) |
Hi Jaime.
If I summarize the logic this is what it would look like i) If has_more_items is false, check if Date of last tweet received is same as Since date. @JaimeBadiola you already know my .Net code which have all of these condition checks. My old Python version code is attached here. I hope this helps. |
Are you guys sure twitter doesn't block IPs? On my remote machine after a certain point (maybe 1mil tweets in 1000 requests) all my responses come back as zero length (empty). But I can still query from my laptop. I'll be honest - I didn't full understand the discussion on min_position - but I don't see how this could be the source of my problem. |
Please help understand, what do you mean by 1 million tweets in 1000 requests? Are you talking about Twitter API? Coz there is no requests concept in scraping. Also while scraping every Twitter URL call for json download returns 20 tweets max. If you are using APIs then min_position do not apply to you. About Twitter policies, check this out There is a gray area when it comes to distinguishing between Scraping and Crawling although both might look same they are different. But it depends on how Twitter defines it. In TOS page there is nothing related to blocking of IP address. Second, blocking of IP means detecting IP address which is against Twitter's privacy policy. Third, IP blocking will not work if you are behind DNS when IP is refreshed periodically or Public network so basically IP blocking is not a good solution and companies knows it. When I was downloading tweets as part of my free assistance on orgneat.com I never had issue with blocking my scraper. To understand this, first we need to understand how Twitter scraper program works. The program mimics a browser and it simply scrolling down the web page in order to get the statuses (tweets). If Twitter blocks the program Twitter has to block all requests coming from your network/system which basically means if you open Twitter.com you should not be able to see anything. While I was working on scraping requests from all around the world, it helped me a lot with little research on how it works. It make sense that if you start downloading millions of tweets, depending on various factors like Internet connection, glitches, Twitter's handling of requests etc there might be some issues. Please note that a scraper is extremely fast human scrolling down a page constantly possibly every second. I do face same issues now and then, so I have made conditional checks, the logic I explained above. I did this only because I was trying to assist many people with free service and wanted to provide a seamless request/delivery experience with scraper running day-night unattended. |
Thanks for addressing the questions. I found the --username query can get full data (e.g. --since 2018-01-01 until 2-18-12-22) but --querysearch (keyword: China tariff) did the best up to 9/13/2018. In that request, I downloaded 142,396 tweets. I did try multiple combinations but still was unable to reach beyond 9/13/2018). Is that memory related? Or IP address related? I did try manually scroll the page that allows me to reach further. Any suggestions will be greatly appreciated!! |
@kho7, the issue is with |
Thanks for your reply and I learnt a great deal. I use the command line method GetOldTweets3 --querysearch "China tariff" --since 2018-01-01 --until 2018-9-13 --output "tradewar02g.csv" Shall I modify the TweetManager.py to change min_position? Thanks again big time. |
You sure understand what Rahul has written about min_position? |
I am testing this query that only recovers 24 tweets. "python Exporter.py --lang en --querysearch "bitcoin" --since 2017-08-13 --until 2017-08-14" And the issue seems to be that the while loop stops here (Line 67 of TweetManager.py) if len(json['items_html'].strip()) == 0: I tried to Get Json response 10 times before breaking the while loop but twitter doesnt answer accordingly. Any ideas? |
One part I am trying to understand and apply is iv) if min_position starts with cm+ set min_position = "TWEET-" + Tweet ID of Last tweet in result + "-" + tweet ID of first tweet in result. Thanks again. |
What is the other python scrypt @giulionf ? |
Basicly, a test script to check if it was working or not... I just set the parameters my other script was fetching manually. On my remote, it's working as well! Really strange...
|
I seemingly triggered this error querying one tweet at a time, <100 times a day, over the course of 2 weeks. For what I gather, that's much less volume, more spread out than what others have reported here. I'm not grasping most of what is posted here. Will conducting test queries exacerbate the problem? Will switching networks or using a VPN will not help? |
I have similar issue. For multiple queries the download stops at a certain number without any errors. Sometime the number varies, but it always stops before reaching the until date. Thanks in advance! |
No, I wasn't able to correct the bug.
…On Tue, 3 Sep 2019 at 14:43, gghidiu ***@***.***> wrote:
I have similar issue. For multiple queries the download stops at a certain
number without any errors. Sometime the number varies, but it always stops
before reaching the until date.
@JaimeBadiola <https://github.com/JaimeBadiola> have you managed to
correct this bug. I am new to Python, so it would be very helpful if you
could post the modified code here.
Thanks in advance!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3?email_source=notifications&email_token=AJSZZCJQJBKXBGAOUL6K2KTQHZSWXA5CNFSM4GCBI7EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5YHNZQ#issuecomment-527464166>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJSZZCPM76XJNTB52ODOB6TQHZSWXANCNFSM4GCBI7EA>
.
|
@JaimeBadiola , have you found a working alternative then? |
What I did was to download all the data day by day and if one day there was
an error I would mark that day as missing data. In total i downloaded about
500 days and a bit more than 20 were missing.
…On Tue, 3 Sep 2019 at 15:14, gghidiu ***@***.***> wrote:
@JaimeBadiola <https://github.com/JaimeBadiola> , have you found a
working alternative then?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3?email_source=notifications&email_token=AJSZZCLXWS6NUJ6KZUGDRT3QHZWNVA5CNFSM4GCBI7EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5YKXMI#issuecomment-527477681>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJSZZCOTTTN7E4FI4QGODZDQHZWNVANCNFSM4GCBI7EA>
.
|
Hey - I was able to resolve this problem when writing a crawler (in scala) for work (so I cant just put it on github). The gist of the problem is that sometimes the scroll cursor used to compute index in to the stream of items returned by a query becomes corrupted. I dont think this is volume dependent (i.e. this is not some kind of rate limiting mechanism). I was able to resolve this by saving the previous search cursor after each query and going back to using it if I suspect the search cursor I'm currently using is corrupted. Functionally I implemented this w/ a psuedo BFS where I kept the previous cursor in the explore Q until its child cursor executes a search w/ no errors. Ive been planning to make a PR to this repo where I port my solution but just been busy. Let me know if you guys want it and I'll make it a priority. |
There is a workaround to download the tweets from a specific time. It partially solves the problem since you can just continue downloading from where the program has stopped. The idea is to convert the For example, if the download stoped at 2016-08-24 19:38:13
In my case the final query will look something like this:
since we have the max_id parameter the --until becomes redundant. |
@aduriseti , would be great if we get the working version. |
|
Did you try running the same script multiple times in hopes of getting more tweets? I noticed I was getting fewer and fewer tweets loaded when I did this... checked my task manager and CPU was through the roof. There were also 20+ Pythons listed... If anyone has Alteryx, there is a Twitter app you can use to pull the data. |
I'm trying to make this search: But the result is not with all tweets. I'm getting just some of them and stop on some date arround 2018-12-20. Can some body help me? |
My work around was to jump the days that I had issues with. So jump
2018-12-20 and keep downloading after that
…On Fri, 15 Nov 2019 at 19:07, Rodrigo Borges Machado < ***@***.***> wrote:
I'm trying to make this search:
tweetCriteria =
got.manager.TweetCriteria().setQuerySearch('CVE').setSince("2015-01-01").setUntil("2019-11-15")
But the result is not with all tweets. I'm getting just some of them and
stop on some date arround 2018-12-20. Can some body help me?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3?email_source=notifications&email_token=AJSZZCI226HFD7CWIVTVBMDQT3XPNA5CNFSM4GCBI7EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEGNJCA#issuecomment-554488968>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJSZZCOHIBXMZ5UGTUGVRLDQT3XPNANCNFSM4GCBI7EA>
.
|
That was my first ideia, I will try that way so... |
could you please do this. |
I have not used Scala but I will be happy to try it. Thanks.
… On Apr 23, 2020, at 9:19 PM, chtryanil ***@***.***> wrote:
Hey - I was able to resolve this problem when writing a crawler (in scala) for work (so I cant just put it on github). The gist of the problem is that sometimes the scroll cursor used to compute index in to the stream of items returned by a query becomes corrupted. I dont think this is volume dependent (i.e. this is not some kind of rate limiting mechanism). I was able to resolve this by saving the previous search cursor after each query and going back to using it if I suspect the search cursor I'm currently using is corrupted. Functionally I implemented this w/ a psuedo BFS where I kept the previous cursor in the explore Q until its child cursor executes a search w/ no errors. Ive been planning to make a PR to this repo where I port my solution but just been busy. Let me know if you guys want it and I'll make it a priority.
could you please do this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIFMPUSOWYPSMYDYIRDRHETRODZMZANCNFSM4GCBI7EA>.
|
Hi guys ive tried this and i only get 1 username/tweet back how do i fix this? do i have to add the max id?
|
Hi, you set tweet equal to the first row when you selected the row with index 0 (the part that looks like this [0]). Delete [0] and you should be fine
On Jun 5, 2020, at 5:50 AM, joshkwannacode <notifications@github.com> wrote:
Hi guys ive tried this and i only get 1 username/tweet back how do i fix this? do i have to add the max id?
import GetOldTweets3 as got
tweetCriteria = got.manager.TweetCriteria().setQuerySearch('#detroitrapper')\
.setUntil("2020-05-01")\
.setNear('Detroit,Michigan')\
.setSince("2020-04-03")\
.setMaxTweets(100)
tweet = got.manager.TweetManager.getTweets(tweetCriteria)[0]
print(tweet.username)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#3 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANNXB3SWRGOZKOSWA5E6HJLRVDEV7ANCNFSM4GCBI7EA>.
|
that does not work, i get an error saying list object has no attribute username |
That's because you're trying to access a nonexistent attribute of a list of tweets. You should probably learn what a list is. |
deleted my other post, thanks man i guess i didnt understand lists lol. edit: made a loop and it works |
I am having trouble with Getoldtweets3 on my mac. I can install it ans run the command: BUt if I try any other command like then I cannot get it to work. It was working until today, I made no changes, but getting this error: If anyone has ideas please share Downloading tweets... During handling of the above exception, another exception occurred: Traceback (most recent call last): Document is empty Done. Output file generated "output_got.csv". |
I tried to download tweets with guery-search 'bitcoin' since 2018-02-18 until 2018-02-19. The issue is that the script stoped before the end of the until parameter
The log was too big to put it all, so I deleted the log of the first 31000 tweets.
You can find the log here
Can this be because twitter detects a bot downloading a lot of tweets?
The text was updated successfully, but these errors were encountered: