Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deleted.py #373

Closed
ameliameyer opened this issue Feb 1, 2021 · 21 comments
Closed

deleted.py #373

ameliameyer opened this issue Feb 1, 2021 · 21 comments

Comments

@ameliameyer
Copy link

What's the usage command for deleted.py? I've been using the command
python utils/deleted.py election_data.txt > election_deleted.jsonl
where election_data is the dehydration output of tweet ids from an election dataset. I keep getting this error:
Traceback (most recent call last):
File "utils/deleted.py", line 31, in
for t in missing(tweets):
File "utils/deleted.py", line 16, in missing
tweet_ids = [t['id_str'] for t in tweets]
File "utils/deleted.py", line 16, in
tweet_ids = [t['id_str'] for t in tweets]
TypeError: 'int' object is not subscriptable

@edsu
Copy link
Member

edsu commented Feb 1, 2021

Actually you need to give deletes.py the JSON data for a tweet because it uses it to try to figure out where the tweet lives (or lived on the web).

This program assumes that you are feeding it tweet JSON data for tweets
that have been deleted. It will use the metadata and the API to
analyze why each tweet appears to have been deleted.

Note that lookups are based on user id, so may give different results than
looking up a user by screen name.

Does that help? I haven't used it in quite a while and wonder if it still works predictably after Twtter's infrastructure changes...

@ameliameyer
Copy link
Author

Wait, sorry, I'm confused. I have a text file of 20,000 or so user ids and I want to extract the tweets or accounts that have been deleted.
"""
This is a little utility that reads in tweets, rehydrates them, and only
outputs the tweets JSON for tweets that are no longer available.
"""
I thought you used deleted.py to extract the tweets/accounts that have been deleted into a jsonl and then feed the jsonl into deletes.py to analyze why those tweets have been deleted.

@edsu
Copy link
Member

edsu commented Feb 1, 2021

Ah yes, there is deleted.py too. It's a bit of a mess isn't it? You can use deleted.py and deletes.py in tandem as you describe. But unfortunately deleted.py also expects to read a file of tweet JSON data. If an account has been deleted information about it cannot be rehydrated from the Twitter API.

If you are looking to see what user ids in your file have been deleted you can use the twarc users command to hydrate them, and then see which ones are missing. But you won't be able to tell much about them because all you will have is the ID (unless you still have tweet data for them). Happy to discuss in Slack if it's useful.

@ameliameyer
Copy link
Author

Would the file of tweet JSON data be the original data or would it be a dehydrated output of the original dataset? It was my understanding that dehydration output a TXT file, not a JSON file.

I would use the twarc users command but I get several HTTP errors.

Sorry for all of the questions, I'm very new to twarc.

@edsu
Copy link
Member

edsu commented Feb 1, 2021

Actually I misspoke, twarc users will throw an error if it is given a user id that no longer exists. To be able to do the maximum number of lookups it does them in batch lookups of 010. When one of them fails it causes the whole batch lookup to fail, so there's no real way for it to recover.

In your case you are best off writing a little program to do the lookup one by one. I've added a utility called deleted_users.py which you can use to lookup user ids, or users in tweet JSON data. It will only output the ids or tweets that have been deleted (no longer available for hydration the Twitter API). Does this help at all?

@edsu edsu closed this as completed Feb 1, 2021
@edsu edsu reopened this Feb 1, 2021
@edsu
Copy link
Member

edsu commented Feb 1, 2021

Whoops, didn't mean to close!

@ameliameyer
Copy link
Author

I'm trying it out now. The rate limit has been exceeded a couple of times so it's taking awhile to finish running. I think that's understandable considering I have 20,000 tweets. Thank you for your help!

@edsu
Copy link
Member

edsu commented Feb 1, 2021

Oh good, I'm glad it is running. It will take some time since it needs to check each one instead of getting 100 at a time. It's basically a hundred times slower...

@ameliameyer
Copy link
Author

I've run it a couple of times on a dataset of 20,000 tweets and the output is blank which seems unlikely, but I suppose isn't impossible.

@edsu
Copy link
Member

edsu commented Feb 8, 2021

That is unusual. Let me try it out. Can you share the exact command you are using? I can test with my own dataset.

@ameliameyer
Copy link
Author

python utils/deleted_users.py 2016.jsonl > 2016_deleted_users.jsonl

@edsu
Copy link
Member

edsu commented Feb 8, 2021

Hmm, this seems to work fine for me on a file of 100 tweets (it found 5). Do you want to try it too?

@ameliameyer
Copy link
Author

I've got it. Let me try that and I'll get back to you.

@ameliameyer
Copy link
Author

Hmm it worked for me with that file. I also got 5.

@edsu
Copy link
Member

edsu commented Feb 8, 2021

Ok good, that means it's working. Did your job finish running? It it didn't it's possible that the output was buffered and not written to the output file yet. Do you see any activity in the deleted_users.log file?

@ameliameyer
Copy link
Author

It seems to have finished but I don't have a 'deleted_users.log' file anywhere.

@edsu
Copy link
Member

edsu commented Feb 8, 2021 via email

@ameliameyer
Copy link
Author

I used the command python utils/deleted_users.py tweets.jsonl > deleted_users.jsonl and it ran in my command line without issues but I don't have any log files.

@edsu
Copy link
Member

edsu commented Feb 8, 2021

Maybe you need to download the current version of deleted_users.py ?

@ameliameyer
Copy link
Author

It's working now and in conjunction with deletes.py. Unfortunately, I still cannot get deleted.py to work. I'd really like to be able to analyze tweets that were deleted, not just users (and consequently their tweets).
I've tried running the command:
python utils/deleted.py tweets.jsonl > tweets_deleted.jsonl
but I receive the error
'File "utils/deleted.py", line 36, in
for t in missing(tweets):
File "utils/deleted.py", line 17, in missing
hydrated = t.hydrate(tweets)
UnboundLocalError: local variable 't' referenced before assignment'.
Looking at the code, I'm a little confused by this error message since t is defined as t = twarc.Twarc() at the beginning of the code.

@igorbrigadir
Copy link
Contributor

Going through closing some old issues:

The same functionality is now available using the new Batch Compliance API, which will process tweet IDs and give you back reasons for deletions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants