Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tweets.py #429

Closed
ameliameyer opened this issue Apr 8, 2021 · 9 comments
Closed

tweets.py #429

ameliameyer opened this issue Apr 8, 2021 · 9 comments

Comments

@ameliameyer
Copy link

Is there a way to expand tweets.py to work on jsonl as well as json?

@igorbrigadir
Copy link
Contributor

This one right? https://github.com/DocNow/twarc/blob/main/utils/tweets.py

It doesn't matter to it what the actual file format is, it will still read it line by line, so it should work with jsonl. Or do you mean something else?

@ameliameyer
Copy link
Author

Yes, that one.

That's what I thought but I receive this error when I run the utility on jsonl files only:

Traceback (most recent call last):
File "C:\Users\user\Desktop\twarc\utils\tweets.py", line 14, in
tweet["text"],
KeyError: 'text'

@igorbrigadir
Copy link
Contributor

Ah, if tweets are in a different format than the default, for example - if they're longer than 140 characters, the text field is full_text not text.

@ameliameyer
Copy link
Author

Ah, I see. So how do we account for this without having to manually change the python code from tweet["text"] to tweet["full_text"]? I'm still fairly new to json, but is there a way to do an if statement to check if tweet["full_text"] is in the json and if not, print "text"? Or vice versa?

print(("[%s] @%s: %s (%s)" % (
created_at.strftime("%Y-%m-%d %H:%M:%S"),
tweet["user"]["screen_name"],
tweet["text"],
tweet["id_str"]
)).encode('utf8'))

@igorbrigadir
Copy link
Contributor

igorbrigadir commented Apr 12, 2021

Unfortunately this is something that might require a code change - but yes, the change would involve a chain of if statements that extract the text, but instead of rolling your own, i would use: https://github.com/twitterdev/tweet_parser

pip install tweet_parser

And use it like this:

#!/usr/bin/env python
from __future__ import print_function

import json
import fileinput
import dateutil.parser

from tweet_parser.tweet import Tweet
from tweet_parser.tweet_parser_errors import NotATweetError

for line in fileinput.input():
    try:
        tweet_dict = json.loads(line)
        tweet = Tweet(tweet_dict)
    except (json.JSONDecodeError,NotATweetError):
        print("Not a valid Tweet:", line)
        continue

    created_at = dateutil.parser.parse(tweet.created_at_string)
    print(("[%s] @%s: %s (%s)" % (
        created_at.strftime("%Y-%m-%d %H:%M:%S"),
        tweet["user"]["screen_name"],
        tweet.all_text,
        tweet["id_str"]
    )).encode('utf8'))

apologies if there's an error here - i didn't test this

Also, this will only work with v1.1 data, not with the new v2 data.

@ameliameyer
Copy link
Author

It outputs "Not a valid Tweet:" with the tweet following for many but not all of the tweets in the dataset but doesn't output the user, screen_name, text, or id_str
Screenshot (351)

@igorbrigadir
Copy link
Contributor

Ah that's unfortunate that it doesn't workout of the box like that.

Can you paste in those json examples in here? I'd like to check it out later.

@ameliameyer
Copy link
Author

Here are the tweet id txt files for the jsonl and json examples respectively.

nh_dod_ids.txt

election_ids.txt

@igorbrigadir
Copy link
Contributor

going over some old issues - in this case, i would now recommend using twarc2 and twarc-csv:

pip install --upgrade twarc twarc-csv
twarc2 hydrate nh_dod_ids.txt nh_dod_ids.jsonl

And to get just id,text,username:

twarc2 csv --output-columns "id,text,author.username" nh_dod_ids.jsonl nh_dod_ids.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants