tweets.py #429

ameliameyer · 2021-04-08T17:39:09Z

Is there a way to expand tweets.py to work on jsonl as well as json?

igorbrigadir · 2021-04-08T18:17:43Z

This one right? https://github.com/DocNow/twarc/blob/main/utils/tweets.py

It doesn't matter to it what the actual file format is, it will still read it line by line, so it should work with jsonl. Or do you mean something else?

ameliameyer · 2021-04-08T18:41:49Z

Yes, that one.

That's what I thought but I receive this error when I run the utility on jsonl files only:

Traceback (most recent call last):
File "C:\Users\user\Desktop\twarc\utils\tweets.py", line 14, in
tweet["text"],
KeyError: 'text'

igorbrigadir · 2021-04-08T19:04:33Z

Ah, if tweets are in a different format than the default, for example - if they're longer than 140 characters, the text field is full_text not text.

ameliameyer · 2021-04-08T19:31:05Z

Ah, I see. So how do we account for this without having to manually change the python code from tweet["text"] to tweet["full_text"]? I'm still fairly new to json, but is there a way to do an if statement to check if tweet["full_text"] is in the json and if not, print "text"? Or vice versa?

print(("[%s] @%s: %s (%s)" % (
created_at.strftime("%Y-%m-%d %H:%M:%S"),
tweet["user"]["screen_name"],
tweet["text"],
tweet["id_str"]
)).encode('utf8'))

igorbrigadir · 2021-04-12T14:37:43Z

Unfortunately this is something that might require a code change - but yes, the change would involve a chain of if statements that extract the text, but instead of rolling your own, i would use: https://github.com/twitterdev/tweet_parser

pip install tweet_parser

And use it like this:

#!/usr/bin/env python
from __future__ import print_function

import json
import fileinput
import dateutil.parser

from tweet_parser.tweet import Tweet
from tweet_parser.tweet_parser_errors import NotATweetError

for line in fileinput.input():
    try:
        tweet_dict = json.loads(line)
        tweet = Tweet(tweet_dict)
    except (json.JSONDecodeError,NotATweetError):
        print("Not a valid Tweet:", line)
        continue

    created_at = dateutil.parser.parse(tweet.created_at_string)
    print(("[%s] @%s: %s (%s)" % (
        created_at.strftime("%Y-%m-%d %H:%M:%S"),
        tweet["user"]["screen_name"],
        tweet.all_text,
        tweet["id_str"]
    )).encode('utf8'))

apologies if there's an error here - i didn't test this

Also, this will only work with v1.1 data, not with the new v2 data.

ameliameyer · 2021-04-13T16:47:19Z

It outputs "Not a valid Tweet:" with the tweet following for many but not all of the tweets in the dataset but doesn't output the user, screen_name, text, or id_str

igorbrigadir · 2021-04-15T09:44:48Z

Ah that's unfortunate that it doesn't workout of the box like that.

Can you paste in those json examples in here? I'd like to check it out later.

ameliameyer · 2021-04-15T16:42:30Z

Here are the tweet id txt files for the jsonl and json examples respectively.

nh_dod_ids.txt

election_ids.txt

igorbrigadir · 2021-09-28T14:08:50Z

going over some old issues - in this case, i would now recommend using twarc2 and twarc-csv:

pip install --upgrade twarc twarc-csv

twarc2 hydrate nh_dod_ids.txt nh_dod_ids.jsonl

And to get just id,text,username:

twarc2 csv --output-columns "id,text,author.username" nh_dod_ids.jsonl nh_dod_ids.csv

igorbrigadir closed this as completed Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tweets.py #429

tweets.py #429

ameliameyer commented Apr 8, 2021

igorbrigadir commented Apr 8, 2021

ameliameyer commented Apr 8, 2021

igorbrigadir commented Apr 8, 2021

ameliameyer commented Apr 8, 2021

igorbrigadir commented Apr 12, 2021 •

edited

ameliameyer commented Apr 13, 2021

igorbrigadir commented Apr 15, 2021

ameliameyer commented Apr 15, 2021

igorbrigadir commented Sep 28, 2021

tweets.py #429

tweets.py #429

Comments

ameliameyer commented Apr 8, 2021

igorbrigadir commented Apr 8, 2021

ameliameyer commented Apr 8, 2021

igorbrigadir commented Apr 8, 2021

ameliameyer commented Apr 8, 2021

igorbrigadir commented Apr 12, 2021 • edited

ameliameyer commented Apr 13, 2021

igorbrigadir commented Apr 15, 2021

ameliameyer commented Apr 15, 2021

igorbrigadir commented Sep 28, 2021

igorbrigadir commented Apr 12, 2021 •

edited