Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare data for network analysis: Is it possible to use network.py with twarc2 downloaded tweets? #461

Closed
numeroteca opened this issue May 17, 2021 · 9 comments

Comments

@numeroteca
Copy link

I've downloaded with twarc2 a set of tweets in .jsonl and I am now trying to create .gexf or other network usable files (list of nodes and edges being able to select which relationship to use).

While running utils/network.py it throws some errors, as the names of the variables (that's my guess) are not the same with the API 2 (id instead of id_str, author instead of user...) and the script is unable to process them.

Which way do you recommend to transform the data into files usable for data analysis?
Thanks!

@igorbrigadir
Copy link
Contributor

v2 API responses are totally different to v1.1, so it's very unlikely that existing scripts for v1.1 data will work for v2 data. However: running:

twarc2 flatten output.jsonl flat_output.jsonl

may help, since that writes out 1 tweet per line and includes all the necessary metadata inline.

Given that format, it should be relatively straight forward to edit utils/network.py to accept the new format with some extra code changes - for example:

from_user = t['user']['screen_name']

should be

from_user = t['author']['username']

to match all the "flattened" fields in v2.

This is the section that would require changes:

https://github.com/DocNow/twarc/blob/main/utils/network.py#L127-L168

It's a good candidate for making a twarc2 plugin https://twarc-project.readthedocs.io/en/latest/plugins/

@numeroteca
Copy link
Author

I started substituting variable names as suggested, I had problems with the parsing of the date-time format:
created_at_date = time.strftime('%d/%m/%Y %H:%M:%S', time.strptime(t["created_at"],'%a %b %d %H:%M:%S +0000 %Y'))

The current format of v2 API is: "created_at": "2009-05-10T01:10:07.000Z",

Once I solve it I think it'll work.

@igorbrigadir
Copy link
Contributor

There are other changes required, that may be slightly more awkward - such as dealing with retweets / quotes etc. using retweeted_status object, but as for dates:

time.strptime(t["created_at"],'%a %b %d %H:%M:%S +0000 %Y')

this parses v1.1 data format, so the v2 equivalent would be:

created_at_date = time.strftime('%d/%m/%Y %H:%M:%S', time.strptime(t["created_at"],'%Y-%m-%dT%H:%M:%S.%fZ'))

@numeroteca
Copy link
Author

Indeed, parsing dates work with your proposal, but not the other stuff related to RT and quotes.

@edsu
Copy link
Member

edsu commented May 17, 2021

I'm glad this came up. I can work on creating a twarc-network plugin. But keep working on your revised script to solve your immediate need and if you can attach it here for reference.

@numeroteca
Copy link
Author

I just added the minor changes of name variable and parsing noted above. See the current script: https://gist.github.com/numeroteca/aa040b0488c914d1e4a37e40117ef062

@numeroteca
Copy link
Author

Hey @edsu, any chance that you can work in this? I am not being able to make it work with the changes to the original script.

@edsu
Copy link
Member

edsu commented Jun 20, 2021

Yes, I started on it a weeks ago and stalled. Thanks for the nudge!

@edsu
Copy link
Member

edsu commented Jun 27, 2021

Ok I've released a port of the old network.py script as a twarc2 plugin. You should be able to install it with:

pip install twarc-network

and then run it to generate a network as HTML D3:

twarc2 network tweets.jsonl network.html

More details about the various format options are available at https://github.com/docnow/twarc-network

Please ask questions about the plugin over in that issue tracker if you don't mind too much!

@edsu edsu closed this as completed Jun 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants