# Analysis of Tweets from a full archival search

In [12]:
import pandas as pd
from os.path import join
import numpy as np

In [1]:
#!pip install --upgrade twarc

In [13]:
# Define the main folder and files (hashtags are the files).
src = '/media/s/Linux_storage/Analyse_Verkehrswende_Transformation/Data/Updates' # put the path here. In windows, use: r'\...'
hashtags = ['IchBinHanna'] # Put the file name here

Note: if the Tweets have already been downloaded and the Tweet data exists in the folder ```data``` as compressed ```.jsonl``` files, you can skip the "Query tweets" and "Compress data" steps and start processing at "Decompress data".

## Collect Tweets

### Query tweets

The goal is to get original posts about Verkehrswende AND Nachhaltigkeit (with substitues), without Retweets.
Note: the queries are saved in separate files. I do this to make the data collection process reproducible by saving the exact query parameters for every data file.

Here is the main correlation: 
```(Verkehrswende OR Mobilitätswende ...)``` ```AND``` ```(Nachhaltigkeit OR Klimawandel ...)```

In [13]:
# Change file permissions such that execution is allowed
! chmod +x /media/s/Linux_storage/Analyse_Verkehrswende_Transformation/Code/Queries/Bahnverkehr.sh

# Run the query. Note: this can take a while, depending on the number of Tweets
# that need to be downloaded
! /media/s/Linux_storage/Analyse_Verkehrswende_Transformation/Code/Queries/Bahnverkehr.sh

Now each following query *only* changes the first part, e.g. here for Elektromobilität:
```(Elektromobilität OR Elektroauto ...)``` ```AND``` ```(Nachhaltigkeit OR Klimawandel...)```

In [13]:
! chmod +x /media/s/Linux_storage/Analyse_Verkehrswende_Transformation/Code/Queries/Digitale_grüne_Bahn.sh
! /media/s/Linux_storage/Analyse_Verkehrswende_Transformation/Code/Queries/Digitale_grüne_Bahn.sh

### Compress data

Note: under windows, .xz files can be decompressed for examply with [WinZIP](https://www.winzip.com/win/en/xz-file.html).

In [27]:
# the parameter "-k" keeps the original file
! xz -k /media/s/Linux_storage/Analyse_Verkehrswende_Transformation/Data/Verkehrswende.jsonl

### Decompress data

In [None]:
! xz -d /media/s/Linux_storage/Analyse_Verkehrswende_Transformation/Data/Verkehrswende.jsonl.xz

### Convert to CSV

Removes duplicate tweets (by ID) but keeps referenced tweets.

In [17]:
! twarc2 csv --extra-input-columns "in_reply_to_user.withheld.scope" /media/s/Linux_storage/Analyse_Verkehrswende_Transformation/Data/conversations_data/Hanna_threads.jsonl /media/s/Linux_storage/Analyse_Verkehrswende_Transformation/Data/conversations_data/Hanna_threads.csv

100%|████████████████████████████████████████| 463k/463k [00:00<00:00, 2.88MB/s]

ℹ️
Read 505 tweets from 11 lines. 
254 were referenced tweets, 240 were duplicates.
Wrote 265 rows and output 92 of 92 input columns in the CSV.



## Extract conversation IDs

Be aware that all necessary folders need to be existsting (case sensitive). Reduce chunksizes to get smaller .txt files. These can be quite long lists, maybe it makes more sense to work with the most active conversations. At least it is helpful to get the amount of tweets relative to conversations (the higher the tweets, the more intense the conversations are).

In [31]:
def get_conversation_IDs(src, filename, chunks=False, chunksize=10000):
    data = pd.read_csv(join(src, '{}.csv'.format(filename)), low_memory=False)
    conversationIDs = data['conversation_id'].dropna().astype(int).unique()
    print('{}: There are {} Tweets from {} conversations'\
              .format(filename, len(data), len(conversationIDs)))
    
    dst = join(src, 'conversation_IDs')
    
    if chunks:
        N_chunks = len(conversationIDs) // chunksize
        print(N_chunks)
        for i in range(N_chunks):
            ID_chunk = conversationIDs[i * chunksize : (i + 1) * chunksize]
            np.savetxt(join(dst, '{}_ConversationIDs_{}_to_{}.txt'\
                .format(filename, i * chunksize, (i + 1) * chunksize)),
                ID_chunk, fmt='%d')
        np.savetxt(join(dst, '{}_ConversationIDs_{}_to_{}.txt'\
                .format(filename, N_chunks * chunksize, len(conversationIDs))),
                conversationIDs[N_chunks * chunksize : ], fmt='%d')
            
    else:   
        np.savetxt(join(dst, '{}_ConversationIDs.txt'.format(filename)),
                   conversationIDs, fmt='%d')

In [32]:
get_conversation_IDs(src, 'Verkehrswende', chunks=True)

Verkehrswende: There are 47489 Tweets from 38917 conversations
38


## Extract Tweet IDs

In [14]:
def get_Tweet_IDs(src, filename, chunks=False, chunksize=1000):
    data = pd.read_csv(join(src, '{}.csv'.format(filename)), low_memory=False)
    TweetIDs = data['id'].dropna().astype(int).unique()
    print('{}: There are {} Tweets'\
              .format(filename, len(TweetIDs)))
    
    dst = join(src, 'tweet_IDs')
    
    if chunks:
        N_chunks = len(TweetIDs) // chunksize
        print(N_chunks)
        for i in range(N_chunks):
            ID_chunk = TweetIDs[i * chunksize : (i + 1) * chunksize]
            np.savetxt(join(dst, '{}_TweetIDs_{}_to_{}.txt'\
                .format(filename, i * chunksize, (i + 1) * chunksize)),
                ID_chunk, fmt='%d')
        np.savetxt(join(dst, '{}_TweetIDs_{}_to_{}.txt'\
                .format(filename, N_chunks * chunksize, len(TweetIDs))),
                TweetIDs[N_chunks * chunksize : ], fmt='%d')
            
    else:   
        np.savetxt(join(dst, '{}_TweetIDs.txt'.format(filename)),
                   TweetIDs, fmt='%d')

In [15]:
get_Tweet_IDs(src, 'IchBinHanna', chunks=False)

IchBinHanna: There are 75910 Tweets
