COSC2671 Social Media and Network Analytics

# Assignment 1 - Twitter posts downloader

@author Lukas Krodinger, s3961415

Note that this notebook requires the file twitterClient.py written by Jeffrey Chan with a valid twitter bearerToken, where the limit is not exceeded in order to work.

In [10]:
import json
import tweepy
from workshop03Code import twitterClient

In [11]:
def load_tweets(filename):
    """
    Loads the tweets from the file with the given name into an array of tweets.

    @param filename: The filename of the file to load the tweets from.

    @returns: An array of tweets.
    """
    tweets = []
    with open(filename, 'r') as f:
        for sLine in f:
            tweet = json.loads(sLine)
            tweets.append(tweet)
    return tweets

In [2]:
client = twitterClient.twitterClient()

Here, I define the search query, what fields of each tweet to download, the maximum amount of downloaded tweets as well as the name of the output json file.

My search focuses on the "Great Barrier Reef" and only on english posts.
I want to download all tweet fields supported by tweepy and requiring no authentication, as one can still filter out the required fields for analysis later on.
Note that the max_tweets might not be reached, because I only download tweets which are at most one-week-old.

In [8]:
# Define what tweets do download
search_query = '"Great Barrier Reef" lang:en'

# All non-authenticated tweet fields
all_tweet_fields = ['id', 'text', 'attachments', 'author_id', 'context_annotations', 'conversation_id', 'created_at', 'entities', 'geo', 'in_reply_to_user_id', 'lang', 'possibly_sensitive', 'public_metrics', 'referenced_tweets', 'reply_settings', 'source', 'withheld']

# The maximum amount of tweets to download
max_tweets = 100  # 50000 was used here

# The filename of the file to store the tweets into
all_twitter_fields_filename = "newTweets.json"

I download the tweets via the tweepy client in a paginated manner (100 at once).

In [4]:
tweets = []

# Download the tweets paginated, 100 at once
for tweet in tweepy.Paginator(client.search_recent_tweets, search_query, max_results=100, tweet_fields=all_tweet_fields).flatten(limit=max_tweets):
    tweets.append(tweet)

print("Number of tweets downloaded: ", len(tweets))

Number of tweets downloaded:  100


Now I store the downloaded tweets to the specified output file.

In [5]:
with open(all_twitter_fields_filename, 'w') as json_file:
    for tweet in tweets:
        json.dump(tweet.data, json_file)
        json_file.write('\n')

print("Tweets successfully stored to: ", all_twitter_fields_filename)

Tweets successfully stored to:  newDownload.json


In a next step I first filter all fields of tweets that are of interest. I also only have a look at 5.000 tweets in order to not exceed the file size limit of 5MB, and I also only take tweets into account, that contain the words "great barrier reef" in that order.

In [12]:
tweets = load_tweets(all_twitter_fields_filename)

In [24]:
# The filename of the file to store the filtered tweets
filtered_tweet_fields_filename='newFilteredTweets.json'

# The fields of interest
fields_of_interest = ['id', 'text', 'entities', 'created_at']

# What we want our tweets to contain
filter_for = "great barrier reef"

# The amount of tweets we want to filter out
amount_of_tweets = 5000

# Whether tweets with redundant text should be removed or not
remove_redundant_tweets_texts = True

I load the posts from the file with all fields and delete all fields I am not interested in. Then I store the tweets with the remaining fields if they contain the filter_for and the amount is not exceeded.

In [26]:
with open(all_twitter_fields_filename, 'r') as fIn, open(filtered_tweet_fields_filename, 'w') as fOut:
    count = 0
    tweet_texts = []

    for line in fIn:
        tweet = json.loads(line)

        # Remove not interesting fields
        for key in list(tweet.keys()):
            if key not in fields_of_interest:
                del tweet[key]

        # Remove tweets which do not contain the filter_for text
        text = tweet.get("text").lower()
        if filter_for not in text:
            continue

        # Remove redundant text tweets
        if remove_redundant_tweets_texts:
            if text not in tweet_texts:
                tweet_texts.append(text)
                count = count + 1
            else:
                continue

        # Take no more than amount_of_tweets
        if count > amount_of_tweets:
            break

        # Store tweet again
        fOut.write("{}\n".format(json.dumps(tweet)))

print("Filtered tweets successfully stored to: ", filtered_tweet_fields_filename)

Filtered tweets successfully stored to:  filteredTweets.json
Filtered tweets successfully stored to:  filteredTweets.json
