COSC2671 Social Media and Network Analytics

# Assignment 2 - Twitter posts filterer

@author Lukas Krodinger, s3961415

Note that this notebook requires the file twitterClient.py written by Jeffrey Chan with a valid twitter bearerToken, where the limit is not exceeded in order to work.

In [2]:
import json
import math
from datetime import datetime, timezone

In [3]:
def load_tweets(filenames):
    """
    Loads the tweets from the file with the given name into an array of tweets.

    @param filename: The filename of the file to load the tweets from.

    @returns: An array of tweets.
    """
    tweets = []
    for filename in filenames:
        with open(filename, 'r') as f:
            for sLine in f:
                tweet = json.loads(sLine)
                tweets.append(tweet)
    return tweets

def get_hashtags(tweet):
    """
    Extracts the associated hashtags of tweet.

    @param tweet: The tweet, which is in the tweepy json format, and which we wish to extract its associated hashtags.

    @returns: list of hashtags (in lower case)
    @author Jeffrey Chan
    """
    entities = tweet.get('entities', {})
    hashtags = entities.get('hashtags', [])

    return [tag['tag'].lower() for tag in hashtags]

In [4]:
# The filename of the file to store the tweets into
all_twitter_fields_filename = ["tennis_2022_10_12_18_20.json", "tennis_2022_10_13_10_20.json", "tennis_2022_10_13_10_50.json"]

In a next step I first filter all fields of tweets that are of interest.
I also only have a look at 5.000 tweets in order to not exceed the file size limit of 5MB, and I also only take tweets into account, that ... in that order.

In [5]:
tweets = load_tweets(all_twitter_fields_filename)

In [6]:
# The filename of the file to store the filtered tweets
filtered_tweet_fields_filename = "table_tennis_filtered3_4_days.json"

# The fields of interest
fields_of_interest = ['id', 'text', 'author_id', 'context_annotations', 'conversation_id', 'created_at', 'entities', 'geo', 'in_reply_to_user_id', 'lang', 'referenced_tweets']

# What we want our tweets to contain
filter_for = "tennis"

# What the tweets should NOT contain (as a hashtag or a word in the text)
filter_out = ["etsyseller", "eastvillagebangles", "bracelet", "etsy", "etsysel", "eastvillagebangl"]

# The amount of tweets we want to filter out
amount_of_tweets = math.inf

# Whether tweets with redundant text should be removed or not
remove_redundant_tweets_texts = True

# Date constraints
start_date = datetime.fromisoformat('2022-10-06T00:00:00.000Z'[:-1]).astimezone(timezone.utc)
end_date = datetime.fromisoformat('2022-10-09T23:59:59.000Z'[:-1]).astimezone(timezone.utc)

I load the posts from the file with all fields and delete all fields I am not interested in. Then I store the tweets with the remaining fields if they contain the filter_for and the amount is not exceeded.

In [10]:
with open(filtered_tweet_fields_filename, 'w') as fOut:
    count = 0
    tweet_texts = []

    for tweet in tweets:

        # Remove not interesting fields
        for key in list(tweet.keys()):
            if key not in fields_of_interest:
                del tweet[key]

        # Remove tweets which do not contain the filter_for text
        text = tweet.get("text").lower()
        if filter_for not in text:
            continue

        # Remove tweets which contain any of filter_out in text or hashtag
        hashtags = get_hashtags(tweet)
        hashtags_string = " ".join(hashtags)
        if any(word in (text + hashtags_string) for word in filter_out):
            continue

        # Remove redundant text tweets
        if remove_redundant_tweets_texts:
            if text not in tweet_texts:
                tweet_texts.append(text)
                count = count + 1
            else:
                continue

        # Filter out date
        created_at = tweet.get('created_at', '')
        date = datetime.fromisoformat(created_at[:-1]).astimezone(timezone.utc)
        date_in_range = start_date < date < end_date
        if not date_in_range:
            continue

        # Take no more than amount_of_tweets
        if count > amount_of_tweets:
            break

        # Store tweet again
        fOut.write("{}\n".format(json.dumps(tweet)))

print("Filtered tweets successfully stored to: ", filtered_tweet_fields_filename)

Filtered tweets successfully stored to:  table_tennis_filtered3_4_days.json
