# Script to create datasets as described in Section 5.1:

Run all cells to generate the dataset file for a single climate event from CrisisMMD dataset [1] and save to disk as a csv file.

##### Note: This step requires the pre-requisites specified in the Readme file of the repository. 

## Initialisations:

In [1]:
# Importing python libraries
import tweepy # Python library for accessing the Twitter API, requires pip install tweepy
import pandas as pd
import requests
import math
import numpy as np
import time

In [2]:
# Setting paths to required directories on disk

# Set following path to file containing Twitter API keys saved in dictionary format.
# API keys file content format: {"api_key1": "xxxxxxxxx", "api_key2": "xxxxxxxx"}
api_keys_path = '../../Code/KEYS/api_keys_academic_access.txt' 

# Set following path to annotated tweets present in dataset downloaded from [1]
labelled_data_path = '../../Data/CrisisMMD/CrisisMMD_v2.0/annotations' 

# Set following path to directory to store datasets created by this script
dataset_store_path = '../../Data/TweetCredibilityDatasets' 

Following is the list of dataset file names as per the files stored in annotations folder of CrisisMMD dataset. Set the event_name and event_file_name in next cell for the dataset to be created.


1. 'california_wildfires_final_data.tsv'
2. 'hurricane_harvey_final_data.tsv'
3. 'hurricane_irma_final_data.tsv'
4. 'hurricane_maria_final_data.tsv'
5. 'iraq_iran_earthquake_final_data.tsv'
6. 'mexico_earthquake_final_data.tsv'
7. 'srilanka_floods_final_data.tsv'

In [3]:
# Set the filename for climate event for which the dataset is to be created
event_file_name = 'california_wildfires_final_data.tsv'
event_name = 'california_wildfires'

In [4]:
# Reading api keys from text file
keys = eval(open(api_keys_path).read())

In [5]:
# Details on initialising a tweepy client to make API requests can be found at [2]

# Setting the bearer token for Twitter API access
bearer_token = keys['BEARER_TOKEN']

# Initialising Tweepy client for API requests
client = tweepy.Client(bearer_token=bearer_token,
                       return_type = requests.Response)

## Defining functions for reading tweet ids from CrisisMMD dataset, making API requests, and saving dataset file:

In [11]:
# Reading tweet ids from CrisisMMD Datasets
def read_tweet_ids(file_name):
    return pd.read_csv(f'{labelled_data_path}/{file_name}', sep='\t', usecols=['tweet_id'], squeeze=True)

In [7]:
# Method to make Tweet Lookup API request
# Following code is based on Tweepy get_tweets example [3] and [4]
def get_tweets(dataset):
    
    # Reading tweet ids from dataset file
    print(f'Reading tweet ids from dataset file for {dataset}...')
    tweet_ids = read_tweet_ids(dataset)
    
    # Splitting tweet ids into multiple parts to limit each list to 100 ids or less
    # The tweet lookup api supports request for only 100 ids in one call
    parts = math.ceil(len(tweet_ids)/100)
    tweet_id_parts = np.array_split(tweet_ids, parts)
    
    # Using get_tweets method of Tweepy to request Tweet Lookup API,
    # which returns a list of tweets using tweet ids specified
    print(f'Requesting tweets...')
    responses = []
    tweet_fields = ['author_id', 'entities', 'public_metrics', 'context_annotations']
    for ids in tweet_id_parts:
        response = client.get_tweets(','.join([str(id) for id in list(ids)]), tweet_fields=tweet_fields)
        # Save data as dictionary, Extract "data" value from dictionary and save in list of responses        
        responses.append(response.json()['data'])
        
    # Flattening list of lists to a single list of all tweet responses
    all_tweets = [res for response in responses for res in response]
    
    # Transform to pandas Dataframe
    df = pd.json_normalize(all_tweets)
    return df

In [8]:
# Method to make User Lookup API request
# Following code is based on Tweepy get_users example [5] and [6].
def get_author_metrics(tweets_data):
    # Reading author ids from dataset file    
    print(f'Reading author ids...')
    author_ids = tweets_data['author_id'].values.tolist()
    
    # Splitting author ids into multiple parts to limit each list to 100 ids or less
    # The User lookup api supports request for only 100 ids in one call    
    parts = math.ceil(len(author_ids)/100)    
    author_id_parts = np.array_split(author_ids, parts)
    
    # Using get_users method of Tweepy to use User Lookup API which returns a list of users
    print(f'Requesting user metrics...')
    responses = []
    user_fields = ['public_metrics']
    for ids in author_id_parts:        
        user_response = client.get_users(ids=','.join([str(id) for id in list(ids)]), user_fields=user_fields)
        # Save data as dictionary, Extract "data" value from dictionary and save in list of responses
        responses.append(user_response.json()['data'])
        
    # Flattening list of lists to a single list of all user responses
    all_tweets = [res for response in responses for res in response]
    
    # Transform to pandas Dataframe
    df = pd.json_normalize(all_tweets)
    
    return df

In [9]:
# Method to save generated dataset files
def save_dataset_files(df_complete, dataset_name):
    print(f'Saving fetched tweets into dataset file for {dataset_name}...')
    # Saving complete data to csv file in disk
    # Note: Filename can be changed. The student id has been added here to file name,
    # to differentiate CrisisMMD files from custom created datasets.  
    df_complete.to_csv(f'{dataset_store_path}/21237189_{dataset_name}.csv', index=False)
    print(f'Save complete for {dataset_name}.\n\n')

## Calling functions to create dataset for specified event:

In [12]:
print(f'\n{event_name}: Starting process...')
# Making API request to get tweets as per the annotated files
tweets_data = get_tweets(event_file_name)
# Making API request to fetch author information for each tweet
author_data = get_author_metrics(tweets_data)
# Combining the dataframes to create final dataset
df_complete = pd.concat([tweets_data, author_data['public_metrics.followers_count']], axis=1)


california_wildfires: Starting process...
Reading tweet ids from dataset file for california_wildfires_final_data.tsv...
Requesting tweets...
Reading author ids...
Requesting user metrics...


In [13]:
# Saving the dataset file to disk
save_dataset_files(df_complete, event_file_name[:-4])

Saving fetched tweets into dataset file for california_wildfires_final_data...
Save complete for california_wildfires_final_data.




# References:

[1] "Crisismmd: Multimodal crisis dataset," [Online]. Available: https://crisisnlp.qcri.org/crisismmd

[2] Tweepy. "Examples: API v2: Authentication". Available: https://docs.tweepy.org/en/stable/examples.html

[3] Tweepy. "Examples: API v2: Get Tweets". Available: https://docs.tweepy.org/en/stable/examples.html

[4] Tweepy. "Tweet lookup: get_tweets" Available: https://docs.tweepy.org/en/stable/client.html#tweepy.Client.get_tweets

[5] Tweepy. "Examples: API v2: Get Users". Available: https://docs.tweepy.org/en/stable/examples.html

[6] Tweepy. "User lookup: get_users" Available: https://docs.tweepy.org/en/stable/client.html#tweepy.Client.get_users