# Script for checking if the tweets in set of core tweets are labelled informative or not informative in CrisisMMD dataset as described in Section 5.3.1.2:

Run all cells to check the ground truth labels for all network community overlaps for a single climate event from CrisisMMD dataset [1] and save to disk.

##### Note: This step requires to have the four community overlap files created using analyse_communities.ipynb script.

## Initialisations:

In [1]:
# Importing python libraries
import pandas as pd
import numpy as np
import itertools

In [2]:
# Initialising directory paths

# Set following path to annotated tweets present in dataset downloaded from [1]
labelled_data_path = '../../Data/CrisisMMD/CrisisMMD_v2.0/annotations'

# Path to directory containg the core tweets for each overlap type created in current time
overlap_analysis_path = '../../Data/Overlaps'

# Path to dataset created in current time using create_dataset.ipynb
dataset_store_path = '../../Data/TweetCredibilityDatasets'

# Path to save results of the analysis
result_path = '../evaluation/reliability_check'

Following is the list of dataset file names as per the files stored in annotations folder of CrisisMMD dataset. Set the event_name and event_file_name in next cell for checking reliability of tweets in core tweets sets for a climate event.

1. 'california_wildfires_final_data.tsv'
2. 'hurricane_harvey_final_data.tsv'
3. 'hurricane_irma_final_data.tsv'
4. 'hurricane_maria_final_data.tsv'
5. 'iraq_iran_earthquake_final_data.tsv'
6. 'mexico_earthquake_final_data.tsv'
7. 'srilanka_floods_final_data.tsv'

In [20]:
# Set the event name and file name of climate event for which the similarity scores are to be calculated
event_name = 'california_wildfires'
event_file_name = 'california_wildfires_final_data.tsv'

In [21]:
# Setting the file names for overlaps containing core tweets for a given event
overlap_files = ['author_url_retweets_tweets', 
                 'followers_url_retweets_tweets', 
                 'author_followers_retweets_tweets', 
                 'author_url_followers_tweets']

## Defining functions to read core tweets, annotated tweets and checking the labels of tweets:

In [22]:
# Method to read tweets in the set of core tweets for a given overlap type
def read_overlap_tweets(filename):
    return pd.read_csv(f'{overlap_analysis_path}/{event_name}_{filename}.csv')

In [23]:
# Method to read annotated files
def read_labelled_tweets(file_name, event_name):
    # Reading tweets data from csv files created using create_dataset.ipynb
    tweets_data = pd.read_csv(f'{dataset_store_path}/21237189_{event_name}_final_data.csv')    
    # Removing duplicate rows
    tweets_data = tweets_data.drop_duplicates(subset=['id']).reset_index()
    annotated_tweets = pd.read_csv(f'{labelled_data_path}/{file_name}', sep='\t', usecols=
                       ['tweet_id', 'text_info', 'tweet_text'], squeeze=True)
    return annotated_tweets[annotated_tweets['tweet_id'].isin(tweets_data['id'].values)].copy()

In [24]:
# Method to check the labels of core tweets and 
# extract the number of informative and not informative tweets captured in each core tweets set
def check_informativeness(overlap_file_name):
    informativeness_results = {}
    overlap_tweets = read_overlap_tweets(overlap_file_name).copy()
    overlap_tweet_ids = overlap_tweets['id'].values
    info_matches = set(overlap_tweet_ids) & set(informative_clusters['tweet_id'].values[0])
    not_info_matches = set(overlap_tweet_ids) & set(informative_clusters['tweet_id'].values[1])
    informativeness_results['community overlap name'] = overlap_file_name    
    informativeness_results['total tweets'] = len(info_matches) + len(not_info_matches)
    informativeness_results['informative tweets'] = len(info_matches)
    informativeness_results['not informative tweets'] = len(not_info_matches)
    return informativeness_results

## Running the tweet reliability check for given climate event:

In [25]:
# Reading all tweets in dataset along with their informative/not informative labels 
labelled_tweets = read_labelled_tweets(event_file_name, event_name)
labelled_tweets = labelled_tweets.drop_duplicates(subset=['tweet_id'])

# Grouping tweets by their informativeness label
informative_clusters = labelled_tweets[labelled_tweets.duplicated('text_info', keep=False
                                         )].groupby('text_info')['tweet_id'].apply(list).reset_index()
# Running the reliability check for all core tweet sets for a given climate event
results = []
for overlap in overlap_files:
    results.append(check_informativeness(overlap))

In [26]:
# Saving the results
results_df = pd.DataFrame(results)
print(results_df.to_markdown())
results_df.to_csv(f'{result_path}/{event_name}_informativeness_table.csv')

|    | community overlap name           |   total tweets |   informative tweets |   not informative tweets |
|---:|:---------------------------------|---------------:|---------------------:|-------------------------:|
|  0 | author_url_retweets_tweets       |              3 |                    3 |                        0 |
|  1 | followers_url_retweets_tweets    |              4 |                    4 |                        0 |
|  2 | author_followers_retweets_tweets |             69 |                   59 |                       10 |
|  3 | author_url_followers_tweets      |             41 |                   27 |                       14 |


# References:

[1] "Crisismmd: Multimodal crisis dataset," [Online]. Available: https://crisisnlp.qcri.org/crisismmd