# Script to create tweet networks for different relationships and save graphs as described in Section 5.2:

Run all cells to generate the four network files for a single climate event from CrisisMMD dataset [1] and save to disk as gml files. 

##### Note: This step requires to have dataset created using create_dataset.ipynb script.

## Initialisations:

In [1]:
# Importing Python libraries
import networkx as nx # Python library for creating networks, requires pip install networkx
import pandas as pd
import numpy as np
import itertools
import requests

In [2]:
# Setting paths to required directories on disk

# Set following path to directory that stores datasets created using create_dataset.ipynb
dataset_store_path = '../../Data/TweetCredibilityDatasets'

# Set following path to directory to store the networks created for a climate event specified
network_store_path = '../../Data/Networks' 

Following is the list of climate event names as per the files stored in annotations folder of CrisisMMD dataset. Set the event_name in next cell for specifying the dataset for which the networks are to be created.

1. 'california_wildfires'
2. 'hurricane_harvey'
3. 'hurricane_irma'
4. 'hurricane_maria'
5. 'iraq_iran_earthquake'
6. 'mexico_earthquake'
7. 'srilanka_floods'

In [3]:
# Set the event name of climate event for which the networks are to be generated
event_name = 'california_wildfires'

## Defining functions to get list of nodes, list of edges for different relationships, creating and storing all four networks:

### Network Nodes:

In [4]:
# Method that reads the tweet ids from dataframe of the specified climate event dataset,
# and returns a list of tweet ids stored as strings to be used as network nodes
def get_network_nodes(tweets_df):
    return [str(tweet_id) for tweet_id in tweets_df['id'].values]

### Network Edges: Relationship - Same Author

In [5]:
# Method to generate list of edges for the same author relationship
def get_author_edges(tweets_df):
    # Grouping tweet ids by same author
    same_author_tweets = tweets_df[tweets_df.duplicated('author_id', keep=False
                                         )].groupby('author_id')['id'].apply(list).reset_index()
    
    # Generating edges between each node that has same author
    edge_list = []
    for tweets in same_author_tweets['id']:
        edge_combinations = list(itertools.combinations(tweets, 2))
        edge_combinations = [(str(x), str(y)) for x, y in edge_combinations]
        edge_list.append(edge_combinations)
    
    # Returning list of edges without duplicates stored as list of tuples,
    # each tuple contains two tweet nodes to be connected
    return list(set(list(itertools.chain(*edge_list))))

### Network Edges: Relationship - Same URL

In [6]:
# Method to generate list of edges for the same URL relationship
def get_url_edges(tweets_data):    
    # Fetching all urls contained in the tweet stored in entities.urls,
    # and saving them as list of urls for a given tweet
    expanded_urls = []
    for index, row in tweets_data.iterrows():
        if type(row['entities.urls']) == str:
            urls_col = eval(row['entities.urls'])
            urls = []
            if type(urls_col) == list:
                for item in urls_col:
                    urls.append(item['expanded_url'])
            expanded_urls.append(urls)
        else:
            expanded_urls.append([])
    tweets_data['expanded_urls'] = expanded_urls    
    
    # Comparing each tweet's urls list to generate 
    # edge between all tweets that share an url
    edge_list = []
    for index, row in tweets_data.iterrows():
        for url in row['expanded_urls']:
            for idx, _row in tweets_data.iterrows():
                if index != idx and url in _row['expanded_urls']:
                    edge_list.append((str(row['id']), str(_row['id'])))
    
    # Returning list of edges without duplicates stored as list of tuples,
    # each tuple contains two tweet nodes to be connected
    return list(set(edge_list))

### Network Edges: Relationship - Similar Retweet Count

In [7]:
# Method to generate list of edges for the Similar Retweet Count relationship
def get_retweet_edges(tweets_df):
    # Creating frequency table from retweet column
    count_dict = dict(tweets_df['public_metrics.retweet_count'].value_counts())
    
    # Creating groups of similar retweet counts based on frequency table
    splits = np.array_split(sorted(count_dict), len(set(count_dict.values())))
    
    # Grouping tweet ids with similar retweet counts
    similar_retweets = []
    for split in splits: 
        similar_counts = []
        for index, row in tweets_df.iterrows():
            if row['public_metrics.retweet_count'] in split:
                similar_counts.append(str(row['id']))
        similar_retweets.append(similar_counts)
        
    # Creating edge list based on similar retweet counts
    edge_list = []
    for tweets in similar_retweets:
        edge_list.append(list(itertools.combinations(tweets, 2)))
    
    # Returning list of edges without duplicates stored as list of tuples,
    # each tuple contains two tweet nodes to be connected
    return list(set(list(itertools.chain(*edge_list))))

### Network Edges: Relationship -  Author Followers Count

In [8]:
# Method to generate list of edges for the Similar Tweet Author Followers Count relationship
def get_followers_edges(tweets_df):    
    # Creating frequency table from followers count column
    count_dict = dict(tweets_df['public_metrics.followers_count'].value_counts())
    
    # Creating groups of similar followers count based on frequency table
    splits = np.array_split(sorted(count_dict), len(set(count_dict.values())))
    
    # Grouping tweet ids with similar author follower counts
    similar_retweets = []
    for split in splits: 
        similar_counts = []
        for index, row in tweets_df.iterrows():
            if row['public_metrics.followers_count'] in split:
                similar_counts.append(str(row['id']))
        similar_retweets.append(similar_counts)
        
    # Creating edge list based on similar author follower counts
    edge_list = []
    for tweets in similar_retweets:
        edge_list.append(list(itertools.combinations(tweets, 2)))
        
    # Returning list of edges without duplicates stored as list of tuples,
    # each tuple contains two tweet nodes to be connected
    return list(set(list(itertools.chain(*edge_list))))

### Networks: 

In [9]:
# Method to create specified network and save as a gml file
def create_network(relationship, climate_event, event_data):    
    # Initialising empty networkx graph
    Tweets_G = nx.Graph()
    
    # Adding tweet nodes to the graph
    print(f'\nGetting and adding nodes to {relationship} network for {climate_event}..')
    Tweets_G.add_nodes_from(get_network_nodes(event_data))  
    
    # Adding edges based on selected relationship to the graph
    print(f'Getting and adding edges to {relationship} network for {climate_event}..')
    if relationship == 'author':
        Tweets_G.add_edges_from(get_author_edges(event_data))
    elif relationship == 'url':
        Tweets_G.add_edges_from(get_url_edges(event_data))    
    elif relationship == 'retweet_count':
        Tweets_G.add_edges_from(get_retweet_edges(event_data))
    elif relationship == 'followers':
        Tweets_G.add_edges_from(get_followers_edges(event_data))
    else:
        print("Invalid Relationship. Accepted relationships are: 'author'/'url'/'retweet_count'")
        return
    
    # Saving the graph in gml format on disk    
    nx.write_gml(Tweets_G, f"{network_store_path}/{climate_event}_{relationship}.gml")
    print("Graph saved.")

## Reading dataset for climate event specified and generating the four networks:

In [11]:
# Reading tweets data from csv files created using create_dataset.ipynb
tweets_data = pd.read_csv(f'{dataset_store_path}/21237189_{event_name}_final_data.csv')

# Removing duplicate tweet id rows
tweets_data = tweets_data.copy().drop_duplicates(subset=['id']).reset_index()

In [12]:
# Filtering tweets to remove tweets with zero retweet counts as described in Section 5.2.3
tweets_data_retweets = tweets_data.copy()[tweets_data['public_metrics.retweet_count'].values != 0]

# Filtering tweets to remove tweets with zero retweet counts as described in Section 5.2.1 and 5.2.4
tweets_data_followers = tweets_data.copy()[tweets_data['public_metrics.followers_count'].values != 0]

In [13]:
# Creating tweet network with same author relationship
create_network('author', event_name, tweets_data_followers)

# Creating tweet network with same url relationship
create_network('url', event_name, tweets_data)

# Creating tweet network with similar retweet count relationship
create_network('retweet_count', event_name, tweets_data_retweets)

# Creating tweet network with author followers relationship
create_network('followers', event_name, tweets_data_followers)


Getting and adding nodes to author network for california_wildfires..
Getting and adding edges to author network for california_wildfires..

Getting and adding nodes to url network for california_wildfires..
Getting and adding edges to url network for california_wildfires..

Getting and adding nodes to retweet_count network for california_wildfires..
Getting and adding edges to retweet_count network for california_wildfires..

Getting and adding nodes to followers network for california_wildfires..
Getting and adding edges to followers network for california_wildfires..


# References:

[1] "Crisismmd: Multimodal crisis dataset," [Online]. Available: https://crisisnlp.qcri.org/crisismmd
