# Script for calculating similarity between the core tweets and other tweet texts as described in Section 5.3.2:

Run all cells to generate the similarity scores and evaluation scores for a given network community overlap for a single climate event from CrisisMMD dataset [1] and save to disk.

##### Note: This step requires to have the four community overlap files created using analyse_communities.ipynb script.


## Initialisations:

In [1]:
# Importing Python libraries
import pandas as pd
import numpy as np
import itertools
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Initialising directory paths

# Set following path to annotated tweets present in dataset downloaded from [1]
labelled_data_path = '../../Data/CrisisMMD/CrisisMMD_v2.0/annotations' 

# Path to directory containg the core tweets for each overlap type created in current time
overlap_analysis_path = '../../Data/Overlaps'

# Path to dataset created in current time using create_dataset.ipynb
dataset_store_path = '../../Data/TweetCredibilityDatasets' 

# Set following path to directory to store evaluation results
evaluation_path = '../evaluation/thesis_results'

Following is the list of dataset file names as per the files stored in annotations folder of CrisisMMD dataset. Set the event_name and event_file_name in next cell for running similarity score generation and evaluation for a climate event.

1. 'california_wildfires_final_data.tsv'
2. 'hurricane_harvey_final_data.tsv'
3. 'hurricane_irma_final_data.tsv'
4. 'hurricane_maria_final_data.tsv'
5. 'iraq_iran_earthquake_final_data.tsv'
6. 'mexico_earthquake_final_data.tsv'
7. 'srilanka_floods_final_data.tsv'

In [3]:
# Set the event name and file name of climate event for which the similarity scores are to be calculated

event_name = 'california_wildfires'
event_file_name = 'california_wildfires_final_data.tsv'

# event_name = 'hurricane_harvey'
# event_file_name = 'hurricane_harvey_final_data.tsv'

# event_name = 'hurricane_irma'
# event_file_name = 'hurricane_irma_final_data.tsv'

# event_name = 'hurricane_maria'
# event_file_name = 'hurricane_maria_final_data.tsv'

# event_name = 'iraq_iran_earthquake'
# event_file_name = 'iraq_iran_earthquake_final_data.tsv'

# event_name = 'mexico_earthquake'
# event_file_name = 'mexico_earthquake_final_data.tsv'

# event_name = 'srilanka_floods'
# event_file_name = 'srilanka_floods_final_data.tsv'

#### The following code calculates the credibility scores and evaluation scores for one tweets overlap for the event set in previous cell at a time. Enter the overlap name for which the core tweets are being loaded as the gold standard collection, against which all other tweets are scored:

**Possible overlap file names:**
1. author_url_retweets_tweets
2. followers_url_retweets_tweets
3. author_followers_retweets_tweets
4. author_url_followers_tweets

**Note:** For the Iraq Iran Earthquake dataset as only author_followers_retweets overlap contains core tweets and others are empty as explained in Section 7.1, the only overlap that provides results is the author_followers_retweets_tweets, other overlaps would throw an error.

In [4]:
overlap_file_name = input("Enter required overlap name:")

Enter required overlap name:author_url_retweets_tweets


## Similarity Score Calculation:

### Read overlap tweets for given overlap file name:

In [5]:
# Method to read different set of overlapping tweets obtained from experimenting
# with different relationship networks in analyse_communities.ipynb.
def read_overlap_tweets(filename):
    return pd.read_csv(f'{overlap_analysis_path}/{event_name}_{filename}.csv', index_col=0)

In [6]:
overlap_tweets = read_overlap_tweets(overlap_file_name).copy()
# Setting the core tweets corpus to tweet texts
corpus = overlap_tweets['text'].values

### Calculating tf-idf scores for tweet terms of tweet texts in the set of core tweets given by the community overlap file:

In [7]:
# Following use of tf-idf vectorizer is based on [2] and [3]
# The TfidfVectorizer has parameters that provide text pre-processing such as stop word removal, 
# use of different tokenizers, and using either word or character n-grams. 
# The following setting was found to be working the best for all the given datasets, 
# after experimenting with different parameters

# Initialising the vectorizer for generating character n-gram features
vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2, 4))
n_grams = '2_4'
# Generating character n-gram features as character bigram, trigram and four-gram tf-idf values
# using the core tweets obtained in the overlap as the vocabulary
corpus_tfidf = vectorizer.fit_transform(corpus)

### Calculating Similarity of all Tweets with respect to core tweets using COSINE Similarity:

In [8]:
# Method to read labelled data from original CrisisMMD files for tweets in the dataset generated in current time
def read_labelled_tweets(file_name):
    # Reading tweets data from csv file created using create_dataset.ipynb
    tweets_data = pd.read_csv(f'{dataset_store_path}/21237189_{event_name}_final_data.csv')
    # Removing duplicate tweet id rows
    tweets_data = tweets_data.drop_duplicates(subset=['id']).reset_index()
    # Reading annotated tweets from CrisisMMD annotations file for given climate event
    # text_info contains the informative/not informative labels
    annotated_tweets = pd.read_csv(f'{labelled_data_path}/{file_name}', sep='\t', usecols=
                       ['tweet_id', 'text_info', 'tweet_text'], squeeze=True)
    return annotated_tweets[annotated_tweets['tweet_id'].isin(tweets_data['id'].values)]

In [9]:
# Reading all tweets in dataset along with their informative/not informative labels 
# for calculating credibility scores for all tweets in the dataset
test_tweets = read_labelled_tweets(event_file_name).copy()
# Removing Duplicates
test_tweets = test_tweets.drop_duplicates(subset=['tweet_id'])
test_corpus = test_tweets['tweet_text'].values

In [10]:
# Getting Tf-Idf representation of all tweets based on overlapping tweet corpus
test_tfidf = vectorizer.transform(test_corpus) 

In [11]:
# Getting cosine similarity scores of all tweets in dataset with all tweets in the core tweet set
similarities = cosine_similarity(test_tfidf, corpus_tfidf)

In [12]:
# Taking mean similarity of each tweet with respect to its similarity to all core tweets
mean_similarities = []
for similarity in similarities:
    mean_similarities.append(np.mean(similarity))

In [13]:
# Saving the similarities along with corresponding tweet ids
tweet_scores = {}
for index, similarity in enumerate(mean_similarities):
    tweet_scores[test_tweets['tweet_id'].values[index]] = similarity

In [14]:
# Saving similarity scores to disk
open(f"{evaluation_path}/{event_name}/{overlap_file_name[:-7]}_similarity_scores_{n_grams}.txt", "w"
        ).write(repr(tweet_scores))

43730

## EVALUATION:

### Obtaining predictions based on threshold:

In [15]:
# Setting threshold to mean of similarities
threshold = np.mean(mean_similarities)

In [16]:
# Setting the predictions based on threshold along with the actual labels for each tweet
predictions = {}
for key, value in tweet_scores.items():
    actual_label = test_tweets[test_tweets['tweet_id']==key]['text_info'].values[0]
    if value > threshold:        
        predictions[key] = ['informative', actual_label]
    else:
        predictions[key] = ['not_informative', actual_label]    

### Getting the precision, recall, and f1 scores:

In [17]:
# Importing Scikit Learn Metrics functions
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix

# Method to run evaluation metrics and save results on disk
def evaluation_metrics(predictions):    
    
    # Separating actual labels from predicted labels    
    predicted_label = np.array(list(predictions.values()))[:, 0]
    actual_label = np.array(list(predictions.values()))[:, 1]
    
    # Getting the confusion matrix for the predictions
    conf_mat = confusion_matrix(actual_label, predicted_label, labels=["informative", "not_informative"])
    confusion_mat = {}
    confusion_mat['informative'] = conf_mat[0]
    confusion_mat['not_informative'] = conf_mat[1]
    confusion_mat['True Positives'] = conf_mat.ravel()[0]
    confusion_mat['False Negatives'] = conf_mat.ravel()[1]
    confusion_mat['False Positives'] = conf_mat.ravel()[2]
    confusion_mat['True Negatives'] = conf_mat.ravel()[3]
    
    print("Confusion Matrix:")
    print(pd.json_normalize(confusion_mat))
    
    # Saving the confusion matrix to file on disk
    open(f"{evaluation_path}/{event_name}/{overlap_file_name[:-7]}_confusion_matrix_{n_grams}.txt", "w"
        ).write(repr(confusion_mat))
    
    # Getting Classification Report   
    classification_rep = classification_report(actual_label, predicted_label, target_names=["informative", "not_informative"])
    print("\n Classification Report")
    print(classification_rep)
    
    # Getting individual precision, recall and f1 score for relevant class
    all_metrics = {}
    all_metrics['precision'] = round(precision_score(actual_label, predicted_label, pos_label='informative'), 4)
    all_metrics['recall'] = round(recall_score(actual_label, predicted_label, pos_label='informative'), 4)
    all_metrics['f1'] = round(f1_score(actual_label, predicted_label, pos_label='informative'), 4)
    
    print("\nEvaluation Scores for Relevant Tweets Retrieved:")
    print(pd.json_normalize(all_metrics))
    
    # Saving the evaluation scores to file on disk
    open(f"{evaluation_path}/{event_name}/{overlap_file_name[:-7]}_evaluation_scores_{n_grams}.txt", "w"
        ).write(repr(all_metrics))

### Running evaluation for current set of predictions obtained for a given community overlap, for a given climate event:

In [18]:
evaluation_metrics(predictions)

Confusion Matrix:
  informative not_informative  True Positives  False Negatives  \
0  [577, 268]      [106, 133]             577              268   

   False Positives  True Negatives  
0              106             133  

 Classification Report
                 precision    recall  f1-score   support

    informative       0.84      0.68      0.76       845
not_informative       0.33      0.56      0.42       239

       accuracy                           0.65      1084
      macro avg       0.59      0.62      0.59      1084
   weighted avg       0.73      0.65      0.68      1084


Evaluation Scores for Relevant Tweets Retrieved:
   precision  recall      f1
0     0.8448  0.6828  0.7552


# References:

[1] "Crisismmd: Multimodal crisis dataset," [Online]. Available: https://crisisnlp.qcri.org/crisismmd 

[2] U. Malik. "Python for NLP: Sentiment Analysis with Scikit-Learn," 2022. Available: https://stackabuse.com/python-for-nlp-sentiment-analysis-with-scikit-learn/

[3] Scikit-Learn. "sklearn.feature_extraction.text.TfidfVectorizer", Available: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html