# Gold-Standard Event Summaries for CrisisFACTS

__Author__: Cody Buntain (cbuntain@umd.edu)

## Description

For automated evaluation in CrisisFACTS, we compare participant-system summaries to three additional sources of event summaries:

1. Wikipedia - A simple summary of each event, though we expect these summaries are not massively useful for situational awareness, attention support, or decision making.

2. ICS 209 Archive - A dataset of real daily hazard reports, gathered from Lise St. Denis. This data comes from a pre-release version of their updated NIMS database.

    - St. Denis, L., Short, K., McConnell, K., Cook, M., Mietkiewicz, N., Buckland, M. & Balch, J. All-hazards dataset mined from the US National Incident Management System 1999–2012. Sci Data (2022).
 
    - St. Denis, L., Short, K., McConnell, K., Cook, M., Mietkiewicz, N., Buckland, M. & Balch, J. All-hazards dataset mined from the US National Incident Management System 1999-2020. figshare. https://doi.org/10.6084/m9.figshare.19858927 (2022).
    
3. NIST Assessor Summaries - A dataset of event summaries generated by NIST assessors, where CrisisFACTS coordinators asked NIST assessors to identify and timestamp important facts from each event.

## Notes

You _cannot_ run this code to generate the ICS 209 summaries yet, as the underlying dataset has not been publicly released. As soon as it is available, we will update this code appropriately.

In [1]:
import json
import gzip
import requests

import pandas as pd

In [None]:
# Install the wikipedia package, so we can get Wikipedia summaries
!pip install wikipedia

In [3]:
import wikipedia

## Pre-built data

We have selected topics for CrisisFACTS 2022 and matched them to the ICS 209 dataset from L. St. Denis et al., which we read in here.

In [4]:
topics = None
with open("CrisisFACTs-2022.topics.json", "r") as in_file:
    topics = json.load(in_file)

In [5]:
topic_to_ics = None
with open("CrisisFACTs-2022.topics.to.ic209.json", "r") as in_file:
    topic_to_ics = json.load(in_file)
topic_to_ics_map = {row["eventID"]:row["icsID"] for row in topic_to_ics}

In [7]:
# Add the ICS 209 report ID set (can be multiple) for each event

for topic in topics:
    e_id = topic["eventID"]
    topic["icsID"] = topic_to_ics_map[e_id]

## Get Wikipedia Summaries

For each topic, we have a Wikipedia URL already provided. Use that page to get Wikipedia's own summary.

In [8]:
# Every topic has a URL to a Wiki page.
for topic in topics:
    print(topic["url"])
    page_title = topic["url"].rpartition("/")[-1]
    
    print("\t", page_title)
    page = wikipedia.page(title=page_title, auto_suggest=False)
    
    topic["wiki.content"] = page.content
    topic["wiki.summary"] = page.summary

https://en.wikipedia.org/wiki/Lilac_Fire
	 Lilac_Fire
https://en.wikipedia.org/wiki/Cranston_Fire
	 Cranston_Fire
https://en.wikipedia.org/wiki/Holy_Fire_(2018)
	 Holy_Fire_(2018)
https://en.wikipedia.org/wiki/Hurricane_Florence
	 Hurricane_Florence
https://en.wikipedia.org/wiki/2018_Maryland_flood
	 2018_Maryland_flood
https://en.wikipedia.org/wiki/Saddleridge_Fire
	 Saddleridge_Fire
https://en.wikipedia.org/wiki/Hurricane_Laura
	 Hurricane_Laura
https://en.wikipedia.org/wiki/Hurricane_Sally
	 Hurricane_Sally


In [9]:
# Example summary from Hurricane Sally
page.summary



## Extract Summaries from ICS 209 Data

Much of the ICS 209 data isn't in a reasonable format for summaries, and many fields aren't directly useful. Here, we read in the 1999-2020 dataset and convert a subset of these fields to sentences for summarization.

In [10]:
df = pd.read_csv(
    "ics-209-plus-2.0-pre-release-v1/ics209-plus_sitreps_1999to2020.csv", 
    low_memory=False, 
    index_col=0
)

## NOTE: THE ABOVE CODE WILL CURRENTLY FAIL

We cannot release the ICS 209 dataset yet, as the underlying dataset is still in pre-release. Once St. Denis et al. publicly release the dataset, we will update this code accordingly.

<hr>

Each key in the `narr_prompts` map is a field in the ICS 209 entry, and each value in `narr_prompts` provides a sentence-based summary of that field. Some fields include only numeric data, e.g., total evacuations, so we embed that number in a sentence using the fields below. Other fields, such as `INCIDENT_DESCRIPTION`, is a narrative field (i.e., text entry), and we use that field directly.

In [11]:
narr_prompts = {
    "CAUSE":    "The cause of the incident is _PLACEHOLDER_.",
    "TOTAL_EVACUATIONS":    "_PLACEHOLDER_ evacuations to date.",
    "TOTAL_P_FATALITIES":    "_PLACEHOLDER_ fatalities to date.",
    "TOTAL_P_INJURIES":    "_PLACEHOLDER_ injuries to date.",
    "TOTAL_PERSONNEL":    "_PLACEHOLDER_ personnel deployed.",
    "TOTAL_R_FATALITIES":    "_PLACEHOLDER_ responder fatalities.",
    "TOTAL_R_INJURIES":    "_PLACEHOLDER_ repsonder injuries.",
    "ACRES":    "_PLACEHOLDER_ acres have been affected.",
    "EDAMAGE":    "Estimated damage at _PLACEHOLDER_.",
    "FATALITIES":    "_PLACEHOLDER_ fatalities.",
    "INJURIES":    "_PLACEHOLDER_ injuries this period.",
    "INJURIES_TO_DATE":    "_PLACEHOLDER_ injuries to date.",
    "STR_DAMAGED":    "_PLACEHOLDER_ structures damaged.",
    "STR_DAMAGED_COMM":    "_PLACEHOLDER_ commercial structures damaged.",
    "STR_DAMAGED_RES":    "_PLACEHOLDER_ residential structures damaged.",
    "STR_DESTROYED":    "_PLACEHOLDER_ structures destroyed.",
    "STR_DESTROYED_COMM":    "_PLACEHOLDER_ commercial structures destroyed.",
    "STR_DESTROYED_RES":    "_PLACEHOLDER_ residential structures destroyed.",
    "STR_THREATENED":    "_PLACEHOLDER_ structures currently threatened.",
    "STR_THREATENED_COMM":    "_PLACEHOLDER_ commercial structures currently threatened.",
    "STR_THREATENED_RES":    "_PLACEHOLDER_ residential structures currently threatened.",
    "UNIT_OR_OTHER_NARR":    "_PLACEHOLDER_",
    "WEATHER_CONCERNS_NARR":    "_PLACEHOLDER_",
    "COMPLEXITY_LEVEL_NARR":    "_PLACEHOLDER_",
    "CRIT_RES_NEEDS_NARR":    "_PLACEHOLDER_",
    "CURRENT_THREAT_NARR":    "_PLACEHOLDER_",
    "HAZARDS_MATLS_INVOLVMENT_NARR":    "_PLACEHOLDER_",
    "INCIDENT_COMMANDERS_NARR":    "_PLACEHOLDER_",
    "INCIDENT_DESCRIPTION":    "_PLACEHOLDER_",
    "INCIDENT_JURISDICTION":    "_PLACEHOLDER_",
    "INCTYP_DESC":    "_PLACEHOLDER_",
    "LIFE_SAFETY_HEALTH_STATUS_NARR":    "_PLACEHOLDER_",
    "MAJOR_PROBLEMS":    "_PLACEHOLDER_",
    "OBS_FIRE_BEHAVE":    "_PLACEHOLDER_",
    "PLANNED_ACTIONS":    "_PLACEHOLDER_",
    "POO_SHORT_LOCATION_DESC":    "_PLACEHOLDER_",
    "PROJECTED_ACTIVITY_NARR":    "_PLACEHOLDER_",
    "REMARKS":    "_PLACEHOLDER_",
    "RES_BENEFITS":    "_PLACEHOLDER_",
    "RISK_ASSESSMENT":    "_PLACEHOLDER_",
    "SIGNIF_EVENTS_SUMMARY":    "_PLACEHOLDER_",
    "STATUS":    "_PLACEHOLDER_",
    "STRATEGIC_NARR":    "_PLACEHOLDER_",
}

For every topic, pull the set of ICS 209 IDs that map in to the ICS data frame we read above.

Then we pull out each numeric and narrative field from above and add them to our running set of summaries. We use a set here to ensure we have unique sentences in the summary field, which is useful since some of the 209 reports duplicate content.

In [12]:
for topic in topics:
    print(topic["title"])
    ics_id = topic["icsID"]
    indices = []
    
    # Gereate the set of Incident IDs, which as stored as CSV
    if "," in ics_id:
        multi_ics_ids = ics_id.split(",")
        
        for sub_ics_id in multi_ics_ids:
            relevant_ics_df = df[df["INCIDENT_ID"] == sub_ics_id.strip()]
            indices.extend(relevant_ics_df.index)
    else:
        relevant_ics_df = df[df["INCIDENT_ID"] == ics_id]
        indices.extend(relevant_ics_df.index)

        
    print("\t", "Relevant Rows:", len(indices))
    
    # Some events don't have a corresponding ICS 209 field
    #. such as the Maryland Floods, which we included for
    #. geographic relevance to NIST assessors
    if len(indices) == 0:
        print("\t", "No rows...")
        continue
        
    relevant_ics_df = df.loc[indices]
    
    summaries = set()
    for idx,row in relevant_ics_df.iterrows():
        lrow = row.dropna()
        
        for field,prompt in narr_prompts.items():
            
            if field not in lrow:
                continue
            
            if row[field] == 0:
                continue
                
            new_str = prompt.replace("_PLACEHOLDER_", str(row[field]))
            summaries.add(new_str)
            
        # Add summaries for binary flag fields
        if row["ROAD_CLOSURE_FLAG"]:
            summaries.add("Roads closed.")
        if row["EVACUATION_IN_PROGRESS"]:
            summaries.add("Evacuations underway.")


    topic["ics.summary"] = " ".join(summaries)

Lilac Wildfire 2017
	 Relevant Rows: 18
Cranston Wildfire 2018
	 Relevant Rows: 23
Holy Wildfire 2018
	 Relevant Rows: 42
Hurricane Florence 2018
	 Relevant Rows: 47
2018 Maryland Flood
	 Relevant Rows: 0
	 No rows...
Saddleridge Wildfire 2019
	 Relevant Rows: 27
Hurricane Laura 2020
	 Relevant Rows: 10
Hurricane Sally 2020
	 Relevant Rows: 3


## Extract Summaries from NIST Assessors

We also have facts from NIST assessors, and we use those to create summaries as well.

In [13]:
with open("CrisisFACTs-2022.facts.json", "r") as in_file:
    facts = json.load(in_file)

In [14]:
collection_files = []
collection_dfs = []

total_fact_count = 0
for event in facts:
    event_name = event["event"]
    event_id = event["eventID"]
    event_requests = event["summaryRequests"]
    event_factsXrequests = event["factsByRequest"]

    print(event_id, event_name)
    for event_request in event_requests:
        req_id = event_request["requestID"]        
        this_req_facts = event_factsXrequests[req_id]
        fact_count = len(this_req_facts)
        fact_collection = [fact["fact"] for fact in this_req_facts]
        
        this_event_req_df = pd.DataFrame(fact_collection, columns=["fact"])
        this_event_req_df["title"] = ["%s-%d" % (req_id,i) for i in this_event_req_df.index]
        
        print("\t", req_id, fact_count)
        
        collection_dfs.append(this_event_req_df)
        total_fact_count += fact_count
        
all_nist_facts_df = pd.concat(collection_dfs)

CrisisFACTS-001 Lilac Wildfire 2017
	 CrisisFACTS-001-r3 267
	 CrisisFACTS-001-r4 75
	 CrisisFACTS-001-r5 14
	 CrisisFACTS-001-r6 29
	 CrisisFACTS-001-r7 19
	 CrisisFACTS-001-r8 5
	 CrisisFACTS-001-r9 3
	 CrisisFACTS-001-r10 3
	 CrisisFACTS-001-r11 2
CrisisFACTS-002 Cranston Wildfire 2018
	 CrisisFACTS-002-r1 27
	 CrisisFACTS-002-r2 10
	 CrisisFACTS-002-r3 4
	 CrisisFACTS-002-r4 13
	 CrisisFACTS-002-r5 7
	 CrisisFACTS-002-r6 1
CrisisFACTS-003 Holy Wildfire 2018
	 CrisisFACTS-003-r5 37
	 CrisisFACTS-003-r6 42
	 CrisisFACTS-003-r7 39
	 CrisisFACTS-003-r8 37
	 CrisisFACTS-003-r9 9
	 CrisisFACTS-003-r10 17
	 CrisisFACTS-003-r11 4
CrisisFACTS-004 Hurricane Florence 2018
	 CrisisFACTS-004-r8 5
	 CrisisFACTS-004-r9 5
	 CrisisFACTS-004-r10 2
	 CrisisFACTS-004-r11 4
	 CrisisFACTS-004-r12 8
	 CrisisFACTS-004-r13 15
	 CrisisFACTS-004-r14 55
	 CrisisFACTS-004-r15 26
	 CrisisFACTS-004-r16 14
	 CrisisFACTS-004-r17 37
	 CrisisFACTS-004-r18 46
	 CrisisFACTS-004-r19 3
	 CrisisFACTS-004-r20 6
	 CrisisFA

In [18]:
for topic in topics:
    print(topic["eventID"])
    
    this_topic_nist_facts = all_nist_facts_df[all_nist_facts_df["title"].str.startswith(topic["eventID"])]
    print("\t", this_topic_nist_facts.shape[0])

    topic["nist.fact_depth"] = this_topic_nist_facts.shape[0]
    topic["nist.summary"] = " ".join(this_topic_nist_facts["fact"])

CrisisFACTS-001
	 417
CrisisFACTS-002
	 62
CrisisFACTS-003
	 185
CrisisFACTS-004
	 232
CrisisFACTS-005
	 74
CrisisFACTS-006
	 137
CrisisFACTS-007
	 199
CrisisFACTS-008
	 176


## Save Final Summaries

We now have three summaries for every event except the Maryland Floods, where the ICS 209 summaries are empty.

In [19]:
with gzip.open("gold.summaries.json.gz", "wb") as out_file:
    topics_json_str = json.dumps(topics)
    out_file.write(topics_json_str.encode("utf8"))