# Forum 1 - Mystery Dataset ↔ Metadata Power
Author: Kyle Metta, PhD – Weather Program Office, September 2025


This tutorial is designed for social scientists to experience firsthand why metadata matters. We’ll start by working with a “mystery dataset” and see how far we can get without metadata, and then demonstrate how better metadata unlocks integration potential.


### Goals of the notebook:
- Understand the limitations of working with data that lacks metadata.
- Explore how basic metadata helps answer critical questions.
- See how a structured metadata schema (InSPIRE-aligned) provides the foundation for confident integration.
- Practice thinking about your own data in terms of context and metadata.
- Build awareness of community-driven approaches to data sharing and reuse.


**Note:** The datasets in this notebook are **synthetic** — created for exploration and demonstration purposes only. No real respondent data is used here.


---


### The Steps We Will Follow
**Step 1:** Describe your own dataset in plain language  
**Step 2:** Open and assess the “mystery dataset”  
**Step 3:** Explore feasibility of integration (geographic, temporal, variables, methods)  
**Step 4:** Introduce metadata and see how it changes the assessment  
**Step 5:** Compare basic vs. comprehensive metadata using the InSPIRE schema  
**Step 6:** Reflect on what metadata enables and why it matters  


### Before We Get Started – Packages and Libraries

In Jupyter notebooks, we often use “packages” (also called libraries).  
These are pre-written collections of code that give us tools for data loading, cleaning, and display — so we don’t have to reinvent the wheel.

For this notebook, the key ones are:

- **pandas**: For loading and working with tabular data (like spreadsheets).  
- **json**: For reading and working with metadata stored in JSON files.  




In [8]:
import pandas as pd
from IPython.display import display, Markdown


## Exercise 1: Describe Your Dataset

Think of a dataset you’ve collected (recent or past).  
Imagine someone emails you asking: *“What’s in this dataset?”*  
They are smart, but they don’t know your work.

**Your task (3–5 minutes):**
- Write what you would tell them.  
- Focus on context: geography, time, methods, sample size, variables.  
- Keep it simple — imagine it’s a README for a future collaborator.

💬 We’ll share a few responses in chat.


---
###  Reflection: What You Just Wrote is Metadata
When you described your dataset in plain words, you were already creating **metadata** or context about your data.

- Right now, it’s **unstructured metadata**: free text, useful for humans but hard for machines to parse.  
- Later in this session, we’ll see how to make this same information **structured** by using a **schema**.  
- A **metadata schema** is simply a *standardized set of fields* (like title, creator, date, location) that ensures everyone describes data in a consistent way.  
- Structured metadata is both **human-readable** and **machine-readable**, which makes it possible to share, integrate, and reuse data consistently.

#### Think of this step as capturing the story of your dataset.
---



## From Your Dataset to the Mystery Dataset
In Exercise 1, you described *your own dataset* in plain language.  
Now we flip perspectives:

- Imagine someone else shares *their dataset* with you; no context, just a file.  
- Can you tell if it’s usable?

We’ll use a **mystery evacuation dataset** to walk through this scenario.


# Exercise 2: Integration Feasibility Assessment

**Your Research Context**  
- **Current Study:** Hurricane Idalia evacuation survey — Sarasota County, August 2023 (n = 892)  
- **Integration Goal:** Combine datasets for more statistical power (need n > 2,000)  
- **Research Question:** Can we compare evacuation timing across counties and storm events?  

**Scenario**  
You get an email from a colleague.  
Attached is a CSV file. 
The note says: *“This might work with your evacuation research.”*  

**What you need to figure out:**  
- Do the **variables** line up? (evacuation timing, demographics)  
- Can you establish **geographic linkage**? (Florida coastal counties)  
- Was it collected with a **similar methodology**? (household survey data)  
- Is there **temporal alignment**? (comparable hurricane events)  


## Step 1: Open the Mystery Dataset

Before we can assess integration, let’s actually look at the file.  
 

**What to do:**  
1. Run the next cell to open the dataset.  
2. Look at the basic overview it prints:  
   - How many rows (potential new observations)?  
   - How many variables?  
   - What are the variable names?  
   - What do the first few rows look like?  
3. Ask yourself: **Does this look usable with your Sarasota/Idalia study? What’s still unclear?**  


In [15]:


url = "https://raw.githubusercontent.com/jmote-noaa/Data-Forums/main/data/mystery_evacuation_dataset.csv"
df = pd.read_csv(url)

print("=== DATASET OVERVIEW ===")
print(f"Potential additional observations: {len(df)}") ## prints how many rows there are
print(f"Combined sample size would be: {len(df) + 892}") # what is our combined sample?
print(f"Variables available: {len(df.columns)}") # how many variables are there? columns 
print(f"\nVariable names:") # list the column names 
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

print("\n=== FIRST 8 OBSERVATIONS ===")
display(df.head(8))


=== DATASET OVERVIEW ===
Potential additional observations: 1247
Combined sample size would be: 2139
Variables available: 15

Variable names:
 1. resp_id
 2. zip_code
 3. survey_date
 4. evac_timing
 5. dest_type
 6. network_size
 7. info_source
 8. prev_hurricane
 9. income_level
10. age_group
11. household_comp
12. evac_decision
13. prep_score
14. risk_perception
15. social_media_use

=== FIRST 8 OBSERVATIONS ===


Unnamed: 0,resp_id,zip_code,survey_date,evac_timing,dest_type,network_size,info_source,prev_hurricane,income_level,age_group,household_comp,evac_decision,prep_score,risk_perception,social_media_use
0,1001,33902,2022-09-29,2,A,8,Official Warning,0,3.0,B,SA,Y,4.4,5,Other
1,1002,33919,2022-09-29,4,D,3,"TV,Family",1,2.0,C,SF,Y,3.1,3,Other
2,1003,33907,2022-09-29,3,C,5,Weather App,0,6.0,D,CF,Y,5.0,3,IG
3,1004,33906,2022-09-29,3,C,3,News Website,1,2.0,D,SF,Y,3.2,4,"Instagram,Facebook"
4,1005,33903,2022-09-28,2,B,5,"TV,Radio,Social",1,3.0,C,CF,N,4.5,4,"Twitter,Instagram"
5,1006,UNK,2022-09-28,2,A,3,"TV,Family",0,4.0,C,SF,Y,3.5,4,"FB,TW"
6,1007,33967-4611,2022-09-28,1,A,1,Social Media,1,3.0,B,CA,Y,3.8,4,IG
7,1008,33901,2022-09-30,4,C,4,News Website,0,2.0,C,SF,Y,1.4,3,"Facebook,Twitter"


---
### 💬 Chat Prompt: First Impressions
Now that you’ve opened the **mystery dataset** and seen the variables and first few rows:  

- Does anything look familiar or usable for your Sarasota/Idalia study?  
- What’s still unclear or confusing?  
- If someone handed you just this file, what would you need to know before deciding to integrate it?  

#### Type a quick response in the chat! We just want to surface what people notice.
---


## Step 2: Geographic Integration

One of the first questions when deciding if datasets can be combined is:  
**“Were they collected in the same places?”**

For evacuation research, **geographic coverage is critical**:  
- If the survey locations overlap with Sarasota County, integration might be possible.  
- If not, the datasets may not be directly comparable.

**What to do:**  
1. **Run the code cell below** to see the zip codes and most common locations.  
2. **Look closely at the results** — do these zip codes line up with Sarasota, or do they point somewhere else?  
3. **Share in the chat:**  
   - What geographic area do you think this dataset represents?  
   - Would it integrate with your Sarasota study?




In [19]:
print("=== GEOGRAPHIC INTEGRATION ASSESSMENT ===")
print("Can this integrate with your Sarasota County study?")

# Examine zip codes
print(f"\nZip code analysis:")
print(f"- Unique zip codes: {df['zip_code'].nunique()}")
print(f"- Zip code format issues: {df['zip_code'].astype(str).str.len().value_counts().to_dict()}")

print(f"\nMost common locations:")
display(df['zip_code'].value_counts().head(8))

print(f"\nSample zip codes: {df['zip_code'].unique()[:10]}")



=== GEOGRAPHIC INTEGRATION ASSESSMENT ===
Can this integrate with your Sarasota County study?

Zip code analysis:
- Unique zip codes: 78
- Zip code format issues: {5: 1082, 3: 121, 10: 44}

Most common locations:


zip_code
33919    44
33906    42
33909    42
33976    42
339      41
33921    40
33904    39
33931    39
Name: count, dtype: int64


Sample zip codes: ['33902' '33919' '33907' '33906' '33903' 'UNK' '33967-4611' '33901'
 '33912' '33931']


### 💬 Chat Prompt: Geography
Looking at the zip codes:  

- Where do you think this dataset was collected?  
- Do these locations overlap with Sarasota County, or somewhere else?  
- Would you feel confident integrating based on this info alone? Why or why not?  



## Step 3: Temporal Integration

The next big question when evaluating datasets is:  
**“Were they collected during the same time period?”**

Why this matters:  
- If two surveys measure evacuation behavior during the *same storm* or season, they’re more directly comparable.  
- If they are from *different storms or years*, integration may introduce new complexities.

**What to do:**  
1. **Run the code cell below** to see when the mystery dataset was collected.  
2. Compare the results to your Sarasota study (Hurricane Idalia, August 2023).  
3. **Share in the chat:**  
   - What storm or event do you think this dataset captures?  
   - Could it still be combined with your Idalia study, or is the time gap too large?



In [23]:
print("=== TEMPORAL INTEGRATION ASSESSMENT ===")
print("Does this align temporally with your Hurricane Idalia study (Aug 2023)?")

# Look at survey dates
print(f"Survey date range:")
print(f"- Earliest: {df['survey_date'].min()}")  
print(f"- Latest: {df['survey_date'].max()}")
print(f"- Collection span: {(pd.to_datetime(df['survey_date'].max()) - pd.to_datetime(df['survey_date'].min())).days} days")

print(f"\nSurvey timing distribution:")
df['survey_date'] = pd.to_datetime(df['survey_date'])
date_counts = df['survey_date'].dt.date.value_counts().head(5)
display(date_counts)


=== TEMPORAL INTEGRATION ASSESSMENT ===
Does this align temporally with your Hurricane Idalia study (Aug 2023)?
Survey date range:
- Earliest: 2022-09-28
- Latest: 2022-09-30
- Collection span: 2 days

Survey timing distribution:


survey_date
2022-09-30    423
2022-09-28    414
2022-09-29    410
Name: count, dtype: int64

___
### 💬 Chat Prompt: Time
Now that you’ve seen the survey dates:  

- What storm or event do you think this dataset captures?  
- How does that compare to your Hurricane Idalia study (Aug 2023)?  
- Could you still justify integrating, or would the timing be a blocker?
---





### Reflection Before Moving On
We’ve now looked at the dataset from three angles:  
- The variables and rows,  
- The geographic coverage,  
- The survey dates.  

**Takeaway:** Even after running some analysis, we’re left guessing.  
This is the core problem: **without metadata, we can’t answer the basic questions needed to decide if integration is feasible.**

Each step gave us a little more information, but notice what’s missing:  
- We still don’t know **which storm** this really represents.  
- We don’t know the **methods** or **who collected it**.  
- We don’t know if it’s safe to compare with Sarasota/Idalia.  

It shows the limits of working with *just a file* and no structured context.  

---




## A Way Forward: Metadata

We just saw how far we could get **without metadata**:  
- 20+ minutes exploring  
- Still no clear decision about feasibility  

This is where **metadata makes the difference**.  
Let’s look at a small set of essential fields and see if it helps us move forward.

**Scenario:**  
You notice in your colleague’s email that alongside the dataset, they’ve also attached a small file with a **`.json` extension**.

### Why JSON?
- **What it is:** JSON stands for *JavaScript Object Notation*. It’s a simple text format for storing structured information as “key–value” pairs.  
- **Why use it:**  
  - It’s human-readable — you can open it and understand it at a glance.  
  - It’s machine-readable — Python, R, and most software can parse JSON directly.  
  - It’s the standard way metadata is stored and shared across platforms.  
- **In practice:** This is how metadata standards (like DataCite, schema.org) are usually published.  

Let’s open that JSON file and see what it tells us about the dataset.



In [30]:
basic_metadata = {
    "title": "Hurricane Ian evacuation behavior survey",
    "creator": "Florida Emergency Management Example Research Consortium",
    "subject_category": "Emergency Management",
    "geographic_coverage_specific": "Lee County, Florida",
    "temporal_coverage_start": "2022-09-28",
    "temporal_coverage_end": "2022-10-02",
    "data_type": "Survey Data",
    "abstract": "Household survey on evacuation decisions during Hurricane Ian"
}


lines = ["### Basic Metadata\n"]
for k, v in basic_metadata.items():
    pretty_key = k.replace("_", " ").title()
    lines.append(f"**{pretty_key}:** {v}  ")  # two spaces = line break in Markdown

display(Markdown("\n".join(lines)))


### Basic Metadata

**Title:** Hurricane Ian evacuation behavior survey  
**Creator:** Florida Emergency Management Example Research Consortium  
**Subject Category:** Emergency Management  
**Geographic Coverage Specific:** Lee County, Florida  
**Temporal Coverage Start:** 2022-09-28  
**Temporal Coverage End:** 2022-10-02  
**Data Type:** Survey Data  
**Abstract:** Household survey on evacuation decisions during Hurricane Ian  

## 📝 What Does This Metadata Change?

With just these few fields, we’ve gained critical context:  
- We now know the **storm** → Hurricane Ian (Sept 2022)  
- We know the **location** → Lee County (adjacent to Sarasota)  
- We know the **data type & timeframe** → Household survey, Sept 28 – Oct 2, 2022  

This gets us much closer to deciding if integration is possible.

---

### 💬 Chat Engagement  – Integration Assessment
Now that we have this context, how do you feel about integration potential?  
**High • Medium • Low**

---

### 💬 Chat Engagement – What’s Still Missing?
In the chat, share:  
- What information do you *still* wish you had to make a confident decision?  



## From Gaps → Comprehensive Metadata

In the last step, we saw that **basic metadata** gave us some answers (storm, county, dates)…  
…but you told us there are still critical pieces missing:  
- Who collected it?  
- What methodology was used?  
- How do we technically integrate it?  
- Who do we contact?  

This is where **comprehensive metadata** comes in.  
Instead of a few fields, let’s look at a structured schema that captures all the details needed for confident integration.



In [46]:

# Comprehensive metadata schema documentation (illustrative slice)

### Note this is again synthetic data created for this example only
comprehensive_metadata = {
  "identification_overview": {
    "title": "Hurricane Ian Evacuation Decision-Making Survey: Lee County, Florida",
    "abstract": "This is a synthetic dataset of a household survey examining evacuation timing, transportation choices, and destination decisions during Hurricane Ian. Not for real use.",
    "purpose": "Quantify drivers of evacuation to inform emergency management.",
    "creator": [{"name": "Dr. Jennifer A. Example", "affiliation": "Florida Example University"}],
    "unique_identifier": "10.0000/INSP/IAN2022-EX"
  },
  "geographic_temporal": {
    "geographic_coverage_specific": "Lee County, Florida — FEMA Evacuation Zones A–E",
    "coordinate_system": "NAD83 / Florida West (EPSG:3086)",
    "spatial_resolution": "Census tract level (anonymized)",
    "temporal_coverage_start": "2022-09-28",
    "temporal_coverage_end": "2022-10-02",
    "temporal_resolution": "Event-based: 72-hour window"
  },
  "data_characteristics_methods": {
    "collection_methodology": "Stratified random sample by evacuation zone; CATI + door-to-door follow-up",
    "sampling_weights": "Design-based weights provided",
    "processing": "Cleaning and recodes documented",
    "uncertainty": "Standard errors and response rates reported"
  },
  "decision_support_attributes": {
    "use_cases": ["Evacuation Planning", "Infrastructure Risk Assessment", "Public Risk Communication"],
    "integration_with_other_data": ["NOAA storm surge models via FEMA zones", "US Census via FIPS"]
  },
  "technical_details_access": {
    "data_access_url": "https://dataverse.example.org/dataset/IAN2022",
    "access_restrictions": "Open access (post-embargo), Not a real dataset",
    "contact": {"name": "Dr. Jennifer A Example", "email": "jennifer.example@example.edu"}
  }
}



In [48]:
# Some code to print the metadata a little cleaner

lines = ["### Comprehensive Metadata\n"]

for section, content in comprehensive_metadata.items():
    # Section header
    lines.append(f"\n**{section.replace('_', ' ').title()}**\n")
    
    # Fields in each section
    for k, v in content.items():
        if k == "creator" and isinstance(v, list):
            creators = "; ".join(f"{c.get('name')} — {c.get('affiliation')}" for c in v)
            lines.append(f"- **Creator(s):** {creators}  ")
        elif isinstance(v, list):
            v = ", ".join(str(x) for x in v)
            lines.append(f"- **{k.replace('_', ' ').title()}:** {v}  ")
        elif isinstance(v, dict):
            v = ", ".join(f"{subk.title()}: {subv}" for subk, subv in v.items())
            lines.append(f"- **{k.replace('_', ' ').title()}:** {v}  ")
        else:
            lines.append(f"- **{k.replace('_', ' ').title()}:** {v}  ")

display(Markdown("\n".join(lines)))

### Comprehensive Metadata


**Identification Overview**

- **Title:** Hurricane Ian Evacuation Decision-Making Survey: Lee County, Florida  
- **Abstract:** This is a synthetic dataset of a household survey examining evacuation timing, transportation choices, and destination decisions during Hurricane Ian. Not for real use.  
- **Purpose:** Quantify drivers of evacuation to inform emergency management.  
- **Creator(s):** Dr. Jennifer A. Example — Florida Example University  
- **Unique Identifier:** 10.0000/INSP/IAN2022-EX  

**Geographic Temporal**

- **Geographic Coverage Specific:** Lee County, Florida — FEMA Evacuation Zones A–E  
- **Coordinate System:** NAD83 / Florida West (EPSG:3086)  
- **Spatial Resolution:** Census tract level (anonymized)  
- **Temporal Coverage Start:** 2022-09-28  
- **Temporal Coverage End:** 2022-10-02  
- **Temporal Resolution:** Event-based: 72-hour window  

**Data Characteristics Methods**

- **Collection Methodology:** Stratified random sample by evacuation zone; CATI + door-to-door follow-up  
- **Sampling Weights:** Design-based weights provided  
- **Processing:** Cleaning and recodes documented  
- **Uncertainty:** Standard errors and response rates reported  

**Decision Support Attributes**

- **Use Cases:** Evacuation Planning, Infrastructure Risk Assessment, Public Risk Communication  
- **Integration With Other Data:** NOAA storm surge models via FEMA zones, US Census via FIPS  

**Technical Details Access**

- **Data Access Url:** https://dataverse.example.org/dataset/IAN2022  
- **Access Restrictions:** Open access (post-embargo), Not a real dataset  
- **Contact:** Name: Dr. Jennifer A Example, Email: jennifer.example@example.edu  

## Reflection – What Did We Gain?

With the **comprehensive metadata** now revealed, we can see fields that answer critical questions:
- Identification (title, PI, contributors, funding)  
- Geographic & temporal coverage (county, FEMA zones, dates, resolution)  
- Methodology (collection design, weights, uncertainty)  
- Integration guidance (links to FEMA, Census, NOAA)  
- Access & contact information  

## 💬 **Chat Prompt:**  
- Which of these fields feel *most valuable* for your research?  
- Which ones would you want from colleagues if you were trying to reuse their dataset?  




## Reflection – What Did We Gain?

Now that we’ve revealed the **comprehensive metadata**, notice how it fills the gaps:  
- **Identification** → title, PI, contributors, funding  
- **Geographic & Temporal Coverage** → county, FEMA zones, dates, resolution  
- **Methodology** → collection design, weights, uncertainty  
- **Integration Guidance** → links to FEMA, Census, NOAA  
- **Access & Contact** → URL, restrictions, who to reach out to  

Instead of guessing, we now have a structured description that directly answers the critical integration questions.  

---

Up next: we’ll bring it all together in a side-by-side **integration matrix** to see how your Sarasota study and this dataset line up.




In [64]:

# Create comprehensive integration assessment
idalia_n = 892
mystery_n = len(df)
total_n = idalia_n + mystery_n

matrix = pd.DataFrame([
    ["Geography", "Sarasota County, FL", "Lee County, FL", "Adjacent counties ✔"],
    ["Storm Event", "Hurricane Idalia (Aug 2023)", "Hurricane Ian (Sept 2022)", "Comparable major hurricanes ✔"],
    ["Time", "Aug 2023", "Sept–Oct 2022", "Different years but comparable ✔"],
    ["Methodology", "Household survey (n=892)", f"Stratified household survey (n≈{mystery_n})", "Design & content broadly aligned ✔"],
    ["Sample Size", "892", f"{mystery_n}", f"Combined n ≈ {total_n} (meets >2,000 threshold ✔)"],
    ["Technical Linkages", "FIPS, county", "FEMA zones, FIPS", "Join via county/FIPS; compare zones ✔"],
    ["Access & Contact", "You own this dataset", "Dataverse link + PI contact", "Collaboration pathway enabled ✔"],
    ["Integration Potential", "", "", "High confidence ✔"]
], columns=["Dimension", "Idalia Study", "Mystery Dataset", "Assessment"])

matrix


Unnamed: 0,Dimension,Idalia Study,Mystery Dataset,Assessment
0,Geography,"Sarasota County, FL","Lee County, FL",Adjacent counties ✔
1,Storm Event,Hurricane Idalia (Aug 2023),Hurricane Ian (Sept 2022),Comparable major hurricanes ✔
2,Time,Aug 2023,Sept–Oct 2022,Different years but comparable ✔
3,Methodology,Household survey (n=892),Stratified household survey (n≈1247),Design & content broadly aligned ✔
4,Sample Size,892,1247,"Combined n ≈ 2139 (meets >2,000 threshold ✔)"
5,Technical Linkages,"FIPS, county","FEMA zones, FIPS",Join via county/FIPS; compare zones ✔
6,Access & Contact,You own this dataset,Dataverse link + PI contact,Collaboration pathway enabled ✔
7,Integration Potential,,,High confidence ✔


# Wrapping Up: What Did We Learn?

Without metadata ❌  
- 20+ minutes exploring, still uncertain about geography, time, and methods.  

With **basic metadata** ⚠️  
- We could identify the storm (Hurricane Ian) and location (Lee County).  
- Helpful, but still missing key details.  

With **comprehensive metadata** ✅  
- We could confidently assess feasibility.  
- Geography, time, methods, access, and contact info all in place.  
- Integration matrix showed compatibility and enough statistical power (n > 2,000).  

---

## Key Takeaway
**Metadata is not overhead — it’s the bridge between datasets.**  
It turns raw files into reusable research resources.  

---

## Next: Break  & Back to the Forum
In the slides, we’ll:  
- Walk through the InSPIRE metadata schema design.  
- Show how these fields map directly to the problems we just experienced.  
- Discuss how we, as a community, can refine and use these practices.
