
# Forum 1: Mystery Dataset ↔ Metadata Power
**Goal:** Experience why metadata matters, then see how better metadata unlocks integration.

**How to use this notebook**
- You can just watch (rendered HTML) or **run cells** (Binder/Colab/local).
- Run cells **top → bottom**. If something fails, use `Kernel → Restart & Run All`.
- Today’s data are **synthetic** for exploration and demonstration (no real respondents).

**What is a notebook?** A notebook is an interactive document: it mixes **text cells** (explanations), **code cells** (runnable steps), and **outputs** (tables/plots). All code cells share one “**kernel**” (a common memory), so **run order matters**.



## Exercise 1: Describe Your Dataset

Think of a dataset you’ve collected (recent or past).  
Imagine someone emails you asking: *“What’s in this dataset?”*  
They are smart, but they don’t know your work.

**Your task (3–5 minutes):**
- Write what you would tell them.  
- Focus on context: geography, time, methods, sample size, variables.  
- Keep it simple — imagine it’s a README for a future collaborator.

💬 We’ll share a few responses in chat.


In [4]:

# (Optional) Type your dataset description here as plain text.
# Nothing will be saved — just a space to capture your thoughts.
my_dataset_description = " Feel free to type here"




print("Thanks for writing! You can also copy/paste this into chat if you’d like to share.")


Thanks for writing! You can also copy/paste this into chat if you’d like to share.



## From Your Dataset to the Mystery Dataset
In Exercise 1, you described *your own dataset* in plain language.  
Now we flip perspectives:

- Imagine someone else shares *their dataset* with you — no context, just a file.  
- Can you tell if it’s usable?

We’ll use a **mystery evacuation dataset** to walk through this scenario.


# Exercise 2: Integration Feasibility Assessment

**Your Research Context**  
- **Current Study:** Hurricane Idalia evacuation survey — Sarasota County, August 2023 (n = 892)  
- **Integration Goal:** Combine datasets for more statistical power (need n > 2,000)  
- **Research Question:** Can we compare evacuation timing across counties and storm events?  

**Scenario**  
You get an email from a colleague.  
Attached is a CSV file. 
The note says: *“This might work with your evacuation research.”*  

**What you need to figure out:**  
- Do the **variables** line up? (evacuation timing, demographics)  
- Can you establish **geographic linkage**? (Florida coastal counties)  
- Was it collected with a **similar methodology**? (household survey data)  
- Is there **temporal alignment**? (comparable hurricane events)  


## Step 1 — Open the Mystery Dataset

Before we can assess integration, let’s actually look at the file.  
 

**What to do:**  
1. Run the next cell to open the dataset.  
2. Look at the basic overview it prints:  
   - How many rows (potential new observations)?  
   - How many variables?  
   - What are the variable names?  
   - What do the first few rows look like?  
3. Ask yourself: **Does this look usable with your Sarasota/Idalia study? What’s still unclear?**  


In [9]:
import pandas as pd

url = "https://raw.githubusercontent.com/jmote-noaa/Data-Forums/main/data/mystery_evacuation_dataset.csv"
df = pd.read_csv(url)

print("=== DATASET OVERVIEW ===")
print(f"Potential additional observations: {len(df)}")
print(f"Combined sample size would be: {len(df) + 892}")
print(f"Variables available: {len(df.columns)}")
print(f"\nVariable names:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

print("\n=== FIRST 8 OBSERVATIONS ===")
display(df.head(8))


=== DATASET OVERVIEW ===
Potential additional observations: 1247
Combined sample size would be: 2139
Variables available: 15

Variable names:
 1. resp_id
 2. zip_code
 3. survey_date
 4. evac_timing
 5. dest_type
 6. network_size
 7. info_source
 8. prev_hurricane
 9. income_level
10. age_group
11. household_comp
12. evac_decision
13. prep_score
14. risk_perception
15. social_media_use

=== FIRST 8 OBSERVATIONS ===


Unnamed: 0,resp_id,zip_code,survey_date,evac_timing,dest_type,network_size,info_source,prev_hurricane,income_level,age_group,household_comp,evac_decision,prep_score,risk_perception,social_media_use
0,1001,33902,2022-09-29,2,A,8,Official Warning,0,3.0,B,SA,Y,4.4,5,Other
1,1002,33919,2022-09-29,4,D,3,"TV,Family",1,2.0,C,SF,Y,3.1,3,Other
2,1003,33907,2022-09-29,3,C,5,Weather App,0,6.0,D,CF,Y,5.0,3,IG
3,1004,33906,2022-09-29,3,C,3,News Website,1,2.0,D,SF,Y,3.2,4,"Instagram,Facebook"
4,1005,33903,2022-09-28,2,B,5,"TV,Radio,Social",1,3.0,C,CF,N,4.5,4,"Twitter,Instagram"
5,1006,UNK,2022-09-28,2,A,3,"TV,Family",0,4.0,C,SF,Y,3.5,4,"FB,TW"
6,1007,33967-4611,2022-09-28,1,A,1,Social Media,1,3.0,B,CA,Y,3.8,4,IG
7,1008,33901,2022-09-30,4,C,4,News Website,0,2.0,C,SF,Y,1.4,3,"Facebook,Twitter"


## What Are We Looking At?

Now you’ve run the first cell and seen:  
- The dataset size (rows = potential new cases).  
- The variable names.  
- The first few observations.

So let’s pause and ask: **Is this data worth considering to integrate with your own study?**


## Step 2: Geographic Integration

One of the first questions when deciding if datasets can be combined is:  
**“Were they collected in the same places?”**

For evacuation research, **geographic coverage is critical**:  
- If the survey locations overlap with Sarasota County, integration might be possible.  
- If not, the datasets may not be directly comparable.

**What to do:**  
1. **Run the code cell below** to see the zip codes and most common locations.  
2. **Look closely at the results** — do these zip codes line up with Sarasota, or do they point somewhere else?  
3. **Share in the chat:**  
   - What geographic area do you think this dataset represents?  
   - Would it integrate with your Sarasota study?




In [11]:
print("=== GEOGRAPHIC INTEGRATION ASSESSMENT ===")
print("Can this integrate with your Sarasota County study?")

# Examine zip codes
print(f"\nZip code analysis:")
print(f"- Unique zip codes: {df['zip_code'].nunique()}")
print(f"- Zip code format issues: {df['zip_code'].astype(str).str.len().value_counts().to_dict()}")

print(f"\nMost common locations:")
display(df['zip_code'].value_counts().head(8))

print(f"\nSample zip codes: {df['zip_code'].unique()[:10]}")



=== GEOGRAPHIC INTEGRATION ASSESSMENT ===
Can this integrate with your Sarasota County study?

Zip code analysis:
- Unique zip codes: 78
- Zip code format issues: {5: 1082, 3: 121, 10: 44}

Most common locations:


zip_code
33919    44
33906    42
33909    42
33976    42
339      41
33921    40
33904    39
33931    39
Name: count, dtype: int64


Sample zip codes: ['33902' '33919' '33907' '33906' '33903' 'UNK' '33967-4611' '33901'
 '33912' '33931']



## Step 3: Temporal Integration

The next big question when evaluating datasets is:  
**“Were they collected during the same time period?”**

Why this matters:  
- If two surveys measure evacuation behavior during the *same storm* or season, they’re more directly comparable.  
- If they are from *different storms or years*, integration may introduce new complexities.

**What to do:**  
1. **Run the code cell below** to see when the mystery dataset was collected.  
2. Compare the results to your Sarasota study (Hurricane Idalia, August 2023).  
3. **Share in the chat:**  
   - What storm or event do you think this dataset captures?  
   - Could it still be combined with your Idalia study, or is the time gap too large?



In [14]:
print("=== TEMPORAL INTEGRATION ASSESSMENT ===")
print("Does this align temporally with your Hurricane Idalia study (Aug 2023)?")

# Look at survey dates
print(f"Survey date range:")
print(f"- Earliest: {df['survey_date'].min()}")  
print(f"- Latest: {df['survey_date'].max()}")
print(f"- Collection span: {(pd.to_datetime(df['survey_date'].max()) - pd.to_datetime(df['survey_date'].min())).days} days")

print(f"\nSurvey timing distribution:")
df['survey_date'] = pd.to_datetime(df['survey_date'])
date_counts = df['survey_date'].dt.date.value_counts().head(5)
display(date_counts)


=== TEMPORAL INTEGRATION ASSESSMENT ===
Does this align temporally with your Hurricane Idalia study (Aug 2023)?
Survey date range:
- Earliest: 2022-09-28
- Latest: 2022-09-30
- Collection span: 2 days

Survey timing distribution:


survey_date
2022-09-30    423
2022-09-28    414
2022-09-29    410
Name: count, dtype: int64


## Stepping Back: What Did We Learn?
So far we’ve checked:
- **Geography:** We saw locations, but couldn’t clearly tie them to Sarasota.
- **Time:** We saw late Sept 2022, but without context we’re still guessing.

**Takeaway:** Even after running some analysis, we’re left guessing.  
This is the core problem: **without metadata, we can’t answer the basic questions needed to decide if integration is feasible.**



## ❓ Integration Questions Still Unanswered

**Critical Information Missing:**

**Geographic Integration**
- Exact counties/regions represented
- Coordinate system used
- How to link with your Sarasota study geographically

**Temporal Integration**
- What specific storm event (Ian? Fiona? General preparedness?)  
- Timeline relative to storm impact  
- Comparability with your Idalia study timing  

**Methodological Integration**
- Who collected this data and how?  
- Sampling methodology / weights  
- Target population  

**Technical & Legal Integration**
- Who do I contact for collaboration?  
- Usage restrictions / license  
- File formats and coordinate systems  

**Result:** Integration assessment **blocked** ❌  
*Time spent: ~20 minutes exploring with no clear conclusion about feasibility.*



## Introducing Metadata: A Way Forward

We just saw how far we could get **without metadata**:  
- 20+ minutes exploring  
- Still no clear decision about feasibility  

This is where **metadata makes the difference**.  
Let’s look at a small set of essential fields, following the InSPIRE schema, and see if it helps us move forward.

**Scenario:**  
You notice in your colleague’s email that alongside the dataset, they’ve also attached a small file with a **`.json` extension**.

### Why JSON?
- **What it is:** JSON stands for *JavaScript Object Notation*. It’s a simple text format for storing structured information as “key–value” pairs.  
- **Why use it:**  
  - It’s human-readable — you can open it and understand it at a glance.  
  - It’s machine-readable — Python, R, and most software can parse JSON directly.  
  - It’s the standard way metadata is stored and shared across platforms.  
- **In practice:** This is how metadata standards (like DataCite, schema.org) are usually published.  

Let’s open that JSON file and see what it tells us about the dataset.




## Basic Metadata Solution
The dataset provider shares **just a handful of fields**:
- Title, creator, subject area  
- Geographic coverage (Lee County, Florida)  
- Temporal coverage (Sept 28 – Oct 2, 2022)  
- Abstract describing a Hurricane Ian evacuation survey


In [39]:

basic_metadata = {
    "title": "Hurricane Ian evacuation behavior survey",
    "creator": "Florida Emergency Management Research Consortium",
    "subject_category": "Emergency Management",
    "geographic_coverage_specific": "Lee County, Florida",
    "temporal_coverage_start": "2022-09-28",
    "temporal_coverage_end": "2022-10-02",
    "data_type": "Survey Data",
    "abstract": "Household survey on evacuation decisions during Hurricane Ian"
}
print("=== BASIC METADATA (Essential Fields) ===\n")
for k,v in basic_metadata.items():
    print(f"{k}: {v}")


=== BASIC METADATA (Essential Fields) ===

title: Hurricane Ian evacuation behavior survey
creator: Florida Emergency Management Research Consortium
subject_category: Emergency Management
geographic_coverage_specific: Lee County, Florida
temporal_coverage_start: 2022-09-28
temporal_coverage_end: 2022-10-02
data_type: Survey Data
abstract: Household survey on evacuation decisions during Hurricane Ian



## What Does This Metadata Change?
With this small addition:
- We now know the **storm** (Hurricane Ian, Sept 2022).
- We know the **location** (Lee County, adjacent to Sarasota).
- We know the **data type** and timeframe.

This gets us closer...

### Poll Engagement #3 – Integration Assessment
Now that we have this context, how do you feel about integration potential?  
**High • Medium • Low**

### Chat Engagement #4 – What’s Still Missing?
In the chat, tell us:  
- What information do you *still* wish you had to make a confident decision?  




## From Gaps → Comprehensive Metadata

In the last step, we saw that **basic metadata** gave us some answers (storm, county, dates)…  
…but you told us there are still critical pieces missing:  
- Who collected it?  
- What methodology was used?  
- How do we technically integrate it?  
- Who do we contact?  

This is where **comprehensive metadata** comes in.  
Instead of a few fields, let’s look at a structured schema that captures all the details needed for confident integration.



In [25]:

# Comprehensive InSPIRE metadata schema documentation (illustrative slice)
comprehensive_metadata = {
  "identification_overview": {
    "title": "Hurricane Ian Evacuation Decision-Making Survey: Lee County, Florida",
    "abstract": "Household survey examining evacuation timing, transportation choices, and destination decisions during Hurricane Ian.",
    "purpose": "Quantify drivers of evacuation to inform emergency management.",
    "creator": [{"name": "Dr. Maria Rodriguez", "affiliation": "Florida International University"}],
    "unique_identifier": "10.7910/DVN/IAN2022"
  },
  "geographic_temporal": {
    "geographic_coverage_specific": "Lee County, Florida — FEMA Evacuation Zones A–E",
    "coordinate_system": "NAD83 / Florida West (EPSG:3086)",
    "spatial_resolution": "Census tract level (anonymized)",
    "temporal_coverage_start": "2022-09-28",
    "temporal_coverage_end": "2022-10-02",
    "temporal_resolution": "Event-based: 72-hour window"
  },
  "data_characteristics_methods": {
    "collection_methodology": "Stratified random sample by evacuation zone; CATI + door-to-door follow-up",
    "sampling_weights": "Design-based weights provided",
    "processing": "Cleaning and recodes documented",
    "uncertainty": "Standard errors and response rates reported"
  },
  "decision_support_attributes": {
    "use_cases": ["Evacuation Planning", "Infrastructure Risk Assessment", "Public Risk Communication"],
    "integration_with_other_data": ["NOAA storm surge models via FEMA zones", "US Census via FIPS"]
  },
  "technical_details_access": {
    "data_access_url": "https://dataverse.example.org/dataset/IAN2022",
    "access_restrictions": "Open access (post-embargo)",
    "contact": {"name": "Dr. Maria Rodriguez", "email": "maria.rodriguez@example.edu"}
  }
}
print("Loaded comprehensive metadata slice (InSPIRE-aligned).")


Loaded comprehensive metadata slice (InSPIRE-aligned).


In [27]:
from IPython.display import display, Markdown

def show_metadata(md):
    sections = []
    sections.append("###  Identification & Overview")
    sections.append(f"**Title:** {md['identification_overview']['title']}")
    sections.append(f"**Creator:** {md['identification_overview']['creator'][0]['name']} – {md['identification_overview']['creator'][0]['affiliation']}")
    sections.append(f"**Abstract:** {md['identification_overview']['abstract']}")

    sections.append("\n###  Geographic & Temporal")
    g = md['geographic_temporal']
    sections.append(f"**Coverage:** {g['geographic_coverage_specific']}")
    sections.append(f"**Coordinate system:** {g['coordinate_system']}")
    sections.append(f"**Dates:** {g['temporal_coverage_start']} → {g['temporal_coverage_end']}")

    sections.append("\n###  Methods & Quality")
    m = md['data_characteristics_methods']
    sections.append(f"**Methodology:** {m['collection_methodology']}")
    sections.append(f"**Weights:** {m['sampling_weights']}")

    sections.append("\n###  Decision Support & Integration")
    d = md['decision_support_attributes']
    sections.append(f"**Use cases:** {', '.join(d['use_cases'])}")
    sections.append(f"**Integration links:** {', '.join(d['integration_with_other_data'])}")

    sections.append("\n### 🔗 Access & Contact")
    t = md['technical_details_access']
    sections.append(f"**Access URL:** {t['data_access_url']}")
    sections.append(f"**Restrictions:** {t['access_restrictions']}")
    sections.append(f"**Contact:** {t['contact']['name']} ({t['contact']['email']})")

    display(Markdown("\n".join(sections)))

show_metadata(comprehensive_metadata)


###  Identification & Overview
**Title:** Hurricane Ian Evacuation Decision-Making Survey: Lee County, Florida
**Creator:** Dr. Maria Rodriguez – Florida International University
**Abstract:** Household survey examining evacuation timing, transportation choices, and destination decisions during Hurricane Ian.

###  Geographic & Temporal
**Coverage:** Lee County, Florida — FEMA Evacuation Zones A–E
**Coordinate system:** NAD83 / Florida West (EPSG:3086)
**Dates:** 2022-09-28 → 2022-10-02

###  Methods & Quality
**Methodology:** Stratified random sample by evacuation zone; CATI + door-to-door follow-up
**Weights:** Design-based weights provided

###  Decision Support & Integration
**Use cases:** Evacuation Planning, Infrastructure Risk Assessment, Public Risk Communication
**Integration links:** NOAA storm surge models via FEMA zones, US Census via FIPS

### 🔗 Access & Contact
**Access URL:** https://dataverse.example.org/dataset/IAN2022
**Restrictions:** Open access (post-embargo)
**Contact:** Dr. Maria Rodriguez (maria.rodriguez@example.edu)

## Reflection – What Did We Gain?

With the **comprehensive metadata** now revealed, we can see fields that answer critical questions:
- Identification (title, PI, contributors, funding)  
- Geographic & temporal coverage (county, FEMA zones, dates, resolution)  
- Methodology (collection design, weights, uncertainty)  
- Integration guidance (links to FEMA, Census, NOAA)  
- Access & contact information  

## 💬 **Chat Prompt:**  
- Which of these fields feel *most valuable* for your research?  
- Which ones would you want from colleagues if you were trying to reuse their dataset?  

👉 Next, we’ll pull it all together in a side-by-side **integration matrix**.



## Putting It All Together: Integration Feasibility

Now that we have **comprehensive metadata**, we can actually make an informed integration decision.

Instead of guessing, we can check requirements side by side:
- Geography  
- Storm event comparability  
- Methodology  
- Sample size (statistical power)  
- Technical linking & access  

Here’s what that assessment looks like when we line up your Sarasota/Idalia study with the Ian dataset:



In [64]:

# Create comprehensive integration assessment
idalia_n = 892
mystery_n = len(df)
total_n = idalia_n + mystery_n

matrix = pd.DataFrame([
    ["Geography", "Sarasota County, FL", "Lee County, FL", "Adjacent counties ✔"],
    ["Storm Event", "Hurricane Idalia (Aug 2023)", "Hurricane Ian (Sept 2022)", "Comparable major hurricanes ✔"],
    ["Time", "Aug 2023", "Sept–Oct 2022", "Different years but comparable ✔"],
    ["Methodology", "Household survey (n=892)", f"Stratified household survey (n≈{mystery_n})", "Design & content broadly aligned ✔"],
    ["Sample Size", "892", f"{mystery_n}", f"Combined n ≈ {total_n} (meets >2,000 threshold ✔)"],
    ["Technical Linkages", "FIPS, county", "FEMA zones, FIPS", "Join via county/FIPS; compare zones ✔"],
    ["Access & Contact", "You own this dataset", "Dataverse link + PI contact", "Collaboration pathway enabled ✔"],
    ["Integration Potential", "", "", "High confidence ✔"]
], columns=["Dimension", "Idalia Study", "Mystery Dataset", "Assessment"])

matrix


Unnamed: 0,Dimension,Idalia Study,Mystery Dataset,Assessment
0,Geography,"Sarasota County, FL","Lee County, FL",Adjacent counties ✔
1,Storm Event,Hurricane Idalia (Aug 2023),Hurricane Ian (Sept 2022),Comparable major hurricanes ✔
2,Time,Aug 2023,Sept–Oct 2022,Different years but comparable ✔
3,Methodology,Household survey (n=892),Stratified household survey (n≈1247),Design & content broadly aligned ✔
4,Sample Size,892,1247,"Combined n ≈ 2139 (meets >2,000 threshold ✔)"
5,Technical Linkages,"FIPS, county","FEMA zones, FIPS",Join via county/FIPS; compare zones ✔
6,Access & Contact,You own this dataset,Dataverse link + PI contact,Collaboration pathway enabled ✔
7,Integration Potential,,,High confidence ✔


# Wrapping Up: What Did We Learn?

Without metadata ❌  
- 20+ minutes exploring, still uncertain about geography, time, and methods.  

With **basic metadata** ⚠️  
- We could identify the storm (Hurricane Ian) and location (Lee County).  
- Helpful, but still missing key details.  

With **comprehensive metadata** ✅  
- We could confidently assess feasibility.  
- Geography, time, methods, access, and contact info all in place.  
- Integration matrix showed compatibility and enough statistical power (n > 2,000).  

---

## Key Takeaway
**Metadata is not overhead — it’s the bridge between datasets.**  
It turns raw files into reusable research resources.  

---

## Next: Back to the Forum
In the slides, we’ll:  
- Walk through the InSPIRE metadata schema design.  
- Show how these fields map directly to the problems we just experienced.  
- Discuss how we, as a community, can refine and adopt these practices.
