# Forum 2 - From Metadata to Insights

Author: Jonathon Mote, PhD - Weather Program Office
September 2025

This tutorial is designed for social scientists who want to explore how their own data might begin to interface with weather and hazard datasets.  The tutorial will provide a quick overview of Jupyter notebooks and tools, some geospatial tools, and the use of APIs to access data.

### What we'll do in the notebook:
1.  **Work with APIs** to search for, access, and download data programmatically
2.  Organize, explore, and download datasets interactively using python libraries like **pandas** and **requests**.
3.  Merge survey data with external data from the **Iowa Environmental Mesonet**
4.  Apply **geospatial tools** to handle location-based data
5.  Create clear, reproducible **visualizations** and **statistical analyses** directly alongside your analysis
6.  Document our process in a way that combines code, results, and explanation all in one place  

### The Steps We Will Follow

<div style="display:flex; flex-direction:column; align-items:center; gap:14px; margin:14px 0 6px 0;">

  <!-- Step 1 -->
  <div style="width:100%; max-width:900px; height:120px; background:#d0eff8; color:#ffffff; border-radius:12px;
              display:flex; align-items:center; justify-content:center; text-align:center;
              padding:0 20px; box-sizing:border-box; font-weight:700; line-height:1.3; font-size:20px;">
    Step 1: Use an API to explore a repository, view metadata, and "pull" social data
  </div>
  <div style="width:2px; height:28px; background:#bfbfbf; border-radius:1px;"></div>

  <!-- Step 2 -->
  <div style="width:100%; max-width:900px; height:120px; background:#0069af; color:#ffffff; border-radius:12px;
              display:flex; align-items:center; justify-content:center; text-align:center;
              padding:0 20px; box-sizing:border-box; font-weight:700; line-height:1.3; font-size:20px;">
    Step 2: Use an API to "pull" weather data
  </div>
  <div style="width:2px; height:28px; background:#bfbfbf; border-radius:1px;"></div>

  <!-- Step 3 -->
  <div style="width:100%; max-width:900px; height:120px; background:#004b98; color:#ffffff; border-radius:12px;
              display:flex; align-items:center; justify-content:center; text-align:center;
              padding:0 20px; box-sizing:border-box; font-weight:700; line-height:1.3; font-size:20px;">
    Step 3: Merge weather data and social data
  </div>
  <div style="width:2px; height:28px; background:#bfbfbf; border-radius:1px;"></div>

  <!-- Step 4 -->
  <div style="width:100%; max-width:900px; height:120px; background:#003087; color:#ffffff; border-radius:12px;
              display:flex; align-items:center; justify-content:center; text-align:center;
              padding:0 20px; box-sizing:border-box; font-weight:700; line-height:1.3; font-size:20px;">
    Step 4: Visualize and analyze the combined dataset
  </div>

</div>



## Stop!  Think About Your Own Data 

Thinking about your data, what types of weather-related data might bring additional insights?  What are some questions that you're interested in? 
#### **>>Respond in the chat<<**

<h2><span style="color:red">Our Research Question Today</span></h2>

For the purposes of this tutorial, we will explore the following question: does exposure to watches, warnings, and advisories have an impact on survey responses to weather risk perception.  Eventually, we will focus on flood warnings, watches, and advisories.

### Before We Get Started - Workplace Setup (Imports)

In Jupyter, there are a large number of python-based "libraries" or "packages" that help with data loading, transformation, and analysis.  These libraries provide tools that help us do things like make graphs, work with data, or do more complex calculations, like regression.  It's good practice to have all libraries imported at the beginning.  You can always add (even install) libraries as you go along, you just have to rerun the cells (or restart the *kernel* if a new install).  

To run a cell, you can go to **"Run"** in the Jupyter menu and select **"Run selected cell"**.  However, it is easier to click on the chevron (▶️) in the editing menu.  There are also keyboard shortcuts like **Shift+Enter** or **Ctrl+Enter**.

In this tutorial, we are going to use a variety of libraries that can be grouped in the following categories:

*Data Handling*

- **pandas**: For working with tabular data in DataFrames.  It is commonly imported with an alias (pd), so we don't constantly have to type out pandas.
- **requests**: For easily fetching data from web APIs and URLs (using GET, POST, etc).
- **timedelta**: Imported from the datetime library, for representing time intervals.
- **BytesIO**: Imported from the IO library, for treating in-memory bytes like a file for reading.
- **ast**: A python module which can be used for evaluating strings.

*Geospatial*

- **geopandas**: To work with geospatial data, allowing us to perform spatial operations and handle geometries such as points, polygons, and lines.
- **Point**: Imported from Shapely, for creating geometric points for mapping.

*Visualization*

- **Pyplot**: Imported from MatPlotLib using the alias "plt", for making simple customizable  plots and charts.
- **Seaborn**: Imported using the alias "sns", for making better looking visualizations.
  
*Utilities*

- **time**: Provides functions for working with time, such as measuring durations, pausing executions, and accessing system time.
- **tqdm**: Adds progress bars to loops for tracking execution.
- **IPython.display**: Renders python results in HTML

*Statistical Modeling/Inference*

- **scipy.stats.chi2_contingency**: Imported from Scipy, to run a chi-square test of independence to check if two categorical variables are related.
- **statsmodels.api as sm**: Imported with the alias "sm", it provides tools for statistical models, including logistic regression.
- **OrderedModel**: Imported from statsmodels, used for ordinal logistic regression models when outcomes are ordered categories.

In [None]:
#import libraries

# data handling
import pandas as pd
import requests
from datetime import timedelta
from io import BytesIO
from collections import Counter
import ast

# geospatial
import geopandas as gpd
from shapely.geometry import Point

#visualization
import matplotlib.pyplot as plt
import seaborn as sns

#utilities
import time
from tqdm import tqdm
from IPython.display import display, HTML

#statistical modeling/inference
from scipy.stats import chi2_contingency        
import statsmodels.api as sm                    
from statsmodels.miscmodels.ordinal_model import OrderedModel 

<div style="background-color:#0085ca66; color:white; padding:20px; border-radius:10px; text-align:center; font-size:28px; font-weight:bold;">
  Step 1
</div>

## Step 1: Use an API to explore a repository, view metadata, and download data

In this step, we will explore API access to a data repository, the Harvard Dataverse.  An API (Application Programming Interface) is just a set of rules and tools that allows different software and servers to communicate and interact with each other.  In this case, we want our Jupyter notebook to interact with the Harvard Databerse server for information on datasets.  We use the python library "Requests" to simplify and automate our requests, and Dataverse returns what we requested (hopefully), typically in a format called JSON.  We then use Pandas to transform the JSON in a dataframe, making the results easier to read and manipulate.  

- **Note**: Not all APIs are the same and there might be differences across repositories and data servers.  Be sure to check each API's documentation for how to get started, authentication, search and data access, and more.  For Harvard's Dataverse, the [Dataverse API Guide](https://guides.dataverse.org/en/latest/api/index.html) is a comprehensive, up-to-date documentation for all operations in Harvard’s Dataverse.

- **Another Note**: Ensure that the API is open access, or whether you need an **API key** for restricted access.  An API key is a unique code generated by the data provider that allows users to authenticate and access the data.  For this tutorial, the APIs do not require an API key.


#### Simple search for 10 results

By default, the Harvard Dataverse only returns 10 results per search request.

In [None]:
# Define search query
query = "ripberger"
search_url = f"https://dataverse.harvard.edu/api/search?q={query}&type=dataset"

# Perform search and show JSON
response = requests.get(search_url)
results = response.json()
results  # Display raw JSON output

#### Transform the JSON results

We can easily transform the JSON results into a more readable format, a Pandas dataframe.  This bit of code we can keep separate or build into any cells where the results are returned in JSON.

In [None]:
# Extract items and load into DataFrame
items = results['data']['items']
df_results = pd.DataFrame(items)

# Preview first 5 rows
df_results.head()

#### Let's examine that 'description' a little more

In [None]:
pd.set_option("display.max_colwidth", None)

display(HTML("""
<style>
.dataframe td {
  white-space: normal !important;
  word-wrap: break-word;
  max-width: 400px;
}
</style>
"""))

df_results[["name", "description"]].head()

#### Search for more than 10 results

We can make an API call that goes beyond the 10 result limit by creating a **loop**.  The "while True" statement will continue running (10 results at a time) until there are no results remaining.  We collect all of the results in one list using the "extend" command.  Finally, we can limit the dataframe to only view a subset of columns.

In [None]:
query = "ripberger"
start = 0
per_page = 20  # Max per page is 100
all_items = []

while True:
    search_url = (
        f"https://dataverse.harvard.edu/api/search?"
        f"q={query}&type=dataset&start={start}&per_page={per_page}"
    )
    
    response = requests.get(search_url)
    data = response.json()
    
    items = data['data']['items']
    all_items.extend(items)
    
    # Break if fewer than per_page results are returned (i.e., last page)
    if len(items) < per_page:
        break
    start += per_page

# Convert to DataFrame
df_results = pd.DataFrame(all_items)
df_results[['name', 'global_id', 'published_at', 'citation']].head(10)

In [None]:
#How many datasets?  Each row (first number) represents a dataset.
df_results.shape

### Subsetting Our Results

Let's say we don't want all of these, but only a subset of related surveys.  For this step, we will focus on the yearly waves of the Extreme Weather and Society Survey (WXYY). So let's subset them the dataframe from the earlier API call.

In [None]:
# Define the dataset names you want to filter on
target_names = ["WX17", "WX18", "WX19", "WX20", "WX21", "WX22", "WX23", "WX24"]

# Subset the dataframe
df_subset = df_results[df_results['name'].isin(target_names)]

#Reset colwidth
pd.reset_option("display.max_colwidth")

# Display the result
df_subset.head(5)

## Step 1a: Get Dataset Metadata and Files

### Getting metadata for a single dataset

Let's examine the metadata for one of the datasets in the subset, WX18.  There will often be two sets of metadata: file-level (specific to the repository) and data-level (that describes the data---the topic of yesterday's forum).  In this example, we will first look at the file-level metadata that the Dataverse uses.  Next, we will pull the full dataset metadata.  While the results will be in JSON, we will be adding a step that converts them to a Pandas dataframe.

### File level metadata

In [None]:
# Extract persistent ID (DOI) from the first row [0] by position
persistent_id = df_subset.iloc[0]['global_id']

# Get dataset metadata
metadata_url = f"https://dataverse.harvard.edu/api/datasets/:persistentId/?persistentId={persistent_id}"
metadata_response = requests.get(metadata_url).json()

# Get list of files
files = metadata_response['data']['latestVersion']['files']

# Convert list of files to DataFrame
df_files = pd.DataFrame(files)

# Preview first few rows
df_files.head()

### Full dataset metadata

This is actually a case of where we want to see the full JSON, to view those aspects of the dataset that are most useful.  In this case, the most relevent information can be found in a description of the dataset.

In [None]:
# Extract persistent ID (DOI) from the first row
persistent_id = df_subset.iloc[0]['global_id']

# Get full dataset metadata (latest version)
metadata_url = f"https://dataverse.harvard.edu/api/datasets/:persistentId/versions/:latest?persistentId={persistent_id}"
metadata_response = requests.get(metadata_url).json()

# Extract list of files from the JSON
files = metadata_response['data']['files']

metadata_response

## Step 1b: Download a File

From the file-level metadata, we see that the dataset file (.tab) is accompanied by PDFs of the instrument and a reference report.

Let's download the dataset file (.tab) for first year of the survey, WX18, which has a file_id of "3657710".  If you wanted to work with this data for real, you would also want to download the survey instrument and reference document to fully understand the data.

In [None]:
# File ID for WxEM_Wave1.tab
file_id = 3657710

# API call and download directly to memory
file_url = f"https://dataverse.harvard.edu/api/access/datafile/{file_id}?format=original"
response = requests.get(file_url)

# Load into pandas directly from memory, assuming comma-delimited content
df_18 = pd.read_csv(BytesIO(response.content), sep=',', encoding='ISO-8859-1', engine='python', on_bad_lines='skip')
df_18.head()

### Examine the dataset

Pandas has a number of attributes that can be used to quickly examine characteristics of the dataset, things like size, shape, value distributions, basic stats, and missing values...anything you would normally do to check a dataset.  Below, I am just going to run a few of them.

In [None]:
# See basic shape of the data (rows, columns).  Each row represents a respondent.

df_18.shape

In [None]:
# See quick summary statistics

df_18.describe()

In [None]:
# Get a quick count of missing data by column

df_18.isnull().sum()

#### Variable names

Next, we will list the columns to show all the variable names contained in the data.  To integrate with weather data, we are most interested in locating possible ways to join the data.  Typically, geographic variables are a good start. 

<h2><span style="color:red">Stop!  After we run the next cell, do you see any geographic variables that might be useful?</span></h2>

#### **>>Respond in the chat<<**

In [None]:
# List all column names
df_18.columns.tolist()

In this survey, some good possible variables are state, zip, and lat/lon. These are pretty straightforward, but I'm also curious about "nws_region"? What does it contain? Let's take a look!

In [None]:
#Is nws_region usefl at all?
df_18['nws_region'].unique()

##### Unfortunately, "nws_region" only has four regions.  Nonetheless, it might be useful for a different research question.  

### For our purposes, we will focus 'lat'/'lon' for location.  And we'll use begin_date to identify three days prior to beginning the survey.

<div style="background-color:#0069afff; color:white; padding:20px; border-radius:10px; text-align:center; font-size:28px; font-weight:bold;">
  Step 2
</div>

## Step 2: Use an API to download weather data

First, we will demonstrate how the Iowa Mesonet API can be used to collect weather alerts for each survey respondent.  Since calling an API can be slow and depends on internet access, we have already pre-downloaded the weather data needed for this analysis.  

Next, we will take this pre-loaded weather data and merge it with the survey responses so that each person’s record includes both their answers and the relevant weather alerts.  


### Iowa Mesonet API - DO NOT RUN AT THIS TIME

The code below is doing alot of work, going through the survey data **one person/row at a time** and asking the Iowa Mesonet API:  

“Given this person’s location and survey date, what watches, warnings, or advisories (WWAs) were active in the few days leading up to that date?”  

**Analogy:** This is like calling a weather hotline for each person’s hometown and writing down any recent alerts next to their name in the survey spreadsheet.  

Here’s what happens step by step:  

1. **Make sure the date is in the right format.**  
   The `begin_date` column is converted into a standard date format so the computer can work with it (it's a precaution). 

2. **Prepare a place to store the results.**  
   A new column called `wwa_names` is added to the survey data. This will eventually hold a *list* of weather alerts for each person.  In Python, you can think of a list as a flexible container that holds a collection of items.

3. **Go through each respondent one by one.**  
   For each person, we:  
   - Look up their latitude, longitude, and survey date.  
   - Define a **3-day window** before their survey date (so we catch recent alerts).  
   - Build a request to the Iowa Mesonet API using their location and dates.  


4. **Ask the API for weather alerts.**  
   - If the API responds successfully, we pull out the names of any alerts and save them in that person’s row.  
   - If something goes wrong (bad connection, no data, etc.), we save an *empty list* for that person.
   - The API call will continue until we run out of survey respondents.  

The resulting dataset (`df_18`) has a new column called `wwa_names` that tells us which WWAs (if any) each respondent experienced around the time of their survey.  

Please note the API Endpoint (Rest-like) and the structure of the request for those variables:

https://mesonet.agron.iastate.edu/vtec/json.php?lon={lon}&lat={lat}&sdate={start}&edate={end}

Remember, you can always check the API documentation for guidance: the [Iowa Mesonet API Guide](https://mesonet.agron.iastate.edu/api/).

**Note:  We will not run this API call during the webinar because it takes about 20-30 minutes to download the data. If you want to try this after the webinar, just remove the first line ('''python) and run the cell.**

In [None]:
'''python

# Make sure begin_date is in datetime format
df_18['begin_date'] = pd.to_datetime(df_18['begin_date'])

# New column to store list of WWA names
df_18['wwa_names'] = None

# Loop through each respondent
for idx, row in tqdm(df_18.iterrows(), total=len(df_18)):
    lat = row['lat']
    lon = row['lon']
    end_date = row['begin_date']
    start_date = end_date - timedelta(days=3)

    # Build API URL with small buffer
    url = (
        f"https://mesonet.agron.iastate.edu/json/vtec_events_bypoint.py"
        f"?lat={lat}&lon={lon}"
        f"&sdate={start_date.strftime('%Y-%m-%d')}&edate={end_date.strftime('%Y-%m-%d')}"
        f"&buffer=0.1"
    )

    try:
        response = requests.get(url)
        if response.status_code == 200:
            data = response.json()
            names = [event['name'] for event in data.get('events', [])]
            df_18.at[idx, 'wwa_names'] = names
        else:
            df_18.at[idx, 'wwa_names'] = []
    except Exception as e:
        print(f"Failed for idx={idx}, lat={lat}, lon={lon}: {e}")
        df_18.at[idx, 'wwa_names'] = []


<div style="background-color:#004b98ff; color:white; padding:20px; border-radius:10px; text-align:center; font-size:28px; font-weight:bold;">
  Step 3
</div>


### Step 3 - Joining the Survey Data with Weather Alerts

Prior to the webinar, the data we needed was downloaded in a csv file and stored in Github.  This dataset ('wwa_by_pid') includes the respondent identifier ('p_id') and a column with watches, warnings, and advisories ('wwa_names') in a list.  At this point, we have two different tables of data that we need to merge:

- **Survey data** (df_18): this has all the survey responses, including their p_id (a unique identifier for each person).

- **Weather alerts data** (lk): this has just two columns — the same p_id, and the list of watches, warnings, or advisories (wwa_names) that each person experienced.

We're going to do a left merge on both tables using p_id, which is basically doing the following: 

**“For each person (p_id) in the survey data, look up their matching p_id in the weather file and bring in the weather alerts column (wwa_names).”**

##### If you ran the API call in Step 2, you should skip the next two cells. 

In [None]:
# If you have run the API, you should skip this cell

url = "https://raw.githubusercontent.com/jmote-noaa/Data-Forums/refs/heads/main/data/wwa_by_pid.csv"
lk = pd.read_csv(url)

# Merge with the original df_18 on 'p_id'
df_18 = df_18.merge(lk, on="p_id", how="left")

In [None]:
# We use `ast` to convert the `wwa_names` column from text back into real Python lists so we can work with them.  
df_18['wwa_names'] = df_18['wwa_names'].apply(
    lambda x: ast.literal_eval(x) if isinstance(x, str) else x
)

##### If you ran the API call in Step 2, you can resume here.

In [None]:
# Let's take a quick look at the new, combined dataset
df_18.head(5)

In [None]:
# We can subset the dataframe to focus only on the variables we want to see.  We use the attribute "dropna" to make sure that there are no rows that are empty (i.e., we didn't screw something up).
df_18[['p_id','lat', 'lon', 'begin_date', 'wwa_names']].dropna().head(10)

In [None]:
#let's make sure we have the same number of row that we started with.  It should be 3,000
df_18.shape

In [None]:
#how many respondents experienced watches/warnings/advisories?
df_18['wwa_names'].apply(lambda x: isinstance(x, list) and len(x) > 0).sum()

In [None]:
#who received more than two?
df_18[df_18['wwa_names'].apply(lambda x: isinstance(x, list) and len(x) > 2)
][['p_id','lat', 'lon', 'begin_date', 'wwa_names']]

### Step 3a: Exporting the Data for Use Outside Jupyter

Now that we have a merged dataframe (df_18) containing both survey responses and weather alerts, we can save and download it in different formats for use in other tools and workflows. Jupyter is excellent for exploring, cleaning, and manipulating data, but once you’ve shaped the dataset you want, you may prefer to bring it back into your regular workflow — whether that’s statistical software, spreadsheets, or visualization platforms. We won’t actually run the save commands here, but I’ll provide the code snippets so you can see how it’s done.

**CSV (general use--Excel, R, SAS, SPSS, STATA)**

In [None]:
#df_18.to_csv("survey_weather.csv", index=False)

**Parquet (general use--generally for larger datasets)**

In [None]:
#df_18.to_parquet("survey_weather.parquet")

<div style="background-color:#fb1e1eff; color:white; padding:20px; border-radius:10px; text-align:center; font-size:28px; font-weight:bold;">
  Stop!  Take a break! 
</div>

<div style="background-color:#003087ff; color:white; padding:20px; border-radius:10px; text-align:center; font-size:28px; font-weight:bold;">
  Step 4
</div>



### Step 4: Exploring the Combined Dataset

After you download the new dataset, you could easily just jump back into your regular workflow.  But let's say we want to continue in Jupyter to quickly visualize and analyze the data to examine the data for any insights.  Below we will look at the following:

- ***Quick Visualizations***
- ***Visualizing WWAs by Survey Responses***
- ***Quick Statistical Analyses***
- ***A Bit More Involved Analysis***

#### Step 4a: Quick visualizations

In this step, I'll go over the first visualization and show the code for the remaining two:

1. A geomap of survey respondents by number of watches and warnings 3 days prior to the survey.
2. A bar chart showing frequency of types of watches and warnings.
3. A line graph showing frequency of types of watches and warning over time.

Jupyter (and Python) provide access to a wide range of powerful visualization libraries — from simple plotting with **matplotlib** and **seaborn** to interactive mapping with **folium** or **plotly** — giving you many options for exploring and presenting your data.  

In [None]:
# Step 1: Ensure lat/lon are float (just in case)
df_18['lat'] = pd.to_numeric(df_18['lat'], errors='coerce')
df_18['lon'] = pd.to_numeric(df_18['lon'], errors='coerce')

# Step 2: Create geometry from lat/lon
geometry = [Point(xy) for xy in zip(df_18['lon'], df_18['lat'])]
gdf_18 = gpd.GeoDataFrame(df_18, geometry=geometry, crs="EPSG:4326")

# Step 3: Count WWAs
gdf_18['wwa_count'] = gdf_18['wwa_names'].apply(lambda x: len(x) if isinstance(x, list) else 0)

# Step 4: Plot
fig, ax = plt.subplots(figsize=(10, 6))
gdf_18.plot(column='wwa_count', cmap='plasma_r', legend=True, ax=ax, markersize=10)
ax.set_title("Survey Respondents by Number of WWAs (3 Days Prior)", fontsize=14)
plt.axis('off')
plt.tight_layout()
plt.show()

In [None]:


# Flatten and count
all_names = df_18['wwa_names'].dropna().explode()
top_names = Counter(all_names).most_common(10)

# Plot
names, counts = zip(*top_names)
plt.figure(figsize=(10, 5))
plt.barh(names[::-1], counts[::-1])
plt.title("Top 10 Most Frequent WWAs")
plt.xlabel("Number of Respondents Exposed")
plt.tight_layout()
plt.show()

In [None]:
''' python

# Count WWAs per date
timeline = (
    df_18[['begin_date', 'wwa_names']]
    .dropna()
    .assign(wwa_count=lambda df: df['wwa_names'].apply(len))
    .groupby('begin_date')['wwa_count']
    .sum()
)

# Plot
timeline.plot(marker='o', figsize=(10, 4), title='Total WWAs by Survey Date')
plt.ylabel('Total Warnings/Watch Events')
plt.xlabel('Survey Date')
plt.grid(True)
plt.tight_layout()
plt.show()

### Step 4b: Visualizing WWWAs by Survey Reponses

For a next step, we might undertake a visualization that is a bit more involved.

Let's say we want to examine how the presence of recent weather alerts (3 days prior to taking the survey) correlates with survey respondents' perceived risk of that hazard.  Again, we're going to focus on flood-related WWAs.  How can we do that?  

Step 1.  Create two groups of respondents, those experienced a WWA prior to the survey and those who did not.
Step 2.  Extract risk perception scores for a hazard (For risk_flood: 1-No risk, 2-Low Risk.....5-Extreme risk)
Step 3.  Create comparative visualizations

For the first two steps, we have created a function **(def)** to carry out those tasks.  We could easily do this without creating a function but it would take several steps to do so.  Here we get it done in one cell.  When you run this cell, you will notice that there are no results...we're just setting things up for the visualization.

In [None]:
# --- Prep: robust WWA exposure + ordered risk labels ---

# 1) robust flood_wwa_exposure (handles NaN/non-lists)
flood_wwa_keywords = ['Flood Advisory', 'Flood Warning', 'Flash Flood Warning', 'Flash Flood Watch', 'Flood Watch']

def has_flood_wwa(wwas):
    if isinstance(wwas, (list, tuple, set)):
        return any(k in wwas for k in flood_wwa_keywords)
    return False

df_18['flood_wwa_exposure'] = df_18['wwa_names'].apply(has_flood_wwa)

# 2) numeric -> labeled flood risk (ordered categorical)
label_map = {1: 'No risk', 2: 'Low risk', 3: 'Moderate risk', 4: 'High risk', 5: 'Extreme risk'}
ordered_risk_labels = list(label_map.values())

# Coerce to numeric, map to labels, and set categorical order
df_18['risk_flood_num'] = pd.to_numeric(df_18.get('risk_flood'), errors='coerce')
df_18['risk_flood_label'] = pd.Categorical(
    df_18['risk_flood_num'].map(label_map),
    categories=ordered_risk_labels,
    ordered=True
)

#Let's take a look at what we just did

df_18[['wwa_names', 'flood_wwa_exposure', 'risk_flood_num', 'risk_flood_label']].head(10)

In [None]:
#let's visualize what we just did
plt.figure(figsize=(10, 6))
ax = sns.countplot(
    data=df_18,
    x='risk_flood_label',
    hue='flood_wwa_exposure',
    order=ordered_risk_labels,
    hue_order=[False, True]
)

plt.xlabel("Perceived Flood Risk")
plt.ylabel("Number of Respondents")
plt.title("Perceived Flood Risk by Exposure to Flood-Related WWAs")
plt.legend(title="WWA Exposure", labels=["No", "Yes"])
plt.tight_layout()
plt.show()

In [None]:
''' python

# crosstab -> proportions by exposure
crosstab = pd.crosstab(
    df_18['flood_wwa_exposure'],
    df_18['risk_flood_label'],
    normalize='index'
)

# ensure consistent order of rows/columns even if some levels are missing
crosstab = crosstab.reindex(index=[False, True], columns=ordered_risk_labels)

ax = crosstab.T.plot(
    kind='bar',
    stacked=True,
    figsize=(10, 6)
)

plt.title('Flood Risk Perception by WWA Exposure (Proportions)')
plt.xlabel('Perceived Flood Risk')
plt.ylabel('Proportion of Respondents')
plt.legend(title='Exposed to Flood WWA', labels=['No', 'Yes'])
plt.tight_layout()
plt.show()

In [None]:
# Violin plots are interesting, let's take a look.  It looks a bit like a population pyramid.
sns.violinplot(data=df_18, x='flood_wwa_exposure', y='risk_flood', inner='quartile')
plt.xticks([0, 1], ['No WWA Exposure', 'WWA Exposure'])
plt.xlabel("Exposure to Flood WWA")
plt.ylabel("Perceived Flood Risk (1-5)")
plt.title("Distribution of Flood Risk Perception by WWA Exposure")
plt.tight_layout()
plt.show()

## Step 4C: Quick Statistical Analyses

We're not limited to visualizations, we can explore a number of statistical procedures to begin to test our question of whether exposure to warnings and watches has an impact on survey responses.  For this tutorial, we will run the following test:

1. Chi-Square: To evaluate whether exposure to flood-related warnings and watches and perceived flood risk categories are independent, with the p-value indicating if the observed relationship is statistically significant.

I will provide the code for two additional analysis, so you can see how they are set up.  If you want to run them on your own, just remove the first line (''' python) or put a hash (#) in front of it (#'''python).

1. Logistic Regression: To evaluate whether exposure to flood-related warnings and watches increases the odds of respondents reporting high or extreme flood risk compared to lower risk levels.
2. Ordinal Logistic Regression: To evaluate whether exposure to flood-related warnings and watches shifts respondents toward higher categories of perceived flood risk.

### 1. Chi-Square

In [None]:
# Let's create a contingency table

contingency = pd.crosstab(df_18['flood_wwa_exposure'], df_18['risk_flood_label'])
contingency

In [None]:
#let's test it

chi2, p, dof, expected = chi2_contingency(contingency)

print("Chi-square test")
print(f"Chi2 = {chi2:.2f}, df = {dof}, p = {p:.4f}")

This suggests that the distribution of flood risk perceptions is not independent of WWA exposure — in other words, people who were exposed to flood-related weather alerts responded differently (in terms of risk levels) than those who were not exposed.

### 2. Logistic Regression

We're going to fit a logistic regression model to test whether exposure to a flood alerts predicts whether a survey respondent reports being at “high” or “extreme” flood risk.

In short: the regression asks, “How much more likely are people to report high or extreme flood risk if they were exposed to a flood-related alerts?"

In [None]:
'''python

# Binary outcome: high risk (≥4) vs. lower
df_18['high_risk'] = df_18['risk_flood_num'] >= 4

# Predictor: WWA exposure (cast to int)
X = sm.add_constant(df_18['flood_wwa_exposure'].astype(int))
y = df_18['high_risk'].astype(int)

logit_model = sm.Logit(y, X).fit()
print(logit_model.summary())

This suggests that respondents exposed to a flood WWA had significantly higher odds (about 39% greater, exp(0.3326) ≈ 1.39) of reporting high or extreme flood risk compared to those not exposed, though the overall model explains only a small share of variation in responses.

### 3. Ordinal Logistic Regression

The previous regression was a binary logistic regression, and the outcome was simplified into two categories: high risk (4–5) vs. not high (1–3).  The model only asks: “Does exposure change the odds of being high risk or not?”

With this model, we use the full ordered scale (1-5), which accounts for the fact that reporting risk = 5 is “higher” than risk = 4, which is higher than risk = 3, etc.

Instead of focusing on a single cutoff, the model estimates the effect of exposure across all possible thresholds in the risk scale.  And you get cut-points plus one slope coefficient. The slope tells you whether exposure consistently increases the likelihood of reporting any higher category of risk.

In [None]:
# Drop NaNs to avoid issues
df_ord = df_18.dropna(subset=['risk_flood_num', 'flood_wwa_exposure'])

# Predictor must be numeric (int)
X = df_ord[['flood_wwa_exposure']].astype(int)

# Outcome is ordered categories (numeric risk levels already ordered 1–5)
y = df_ord['risk_flood_num']

# Fit ordinal logistic regression
mod = OrderedModel(y, X, distr='logit')
res = mod.fit(method='bfgs')

print(res.summary())

This suggests that respondents exposed to flood-related warnings and watches had a significantly higher likelihood of placing themselves in higher flood risk perception categories (coef = 0.2810, p < 0.000), indicating a consistent upward shift in perceived risk.

## A Bit More Involved - Did WWAs *Really* Affect Risk Perception?

So it appears that exposure to watches and warnings has a statistically significant impact on survey responses.  But by how much (substantive significance)? To suggest an answer, we might look at the **predicted probabilities** of survey responses.  This shows us how much exposure to a WWA changes the likelihood of a respondent reporting “High” or “Extreme” flood risk.  

To do this, the code below creates two simple scenarios: one where a person had no flood alert (0) and one where they did (1).  It then uses a statistical model (ordinal logit model with a logit link function) to predict the probability of each risk category under those two scenarios.  The results are labeled clearly as “No Exposure” and “Exposure,” giving us a side-by-side view.  

**Analogy:** It’s like asking, “What would the risk look like if nobody had a flood alert?” and then,  “What would the risk look like if everyone had a flood alert?” — and comparing the two answers side by side.  


In [None]:
# Make two scenarios: no exposure (0) and exposure (1)
scenarios = pd.DataFrame({
    'flood_wwa_exposure': [0, 1]
})

# Predict probabilities for each risk category
pred_probs = res.predict(scenarios)

# Attach labels
pred_probs.index = ['No Exposure', 'Exposure']
pred_probs.columns = [f"Risk {c}" for c in pred_probs.columns]

pred_probs


## A possible interpretation

Respondents who were exposed to a flood WWA were less likely to report no risk (10% → 8%) or low risk (32% → 28%), and more likely to report higher levels of risk perception, particularly at the “High” (15% → 18%) and “Extreme” (9% → 11%) categories. While the percentage point changes may look modest, they indicate a clear upward shift in perceived flood risk among those who received WWAs.  In other words, I think it is possible to say that exposure to a flood warning nudged people away from saying ‘no risk’ and toward saying ‘high or extreme risk.  However, with only 3,000 responses, I probably wouldn't.  As the academics say, this requires further study. 😉

Now let's take a look at a visualization of these results.

In [None]:
#let's see HOW the probability distribution shift
ax = pred_probs.T.plot(kind='bar', figsize=(10,6))
plt.title("Predicted Flood Risk Perception by WWA Exposure")
plt.ylabel("Probability")
plt.xlabel("Perceived Flood Risk Level")
plt.legend(title="WWA Exposure")
plt.tight_layout()
plt.show()

## Wrapping Up

In this tutorial, we explored how to bring together **social survey data** and **weather warning data** to better understand how hazard information might influence perceptions of risk. Using the Jupyter Notebook environment, you saw how to:  

- **Work with APIs** to search for, access, and download data programmatically  
- Organize and explore datasets interactively using python libraries like **pandas** and **requests**.  
- Merge survey data with external data from the **Iowa Environmental Mesonet**  
- Apply **geospatial tools** to handle location-based data  
- Create clear, reproducible **visualizations** and **statistical analyses** directly alongside your analysis  
- Document your process in a way that combines code, results, and explanation all in one place  

With this walkthrough, you’ve seen how Jupyter can serve as both a **research lab and a communication tool**—a space where you can work with data, analyze the data, visualize the data and results, and explain what you found.  

### Moving Forward  
- Try adapting this workflow to other survey topics (e.g., heat, drought, tornado risk)  
- Explore additional APIs to enrich your analysis with different kinds of data  
- Use Jupyter notebooks to build **reproducible reports**, where readers can see not just your conclusions but also the steps you took to get there  
- Share your notebooks with collaborators as a way to make your analysis **transparent and interactive** 

Ultimately, the key takeaway is that with just a few tools—**APIs, pandas, geospatial libraries, and Jupyter notebooks**—you can connect diverse datasets, analyze them in context, and tell meaningful data stories about risk and society **in one place**!

---

**Thank you for following along!**  
We encourage you to take this workflow and apply it to your own research questions about weather, risk, and society—the more you explore, the more insights you’ll uncover.  
