### Assignment 04 â€” SEC Universities Impacted by Severe Winter Weather (January 2026)

**Grouping:** SEC (Southeastern Conference) - 16 universities
**Data Sources:** Wikipedia (HTML) + Open-Meteo (API)
**Platform:** Databricks

Pipeline:
1. Scrape university enrollment from Wikipedia HTML table
2. Pull daily weather data from Open-Meteo REST API for each campus
3. Classify "severe weather" days based on thresholds I defined
4. Calculate student-days impacted per university


In [0]:
%pip install beautifulsoup4


In [0]:
%restart_python


In [0]:
import requests
import pandas as pd
import time
import matplotlib.pyplot as plt
import numpy as np
from bs4 import BeautifulSoup
from io import StringIO


#### HTML Data Ingestion - Enrollment

Scraping the SEC Wikipedia page for Fall 2023 enrollment numbers for all 16 schools.

**Source:** https://en.wikipedia.org/wiki/Southeastern_Conference

**Databricks note:** Ran into network issues with this step. The serverless compute on our
Databricks cluster blocks outbound requests to most websites. Open-Meteo works fine but
Wikipedia kept failing with a DNS resolution error. Emailed Paul about it - he said it should
be whitelisted but it still doesn't resolve on serverless. Built a placeholder table so I
can keep working on the rest of the pipeline while that gets sorted out. The wikipedia scrape
cell is below the placeholder - once access works I'll run that instead and delete the
hardcoded data.


In [0]:
# hardcoded enrollment data from Wikipedia SEC page (Fall 2023)
# using this until en.wikipedia.org works on the cluster
# source: https://en.wikipedia.org/wiki/Southeastern_Conference

sec_schools = [
    {"university_name": "University of Alabama", "state": "AL", "enrollment": 39622, "lat": 33.2098, "lon": -87.5692},
    {"university_name": "Auburn University", "state": "AL", "enrollment": 33015, "lat": 32.6099, "lon": -85.4808},
    {"university_name": "University of Arkansas", "state": "AR", "enrollment": 32140, "lat": 36.0686, "lon": -94.1748},
    {"university_name": "University of Florida", "state": "FL", "enrollment": 54814, "lat": 29.6436, "lon": -82.3549},
    {"university_name": "University of Georgia", "state": "GA", "enrollment": 41615, "lat": 33.948, "lon": -83.3773},
    {"university_name": "University of Kentucky", "state": "KY", "enrollment": 32703, "lat": 38.0317, "lon": -84.504},
    {"university_name": "Louisiana State University", "state": "LA", "enrollment": 39418, "lat": 30.4133, "lon": -91.18},
    {"university_name": "University of Mississippi", "state": "MS", "enrollment": 24043, "lat": 34.3665, "lon": -89.5385},
    {"university_name": "Mississippi State University", "state": "MS", "enrollment": 22657, "lat": 33.4552, "lon": -88.7902},
    {"university_name": "University of Missouri", "state": "MO", "enrollment": 31013, "lat": 38.9404, "lon": -92.3277},
    {"university_name": "University of Oklahoma", "state": "OK", "enrollment": 29145, "lat": 35.2058, "lon": -97.4457},
    {"university_name": "University of South Carolina", "state": "SC", "enrollment": 36579, "lat": 33.994, "lon": -81.0299},
    {"university_name": "University of Tennessee", "state": "TN", "enrollment": 36304, "lat": 35.9544, "lon": -83.9295},
    {"university_name": "Vanderbilt University", "state": "TN", "enrollment": 13456, "lat": 36.1447, "lon": -86.8027},
    {"university_name": "Texas A&M University", "state": "TX", "enrollment": 76633, "lat": 30.6187, "lon": -96.3365},
    {"university_name": "University of Texas at Austin", "state": "TX", "enrollment": 53082, "lat": 30.2849, "lon": -97.7341},
]

df_enrollment = pd.DataFrame(sec_schools)
print(df_enrollment)
print(f"\n{df_enrollment['enrollment'].sum():,} total students")


#### Wikipedia scrape (not working yet on this cluster)

Leaving this cell here so I can swap it in once the access issue is resolved.
Right now it throws a ConnectionError because serverless compute cant reach wikipedia.
Tried restarting the kernel and the cluster, still same DNS error.


In [0]:
# scrape enrollment from wikipedia SEC page
# NOT WORKING YET - serverless compute blocks en.wikipedia.org
# uncomment and run once access is fixed, then delete the placeholder cell above

# url = "https://en.wikipedia.org/wiki/Southeastern_Conference"
# response = requests.get(url)
# html = response.text
# print(f"got {len(html):,} characters")
#
# tables = pd.read_html(StringIO(html))
# print(f"found {len(tables)} tables")
#
# for i, t in enumerate(tables):
#     print(f"  table {i}: {len(t)} rows, columns: {list(t.columns[:4])}")
#
# # table 2 has the enrollment - clean up the footnotes
# df_members = tables[2]
# enrollment_col = df_members["Enrollment (fall 2023)[8]"].astype(str)
# enrollment_col = enrollment_col.str.replace(r"\[.*?\]", "", regex=True)
# enrollment_col = enrollment_col.str.replace(",", "")
# df_members["enrollment"] = enrollment_col.astype(int)
#
# df_enrollment = df_members[["Institution", "Location", "Type", "enrollment"]].copy()
# df_enrollment.columns = ["university_name", "location", "type", "enrollment"]
# print(df_enrollment)


#### Weather API - Open-Meteo

Pulling daily weather for each school's campus location for all of January 2026.
Open-Meteo is free and doesnt need an API key, just pass lat/lon and a date range.

This part works fine on serverless - no network issues with api.open-meteo.com.

**Endpoint:** https://api.open-meteo.com/v1/forecast


In [0]:
# pull weather data from open-meteo for each school
base_url = "https://api.open-meteo.com/v1/forecast"
weather_vars = "temperature_2m_min,temperature_2m_max,precipitation_sum,snowfall_sum,rain_sum,windspeed_10m_max,windgusts_10m_max,precipitation_hours,weathercode"

all_rows = []

for idx, school in df_enrollment.iterrows():
    params = {
        "latitude": school["lat"],
        "longitude": school["lon"],
        "daily": weather_vars,
        "temperature_unit": "fahrenheit",
        "precipitation_unit": "inch",
        "wind_speed_unit": "mph",
        "start_date": "2026-01-01",
        "end_date": "2026-01-31",
        "timezone": "America/Chicago"
    }
    
    resp = requests.get(base_url, params=params)
    result = resp.json()
    daily = result["daily"]
    
    for i in range(len(daily["time"])):
        all_rows.append({
            "university_name": school["university_name"],
            "date": daily["time"][i],
            "min_temp_f": daily["temperature_2m_min"][i],
            "max_temp_f": daily["temperature_2m_max"][i],
            "precipitation_in": daily["precipitation_sum"][i],
            "snowfall_in": daily["snowfall_sum"][i],
            "rain_in": daily["rain_sum"][i],
            "wind_max_mph": daily["windspeed_10m_max"][i],
            "wind_gust_mph": daily["windgusts_10m_max"][i],
            "precip_hours": daily["precipitation_hours"][i],
            "weathercode": daily["weathercode"][i],
        })
    
    print(f"  confirmed {school['university_name']}")
    time.sleep(0.3)

df_weather = pd.DataFrame(all_rows)
df_weather["date"] = pd.to_datetime(df_weather["date"])
print(f"\n{len(df_weather)} total weather records")


#### Exploring the weather data

Taking a look at what January 2026 actually looked like before deciding on thresholds.
The big event was Jan 23-27 but want to see the full picture.


In [0]:
# quick look at the weather data
print("MIN TEMPS:")
print(f"  lowest: {df_weather['min_temp_f'].min():.1f} F")
print(f"  median: {df_weather['min_temp_f'].median():.1f} F")

print("\nSNOW:")
print(f"  biggest day: {df_weather['snowfall_in'].max():.2f} in")
snow = df_weather[df_weather['snowfall_in'] > 0]
print(f"  {len(snow)} days had some snow out of {len(df_weather)}")

print("\nWIND GUSTS:")
print(f"  max: {df_weather['wind_gust_mph'].max():.1f} mph")

# zoom in on the big storm
print("\n--- storm window jan 23-27 ---")
storm = df_weather[(df_weather["date"] >= "2026-01-23") & (df_weather["date"] <= "2026-01-27")]
for name in ["University of Mississippi", "University of Tennessee", "University of Florida"]:
    print(f"\n{name}:")
    chunk = storm[storm["university_name"] == name]
    print(chunk[["date", "min_temp_f", "max_temp_f", "snowfall_in", "wind_gust_mph"]].to_string(index=False))


#### Defining "severe weather"

A day counts as severe if ANY of these are true:

| Condition | Threshold | Why |
|---|---|---|
| Min temperature | <= 20 F | Way too cold for the south. Pipes burst, roads ice over, nothing is built for it |
| Snowfall | >= 1.0 inch | Enough to shut down roads and campuses. These cities dont have plows |
| Wind gusts | >= 40 mph | Downed trees and power lines, dangerous wind chill |

These numbers are calibrated for where SEC schools are. A 20 degree day in Tuscaloosa
is a completely different situation than 20 degrees in Minneapolis. I lived in Texas for
12 years and saw firsthand how even a little ice or snow shuts everything down. In a real
world analysis I'd probably also factor in actual campus closings and adjust thresholds
by region since Missouri and Kentucky handle cold way better than Florida or south Texas.


In [0]:
# flag each day as severe or not
def check_severe(row):
    if row["min_temp_f"] <= 20:
        return True
    if row["snowfall_in"] >= 1.0:
        return True
    if row["wind_gust_mph"] >= 40:
        return True
    return False

df_weather["is_severe"] = df_weather.apply(check_severe, axis=1)

severe_days = df_weather[df_weather["is_severe"]]
print(f"{len(severe_days)} severe days out of {len(df_weather)} total")
print(f"{severe_days['university_name'].nunique()} universities affected\n")

counts = severe_days.groupby("university_name").size().sort_values(ascending=False)
print(counts)


#### Final output table

Merging weather with enrollment and calculating student-days impacted.
One student-day = one enrolled student x one severe weather day.


In [0]:
# merge weather with enrollment to get student-days
df_combined = df_weather.merge(
    df_enrollment[["university_name", "state", "enrollment"]],
    on="university_name"
)

df_combined["student_days"] = df_combined["enrollment"] * df_combined["is_severe"].astype(int)

# list out the severe dates for each school
date_lists = (
    df_combined[df_combined["is_severe"]]
    .groupby("university_name")["date"]
    .apply(lambda dates: ", ".join(dates.dt.strftime("%b %d")))
    .reset_index()
)
date_lists.columns = ["university_name", "severe_dates"]

# one row per school
summary = df_combined.groupby(["university_name", "state", "enrollment"]).agg(
    severe_days=("is_severe", "sum"),
    total_student_days=("student_days", "sum"),
).reset_index()

summary = summary.merge(date_lists, on="university_name", how="left")
summary["severe_dates"] = summary["severe_dates"].fillna("None")
summary = summary.sort_values("total_student_days", ascending=False)

# html table so it actually looks decent in the notebook
html = """
<style>
  .results { border-collapse: collapse; width: 100%; font-family: Arial; font-size: 13px; }
  .results th { background: #2c3e50; color: white; padding: 10px; text-align: left; }
  .results td { padding: 8px 10px; border-bottom: 1px solid #ddd; }
  .results tr:nth-child(even) { background: #f8f9fa; }
  .results .r { text-align: right; }
  .results .sm { font-size: 11px; color: #555; }
  .totals { background: #2c3e50; color: white; padding: 15px; border-radius: 5px; margin-top: 12px; font-family: Arial; }
</style>
<h3 style="font-family:Arial;">SEC Universities - Student-Days Impacted (January 2026)</h3>
<table class="results">
<tr><th>#</th><th>University</th><th>State</th><th class="r">Enrollment</th><th class="r">Severe Days</th><th class="r">Student-Days</th><th>Dates</th></tr>
"""

for i, (_, r) in enumerate(summary.iterrows(), 1):
    html += f'<tr><td>{i}</td><td><b>{r["university_name"]}</b></td><td>{r["state"]}</td>'
    html += f'<td class="r">{r["enrollment"]:,}</td><td class="r">{r["severe_days"]}</td>'
    html += f'<td class="r"><b>{r["total_student_days"]:,}</b></td>'
    html += f'<td class="sm">{r["severe_dates"]}</td></tr>'

html += "</table>"

total = summary["total_student_days"].sum()
affected = (summary["severe_days"] > 0).sum()

html += f"""
<div class="totals">
  <b style="font-size:20px">{total:,}</b> student-days impacted |
  <b>{affected}</b> of 16 universities affected |
  <b>{summary["severe_days"].sum()}</b> severe university-days
</div>"""

displayHTML(html)


#### Visualizations


In [0]:
# bar chart - student days by school
fig, ax = plt.subplots(figsize=(12, 8))

plot_data = summary.sort_values("total_student_days")
state_colors = {
    "KY": "#1f77b4", "MO": "#ff7f0e", "AR": "#2ca02c", "OK": "#d62728",
    "TN": "#9467bd", "TX": "#8c564b", "SC": "#e377c2", "MS": "#7f7f7f",
    "GA": "#bcbd22", "AL": "#17becf", "LA": "#aec7e8", "FL": "#ffbb78",
}
colors = [state_colors.get(s, "#999") for s in plot_data["state"]]

bars = ax.barh(plot_data["university_name"], plot_data["total_student_days"], color=colors)
ax.set_xlabel("Student-Days Impacted")
ax.set_title("SEC Universities: Student-Days Impacted by Severe Winter Weather (Jan 2026)")
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x:,.0f}'))

for bar, val in zip(bars, plot_data["total_student_days"]):
    if val > 0:
        ax.text(val + 5000, bar.get_y() + bar.get_height()/2, f'{val:,.0f}',
                va='center', fontsize=8)

plt.tight_layout()
plt.show()


In [0]:
# heatmap - which schools got hit on which days
fig, ax = plt.subplots(figsize=(16, 8))

universities = summary.sort_values("total_student_days", ascending=False)["university_name"].tolist()
dates = sorted(df_weather["date"].unique())
date_labels = [d.strftime("%b %d") for d in pd.to_datetime(dates)]

heatmap_data = []
for uni in universities:
    row = []
    for d in dates:
        match = df_weather[(df_weather["university_name"] == uni) & (df_weather["date"] == d)]
        row.append(1 if match["is_severe"].values[0] else 0)
    heatmap_data.append(row)

im = ax.imshow(heatmap_data, cmap="RdYlGn_r", aspect="auto", interpolation="nearest")
ax.set_yticks(range(len(universities)))
ax.set_yticklabels(universities, fontsize=8)
ax.set_xticks(range(len(dates)))
ax.set_xticklabels(date_labels, rotation=90, fontsize=7)
ax.set_title("Severe Weather Days by University - January 2026 (Red = Severe)")

# box around the storm window
storm_start = [i for i, d in enumerate(dates) if pd.Timestamp(d).day == 23][0]
storm_end = [i for i, d in enumerate(dates) if pd.Timestamp(d).day == 27][0]
rect = plt.Rectangle((storm_start - 0.5, -0.5), storm_end - storm_start + 1,
                       len(universities), linewidth=2, edgecolor='blue',
                       facecolor='none', linestyle='--')
ax.add_patch(rect)
ax.text(storm_start + 1, -1.2, "Storm Fern", color="blue", fontsize=9, fontweight="bold")

plt.tight_layout()
plt.show()


In [0]:
# temperature trend lines - all schools overlaid
fig, ax = plt.subplots(figsize=(14, 6))

for uni in df_weather["university_name"].unique():
    subset = df_weather[df_weather["university_name"] == uni].sort_values("date")
    # highlight a few interesting schools, fade the rest
    highlight = ["University of Missouri", "University of Kentucky",
                 "University of Florida", "University of Mississippi"]
    if uni in highlight:
        ax.plot(subset["date"], subset["min_temp_f"], alpha=0.7, linewidth=2, label=uni)
    else:
        ax.plot(subset["date"], subset["min_temp_f"], alpha=0.2, linewidth=0.8)

ax.axhline(y=20, color="red", linestyle="--", linewidth=1.5, label="Severe threshold (20 F)")
ax.axvspan(pd.Timestamp("2026-01-23"), pd.Timestamp("2026-01-27"),
           alpha=0.15, color="blue", label="Storm Fern")

ax.set_ylabel("Daily Min Temperature (F)")
ax.set_title("January 2026 Min Temps - All SEC Schools")
ax.legend(loc="lower left", fontsize=8)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


#### Takeaways

Kentucky got hit the hardest with 18 severe days which makes sense since its the furthest
north in the SEC. Florida was completely fine with zero severe days. The big storm window
(jan 23-27) drove most of the impact but the cold hung around at the northern schools for
a while after that. Missouri hit -6F on the 26th which is pretty extreme even for them.

Overall about 3.2 million student-days were impacted across 15 of the 16 schools. The only
school that dodged it entirely was Florida.

If I were doing this for real I would want to cross reference with actual campus closure
announcements and probably adjust the thresholds based on how far north each school is.
A 20 degree day means something very different in Lexington KY vs Baton Rouge LA.
