# NOAA Climate Data – Data quality and cleaning

## Reading and Merging NOAA Climate Data (Following Official Documentation)

The NOAA ClimDiv statewide climate files are provided in a fixed-width text format, where each row contains one year of monthly climate observations for a specific state. According to the official ClimDiv documentation, the first four fields (state, division, element, year) follow fixed column boundaries, followed by twelve fixed-width fields for the monthly values (Jan–Dec). Our cleaning procedure follows this structure:

1. **Read raw fixed-width data using the official column specifications**  
   The tavg (average temperature), cdd (cooling degree days), and hdd (heating degree days) files are parsed with `read_fwf()` using column boundaries taken directly from the NOAA specification. This ensures that all fields are split and assigned correctly.

2. **Filter to valid ClimDiv state codes (1–48)**  
   In the statewide ClimDiv dataset, state codes 1–48 correspond to the contiguous United States. Rows outside this range are removed to match the definition of the product.

3. **Reshape monthly columns into long format (wide → long)**  
   NOAA stores the twelve monthly values (`jan`–`dec`) as separate fixed-width columns.  
   For easier analysis and later schema alignment with the EIA dataset, these columns are converted into a tidy long format where each row represents a single observation: (state, year, month, value).

4. **Load each file separately with a unified cleaning procedure**  
   Each of the three datasets (tavg, cdd, hdd) is processed using the same steps and then renamed so that their climate values are stored consistently across files.

5. **Merge the three climate indicators**  
   Because all three datasets share the same keys (state, year, month), they are merged into a single consolidated climate table containing the three variables.

6. **Add state names (for quality checks)**  
   Using the NOAA state-code mapping, a human-readable `state_name` column is added to support visual inspection and profiling.

The resulting `climate_subset` dataset contains:  
`state_name`, `year`, `month`, `tavg`, `cdd`, `hdd`  
and serves as the input for the subsequent data-quality assessment.


In [None]:
import pandas as pd
import numpy as np

RAW_DIR = "../data/raw"

NOAA_FILES = {
    "tavg": f"{RAW_DIR}/climdiv-tmpcst-v1.0.0-20250905",
    "cdd":  f"{RAW_DIR}/climdiv-cddcst-v1.0.0-20250905",
    "hdd":  f"{RAW_DIR}/climdiv-hddcst-v1.0.0-20250905",
}

# NOAA fixed-width spec (from NOAA documentation)
colspecs = [
    (0, 3), (3, 4), (4, 6), (6, 10),
    (10, 17), (17, 24), (24, 31), (31, 38),
    (38, 45), (45, 52), (52, 59), (59, 66),
    (66, 73), (73, 80), (80, 87), (87, 94)
]

names = [
    "state", "division", "element", "year",
    "jan","feb","mar","apr","may","jun",
    "jul","aug","sep","oct","nov","dec"
]

# state code → full name (no abbrev yet)
state_name_map = {
    1: "Alabama", 2: "Arizona", 3: "Arkansas", 4: "California",
    5: "Colorado", 6: "Connecticut", 7: "Delaware", 8: "Florida",
    9: "Georgia", 10: "Idaho", 11: "Illinois", 12: "Indiana",
    13: "Iowa", 14: "Kansas", 15: "Kentucky", 16: "Louisiana",
    17: "Maine", 18: "Maryland", 19: "Massachusetts", 20: "Michigan",
    21: "Minnesota", 22: "Mississippi", 23: "Missouri", 24: "Montana",
    25: "Nebraska", 26: "Nevada", 27: "New Hampshire", 28: "New Jersey",
    29: "New Mexico", 30: "New York", 31: "North Carolina", 32: "North Dakota",
    33: "Ohio", 34: "Oklahoma", 35: "Oregon", 36: "Pennsylvania",
    37: "Rhode Island", 38: "South Carolina", 39: "South Dakota",
    40: "Tennessee", 41: "Texas", 42: "Utah", 43: "Vermont", 44: "Virginia",
    45: "Washington", 46: "West Virginia", 47: "Wisconsin", 48: "Wyoming"
}

In [22]:
def load_climdiv(path):
    df = pd.read_fwf(path, colspecs=colspecs, names=names)

    # keep only contiguous U.S. states
    df = df[(df["state"] >= 1) & (df["state"] <= 48)].copy()

    # reshape to long format (month is still string)
    df = df.melt(
        id_vars=["state", "year"],
        value_vars=["jan","feb","mar","apr","may","jun",
                    "jul","aug","sep","oct","nov","dec"],
        var_name="month",
        value_name="value"
    )

    df["value"] = pd.to_numeric(df["value"], errors="coerce")

    return df

In [23]:
tavg = load_climdiv(NOAA_FILES["tavg"]).rename(columns={"value": "tavg"})
cdd  = load_climdiv(NOAA_FILES["cdd"]).rename(columns={"value": "cdd"})
hdd  = load_climdiv(NOAA_FILES["hdd"]).rename(columns={"value": "hdd"})

In [None]:
climate = (
    tavg.merge(cdd, on=["state", "year", "month"])
        .merge(hdd, on=["state", "year", "month"])
)

In [25]:
climate["state_name"] = climate["state"].map(state_name_map)

# For quality check
climate_subset = climate[[
    "state_name", "year", "month", "tavg", "cdd", "hdd"
]]

climate_subset.head()

Unnamed: 0,state_name,year,month,tavg,cdd,hdd
0,Alabama,1895,jan,43.1,5.0,717.0
1,Alabama,1896,jan,43.5,4.0,693.0
2,Alabama,1897,jan,41.8,3.0,752.0
3,Alabama,1898,jan,49.0,19.0,545.0
4,Alabama,1899,jan,43.8,5.0,690.0


### Data Profiling
- Column and type check: All six fields (`state_name`, `year`, `month`, `tavg`, `cdd`, `hdd`) have expected types.
- Unique values:
  - 48 contiguous U.S. states  
  - Years 1895–2025  
  - 12 month labels (`jan`–`dec`)
- Missingness: After converting NOAA’s special missing-value codes (`-99.9`, `-9999`) to `NaN`, we found 192 missing rows.
- Pattern: All missing values occur in **2025 Sep–Dec**, covering all 48 states. This matches NOAA’s partially released data for late 2025.
- Range checks: No unrealistic values remained (e.g., `tavg < -40`, `tavg > 120`, negative CDD/HDD).

### Handling Missing Values
- Columns with missing values: `tavg`, `cdd`, `hdd`
- Reason: NOAA has not yet published late-2025 values; these rows cannot be recovered.
- Cleaning action: Dropped all rows containing missing climate values to ensure complete, consistent observations.
- Result: Final dataset contains complete and valid monthly climate measurements for all 48 states.

In [27]:
# Column info
print("=== Column Info ===")
print(climate_subset.info())

display(climate_subset.head())

=== Column Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75456 entries, 0 to 75455
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   state_name  75456 non-null  object 
 1   year        75456 non-null  int64  
 2   month       75456 non-null  object 
 3   tavg        75456 non-null  float64
 4   cdd         75456 non-null  float64
 5   hdd         75456 non-null  float64
dtypes: float64(3), int64(1), object(2)
memory usage: 3.5+ MB
None


Unnamed: 0,state_name,year,month,tavg,cdd,hdd
0,Alabama,1895,jan,43.1,5.0,717.0
1,Alabama,1896,jan,43.5,4.0,693.0
2,Alabama,1897,jan,41.8,3.0,752.0
3,Alabama,1898,jan,49.0,19.0,545.0
4,Alabama,1899,jan,43.8,5.0,690.0


In [28]:
# Missing value check
print("=== Missing Values (Count and %) ===")
missing_count = climate_subset.isna().sum()
missing_percent = climate_subset.isna().mean() * 100

missing_df = pd.DataFrame({
    "missing_count": missing_count,
    "missing_percent": missing_percent.round(2)
})

display(missing_df)

=== Missing Values (Count and %) ===


Unnamed: 0,missing_count,missing_percent
state_name,0,0.0
year,0,0.0
month,0,0.0
tavg,0,0.0
cdd,0,0.0
hdd,0,0.0


In [31]:
# Basic descriptive statistics
print("=== Summary Statistics ===")
display(climate_subset.describe())

=== Summary Statistics ===


Unnamed: 0,year,tavg,cdd,hdd
count,75456.0,75456.0,75456.0,75456.0
mean,1960.0,51.338359,59.732851,432.20661
std,37.815591,19.457149,526.688032,687.060961
min,1895.0,-99.9,-9999.0,-9999.0
25%,1927.0,38.0,0.0,47.0
50%,1960.0,53.0,8.0,342.0
75%,1993.0,66.5,120.0,774.0
max,2025.0,89.2,844.0,2389.0


In [35]:
climate_subset.loc[:, "tavg"] = climate_subset["tavg"].replace(-99.9, np.nan)
climate_subset.loc[:, "cdd"]  = climate_subset["cdd"].replace(-9999, np.nan)
climate_subset.loc[:, "hdd"]  = climate_subset["hdd"].replace(-9999, np.nan)

In [37]:
print("=== Missing Value Summary ===")
print(climate_subset.isna().sum())

=== Missing Value Summary ===
state_name      0
year            0
month           0
tavg          192
cdd           192
hdd           192
dtype: int64


In [52]:
missing_rows = climate_subset[
    climate_subset[["tavg", "cdd", "hdd"]].isna().any(axis=1)
]
missing_rows

Unnamed: 0,state_name,year,month,tavg,cdd,hdd
50434,Alabama,2025,sep,,,
50565,Arizona,2025,sep,,,
50696,Arkansas,2025,sep,,,
50827,California,2025,sep,,,
50958,Colorado,2025,sep,,,
...,...,...,...,...,...,...
74931,Virginia,2025,dec,,,
75062,Washington,2025,dec,,,
75193,West Virginia,2025,dec,,,
75324,Wisconsin,2025,dec,,,


In [46]:
missing_rows["year"].value_counts().sort_index()

year
2025    192
Name: count, dtype: int64

In [49]:
missing_rows.groupby(["year", "month"]).size()

year  month
2025  dec      48
      nov      48
      oct      48
      sep      48
dtype: int64

In [38]:
print("tavg < -40:", (climate_subset["tavg"] < -40).sum())
print("tavg > 120:", (climate_subset["tavg"] > 120).sum())

print("cdd < 0:", (climate_subset["cdd"] < 0).sum())
print("hdd < 0:", (climate_subset["hdd"] < 0).sum())


tavg < -40: 0
tavg > 120: 0
cdd < 0: 0
hdd < 0: 0


In [50]:
climate_subset_clean = climate_subset.dropna(subset=["tavg", "cdd", "hdd"]).copy()


In [51]:
climate_subset_clean 

Unnamed: 0,state_name,year,month,tavg,cdd,hdd
0,Alabama,1895,jan,43.1,5.0,717.0
1,Alabama,1896,jan,43.5,4.0,693.0
2,Alabama,1897,jan,41.8,3.0,752.0
3,Alabama,1898,jan,49.0,19.0,545.0
4,Alabama,1899,jan,43.8,5.0,690.0
...,...,...,...,...,...,...
75450,Wyoming,2020,dec,24.5,0.0,1221.0
75451,Wyoming,2021,dec,26.2,0.0,1159.0
75452,Wyoming,2022,dec,19.6,0.0,1369.0
75453,Wyoming,2023,dec,27.7,0.0,1119.0
