# 🌫️ Atmospheric CO₂ Data Cleaning & Preparation

This notebook processes and enriches the **Mauna Loa daily atmospheric CO₂ dataset** from NOAA and Scripps. The goal is to prepare the data for visualization in Tableau and deeper analysis.

---

## 🧹 Data Cleaning Objectives

- Load the raw daily CO₂ data from the NOAA `.txt` file
- Parse and format the date fields correctly
- Remove flagged or invalid entries (e.g., `-99.99` values)
- Handle missing values using appropriate techniques (e.g., interpolation, rolling averages)
- Standardize column names for clarity

---

## 🧪 Feature Engineering Plan

We’ll create a variety of new features to support time-series analysis and storytelling:

| Feature | Description |
|--------|-------------|
| `date`, `year`, `month`, `day`, `day_of_year` | Temporal breakdown |
| `co2_30d_avg`, `co2_365d_avg` | Rolling averages to smooth trends |
| `daily_diff`, `monthly_diff`, `pct_change` | Growth and rate-of-change metrics |
| `anomaly_flag` | Outlier detection based on z-scores or thresholds |
| `sin_day`, `cos_day` | Cyclical encodings for seasonality and radial plots |
| `season` | Categorical: Winter, Spring, Summer, Fall |
| `forecast` | Modeled prediction of future CO₂ levels |

---

## 📦 Output Files

- `co2_with_features.csv` – Cleaned + engineered features for visualization

---

## 🛠️ Tools Used

- `pandas` – Data manipulation
- `numpy` – Numerical operations
- `matplotlib / seaborn` – (Optional) Data exploration
- `datetime` – Date parsing and manipulation
- `scipy` – (Optional) Anomaly detection / z-score

---

In [14]:
# 📦 Import standard libraries
import pandas as pd
import numpy as np

# 📊 (Optional) For quick visual checks
import matplotlib.pyplot as plt
import seaborn as sns

# 📁 For handling Excel files (if needed)
import openpyxl  # required for reading .xlsx files with pd.read_excel()

# 🔧 Display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [15]:
# Read in .csv
# Note: We ended up cleaning the dataset in Excel first, since there wasn't much to clean and it was quicker that way.


df = pd.read_csv('../data/co2_daily_cleaned.csv')  

In [16]:
# Make a datetime column

df['date'] = pd.to_datetime(df[['year', 'month', 'day']])

In [26]:
# Create day_of_year column

df['day_of_year'] = df['date'].dt.dayofyear

In [27]:
df

Unnamed: 0,year,month,day,decimal,co2 molfrac,date,day_of_year
0,1974,5,19,1974.3781,333.46,1974-05-19,139
1,1974,5,20,1974.3808,333.64,1974-05-20,140
2,1974,5,21,1974.3836,333.50,1974-05-21,141
3,1974,5,22,1974.3863,333.21,1974-05-22,142
4,1974,5,23,1974.3890,333.05,1974-05-23,143
...,...,...,...,...,...,...,...
15639,2025,5,22,2025.3863,429.89,2025-05-22,142
15640,2025,5,23,2025.3890,429.52,2025-05-23,143
15641,2025,5,24,2025.3918,430.30,2025-05-24,144
15642,2025,5,25,2025.3945,430.43,2025-05-25,145
