# COGS 108 - Data Checkpoint

## Authors

- Jin Choi: Methodology, Project administration
- Sujin Kim: Conceptualization, Visualization
- Rowoon Lee: Background research, Software
- Idhant Kuma: Data curation, Writing original draft 
- Yechan Park: Analysis, experimental investigation, Writing review & editing

## Research Question

Is there a statistically significant correlation between the global production rate of plastics and key indicators of global warming, such as atmospheric CO₂ concentration, fossil fuel consumption, and global average temperature anomalies?

## Background and Prior Work

The rapid growth of global plastic production over recent decades has raised increasing concern due to its environmental and climate impacts. Since the 1950s, plastic production has increased from almost zero to hundreds of millions of metric tons per year, largely driven by industrialization and rising consumer demand [1]. Because most plastics are produced from fossil fuels, their manufacturing and disposal require large amounts of energy and result in greenhouse gas emissions. As a result, plastic production is closely connected to broader patterns of fossil fuel use and industrial activity.
Long-term observations show that atmospheric carbon dioxide (CO₂) concentrations have risen steadily since the late 1950s. Measurements collected by the National Oceanic and Atmospheric Administration (NOAA) indicate that current CO₂ levels are significantly higher than pre-industrial values and continue to increase each year [2]. This rise is mainly caused by the widespread burning of fossil fuels and other human activities, and it is strongly linked to global warming. Increasing CO₂ concentrations are associated with other major climate indicators, including rising global temperature anomalies, reflecting an enhanced greenhouse effect in Earth’s atmosphere.
Scientific assessments by the Intergovernmental Panel on Climate Change (IPCC) provide strong evidence that human-driven increases in greenhouse gases have caused widespread warming across the climate system. The IPCC Sixth Assessment Report explains that excess heat trapped by greenhouse gases has been absorbed largely by the oceans since the mid-20th century, leading to accelerating glacier and ice-sheet melt and rising global sea levels [3]. These observed changes closely follow long-term increases in fossil fuel consumption and industrial production, suggesting that other fossil-fuel-intensive activities may exhibit similar relationships with climate indicators.
Global temperature records further support this warming trend. Data from NASA’s Goddard Institute for Space Studies show that recent decades are significantly warmer than the mid-20th century average, indicating a clear and persistent rise in global temperature anomalies [4]. Because plastics are derived from fossil fuels and contribute to greenhouse gas emissions throughout their lifecycle, examining the statistical relationship between global plastic production and key climate indicators. Atmospheric CO₂ concentrations and global temperature anomalies can help clarify how industrial production aligns with observed global warming trends.

References
1. Our World in Data. Global Plastics Production. https://ourworldindata.org/grapher/global-plastics-production
2. NOAA Global Monitoring Laboratory. Trends in Atmospheric Carbon Dioxide. https://gml.noaa.gov/ccgg/trends/
3. IPCC. (2021). Climate Change 2021: The Physical Science Basis, Chapter 9. https://www.ipcc.ch/report/ar6/wg1/chapter/chapter-9/
4. NASA Goddard Institute for Space Studies. GISTEMP Surface Temperature Analysis. https://data.giss.nasa.gov/gistemp/

## Hypothesis


We hypothesize that there is a statistically significant positive correlation between global plastic production levels and indicators of global warming, including atmospheric CO₂ concentration, fossil fuel consumption, and global average temperature anomalies. This relationship is expected because plastic production is highly dependent on fossil fuels and contributes to greenhouse gas emissions throughout its lifecycle. As plastic production increases steadily over time, we anticipate upward trends in the climate variables as well.

## Data

### Data overview

#### Dataset 1: Global Plastic Production

The Global plastic production dataset is a series of plastic production measured in tonnes from 1950 to 2019. It contains information mainly from Geyer et al. about one global value per year while demonstrating 5% of annual growth rate from 2016 to 2018. 

**Link to the dataset**: https://ourworldindata.org/grapher/global-plastics-production

**Number of observations**: 70 

**Number of variables**: 4: `entity`, `code`, `year` and the measured value column for plastics production in tonnes 

**Description of the variables most relevant to this project**: 
`Year`: the observation year
`Entity`: the geographic unit 
`Plastic production`: “annual production of polymer resin and fibers in tonnes”

**Descriptions of any shortcomings this dataset has with respect to the project**: 
This dataset only provide global totals not a country/region breakdown
Values from this dataset is a bit mixed and later year values aren’t directly measured from a single consistent system
The amount of production doesn’t directly related ot the outcome of waste/pollution
This dataset lack confidence intervals/measurement uncertainty  


#### Dataset 2: Global Monthly Atmospheric CO₂ Data

**Link to the Dataset**: https://gml.noaa.gov/ccgg/trends/gl_data.html

**Number of Observations**: ~800+ monthly global observations

**Number of Variables**: Year, month, decimal date, monthly mean CO₂ (ppm), deseasonalized trend, number of days measured

**Variables most relevant to this project**: year — calendar year, month — month of observation, average — monthly global CO₂ concentration (ppm), trend — seasonally adjusted global CO₂ concentration

**Shortcomings**: 
- Still observational (no causation)
- Aggregated global average may hide regional variation

#### Dataset 3: Global Land–Ocean Temperature Anomalies (NASA GISTEMP) 

**Source:** NASA Goddard Institute for Space Studies (GISS), GISTEMP Surface Temperature Analysis.  
**File:** `GLB.Ts+dSST.csv` (downloaded from [data.giss.nasa.gov/gistemp](https://data.giss.nasa.gov/gistemp/)).

**Important metrics and units:**  
Each row is one year (1880–present). All temperature values are **anomalies** in degrees Celsius (°C): the deviation from the **1951–1980 global mean** (the baseline). For example, `0.5` means 0.5°C warmer than that baseline; `-0.2` means 0.2°C cooler. In practice, early decades (e.g. 1880s–1970s) often show negative or near-zero anomalies (cooler or similar to the baseline), while recent decades (e.g. 1990s onward) consistently show positive anomalies—recent years often range from about +0.5°C to over +1.2°C, reflecting well-documented global warming. The column **`J-D`** (January–December) is the annual mean land–ocean temperature anomaly and is the main variable we use to correlate with plastic production and other climate indicators. Monthly columns (Jan–Dec) and seasonal columns (e.g. DJF, MAM, JJA, SON) are also present; we focus on `Year` and `J-D` for consistency with the project’s research question.

**Concerns and limitations:**  
The data are global aggregates from a well-established scientific product, so there is no sample bias in the usual sense. Missing values are encoded as `***`, mainly for (1) the most recent partial year (e.g. 2026) when the year is incomplete, and (2) some derived columns (e.g. D-N, DJF) in early years where the seasonal definition does not apply. We restrict to years with a non-missing `J-D` so that analyses use only complete annual values. Because all time series in the project share a common upward trend over time, we will later need to be careful not to overinterpret correlation as causation. The checkpoint focuses on obtaining and cleaning this dataset.

#### Dataset 4: Global Annual Mean CO₂ Data

**Link to the Dataset**: https://gml.noaa.gov/ccgg/trends/gl_data.html

**Number of Observations**: ~60+ annual observations

**Number of Variables**: Year, annual mean CO₂ (ppm)

**Variables most relevant to this project**: year — calendar year, month — month of observation, average — monthly global CO₂ concentration (ppm), trend — seasonally adjusted global CO₂ concentration

**Shortcomings**: 
- Annual averaging removes seasonal detail

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://ourworldindata.org/grapher/global-plastics-production.csv', 'filename':'global-plastics-production.csv'},
    {
        'url': 'https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_mm_gl.csv',
        'filename': 'co2_mm_gl.csv'
    },
    {
        'url': 'https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_gl.csv',
        'filename': 'co2_annmean_gl.csv'
    }
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/3 [00:00<?, ?it/s]

Overall Download Progress:  33%|███▎      | 1/3 [00:00<00:00,  8.03it/s]

Successfully downloaded: global-plastics-production.csv


Overall Download Progress:  67%|██████▋   | 2/3 [00:00<00:00,  4.84it/s]

Successfully downloaded: co2_mm_gl.csv


Overall Download Progress: 100%|██████████| 3/3 [00:00<00:00,  5.38it/s]

Successfully downloaded: co2_annmean_gl.csv





### 1. Global Plastic Production Dataset (World Data)

In [3]:
from pathlib import Path
import pandas as pd
import numpy as np

RAW_DIR = Path("data/00-raw")
INTERIM_DIR = Path("data/01-interim")
PROCESSED_DIR = Path("data/02-processed")
for d in [RAW_DIR, INTERIM_DIR, PROCESSED_DIR]:
    d.mkdir(parents=True, exist_ok=True)

RAW_PATH = RAW_DIR / "global-plastics-production.csv"

In [4]:
# 1. Load the dataset

df_raw = pd.read_csv(RAW_PATH)
print("Loaded raw data from:", RAW_PATH)
display(df_raw.head())
print("Raw shape (rows, cols):", df_raw.shape)

Loaded raw data from: data/00-raw/global-plastics-production.csv


Unnamed: 0,Entity,Code,Year,Annual plastic production between 1950 and 2019
0,World,OWID_WRL,1950,2000000
1,World,OWID_WRL,1951,2000000
2,World,OWID_WRL,1952,2000000
3,World,OWID_WRL,1953,3000000
4,World,OWID_WRL,1954,3000000


Raw shape (rows, cols): (69, 4)


In [5]:
# 2) Tidy check

# Tidy criteria: each variable is a column; each observation is a row.
print("\nColumns:", df_raw.columns.tolist())
print("\nDtypes:\n", df_raw.dtypes)

# Expect OWID style: Entity, Year, and one value column
required = {"Entity", "Year"}
if not required.issubset(df_raw.columns):
    raise ValueError(f"Expected at least {required}, but got {df_raw.columns.tolist()}")

# Identify numeric value columns besides Year
numeric_cols = df_raw.select_dtypes(include=[np.number]).columns.tolist()
value_candidates = [c for c in numeric_cols if c != "Year"]
if len(value_candidates) == 0:
    raise ValueError("No numeric value column found besides Year.")
value_col = value_candidates[0]  # main metric column
print("Main production column:", value_col)

# Check if one row per Entity-Year
dup_entity_year = df_raw.duplicated(subset=["Entity", "Year"]).sum()
print("Duplicate Entity-Year rows:", dup_entity_year)
print("✅ Data is already in tidy long format (Entity-Year observations). No reshape needed.")


Columns: ['Entity', 'Code', 'Year', 'Annual plastic production between 1950 and 2019']

Dtypes:
 Entity                                             object
Code                                               object
Year                                                int64
Annual plastic production between 1950 and 2019     int64
dtype: object
Main production column: Annual plastic production between 1950 and 2019
Duplicate Entity-Year rows: 0
✅ Data is already in tidy long format (Entity-Year observations). No reshape needed.


In [6]:
# 3) Dataset size

print("\nDataset size:")
print("Rows:", len(df_raw))
print("Columns:", df_raw.shape[1])
print("Unique entities:", df_raw["Entity"].nunique())
print("Year range:", int(df_raw["Year"].min()), "to", int(df_raw["Year"].max()))


Dataset size:
Rows: 69
Columns: 4
Unique entities: 1
Year range: 1950 to 2019


In [7]:
# 4) Missingness

missing_count = df_raw.isna().sum()
missing_pct = (missing_count / len(df_raw) * 100).round(2)
missing_report = pd.DataFrame({"missing_count": missing_count, "missing_pct": missing_pct}).sort_values(
    "missing_count", ascending=False
)
print("\nMissingness by column:")
display(missing_report)

rows_with_missing = df_raw[df_raw.isna().any(axis=1)]
print("Rows with any missing values:", rows_with_missing.shape[0])

if rows_with_missing.shape[0] > 0:
    print("\nWhere missingness happens (by Year):")
    display(rows_with_missing["Year"].value_counts().sort_index())

    print("\nWhere missingness happens (by Entity):")
    display(rows_with_missing["Entity"].value_counts().head(15))

    print("\nInterpretation: If missingness clusters in specific years/entities, it is likely systematic (not random-looking).")
else:
    print("✅ No missing values detected → missingness does not appear to be an issue for this dataset.")


Missingness by column:


Unnamed: 0,missing_count,missing_pct
Entity,0,0.0
Code,0,0.0
Year,0,0.0
Annual plastic production between 1950 and 2019,0,0.0


Rows with any missing values: 0
✅ No missing values detected → missingness does not appear to be an issue for this dataset.


In [8]:
# 5) Flag outliers / suspicious entries
# ----------------------------
df_flagged = df_raw.copy()

# Suspicious: negative production values
df_flagged["flag_negative"] = df_flagged[value_col] < 0

# Suspicious: duplicate Entity-Year rows
df_flagged["flag_duplicate_entity_year"] = df_flagged.duplicated(subset=["Entity", "Year"], keep=False)

# Outlier review flag via IQR 
q1, q3 = df_flagged[value_col].quantile([0.25, 0.75])
iqr = q3 - q1
lo, hi = q1 - 1.5 * iqr, q3 + 1.5 * iqr
df_flagged["flag_iqr_outlier"] = (df_flagged[value_col] < lo) | (df_flagged[value_col] > hi)

suspicious = df_flagged[df_flagged[["flag_negative","flag_duplicate_entity_year","flag_iqr_outlier"]].any(axis=1)]
print("\nSuspicious/outlier rows flagged:", suspicious.shape[0])
display(suspicious.head(20))

# Save interim (optional)
INTERIM_PATH = INTERIM_DIR / "global_plastics_production_flagged.csv"
df_flagged.to_csv(INTERIM_PATH, index=False)
print("Saved interim flagged dataset to:", INTERIM_PATH)


Suspicious/outlier rows flagged: 0


Unnamed: 0,Entity,Code,Year,Annual plastic production between 1950 and 2019,flag_negative,flag_duplicate_entity_year,flag_iqr_outlier


Saved interim flagged dataset to: data/01-interim/global_plastics_production_flagged.csv


In [9]:
# 6) Clean data 

# We drop rows missing key fields (Entity/Year/value) because imputing aggregate annual production would be speculative and could bias trends.
df_clean = df_flagged.copy()

# Remove duplicate Entity-Year (keep first)
df_clean = df_clean.drop_duplicates(subset=["Entity", "Year"], keep="first")

# Drop rows missing key fields
before = len(df_clean)
df_clean = df_clean.dropna(subset=["Entity", "Year", value_col])
print("Dropped rows missing key fields:", before - len(df_clean))

# Remove negative values (set to NaN then drop)
neg_before = (df_clean[value_col] < 0).sum()
df_clean.loc[df_clean[value_col] < 0, value_col] = np.nan
df_clean = df_clean.dropna(subset=[value_col])
print("Removed negative rows:", int(neg_before))

# Ensure Year is int
df_clean["Year"] = df_clean["Year"].astype(int)

# Keep core columns + flags (clean output)
df_clean = df_clean[["Entity", "Year", value_col, "flag_negative", "flag_duplicate_entity_year", "flag_iqr_outlier"]]

print("\nCleaned shape:", df_clean.shape)
display(df_clean.head())

Dropped rows missing key fields: 0
Removed negative rows: 0

Cleaned shape: (69, 6)


Unnamed: 0,Entity,Year,Annual plastic production between 1950 and 2019,flag_negative,flag_duplicate_entity_year,flag_iqr_outlier
0,World,1950,2000000.0,False,False,False
1,World,1951,2000000.0,False,False,False
2,World,1952,2000000.0,False,False,False
3,World,1953,3000000.0,False,False,False
4,World,1954,3000000.0,False,False,False


In [10]:
# 7) Write final processed dataset

PROCESSED_PATH = PROCESSED_DIR / "global_plastics_production_clean.csv"
df_clean.to_csv(PROCESSED_PATH, index=False)
print("Saved processed clean dataset to:", PROCESSED_PATH)

Saved processed clean dataset to: data/02-processed/global_plastics_production_clean.csv


### 2. Global Monthly Atmospheric CO₂ Data
This dataset contains monthly global atmospheric carbon dioxide (CO₂) measurements provided by NOAA’s Global Monitoring Laboratory. Each row represents one month of globally averaged CO₂ concentration. The dataset includes variables such as year, month, decimal date, average CO₂ concentration, deseasonalized trend, and number of days measured.

CO₂ concentration is measured in parts per million (ppm), which represents how many CO₂ molecules exist per one million molecules of air. Monthly data allows us to observe both long-term upward trends and seasonal fluctuations in global atmospheric CO₂ levels. The seasonal variation reflects natural plant growth cycles, while the long-term increase reflects human-related greenhouse gas emissions.

A limitation of this dataset is that, although it represents global averages, it does not capture regional differences in CO₂ levels. Additionally, this dataset shows trends and correlations but does not establish causation.

In [11]:
import pandas as pd

# Load dataset
co2_monthly = pd.read_csv(
    "data/00-raw/co2_mm_gl.csv",
    comment="#"
)

print("Shape (rows, columns):", co2_monthly.shape)

print("\nColumn names and dtypes:")
print(co2_monthly.dtypes)

print("\nFirst few rows:")
display(co2_monthly.head())

print("\nMissing values per column:")
print(co2_monthly.isna().sum())

print("\nSummary statistics:")
display(co2_monthly.describe())

print("\nDuplicate rows:", co2_monthly.duplicated().sum())

Shape (rows, columns): (563, 7)

Column names and dtypes:
year             int64
month            int64
decimal        float64
average        float64
average_unc    float64
trend          float64
trend_unc      float64
dtype: object

First few rows:


Unnamed: 0,year,month,decimal,average,average_unc,trend,trend_unc
0,1979,1,1979.042,336.56,0.11,335.92,0.09
1,1979,2,1979.125,337.29,0.09,336.26,0.09
2,1979,3,1979.208,337.88,0.11,336.51,0.09
3,1979,4,1979.292,338.32,0.13,336.72,0.1
4,1979,5,1979.375,338.26,0.04,336.71,0.1



Missing values per column:
year           0
month          0
decimal        0
average        0
average_unc    0
trend          0
trend_unc      0
dtype: int64

Summary statistics:


Unnamed: 0,year,month,decimal,average,average_unc,trend,trend_unc
count,563.0,563.0,563.0,563.0,563.0,563.0,563.0
mean,2001.959147,6.490231,2002.458334,375.844725,0.104636,375.84611,0.064938
std,13.554035,3.450385,13.555698,25.761809,0.031044,25.737807,0.017569
min,1979.0,1.0,1979.042,334.37,0.03,335.92,0.03
25%,1990.0,3.5,1990.75,354.355,0.08,354.45,0.05
50%,2002.0,6.0,2002.458,372.84,0.1,372.3,0.06
75%,2014.0,9.0,2014.1665,396.745,0.12,396.57,0.08
max,2025.0,12.0,2025.875,426.94,0.26,426.63,0.12



Duplicate rows: 0


### 3. Global Land–Ocean Temperature Anomalies (NASA GISTEMP)


In [12]:
import pandas as pd
import os

# --- 3.1 Load the dataset from data/00-raw/ ---
# First row is a title line, so we skip it. Missing values are encoded as '***'.
raw_path = 'data/00-raw/GLB.Ts+dSST.csv'
temp_raw = pd.read_csv(raw_path, skiprows=1, na_values='***')

print("Shape (rows, columns):", temp_raw.shape)
print("\nColumn names and dtypes:")
print(temp_raw.dtypes)
print("\nFirst few rows:")
display(temp_raw.head())

# --- 3.2 Tidy format: one row per year, one column per variable → already tidy ---
print("\nTidy check: one row per observation (year), one column per variable. No need to reshape.")


Shape (rows, columns): (147, 19)

Column names and dtypes:
Year      int64
Jan     float64
Feb     float64
Mar     float64
Apr     float64
May     float64
Jun     float64
Jul     float64
Aug     float64
Sep     float64
Oct     float64
Nov     float64
Dec     float64
J-D     float64
D-N     float64
DJF     float64
MAM     float64
JJA     float64
SON     float64
dtype: object

First few rows:


Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON
0,1880,-0.19,-0.25,-0.1,-0.17,-0.11,-0.22,-0.19,-0.11,-0.15,-0.24,-0.23,-0.18,-0.18,,,-0.13,-0.17,-0.21
1,1881,-0.21,-0.15,0.02,0.04,0.05,-0.2,-0.01,-0.04,-0.16,-0.22,-0.19,-0.08,-0.1,-0.1,-0.18,0.04,-0.08,-0.19
2,1882,0.15,0.13,0.04,-0.18,-0.15,-0.24,-0.17,-0.08,-0.15,-0.24,-0.17,-0.37,-0.12,-0.09,0.07,-0.1,-0.16,-0.19
3,1883,-0.3,-0.37,-0.13,-0.19,-0.18,-0.07,-0.08,-0.14,-0.22,-0.11,-0.25,-0.12,-0.18,-0.2,-0.35,-0.17,-0.1,-0.19
4,1884,-0.14,-0.09,-0.37,-0.4,-0.34,-0.35,-0.31,-0.28,-0.27,-0.25,-0.34,-0.31,-0.29,-0.27,-0.11,-0.37,-0.31,-0.29



Tidy check: one row per observation (year), one column per variable. No need to reshape.


**3.2 (tidy) & 3.3 (size):** The dataset is already in tidy form: one row per year, one column per variable (year, monthly anomalies, annual J-D, seasonal indices). Below we verify the size (number of observations and variables) and then assess missing values.

In [13]:
# --- 3.3 Demonstrate the size of the dataset ---
print("Number of observations (years):", len(temp_raw))
print("Number of variables:", len(temp_raw.columns))

# --- 3.4 How much data is missing, where, and is it missing at random? ---
print("\nMissing values per column (only columns with any missing):")
missing_counts = temp_raw.isna().sum()
print(missing_counts[missing_counts > 0])

# Which years have missing J-D? (critical for our analysis)
jd_missing = temp_raw[temp_raw['J-D'].isna()]
print("\nRows (years) with missing J-D:", len(jd_missing))
if len(jd_missing) > 0:
    display(jd_missing[['Year', 'J-D']])

Number of observations (years): 147
Number of variables: 19

Missing values per column (only columns with any missing):
Feb    1
Mar    1
Apr    1
May    1
Jun    1
Jul    1
Aug    1
Sep    1
Oct    1
Nov    1
Dec    1
J-D    1
D-N    2
DJF    2
MAM    1
JJA    1
SON    1
dtype: int64

Rows (years) with missing J-D: 1


Unnamed: 0,Year,J-D
146,2026,


**Missingness:** Missing values are not at random. They appear (1) in the most recent year when only partial months exist (e.g. 2026), and (2) in derived columns (D-N, DJF, etc.) for early years where the seasonal definition does not apply. For our analysis we only need `Year` and `J-D`; we will drop rows where `J-D` is missing so that every retained row has a valid annual anomaly. We do not fill missing values, since we need complete years for correlation with other annual datasets.

**3.5 Find and flag outliers or suspicious entries:** We check for extreme values in our key variable `J-D`. Temperature anomalies are physical measurements; unusually warm or cool years are real climate events, not data-entry errors. We flag them for review but do not remove them.

In [14]:
# Ensure J-D is numeric for this check (in case we ran cells out of order)
temp_raw['J-D'] = pd.to_numeric(temp_raw['J-D'], errors='coerce')
jd_valid = temp_raw.dropna(subset=['J-D'])

# Flag extremes: years with lowest and highest annual anomaly
print("Years with lowest J-D (coldest relative to 1951-1980 baseline):")
display(jd_valid.nsmallest(5, 'J-D')[['Year', 'J-D']])
print("\nYears with highest J-D (warmest relative to baseline):")
display(jd_valid.nlargest(5, 'J-D')[['Year', 'J-D']])
print("\nConclusion: All flagged values are plausible (historical cold spells and recent warming). We keep all rows; no outliers removed.")

Years with lowest J-D (coldest relative to 1951-1980 baseline):


Unnamed: 0,Year,J-D
29,1909,-0.5
24,1904,-0.48
37,1917,-0.47
30,1910,-0.45
31,1911,-0.45



Years with highest J-D (warmest relative to baseline):


Unnamed: 0,Year,J-D
144,2024,1.28
145,2025,1.19
143,2023,1.17
136,2016,1.01
140,2020,1.01



Conclusion: All flagged values are plausible (historical cold spells and recent warming). We keep all rows; no outliers removed.


In [15]:
# --- 3.6 Clean the data: deal with missingness ---
# Choice: we use dropna(subset=['J-D']) (not fillna). Justification: we need complete annual
# values to merge with other yearly datasets; filling would impute values we cannot verify.
temp_raw['J-D'] = pd.to_numeric(temp_raw['J-D'], errors='coerce')
temp_clean = temp_raw.dropna(subset=['J-D']).copy()
temp_clean = temp_clean.astype({'Year': 'int64'})

print("Rows after dropping rows with missing J-D:", len(temp_clean))
print("Year range in cleaned data:", temp_clean['Year'].min(), "–", temp_clean['Year'].max())
temp_clean[['Year', 'J-D']].tail(10)

Rows after dropping rows with missing J-D: 146
Year range in cleaned data: 1880 – 2025


Unnamed: 0,Year,J-D
136,2016,1.01
137,2017,0.92
138,2018,0.85
139,2019,0.98
140,2020,1.01
141,2021,0.85
142,2022,0.9
143,2023,1.17
144,2024,1.28
145,2025,1.19


**3.7 Save final wrangled data:** We load from `data/00-raw/` (done above). We do not write an intermediate stage to `data/01-interim/` because the only change is dropping one incomplete year. The final, fully wrangled table (Year + annual anomaly) is written to `data/02-processed/`.

In [16]:
# Subset to columns needed for the project; write final data to data/02-processed/
processed = temp_clean[['Year', 'J-D']].rename(columns={'J-D': 'temp_anomaly_C'})
os.makedirs('data/02-processed', exist_ok=True)
out_path = 'data/02-processed/global_temp_anomaly_giss.csv'
processed.to_csv(out_path, index=False)

print("Saved to", out_path)
print("Shape:", processed.shape)

# --- 4. Summary statistics for important variables ---
print("\nSummary statistics for key variables (Year and temp_anomaly_C):")
display(processed.describe())

Saved to data/02-processed/global_temp_anomaly_giss.csv
Shape: (146, 2)

Summary statistics for key variables (Year and temp_anomaly_C):


Unnamed: 0,Year,temp_anomaly_C
count,146.0,146.0
mean,1952.5,0.080274
std,42.290661,0.403674
min,1880.0,-0.5
25%,1916.25,-0.2
50%,1952.5,-0.03
75%,1988.75,0.3175
max,2025.0,1.28


**Checkpoint summary — Data cleaning (Dataset #3)**  
- **How clean is the data?** The raw file is well-structured and mostly clean. The only issues are (1) a title row we skip on read, (2) missing values encoded as `***` (incomplete or inapplicable values), and (3) one incomplete year (2026) with missing `J-D`.  
- **What did we do to get it usable?** We loaded from `data/00-raw/` (3.1), confirmed the data are already tidy (3.2), reported size (3.3), assessed missingness and showed it is systematic not random (3.4), flagged and reviewed extremes and kept them (3.5), dropped only rows with missing `J-D` and justified not filling (3.6), and wrote the final wrangled table to `data/02-processed/` (3.7). We did not fill missing values because we need complete annual values for merging.  
- **Pre-processing for analysis:** We restricted to `Year` and annual mean anomaly (`temp_anomaly_C`) and saved that table for merging. No distributional transformations were required for this checkpoint; if we use parametric methods later, we may check normality of the anomaly series.

### 4. Global Annual Mean CO₂ Data
This dataset contains annual global mean atmospheric CO₂ concentrations provided by NOAA. Each row represents one year of globally averaged CO₂ levels. The dataset includes the calendar year and the annual mean CO₂ concentration (ppm).

Annual data removes seasonal variation and highlights long-term trends more clearly. This makes it especially useful for comparing CO₂ levels with other yearly indicators, such as global plastic production or global temperature anomalies.

A limitation of this dataset is that annual averaging removes seasonal detail, which may hide short-term fluctuations. Additionally, like the monthly dataset, it provides observational data and does not directly explain the causes of CO₂ increases.

In [17]:
import pandas as pd

# Load dataset
co2_annual = pd.read_csv(
    "data/00-raw/co2_annmean_gl.csv",
    comment="#"
)

print("Shape (rows, columns):", co2_annual.shape)
print("\nColumn names and dtypes:")
print(co2_annual.dtypes)

print("\nFirst few rows:")
display(co2_annual.head())

print("\nMissing values per column:")
print(co2_annual.isna().sum())

print("\nSummary statistics:")
display(co2_annual.describe())

print("\nDuplicate rows:", co2_annual.duplicated().sum())

Shape (rows, columns): (46, 3)

Column names and dtypes:
year      int64
mean    float64
unc     float64
dtype: object

First few rows:


Unnamed: 0,year,mean,unc
0,1979,336.86,0.1
1,1980,338.91,0.07
2,1981,340.11,0.08
3,1982,340.85,0.03
4,1983,342.53,0.05



Missing values per column:
year    0
mean    0
unc     0
dtype: int64

Summary statistics:


Unnamed: 0,year,mean,unc
count,46.0,46.0,46.0
mean,2001.5,374.855435,0.065
std,13.422618,25.253197,0.016964
min,1979.0,336.86,0.03
25%,1990.25,354.395,0.05
50%,2001.5,371.575,0.07
75%,2012.75,394.7125,0.07
max,2024.0,422.8,0.1



Duplicate rows: 0


## Ethics


[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


The data used in this project are public, aggregated datasets from sources like Our World in Data, NOAA, NASA, and the IPCC, so there are no privacy concerns since no personal or individual data are included. The data are free to use for research and follow open data terms. However, there may be some bias in how the data were collected. For example, plastics production data rely on country and industry reports, which may be less accurate in regions with weaker reporting systems. Climate data may also reflect more measurements from developed countries, which could slightly bias global averages.
Another issue is that all datasets show strong upward trends over time, which could make correlations look stronger than they really are. To handle this, the data will be checked for missing values and inconsistencies, and trends will be visualized before running any statistical tests. Results will be explained carefully, making it clear that correlation does not mean non-checked sources. When sharing results, the limits of the data will be clearly stated to avoid misleading conclusions or unfairly blaming specific groups or regions.

## Team Expectations 

Team expectation 1: Communication
* Primary communication method: discord chat and call 
* Response time usually within a day and everyone should answer the weekly group meeting call since we are all contributing to the proposal
* If a deadline is within 48 hours, we aim to start at least 12 hours before the dead line
* If someone is unavailable to answer the call or do the work, they have to notice the group as soon as possible

Team expectation 2: Weekly Meeting Schedule 
* Meeting will be held every week usually wednesday around 3-5 pm since we know everyone is available during that period of time 
* Each meeting we discuss what to do by the deadline and what to expect, and plan for the next meeting
* We use google doc to do the assignments and submit whoever is available

Team expectation 3: Decision-making
* During the team meeting, we go for the majority 
* Whoever is available can create the assignment document or submit the assignment
* If a quick decision needs to be made, whoever answer the first gets the chance 

Team expectation 4: Equal Contribution
* Everyone puts equal amount of time and effort to finish the assignment 
* We will use our Github page and google doc to work on most of our project 
* Everyone must contribute into the weekly meeting and has to let everyone know if something happens on the discord chat 
* Respect every member and make sure to keep the boundaries 


## Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/30  |  1 PM | We checked group members and reached out for a group chat. Read and think about what to answer for the assignment where all the members read previous proposals and give a review of them.   | Discussed who will answer which portions and shared thoughts about two different previous proposal examples in order for us to do better.  | 
| 2/4  |  3 PM |  Think about which research question we want to work on. | Discuss which portion in the proposal each person wants to work on, and what to answer. | 
| 2/11  | 3 PM  | Prepare data we need for our research and keep collecting them. | Share the data we collected and the approaches we should take. Every member works on their assigned parts. |
| 2/18  | 3 PM  | Combine the data and utilize them so we can use it on the project; we start using the EDA too.  | Review our work and the data collected. Check if there’s anything wrong with using EDA. |
| 2/25  | 3 PM  | Finalize the project generally and start making the analysis for the project.  | Discuss what to include in the analysis and what to highlight. Everyone should participate and share their thoughts and ideas. Complete project check-in. |
| 3/4  | 3 PM  | Complete the analysis and make the final draft of the project to see if there’s anything causing problems. | Discuss and make edits on the analysis and the contents generally. |
| 3/18  | Before 11:59 PM  | Finish up and fix last minute errors and submit the project  | Submit the project and the surveys |