# Data matching
---

Experimenting with matching data from:
- Global Energy Monitor (GEM)'s [Global Coal Plant Tracker](https://www.globalenergymonitor.org/coal.html)
- USA's [CAMPD emissions data](https://campd.epa.gov/data)
- OSM's [cooling_tower](https://wiki.openstreetmap.org/wiki/Tag:man_made%3Dcooling_tower) tag

## Setup

### Imports

In [None]:
import pandas as pd
import geopandas as gpd
import plotly.express as px

In [None]:
from coal_emissions_monitoring.data_cleaning import (
    load_clean_gcpt_gdf,
    load_clean_campd_facilities_gdf,
    load_clean_campd_emissions_df,
    load_osm_data,
)

### Parameters

In [None]:
# show all columns in pandas
pd.set_option("display.max_columns", None)

## Load data

### GEM Global Coal Plant Tracker

In [None]:
gcpt_df = load_clean_gcpt_gdf("/Users/adminuser/Downloads/Global-Coal-Plant-Tracker-January-2023.xlsx")
gcpt_df

### CAMPD facilities metadata

In [None]:
campd_facilities_df = load_clean_campd_facilities_gdf("/Users/adminuser/GitHub/ccai-ss23-ai-monitoring-tutorial/data/facility_attributes.csv")
campd_facilities_df

In [None]:
campd_facilities_df.capacity_mw.describe()

In [None]:
# find distance to the nearest facility
for facility_id in campd_facilities_df.facility_id:
    campd_facilities_df.loc[
        campd_facilities_df.facility_id == facility_id,
        "dist_to_nearest_facility"
    ] = gpd.sjoin_nearest(
        campd_facilities_df.loc[campd_facilities_df.facility_id == facility_id],
        campd_facilities_df.loc[campd_facilities_df.facility_id != facility_id],
        distance_col="dist",
    ).dist.min()
campd_facilities_df.groupby("facility_id").dist_to_nearest_facility.min().sort_values()

In [None]:
campd_facilities_df[campd_facilities_df.year == 2023].explore()

### CAMPD emissions data

In [None]:
campd_emissions_df = load_clean_campd_emissions_df("/Users/adminuser/GitHub/ccai-ss23-ai-monitoring-tutorial/data/daily_emissions_facility_aggregation.csv")
campd_emissions_df

In [None]:
campd_emissions_df["year"] = campd_emissions_df["date"].dt.year
yearly_emissions = campd_emissions_df.groupby("year").co2_mass_short_tons.mean()
yearly_emissions

In [None]:
px.line(campd_emissions_df, x="date", y="co2_mass_short_tons", color="facility_name")

### OSM cooling_tower tag

In [None]:
osm_gdf = load_osm_data()
osm_gdf

## Match data

### CAMPD facilities metadata and emissions

In [None]:
campd_emissions_df["year"] = pd.to_datetime(campd_emissions_df["date"].dt.year, format="%Y")
campd_gdf = pd.merge(
    campd_facilities_df,
    campd_emissions_df,
    on=["facility_id", "year"],
    how="inner",
    suffixes=("_delete", ""),
)
campd_gdf = campd_gdf.drop(columns=[col for col in campd_gdf.columns if "_delete" in col])
campd_gdf

### CAMPD data and OSM cooling_tower tag

In [None]:
campd_ndt_gdf = gpd.sjoin_nearest(campd_gdf, osm_gdf, how="inner", distance_col="distances", max_distance=0.01)
campd_ndt_gdf

In [None]:
campd_ndt_gdf.distances.describe()

In [None]:
ndt_plants = campd_ndt_gdf.facility_id.nunique()
ndt_plants