## <u>About this kernel</u>

In this kernel, I propose following two methods. First, I will explain outline of the methods.

**1. For calculate Emissions Factor**

- I use NO2_column_number_density of Sentinel-5P dataset.
- I regard one data (such as s5p_no2_20180701T161259_20180707T175356.tif) as one day's worth. If na is contained, fill na by mean of the data.
- For calculate amount of NOx, sum up the density in focused area.
- For calculate total estimated generation in focused area, sum up estimated generation from "Global Power Plant Database" in the area.
- For calculate emissions factor, devide the total NOx amount by total estimated generation.

**2. For calculate Merginal Emissions Factor**

- Create dataframe of NO2_column_number_density, longtitude, latitude, wind propaty.
- By k-means divide data of NO2_column_number_density into some regional groups using the dataframe.
- Using method 1 and some additional assumptions, I calculate Emissions Factors of each power plants. 
  Then, sum the NOx amount and the estimated power generation only within the group to which the power plant belongs.
- Using the Emissions Factors calculate merginal emissions factor.

I'll explain the details of the method using NO2 as the theme.

## Why pay attention to NO2?

The simplest way to calculate emission factors is dividing the actual measured value of the amount of NO2 emitted by the power plant by the amount of power generated per hour. This value seems be called "actual emissions factor". (I refered P.30 of [1]　for this actual emissions factor.)

But since NOx reacts with chemicals in the atmosphere and changes into various forms, the amount measured at the time of generation is not always the same as the amount of NO2 actually present in the environment. 

Therefore, if the NO2 emissions factor can be calculated accurately, it will be a reference for other gases.

## About calculation

I aimed to calculate g/GWh unit Emissions Factor (such as g/GWh, T/GWh). In some document I read, this unit seems standard.

I used fllowing fomula.

### Emissions(mol) = NO2_column_number_density(mol/m^2) \* area(m^2)
### Emissons(T) = Emissions(mol) \* 48(g/mol) \* 1e^-6
### Emissons Factor(T/GWh) = Emissons(T) / Estimated Generation(GWh)

## Load libraries

In [None]:
from decimal import Decimal, ROUND_HALF_UP, ROUND_HALF_EVEN
from datetime import datetime, timedelta
import folium
import glob
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import rasterio as rio
import seaborn as sns
from sklearn.cluster import KMeans
import tifffile as tiff 

## Utilities

I put it all together tools that are not related to the subject but are required.

In [None]:
#If you want to analyze other area, please change here!
LAT_MAX = 18.563112930177304
LAT_MIN = 17.903121359128956
LON_MAX = -65.19437297127342
LON_MIN = -67.32297404549217

When trying other places, you need to change the following constants: The constants represent the latitude and longitude range of the data.

In [None]:
def overlay_image_on_puerto_rico_df(df, img, zoom):
    """
    show image on google map with marker of power plants.
    """
    lat_map=df.iloc[[0]].loc[:,["latitude"]].iat[0,0]
    lon_map=df.iloc[[0]].loc[:,["longitude"]].iat[0,0]
    m = folium.Map([lat_map, lon_map], zoom_start=zoom, tiles= 'Stamen Terrain')
    color={ 'Hydro' : 'lightblue', 'Solar' : 'orange', 'Oil' : 'darkblue', 'Coal' : 'black', 'Gas' : 'lightgray', 'Wind' : 'green' }
    folium.raster_layers.ImageOverlay(
        image=img,
        bounds = [[LAT_MAX,LON_MIN,],[LAT_MIN,LON_MAX]],
        #bounds = [[18.56,-67.32,],[17.90,-65.194]],
        colormap=lambda x: (1, 0, 0, x),
    ).add_to(m)
    
    for i in range(0,len(df)):
        popup = folium.Popup(str(df.primary_fuel[i:i+1]))
        folium.Marker([df["latitude"].iloc[i],df["longitude"].iloc[i]],
                     icon=folium.Icon(icon_color='red',icon ='bolt',prefix='fa',color=color[df.primary_fuel.iloc[i]])).add_to(m)
        
    return m

In [None]:
def split_column_into_new_columns(dataframe,column_to_split,new_column_one,begin_column_one,end_column_one):
    """
    Add latitude and longtitude to dataframe.
    """
    for i in range(0, len(dataframe)):
        dataframe.loc[i, new_column_one] = dataframe.loc[i, column_to_split][begin_column_one:end_column_one]
    return dataframe

# Data overview

For the first point, we get power plants data from "Global Power Plant Database" in the area.

If we calculate emissions factor in other place, we have to get data of power plants in coressponding area and replace here.

In [None]:
#import Global Power Plant Database data
power_plants = pd.read_csv('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gppd/gppd_120_pr.csv')

#Make latitude and longitude data easier to use
power_plants = split_column_into_new_columns(power_plants,'.geo','latitude',50,66)
power_plants = split_column_into_new_columns(power_plants,'.geo','longitude',31,48)
power_plants['latitude'] = power_plants['latitude'].astype(float)
a = np.array(power_plants['latitude'].values.tolist()) # 18 instead of 8
power_plants['latitude'] = np.where(a < 10, a+10, a).tolist() 

#sort data by their capacity
power_plants_df = power_plants.sort_values('capacity_mw',ascending=False).reset_index()

In [None]:
power_plants_df

There are 35 power plants in Puerto Rico.

Hydrogen, solar and wind are NOx free power source. 

Oil, coal, and gas emits NOx. 

About 45% power plants in Puerto Rico emit NOx.

In [None]:
#Check NO2_column_number_density of Sentinel-5P Data
image = tiff.imread('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180701T161259_20180707T175356.tif')
overlay_image_on_puerto_rico_df(power_plants_df,image[:,:,0],8)

##From https://www.kaggle.com/paultimothymooney/explore-image-metadata-s5p-gfs-gldas
##You can check which bands correspond which properties. 
##Now, band1 is NO2_column_number_density.

Correspondence of marker color and fuel are below.

- 'Hydrogen' : 'lightblue' 
- 'Solar' : 'orange'
- 'Oil' : 'darkblue'
- 'Coal' : 'black'
- 'Gas' : 'lightgray' 
- 'Wind' : 'green'

Ovarlayed data are (148, 475) shape numpy arrays. Area of 1 pixel is 0.25 m^2※.

※Use following values to lead this value. The original data is located at latitudes 17.9 to 18.6 degrees and longitudes -67.3 to -65.2 degrees. And longitude 1degree is 111km.

# 1. For calculate Emissions Factor

Now, I try to calculate emissions factor whole Puerto Rico.

First, I calculate total estimated generation. Import and pre-process NO2_column_number_density data.

## Preparation 

In [None]:
#Calculate total estimated electricity generation(GWh)
quantity_of_electricity_generated = np.sum(power_plants_df['estimated_generation_gwh'])
print('Quanity of Electricity Generated: ', quantity_of_electricity_generated)

In [None]:
#import path of Sentinel-5P Data
s5p_no2_timeseries = glob.glob('../input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/*')

NO2_column_number_density is band1 of Sentinel-5P dataset. If you want to calculate emissions factor in other place, please take data of the region from Sentinel-5P dataset and replace it at here.

If you check sample s5p_no2_timeseries, you can find that there are data which starts same date. And they effect result (values become abnormally big because of duplication). So I use most early data in each days simply.

In [None]:
s5p_no2_timeseries_no_duplication = []
checked_date = []

for data in sorted(s5p_no2_timeseries):
     
    data_date =  datetime.strptime(data[:79], '../input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_%Y%m%d')
    data_date = data_date.strftime("%Y/%m%d")
    
    if not data_date in checked_date:
        checked_date.append(data_date)
        s5p_no2_timeseries_no_duplication.append(data)

#Data path without duplicates
s5p_no2_timeseries = s5p_no2_timeseries_no_duplication

## Caluculate monthly emissions factor

Second, I calculate emissions factor monthly. By this way, we can calculate emissions factor in monthly resolution.

In [None]:
#Divide the data by month.
data_monthly_divided = {}
for data in s5p_no2_timeseries:
     
    data_date =  datetime.strptime(data[:77], '../input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_%Y%m')
    data_date = data_date.strftime("%Y/%m")
    
    if not data_date in data_monthly_divided.keys():
        data_monthly_divided[data_date] = []
        
    data_monthly_divided[data_date].append(data)

In [None]:
month = []
emissions = []

for key in sorted(data_monthly_divided.keys()):
    total_emissions = []
    datas = data_monthly_divided[key]
    
    for data in datas:        
        img = tiff.imread(data)[:,:,0] #import data here
        img = np.nan_to_num(img, nan=np.nanmean(img))  #fill nan by average  
        total_emission = np.nansum(img)  #take total NO2 density of the data
        
        total_emissions.append(total_emission)
    
    #take monthly total density.
    month.append(key)
    emissions.append(np.nansum(total_emissions))

We import data here.

> img = tiff.imread(data)[:,:,0] #import data here

Just change imported data, we use other place and scale data. 

In [None]:
#calculate amount of NO2.
#amount[T] = density[mol/m^2] * 0.25m^2 * number of whole pixels * 46.0055[g/mol] * 1e-6

emissions = np.array(emissions) * ((0.25 * img.shape[0]*img.shape[1]) * 46.0055 *1e-6)

In [None]:
results_monthly = pd.DataFrame(columns=['month', 'emission','emisson factor'])
results_monthly = pd.DataFrame({'emission':emissions,
                       'emission factor':emissions/(quantity_of_electricity_generated)}, #devide emissions by estimated generation
                    index=month)

In [None]:
results_monthly.head()

In [None]:
fig = plt.figure(figsize=(30, 4))
ax = results_monthly["emission factor"].plot()
plt.title('Monthly Emissions Factor in Puerto Rico')
ax.set(xlabel='YYYY/mm', ylabel='Emission factor [T/GWh]')

## Caluculate minimal time span (Appendix)

Sentinel-5P starts measuring every day, so using Sentinel-5P data, we can calculate more detailed values.

Third, I try to calculate emissions factor in this minimal time span for appendix.

In [None]:
#Divide the data by minimal time span.
data_minimal_divided = {}
for data in s5p_no2_timeseries:
     
    data_date =  datetime.strptime(data[:79], '../input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_%Y%m%d')
    data_date = data_date.strftime("%Y/%m/%d")
    
    if not data_date in data_minimal_divided.keys():
        data_minimal_divided[data_date] = [data]

In [None]:
span = []
emissions = []

for key in sorted(data_minimal_divided.keys()):
    total_emissions = []
    datas = data_minimal_divided[key]
    for data in datas:
        
        img = tiff.imread(data)[:,:,0]
        img = np.nan_to_num(img, nan=np.nanmean(img))
        total_emission = np.nansum(img)
        
        total_emissions.append(total_emission)
    
    span.append(key)
    emissions.append(np.nansum(total_emissions))

In [None]:
#calculate amount of NO2.
#amount[T] = density[mol/m^2] * 0.25m^2 * number of whole pixels * 46.0055[g/mol] * 1e-6
emissions = np.array(emissions) * ((0.25 * img.shape[0]*img.shape[1]) * 46.0055 *1e-6)

In [None]:
results_minimal = pd.DataFrame(columns=['week', 'emission','emisson factor'])
results_minimal = pd.DataFrame({'emission':emissions,
                       'emission factor':emissions/(quantity_of_electricity_generated)},
                    index=span)

In [None]:
results_minimal.head()

In [None]:
fig = plt.figure(figsize=(30, 4))
ax = results_minimal["emission factor"].plot()
plt.title('Monthly Mean Simplified Emissions Factor in Puerto Rico')
ax.set(xlabel='YYYY/mm/dd', ylabel='Emission factor [T/GWh]')

## Are these estimates correct?

In [None]:
print( "Total NO2 emissions in Puerto Rico is", int(np.sum(emissions) * 12), "T/year")

I think this value is not far away from real value. According to page21 of [2], total NO2 emissions from factories and business sites of Chiba prefecture in Japan is 41944 T/year. Puerto Rico is 2.5 times larger than Chiba, but Chiba has very big industrial zone.

# 2. For calculate Merginal Emissions Factor

## Overview of objectives and methods
There are 35 power plants and 6 fuel types throughout Puerto Rico. The capacity for each fuel type can be summarized in the graph below.

In [None]:
#From https://www.kaggle.com/ajulian/capacity-factor-in-power-plants

total_capacity_mw = power_plants_df['capacity_mw'].sum()
print('Total Installed Capacity: '+'{:.2f}'.format(total_capacity_mw) + ' MW')
capacity = (power_plants_df.groupby(['primary_fuel'])['capacity_mw'].sum()).to_frame()
capacity = capacity.sort_values('capacity_mw',ascending=False)
capacity['percentage_of_total'] = (capacity['capacity_mw']/total_capacity_mw)*100
ax = capacity.sort_values(by='percentage_of_total', ascending=True)['percentage_of_total'].plot(kind='bar',color=['lightblue', 'green', 'orange', 'black','lightgray','darkblue'])
ax.set(ylabel='percentage')


Since the marginal emissions factor is an emissions factor of the power plant that supplies power at a certain time, it is necessary to estimate the NOx emissions for each power plant. The sample data is provided for the entire Puerto Rico, but since Puerto Rico is large, it is not reasonable to simply distribute the total emissions by power generation. In addition, Google Earth Engine can collect data in a fixed form such as a rectangle, but the distribution of NOx emitted from a certain power plant changes dynamically depending on the characteristics such as land, weather and so on. Therefore, there is a need for a more flexible attribution method of which NOx is emitted from which power plant. 

I tought that it is nice idea to divide whole data into some geographical groups by clustering method such as k-means. 

Ideally I want to devide whole data into groups as many as power plants in Puerto Rico. But it is difficult because there are some power plants very near each other. 

When there are some power plants in each group, I devide emissions by multiplicatting capacity_mw rate※ in the area. I think capacity reflect power plants's scale of NOx emissions.

※capacity_mw rate = (capacity_mw of specific plant)/  (capacity_mw of all plants in the area)

## Can we devide only NO2 density information?

First, I confirm NO2 distribution, and whether we should devide the data by which only NO2 density but also other propaties.

I roughly classified NO2 distribution by k-means only using its own data.

In [None]:
image = tiff.imread('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180701T161259_20180707T175356.tif')
monotonous = KMeans(n_clusters=3, random_state=6).fit_predict(image[:,:,0].reshape(-1, 1))
overlay_image_on_puerto_rico_df(power_plants_df, monotonous.reshape((148, 475)), 8)

I classified the data into three levels so that the darker the color, the higher the NOx concentration.

As you can see from this figure, the NO2 gas concentration is not easy to understand distribution.

The gas is not distributed like a circle or band from a power plant. It seems be mottled.

So it seems necessary to include not only density but also other effects, in order to divide data into groups robustly and accurately.

## Divide into groups with wind feature

I tried to devide data into groups including NO2 density and wind feature.

Since NO2 gases are flowed by wind, and accumulate place where no wind, I considered that I should include this propaties to get robust and accurate result.

We can access following by Global Forecast System 384-Hour Predicted Atmosphere Data,

- u_component_of_wind_10m_above_ground

- v_component_of_wind_10m_above_ground

and by GLDAS-2.1: Global Land Data Assimilation System,

- Wind_f_inst

If you want to calculate for other region, please replace here.

In [None]:
gldas_files = glob.glob('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gldas/*')
gldas_files = sorted(gldas_files)
gfs_files = glob.glob('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gfs/*')
gfs_files = sorted(gfs_files)

For simplicity, I focus on the data from 7/1 to 7/7.

In [None]:
#Data are separated by day for the future, but it is not necessary.
gldas_files_par_day = []
for i in range(0,len(gldas_files[6:54]),8):
    gldas_files_par_day.append(gldas_files[i:i+8])

In [None]:
#Data are separated by day for the future, but it is not necessary.
gfs_files_par_day = []
for i in range(0,len(gfs_files[3:27]),4):
    #print(gfs_files[i:i+4])
    gfs_files_par_day.append(gfs_files[i:i+4])

## Create input for clustering model

GLDAS data are taken by 3 hour interval and GFS data are taken by 6 hour interval. So we have to adjust time span to Sentinel-5P OFFL NO2 data.

In [None]:
ave_wind_u = []
ave_wind_v = []
ave_wind_speed = []

#Get data of a day
for i in range(len(gfs_files_par_day)):
    gfs_tmp = gfs_files_par_day[i]
    gldas_tmp = gldas_files_par_day[i]
    array_wind_u = []
    array_wind_v = []
    array_wind_speed = []
    
    #Get datas in the day
    for j in range(len(gfs_tmp)):
        gfs_image_u = tiff.imread(gfs_tmp[j])[:,:,3]
        gfs_image_v = tiff.imread(gfs_tmp[j])[:,:,4]
        gldas_image1 = tiff.imread(gldas_tmp[2*j])[:,:,11]
        gldas_image2 = tiff.imread(gldas_tmp[2*j + 1])[:,:,11]

        #fill na by mean
        gfs_image_u = np.nan_to_num(gfs_image_u, nan=np.nanmean(gfs_image_u))
        gfs_image_v = np.nan_to_num(gfs_image_v, nan=np.nanmean(gfs_image_v))
        gldas_image1 = np.nan_to_num(gldas_image1, nan=np.nanmean(gldas_image1))
        gldas_image2 = np.nan_to_num(gldas_image2, nan=np.nanmean(gldas_image2))
        
        #GLDAS has twice detailed time span than GFS
        gldas_image = (gldas_image1 + gldas_image2)/2
        
        array_wind_u.append(gfs_image_u)
        array_wind_v.append(gfs_image_v)
        array_wind_speed.append(gldas_image)
    
#Calculate average        
ave_wind_u = np.nanmean(np.array(array_wind_u), axis=0)
ave_wind_v = np.nanmean(np.array(array_wind_v), axis=0)
ave_wind_speed = np.nanmean(np.array(array_wind_speed), axis=0)

In [None]:
lon = []
lat = []
NO2 = []
wind_u = []
wind_v = []
wind_speed = []

for i in range(image[:,:,0].shape[0]):
    for j in range(image[:,:,0].shape[1]):
        #print(image[:,:,0][i,j])
        NO2.append(image[:,:,0][i,j])
        lon.append(i)
        lat.append(j)
        wind_u.append(ave_wind_u.reshape((148, 475))[i,j])
        wind_v.append(ave_wind_v.reshape((148, 475))[i,j])
        wind_speed.append(ave_wind_speed.reshape((148, 475))[i,j])
        
NO2 = np.array(NO2)
lon = np.array(lon)
lat = np.array(lat)
wind_u = np.array(wind_u)
wind_v = np.array(wind_v)
wind_spped = np.array(wind_speed)
        
features_df = pd.DataFrame(columns=['NO2', 'lat', 'lon', 'wind_u', 'wind_v', 'wind_speed'])
features_df = pd.DataFrame({
                    'NO2': NO2/max(NO2),
                    'lat': lat/max(lat),
                    'lon': lon/max(lon),
                    'wind_u' : wind_u/(- min(wind_u)),
                    'wind_v' : wind_v/(- min(wind_v)),
                    'wind_speed': wind_speed/max(wind_speed)})

## Let's divide into groups

I'll divide whole NO2 data to several groups by k-means and created data. By k-means, we have to decide the number of clusters. To decide this, I check the map of Puerto Rico again.

In [None]:
overlay_image_on_puerto_rico_df(power_plants_df, np.zeros((148, 475)), 8)

Power plants using oil, coal and gas fuel emits NO2, so I focus of these power plants. From the geographical features and distance perspective, I decide following 7 group.
  1. Upper left of main island
  2. Upper right of main island
  3. Left of main island
  4. Right of main island
  5. Lower left of main island
  6. Lower right of main island
  7. Vieques island

As you can see, we have to decide the number of clusters manually. If we use data that is too geographically large, this feature may be a problem.However, there is no problem with the size of subnations.

In [None]:
group_pred = KMeans(n_clusters=7, random_state=3).fit_predict(features_df)
plt.figure()
sns.heatmap(group_pred.reshape((148, 475)))

In [None]:
overlay_image_on_puerto_rico_df(power_plants_df, group_pred.reshape((148, 475)), 8)

Unfortunately, the central mountain area has become one group.😅 However, it seems to reflect the geographic features well. I think that the wind propaties implicitly incorporate geographic features into clustering.

## Calculate emissions factor of each power plants

Next step, I caclulate that each power plants can generate what percentage of electricity in the group. This is because I want to estimate each power plant's NO2 emissions by multipling the rate to total emissions in the group. Since capacity reflect the scale of each power plant, I measure this rate by capacity. This is just assumption, so we have to devise more good index.

In [None]:
def which_pixel_and_group(df,pred,img):
    """
    Add information (which pixel and class label) to input DataFrame
    
    Parameters
    ----------
    df : pandas.DataFrame
        This dataframe must have latitude and longitude.
    pred : numpy.array
        Label classified by k-means.
    img : numpy.array
        NO2 density.

    Returns
    -------
    df : pandas.DataFrame
        Information of powerplant added pixel and class label.
    """
    
    lat_pixel = []
    lon_pixel = []
    kmean_groups = []
    
    for i in range(len(df)):
        lat = float(df.iloc[[i]].loc[:,["latitude"]].iat[0,0])
        lon = float(df.iloc[[i]].loc[:,["longitude"]].iat[0,0])

    
        f_lat = (lat - LAT_MIN)*img.shape[0]/(LAT_MAX - LAT_MIN)
        f_lon = (lon + LON_MAX)*img.shape[1]/(-LON_MIN + LON_MAX)
        f_lat_int = int(Decimal(str(f_lat -1)).quantize(Decimal('0'), rounding=ROUND_HALF_UP))
        f_lon_int = int(Decimal(str(f_lon -1)).quantize(Decimal('0'), rounding=ROUND_HALF_UP))
        pixel = (f_lat_int - 1) * img.shape[1] + f_lon_int
        
        lat_pixel.append(img.shape[0] - f_lat_int)
        lon_pixel.append(f_lon_int)
        kmean_groups.append(pred[pixel])
 
        
    df["lat_pixel"] = lat_pixel
    df["lon_pixel"] = lon_pixel
    df["kmean_group"] = kmean_groups
    
    return df

In [None]:
def calc_gwh_in_group(df):
    """
    Add information (which pixel and class label) to input DataFrame
    
    Parameters
    ----------
    df : pandas.DataFrame
        This dataframe must have primary_fuel, kmean_group and estimated_generation_gwh.

    Returns
    -------
    df : pandas.DataFrame
        Information of powerplant added gwh_rate_group.
    """
    
    pplant_gwh_rate_ingroup = []
    
    for i in range(len(df)):
        
        #Exclude power plants that do not emit NO2
        if not df.iloc[[i]].loc[:,["primary_fuel"]].iat[0,0] in ["Oil","Gas", "Oil"]:
            pplant_gwh_rate_ingroup.append(0)
            
        else:      
            pplant_cap = df.iloc[[i]].loc[:,["capacity_mw"]].iat[0,0]
            pplant_group = df.iloc[[i]].loc[:,["kmean_group"]].iat[0,0]
        
            pplants_emitsno2_group = df[(df["kmean_group"]==pplant_group) | \
                                         power_plants_df["primary_fuel"].map(lambda primary_fuel: primary_fuel in ["Oil","Gas", "Oil"])]
        
            total_cap_ingroup = sum(pplants_emitsno2_group["capacity_mw"])
            pplant_gwh_rate_ingroup.append(pplant_cap/total_cap_ingroup)
        
    df["cap_rate_group"] = pplant_gwh_rate_ingroup
    
    return df

In [None]:
power_plants_df = which_pixel_and_group(power_plants_df, group_pred, image)
power_plants_df = calc_gwh_in_group(power_plants_df)

In [None]:
power_plants_df.loc[:,["name", "primary_fuel", "kmean_group","cap_rate_group"]].head()

## Calculate each own emission factor of power plants

Last, I calculate emissions factor of each power plants (own_EF in following code). To calculate the values, I calculate total NO2 amount in each group (no2amount_dict_group), and multiple the values to "cap_rate_group".

In [None]:
def calc_no2amount_each_group(img, pred):
    """
    Calculate amount of substance of NO2 for each group.
    
    Parameters
    ----------
    img : numpy.array
        NO2 density.
    pred : numpy.array
        Label classified by k-means.

    Returns
    -------
    no2amount : dictionary
        value: group number, value: amount of substance of NO2(g)
    """
    no2amount = dict()
    
    for i in set(pred):
        no2amount[i] = 0
    
    pred = pred.reshape(img[:,:,0].shape)
    
    for i in range(img.shape[0]):
        for j in range(img.shape[1]):
            no2amount[pred[i,j]] += img[:,:,0][i,j] * 0.25 *48 #Simultaneous conversion from density to quantity and summation 
            
    return no2amount

In [None]:
def calc_own_ef(df, no2amount):
    """
    Calculate own emission factor of each power plant.
    
    Parameters
    ----------
    df : pandas.DataFrame
        This dataframe must have gwh_rate_group and kmean_group.
    no2amount : dictionary
        value: group number, value: amount of substance of NO2

    Returns
    -------
    df : pandas.DataFrame
        Information of powerplant added estimated merginal_EF and emission for one year.
    """
    own_efs = []
    own_emission = []
    
    for i in range(len(df)):
        pplant_cap_rate_group = df.iloc[[i]].loc[:,["cap_rate_group"]].iat[0,0]
        pplant_est_gen_gwh = df.iloc[[i]].loc[:,["estimated_generation_gwh"]].iat[0,0]
        pplant_kmean_group = df.iloc[[i]].loc[:,["kmean_group"]].iat[0,0]
        
        #I calculate emissions factor of each power plant here!
        #I assumed that one Sentinel-5P data coressponding to daily data.
        own_efs.append(no2amount[pplant_kmean_group] * pplant_cap_rate_group / pplant_est_gen_gwh) 
        own_emission.append(no2amount[pplant_kmean_group] * pplant_cap_rate_group) 
        
    df["own_EF"] = own_efs
    df["own_emission"] = own_emission
    
    return df

In [None]:
no2amount_dict_group = calc_no2amount_each_group(image, group_pred)

print("Coreration of 'group: Total NO2 amount of each group(g)' are following:")
no2amount_dict_group #Total NO2 amount of each group

In [None]:
power_plants_df = calc_own_ef(power_plants_df, no2amount_dict_group)
power_plants_df.loc[:,["name", "primary_fuel", 'estimated_generation_gwh','capacity_mw' , "kmean_group", "own_emission", "own_EF"]]

Now, we successed to get emissions factor of each power plant (own_EF)!

Note that unit of own_EF is [g/GWh] amd own_emission is [g].

And if the group and the fuel were the same, they would have the same value. This may be because capacity is used to calculate the estimated_generation_gwh	. We think that it will be solved if we use the power generation results of last year instead of capacity.

## Find merginal emissions factor

We got emission factors of each power plants. Using these values, I'll calculate merginal emission factor.

To calculate merginal emission factor, we have to know following infomation, but we don't have them. 

- Power generation cost of each power plants

- Which regions receive power from which power plants 

So I only calculate merginal emission factor of group0 area assuming that the region of Group 0 depends on the power plants in the region.

In [None]:
power_plants_group0_df = power_plants_df[power_plants_df["kmean_group"]==0]
power_plants_group0_df.loc[:,["name", "primary_fuel", "kmean_group", "own_emission", "own_EF"]]

In [None]:
total_capacity_mw_group0 = power_plants_group0_df['capacity_mw'].sum()
print('Total Installed Capacity: '+'{:.2f}'.format(total_capacity_mw_group0) + ' MW')
capacity = (power_plants_group0_df.groupby(['primary_fuel'])['capacity_mw'].sum()).to_frame()
capacity = capacity.sort_values('capacity_mw',ascending=False)
capacity['percentage_of_total'] = (capacity['capacity_mw']/total_capacity_mw_group0)*100
ax = capacity.sort_values(by='percentage_of_total', ascending=True)['percentage_of_total'].plot(kind='bar',color=['lightblue', 'green', 'orange', 'black','lightgray','darkblue'])
ax.set(ylabel='percentage')

In [None]:
capacity

~9% of capacity, marginal emission factor is 0.

9% ~ 41.5 of capacity, marginal emission factor is 0.071325 g/GWh.

41.5% ~  of capacity, marginal emission factor is 0.089964 g/GWh.

# What is great of this method?

## Pros and cons

**<u>Pros</u>**

- We can calculate emissions factor very simply. But the value seems reasonable. (method 1 and 2)
- By inputting together weather and geographical data, it is possible to attribute which NOx is attributable to which power plant.(method 2)
- Other useful data can be added to clustering simply by standardizing and adding.(method2)

**<u>Cons</u>**

- We need to decide the number of clusters ourselves. but the result of clustering desn't  always reflect our intention.
- When we divide the data into clusters, if there are no power plants in that cluster, NO2 in that area will not be reflected in any power plants
  (However, this disadvantage can be covered by distributing NO2 in these areas to all power plants.)
- Data observed by remote sensing includes NOx from sources other than power plants such as viecle, bussiness site and so on. With this method, extra data is mixed into the emissions factor.

## Why this approach improve emissions factor?

- We can calculate the NO2 emission factor directly from the actual amount present in the atmosphere. In "Why pay attention to NO2?", I explained that NOx reacts with chemicals in the atmosphere and changes into various forms. To know the NO2 density, it was necessary to separately investigate the concentration of ozone and the like and estimate the concentration that has reached equilibrium as a result of the chemical change.

- To estimate which NO2 comes from which power plant, we can use weather, geographical data and so on. This time, I used only weather data, but if we find more efficient data, we can include them by just inputting to clustering model togather. 

## Reference

[1] - [2] are documents by japanese public institutions. (
So sorry in Japanese.)

[1]https://ghg-santeikohyo.env.go.jp/files/calc/cm_ec_R01/full.pdf

[2] https://www.pref.chiba.lg.jp/taiki/shingikai/kankyou-taiki/documents/20110601shiryou04.pdf

[3] - [5] are my kernel published in advance.

[3]https://www.kaggle.com/nayuts/can-we-attribute-emissions-to-power-plants

[4]https://www.kaggle.com/nayuts/focus-on-specific-power-plants

[5]https://www.kaggle.com/nayuts/exploration-of-tiff-bands

[6]https://www.kaggle.com/nayuts/calculate-ef-in-smaller-time-slices