# Introduction

This notebook introduces the PRIMAP-hist datasets exploration whose the links are available on our **[github repository](https://github.com/OpenGeoScales/ogs-data-exploration/tree/main/data/ghg-emissions/primap-hist)**. 

These datasets combine several published datasets and are provided by the Potsdam Institute for Climate Impact Research (PIK).

They provide a set of emissions pathways for each country and Kyoto Gas (GHG) from 1850 to 2018.

Here are the differents datasets from the last version (2.2) of February 2021:
- PRIMAP-hist_v2.2_19-Jan-2021.csv contains the numerical extrapolations of all time series to 2018
- PRIMAP-hist_v2.2_no_extrapolation_19-Jan-2021.csv has no numerical extrapolations of missing values and does not include the country groups "EARTH", "ANNEXI", "NONANNEXI", "AOSIS", "BASIC", "EU28", "LDC", "UMBRELLA"

Given that the first dataset contains no missing values thanks to the numerical extrapolations. The exploration below is focused on this dataset.

# Libraries importations and datasets loading

In [3]:
import os

import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
folder = "../../../../PRIMAP-hist_v2.2/"
dataset_name = "PRIMAP-hist_v2.2_19-Jan-2021.csv"
primap_hist_data = pd.read_csv(folder+dataset_name)

In [5]:
primap_hist_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28622 entries, 0 to 28621
Columns: 174 entries, scenario to 2018
dtypes: float64(169), object(5)
memory usage: 38.0+ MB


In [6]:
primap_hist_data.head()

Unnamed: 0,scenario,country,category,entity,unit,1850,1851,1852,1853,1854,...,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,HISTCR,ABW,IPC1A,CH4,Gg,0.000153,0.000158,0.000164,0.000169,0.000174,...,0.0355,0.0385,0.0317,0.0517,0.0684,0.0594,0.0595,0.0539,0.0552,0.0565
1,HISTCR,AFG,IPC1A,CH4,Gg,0.0237,0.0238,0.0239,0.0241,0.0242,...,4.07,4.49,5.22,7.82,11.7,5.42,5.38,7.59,7.96,8.33
2,HISTCR,AGO,IPC1A,CH4,Gg,1.75,1.77,1.8,1.82,1.84,...,57.7,59.1,60.5,61.7,63.0,64.3,65.6,66.9,68.3,69.6
3,HISTCR,AIA,IPC1A,CH4,Gg,1e-05,1e-05,1e-05,1e-05,1e-05,...,0.00205,0.0024,0.00256,0.00256,0.00275,0.00275,0.00276,0.00286,0.00298,0.0031
4,HISTCR,ALB,IPC1A,CH4,Gg,0.0602,0.0606,0.0615,0.0631,0.0652,...,5.14,5.04,4.98,4.87,4.79,4.66,4.89,5.0,5.02,5.04


# Variables and their description

The variables description that follows can be found **[here](https://zenodo.org/record/4479172#data-format-description-columns)**.

## Scenario

- HISTCR: in this scenario country-reported data (CRF, BUR, UNFCCC) is prioritized over third-party data (CDIAC, FAO, Andrew, EDGAR, BP).
- HISTTP: in this scenario third-party data (CDIAC, FAO, Andrew, EDGAR, BP) is prioritized over country-reported data (CRF, BUR, UNFCCC).

CRF, BUR, UNFCCC, CDIAC, FAO, Andrew, EDGAR and BP represent the data sources of the PIK dataset.

*How are the data prioritized over the others?*

In the article, it says that the highest priority datasets are the country reported data (UNFCCC). But if the data aren't available or don't meet the minimal requirements, then it uses country reported data provided by third-party sources such as reseach institutions or international organizations.
I haven't found how the scenarios are built and how the priorization is made.

*Does the priorization mean that the data are used from one category and not the other?*
*For all the years of a time serie? Or just some of them?*

I don't know. **For the two last questions, some exploration can be made to answer it.**

In [13]:
primap_hist_data.scenario.value_counts()

HISTTP    14311
HISTCR    14311
Name: scenario, dtype: int64

There are the same numbers of emission pathways (or time series) in each scenario.

## Country

207 ISO 3166 three-letter country codes or 8 custom codes for groups

In [14]:
primap_hist_data.country.unique()

array(['ABW', 'AFG', 'AGO', 'AIA', 'ALB', 'AND', 'ANNEXI', 'ANT', 'AOSIS',
       'ARE', 'ARG', 'ARM', 'ATG', 'AUS', 'AUT', 'AZE', 'BASIC', 'BDI',
       'BEL', 'BEN', 'BFA', 'BGD', 'BGR', 'BHR', 'BHS', 'BIH', 'BLR',
       'BLZ', 'BOL', 'BRA', 'BRB', 'BRN', 'BTN', 'BWA', 'CAF', 'CAN',
       'CHE', 'CHL', 'CHN', 'CIV', 'CMR', 'COD', 'COG', 'COK', 'COL',
       'COM', 'CPV', 'CRI', 'CUB', 'CYP', 'CZE', 'DEU', 'DJI', 'DMA',
       'DNK', 'DOM', 'DZA', 'EARTH', 'ECU', 'EGY', 'ERI', 'ESP', 'EST',
       'ETH', 'EU28', 'FIN', 'FJI', 'FRA', 'GAB', 'GBR', 'GEO', 'GHA',
       'GIN', 'GMB', 'GNB', 'GNQ', 'GRC', 'GRD', 'GTM', 'GUY', 'HKG',
       'HND', 'HRV', 'HTI', 'HUN', 'IDN', 'IND', 'IRL', 'IRN', 'IRQ',
       'ISL', 'ISR', 'ITA', 'JAM', 'JOR', 'JPN', 'KAZ', 'KEN', 'KGZ',
       'KHM', 'KIR', 'KNA', 'KOR', 'KWT', 'LAO', 'LBN', 'LBR', 'LBY',
       'LCA', 'LDC', 'LIE', 'LKA', 'LSO', 'LTU', 'LUX', 'LVA', 'MAC',
       'MAR', 'MCO', 'MDA', 'MDG', 'MDV', 'MEX', 'MHL', 'MKD', 'MLI',
       'ML

See the **[source](https://zenodo.org/record/4479172#data-format-description-columns)** for more details about the additional custom codes for country groups.

In [15]:
len(primap_hist_data.country.unique())

215

## Category

IPCC (Intergovernmental Panel on Climate Change) 2006 categories for emissions. Some aggregate sectors have been added to the hierarchy. These begin with the prefix IPCM instead of IPC.

In [16]:
primap_hist_data.category.unique()

array(['IPC1A', 'IPC1B1', 'IPC1B2', 'IPC1B3', 'IPC1B', 'IPC1', 'IPC2B',
       'IPC2C', 'IPC2D', 'IPC2G', 'IPC2H', 'IPC2', 'IPC3A', 'IPC4',
       'IPC5', 'IPCM0EL', 'IPCMAGELV', 'IPCMAG', 'IPC2A'], dtype=object)

See the **[source](https://zenodo.org/record/4479172#data-format-description-columns)** for more details about the category description.

## Entity

Gas categories using global warming potentials (GWP) from either Second Assessment Report (SAR) or Fourth Assessment Report (AR4).


In [17]:
primap_hist_data.entity.unique()

array(['CH4', 'CO2', 'FGASESAR4', 'FGASES', 'HFCSAR4', 'HFCS',
       'KYOTOGHGAR4', 'KYOTOGHG', 'N2O', 'NF3', 'PFCSAR4', 'PFCS', 'SF6'],
      dtype=object)

See the **[source](https://zenodo.org/record/4479172#data-format-description-columns)** for more details about the gas description.

## Unit

Unit is either Gg or GgCO2eq (CO2-equivalent according to the global warming potential used).

## Remaining columns

Years from 1850-2018 so 168 years covered.

# Missing values

This part has got the aim to identify in the dataset the variables with missing values, count them and their frequencies. Then, some further analyses will be done if missing values are found.

In [57]:
def count_missing_values_per_variable(dataset):
    return pd.DataFrame(dataset.isnull().sum(), columns=['nb_missing_values'])

In [58]:
def count_frequency_missing_values_per_variable(recap_table, number_total_values):
    return recap_table.assign(freq_missing_value=lambda df:
                                  round(df.nb_missing_values/number_total_values*100, 1))

In [59]:
def filter_variables_with_missing_values(table):
    return table[table.nb_missing_values > 0]

In [60]:
def create_recap_table_for_no_missing_value():
    return pd.DataFrame({'Results': 'None missing value in the dataset'})

In [61]:
def sort_descending_number_missing_values(recap_table):
    return recap_table.sort_values(by='nb_missing_values', ascending=False)

In [62]:
def identify_variables_with_missing_values(dataset):
    
    recap_table = count_missing_values_per_variable(dataset)
    recap_table = filter_variables_with_missing_value(recap_table)
    nb_variables_with_missing_values = len(recap_table)

    if nb_variables_with_missing_values > 0:
        recap_table = count_frequency_missing_values_per_variable(recap_table, len(dataset))
        recap_table = sort_descending_number_missing_values(recap_table)
    else:
        recap_table = create_recap_table_for_no_missing_value()
    
    return recap_table

In [63]:
identify_variables_with_missing_values(primap_hist_data)

Unnamed: 0,nb_missing_values,freq_missing_value
1850,104,0.4
1956,104,0.4
1949,104,0.4
1950,104,0.4
1951,104,0.4
...,...,...
2006,94,0.3
2007,72,0.3
2008,70,0.2
2009,34,0.1
