# Simon UN Data
This notebook attempts to:
- Read in raw data (currently split into multiple files)
- Assemble the raw data into a single .csv file
- Save the combined file to a local directory
- Assess the data quality of the combined .csv

In [1]:
%load_ext lab_black
%load_ext autoreload
%load_ext watermark

In [2]:
%autoreload 2

In [3]:
%watermark -ntz -p pandas -a Simon-Lee-UK -u -d -t -z

Author: Simon-Lee-UK

Last updated: 2021-04-21 22:01:14BST

pandas: 1.2.1



In [4]:
import sys
from pathlib import Path
import pandas as pd
from pandas_profiling import ProfileReport
from pyprojroot import here

sys.path.append(
    str(here())
)  # adds the project directory to a list of locations the python interpreter searches through when attempting to import modules

from modules.simon_get_data import load_un_data

split_data_path = (
    here() / "raw_data"
)  # here() returns the root of the repository as a pathlib object
interim_data_path = here() / "data" / "interim"

In [5]:
un_data = load_un_data(split_data_path, interim_data_path)

Loading combined data from /Users/Simon/Documents/Python_Projects/see-you-data/UNInternationalEnergyData/data/interim...


In [6]:
un_data.head(5)

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates


In [7]:
un_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1189471 entries, 0 to 1189470
Data columns (total 7 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   country_or_area        1189471 non-null  object 
 1   commodity_transaction  1189471 non-null  object 
 2   year                   1189471 non-null  int64  
 3   unit                   1189471 non-null  object 
 4   quantity               1189471 non-null  float64
 5   quantity_footnotes     163946 non-null   float64
 6   category               1189471 non-null  object 
dtypes: float64(2), int64(1), object(4)
memory usage: 63.5+ MB


In [8]:
report = ProfileReport(un_data, title="Pandas Profiling Overview")

In [9]:
report

Summarize dataset:   0%|          | 0/21 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Processing Steps
- Delete the 'quantity_footnotes' column
- Understand the relationship between category and 'commodity_transaction'
    - If the relationship can be mapped, created a dictionary summarising the links
    - Pretty-print the dictionary to help find columns relevant to different investigations
- Reshape the long-format DataFrame to a wide format
    - Try and maintain the unit values in a column associated with each of the new wide columns
    - Pretty sure I've done something similar to this before
- Add another function that converts the DataFrame in the other direction (wide -> long) 

In [10]:
un_data.columns.to_list()

['country_or_area',
 'commodity_transaction',
 'year',
 'unit',
 'quantity',
 'quantity_footnotes',
 'category']

In [11]:
un_data["category"].unique()

array(['additives_and_oxygenates', 'animal_waste', 'anthracite',
       'aviation_gasoline', 'bagasse', 'biodiesel', 'biogases',
       'biogasoline', 'bitumen', 'black_liquor', 'blast_furnace_gas',
       'brown_coal_briquettes', 'brown_coal', 'charcoal', 'coal_tar',
       'coke_oven_coke', 'coking_coal', 'conventional_crude_oil',
       'direct_use_of_geothermal_heat',
       'direct_use_of_solar_thermal_heat',
       'electricity_net_installed_capacity_of_electric_power_plants',
       'ethane', 'falling_water', 'fuel_oil', 'fuelwood', 'gas_coke',
       'gas_oil_diesel_oil', 'gasoline_type_jet_fuel', 'gasworks_gas',
       'geothermal', 'hard_coal', 'heat', 'hydro', 'industrial_waste',
       'kerosene_type_jet_fuel', 'lignite', 'liquified_petroleum_gas',
       'lubricants', 'motor_gasoline', 'municipal_wastes', 'naphtha',
       'natural_gas_including_lng', 'natural_gas_liquids',
       'nuclear_electricity', 'of_which_biodiesel',
       'of_which_biogasoline', 'oil_shale_oil_sa

In [12]:
un_data["commodity_transaction"].unique()

array(['Additives and Oxygenates - Exports',
       'Additives and Oxygenates - Imports',
       'Additives and Oxygenates - Production', ...,
       'White spirit and special boiling point industrial spirits - Transformation',
       'White spirit and special boiling point industrial spirits - Transformation in petrochemical plants',
       'Electricity - total wind production'], dtype=object)

In [13]:
unique_categories = un_data["category"].unique()
unique_transactions = un_data["commodity_transaction"].unique()

In [14]:
for category in unique_categories:
    print(category)

additives_and_oxygenates
animal_waste
anthracite
aviation_gasoline
bagasse
biodiesel
biogases
biogasoline
bitumen
black_liquor
blast_furnace_gas
brown_coal_briquettes
brown_coal
charcoal
coal_tar
coke_oven_coke
coking_coal
conventional_crude_oil
direct_use_of_geothermal_heat
direct_use_of_solar_thermal_heat
electricity_net_installed_capacity_of_electric_power_plants
ethane
falling_water
fuel_oil
fuelwood
gas_coke
gas_oil_diesel_oil
gasoline_type_jet_fuel
gasworks_gas
geothermal
hard_coal
heat
hydro
industrial_waste
kerosene_type_jet_fuel
lignite
liquified_petroleum_gas
lubricants
motor_gasoline
municipal_wastes
naphtha
natural_gas_including_lng
natural_gas_liquids
nuclear_electricity
of_which_biodiesel
of_which_biogasoline
oil_shale_oil_sands
other_bituminous_coal
other_coal_products
other_hydrocarbons
other_kerosene
other_liquid_biofuels
other_oil_products_n_e_c
other_recovered_gases
other_vegetal_material_and_residues
paraffin_waxes
patent_fuel
peat
peat_products
petroleum_coke
refin

In [15]:
for transaction in unique_transactions:
    print(transaction)

Additives and Oxygenates - Exports
Additives and Oxygenates - Imports
Additives and Oxygenates - Production
Additives and Oxygenates - Receipts from other sources
Additives and Oxygenates - Stock changes
Additives and Oxygenates - Total energy supply
Additives and Oxygenates - transfers and recycled products
Additives and Oxygenates - Transformation
Additives and Oxygenates - Transformation in oil refineries
Animal waste - Consumption by commerce and public services
Animal waste - Consumption by food and tobacco 
Animal waste - Consumption by households
Animal waste - Consumption by manufacturing, construction and non-fuel industry
Animal waste - Consumption by non-metallic minerals 
Animal waste - Consumption by other
Animal waste - Consumption by other manuf., const. and non-fuel min. ind.
Animal waste - Consumption in agriculture, forestry and fishing
Animal waste - Consumption not elsewhere specified (industry)
Animal waste - Consumption not elsewhere specified (other)
Animal waste