# Food Explorer
Produced using garden-level FAOstat datasets. 

So far the following datasets have been processed:

- [x] QCL
- [x] FBSC (FBS, FBSH)


We process both datasets in parallel, until the _Final Processing_ section, where we actually merge the datasets.

## 0. Parameters

In [1]:
dest_dir = "/tmp/food_explorer"

## 1. Imports & paths
Import the required libraries and define paths to load files (including data files and standardisation mappings for item and element names).

In [2]:
from pathlib import Path
import pandas as pd
import numpy as np
from owid import catalog
from etl.paths import BASE_DIR as base_path

In [3]:
PATH_DATASET_QCL = base_path / "data/garden/faostat/2021-03-18/faostat_qcl"
PATH_DATASET_FBSC = base_path / "data/garden/faostat/2021-04-09/faostat_fbsc"
PATH_MAP_ITEM = (
    base_path / "etl/steps/data/garden/explorers/2021/food_explorer.items.std.csv"
)
PATH_MAP_ELEM = (
    base_path / "etl/steps/data/garden/explorers/2021/food_explorer.elements.std.csv"
)

## 2. Load garden dataset
In this step we load the required datasets from Garden. At the moment, only QCL dataset is processed. In next iterations FBS+FBSH dataset will be also imported.

In [4]:
qcl_garden = catalog.Dataset(PATH_DATASET_QCL)
fbsc_garden = catalog.Dataset(PATH_DATASET_FBSC)

We obtain table `bulk` from the dataset, which contains the data itself.

In [19]:
# Bulk data and items metadata
qcl_bulk = qcl_garden["bulk"]
fbsc_bulk = fbsc_garden["bulk"]

In the following step we discard column `Variable name`, which although useful for its clarity we don't actually need it in this process. Also, we reset the index as this will be needed in following operations. 

In [21]:
# QCL
qcl_bulk = qcl_bulk.reset_index()
qcl_bulk = qcl_bulk.drop(columns=["Variable Name"])
# FBSC
fbsc_bulk = fbsc_bulk.reset_index()
fbsc_bulk = fbsc_bulk.drop(columns=["Variable Name"])

Brief overview of the data.

In [22]:
# QCL
print(qcl_bulk.shape)
qcl_bulk.head()

(2796737, 6)


Unnamed: 0,Country,Item Code,Element Code,Year,Flag,Value
0,Armenia,221,5312,1992,M,
1,Armenia,221,5312,1993,M,
2,Armenia,221,5312,1994,M,
3,Armenia,221,5312,1995,M,
4,Armenia,221,5312,1996,M,


In [23]:
# FBSC
print(fbsc_bulk.shape)
fbsc_bulk.head()

(10319823, 6)


Unnamed: 0,Country,Item Code,Element Code,Year,Flag,Value
0,Armenia,2901,664,2014,Fc,3069.0
1,Armenia,2901,664,2015,Fc,3090.0
2,Armenia,2901,664,2016,Fc,3051.0
3,Armenia,2901,664,2017,Fc,3072.0
4,Armenia,2901,664,2018,Fc,2997.0


## 3. Select Flags
There are cases where we have more than just one entry for a `Country`, `Item Code`, `Element Code` and `Year`. This is due to the fact that there are multiple ways of reporting the data. All these different methodologies are identified by the field `Flag`, which tells us how a data point was obtained (see table below). This is given by FAOstat.

|Flag   |Description                                                                        |
|-------|-----------------------------------------------------------------------------------|
|`*`      |       Unofficial figure                                                           |
|`NaN`    | Official data                                                                     |
|`A`      |       Aggregate; may include official; semi-official; estimated or calculated data|
|`F`      |       FAO estimate                                                                |
|`Fc`     |      Calculated data                                                              |
|`Im`     |      FAO data based on imputation methodology                                     |
|`M`      |       Data not available                                                          |
|`S`      |       Standardised                                                                |
|`SD`     |       Statistical Discrepancy                                                     |
|`R`      |       Estimated data using trading partners database                              |


The following cell examines how many datapoints would be removed if we did _flag-prioritisation_. As per the output, we see that we would eliminate 30,688 rows (~1% of the data).

In [24]:
def check_flags_1(df):
    i_og = df.index.tolist()
    i_ne = df.drop_duplicates(
        subset=["Country", "Item Code", "Element Code", "Year"]
    ).index.tolist()
    print(
        f"Number of datapoints: {len(i_og)}\nNumber of datapoints (after dropping duplicates): {len(i_ne)}\nTotal datapoints removed: {len(i_og)-len(i_ne)}"
    )
    check_flags_2(df, i_og, i_ne)


def check_flags_2(df, i_og, i_ne):
    """Prints `[number of datapoints eliminated], True`"""
    df = df.set_index(["Country", "Item Code", "Element Code", "Year"])
    dups = df.index.duplicated()
    print(f"{dups.sum()}, {len(i_ne) == len(i_og)-dups.sum()}")
    # dups = qcl_bulk.index.duplicated(keep=False)
    df = df.reset_index()


check_flags_1(qcl_bulk)
print()
check_flags_1(fbsc_bulk)

Number of datapoints: 2796737
Number of datapoints (after dropping duplicates): 2766049
Total datapoints removed: 30688
30688, True

Number of datapoints: 10319823
Number of datapoints (after dropping duplicates): 10246430
Total datapoints removed: 73393
73393, True


### Flag prioritzation

In this step we define a Flag prioritisation rank, which allows us to discard duplicate entries based on which flag we "prefer". We do this by assigning a weight to each datapoint based on their `Flag` value (the higher, the more prioritised it is). On top of flag prioritisation, we always prefer non-`NaN` values regardless of their associated `Flag` value (we assign weight -1 to this datapoints). The weighting was shared and discussed with authors. 

The weight is added to the dataframe as a new column `Flag_priority`.

#### Example 1

    Country, Year, Product, Value, Flag 
    Afghanistan, 1993, Apple, 100, F
    Afghanistan, 1993, Apple, 120, A

We would choose first row, with flag F.

#### Example 2:

    Country, Year, Product, Value, Flag 
    Afghanistan, 1993, Apple, NaN, F
    Afghanistan, 1993, Apple, 120, A

We would choose second row, as first row is `NaN`.


In the following cell we filter rows based on `FLAG_PRIORITIES`.

In [25]:
# Create flag priority (add to df) More info at https://www.fao.org/faostat/en/#definitions
FLAG_PRIORITIES = {
    "M": 0,  # Data not available
    "SD": 10,  # Statistical Discrepancy
    "*": 20,  # Unofficial figure
    "R": 30,  # Estimated data using trading partners database
    "Fc": 40,  # Calculated data
    "S": 60,  # Standardized data
    "A": 70,  # Aggregate; may include official; semi-official; estimated or calculated data
    "Im": 80,  # FAO data based on imputation methodology
    "F": 90,  # FAO estimate
    np.nan: 100,  # Official data
}


def filter_by_flag_priority(df):
    # Add Flag priority column
    df.loc[:, "Flag_priority"] = df.Flag.replace(FLAG_PRIORITIES).tolist()
    df.loc[df.Value.isna(), "Flag_priority"] = -1
    # Remove duplicates based on Flag value
    df = df.sort_values("Flag_priority")
    df = df.drop_duplicates(
        subset=["Country", "Item Code", "Element Code", "Year"], keep="last"
    )
    return df.drop(columns=["Flag_priority", "Flag"])

In [26]:
# QCL
qcl_bulk = filter_by_flag_priority(qcl_bulk)
print(qcl_bulk.shape)

(2766049, 5)


In [27]:
# FBSC
fbsc_bulk = filter_by_flag_priority(fbsc_bulk)
print(fbsc_bulk.shape)

(10246430, 5)


## 4. Element Overview
This serves as an initial check on the meaning of `Element Code` values. In particular, we note that each `Element Code` value corresponds to a unique pair of _element name_  and _element unit_. Note, for instance, that _element_name_ "Production" can come in different flavours (i.e. units): "Production -- tones" and "Production -- 1000 No".

Based on the number of occurrences of each element code, we may want to keep only those that rank high.

**Note: This step uses file `PATH_MAP_ELEM`, which is a file that was generated using the code in a later cell.**

In [28]:
# Where do each element appear?
def get_stats_elements(df):
    res = df.reset_index().groupby("Element Code")["Item Code"].nunique()
    df_elem = pd.read_csv(PATH_MAP_ELEM, index_col="code")
    elem_map = (
        df_elem["name"] + " -- " + df_elem["unit"] + " -- " + df_elem.index.astype(str)
    )
    res = res.rename(index=elem_map.to_dict()).sort_values(ascending=False)
    return res

In [29]:
# QCL
get_stats_elements(qcl_bulk)

Element Code
Production -- tonnes -- 5510                          281
Area harvested -- ha -- 5312                          172
Yield -- hg/ha -- 5419                                171
Producing Animals/Slaughtered -- Head -- 5320          31
Yield/Carcass Weight -- hg/An -- 5417                  14
Stocks -- Head -- 5111                                 12
Yield -- hg/An -- 5420                                 10
Producing Animals/Slaughtered -- 1000 Head -- 5321      8
Yield/Carcass Weight -- 0.1g/An -- 5424                 8
Stocks -- 1000 Head -- 5112                             7
Laying -- 1000 Head -- 5313                             3
Yield -- 100mg/An -- 5410                               3
Yield -- hg -- 5422                                     2
Production -- 1000 No -- 5513                           2
Stocks -- No -- 5114                                    1
Name: Item Code, dtype: int64

In [30]:
# FBSC
get_stats_elements(fbsc_bulk)

Element Code
Protein supply quantity (g/capita/day) -- g/capita/day -- 674    123
Fat supply quantity (g/capita/day) -- g/capita/day -- 684        123
Food supply (kcal/capita/day) -- kcal/capita/day -- 664          123
Food supply quantity (kg/capita/yr) -- kg -- 645                 121
Food -- 1000 tonnes -- 5142                                      121
Domestic supply quantity -- 1000 tonnes -- 5301                  121
Other uses (non-food) -- 1000 tonnes -- 5154                     121
Export Quantity -- 1000 tonnes -- 5911                           121
Stock Variation -- 1000 tonnes -- 5072                           120
Production -- 1000 tonnes -- 5511                                120
Import Quantity -- 1000 tonnes -- 5611                           120
Residuals -- 1000 tonnes -- 5170                                 105
Tourist consumption -- 1000 tonnes -- 5171                       104
Losses -- 1000 tonnes -- 5123                                    103
Feed -- 1000 tonnes -

## 5. Reshape dataset
This step is simple and brief. It attempts to pivot the dataset in order to have three identifying columns (i.e. "keys") and several "value" columns based on the `Element Code` and `Value` columns.

This format is more Grapher/Explorer friendly, as it clearly divides the dataset columns into: Entities, Year, [Values].

In [31]:
def reshape_df(df):
    df = df.reset_index()
    df = df.pivot(
        index=["Country", "Item Code", "Year"], columns="Element Code", values="Value"
    )
    return df

In [32]:
# QCL
qcl_bulk = reshape_df(qcl_bulk)
# FBSC
fbsc_bulk = reshape_df(fbsc_bulk)

In [39]:
print("QCL:", qcl_bulk.shape)
print("FBSC:", fbsc_bulk.shape)

QCL: (1214535, 15)
FBSC: (1083147, 17)


## 6. Standardise Element and Item names (OPTIONAL)
In the following cells we obtain tables with the code, current name and number of occurrences of all the Items and Elements present in our dataset.

Based on this tables, Hannah (or another researcher), will revisit these and:
- Select those Items and Elements that we are interested in.
- Standardise naming proposals of Items and Elements.

Notes:
- We obtain the number of occurrences as this can assist the researcher in prioritising Items or Elements. 

### Elements
Here we obtain a table with the current namings for Elements (plus other variables). Note that we also propagate the unit names, as these may also be standardised (or even changed).

In [40]:
# Load table from dataset containing Element information
qcl_elem = qcl_garden["meta_element"]
fbsc_elem = fbsc_garden["meta_element"]

In [41]:
def get_elements_to_standardize(df, df_elem):
    # Obtain number of occurrences for each Element Code (each column is an element)
    elements = pd.DataFrame(df.notna().sum()).reset_index()
    elements = elements.sort_values(0, ascending=False)
    # Add names and unit info to the table
    elements = elements.merge(
        df_elem[["Element", "Unit", "Unit Description"]],
        left_on="Element Code",
        right_index=True,
    )
    # Rename column names
    elements = elements.rename(
        columns={
            "Element Code": "code",
            0: "number_occurrences",
            "Element": "name",
            "Unit": "unit",
            "Unit Description": "unit_description",
        }
    )[["code", "name", "unit", "unit_description", "number_occurrences"]]
    return elements

In [42]:
elements_qcl = get_elements_to_standardize(qcl_bulk, qcl_elem).assign(dataset="QCL")
elements_fbsc = get_elements_to_standardize(fbsc_bulk, fbsc_elem).assign(dataset="FBSC")

assert elements_qcl.merge(elements_fbsc, on="code").empty

Once the table is obtained, we take a look at it and export it. Note that we use a filename starting with `ign.`, as these are note git-tracked.

In [43]:
elements = pd.concat([elements_qcl, elements_fbsc])
elements.head()

Unnamed: 0,code,name,unit,unit_description,number_occurrences,dataset
13,5510,Production,tonnes,tonnes,996973,QCL
3,5312,Area harvested,ha,hectares,539828,QCL
9,5419,Yield,hg/ha,hectograms per hectare,534847,QCL
5,5320,Producing Animals/Slaughtered,Head,head,149439,QCL
0,5111,Stocks,Head,head,86112,QCL


In [44]:
# elements.to_csv("ign.food.elements.csv", index=False)

### Items
Here we obtain a table with the current namings for Items (plus other variables).

In [45]:
# Load table from dataset containing Item information
qcl_item = qcl_garden["meta_item"]
fbsc_item = fbsc_garden["meta_item"]

As the following cell shows, this table comes with a multi-index, as codes may actually be referring to "Item Groups" or "Items".

In [46]:
qcl_item.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Item Group,Item
Item Group Code,Item Code,Unnamed: 2_level_1,Unnamed: 3_level_1
1806,947,Beef and Buffalo Meat,"Meat, buffalo"
1806,867,Beef and Buffalo Meat,"Meat, cattle"
1811,983,Butter and Ghee,"Butter and ghee, sheep milk"
1811,952,Butter and Ghee,"Butter, buffalo milk"
1811,886,Butter and Ghee,"Butter, cow milk"


Therefore, in the next cell we attempt to flatten code to name mappings.

To this end:
- We first create two separate dictionaries, mapping `Item Group Code --> Item Group` and `Item Code --> Item`, respectively.
- We note, however, that some codes appear both as "Items" and "Item Groups". This might be due to the fact that there are more than one level of items. That is, an Item can "belong" to an Item Group, which in turn belongs to yet a higher up Item Group. Therefore, we remove these codes from the item dictionary so they only appear in the item group dictionary.
- Next, we create a table with all items, their occurrences, whether they are Item Groups, and their FAO original namings.

In [47]:
def get_items_to_standardize(df, df_item):
    # Group
    map_item_g = dict(
        zip(
            df_item.index.get_level_values("Item Group Code").astype(str),
            df_item["Item Group"],
        )
    )
    # Item
    map_item = dict(
        zip(df_item.index.get_level_values("Item Code").astype(str), df_item["Item"])
    )

    # Correct
    map_item = {k: v for k, v in map_item.items() if k not in map_item_g}

    # Load item occurences
    items = (
        pd.DataFrame(df.reset_index()["Item Code"].value_counts())
        .reset_index()
        .astype(str)
        .rename(
            columns={
                "index": "code",
                "Item Code": "number_occurences",
            }
        )
    )
    # Add flag for groups
    items["type"] = (
        items["code"].isin(map_item_g).apply(lambda x: "Group" if x else None)
    )
    # Add name
    map_item_all = {**map_item, **map_item_g}
    items["name"] = items.code.replace(map_item_all)
    # Order columns
    items = items[["code", "name", "type", "number_occurences"]]
    return items

In [48]:
items_qcl = get_items_to_standardize(qcl_bulk, qcl_item).assign(dataset="QCL")
items_fbsc = get_items_to_standardize(fbsc_bulk, fbsc_item).assign(dataset="FBSC")
items = pd.concat([items_qcl, items_fbsc])

Once the table is obtained, we take a look at it and export it. Note that we use a filename starting with `ign.`, as these are note git-tracked.

In [49]:
items.head()

Unnamed: 0,code,name,type,number_occurences,dataset
0,1765,"Meat, Total",Group,11055,QCL
1,1738,Fruit Primary,Group,10909,QCL
2,1057,Chickens,,10893,QCL
3,1808,"Meat, Poultry",Group,10883,QCL
4,1058,"Meat, chicken",,10883,QCL


In [50]:
# items.to_csv("ign.food.items.csv", index=False)

## 7. Renaming Items and Elements
After the previous step, where we shared files `ign.food.items.csv` and `ign.food.elements.csv` with a researcher, they will review them and add the standardisation namings for all items and elements that we intend to use. Note that if no standardised name is provided, the item or element will be discarded.

Their proposals come in two files: `food_explorer.items.std.csv` and `food_explorer.elements.std.csv`. Note that we prefer working with the mapping `"item/element code" ---> "new standardised item/element name"`.

### Element

First of all, we load the standardisation table and remove NaN values (these belong to to-be-discarded elements).

In [51]:
# Get standardised values
df = pd.read_csv(PATH_MAP_ELEM, index_col="code")
df = df.dropna(subset=["name_standardised"])

If we display the content of the standardisation element file we observe that:
- Only some elements are preserved.
- There is the column `unit_name_standardised_with_conversion` and `unit_factor`, which provide the new unit and the factor to convert the old one into the new one. 
- Multiple codes are assigned to the same `name_standardised` and `unit_name_standardised_with_conversion`, which means that we will have to merge them. In particular, element "Yield" with unit "kg/animal" appears with four different codes!

In [52]:
# Show
df

Unnamed: 0_level_0,name,unit,unit_description,number_occurrences,Dataset,name_standardised,unit_name_standardised_with_conversion,unit_factor
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5312,Area harvested,ha,hectares,539828,QCL,Area harvested,ha,1.0
5301,Domestic supply quantity,1000 tonnes,thousand tonnes,1043347,FBSC,Domestic supply,1000_tonnes,1.0
5911,Export Quantity,1000 tonnes,thousand tonnes,842139,FBSC,Exports,1000_tonnes,1.0
684,Fat supply quantity (g/capita/day),g/capita/day,grams per capita per day,866317,FBSC,Food available for consumption,fat_g_per_day_per_capita,1.0
5521,Feed,1000 tonnes,thousand tonnes,219816,FBSC,Feed,1000_tonnes,1.0
5142,Food,1000 tonnes,thousand tonnes,966295,FBSC,Food,1000_tonnes,1.0
664,Food supply (kcal/capita/day),kcal/capita/day,kilocalorie per capita per day,1005702,FBSC,Food available for consumption,kcal_per_day_per_capita,1.0
645,Food supply quantity (kg/capita/yr),kg,kilograms,966185,FBSC,Food available for consumption,kg,1.0
5611,Import Quantity,1000 tonnes,thousand tonnes,1008063,FBSC,Imports,1000_tonnes,1.0
5123,Losses,1000 tonnes,thousand tonnes,339465,FBSC,Waste,1000_tonnes,1.0


We keep columns in data file that belong to the "elements of interest" (those with renaming).

In [53]:
# Filter elements of interest
qcl_bulk = qcl_bulk[[col for col in df.index if col in qcl_bulk.columns]]
fbsc_bulk = fbsc_bulk[[col for col in df.index if col in fbsc_bulk.columns]]

We modify the values of some elements, based on the new units and `unit_factor` values.

In [54]:
# Factor
qcl_bulk = qcl_bulk.multiply(df.loc[qcl_bulk.columns, "unit_factor"])
fbsc_bulk = fbsc_bulk.multiply(df.loc[fbsc_bulk.columns, "unit_factor"])

Next, we merge codes 5417, 5420, 5424 and 5410 into a single one. As previously highlighted, all of them are mapped to the same (name, unit) tupple.

In [55]:
# Merge 5417,5420,5424,5410 --> 5417
qcl_bulk[5417] = qcl_bulk[5417].fillna(
    qcl_bulk[5420].fillna(qcl_bulk[5424].fillna(qcl_bulk[5410]))
)
qcl_bulk = qcl_bulk.drop(columns=[5420, 5424, 5410])

Finally, we rename the column names (so far element codes) to more prosaic element identifiers (`[element-name]__[unit]`).

In [56]:
# Build element name
a = df["name_standardised"].apply(lambda x: x.lower().replace(" ", "_")).astype(str)
b = (
    df["unit_name_standardised_with_conversion"]
    .apply(lambda x: x.lower().replace(" ", "_"))
    .astype(str)
)
df["element_name"] = (a + "__" + b).tolist()
# Obtain dict Element Code -> element name
map_elem = df["element_name"].to_dict()

In [57]:
# Change columns names
qcl_bulk = qcl_bulk.rename(columns=map_elem)
fbsc_bulk = fbsc_bulk.rename(columns=map_elem)

In [58]:
# Show dataframe with standardised element names
qcl_bulk.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Element Code,area_harvested__ha,production__tonnes,yield__tonnes_per_ha,yield__kg_per_animal
Country,Item Code,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,15,1961,2230000.0,2279000.0,1.022,
Afghanistan,15,1962,2341000.0,2279000.0,0.9735,
Afghanistan,15,1963,2341000.0,1947000.0,0.8317,
Afghanistan,15,1964,2345000.0,2230000.0,0.951,
Afghanistan,15,1965,2347000.0,2282000.0,0.9723,


### Item
We now load the standardisation item table and remove `NaN` values (these belong to to-be-discarded items).

In [59]:
# Get standardised values
df = pd.read_csv(PATH_MAP_ITEM, index_col="code")
map_item_std = df.dropna(subset=["name_standardised"])["name_standardised"].to_dict()

Briefly display first 10 mappings.

In [60]:
{k: v for (k, v) in list(map_item_std.items())[:10]}

{221: 'Almonds',
 711: 'Herbs (e.g. fennel)',
 515: 'Apples',
 526: 'Apricots',
 226: 'Areca nuts',
 366: 'Artichokes',
 367: 'Asparagus',
 1107: 'Asses',
 572: 'Avocados',
 486: 'Bananas'}

Next, we do a simple check of item name uniqueness. Note that we can have multiple codes assigned to the same `name_standardised`, as part of the standardisation process, BUT these should be in different datasets so we don't have any element conflicts.

In [61]:
# Show "fused" products from QCL and FBSC
x = pd.DataFrame.from_dict(map_item_std, orient="index", columns=["name"]).reset_index()
x = x.groupby("name").index.unique().apply(list)
x = x[x.apply(len) > 1]
print("There are", len(x), "fused products:\n", x)

There are 29 fused products:
 name
Bananas                       [486, 2615]
Beans, dry                    [2546, 176]
Chillies and peppers, dry     [689, 2641]
Coconut oil                   [2578, 252]
Cottonseed                    [329, 2559]
Cottonseed oil                [2575, 331]
Cream                         [2743, 885]
Dates                         [577, 2619]
Groundnut oil                 [2572, 244]
Honey                        [2745, 1182]
Maize oil                      [2582, 60]
Meat, Poultry                [1808, 2734]
Oilcrops                     [1731, 2913]
Onions                        [2602, 403]
Palm kernel oil               [258, 2576]
Peas, dry                     [2547, 187]
Pepper                        [2640, 687]
Plantains                     [2616, 489]
Pulses                       [2911, 1726]
Sesame oil                    [290, 2579]
Sesame seed                   [289, 2561]
Soybean oil                   [237, 2571]
Sugar beet                    [157, 2537]

In [62]:
# Check `code` --> `name_standardised` is unique in each dataset
assert (
    df.dropna(subset=["name_standardised"])
    .reset_index()
    .groupby(["dataset", "name_standardised"])
    .code.nunique()
    .max()
    == 1
)

Next, we filter out items that we are not interested in and add a new column (`Product`) with the standardised item names.

In [63]:
def standardise_product_names(df):
    df = df.reset_index()
    df = df[df["Item Code"].isin(map_item_std)]
    df.loc[:, "Product"] = df["Item Code"].replace(map_item_std).tolist()
    df = df.drop(columns=["Item Code"])
    # Set back index
    df = df.set_index(["Product", "Country", "Year"])
    return df

In [64]:
qcl_bulk = standardise_product_names(qcl_bulk)
fbsc_bulk = standardise_product_names(fbsc_bulk)

## 8. Final processing
Here we add the final processing steps:
- Merge datasets `QCL` + `FBSC`
- Discard products (former items) that do not contain any value for the "elements of interest".

In [65]:
# Merge datasets
fe_bulk = pd.merge(qcl_bulk, fbsc_bulk, how="outer", left_index=True, right_index=True)

In [73]:
print("QCL // shape:", qcl_bulk.shape, "/ not-NaN:", qcl_bulk.notna().sum().sum())
print("FBSC // shape:", fbsc_bulk.shape, "/ not-NaN:", fbsc_bulk.notna().sum().sum())
print("FE // shape:", fe_bulk.shape, "/ not-NaN:", fe_bulk.notna().sum().sum())

QCL // shape: (1021572, 4) / not-NaN: 1907993
FBSC // shape: (246696, 11) / not-NaN: 1770081
FE // shape: (1131176, 15) / not-NaN: 3678074


In [74]:
# Drop nulls (some products dont have any value for the elements of interest)
fe_bulk = fe_bulk.dropna(how="all")
print("FE (after NaN-drop):", fe_bulk.shape)

FE (after NaN-drop): (943890, 15)


In [75]:
print(fe_bulk.shape)
fe_bulk.head()

(943890, 15)


Unnamed: 0_level_0,Unnamed: 1_level_0,Element Code,area_harvested__ha,production__tonnes,yield__tonnes_per_ha,yield__kg_per_animal,domestic_supply__1000_tonnes,exports__1000_tonnes,food_available_for_consumption__fat_g_per_day_per_capita,feed__1000_tonnes,food__1000_tonnes,food_available_for_consumption__kcal_per_day_per_capita,food_available_for_consumption__kg,imports__1000_tonnes,waste__1000_tonnes,other_uses__1000_tonnes,food_available_for_consumption__protein_g_per_day_per_capita
Product,Country,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Almonds,Afghanistan,1975,0.0,0.0,,,,,,,,,,,,,
Almonds,Afghanistan,1976,5900.0,9800.0,1.661,,,,,,,,,,,,
Almonds,Afghanistan,1977,6000.0,9000.0,1.5,,,,,,,,,,,,
Almonds,Afghanistan,1978,6000.0,12000.0,2.0,,,,,,,,,,,,
Almonds,Afghanistan,1979,6000.0,10500.0,1.75,,,,,,,,,,,,


## Export
Time to export the shining brand new dataset!

We export it in two flavours: bulk and file-per-product formats. The former is the standard format, while the later is intended to power OWID tools such as explorers.

### Define metadata
Prior to export, we need to create the metadata content for this dataset. It basically propagates the metadata from its building pieces (QCL so far).

For this dataset, we use namespace `explorers`, which is intended for datasets aimed at powering explorers (this may change).

In [76]:
from owid.catalog.meta import DatasetMeta

In [82]:
# Initialize dataset
fe_garden = catalog.Dataset.create_empty(dest_dir)
fe_garden.metadata = DatasetMeta(
    namespace="explorers",
    short_name="food_explorer",
    sources=qcl_garden.metadata.sources + fbsc_garden.metadata.sources,
    licenses=qcl_garden.metadata.licenses + fbsc_garden.metadata.licenses,
)
fe_garden.save()

### In bulk

Preserve the bulk file for QA or manual analysis.

In [84]:
t = catalog.Table(fe_bulk)
t.metadata.short_name = "bulk"
fe_garden.add(t)

### One file per product

To work in an explorer, we need to add the table in CSV format. To make it more scalable for use, we want
to split that dataset into many small files, one per product.

In [85]:
def to_short_name(raw):
    return (
        raw.lower()
        .replace(" ", "_")
        .replace(",", "")
        .replace("(", "")
        .replace(")", "")
        .replace(".", "")
    )


# the index contains values like "Asses" which have already been filtered out from the data,
# let's remove them
fe_bulk.index = fe_bulk.index.remove_unused_levels()

for product in sorted(fe_bulk.index.levels[0]):
    short_name = to_short_name(product)
    print(f"{product} --> {short_name}.csv")

    t = catalog.Table(fe_bulk.loc[[product]])
    t.metadata.short_name = short_name
    fe_garden.add(t, format="csv")  # <-- note we choose CSV format here

Almonds --> almonds.csv
Apples --> apples.csv
Apricots --> apricots.csv
Areca nuts --> areca_nuts.csv
Artichokes --> artichokes.csv
Asparagus --> asparagus.csv
Avocados --> avocados.csv
Bananas --> bananas.csv
Barley --> barley.csv
Beans, dry --> beans_dry.csv
Beans, green --> beans_green.csv
Beef and Buffalo Meat --> beef_and_buffalo_meat.csv
Beeswax --> beeswax.csv
Blueberries --> blueberries.csv
Brazil nuts, with shell --> brazil_nuts_with_shell.csv
Broad beans --> broad_beans.csv
Buckwheat --> buckwheat.csv
Buffalo hides --> buffalo_hides.csv
Butter and Ghee --> butter_and_ghee.csv
Cabbages --> cabbages.csv
Canary seed --> canary_seed.csv
Carrots and turnips --> carrots_and_turnips.csv
Cashew nuts --> cashew_nuts.csv
Cassava --> cassava.csv
Castor oil seed --> castor_oil_seed.csv
Cattle hides --> cattle_hides.csv
Cauliflowers and broccoli --> cauliflowers_and_broccoli.csv
Cereals --> cereals.csv
Cheese --> cheese.csv
Cherries --> cherries.csv
Chestnut --> chestnut.csv
Chickpeas -->

Let's check that the biggest files are still an ok size for an explorer.

In [86]:
!du -hs {dest_dir}/*.csv | sort -hr | head -n 10

1.4M	/tmp/food_explorer/oilcrops.csv
1.3M	/tmp/food_explorer/meat_poultry.csv
1.2M	/tmp/food_explorer/pulses.csv
1.1M	/tmp/food_explorer/onions.csv
1.1M	/tmp/food_explorer/beans_dry.csv
1.1M	/tmp/food_explorer/bananas.csv
1.0M	/tmp/food_explorer/chillies_and_peppers_dry.csv
1012K	/tmp/food_explorer/sweet_potatoes.csv
944K	/tmp/food_explorer/peas_dry.csv
888K	/tmp/food_explorer/soybean_oil.csv


The biggest is 1.4MB (csv), we should be ok ✓ 