# Environmental Impact of Agricultural Practices in the World

**ADA Project Milestone 2**

This notebook consists of our initial Data Analysis of the FAOSTAT dataset on Food an agriculture. We will first study the contents of the data and its strucuture, before restructuring it in order to start our analysis. Also, some research questions initially asked will be answered by the end of this notebook.

## A. Initial Analysis

The dataset initially contained 78 csv files, but some of them were discarded as they will not be useful for our analysis. We have selected 43 CSVs that would help us with our analysis.

In [1]:
from glob import glob
import pandas as pd
%load_ext autoreload
%autoreload 2

In [2]:
csv_files = glob('data/**/**.csv')
len(csv_files)

43

We split those 43 csv into different directories, one for each group of csv. Each group corresponds to one category:
```.
+-- data/
|   +-- emissions_agriculture/
|      +-- ...
|   +-- emissions_land/
|      +-- ...
|   +-- environment/
|      +-- ...
|   +-- forestry/
|      +-- ...
|   +-- inputs/
|      +-- ...
|   +-- population/
|      +-- ...
|   +-- production/
|      +-- ...
```

### A.1 Schema consistency

#### 1. Checking column names across whole dataset

Now, let's scan all the csv files and check their schemas.

In [3]:
from data_processing import scan_columns
all_columns = scan_columns(csv_files)
print("The found columns, grouped, are:\n")
for cols, f in all_columns:
    print(list(sorted(cols)), f"Num files {len(f)}")

The found columns, grouped, are:

['area', 'areacode', 'element', 'elementcode', 'flag', 'item', 'itemcode', 'unit', 'value', 'year', 'yearcode'] Num files 33
['country', 'countrycode', 'element', 'elementcode', 'elementgroup', 'flag', 'item', 'itemcode', 'unit', 'value', 'year'] Num files 4
['area', 'areacode', 'element', 'elementcode', 'flag', 'months', 'monthscode', 'unit', 'value', 'year', 'yearcode'] Num files 1
['area', 'areacode', 'element', 'elementcode', 'flag', 'item', 'itemcode', 'note', 'unit', 'value', 'year', 'yearcode'] Num files 4
['country', 'countrycode', 'element', 'elementcode', 'flag', 'item', 'itemcode', 'unit', 'value', 'year', 'yearcode'] Num files 1


As we can see, sometimes the columns `area` and `areacode` are named `country` and `countrycode`. Also, some files have the `note`, `elementgroup` and `months` columns. We will look into those in subsequent steps as we are now simply checking whether column naming is consistent.

In order to obtain a more consistent column naming, we will rename `country` to `area` and `countrycode` to `areacode`.

In [4]:
column_rename = {'country': 'area', 'countrycode': 'areacode'}

In [5]:
all_columns_2 = scan_columns(csv_files, column_rename)
print(f"After renaming, we obtain the following columns:\n")
for cols, f in all_columns_2:
    print(list(sorted(cols)), f"Num files {len(f)}")

After renaming, we obtain the following columns:

['area', 'areacode', 'element', 'elementcode', 'flag', 'item', 'itemcode', 'unit', 'value', 'year', 'yearcode'] Num files 34
['area', 'areacode', 'element', 'elementcode', 'elementgroup', 'flag', 'item', 'itemcode', 'unit', 'value', 'year'] Num files 4
['area', 'areacode', 'element', 'elementcode', 'flag', 'months', 'monthscode', 'unit', 'value', 'year', 'yearcode'] Num files 1
['area', 'areacode', 'element', 'elementcode', 'flag', 'item', 'itemcode', 'note', 'unit', 'value', 'year', 'yearcode'] Num files 4


#### 2. Checking which columns to drop

Now we have a few files that have different schemas. One column that we should look into before continuing is `note`, as it is in 4 files.

In [7]:
from data_processing import load_dataframe
def get_column_unique_values(files, col):
    # Function that returns the unique values of a column across multiple files
    values = []
    for f in files:
        df = load_dataframe(f, column_rename)
        vals = df[col].unique()
        values.append(vals)
    return values

files_with_note = all_columns_2[-1][1]
note_values = get_column_unique_values(files_with_note, 'note')
note_values

[array([nan]), array([nan]), array([nan]), array([nan])]

As we can see, all values for this column are NaN, so we can safely drop the column.

#### 3. Checking duplicate columns

We figured it would be useful to scan for duplicate columns in each dataframe (i.e. columns with different names but same values)

In [34]:
from data_processing import scan_column_duplicates
duplicates = scan_column_duplicates(csv_files, column_rename)
for c, f in duplicates:
    print(f"Duplicates for {c} in {len(f)} files")

Duplicates for [('yearcode', 'year')] in 39 files
Duplicates for [('elementgroup', 'elementcode')] in 3 files


As we can see, most files have `year` and `yearcode` columns which are equal. Hence, we can safely drop this column. However, for `elementgroup` and `elementcode`, they are equal in almost all CSV where they appear (3/4), but not all, so we cannot safely drop it without checking. We choose to keep `elementcode` when those two are equal, and keep them both when they are not.

Hence, we can define a list of columns to be dropped, but we need to check if they fulfill any of the following conditions:
 - NaN in all rows
 - Duplicate with another column

In [49]:
drop_columns = ["note", "yearcode", "elementgroup"]
all_columns_3 = scan_columns(csv_files, column_rename, drop_columns)
print(f"After renaming and dropping columns, we obtain the following columns:\n")
for cols, f in all_columns_3:
    print(list(sorted(cols)), f"Num files {len(f)}")

After renaming and dropping columns, we obtain the following columns:

['area', 'areacode', 'element', 'elementcode', 'flag', 'item', 'itemcode', 'unit', 'value', 'year'] Num files 41
['area', 'areacode', 'element', 'elementcode', 'elementgroup', 'flag', 'item', 'itemcode', 'unit', 'value', 'year'] Num files 1
['area', 'areacode', 'element', 'elementcode', 'flag', 'months', 'monthscode', 'unit', 'value', 'year'] Num files 1


We now have 41 files with identical schemas, and 2 files that have a different one:
 - The file containing the `elementgroup` additional column 
 - The file with monthly data and no `item` and `itemcode` columns
 
To obtain the desired format, we can now call `load_dataframe(<file>, column_rename, drop_columns)` with `column_rename = {'country': 'area', 'countrycode': 'areacode'}` and `drop_columns = ["note", "yearcode", "elementgroup"]`

### A.2 Schema description
Now that we have a unified schema for (almost) all csv files, we can start looking into the meaning of each column and their possible value. 

#### 1. Area columns

We will first look into the columns `area` and `areacode`. According to FAOSTAT's website, each area is defined by a unique areacode, however some areas include other ones, i.e. there are grouped areas in the datasets. We would expect a one-to-one mapping between those two columns. Let's see how this looks like

To verify that it is indeed a one-to-one mapping, we will append all values from all csv files, and drop duplicates. Then we group by area and see if the length of the group is 1.

In [372]:
from data_processing import get_column_unique_values
from utils import is_unique_mapping

area_values = get_column_unique_values(csv_files, column_rename, drop_columns, ['area', 'areacode'])
if is_unique_mapping(area_values, 'area', 'areacode'):
    print("Area and Areacode is a one-to-one mapping")

Area and Areacode is a one-to-one mapping


In [373]:
area_values

Unnamed: 0,area,areacode
0,Afghanistan,2
627,Albania,3
1539,Algeria,4
2508,American Samoa,5
2793,Angola,7
...,...,...
422,Polynesia + (Total),5504
166548,Svalbard and Jan Mayen Islands,260
14537,"Bonaire, Sint Eustatius and Saba",278
105102,Saint Barthélemy,282


As we can see, some wierd `area` values such as "Polynesia + (Total)" represent a group of areas rather than a country. In order to understand how those grouped are formed, we downloaded an additional csv file present on the FAOSTAT website, describing which countries are in those grouped areas.

In [95]:
country_groups = load_dataframe('data/country_groups.csv')
country_groups.head()

Unnamed: 0,countrygroupcode,countrygroup,countrycode,country,m49code,iso2code,iso3code
0,5100,Africa,4,Algeria,12.0,DZ,DZA
1,5100,Africa,7,Angola,24.0,AO,AGO
2,5100,Africa,53,Benin,204.0,BJ,BEN
3,5100,Africa,20,Botswana,72.0,BW,BWA
4,5100,Africa,24,British Indian Ocean Territory,86.0,IO,IOT


Here, countries are grouped into multiple `countrygroup`, so we know exactly of which countries each group is formed. These country groups are present in the dataset as `area`, meaning there are aggregated values in the dataset. For example: we can find the emissions for "Algeria" and for "Africa", where the latter is an aggregated value over the whole group. We will need to be careful when aggregating values in the future, as we could account multiple times for one country.

We will add a `countrygroupcodes` column to each csv, holding the list of `countrygroupcode` in which the country is (and keep it at NaN for groups).
Also the column `iso3code` will help us with plotting maps using `geopandas`.

#### 2. Element Columns

The `element` and `elementcode` represent the measure quantity for a given `item`. A quantity has a name and a unit, which is why we believe these two columns should also have a one-to-one mapping accross the whole dataset. Also, since an `elementcode` potentially uniquely identifies (`element`, `unit`) pair, we might drop those two columns as to make the csv files smaller and easier to manipulate.

First let's check if indeed this mapping is one-to-one:

In [358]:
element_values = get_column_unique_values(csv_files, column_rename, drop_columns, ['elementcode', 'element', 'unit'])
is_unique_mapping(element_values, 'elementcode', ['element', 'unit'])

True

In [359]:
element_values.head()

Unnamed: 0,elementcode,element,unit
0,5111,Stocks,Head
171,5112,Stocks,1000 Head
684,5114,Stocks,No
0,5510,Production,tonnes
0,5313,Laying,1000 Head


As we can see, `itemcode` uniquely identify (`element`, `unit`) pairs, so we can safely drop those two columns and only use `elementcode`. We will later pivot each csv as to obtain all the `elementcode`s as columns, so we can reduce de number of rows significantly. A mapping using a dictionnary will of course be necessary in order to have a nice GUI where users can select the (element, unit) pair instead of the code.

#### 3. Item columns

According to FAOSTAT, the `item` and `itemcode` columns represent item on which measurements were done. For example an item can be `cattle` and the measurement can be "CH4 emissions in gigagrams". 
Similarly to what we did above, we expect `item` and `itemcode` to have a one-to-one relationship. Let's verify this using the same functions

In [374]:
item_values = get_column_unique_values(csv_files, column_rename, drop_columns, ['item', 'itemcode'])
is_unique_mapping(item_values, 'item', 'itemcode')

False

It seems that `item` to `itemcode` is not unique for a few items, let's check those and try to understand why it is the case.

In [375]:
grouped = item_values.groupby('item').agg({'itemcode': set})
grouped[grouped.itemcode.apply(len) > 1]

Unnamed: 0_level_0,itemcode
item,Unnamed: 1_level_1
Ammonium nitrate (AN),"{1362, 4003}"
Ammonium sulphate,"{1361, 4002}"
Burning - all categories,"{6795, 6798}"
Cattle,"{866, 1757}"
Chickens,"{1057, 1054}"
Cropland,"{6620, 5070}"
Disinfectants,"{1358, 1351}"
Forest land,"{5065, 6749, 6646}"
Grassland,"{6794, 6983}"
Mineral Oils,"{1354, 1316}"


Some items seem to have multiple (up to 3) different item codes, which doesn't seem very normal. It seems that some of those items correspond to nutrients provided throught fertilizers. Let's see in which files those appear. The items related to nutrients are the following :

In [414]:
nutrient_items = ["Ammonium nitrate (AN)", "Ammonium sulphate", "Other nitrogenous fertilizers, n.e.c.", "Other potassic fertilizers, n.e.c.", "Potassium sulphate (sulphate of potash) (SOP)", "Urea"]
item_values[item_values.item.isin(nutrient_items)].file.unique()

array(['data/inputs/Inputs_FertilizersArchive_E_All_Data_(Normalized).csv',
       'data/inputs/Inputs_FertilizersProduct_E_All_Data_(Normalized).csv'],
      dtype=object)

TODO: explain that one is an Archive and the other is a new one

In [424]:
pesticide_items = ["Disinfectants", "Mineral Oils", "Other Pesticides nes", "Plant Growth Regulators"]
item_values[item_values.item.isin(pesticide_items)].file.unique()

array(['data/inputs/Inputs_Pesticides_Use_E_All_Data_(Normalized).csv'],
      dtype=object)

In [430]:
t = load_dataframe('data/inputs/Inputs_Pesticides_Use_E_All_Data_(Normalized).csv', column_rename, drop_columns)
for i in pesticide_items:
    item_codes = t[t.item == i].itemcode.unique()
    if (t[t.itemcode == item_codes[0]].drop('itemcode', axis=1).values == t[t.itemcode == item_codes[1]].drop('itemcode', axis=1).values).all():
        print(f"Duplicate item for {i}")

Duplicate item for Disinfectants
Duplicate item for Mineral Oils
Duplicate item for Other Pesticides nes
Duplicate item for Plant Growth Regulators


In [438]:
livestock_items = ["Cattle", "Chickens"]
item_values[item_values.item.isin(livestock_items)].file.unique()

array(['data/production/Production_Livestock_E_All_Data_(Normalized).csv',
       'data/environment/Environment_LivestockPatterns_E_All_Data_(Normalized).csv',
       'data/environment/Environment_LivestockManure_E_All_Data_(Normalized).csv',
       'data/emissions_agriculture/Emissions_Agriculture_Enteric_Fermentation_E_All_Data_(Normalized).csv',
       'data/emissions_agriculture/Emissions_Agriculture_Manure_Management_E_All_Data_(Normalized).csv',
       'data/emissions_agriculture/Emissions_Agriculture_Manure_left_on_pasture_E_All_Data_(Normalized).csv',
       'data/emissions_agriculture/Emissions_Agriculture_Manure_applied_to_soils_E_All_Data_(Normalized).csv'],
      dtype=object)

In [444]:
land_items = ["Burning - all crops", "Cropland", "Forest Land", "Grassland"]
item_values[item_values.item.isin(land_items)].file.unique()

array(['data/environment/Environment_LandCover_E_All_Data_(Normalized).csv',
       'data/environment/Environment_LandUse_E_All_Data_(Normalized).csv',
       'data/emissions_land/Emissions_Land_Use_Land_Use_Total_E_All_Data_(Normalized).csv',
       'data/emissions_agriculture/Emissions_Agriculture_Burning_Savanna_E_All_Data_(Normalized).csv',
       'data/inputs/Inputs_LandUse_E_All_Data_(Normalized).csv'],
      dtype=object)