# Decriptive Analysis of the forests dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

First we load the dataset :

In [110]:
datapath = "Data/global-food-agriculture-statistics/"
current_fao = "current_FAO/raw_files/"

forests = pd.read_csv(datapath + current_fao + "Emissions_Land_Use_Forest_Land_E_All_Data_(Norm).csv", sep=",", encoding="ANSI")         # Less forests

This dataset is very similar to Savanna :

In [111]:
forests.dtypes

Country Code      int64
Country          object
Item Code         int64
Item             object
Element Code      int64
Element          object
Year Code         int64
Year              int64
Unit             object
Value           float64
Flag             object
dtype: object

In [112]:
forests.shape

(73067, 11)

In [113]:
forests.head(10)

Unnamed: 0,Country Code,Country,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag
0,2,Afghanistan,6661,Forest,5110,Area,1990,1990,1000 Ha,1350.0,F
1,2,Afghanistan,6661,Forest,5110,Area,1991,1991,1000 Ha,1350.0,F
2,2,Afghanistan,6661,Forest,5110,Area,1992,1992,1000 Ha,1350.0,F
3,2,Afghanistan,6661,Forest,5110,Area,1993,1993,1000 Ha,1350.0,F
4,2,Afghanistan,6661,Forest,5110,Area,1994,1994,1000 Ha,1350.0,F
5,2,Afghanistan,6661,Forest,5110,Area,1995,1995,1000 Ha,1350.0,F
6,2,Afghanistan,6661,Forest,5110,Area,1996,1996,1000 Ha,1350.0,F
7,2,Afghanistan,6661,Forest,5110,Area,1997,1997,1000 Ha,1350.0,F
8,2,Afghanistan,6661,Forest,5110,Area,1998,1998,1000 Ha,1350.0,F
9,2,Afghanistan,6661,Forest,5110,Area,1999,1999,1000 Ha,1350.0,F


Now we need to see how many different items and elements we have, and remove those that are not of interest for us:

In [114]:
pd.Series.unique(forests.Item)

array(['Forest', 'Net Forest conversion', 'Forest land'], dtype=object)

We have three different type of items, which represent the type of land related to forests : Forests, land that is related to forests and importantly "Net Forest conversion" which represent the area of forest which was cleared to use for an other purpose (mostly agriculure, but also mines, infrastructure or urbanisation). We only keep net forest conversion as we don't really care about the actual area of sane forests.

In [115]:
forests = forests.query("Item == 'Net Forest conversion'")
forests.shape

(26308, 11)

What about the Element column ? 

In [67]:
pd.Series.unique(forests.Element)

array(['Area', 'Implied emission factor for CO2',
       'Net emissions/removals (CO2) (Forest land)',
       'Net emissions/removal (CO2eq) (Forest land)'], dtype=object)

We are clearly only interested in the area, so we remove the useless elements :

In [89]:
forests = forests.query("Element == 'Area'")
forests.size

74987

In [59]:
forests 

Unnamed: 0,Country Code,Country,Item Code,Item,Element Code,Element,Year Code,Year,Unit,Value,Flag
104,2,Afghanistan,6750,Net Forest conversion,5110,Area,1990,1990,1000 Ha,0.00,Fc
105,2,Afghanistan,6750,Net Forest conversion,5110,Area,1991,1991,1000 Ha,0.00,Fc
106,2,Afghanistan,6750,Net Forest conversion,5110,Area,1992,1992,1000 Ha,0.00,Fc
107,2,Afghanistan,6750,Net Forest conversion,5110,Area,1993,1993,1000 Ha,0.00,Fc
108,2,Afghanistan,6750,Net Forest conversion,5110,Area,1994,1994,1000 Ha,0.00,Fc
...,...,...,...,...,...,...,...,...,...,...,...
72932,5873,OECD,6750,Net Forest conversion,5110,Area,2011,2011,1000 Ha,822.52,A
72933,5873,OECD,6750,Net Forest conversion,5110,Area,2012,2012,1000 Ha,822.52,A
72934,5873,OECD,6750,Net Forest conversion,5110,Area,2013,2013,1000 Ha,822.52,A
72935,5873,OECD,6750,Net Forest conversion,5110,Area,2014,2014,1000 Ha,822.52,A


In [10]:
pd.Series.unique(forests.Flag)

array(['Fc', 'A'], dtype=object)

We have two kind of flags : 
- "Fc" which tells that the data is only calculated
- "A" which tells that the data is aggregated and may include official, semi-official estimated or calculated data

We don't really care how the data was collected. Therefore, we will remove this column. 

Now about the countries. Looking at the country code, it seems like there are 5873 different countries ? How about the actual countries

In [116]:
pd.Series.nunique(forests.Country)

272

There are only 272 which makes sense. let see the country code:

In [118]:
pd.Series.unique(forests["Country Code"])

array([   2,    3,    4,    5,    6,    7,  258,    8,    9,    1,   22,
         10,   11,   52,   12,   13,   16,   14,   57,  255,   15,   23,
         53,   17,   18,   19,   80,   20,   21,  239,   26,   27,  233,
         29,   35,  115,   32,   33,   36,   37,   39,  259,   40,  351,
         41,   44,   45,   46,   47,   48,  107,   98,   49,   50,  167,
         51,  116,  250,   54,   72,   55,   56,   58,   59,   60,   61,
        178,   63,  238,   62,   65,   64,   66,   67,   68,   69,   70,
         74,   75,   73,   79,   81,   82,   84,   85,   86,   87,   88,
         89,   90,  175,   91,   93,   95,   97,   99,  100,  101,  102,
        103,  104,  264,  105,  106,  109,  110,  112,  108,  114,   83,
        118,  113,  120,  119,  121,  122,  123,  124,  125,  126,  256,
        129,  130,  131,  132,  133,  134,  127,  135,  136,  137,  270,
        138,  145,  141,  273,  142,  143,  144,   28,  147,  148,  149,
        150,  151,  153,  156,  157,  158,  159,  1

We see that there is something special with the numbering of the country : Some numbers are not there (232 for example), but more importantly, the numbers jump to around 5'000. What data is contained whith these 5'000+ country codes? 

Apprently, these "countries" are in fact regions/continent. We will keep them for now, but we must be careful as they contain informations already counted in each single coutries of a given particular region. This shows that the country code is pretty useful to discriminate normal countries and regions

forests

I remove the code columns : 

In [90]:
forests = forests[["Country Code", "Country", "Item", "Element", "Year", "Unit", "Value"]]

In [91]:
forests

Unnamed: 0,Country Code,Country,Item,Element,Year,Unit,Value
104,2,Afghanistan,Net Forest conversion,Area,1990,1000 Ha,0.00
105,2,Afghanistan,Net Forest conversion,Area,1991,1000 Ha,0.00
106,2,Afghanistan,Net Forest conversion,Area,1992,1000 Ha,0.00
107,2,Afghanistan,Net Forest conversion,Area,1993,1000 Ha,0.00
108,2,Afghanistan,Net Forest conversion,Area,1994,1000 Ha,0.00
...,...,...,...,...,...,...,...
72932,5873,OECD,Net Forest conversion,Area,2011,1000 Ha,822.52
72933,5873,OECD,Net Forest conversion,Area,2012,1000 Ha,822.52
72934,5873,OECD,Net Forest conversion,Area,2013,1000 Ha,822.52
72935,5873,OECD,Net Forest conversion,Area,2014,1000 Ha,822.52


need to multiply value by 1000 to get a simple unit of 1 Ha instead of 1000 Ha (to join with savanna)

In [92]:
forests.Value = forests.Value*1000
forests

Unnamed: 0,Country Code,Country,Item,Element,Year,Unit,Value
104,2,Afghanistan,Net Forest conversion,Area,1990,1000 Ha,0.0
105,2,Afghanistan,Net Forest conversion,Area,1991,1000 Ha,0.0
106,2,Afghanistan,Net Forest conversion,Area,1992,1000 Ha,0.0
107,2,Afghanistan,Net Forest conversion,Area,1993,1000 Ha,0.0
108,2,Afghanistan,Net Forest conversion,Area,1994,1000 Ha,0.0
...,...,...,...,...,...,...,...
72932,5873,OECD,Net Forest conversion,Area,2011,1000 Ha,822520.0
72933,5873,OECD,Net Forest conversion,Area,2012,1000 Ha,822520.0
72934,5873,OECD,Net Forest conversion,Area,2013,1000 Ha,822520.0
72935,5873,OECD,Net Forest conversion,Area,2014,1000 Ha,822520.0


In [93]:
forests = forests.drop(columns=["Unit", "Element"])
forests = forests.rename(columns={"Item":"Ecosystem"})
forests.Ecosystem = "Forest"
# Based on results from other datasets, we only keep data from 1995 and later :
forests = forests[forests.Year >= 1995]
forests

Unnamed: 0,Country Code,Country,Ecosystem,Year,Value
109,2,Afghanistan,Forest,1995,0.0
110,2,Afghanistan,Forest,1996,0.0
111,2,Afghanistan,Forest,1997,0.0
112,2,Afghanistan,Forest,1998,0.0
113,2,Afghanistan,Forest,1999,0.0
...,...,...,...,...,...
72932,5873,OECD,Forest,2011,822520.0
72933,5873,OECD,Forest,2012,822520.0
72934,5873,OECD,Forest,2013,822520.0
72935,5873,OECD,Forest,2014,822520.0


Now we will divide the datframe into 3 based on "Country" : Countries, region and Economical_segment (filtered using Country code)

In [83]:
pd.Series.unique(forests[forests["Country Code"] >= 5000].Country)

array(['World', 'Africa', 'Eastern Africa', 'Middle Africa',
       'Northern Africa', 'Southern Africa', 'Western Africa', 'Americas',
       'Northern America', 'Central America', 'Caribbean',
       'South America', 'Asia', 'Central Asia', 'Eastern Asia',
       'Southern Asia', 'South-Eastern Asia', 'Western Asia', 'Europe',
       'Eastern Europe', 'Northern Europe', 'Southern Europe',
       'Western Europe', 'Oceania', 'Australia & New Zealand',
       'Melanesia', 'Micronesia', 'Polynesia', 'European Union',
       'Least Developed Countries', 'Land Locked Developing Countries',
       'Small Island Developing States',
       'Low Income Food Deficit Countries',
       'Net Food Importing Developing Countries', 'Annex I countries',
       'Non-Annex I countries', 'OECD'], dtype=object)

In [104]:
countries = forests[forests["Country Code"] < 5000].drop(columns="Country Code")
regions_eco = forests[forests["Country Code"] >= 5000]
regions = regions_eco[regions_eco["Country Code"] < 5800].rename(columns={"Country":"Region"}).drop(columns="Country Code")
economical_segment = regions_eco[regions_eco["Country Code"] >= 5800].rename(columns={"Country":"Segment"}).drop(columns="Country Code")

In [105]:
countries

Unnamed: 0,Country,Ecosystem,Year,Value
109,Afghanistan,Forest,1995,0.0
110,Afghanistan,Forest,1996,0.0
111,Afghanistan,Forest,1997,0.0
112,Afghanistan,Forest,1998,0.0
113,Afghanistan,Forest,1999,0.0
...,...,...,...,...
64266,Zimbabwe,Forest,2011,312400.0
64267,Zimbabwe,Forest,2012,312400.0
64268,Zimbabwe,Forest,2013,312400.0
64269,Zimbabwe,Forest,2014,312400.0


In [106]:
regions

Unnamed: 0,Region,Ecosystem,Year,Value
64510,World,Forest,1995,13029740.8
64511,World,Forest,1996,13029740.8
64512,World,Forest,1997,13029740.8
64513,World,Forest,1998,13029740.8
64514,World,Forest,1999,13029740.8
...,...,...,...,...
71060,European Union,Forest,2011,287412.4
71061,European Union,Forest,2012,287412.4
71062,European Union,Forest,2013,287412.4
71063,European Union,Forest,2014,287412.4


In [107]:
economical_segment

Unnamed: 0,Segment,Ecosystem,Year,Value
71278,Least Developed Countries,Forest,1995,3225282.0
71279,Least Developed Countries,Forest,1996,3225282.0
71280,Least Developed Countries,Forest,1997,3225282.0
71281,Least Developed Countries,Forest,1998,3225282.0
71282,Least Developed Countries,Forest,1999,3225282.0
...,...,...,...,...
72932,OECD,Forest,2011,822520.0
72933,OECD,Forest,2012,822520.0
72934,OECD,Forest,2013,822520.0
72935,OECD,Forest,2014,822520.0
