# Agri-Food C02 Emissions: Exploratory Data Analysis with Python

Emissions, particularly greenhouse gas emissions, have become one of the most critical global challenges of our time. The burning of fossil fuels, deforestation, industrial processes, and agriculture are among the primary sources of emissions, releasing vast amounts of carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O), and other greenhouse gases into the atmosphere. As these gases accumulate, they trap heat, leading to global warming and climate change.

The importance of addressing emissions lies in their far-reaching and detrimental impacts on the environment, human health, and economies worldwide. Climate change is already causing more frequent and severe extreme weather events, such as hurricanes, heatwaves, and droughts. Rising sea levels are threatening coastal communities and island nations, while changing weather patterns are disrupting agriculture and food security. Moreover, increasing temperatures contribute to the spread of infectious diseases and exacerbate air pollution, resulting in health problems for millions of people. In particular data analysis on agri-food CO2 emissions is of paramount importance in understanding and addressing the environmental impact of the agriculture and food production sectors.

## Flash Description

Even though the correlation between global emissions and average increase in temperature is high in aggregate, at the local level it vanishes because of the nature inherent in the problem of Climate Change, its globality. 

## Features available in the Dataset

- Savanna fires: Emissions from fires in savanna ecosystems.
- Forest fires: Emissions from fires in forested areas.
- Crop Residues: Emissions from burning or decomposing leftover plant material after crop harvesting.
- Rice Cultivation: Emissions from methane released during rice cultivation.
- Drained organic soils (CO2): Emissions from carbon dioxide released when draining organic soils.
- Pesticides Manufacturing: Emissions from the production of pesticides.
- Food Transport: Emissions from transporting food products.
- Forestland: Land covered by forests.
- Net Forest conversion: Change in forest area due to deforestation and afforestation.
- Food Household Consumption: Emissions from food consumption at the household level.
- Food Retail: Emissions from the operation of retail establishments selling food.
- On-farm Electricity Use: Electricity consumption on farms.
- Food Packaging: Emissions from the production and disposal of food packaging materials.
- Agrifood Systems Waste Disposal: Emissions from waste disposal in the agrifood system.
- Food Processing: Emissions from processing food products.
- Fertilizers Manufacturing: Emissions from the production of fertilizers.
- IPPU: Emissions from industrial processes and product use.
- Manure applied to Soils: Emissions from applying animal manure to agricultural soils.
- Manure left on Pasture: Emissions from animal manure on pasture or grazing land.
- Manure Management: Emissions from managing and treating animal manure.
- Fires in organic soils: Emissions from fires in organic soils.
- Fires in humid tropical forests: Emissions from fires in humid tropical forests.
- On-farm energy use: Energy consumption on farms.
- Rural population: Number of people living in rural areas.
- Urban population: Number of people living in urban areas.
- Total Population - Male: Total number of male individuals in the population.
- Total Population - Female: Total number of female individuals in the population.
- total_emission: Total greenhouse gas emissions from various sources.
- Average Temperature °C: The average increasing of temperature (by year) in degrees Celsius.
- Total_Population: Total number of people living a specific Country calculated adding rural and urban population.
- Emissions_pro_capita: Country emissions divided by the total population.

## Importing
Importing ad adding two more useful columns for further analysis.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import pycountry_convert as pc

In [2]:
df = pd.read_csv('../Datasets/Agrofood_co2_emission.csv')
df.shape

(6965, 31)

In [3]:
df['Total_population'] = df['Urban population'] + df['Rural population']
df['Emissions_pro_capita'] = df['total_emission'] / df['Total_population']

# Cleaning Data
In order to manintain some kind of temporal cohesion and because of the inconsistence of some datas in regard to countries I dropped the first two years of the Dataset, in addition I eliminated the Area 'China, mainland' since it consists in a quasi - copy of the 'China' Area. Then I controlled for NaN values and found out that the main features where consistent but some more specific values where not, this will be taken into account during the EDA. As an example of the latter I searched for Areas where the value 'Forestland' where absent.

In [4]:
state_counts = df['Area'].value_counts()
states_to_keep = state_counts[state_counts >= 29].index.tolist()

df = df[df['Area'].isin(states_to_keep)].query('Year != 1990 & Year != 1991').query('Area != "China, mainland"')
df.shape, df['Area'].unique().size, df.isna().sum()

((6264, 33),
 216,
 Area                                  0
 Year                                  0
 Savanna fires                        29
 Forest fires                         87
 Crop Residues                      1286
 Rice Cultivation                      0
 Drained organic soils (CO2)           0
 Pesticides Manufacturing              0
 Food Transport                        0
 Forestland                          464
 Net Forest conversion               464
 Food Household Consumption          415
 Food Retail                           0
 On-farm Electricity Use               0
 Food Packaging                        0
 Agrifood Systems Waste Disposal       0
 Food Processing                       0
 Fertilizers Manufacturing             0
 IPPU                                696
 Manure applied to Soils             848
 Manure left on Pasture                0
 Manure Management                   848
 Fires in organic soils                0
 Fires in humid tropical forests     1

In [5]:
df['Area'][df['Forestland'].isna() == True].value_counts()

Anguilla                     29
Antigua and Barbuda          29
Bermuda                      29
British Virgin Islands       29
Channel Islands              29
China, Hong Kong SAR         29
China, Macao SAR             29
China, Taiwan Province of    29
Isle of Man                  29
Kiribati                     29
North Macedonia              29
Palau                        29
Palestine                    29
Saint Kitts and Nevis        29
United Arab Emirates         29
Vanuatu                      29
Name: Area, dtype: int64

# Exploratory Data Analysis
The EDA has been divided in steps for simplicity and better comprehension of the values taken into account.
1. Initial View of the Areas observing the movements of emissions in time, max emissions, max emission per capita and average temperature in celsius.
3. Group by continent with a new column and plotting in regard with some considerations.
2. Create a DataFrame with differences in order to compare the changing environment and plotting the results.
4. Box plot and outlier alanysis to identify Areas which where consistently out of the standard behaviour.
5. Correlation matrices and Heatmaps, in particular one with highest type of emission per Area with colours corresponding to place in the standing.
6. Tips for further analysis and model predictions.

### First Glance

In [67]:
df.head(5)

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,...,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C,Total_population,Emissions_pro_capita,Continent
2,Afghanistan,1992,14.7237,0.0557,196.5341,686.0,0.0,11.712073,53.317,-2388.803,...,,10995568.0,2985663.0,6028494.0,6028939.0,2356.304229,-0.259583,13981231.0,0.000169,Asia
3,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,...,,11858090.0,3237009.0,7003641.0,7000119.0,2368.470529,0.101917,15095099.0,0.000157,Asia
4,Afghanistan,1994,14.7237,0.0557,242.0494,705.6,0.0,11.712073,53.9874,-2388.803,...,,12690115.0,3482604.0,7733458.0,7722096.0,2500.768729,0.37225,16172719.0,0.000155,Asia
5,Afghanistan,1995,14.7237,0.0557,243.8152,666.4,0.0,11.712073,54.6445,-2388.803,...,,13401971.0,3697570.0,8219467.0,8199445.0,2624.612529,0.285583,17099541.0,0.000153,Asia
6,Afghanistan,1996,38.9302,0.2014,249.0364,686.0,0.0,11.712073,53.1637,-2388.803,...,,13952791.0,3870093.0,8569175.0,8537421.0,2838.921329,0.036583,17822884.0,0.000159,Asia


In [58]:
grouped_df = df.groupby('Year')['total_emission'].sum().reset_index()

fig = px.line(grouped_df, x='Year', y='total_emission', title='Total Co2 Emissions between 1990 and 2020', markers=True, template='plotly_dark')
fig.show()

In [6]:
def barchart(df, year:float, num:float, x:str, y:str):
    fig = px.bar(df.query(f'Year == {year} & Total_population > 1000000').nlargest(num, y), 
                x=x, y=y,
                template='plotly_dark',
                color=y,
                color_discrete_sequence= px.colors.sequential.Oranges)
    fig.update_coloraxes(showscale=False)
    fig.show()

In [7]:
barchart(df, 2020, 20, 'Area', 'total_emission'), barchart(df, 2020, 20, 'Area', 'Emissions_pro_capita');

In [8]:
px.scatter(df, df['Average Temperature °C'],
           df['total_emission'], 
           size= "Total_population", 
           title = "CO2 Emission pro capita & Temperature - population", 
           template="plotly_dark",
           color='Year')

### Country Level Analysis
It is evident that, even though Asia has the biggest impact on the total amout of emissions globaly, Europe is the most impacted by changes in Temperature. Furthermore this underlines the absent correlation between Area's emissions and their Temperature increasing because of this matter is global and not local. 

In [9]:
def country_to_continent(country_name):
    try:
        alpha2 = pc.country_name_to_country_alpha2(country_name)
        continent_code = pc.country_alpha2_to_continent_code(alpha2)
        continent_name = pc.convert_continent_code_to_continent_name(continent_code)
        return continent_name
    except:
        return None

df['Continent'] = df['Area'].apply(country_to_continent)

In [10]:
px.scatter(df, df['Average Temperature °C'],
           df['total_emission'], 
           size= "Total_population", 
           title = "CO2 Emission pro capita & Temperature - continent", 
           template="plotly_dark",
           color='Continent')

In [11]:
fig = (px.pie(df.query('Year == 2020 & Continent.isnull() == False'), 
                values='total_emission', 
                names='Continent', 
                title='Total Emissions per Continent in 2020',
                template="plotly_dark",
                hole=.2,
                color_discrete_sequence=px.colors.qualitative.Plotly))
fig.show()

In [48]:
fig = (px.violin(df, x="Average Temperature °C", 
                    color="Continent", template="plotly_dark",
                    color_discrete_sequence=px.colors.qualitative.Plotly))
fig.show()

In [13]:
fig = (px.violin(df, x="total_emission", 
                    color="Continent", template="plotly_dark",
                    color_discrete_sequence=px.colors.qualitative.Plotly))
fig.show()

### Differences and Population view

In [14]:
groups = df.groupby('Area')
areas = df.Area.unique()
diff_df = pd.DataFrame()

for a in areas:
    area = groups.get_group(a).copy()
    area['diff_emission'] = area['total_emission'].diff()
    area['diff_pop'] = area['Total_population'].diff()
    area['diff_temp'] = area['Average Temperature °C'].diff()
    diff_df = pd.concat([diff_df, area[['Year', 'Total_population', 'diff_pop', 'diff_emission', 'diff_temp', 'Average Temperature °C']]], ignore_index=True)

diff_df = diff_df.dropna().reset_index(drop=True)


In [15]:
px.scatter(diff_df, diff_df["diff_temp"],
           diff_df["diff_emission"], 
           size= "Total_population", 
           title = "Change in CO2 Emission & Temperature - population", 
           template="plotly_dark",
           color='Year')

In [16]:
px.scatter(df, df['Total_population'],
           df['total_emission'], 
           size= "Total_population", 
           title = "CO2 Emission & Temperature - population", 
           template="plotly_dark",
           color='Year',
           color_continuous_scale=px.colors.sequential.Agsunset)

### Outliers Analysis

In [17]:
fig = px.box(diff_df, x="Year",
             y="Average Temperature °C",
             color="Year",
             color_discrete_sequence=px.colors.diverging.BrBG,
             template="plotly_dark",
             title='Average temperature distribution by years')
fig.show()

In [18]:
def area_outliers(fig):
    outliers_low = []
    outliers_high = []
    for d in range(0, len(fig.data)):
        data = fig.data[d].y
        q1 = np.percentile(data, 25)
        q3 = np.percentile(data, 75)
        min_dist = q1 - 1.5 * (q3 - q1)
        max_dist = q3 + 1.5 * (q3 - q1)

        for i in range(0, len(data)):
            if data[i] < min_dist:
                outliers_low.append(df.loc[df['Average Temperature °C'] == data[i]]['Area'].iloc[0])
            elif data[i] > max_dist:
                outliers_high.append(df.loc[df['Average Temperature °C'] == data[i]]['Area'].iloc[0])
        

    outliers_low = ['\033[92m' + o + '\033[0m' for o in outliers_low]
    outliers_high = ['\033[91m' + o + '\033[0m' for o in outliers_high]
    outliers_low = pd.Series(outliers_low)
    outliers_high = pd.Series(outliers_high)
    outliers = pd.concat([outliers_low, outliers_high], axis=0)
    
    return round(outliers.value_counts() / int(len(data)) * 100, 2)

area_outliers(fig).head(10)


[91mFinland[0m                      4.63
[91mEstonia[0m                      3.70
[91mAustria[0m                      3.24
[91mLatvia[0m                       3.24
[91mBelarus[0m                      2.78
[92mZimbabwe[0m                     2.78
[92mSaint Pierre and Miquelon[0m    2.78
[91mMongolia[0m                     2.78
[92mBotswana[0m                     2.78
[91mRussian Federation[0m           2.78
dtype: float64

### Correlations and Hints for further Predictions
Thanks to the Heatmaps the correlation between emissions and Temperature seems very significant at the aggregate level. The second one represents the highest type of emission per Area with colours corresponding to place in the standing and the last one the correlation between all the different values in the Dataframe. The latter must be red as a hint for future models in the sense that underlines possible multicollinearity problems and omitted variables that are going to distort the model.

In [19]:
correlation = df.groupby(["Year"]).agg({"total_emission":"sum", "Average Temperature °C":"mean", "Total_population":"sum"}).corr()

fig = px.imshow(correlation, color_continuous_scale=px.colors.sequential.Emrld, zmin=np.min(0))
fig.show()

In [64]:
emission_kind = ('Forest fires',
                 #'Food Transport', 'Food Household Consumption', 'Food Retail','Food Packaging','Food Processing',
                 'Pesticides Manufacturing', 'Fertilizers Manufacturing', 'IPPU', 'On-farm Electricity Use', 'On-farm energy use',
                 'Manure applied to Soils', 'Manure left on Pasture', 'Manure Management',
                 'total_emission')

heat_df = pd.DataFrame(columns=['Area'])

for e in emission_kind:
    grouped_df = df.query('Total_population > 50000000').groupby('Area')[e].sum().reset_index()
    kind = grouped_df.nlargest(20, e).reset_index(drop=True)
    kind[e] = kind.index + 1
    heat_df = pd.merge(heat_df, kind, on='Area', how='outer')
heat_df.set_index('Area', inplace=True)
heat_df.sort_index(inplace=True)

heat_df.fillna(30, inplace=True)
heat_df['Row_Sum'] = heat_df.sum(axis=1)
heat_df = heat_df.sort_values('Row_Sum', ascending=True)
heat_df.drop('Row_Sum', axis=1, inplace=True)

fig = px.imshow(heat_df, color_continuous_scale=px.colors.sequential.Blackbody, aspect="auto")
fig.update_layout(width=1000, height=1000)
fig.update_xaxes(side="top")
fig.update_coloraxes(showscale=False)
fig.show()

In [66]:
fig = px.imshow(round(df.corr(), 2), color_continuous_scale=px.colors.sequential.Agsunset, zmin=-1, zmax=1, text_auto=True, aspect='auto')
fig.update_layout(width=1000, height=1000)
fig.show()

### Thanks for Reading!