# AKONA CIKO: Agricultural _Dataset _Project

## Table of Contents

1. Project Overview
2. Dataset details
3. Packages & Libraries
4. Data Pre-processsing & Processing
5. Exploratory Data Analysis (EDA) & Statistics
6. Data Visualization
7. Results & Insights 
8. Project Manager / Contributor

#### 1. Project Overview:

The "Carbon Dioxide Emission from Agricultural Sector" project aims to analyse & quantify CO2 emissions generated by various agricultural activities. This dataset provides insights into the environmental impact of agriculture on climate change, focusing on key emission sources such as crop production, livestock, and energy use in the agric-food sector. The analysis will support environmental consultants, policymakers, and agricultural businesses in developing sustainable strategies for emission reduction. By identifying trends and evaluating emission patterns, the project will contribute to creating actionable recommendations to migitate the agricultural sector's carbon footprint while balancing food production demands.

#### 2. Dataset details:

The dataset used in this analysis was collected from an authorized source (co2_emissions_from_agri.csv), this dataset provides a valuable resource for environmental consultants, analysts, policymakers, agricultural businesses and enthusiasts interested in understanding the effect of CO2 emissions on climate change from the agri-food sector.

#### 3. Packages & Libraries:

##### Import relevant libraries

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm
%matplotlib inline 

#### 4. Data Pre-processing & Processing:

##### Ignore warning(s) _to run a clean Notebook

In [2]:
# Code that allows us to ignore 'warnings', just incase they pop_up:

import warnings 
warnings.filterwarnings('ignore')

In [3]:
agric_set = pd.read_csv("co2_emissions_from_agri.csv")

In [4]:
agric_set.head(5)

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
0,Afghanistan,1990,14.7237,0.0557,205.6077,686.0,0.0,11.807483,63.1152,-2388.803,...,319.1763,0.0,0.0,,9655167.0,2593947.0,5348387.0,5346409.0,2198.963539,0.536167
1,Afghanistan,1991,14.7237,0.0557,209.4971,678.16,0.0,11.712073,61.2125,-2388.803,...,342.3079,0.0,0.0,,10230490.0,2763167.0,5372959.0,5372208.0,2323.876629,0.020667
2,Afghanistan,1992,14.7237,0.0557,196.5341,686.0,0.0,11.712073,53.317,-2388.803,...,349.1224,0.0,0.0,,10995568.0,2985663.0,6028494.0,6028939.0,2356.304229,-0.259583
3,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,...,352.2947,0.0,0.0,,11858090.0,3237009.0,7003641.0,7000119.0,2368.470529,0.101917
4,Afghanistan,1994,14.7237,0.0557,242.0494,705.6,0.0,11.712073,53.9874,-2388.803,...,367.6784,0.0,0.0,,12690115.0,3482604.0,7733458.0,7722096.0,2500.768729,0.37225


In [5]:
agric_set.tail(7)

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
6958,Zimbabwe,2014,1706.5216,166.6056,94.2907,7.3598,0.0,65.0,306.0107,648.0808,...,240.2096,0.0,0.0,520.7873,10402274.0,5009401.0,6508226.0,7347527.0,21653.810678,0.096333
6959,Zimbabwe,2015,2185.5313,287.166,72.4948,9.067,0.0,68.0,296.7483,648.0808,...,273.1156,0.0,0.0,485.3018,10667966.0,5109485.0,6652836.0,7502101.0,23932.067699,1.153417
6960,Zimbabwe,2016,1190.0089,232.5068,70.9451,7.4088,0.0,75.0,251.1465,76500.2982,...,282.5994,0.0,0.0,417.315,10934468.0,5215894.0,6796658.0,7656047.0,98491.026347,1.12025
6961,Zimbabwe,2017,1431.1407,131.1324,108.6262,7.9458,0.0,67.0,255.7975,76500.2982,...,255.59,0.0,0.0,398.1644,11201138.0,5328766.0,6940631.0,7810471.0,97159.311553,0.0465
6962,Zimbabwe,2018,1557.583,221.6222,109.9835,8.1399,0.0,66.0,327.0897,76500.2982,...,257.2735,0.0,0.0,465.7735,11465748.0,5447513.0,7086002.0,7966181.0,97668.308205,0.516333
6963,Zimbabwe,2019,1591.6049,171.0262,45.4574,7.8322,0.0,73.0,290.1893,76500.2982,...,267.5224,0.0,0.0,444.2335,11725970.0,5571525.0,7231989.0,8122618.0,98988.062799,0.985667
6964,Zimbabwe,2020,481.9027,48.4197,108.3022,7.9733,0.0,73.0,238.7639,76500.2982,...,266.7316,0.0,0.0,444.2335,11980005.0,5700460.0,7385220.0,8284447.0,96505.221853,0.189


In [7]:
# Code checks & trims leading and or trailing spaces:
agric_set.columns= agric_set.columns.str.strip()
print(agric_set.columns)

Index(['Area', 'Year', 'Savanna fires', 'Forest fires', 'Crop Residues',
       'Rice Cultivation', 'Drained organic soils (CO2)',
       'Pesticides Manufacturing', 'Food Transport', 'Forestland',
       'Net Forest conversion', 'Food Household Consumption', 'Food Retail',
       'On-farm Electricity Use', 'Food Packaging',
       'Agrifood Systems Waste Disposal', 'Food Processing',
       'Fertilizers Manufacturing', 'IPPU', 'Manure applied to Soils',
       'Manure left on Pasture', 'Manure Management', 'Fires in organic soils',
       'Fires in humid tropical forests', 'On-farm energy use',
       'Rural population', 'Urban population', 'Total Population - Male',
       'Total Population - Female', 'total_emission',
       'Average Temperature °C'],
      dtype='object')


In [8]:
# More info of the dataset:
agric_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6965 entries, 0 to 6964
Data columns (total 31 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Area                             6965 non-null   object 
 1   Year                             6965 non-null   int64  
 2   Savanna fires                    6934 non-null   float64
 3   Forest fires                     6872 non-null   float64
 4   Crop Residues                    5576 non-null   float64
 5   Rice Cultivation                 6965 non-null   float64
 6   Drained organic soils (CO2)      6965 non-null   float64
 7   Pesticides Manufacturing         6965 non-null   float64
 8   Food Transport                   6965 non-null   float64
 9   Forestland                       6472 non-null   float64
 10  Net Forest conversion            6472 non-null   float64
 11  Food Household Consumption       6492 non-null   float64
 12  Food Retail         

In [9]:
# Looking for Null values that appear on each feature or column:
agric_set.isnull().sum()

Area                                  0
Year                                  0
Savanna fires                        31
Forest fires                         93
Crop Residues                      1389
Rice Cultivation                      0
Drained organic soils (CO2)           0
Pesticides Manufacturing              0
Food Transport                        0
Forestland                          493
Net Forest conversion               493
Food Household Consumption          473
Food Retail                           0
On-farm Electricity Use               0
Food Packaging                        0
Agrifood Systems Waste Disposal       0
Food Processing                       0
Fertilizers Manufacturing             0
IPPU                                743
Manure applied to Soils             928
Manure left on Pasture                0
Manure Management                   928
Fires in organic soils                0
Fires in humid tropical forests     155
On-farm energy use                  956


In [14]:
# Percentage of features with null values :
null_counts = agric_set.isnull().sum()

total_columns = agric_set.shape[1]
null_features_count = (null_counts > 0).sum()
null_features_percentage = (null_features_count / total_columns) * 100

print(f"Percentage of features with Null values: {null_features_percentage:.2f}%")

Percentage of features with Null values: 35.48%


##### Handling of missing values or blanks on identified features / columns:

In [15]:
# Skewness and Kurtosis technique for "Savanna fires" column:
skewness = agric_set['Savanna fires'].skew()
kurtosis = agric_set['Savanna fires'].kurt()

# Skewness near 0 indicates Normal distribution
# Skewness > 0, Right skewed
# While Skewness < 0, Left skewed
print(f"Skewness: {skewness}")
print(f"Kurtosis: {kurtosis}")

Skewness: 10.347116822388163
Kurtosis: 157.12347335007266


In [20]:
# Since the data is not normally distributed & is extremely right skewed, using the median would be more robust approach for imputing missing values:
agric_set['Savanna fires'].fillna(agric_set['Savanna fires'].median(), inplace= True)

# Verify if this indeed worked ???
agric_set.isnull().sum()

Area                                  0
Year                                  0
Savanna fires                         0
Forest fires                         93
Crop Residues                      1389
Rice Cultivation                      0
Drained organic soils (CO2)           0
Pesticides Manufacturing              0
Food Transport                        0
Forestland                          493
Net Forest conversion               493
Food Household Consumption          473
Food Retail                           0
On-farm Electricity Use               0
Food Packaging                        0
Agrifood Systems Waste Disposal       0
Food Processing                       0
Fertilizers Manufacturing             0
IPPU                                743
Manure applied to Soils             928
Manure left on Pasture                0
Manure Management                   928
Fires in organic soils                0
Fires in humid tropical forests     155
On-farm energy use                  956


In [25]:
# Rounding to nearest integer:
agric_set['Savanna fires'] = agric_set['Savanna fires'].round()

In [28]:
 #
from scipy.stats import shapiro

# Shapiro-Wilk test:
stat, p = shapiro(agric_set['Forest fires'].dropna())

# Intepret the result:
if p > 0.05:
    print('The data is normally distributed (Fail to reject HO)')
else:
    print('The data is not normally distributed (Reject HO)')

The data is not normally distributed (Reject HO)


In [35]:
# Since the data is NOT normally distributed, using the median imputation is generally more appropriate:
agric_set['Forest fires'].fillna(agric_set['Forest fires'].median(), inplace= True)

In [38]:
# Rounding to nearest integer:
agric_set['Forest fires'] = agric_set['Forest fires'].round()

In [41]:
# Skewness and Kurtosis technique for "Crop Residues" column:
skewness = agric_set['Crop Residues'].skew()
kurtosis = agric_set['Crop Residues'].kurt()

# Skewness near 0 indicates Normal distribution
# Skewness > 0, Right skewed
# While Skewness < 0, Left skewed
print(f"Skewness: {skewness}")
print(f"Kurtosis: {kurtosis}")

Skewness: 6.105082438285656
Kurtosis: 39.239780020520875


In [44]:
# Since the data is NOT normally distributed & is right skewed, using the median would be more robust approach for imputing missing values:
agric_set['Crop Residues'].fillna(agric_set['Crop Residues'].median(), inplace= True)

In [46]:
# Rounding to nearest integer:
agric_set['Crop Residues'] = agric_set['Crop Residues'].round()

In [None]:
Regression Imputation:

from sklearn.linear_model import LinearRegression

# Seperate rows with missing values:
missing_forestland = agric_set['Forestland'].isnull()

# Regression model training on rows without missing values:
reg = LinearRegression()
reg.fit(agric_set.loc[~missing_forestland, ['Food Transport', 'Food Retail']], agric_set.loc[~missing_forestland, 'Forestland'])

# Predict missing values:
agric_set.loc[missing_forestland, 'Forestland'] = reg.predict(agric_set.loc[missing_forestland, ['Food Transport', 'Food Retail']])

In [53]:
# Rounding off for consistency:
agric_set['Forestland'] = agric_set['Forestland'].round()

In [55]:
# Skewness and Kurtosis technique for "Crop Residues" column:
skewness = agric_set['Net Forest conversion'].skew()
kurtosis = agric_set['Net Forest conversion'].kurt()

# Skewness near 0 indicates Normal distribution
# Skewness > 0, Right skewed
# While Skewness < 0, Left skewed
print(f"Skewness: {skewness}")
print(f"Kurtosis: {kurtosis}")

Skewness: 11.77165033927154
Kurtosis: 157.78022563919293


In [58]:
# Since the data is NOT normally distributed & is right skewed, using the median would be more robust approach for imputing missing values:
agric_set['Net Forest conversion'].fillna(agric_set['Net Forest conversion'].median(), inplace= True)

# Rounding off for consistency:
agric_set['Net Forest conversion'] = agric_set['Net Forest conversion'].round()

In [60]:
# Skewness and Kurtosis technique for "Crop Residues" column:
skewness = agric_set['Food Household Consumption'].skew()
kurtosis = agric_set['Food Household Consumption'].kurt()

# Skewness near 0 indicates Normal distribution
# Skewness > 0, Right skewed
# While Skewness < 0, Left skewed
print(f"Skewness: {skewness}")
print(f"Kurtosis: {kurtosis}")

Skewness: 11.356973350891558
Kurtosis: 152.4137544204516


In [62]:
# Since the data is NOT normally distributed & is right skewed, using the median would be more robust approach for imputing missing values:
agric_set['Food Household Consumption'].fillna(agric_set['Food Household Consumption'].median(), inplace= True)

# Rounding off for consistency:
agric_set['Food Household Consumption'] = agric_set['Food Household Consumption'].round()

In [63]:
agric_set.to_csv("Agric-sector_cleaned_1.csv")

#### 5. Exploratory Data Analysis (EDA) & Statistics:

#### 6. Data Visualization:

#### 7. Results & Insights:

#### 8. Project Manager / Contributor:

Akona Ciko|Akona.Ciko@fnb.co.za