## Introduction

Significant Contribution of Agri-food Sector to Global Emissions: The dataset emphasizes the substantial role of the agri-food sector in contributing to global CO2 emissions, highlighting the need for sustainable practices within this industry.
Interplay of Emissions, Climate Change, and Geography: The analysis within the notebook delves into the intricate relationship between these factors, providing valuable insights for understanding regional variations and global trends.
Predictive Power of Machine Learning: The notebook demonstrates the practical application of machine learning techniques, specifically in predicting temperature variations. This showcases the potential of data-driven approaches to inform climate change mitigation strategies.


### Potential Areas of Further Exploration:

**Identifying High-Emission Regions and Practices:** Pinpoint specific geographic regions or agricultural practices that contribute disproportionately to CO2 emissions. Analyze the impact of different farming techniques (e.g., conventional vs. organic) on emissions.

**Quantifying the Impact of Climate Change on Agriculture:** Assess how climate change-induced factors like extreme weather events, temperature fluctuations, and altered precipitation patterns affect agricultural productivity and emissions.Model future scenarios to predict the potential impacts of climate change on different agricultural regions.

**Developing Sustainable Agricultural Strategies:** Explore the potential of innovative technologies (e.g., precision agriculture, carbon sequestration techniques) to reduce emissions and enhance agricultural sustainability. Analyze the economic and environmental implications of various sustainable practices to inform policy decisions.

**Policy Recommendations and International Cooperation:** Propose policy measures to incentivize sustainable agriculture and reduce emissions.Advocate for international cooperation to address global challenges related to agriculture and climate change.

### Leveraging Machine Learning for Deeper Insights:

**Time Series Analysis:** Analyze historical emission trends to identify patterns and seasonal variations.Forecast future emissions based on historical data and relevant factors.

**Clustering Analysis:** Group countries or regions with similar emission profiles to identify common characteristics and potential solutions.

**Anomaly Detection:** Identify outliers or unusual emission patterns that may indicate underlying issues or potential risks.

By delving deeper into this dataset and employing advanced data analysis techniques, we can gain valuable insights that will inform evidence-based decision-making and contribute to the development of effective strategies to mitigate climate change.


## Data Collection and Description

***Dataset features***

- **Area:** A country name being analyzed (Text).
 
- **Year:** The number of hours a student spent studying per week (Int).

- **Savanna Fires:** Emissions from fires in savanna ecosystems. 

- **Forest Fires:** Emissions from fires in forested areas. Crop Residues: Emissions from burning or decomposing leftover plant material after crop harvesting. 

- **Crop Residues:** Availability of student's educational resources (Text).

- **Rice Cultivation:** Emissions from methane released during rice cultivation.
- **Drained organic soils (CO2):** Emissions from carbon dioxide released when draining organic soils.
- **Pesticides Manufacturing:** Emissions from the production of pesticides.
- **Food Transport:** Emissions from transporting food products..
- **Forestland:** Land covered by forests.
- **Net Forest conversion:** Change in forest area due to deforestation and afforestation.
- **Food Household Consumption:** Emissions from food consumption at the household level.
- **Food Retail:** Number of sessions used to enhance academic performance per month (Int).
- **Manure Management:** Emissions from managing and treating animal manure
- **Fires in humid tropical forests:** Measure of teacher's effectiveness and competencies (Text).
- **On-farm Electricity Use:** Electricity consumption on farms.
- **Food Packaging:** Emissions from the production and disposal of food packaging materials.
- **Agrifood Systems Waste Disposal:** Emissions from waste disposal in the agrifood system
- **Food Processing:** Emissions from processing food products.
- **Fertilizers Manufacturing:** Emissions from the production of fertilizers.
- **IPPU:** Emissions from industrial processes and product use.
- **Manure applied to Soils:** Emissions from applying animal manure to agricultural soils.
- **Manure left on Pasture:** Emissions from animal manure on pasture or grazing land.
- **Fires in organic soils:** Emissions from fires in organic soils.
- **Fires in humid tropical forests:** Emissions from fires in humid tropical forests.
- **On-farm energy use:** Energy consumption on farms
- **Rural population:** Number of people living in rural areas.
- **Urban population:** Number of people living in urban areas.
- **Total Population - Male:** Total number of male individuals in the population.
- **Total Population - Female:** Total number of female individuals in the population.
- **total_emission:** Total greenhouse gas emissions from various sources.
- **Average Temperature °C:** The average increasing of temperature (by year) in degrees Celsius,

## Data Loading


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pycountry
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import json
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_predict

#Data loading
df = pd.read_csv('co2_emissions_from_agri.csv')

#Display and explore data
df.head()

## Data Cleaning

In [None]:
## set dataset to display all rows.
pd.set_option('display.max_columns', None)
df

pd.set_option('display.max_columns', None)
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
df.isnull()

In [None]:
df.dropna()

In [None]:
df.drop_duplicates()

## Exploratory Data Analysis

In [None]:
sales_state = df.groupby(['Year'], as_index=False)['Rice Cultivation'].sum().sort_values(by='Rice Cultivation', ascending=False)

sns.set(rc={'figure.figsize':(20,5)})
sns.barplot(data = sales_state, x = 'Year', y= 'Rice Cultivation')

In [None]:

sales_state = df.groupby(['Year'], as_index=False)['Manure applied to Soils'].sum().sort_values(by='Manure applied to Soils', ascending=False)
palette = sns.color_palette("hsv", len(sales_state))
sns.set(rc={'figure.figsize':(20,5)})
sns.barplot(data = sales_state, x = 'Year', y= 'Manure applied to Soils', hue='Year', palette=palette)

<p style='text-align: justify;'>Also adding an extra column to the dataset for the total population, calculated from the total male and female population for each year.</p>

In [None]:
df['total_population'] = df.loc[:,'Total Population - Male'] + df.loc[:,'Total Population - Female']

In [None]:
def normalize(df):
    n = (df-df.max())/(df.max()-df.min())
    return n

df_year = df.groupby('Year')[['total_emission','Average Temperature °C']].mean()
df_pop = df.groupby('Year')['total_population'].sum()  #['Total_Population_Male','Total_Population_Female'].sum()
df_year['total_population'] = df_pop
df_year_norm = normalize(df_year)

df_year_norm.plot(figsize=(20, 12))
plt.title('Average CO2 Emission and Temperature rise yearly from 1990-2020', fontsize = 18)
plt.xlabel('Date from 1990 to 2020', fontsize = 18)
plt.ylabel('Normalized change of CO2 emission, temperature and population.', fontsize = 18)
plt.show()

From the graph above it can be seen that there is a direct correlation between the CO2 emission with temperature rise and the population growth. These emissions only about 1/5th of the total CO2 emission wordwide but the direct proportionality and upward trend can be seen from this dataset as well.

In [None]:
sales_state = df.groupby(['Year'], as_index=False)['Average Temperature °C'].sum().sort_values(by='Average Temperature °C', ascending=False)

sns.set(rc={'figure.figsize':(20,5)})
sns.barplot(data = sales_state, x = 'Year', y= 'Average Temperature °C')

In [None]:
df_emitter = df.iloc[:,1:24].groupby('Year').sum()
sns.set_style('whitegrid')
df_emitter_tot = df_emitter.sum(axis = 0).sort_values()

colors = ['green' if (x < 0) else 'red' for x in df_emitter_tot]
g = df_emitter_tot.plot(kind = 'barh', 
                        figsize = (25, 20), 
                        color = colors, 
                        rot = 0) 
plt.title('CO2 emission by sector during 30 years of period.', fontsize = 18)
plt.xlabel('Emission by industry', fontsize = 18)
plt.ylabel('Emission of CO2', fontsize = 18)

for p in g.patches:
    width = p.get_width()
    plt.text(p.get_width(), p.get_y()+ 1.3* p.get_height(),
             '{:1.2f}'.format(width),
             ha='center', va='center')
plt.show()

In [None]:
plt.figure(figsize = (10, 6))
df3 = df.groupby('Area')['Food Household Consumption'].sum().sort_values(ascending=False)
df3

In [None]:
sns.histplot(df['Fertilizers Manufacturing']);

In [None]:
sns.distplot(df['On-farm Electricity Use'])
plt.show()

In [None]:
sns.scatterplot(df['On-farm Electricity Use'])

In [None]:
df_totalemi = df.groupby('Area')['total_emission'].sum()
df_totalemi = pd.DataFrame(df_totalemi).sort_values('total_emission', ascending = False)

df_top10 = df_totalemi.head(10).reset_index()
df_top10['Proportion_(%)'] = (df_top10['total_emission'] / df_top10['total_emission'].sum() )* 100
df_top10

In [None]:
g = sns.catplot(x = 'total_emission',
            y = 'Area',
            data = df_top10,
            kind = 'bar',
            ci = None,
            height = 6,
            aspect = 2)
g.fig.suptitle('Top 10 agricultural CO2 emitter.', y = 1.02, fontsize = 18)
g.set(xlabel = 'Total Emission Percentage',
      ylabel = 'Top 10 countries')
ax = g.facet_axis(0, 0)
for c in ax.containers:
    labels = [f'{(v.get_height() / 1000):.1f}K' for v in c]
    ax.bar_label(c, labels=round(df_top10['Proportion_(%)'], 2), label_type='edge')
plt.show()

In [None]:
df_bottom10 = df_totalemi.tail(10).reset_index()
df_bottom10['Proportion_(%)'] = (df_bottom10['total_emission'] / df_bottom10['total_emission'].sum() )* 100
df_bottom10

In [None]:
g = sns.catplot(x = 'total_emission',
            y = 'Area',
            data = df_bottom10,
            kind = 'bar',
            ci = None,
            height = 6,
            aspect = 2)
g.fig.suptitle('Bottom 10 agricultural CO2 emitter.', y = 1.02, fontsize = 18)
g.set(xlabel = 'Total Emission',
      ylabel = 'Bottom 10 countries')
ax = g.facet_axis(0, 0)
for c in ax.containers:
    labels = [f'{(v.get_height() / 1000):.1f}K' for v in c]
    ax.bar_label(c, labels=round(df_bottom10['Proportion_(%)'], 2), label_type='edge')
plt.show()

In [None]:
sns.set(style="whitegrid")
ax = sns.violinplot(x=df["Savanna fires"])

In [None]:
sns.set(style="whitegrid")
ax = sns.violinplot(x=df["Forest fires"])

In [None]:
sns.set(style="whitegrid")
ax = sns.violinplot(x=df["Average Temperature °C"])

The below functionality allocates the different Areas to the respective continents, in order to do so we got an online continents json file which contains country and continents map and we then take the Area from our dataset and map it.

In [None]:
# Loading continents from the json file
continents = json.load(open('continents.json','r'))

#This function assign a continent label to each country in the co2_emission 
def assign_continent(area):
    for continent, country in continents.items():
        if area in country:
            return continent

df['Continents'] = df.loc[:,'Area'].apply(assign_continent)

Looking at the average temperature change due to carnon emission year by year in the different regions of the world. From the next three plot we can conclude that the most effected by the CO2 emission is Europe and the second one is Asia.

In [None]:
df_1990_1999 = df.query('Year >= 1990 and Year <= 1999')
df_2000_2009 = df.query('Year >= 2000 and Year <= 2009')
df_2010_2020 = df.query('Year >= 2010 and Year <= 2020')

def plotEmissionBoxPlot(dataframe, title):
      sns.set_style('whitegrid')
      g = sns.catplot(x = 'Year',
                  y = 'Average Temperature °C',
                  data = dataframe,
                  kind = 'box',
                  hue = 'Continents',
                  height=10, 
                  aspect=2.5
                  )
      g.fig.suptitle('Average temperature change by year from ' + title, fontsize = 18)
      g.set(xlabel = 'Year', ylabel = 'Average change of temperature.')
      plt.show()

plotEmissionBoxPlot(df_1990_1999, '1990 to 1999.')

In [None]:
plotEmissionBoxPlot(df_2000_2009, '2000 to 2009.')

In [None]:
plotEmissionBoxPlot(df_2010_2020, '2010 to 2020.')

We add where the country is located by continent.

Also adding an extra column to the dataset for the total population, calculated from the total male and female population for each year

In [None]:
data_mean_tmp_emi = df.groupby(['Year','Continents'])[['total_emission','Average Temperature °C']].agg('mean').reset_index()
data_sum_pop = df.groupby(['Year','Continents'])[('total_population')].agg('sum').reset_index()
data_joined = pd.merge(data_mean_tmp_emi, data_sum_pop,  how='left', left_on=['Year','Continents'], right_on = ['Year','Continents'])
data_joined2 = normalize(data_joined[['total_emission','Average Temperature °C','total_population']])
data_joined2 = pd.concat((data_joined[['Year','Continents']], data_joined2[['total_emission','Average Temperature °C','total_population']]), axis = 1)

sns.lineplot(data = data_joined2, x = 'Year', y = 'Average Temperature °C', hue = 'Continents', palette = 'dark', style = 'Continents', markers = True)
plt.title('Temperature rise yearly from 1990-2020', fontsize = 18)
plt.xlabel('Date from 1990 to 2020', fontsize = 18)
plt.ylabel('Normalized change of temperature.', fontsize = 18)
plt.show()

Analyze different variables against the Average temperature and Continents

In [None]:
sns.pairplot(df, vars =[
     'Savanna fires', 'Forest fires', 
     "total_emission", 'Crop Residues',
     'Rice Cultivation','Pesticides Manufacturing',
     'Food Transport',"Average Temperature °C"
],
y_vars="Average Temperature °C",
hue = "Continents")
plt.show()

## Hypothesis Testing


In [None]:
dataplot = df.drop(['Area', 'Continents'], axis = 1)
sns.heatmap(dataplot.corr(), cmap="YlGnBu", annot=True)
plt.show()

In [None]:

df_emiPerCapita = df[['Area', 'Year', 'Forestland', 'total_emission', 'total_population','Continents']]
df_emiPerCapita['Emission_Per_Capita'] = df_emiPerCapita['total_emission'] / df_emiPerCapita['total_population']
df = pd.concat([df, df_emiPerCapita['Emission_Per_Capita']], axis = 1)
df_emiPerCapita_mean = df_emiPerCapita.groupby('Area')['Emission_Per_Capita'].mean()
df_emiPerCapita_mean.sort_values(ascending = False).head(n=10).reset_index()

correlation = df.groupby('Year').agg({'total_emission':'sum','Average Temperature °C':'mean','total_population':'sum', 'Forestland':'sum', 'Emission_Per_Capita':'mean'})
correlation.corr()

In [None]:
correlation['Year'] = correlation.index
sns.lmplot(data = correlation,
            x = 'total_emission',
            y = 'Average Temperature °C',
            height = 10,
            aspect = 2,
            fit_reg = True)
sns.lmplot(data = correlation,
            x = 'total_emission',
            y = 'Emission_Per_Capita',
            height = 10,
            aspect = 2,
            fit_reg = True)
sns.lmplot(data = correlation,
            x = 'total_emission',
            y = 'total_population',
            height = 10,
            aspect = 2,
            fit_reg = True)
plt.show()


## Perform train-test splits

## Perform train-test splits - Model 1

In [None]:
#Function to calculate slope intercept
def get_slope_intercept(model):
  slope = model.coef_[0]
  intercept = model.intercept_

  #Return scope and intercept values
  return slope, intercept

In [None]:
def calculate_evaluation_metrics(predictions, y_values):
  metrics = {
      'R-squared': r2_score(y_values, predictions),
      'MAE': mean_absolute_error(y_values, predictions),
      'MSE': mean_squared_error(y_values, predictions),
      'RMSE': mean_squared_error(y_values, predictions, squared=False)
  }
  return metrics

In [None]:
#Define values to be used to train the model
X = df[['Food Processing']]
y = df['Average Temperature °C']

# Calculate and get train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Plotting the training and testing sets

In [None]:
# Plot the results using scatter plot 
plt.scatter(X_train, y_train, color='green', label='Training')  # plot the training data in green
plt.scatter(X_test, y_test, color='darkblue', label='Testing')  # plot the testing data in blue
plt.legend()
plt.show()

### Training a linear model

In [None]:
# Create an instance of the LinearRegression class
lm = LinearRegression()

# Train the linear regression model
lm.fit(X_train, y_train)

# Get slope and intercept values
a, b = get_slope_intercept(lm)

print("Slope:\t\t", a)
print("Intercept:\t", float(b))

### Assessing model on the training data

In [None]:
# Generate predictions on the training set
y_train_pred = lm.predict(X_train)

# Plot the results
plt.scatter(X_train, y_train, color='green', label='Training data')  # Plot the training data in green
plt.plot(X_train, y_train_pred, color='red', label='Regression line')  # Plot the line connecting the generated y-values
plt.legend()
plt.show()


In [None]:
# Print the training MSE and R-squared score

train_mse = metrics.mean_squared_error(y_train, y_train_pred)
train_r2 = metrics.r2_score(y_train, y_train_pred)
train_rmse = metrics.mean_squared_error(y_train, y_train_pred, squared=False) 
train_mae = metrics.mean_absolute_error(y_train, y_train_pred)

print("Training MSE:", train_mse)
print("Training R-squared:", train_r2)
print("Training RMSE:", train_rmse)
print("Training MAE:", train_mae)

## Perform train-test splits - Model 2

In [None]:
# Define values to be used to train the model
X = df[['Food Transport']]
y = df['Average Temperature °C']

# Calculate and get train test split
X_one_train, X_one_test, y_one_train, y_one_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Plotting the training and testing sets

In [None]:
# Plot the results using scatter plot 
plt.scatter(X_one_train, y_one_train, color='green', label='Training')  # plot the training data in green
plt.scatter(X_one_test, y_one_test, color='darkblue', label='Testing')  # plot the testing data in blue
plt.legend()
plt.show()

### Training a linear model

In [None]:
# Create an instance of the LinearRegression class
lm = LinearRegression()

# Train the linear regression model
lm.fit(X_one_train, y_one_train)

# Get slope and intercept values
a, b = get_slope_intercept(lm)

print("Slope:\t\t", a)
print("Intercept:\t", float(b))

### Assessing model on the training data

In [None]:
y_one_train_pred = lm.predict(X_one_train)

# Plot the results
plt.scatter(X_one_train, y_one_train, color='green', label='Training data')  # Plot the training data in green
plt.plot(X_one_train, y_one_train_pred, color='red', label='Regression line')  # Plot the line connecting the generated y-values
plt.legend()
plt.show()

In [None]:
train_mse = metrics.mean_squared_error(y_one_train, y_one_train_pred)
train_r2 = metrics.r2_score(y_one_train, y_one_train_pred)
train_rmse = metrics.mean_squared_error(y_one_train, y_one_train_pred, squared=False) 
train_mae = metrics.mean_absolute_error(y_one_train, y_one_train_pred)

print("Training MSE:", train_mse)
print("Training R-squared:", train_r2)
print("Training RMSE:", train_rmse)
print("Training MAE:", train_mae)

## Perform train-test splits - Model 3

In [None]:
# Define values to be used to train the model
X = df[['total_emission']]
y = df['Agrifood Systems Waste Disposal']

# Calculate and get train test split
X_two_train, X_two_test, y_two_train, y_two_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Plotting the training and testing sets

In [None]:
#Plot the results using scatter plot 
plt.scatter(X_two_train, y_two_train, color='green', label='Training')  # plot the training data in green
plt.scatter(X_two_test, y_two_test, color='darkblue', label='Testing')  # plot the testing data in blue
plt.legend()
plt.show()

### Training a linear model

In [None]:
# Create an instance of the LinearRegression class
lm = LinearRegression()

# Train the linear regression model
lm.fit(X_two_train, y_two_train)

# Get slope and intercept values
a, b = get_slope_intercept(lm)

print("Slope:\t\t", a)
print("Intercept:\t", float(b))

### Assessing model on the training data

In [None]:
y_two_train_pred = lm.predict(X_two_train)

# Plot the results
plt.scatter(X_two_train, y_two_train, color='green', label='Training data')  # Plot the training data in green
plt.plot(X_two_train, y_two_train_pred, color='red', label='Regression line')  # Plot the line connecting the generated y-values
plt.legend()
plt.show()

In [None]:
train_mse = metrics.mean_squared_error(y_two_train, y_two_train_pred)
train_r2 = metrics.r2_score(y_two_train, y_two_train_pred)
train_rmse = metrics.mean_squared_error(y_two_train, y_two_train_pred, squared=False) 
train_mae = metrics.mean_absolute_error(y_two_train, y_two_train_pred)

print("Training MSE:", train_mse)
print("Training R-squared:", train_r2)
print("Training RMSE:", train_rmse)
print("Training MAE:", train_mae)

## Perform train-test splits - Model 4

In [None]:
# Define values to be used to train the model
X = df[['On-farm Electricity Use']]
y = df['Pesticides Manufacturing']

# Calculate and get train test split
X_three_train, X_three_test, y_three_train, y_three_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Plotting the training and testing sets

In [None]:
# Plot the results using scatter plot 
plt.scatter(X_three_train, y_three_train, color='green', label='Training')  # plot the training data in green
plt.scatter(X_three_test, y_three_test, color='darkblue', label='Testing')  # plot the testing data in blue
plt.legend()
plt.show()

### Training a linear model

In [None]:
# Create an instance of the LinearRegression class
lm = LinearRegression()

# Train the linear regression model
lm.fit(X_three_train, y_three_train)

# Get slope and intercept values
a, b = get_slope_intercept(lm)

print("Slope:\t\t", a)
print("Intercept:\t", float(b))

### Assessing model on the training data

In [None]:
y_three_train_pred = lm.predict(X_three_train)

# Plot the results
plt.scatter(X_three_train, y_three_train, color='green', label='Training data')  # Plot the training data in green
plt.plot(X_three_train, y_three_train_pred, color='red', label='Regression line')  # Plot the line connecting the generated y-values
plt.legend()
plt.show()

In [None]:
train_mse = metrics.mean_squared_error(y_three_train, y_three_train_pred)
train_r2 = metrics.r2_score(y_three_train, y_three_train_pred)
train_rmse = metrics.mean_squared_error(y_three_train, y_three_train_pred, squared=False) 
train_mae = metrics.mean_absolute_error(y_three_train, y_three_train_pred)

print("Training MSE:", train_mse)
print("Training R-squared:", train_r2)
print("Training RMSE:", train_rmse)
print("Training MAE:", train_mae)