# 2401PTDS_Regression_Project_FM2

## Table of contents
* [1. Project Overview and Objectives](#project-description)
* [2. Dataset](#dataset)
* [3. Packages](#packages)
* [4. Environment](#environment)
* [5. Team Members](#team-members)

## 1. Project Overview and Objectives

Your team of environmental consultants and data scientists are tasked by a coalition of agricultural stakeholders to analyse and predict the effect of CO2 emissions on climate change from the agri-food sector. The stakeholders include policymakers, agricultural businesses, and environmental organisations. Your project aims to understand the impact of agricultural activities on climate change and develop strategies for sustainable practices. Utilising a comprehensive dataset compiled from the Food and Agriculture Organization (FAO) and the Intergovernmental Panel on Climate Change (IPCC), you will explore various emission sources, perform regression analysis to predict temperature variations, and offer actionable insights for the stakeholders.

By the end of this project, you will have a thorough understanding of the impact of agricultural activities on CO2 emissions and climate change. Your findings and recommendations will contribute to the ongoing efforts to promote sustainability within the agri-food sector, providing valuable insights for the stakeholders involved in this initiative.

### 2. Import Libraries

In [1]:
import pandas                     as     pd
import seaborn                    as     sns
import matplotlib.pyplot          as     plt
from sklearn.preprocessing        import MinMaxScaler
from sklearn.model_selection      import train_test_split
from sklearn.linear_model         import LinearRegression
from sklearn.tree                 import DecisionTreeRegressor
from sklearn.ensemble             import RandomForestRegressor
from sklearn.ensemble             import StackingRegressor
from sklearn.metrics              import mean_squared_error, r2_score


In [2]:
#!pip install jupyter_contrib_nbextensions

## Dataset

Emissions from the agri-food sector play a crucial role in climate change, as they represent a significant share of global annual emissions. The dataset highlights the substantial contribution of the various sources of emissions. Therefore, it is essential to understand and address the environmental impact of the agri-food industry to mitigate climate change and promote sustainable practices within this sector.

Dataset Features:

- Savanna fires: Emissions from fires in savanna ecosystems.
- Forest fires: Emissions from fires in forested areas.
- Crop Residues: Emissions from burning or decomposing leftover plant material after crop harvesting.
- Rice Cultivation: Emissions from methane released during rice cultivation.
- Drained organic soils (CO2): Emissions from carbon dioxide released when draining organic soils.
- Pesticides Manufacturing: Emissions from the production of pesticides.
- Food Transport: Emissions from transporting food products.
- Forestland: Land covered by forests.
- Net Forest conversion: Change in forest area due to deforestation and afforestation.
- Food Household Consumption: Emissions from food consumption at the household level.
- Food Retail: Emissions from the operation of retail establishments selling food.
- On-farm Electricity Use: Electricity consumption on farms.
- Food Packaging: Emissions from the production and disposal of food packaging materials.
- Agrifood Systems Waste Disposal: Emissions from waste disposal in the agrifood system.
- Food Processing: Emissions from processing food products.
- Fertilizers Manufacturing: Emissions from the production of fertilizers.
- IPPU: Emissions from industrial processes and product use.
- Manure applied to Soils: Emissions from applying animal manure to agricultural soils.
- Manure left on Pasture: Emissions from animal manure on pasture or grazing land.
- Manure Management: Emissions from managing and treating animal manure.
- Fires in organic soils: Emissions from fires in organic soils.
- Fires in humid tropical forests: Emissions from fires in humid tropical forests.
- On-farm energy use: Energy consumption on farms.
- Rural population: Number of people living in rural areas.
- Urban population: Number of people living in urban areas.
- Total Population - Male: Total number of male individuals in the population.
- Total Population - Female: Total number of female individuals in the population.
- total_emission: Total greenhouse gas emissions from various sources.
- Average Temperature °C: The average increasing of temperature (by year) in degrees Celsius,
- CO2 is recorded in kilotonnes (kt): 1 kt represents 1000 kg of CO2.

The feature "Average Temperature C°", represents the average yearly temperature increase. For example, if it is 0.12, it means that the temperature in that specific location increased by 0.12 degrees Celsius.

Forestland is the only feature that exhibits negative emissions due to its role as a carbon sink. Through photosynthesis, forests absorb and store carbon dioxide, effectively removing it from the atmosphere. Sustainable forest management, along with afforestation and reforestation efforts, further contribute to negative emissions by increasing carbon sequestration capacity.

### Load Dataset

Download the csv file as the same folder as this jupyter notebook file

In [3]:
#THEO 
##df = pd.read_csv("C:/Users/tsint001/OneDrive - Vodafone Group/Desktop/Online learning/Explore AI/Regression/Project/Project/2401PTDS_Regression_Project_FM2/co2_emissions_from_agri.csv")

#MFANA
##df = pd.read_csv(".csv")

#MARCUS
##df = pd.read_csv(".csv")

#MANQOBA
df = pd.read_csv("C:/Users/mngum005/OneDrive - Vodafone Group/Data Science Course/Sprint 0 - Regression Project/Git Space/2401PTDS_Regression_Project_FM2/co2_emissions_from_agri.csv")
df = pd.DataFrame(df)
df.head()


Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
0,Afghanistan,1990,14.7237,0.0557,205.6077,686.0,0.0,11.807483,63.1152,-2388.803,...,319.1763,0.0,0.0,,9655167.0,2593947.0,5348387.0,5346409.0,2198.963539,0.536167
1,Afghanistan,1991,14.7237,0.0557,209.4971,678.16,0.0,11.712073,61.2125,-2388.803,...,342.3079,0.0,0.0,,10230490.0,2763167.0,5372959.0,5372208.0,2323.876629,0.020667
2,Afghanistan,1992,14.7237,0.0557,196.5341,686.0,0.0,11.712073,53.317,-2388.803,...,349.1224,0.0,0.0,,10995568.0,2985663.0,6028494.0,6028939.0,2356.304229,-0.259583
3,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,...,352.2947,0.0,0.0,,11858090.0,3237009.0,7003641.0,7000119.0,2368.470529,0.101917
4,Afghanistan,1994,14.7237,0.0557,242.0494,705.6,0.0,11.712073,53.9874,-2388.803,...,367.6784,0.0,0.0,,12690115.0,3482604.0,7733458.0,7722096.0,2500.768729,0.37225


### Data Discovery

Here we will perform a preliminary exploration of our dataset to get a better understanding of the data we are working with.

This initial exploration of our data will help uncover underlying patterns and relationships that can inform our choice of models.

In [4]:
df.shape

(6965, 31)

In [5]:
df.head()
df[df['Year'] == 'mainland']

##df[df['Age'] > 30]

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C


In [6]:
df['Area'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina',
       'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan',
       'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus',
       'Belgium', 'Belgium-Luxembourg', 'Belize', 'Benin', 'Bermuda',
       'Bhutan', 'Bolivia (Plurinational State of)',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'British Virgin Islands', 'Brunei Darussalam', 'Bulgaria',
       'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon',
       'Canada', 'Cayman Islands', 'Central African Republic', 'Chad',
       'Channel Islands', 'Chile', 'China', 'China, Hong Kong SAR',
       'China, Macao SAR', 'China, mainland', 'China, Taiwan Province of',
       'Colombia', 'Comoros', 'Congo', 'Cook Islands', 'Costa Rica',
       'Croatia', 'Cuba', 'Cyprus', 'Czechia', 'Czechoslovakia',
       "Democratic People's Republic of Korea",
       'Democratic 

In [7]:
df.tail()

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
6960,Zimbabwe,2016,1190.0089,232.5068,70.9451,7.4088,0.0,75.0,251.1465,76500.2982,...,282.5994,0.0,0.0,417.315,10934468.0,5215894.0,6796658.0,7656047.0,98491.026347,1.12025
6961,Zimbabwe,2017,1431.1407,131.1324,108.6262,7.9458,0.0,67.0,255.7975,76500.2982,...,255.59,0.0,0.0,398.1644,11201138.0,5328766.0,6940631.0,7810471.0,97159.311553,0.0465
6962,Zimbabwe,2018,1557.583,221.6222,109.9835,8.1399,0.0,66.0,327.0897,76500.2982,...,257.2735,0.0,0.0,465.7735,11465748.0,5447513.0,7086002.0,7966181.0,97668.308205,0.516333
6963,Zimbabwe,2019,1591.6049,171.0262,45.4574,7.8322,0.0,73.0,290.1893,76500.2982,...,267.5224,0.0,0.0,444.2335,11725970.0,5571525.0,7231989.0,8122618.0,98988.062799,0.985667
6964,Zimbabwe,2020,481.9027,48.4197,108.3022,7.9733,0.0,73.0,238.7639,76500.2982,...,266.7316,0.0,0.0,444.2335,11980005.0,5700460.0,7385220.0,8284447.0,96505.221853,0.189


In [8]:
df.dtypes

Area                                object
Year                                 int64
Savanna fires                      float64
Forest fires                       float64
Crop Residues                      float64
Rice Cultivation                   float64
Drained organic soils (CO2)        float64
Pesticides Manufacturing           float64
Food Transport                     float64
Forestland                         float64
Net Forest conversion              float64
Food Household Consumption         float64
Food Retail                        float64
On-farm Electricity Use            float64
Food Packaging                     float64
Agrifood Systems Waste Disposal    float64
Food Processing                    float64
Fertilizers Manufacturing          float64
IPPU                               float64
Manure applied to Soils            float64
Manure left on Pasture             float64
Manure Management                  float64
Fires in organic soils             float64
Fires in hu

In [9]:
df.describe()

Unnamed: 0,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,Net Forest conversion,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
count,6965.0,6934.0,6872.0,5576.0,6965.0,6965.0,6965.0,6965.0,6472.0,6472.0,...,6037.0,6965.0,6810.0,6009.0,6965.0,6965.0,6965.0,6965.0,6965.0,6965.0
mean,2005.12491,1188.390893,919.302167,998.706309,4259.666673,3503.228636,333.418393,1939.58176,-17828.285678,17605.64,...,2263.344946,1210.315532,668.452931,3008.982252,17857740.0,16932300.0,17619630.0,17324470.0,64091.24,0.872989
std,8.894665,5246.287783,3720.078752,3700.34533,17613.825187,15861.445678,1429.159367,5616.748808,81832.210543,101157.5,...,7980.542461,22669.84776,3264.879486,12637.86443,89015210.0,65743620.0,76039930.0,72517110.0,228313.0,0.55593
min,1990.0,0.0,0.0,0.0002,0.0,0.0,0.0,0.0001,-797183.079,0.0,...,0.4329,0.0,0.0,0.0319,0.0,0.0,250.0,270.0,-391884.1,-1.415833
25%,1997.0,0.0,0.0,11.006525,181.2608,0.0,6.0,27.9586,-2848.35,0.0,...,37.6321,0.0,0.0,13.2919,97311.0,217386.0,201326.0,207890.0,5221.244,0.511333
50%,2005.0,1.65185,0.5179,103.6982,534.8174,0.0,13.0,204.9628,-62.92,44.44,...,269.8563,0.0,0.0,141.0963,1595322.0,2357581.0,2469660.0,2444135.0,12147.65,0.8343
75%,2013.0,111.0814,64.950775,377.640975,1536.64,690.4088,116.325487,1207.0009,0.0,4701.746,...,1126.8189,0.0,9.577875,1136.9254,8177340.0,8277123.0,9075924.0,9112588.0,35139.73,1.20675
max,2020.0,114616.4011,52227.6306,33490.0741,164915.2556,241025.0696,16459.0,67945.765,171121.076,1605106.0,...,70592.6465,991717.5431,51771.2568,248879.1769,900099100.0,902077800.0,743586600.0,713341900.0,3115114.0,3.558083


In [10]:
df.describe().loc[['min', '25%', '50%', '75%', 'max']]

Unnamed: 0,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,Net Forest conversion,...,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
min,1990.0,0.0,0.0,0.0002,0.0,0.0,0.0,0.0001,-797183.079,0.0,...,0.4329,0.0,0.0,0.0319,0.0,0.0,250.0,270.0,-391884.1,-1.415833
25%,1997.0,0.0,0.0,11.006525,181.2608,0.0,6.0,27.9586,-2848.35,0.0,...,37.6321,0.0,0.0,13.2919,97311.0,217386.0,201326.0,207890.0,5221.244,0.511333
50%,2005.0,1.65185,0.5179,103.6982,534.8174,0.0,13.0,204.9628,-62.92,44.44,...,269.8563,0.0,0.0,141.0963,1595322.0,2357581.0,2469660.0,2444135.0,12147.65,0.8343
75%,2013.0,111.0814,64.950775,377.640975,1536.64,690.4088,116.325487,1207.0009,0.0,4701.746,...,1126.8189,0.0,9.577875,1136.9254,8177340.0,8277123.0,9075924.0,9112588.0,35139.73,1.20675
max,2020.0,114616.4011,52227.6306,33490.0741,164915.2556,241025.0696,16459.0,67945.765,171121.076,1605106.0,...,70592.6465,991717.5431,51771.2568,248879.1769,900099113.0,902077760.0,743586579.0,713341908.0,3115114.0,3.558083


In [11]:
df.isnull().any()

Area                               False
Year                               False
Savanna fires                       True
Forest fires                        True
Crop Residues                       True
Rice Cultivation                   False
Drained organic soils (CO2)        False
Pesticides Manufacturing           False
Food Transport                     False
Forestland                          True
Net Forest conversion               True
Food Household Consumption          True
Food Retail                        False
On-farm Electricity Use            False
Food Packaging                     False
Agrifood Systems Waste Disposal    False
Food Processing                    False
Fertilizers Manufacturing          False
IPPU                                True
Manure applied to Soils             True
Manure left on Pasture             False
Manure Management                   True
Fires in organic soils             False
Fires in humid tropical forests     True
On-farm energy u

### Data Preprocessing ??????????????????

In [12]:
# standardise features
scaler = StandardScaler()
# Convert to numpy array first to apply np.newaxis
X_scaled = scaler.fit_transform(np.array(X)[:,np.newaxis])

# Train test split
x_train, x_test, y_train,y_test = train_test_split(X_scaled,y,test_size=0.2,random_state=6)

NameError: name 'StandardScaler' is not defined

In [None]:
1. Years
2. Average_Temperature
3. Populations
4. 

In [None]:
Target_variable = Avg Temperature
Eliminate Other Population


## Train-test split

The data will be split into 2, training data and the testing data. The dataset used in this project is considered as a small dataset since there are about 7000 records so larger test size will be suitable, 30% of the data will be used as a test data. The random state will be set to 50 to ensure that every time the code is run, same split of data take place.

In [None]:
y = df['Average Temperature °C']
X = df.drop('Average Temperature °C', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)
print(len(X_train), len(X_test) , len(y_train), len(y_test))


# Model Training, Evaluation and Selections

Data Cleaning - Theo

EDA - Marcus

Variable Selection - Mfana

Train-test splits - Manqoba 

Model 1 -Decisionn Tree Regressor/Bagging -  Mfana

### Model 2 -Random Forest Regressor -Manqoba

In [None]:
Model2_Random_Forest = RandomForestRegressor(n_estimators=100, max_depth=5)
Model2_Random_Forest.fit(X_train,y_train)

##### Model 2 Testing

In [None]:
Model2_Avg_Temp = Model2_Random_Forest.predict(X_test)
print("RMSE:",np.sqrt(mean_squared_error(y_test,Model2_Avg_Temp)))

Model 3 - Support Vector Regression - Mfana

Model Selection - Manqoba

# Team Members

|Full Name                        | Email Address                  |
|---------------------------------|--------------------------------|
|Manqoba Mnguni                   | manqoba.mnguni@vodacom.co.za   | 
|Marcus Van Staden                | marcus.vanstaden@vodacom.co.za | 


## Appendix