#### Explore AI Academy: Final Regression Project

<div style="font-size: 35px">
    <font color='Black'> <b>Let's come up with a good name for our project???</b></font> 
<br>
</div>

<a id="cont"></a>

## Table of Contents
* <b>[1. Project Overview](#chapter1)
    * [1.1 Project background](#section_1_1)
    * [1.2 Problem statment and objective of the study](#section_1_2)
    * [1.3 Data source](#section_1_3)
    * [1.4 Importance of the study](#section_1_4)
    * [1.5 Research questions and Hypotheses](#section_1_5)
    * [1.6 Methodology overview](#section_1_6)
    * [1.7 Structure of the notebook](#section_1_7)
* <b>[2. Importing Packages](#chapter2)
* <b>[3. Loading Data](#chapter3)
* <b>[4. Data Cleaning](#chapter4)
* <b>[5. Exploratory Data Analysis (EDA)](#chapter5)
* <b>[6. Feature Engineering](#chapter6)</b>
* <b>[7. Modeling](#chapter7)<b>
* <b>[8. Model Performance](#chapter8)<b>
* <b>[9. Conclusion](#chapter9)</b>
* <b>[10. References](#chapter10)<b>

## 1. Project Overview <a class="anchor" id="chapter1"></a>

### 1.1 Project background <a class="anchor" id="section_1_1"></a>

### 1.2 Problem statment and objective of the study <a class="anchor" id="section_1_2"></a>

### 1.3 Data source <a class="anchor" id="section_1_3"></a>

### 1.4 Importance of the study <a class="anchor" id="section_1_4"></a>

### 1.5 Research questions and Hypotheses <a class="anchor" id="section_1_5"></a>

#### Research Questions:

    
#### Hypotheses

    
### 1.6 Methodology overview <a class="anchor" id="section_1_6"></a>

### 1.7 Structure of the notebook <a class="anchor" id="section_1_7"></a>

The project notebook used for this study will be structured as follows:
- [Table of contents](#cont)
- [Chapter 1](#chapter1)
- [Chapter 2](#chapter2) 
- [Chapter 3](#chapter3) 
- [Chapter 4](#chapter4) 
- [Chapter 5](#chapter5)
- [Chapter 6](#chapter6)
- [Chapter 7](#chapter7)
- [Chapter 8](#chapter8) 
- [Chapter 9](#chapter9)
- [Chapter 10](#chapter10)
  
[Back to Table of contents](#cont)

## 2. Importing Packages <a class="anchor" id="chapter2"></a>

<div class="alert alert-block alert-danger">
<b><b>DO NOT</b> re-run this line of code unless you want to export your own environment into a new requirements.txt file as it will
override the one currently in your directory!
<br><br>
Uncomment the below line of code if you wish to run it. The requirements.txt file needed for this project already exists in the repo,
but you are welcome to uncomment and run it to see it work.
</div>

In [678]:
#pip freeze > requirements.txt

<div class="alert alert-block alert-danger">
This below line of code is critical to run to ensure you are using the correct packages to run our notebook code. 

The below line of code will be used to install the correct version of each package for usage in this notebook as per the packages listed in the
requirements.txt document

<b>NOTE:</b> Right now the requirements.txt file does not yet exist, so do not attempt to run the code below! Just note that it has been commented out for now
</div>

In [679]:
#pip install -r requirements.txt

<div class="alert alert-block alert-info">
These have been pre-determined to be libraries that are necessary - To be added on further as we go along
</div>

In [74]:
# Libraries for data loading, manipulation and analysis

import re # No need to add to requirements.txt as this comes pre-installed with python when creating the environment
import numpy as np
import pandas as pd
import csv # No need to add to requirements.txt as this comes pre-installed with python when creating the environment
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Displays output inline
%matplotlib inline

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

[Back to Table of contents](#cont)

## 3. Loading data <a class="anchor" id="chapter3"></a>

The next chapter will focus on loading the data from the dataset input to our notebook for further manipulation and eventual analysis and exploration later on.

### Creating dataframe and loading dataset

The data used for this project was located in the `co2_emissions_from_agri.csv` file. To better manipulate and analyse the `co2_emissions_from_agri.csv` file, it was loaded into a `Pandas` Data Frame using the `Pandas` function read_csv() and referred to as a data import. If `sep=None`, the import engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator from only the first valid row of the file by Python’s builtin sniffer tool, `csv.Sniffer` (hence the reason for importing the csv packages earlier as well). We will create a new variable called `co2_emissions_df` that will be a data import of our dataset which we can store in the memory for any future usage.

In [75]:
# Add the line of code to load the dataset in
co2_emissions_df = pd.read_csv("co2_emissions_from_agri.csv", sep=None)

To set the maximum number of columns to be displayed, the `pd.set_option()` function was put in place. The function has an additional argument that can be passed to set the maximum number of columns displayed when displaying the dataframe. This argument was passed as `None` since this data set does not have excessive number of columns and therefore is suited to be displayed without a column limit.

In [76]:
pd.set_option("display.max_columns", None)

### Creating a copy of the dataset

<div class="alert alert-block alert-danger">
<b>To prevent any major unnecessary changes occurring to the original data</b> , a copy of the dataframe was made using the <b>.copy()</b> method and referred to as <b>df</b> for ease of reference.
</div>

In [77]:
# Add the code to make a copy of the dataset
df = co2_emissions_df.copy()

[Back to Table of contents](#cont)

## 4. Data Cleaning <a class="anchor" id="chapter4"></a>

<div class="alert alert-block alert-info">
<b>Data cleaning</b> is a crucial step in the data analysis process, involving the correction or removal of incorrect, corrupted, duplicate, or incomplete data within a dataset. Through various techniques such as filling missing values, removing outliers, and standardizing data formats, it ensures the accuracy and reliability of subsequent analyses and decision-making.
</div>


### Initial feature investigation using `df.head()`

After we create a new variable called `co2_emissions_df`, then we will use the `Pandas` function `.head` to give us a quick look into the dataset just to get a feel of what type of columns we can expect to see and give us a glimpse into what values might be contained therein. We can see a combination of date, numbers and strings represented in the dataset. We will explore this data in more detail as we progress through this Chapter, however, let us get a quick feel for what we can expect to be working with.

In [78]:
# Testing if data frame has been loaded successfully and generating an overview

df.head()

Unnamed: 0,Area,Year,Savanna fires,Forest fires,Crop Residues,Rice Cultivation,Drained organic soils (CO2),Pesticides Manufacturing,Food Transport,Forestland,Net Forest conversion,Food Household Consumption,Food Retail,On-farm Electricity Use,Food Packaging,Agrifood Systems Waste Disposal,Food Processing,Fertilizers Manufacturing,IPPU,Manure applied to Soils,Manure left on Pasture,Manure Management,Fires in organic soils,Fires in humid tropical forests,On-farm energy use,Rural population,Urban population,Total Population - Male,Total Population - Female,total_emission,Average Temperature °C
0,Afghanistan,1990,14.7237,0.0557,205.6077,686.0,0.0,11.807483,63.1152,-2388.803,0.0,79.0851,109.6446,14.2666,67.631366,691.7888,252.21419,11.997,209.9778,260.1431,1590.5319,319.1763,0.0,0.0,,9655167.0,2593947.0,5348387.0,5346409.0,2198.963539,0.536167
1,Afghanistan,1991,14.7237,0.0557,209.4971,678.16,0.0,11.712073,61.2125,-2388.803,0.0,80.4885,116.6789,11.4182,67.631366,710.8212,252.21419,12.8539,217.0388,268.6292,1657.2364,342.3079,0.0,0.0,,10230490.0,2763167.0,5372959.0,5372208.0,2323.876629,0.020667
2,Afghanistan,1992,14.7237,0.0557,196.5341,686.0,0.0,11.712073,53.317,-2388.803,0.0,80.7692,126.1721,9.2752,67.631366,743.6751,252.21419,13.4929,222.1156,264.7898,1653.5068,349.1224,0.0,0.0,,10995568.0,2985663.0,6028494.0,6028939.0,2356.304229,-0.259583
3,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,0.0,85.0678,81.4607,9.0635,67.631366,791.9246,252.21419,14.0559,201.2057,261.7221,1642.9623,352.2947,0.0,0.0,,11858090.0,3237009.0,7003641.0,7000119.0,2368.470529,0.101917
4,Afghanistan,1994,14.7237,0.0557,242.0494,705.6,0.0,11.712073,53.9874,-2388.803,0.0,88.8058,90.4008,8.3962,67.631366,831.9181,252.21419,15.1269,182.2905,267.6219,1689.3593,367.6784,0.0,0.0,,12690115.0,3482604.0,7733458.0,7722096.0,2500.768729,0.37225


**Observations:**

- The first noticable observation is that all the columns have no discernible pattern in their naming convetion. Some columns start with an upper case letter for the first word and then a lower case letter for the second word. Other columns have upper case letters for each new word. This does not confirm with the naming convention for `PEP8` standards.
- Hyphenated words are also present such as with the column `On-farm Electricity Use` and `Total Population - Male`. Hyphenated words once again does not conform to `PEP8` naming convetions.
- There is one single column that stands out as already being in the correct `PEP8` naming convention standard, namely `total_emission`, however, this is the only column meeting the standards.
- There is columns or features that contain brackets such as `(` and `)` that also do not align to `PEP8` standards
- There is also another column or feature that contains the symbol for degrees, i.e. `°`, which once again does not align to `PEP8` standards and interfere with uniformity
- Therefore, it is clear to see that the columns in the dataset lack **(1)** uniformity in their naming and **(2)** do not adhere to the `PEP8` standards. It is worth applying some further cleaning up to convert all column names to more appropriate names so as to ensure uniformity and conformance to `PEP8` standards.

### Standardizing column names for ease of reference, ensuring uniformity and compliance

Following from the initial feature investigation, it was identified that certain columns or features do not comply with `PEP8` naiming conventions and lack uniformity in their names. It is therefore necessary to perform some further work to transform all column names into a standardized naming convention for ease of reference to ensure uniformity across column names and ensure compliace with `PEP8` standards. 

We will start by converting all characters in the names of all columns in the dataframe into lowercase so that all uppercase characters are removed. We will do this by using the `df.columns.str.lower()` method, which will convert all characters contained in the column names into lowercase characters. 

We will then apply a `lambda` function to the each column in the dataframe using the `df.columns.map()` function. The `lambda` function will be applied and wrapped inside the `.map()` function to change each column's name. The `lambda` function will function by taking each column's name and replacing all spaces ` `; all hyphens `-`; all brackets - both opening `(` and closing `)`; as well as the degree `°` symbol using the `.replace()` method.

We will also renam some columns using the `df.rename` function so that columns containing multiple empty spaces adhere to the same naming convetion standard of other columns and ensure `PEP8` compliance. 

Finally, we will call the `df.head()` method once again to investigate whether all the changes mentioned above have indeed taken effect.

In [79]:
# Standardizing column names for ease of reference

# string method for lower case - converts all headers to lower
df.columns = df.columns.str.lower()

# Lambda method to replace special characters with an underscore and remove other unwanted special characters to ensure uniformity
df.columns = df.columns.map(lambda x : x.replace(' ', '_').replace('-', '_').replace('(', '').replace(')', '').replace('°', ''))

# Manually rename columns that had a discrepancy using the above methods 
df = df.rename(columns={"total_population___male": "total_population_male",
                        "total_population___female": "total_population_female",
                        })
# Display data frame with new headers 
df.head()

Unnamed: 0,area,year,savanna_fires,forest_fires,crop_residues,rice_cultivation,drained_organic_soils_co2,pesticides_manufacturing,food_transport,forestland,net_forest_conversion,food_household_consumption,food_retail,on_farm_electricity_use,food_packaging,agrifood_systems_waste_disposal,food_processing,fertilizers_manufacturing,ippu,manure_applied_to_soils,manure_left_on_pasture,manure_management,fires_in_organic_soils,fires_in_humid_tropical_forests,on_farm_energy_use,rural_population,urban_population,total_population_male,total_population_female,total_emission,average_temperature_c
0,Afghanistan,1990,14.7237,0.0557,205.6077,686.0,0.0,11.807483,63.1152,-2388.803,0.0,79.0851,109.6446,14.2666,67.631366,691.7888,252.21419,11.997,209.9778,260.1431,1590.5319,319.1763,0.0,0.0,,9655167.0,2593947.0,5348387.0,5346409.0,2198.963539,0.536167
1,Afghanistan,1991,14.7237,0.0557,209.4971,678.16,0.0,11.712073,61.2125,-2388.803,0.0,80.4885,116.6789,11.4182,67.631366,710.8212,252.21419,12.8539,217.0388,268.6292,1657.2364,342.3079,0.0,0.0,,10230490.0,2763167.0,5372959.0,5372208.0,2323.876629,0.020667
2,Afghanistan,1992,14.7237,0.0557,196.5341,686.0,0.0,11.712073,53.317,-2388.803,0.0,80.7692,126.1721,9.2752,67.631366,743.6751,252.21419,13.4929,222.1156,264.7898,1653.5068,349.1224,0.0,0.0,,10995568.0,2985663.0,6028494.0,6028939.0,2356.304229,-0.259583
3,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,0.0,85.0678,81.4607,9.0635,67.631366,791.9246,252.21419,14.0559,201.2057,261.7221,1642.9623,352.2947,0.0,0.0,,11858090.0,3237009.0,7003641.0,7000119.0,2368.470529,0.101917
4,Afghanistan,1994,14.7237,0.0557,242.0494,705.6,0.0,11.712073,53.9874,-2388.803,0.0,88.8058,90.4008,8.3962,67.631366,831.9181,252.21419,15.1269,182.2905,267.6219,1689.3593,367.6784,0.0,0.0,,12690115.0,3482604.0,7733458.0,7722096.0,2500.768729,0.37225


**Observations:**

- Our first transformation involved the calling of the `df.columns.str.lower()` method. This assisted in converting all column names to lower case to remove all upper case characters to ensure `PEP8` compliance.
- Secondly, we used a `Lamda` function to replace all spaces ` `; all hyphens `-`; all parentheses `(` & `)`; as well as all degree symbols `°`, and then applied the function to every column in the dataset using the `df.columns.map` function.
- Finally, we renamed some columns using the `df.rename` function so that columns containing multiple empty spaces adhere to the same naming convetion standard of other columns and ensure `PEP8` compliance.
- Doing the above pieces of work ensures our features are easy to use, reference and call across our program - no need to remember other naming conventions since the features now all align to `PEP8` standards.

### Exploring columns within the dataset

Use `.shape` to understand the size of the dataframe and make sure it has been loaded correctly by checking the dimensions.

In [80]:
# Exploring shape of dataset

df.shape

(6965, 31)

**Observation** : The dataset consists of 6965 rows (observations) and 31 columns (features).

We use `.info()` to provide an overview of the data frame to give a quick summary of the rows and columns and make sure that the data types are in line with the columns.

In [81]:
# Exploring columns within data frame and determine data types 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6965 entries, 0 to 6964
Data columns (total 31 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   area                             6965 non-null   object 
 1   year                             6965 non-null   int64  
 2   savanna_fires                    6934 non-null   float64
 3   forest_fires                     6872 non-null   float64
 4   crop_residues                    5576 non-null   float64
 5   rice_cultivation                 6965 non-null   float64
 6   drained_organic_soils_co2        6965 non-null   float64
 7   pesticides_manufacturing         6965 non-null   float64
 8   food_transport                   6965 non-null   float64
 9   forestland                       6472 non-null   float64
 10  net_forest_conversion            6472 non-null   float64
 11  food_household_consumption       6492 non-null   float64
 12  food_retail         

**Observations:**
- First of all, there are no notable discrepancies identified in the columns of the dataset - the features all appear to be in order.
- We can see the dataset features has a range of 6 965 entries and they range from 0 to 6964.
- We can see the dataset features have a total of 31 columns identified.
- We can see the datatypes (`dtypes`) contains 29 `float64`; 1 `int64`; and 1 `object` datatype.
- The dataframe object uses around 1.6 MB of memory. This is a very small amount considering the size of the dataset in question and actually suggests - given its low memory usage - any functions called from, applied to the dataset and used for analysis purposes should operate quite efficient without any noticeable slow processing speed.
- We also identify a few columns returning `Non-Null Count` values that is not equal to 6 965 (being our total number of entries in the dataset), which would suggest those columns contain some null values that require further investigation as we move along with this Chapter.
- The 1 `object` datatype is used for the `area` feature, which is the name of the Country. `object` is a datatype associated with Strings and therefore it makes sense for this feature to be of datatype `object`.
- The 1 `int64` datatype is used for the `year` feature, which is the year in question for which emmissions was measured. Again, `int64` makes sense to be used as the datatype for a year feature since years can never have decimal values and are always integers.
- All the other remaining features are of datatype `float64`, implying that they contain some or other decimal value in their column values. Since we are measuring emmissions in almost all of those features, it once again makes sense to use the `float64` datatype.
- The only features that are `float64` and might require some further investigation is `rural_population`, `urban_population`, `total_population_male` and `total_population_female` which do not necessarily makes sense to be a float datatype. Float is generally reserved for numerical values that contains decimals and chances are very slim that these features will contain decimals since a person (making part of the population) can never be a decimal person - there is no such thing as a decimal person. This can be further investigated in subsequent sections of this Chapter.
- Also, reconsidering some features' datatypes could assist to further enhance the performance of the dataframe as it is used in other parts of the notebook, especially during EDA and modelling

### Typecasting some columns to more appropriate datatypes

As identified earlier, the population columns, namely `rural_population`, `urban_population`, `total_population_male` and `total_population_female` are of datatype `float64` which is generally reserved for numerical values with decimal places. However, a population value can never have a decimal population, and therefore, `float64` is not the appropriate datatype for these columns. It would make more sense to typecast these columns into the more appropriate datatype as `int64` which is reserved for numerical values with no decimal places. Since a person in a population can never be a decimal person, this is determined to be a more appropriate datatype as integer instead of float.

In [82]:
# Typecasting population columns to more appropriate int64 datatype

df['rural_population'] = x['rural_population'].astype('int64')
df['urban_population'] = x['urban_population'].astype('int64')
df['total_population_male'] = x['total_population_male'].astype('int64')
df['total_population_female'] = x['total_population_female'].astype('int64')

We use `.info()` to provide an overview of the data frame again just to understand whether the typecasting of those columns have indeed taken place, and has had no adverse affect on the data frame.

In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6965 entries, 0 to 6964
Data columns (total 31 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   area                             6965 non-null   object 
 1   year                             6965 non-null   int64  
 2   savanna_fires                    6934 non-null   float64
 3   forest_fires                     6872 non-null   float64
 4   crop_residues                    5576 non-null   float64
 5   rice_cultivation                 6965 non-null   float64
 6   drained_organic_soils_co2        6965 non-null   float64
 7   pesticides_manufacturing         6965 non-null   float64
 8   food_transport                   6965 non-null   float64
 9   forestland                       6472 non-null   float64
 10  net_forest_conversion            6472 non-null   float64
 11  food_household_consumption       6492 non-null   float64
 12  food_retail         

**Observations:**
- We can now identify that `rural_population`, `urban_population`, `total_population_male` and `total_population_female` are of the datatype `int64` as desired.
- We also see now signifigant improvement in memory usage as the memory usage is still 1.6+ MB
- There is also now change in the number of entries in the dataframe at 6965
- It would appear our transformation to typecast the mentioned columns to `int64` was successful and has had no adverse affect on our dataframe.

### Checking for duplicated rows or values

Duplicates can skew statistical data, leading to inaccurate insights and conclusions. By identifying duplicates and removing them, we will perserve the integrity of our dataset and ensuring that duplicates do not impact any statistical inferences and influences any conclusions and insights from the results of our study. 

We will do this by calling the `df.duplicated().any()` method and interpreting the result from that method. Should we get a `True` then we will know that some rows have been duplicate and requires further data cleaning to determine the best course of action to take in dealing with the duplicates. Should we get a `False`, no further action is necessary to be taken in our initial investigation to deal with duplicate rows.

In [84]:
# checking for duplicate rows
print(df.duplicated().any())

False


**Result:**
- The initial investigation suggests that the dataset does not contain any duplicate rows.
- **Note** this method simply tells us wheter a row has been duplicate entirely, and not just certain elements of a row. So there might exist duplicates in other aspects of the dataset, even though an entire row in the dataset has not been duplicated where every value in every feature is the same on a row level.
- It might be worth further investigating the dataset at some row level to determine duplication elsewhere in the dataset.

### Checking for Null or NaN values and applying an appropriate strategy to deal with the same

To check for Null or NaN values, we invoke the `df.isnull().sum()` method. This will return a `Pandas` dataframe series with all the feature/column names and a sum of all the Null or NaN values found in those features. It will then neatly summarize the results for us to be able to determine which features contain Null or NaN values and assist in directing our strategy on specific columns that need to be treated for their Null or NaN values.

In [85]:
# checking for NaN values
df.isnull().sum()

area                                  0
year                                  0
savanna_fires                        31
forest_fires                         93
crop_residues                      1389
rice_cultivation                      0
drained_organic_soils_co2             0
pesticides_manufacturing              0
food_transport                        0
forestland                          493
net_forest_conversion               493
food_household_consumption          473
food_retail                           0
on_farm_electricity_use               0
food_packaging                        0
agrifood_systems_waste_disposal       0
food_processing                       0
fertilizers_manufacturing             0
ippu                                743
manure_applied_to_soils             928
manure_left_on_pasture                0
manure_management                   928
fires_in_organic_soils                0
fires_in_humid_tropical_forests     155
on_farm_energy_use                  956


**Result:** It is revealed that some columns contain Null or NAN values. It is decided to replace the respective values with 0 where possible as not to skew analysis further down when EDA and Modelling techniques are performed.

#### Investigating Null values in certain columns and replacing them with 0

It was decided to replace all the Null or NaN values in `savanna_fires`, `forest_fires`, and `fire_in_humid_tropical_forests` columns with 0. We do this by filtering out our original dataframe for each column's null values using the `.isnull()` function. We then assign the resulting filtered dataframe to a new dataframe object for that feature that has been filtered for null values.

We then refactor our original dataframe's features that contain null values with 0 using the `df.fillna()` function and passing through the parameters as `0` for the fill value and passing `inplace=True` to ensure the effect is immediate and applied in place. The original dataframe will then have been refactored to replace the aforementioned features' null values with a 0.

In [86]:
# Investigating Null Values in savanna_fires
null_savanna_fires_df = df[df['savanna_fires'].isnull()]
print(f"Areas with NAN values in savanna_fires: {null_savanna_fires_df['area'].unique()}")

# Replace Nan Values in savanna_fires with 0 
df['savanna_fires'].fillna(0, inplace=True)

# Investigating Null Values in forest_fires 
null_forest_fires_df = df[df['forest_fires'].isnull()]
print(f"Areas with NAN values in forest_fires: {null_forest_fires_df['area'].unique()}")

# Replace Nan Values in forest_fires with 0 
df['forest_fires'].fillna(0,inplace=True)

# Investigating Null Values in fires_in_humid_tropical_forests
null_fires_in_humid_tropical_forests = df[df['fires_in_humid_tropical_forests'].isnull()]
print(f"Areas with NAN values in fires_in_humid_tropical_forests: {null_fires_in_humid_tropical_forests['area'].unique()}")

# Replace Nan Values in fires_in_humid_tropical_forests with 0 
df['fires_in_humid_tropical_forests'].fillna(0,inplace=True)


Areas with NAN values in savanna_fires: ['Holy See']
Areas with NAN values in forest_fires: ['Holy See' 'Monaco' 'San Marino']
Areas with NAN values in fires_in_humid_tropical_forests: ['Channel Islands' 'Holy See' 'Liechtenstein' 'Monaco' 'San Marino']


**Result:** It was found that the NaN values in the columns `savanna_fires`, `forest_fires`, and `fires_in_humid_tropical_forests` correspond to areas that do not have those types of environments NaN values have been replaced with 0 

We then call the `df.head()` method again to display the dataframe to demonstrate that the changes to replace the mentioned null values with a 0 had in fact taken place and was applied correctly.

In [87]:
df.head()

Unnamed: 0,area,year,savanna_fires,forest_fires,crop_residues,rice_cultivation,drained_organic_soils_co2,pesticides_manufacturing,food_transport,forestland,net_forest_conversion,food_household_consumption,food_retail,on_farm_electricity_use,food_packaging,agrifood_systems_waste_disposal,food_processing,fertilizers_manufacturing,ippu,manure_applied_to_soils,manure_left_on_pasture,manure_management,fires_in_organic_soils,fires_in_humid_tropical_forests,on_farm_energy_use,rural_population,urban_population,total_population_male,total_population_female,total_emission,average_temperature_c
0,Afghanistan,1990,14.7237,0.0557,205.6077,686.0,0.0,11.807483,63.1152,-2388.803,0.0,79.0851,109.6446,14.2666,67.631366,691.7888,252.21419,11.997,209.9778,260.1431,1590.5319,319.1763,0.0,0.0,,9655167,2593947,5348387,5346409,2198.963539,0.536167
1,Afghanistan,1991,14.7237,0.0557,209.4971,678.16,0.0,11.712073,61.2125,-2388.803,0.0,80.4885,116.6789,11.4182,67.631366,710.8212,252.21419,12.8539,217.0388,268.6292,1657.2364,342.3079,0.0,0.0,,10230490,2763167,5372959,5372208,2323.876629,0.020667
2,Afghanistan,1992,14.7237,0.0557,196.5341,686.0,0.0,11.712073,53.317,-2388.803,0.0,80.7692,126.1721,9.2752,67.631366,743.6751,252.21419,13.4929,222.1156,264.7898,1653.5068,349.1224,0.0,0.0,,10995568,2985663,6028494,6028939,2356.304229,-0.259583
3,Afghanistan,1993,14.7237,0.0557,230.8175,686.0,0.0,11.712073,54.3617,-2388.803,0.0,85.0678,81.4607,9.0635,67.631366,791.9246,252.21419,14.0559,201.2057,261.7221,1642.9623,352.2947,0.0,0.0,,11858090,3237009,7003641,7000119,2368.470529,0.101917
4,Afghanistan,1994,14.7237,0.0557,242.0494,705.6,0.0,11.712073,53.9874,-2388.803,0.0,88.8058,90.4008,8.3962,67.631366,831.9181,252.21419,15.1269,182.2905,267.6219,1689.3593,367.6784,0.0,0.0,,12690115,3482604,7733458,7722096,2500.768729,0.37225


#### Imputing remaining Null or NaN values with Mean values

For the remaining Null or NaN values that are not in the columns mentioned in the previous section, we will attempt to impute those values with a mean value so as to not skew further analysis on the dataset. The imputation will only be applied to columns that have Null or NaN values and includes:
- `crop_residues`
- `forestland`
- `net_forest_conversion`
- `food_household_consumption`
- `ippu`
- `manure_applied_to_soils`
- `manure_management`
- `on_farm_energy_use`

The imputation of missing, Null or NaN values in each column using a mean will be done through usage of the `df.fillna()` method. The parameters passed to the function will be the column on the dataframe which will be imputed and applying a `.mean()` method to it as well as the `inplace = True` parameter so that the dataframe will be updated in place so that the imputation takes immediate effect. The original dataframe will then have been refactored to replace the aforementioned features' null values with a mean calculated from that same feature.

In [88]:
# replacing the other NaN values with Mean values as not to skew further analysis 

df["crop_residues"].fillna(df["crop_residues"].mean(), inplace = True)
df["forestland"].fillna(df["forestland"].mean(), inplace = True)
df["net_forest_conversion"].fillna(df["net_forest_conversion"].mean(), inplace = True)
df["food_household_consumption"].fillna(df["food_household_consumption"].mean(), inplace = True)
df["ippu"].fillna(df["ippu"].mean(), inplace = True)
df["manure_applied_to_soils"].fillna(df["manure_applied_to_soils"].mean(), inplace = True)
df["manure_management"].fillna(df["manure_management"].mean(), inplace = True)
df["on_farm_energy_use"].fillna(df["on_farm_energy_use"].mean(), inplace = True)

**Result:** It was found that the Null or NaN values in the remaining columns should be replaced with the mean values to avoid skewing the analysis. This was performed using the above line of code and the dataframe's feature were accordingly imputed with a mean value.

Once we have finished handling the missing, Null or NaN values as per above implemeneted strategies, we once again check for Null or NaN values to confirm that there is no further missing value handling necessary. We again invoke the df.isnull().sum() method to do this. This will return a Pandas dataframe series with all the feature/column names and a sum of all the Null or NaN values found in those features which have now been refactored to handle missing, Null or NaN values accordingly.

In [89]:
# Check if all NaN values have been removed 
df.isnull().sum()

area                               0
year                               0
savanna_fires                      0
forest_fires                       0
crop_residues                      0
rice_cultivation                   0
drained_organic_soils_co2          0
pesticides_manufacturing           0
food_transport                     0
forestland                         0
net_forest_conversion              0
food_household_consumption         0
food_retail                        0
on_farm_electricity_use            0
food_packaging                     0
agrifood_systems_waste_disposal    0
food_processing                    0
fertilizers_manufacturing          0
ippu                               0
manure_applied_to_soils            0
manure_left_on_pasture             0
manure_management                  0
fires_in_organic_soils             0
fires_in_humid_tropical_forests    0
on_farm_energy_use                 0
rural_population                   0
urban_population                   0
t

**Result:** We can clearly see that there is no longer any missing, Null or NaN values remaining in any of the features of the dataset. We have, therefore, successfully determine, invoked and applied our strategies to deal with previously identified missing, Null or NaN values identified. This will assist us to have a clearer approach when analyzing the data moving forward and eliminate any obvious skews in the same analysis.

### Exploring and understanding the overall distribution of the data.

Next up, we want to have a quick overview of our data so to get a better sense of the distribution and spread of values in our dataset with a easy glanse. This is particularly usefol for:
- Statistical comparison of data across different columns in a summarized view.
- Quickly identifying any noticeable potential issues with the data and/or evidence of unusual values in any of the columns.
- Understanding the general spread and distribution of each column in the the dataset.

We will do so using the `df.describe()` method - which is a method that generates a descriptive statistic for numerical columns in the dataframe. This will provide us insight into the following values for each column:
* `count`: Count of total values,
* `mean`: Mean of each column,
* `std`: Standard Deviation of each column,
* `min`: Minimum value,
* `max`: Maximum value,
* `25%`: Q1 or 1st Quartile,
* `50%`: Q2 or median,
* `75%`: Q3 or 3rd Quartile.

We will also apply the `.T` method to the `df.describe()` method - which will transpose the dataframe in that it flips the dataframe having rows becoming columns and columns becoming rows. This makes it easier to read and compare statistics across the different columns of the dataframe.

In [90]:
# Summary of statistics transposed for easier viewing

df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,6965.0,2005.125,8.894665,1990.0,1997.0,2005.0,2013.0,2020.0
savanna_fires,6965.0,1183.102,5235.196,0.0,0.0,1.587,108.3617,114616.4
forest_fires,6965.0,907.0272,3696.662,0.0,0.0,0.4164,61.2372,52227.63
crop_residues,6965.0,998.7063,3310.818,0.0002,25.3601,193.0831,998.7063,33490.07
rice_cultivation,6965.0,4259.667,17613.83,0.0,181.2608,534.8174,1536.64,164915.3
drained_organic_soils_co2,6965.0,3503.229,15861.45,0.0,0.0,0.0,690.4088,241025.1
pesticides_manufacturing,6965.0,333.4184,1429.159,0.0,6.0,13.0,116.3255,16459.0
food_transport,6965.0,1939.582,5616.749,0.0001,27.9586,204.9628,1207.001,67945.76
forestland,6965.0,-17828.29,78882.49,-797183.079,-5960.8296,-128.4116,0.0,171121.1
net_forest_conversion,6965.0,17605.64,97511.21,0.0,0.0,125.994,9877.472,1605106.0


### Simplifying Area names in the dataframe

Upon further investigation into the `area` feature of the dataset it was found that some `areas` have very non-user-friendly namings applied to them. Using these non-user-friendly `area` names will be fairly challenging moving forward in any future analysis and manipulation of the data. Leaving these names as-is will require far more clear specification when referencing those `areas` than necessary. An example would be `United Kingdom of Great Britain and Northern Ireland` which is very cumbersome and long-winded to type out every time when referencing that specific `area`. Therefore, it would make more sense to convert such cumbersome `areas` to a name that is easier to reference moving forward while still maintaining a clear description of the `area`. For instance, `United Kingdom of Great Britain and Northern Ireland` can be renamed to `United Kingdom` while still maintaining the same symantic meaning as before.

We will proceed to perform this task by, first of all, identifying all the unique names in the `area` feature so as to start our analysis by focusing on those `areas` that have some cumbersome names. We will then simplify those names to make them easier to reference and reduce the naming complexity while still retaining its symantic meaning. To do this, we will use the `df[column_name].unique()` method. We will assign the result to a `list` object using the `list` type-casting method inherent in python. Finally, the resulting list will be displayed so as to interpret the results and find those `areas` requiring simplification in their name.

In [91]:
# Look at all the unique areas 
list(df['area'].unique())

['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Andorra',
 'Angola',
 'Anguilla',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belgium-Luxembourg',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'British Virgin Islands',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Cayman Islands',
 'Central African Republic',
 'Chad',
 'Channel Islands',
 'Chile',
 'China',
 'China, Hong Kong SAR',
 'China, Macao SAR',
 'China, mainland',
 'China, Taiwan Province of',
 'Colombia',
 'Comoros',
 'Congo',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czechia',
 'Czechoslovakia',
 "Democratic People's Republic of Korea",
 'Democratic Republic of the Congo',
 'Denmark',
 'Djibouti',
 'D

The below list of `area` names has been identified as requiring simplification following from the above mentioned logic. The identified `area` names are read into a dictionary with the dictionary `keys` being the original `area` name and the dictionary `values` being the new simplified name. In order to replace the original `area` name with the simplified `area` name, the `df['area'].replace()` function will be used. This will replace all values in the dataframe where the original name meets the dictionary's `keys` reference with the simplified name in the dictionary's `value` reference. The additional parameter `inplace=True` will also be passed to ensure that the dataframe is updated during the call of the function.

In [92]:
# Simplifying Area Names 
df['area'].replace({'United States of America': 'United States',
                    'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
                    'Bolivia (Plurinational State of)': 'Bolivia',
                    'Republic of Korea': 'South Korea',
                    "Democratic People's Republic of Korea": 'North Korea',
                    'Iran (Islamic Republic of)': 'Iran',
                    'Venezuela (Bolivarian Republic of)': 'Venezuela',
                    "Lao People's Democratic Republic": 'Laos',
                    'Micronesia (Federated States of)': 'Micronesia'
                    }, inplace=True)

### Identifying and treating outliers in each column

The next step in clearning our data is to identify outliers and determine an appropriate strategy for treating or dealing with the outliers in each column. We will endeavour to identify any outliers in each of the numerical columns in the dataframe by, first of all, determining the z-score for each numerical column in the dataframe. We will then count the number of observations under each column where the z-score values surpass the threshold level specified to determine their statistical signifigance. We would then be in a position to analyze all the outlier observations in each column so as to determine an appropriate strategy for dealing with outliers in those identified columns.

Z-score is a measure that describeds how many standard deviations a data point is away from the mean of the data set. It is a useful tool to standardize scores on a easily interpretable scale that allows for comparison across a multitude of distributions or data points. Z-scores are commontly used to identify outliers in any data set when those data points exceed a certain threshold. It also helps with ensuring a dataset can conform to uniform scales and helps to align dataset more towards the desired normal distribution when performing predictive analyses.

We will proceed to identify outliers by declaring a new `outliers_summary` dictionary. We will then run through every column in our dataset using a `For` loop. For every column that is a numeric datatype, we will then calculate the Z-score using the `stats.zscore()` function and pass in the current column we iterating over in the `For` loop, also storing the result in a new feature called `Z-Score`. After that point, we will create a filtered dataframe object that will be filtered using the threshold of `Z-score` being greater than (`>`) 3 or being less than (`<`) -3, using the `Z-score` calculated previously. Inside the `outliers_summary` dictionary, we will then add the total observations outside of the specified `Z-score` threshold as a `value` reference. Once the loop through the columns have been completed, we will then perform another `For` loop through the `outliers_summary` dictionary and print out each feature of the dataset along with its outlier count.

In [93]:
# to identify outliers in each column

outliers_summary = {}

for column in df.columns:
    # Skip non-numeric columns
    if pd.api.types.is_numeric_dtype(df[column]):
        # Calculate Z-scores
        df['Z-Score'] = stats.zscore(df[column])
        
        # Identify outliers
        outliers = df[(df['Z-Score'] > 3) | (df['Z-Score'] < -3)]

        # Store the number of outliers for the column
        outliers_summary[column] = len(outliers)

# Print the number of outliers found for each column
for column, count in outliers_summary.items():
    print(f"Number of outliers in column '{column}': {count}")

Number of outliers in column 'year': 0
Number of outliers in column 'savanna_fires': 122
Number of outliers in column 'forest_fires': 184
Number of outliers in column 'crop_residues': 131
Number of outliers in column 'rice_cultivation': 124
Number of outliers in column 'drained_organic_soils_co2': 48
Number of outliers in column 'pesticides_manufacturing': 123
Number of outliers in column 'food_transport': 137
Number of outliers in column 'forestland': 171
Number of outliers in column 'net_forest_conversion': 78
Number of outliers in column 'food_household_consumption': 92
Number of outliers in column 'food_retail': 96
Number of outliers in column 'on_farm_electricity_use': 94
Number of outliers in column 'food_packaging': 70
Number of outliers in column 'agrifood_systems_waste_disposal': 124
Number of outliers in column 'food_processing': 90
Number of outliers in column 'fertilizers_manufacturing': 66
Number of outliers in column 'ippu': 77
Number of outliers in column 'manure_applied

[Back to Table of contents](#cont)

## 5. Exploratory Data Analysis (EDA) <a class="anchor" id="chapter5"></a>

To give a better understanding of the variables and the relationships between them, we set out to do an **Exploratory Data Analysis (EDA)** of our dataset. The main tasks includes investigating and summarizing the dataframe's main characteristics by data visualization methods and statistical analyses. Furthermore, investigating the dataset’s key features, summarizing its central characteristics, and employing both data visualisation techniques and statistical analyses to draw meaningful insights that can guide further research and data-driven decision making.


[Back to Table of contents](#cont)

## 6. Feature Engineering <a class="anchor" id="chapter6"></a>
[Back to Table of contents](#cont)

## 7. Modeling <a class="anchor" id="chapter7"></a>

[Back to Table of contents](#cont)

## 8. Model Performance <a class="anchor" id="chapter8"></a>

[Back to Table of contents](#cont)

## 9. Conclusion <a class="anchor" id="chapter9"></a>

This Chapter attempts to provide a final summation of all the insights and conclusions gathered during all the previous Chapters of this project. This is done so as to assist with providing a summized retelling of the narrative caried through and forward by prior Chapters in attempt to reflect and review on the analyses contained within this study.

### Summary of Key Findings 

### Evaluation of the Methodology

### Implications of the Findings

### Suggestions for Future Work

### Reflection on the Data Source and Quality

### Concluding Thoughts

[Back to Table of contents](#cont)

## 10. References <a class="anchor" id="chapter10"></a>


[Back to Table of contents](#cont)

In [696]:
# test code 