# Table of Contents
- <b> [1. Project Overview](#chapter1)
    - [1.1. Introduction](#section_1_1)
    - [1.2. Objective](#section_1_2)
- <b> [2. Importing Packages](#chapter2)
- <b> [3. Data Loading](#chapter3)
- <b> [4. Data Cleaning](#chapter4)
- <b> [5. Exploratory Data Analysis (EDA)](#chapter5)
- <b> [6. Conclusion and Insights](#chapter6)</b>

# 1. Project Overview <a class="anchor" id="chapter1"></a>

## 1.1. Introduction<a class="anchor" id="section_1_1"></a>

Deforestation is one of the most significant modern challenges with extensive implications on the environment, economy and society. The process involves the reduction of forest area through the removal of vegetation and trees, leaving the land barren or significantly altered. This is typically driven by factors such as the expansion of agricultural fields, urban development and some industrial activities. The ongoing loss of forest area results in significant ramifications, including habitat destruction, biodiversity loss and the displacement of indigenous communities. Therefore, the understanding of deforestation trends is paramount for devising conservation strategies, identifying most affected countries and implementing sustainable practices.(references)

This project aims to explore and analyse a World forest area dataset to identify deforestation trends and address key questions related to forest area. The data is sourced from Kaggle, uploaded by Takumi Watanabe and last updated 10 months ago.(reference) It comprises of the following features:
* Country_name: list of country names, regions, and income statuses, with each element (country, region, income status) appearing as a distinct entry.
* Country_code: A three letter code that distinctively represents a country.
* 1990 - 2021 year columns: The annual total forest area  expressed in square kilometres.
 

The key questions to be explored in this project are as follow:

* What are the global trends in forest area from 1990 and 2021?
* How does the rate of deforestation or afforestation compare across different countries?
* Which countries, regions and income levels have seen the largest decrease or increase in forest area between 1990 and 2021?
* What is the relationship between the rate of deforestation and Income status?
* What is the relationship between rate of deforestation and region?


This notebook is organised into several key sections to ensure a structured data analysis approach is employed. The approach leverages Python’s extensive data analysis and visualisation libraries to facilitate an in-depth examination of the dataset. This includes pandas, numpy, matplotlib and seaborn. In the Importing Packages section, these libraries are loaded.  The Data loading section details how the  data is loaded into a DataFrame and initial inspection is performed. In the Data Cleaning section, the data is cleaned and prepared for further analyses by handling missing or inconsistent data. The exploratory data analysis (EDA) section applies both data visualisations and statistical methods to uncover key trends and insights relevant to the project’s objective. The results of the analyses will provide a comprehensive understanding of global deforestation and afforestation trends, highlighting which countries have experienced the greatest losses or gains in forest area. This will be summarised and presented in the last section, Conclusion, including recommendations based on obtained insights.

### 1.1.1 Problem Statement<a class="anchor" id="section_1_1_1"></a>

How has forest area changed globally over the years and which countries, regions and income levels are most affected ?


## 1.2. Objective<a class="anchor" id="section_1_2"></a>

* To perform exploratory data analysis of the global deforestation dataset.
* To analyse the change in global forest area from 1990 and 2021, identifying trends of deforestation or afforestation for each country.
* To identify countries, regions and income levels that have the largest decrease and increase in forest area between 1990 and 2021.
* To determine the relationship between income status and rate of deforestation.
* To determine the relationship between various regions and rate of deforestation.


# 2. Importing Packages<a class="anchor" id="chapter2"></a>

Importing notes

In [1]:
# Libraries for data loading, manipulation and analysis

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Displays output inline
%matplotlib inline

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

# 3. Data Loading<a class="anchor" id="chapter3"></a>

Loading notes

In [2]:
# Display all columns
pd.set_option("display.max_columns", None)

In [3]:
# loading dataset
original_df = pd.read_csv("forest_area_km.csv", index_col=False)

# Display the first few rows
original_df.head()

Unnamed: 0,Country Name,Country Code,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Afghanistan,AFG,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4
1,Albania,ALB,7888.0,7868.5,7849.0,7829.5,7810.0,7790.5,7771.0,7751.5,7732.0,7712.5,7693.0,7705.77,7718.54,7731.31,7744.08,7756.85,7769.62,7782.39,7795.16,7807.93,7820.7,7834.935,7849.17,7863.405,7877.64,7891.875,7891.8,7889.025,7889.0,7889.0,7889.0,7889.0
2,Algeria,DZA,16670.0,16582.0,16494.0,16406.0,16318.0,16230.0,16142.0,16054.0,15966.0,15878.0,15790.0,16129.0,16468.0,16807.0,17146.0,17485.0,17824.0,18163.0,18502.0,18841.0,19180.0,19256.0,19332.0,19408.0,19484.0,19560.0,19560.0,19430.0,19300.0,19390.0,19490.0,19583.333
3,American Samoa,ASM,180.7,180.36,180.02,179.68,179.34,179.0,178.66,178.32,177.98,177.64,177.3,177.0,176.7,176.4,176.1,175.8,175.5,175.2,174.9,174.6,174.3,174.0,173.7,173.4,173.1,172.8,172.5,172.2,171.9,171.6,171.3,171.0
4,Andorra,AND,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0,160.0


In [4]:
# Create copy of dataset
df = original_df.copy()

In [5]:
# Replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df.columns]

In [6]:
# Displays number of rows and columns
df.shape

(259, 34)

**Results**: The dataset consists of 259 rows and 34 columns.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259 entries, 0 to 258
Data columns (total 34 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country_Name  259 non-null    object 
 1   Country_Code  259 non-null    object 
 2   1990          215 non-null    float64
 3   1991          219 non-null    float64
 4   1992          248 non-null    float64
 5   1993          251 non-null    float64
 6   1994          251 non-null    float64
 7   1995          251 non-null    float64
 8   1996          251 non-null    float64
 9   1997          251 non-null    float64
 10  1998          251 non-null    float64
 11  1999          251 non-null    float64
 12  2000          253 non-null    float64
 13  2001          253 non-null    float64
 14  2002          253 non-null    float64
 15  2003          253 non-null    float64
 16  2004          253 non-null    float64
 17  2005          253 non-null    float64
 18  2006          255 non-null    

**Results**: Country_Name and Country_Code are objects with no null values, columns 1990 - 2011 have some null values

# 4. Data Cleaning<a class="anchor" id="chapter4"></a>

Before cleaning, inspect the data for any issues (null values, duplicates, non-essential columns or rows)

`print_null_values` function notes

In [8]:
def print_null_values(df):
    """
    Prints the count of null (missing) values for each column that has more than 0 null values.
    
    Parameters:
    df (pandas.DataFrame): The input DataFrame to check for null values.

    Returns:
    None
    """
    null_counts = df.isnull().sum()
    
    # Filter out columns with no null values
    null_counts = null_counts[null_counts > 0]
    
    if null_counts.empty:
        print("No columns with null values.")
    else:
        print("Columns with null values:")
        print(null_counts)
        

print_null_values(df)

Columns with null values:
1990    44
1991    40
1992    11
1993     8
1994     8
1995     8
1996     8
1997     8
1998     8
1999     8
2000     6
2001     6
2002     6
2003     6
2004     6
2005     6
2006     4
2007     4
2008     4
2009     4
2010     4
2011     1
dtype: int64


**Results**: The columns from 1990 to 2011 contain missing values, with the earliest years (1990-1992) having the most missing values.

`trim_spaces` function to remove leading and trailing spaces before checking for duplicates

In [9]:
def trim_spaces(df, columns):
    """
    Trims leading and trailing spaces from the specified columns in a DataFrame.
    
    Parameters:
    df (pandas.DataFrame): The input DataFrame to trim.
    columns (list): List of object columns to trim spaces from.
    """
    for column in columns:
        df[column] = df[column].str.strip()
    
    return df

trim_columns = ['Country_Name', 'Country_Code']
df = trim_spaces(df, trim_columns)

`count_duplicate_rows` function notes

In [10]:
def count_duplicate_rows(df):
    """
    Counts the number of duplicate rows in a DataFrame.
    
    Parameters:
    df (pandas.DataFrame): The input DataFrame to check duplicates.
    
    Returns:
    int: The count of duplicate rows.
    """
    return df.duplicated().sum()

print("Number of duplicate rows:", count_duplicate_rows(df))

Number of duplicate rows: 0


Non-essential rows notes

In [11]:
# Check for rows with all zeroes
zero_rows = df.loc[df.iloc[:, 2:].eq(0).all(axis=1)]
zero_rows

Unnamed: 0,Country_Name,Country_Code,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
75,Gibraltar,GIB,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
# Remove rows that contain only zeroes
df = df.drop(zero_rows.index)
df.shape

(258, 34)

**Results**: 1 row removed, dataset now consists of 258 rows and 34 columns

`check_column_data_types` function notes

In [13]:
def check_column_data_types(df, object_columns):
    """
    Checks if specified columns are of type object and the rest are of type float.
    
    Parameters:
    df (pandas.DataFrame): The input DataFrame to check data types.
    object_columns (list): List of columns expected to be of type object.
    """
    mismatches = []
    
    for column in df.columns:
        expected_type = 'object' if column in object_columns else 'float64'
        actual_type = df[column].dtype
        
        if actual_type != expected_type:
            mismatches.append((column, actual_type, expected_type))
    
    if mismatches:
        for column, actual, expected in mismatches:
            print(f"Column '{column}' has data type '{actual}', expected '{expected}'.")
    else:
        print("All columns have the correct data types.")

object_columns = ['Country_Name', 'Country_Code']
check_column_data_types(df, object_columns)

All columns have the correct data types.


Handling null values notes

In [14]:
def drop_rows_with_too_many_nulls(df, threshold=0.5):
    """
    Drops rows with more than a specified percentage of null values.
    
    Parameters:
    df (pandas.DataFrame): The input DataFrame.
    threshold (float): The maximum percentage of null values allowed in a row (default 0.5 = 50%).
    
    Returns:
    pandas.DataFrame: The DataFrame with rows dropped based on the threshold.
    """
    # Calculate the percentage of null values per row
    null_percentage = df.isnull().mean(axis=1)
    
    # Drop rows where the percentage of null values is greater than the threshold
    return df[null_percentage <= threshold]

df_nulls_preprocessed = drop_rows_with_too_many_nulls(df)
df_nulls_preprocessed.shape

(254, 34)

**Result**: Dropped 4 rows with over 50% null values

To avoid inaccuracies when filling missing values, drop rows with more than 9 null values

In [15]:
def drop_rows_with_excessive_nulls(df, max_nulls=9):
    """
    Drops rows with more than a specified number of null values.
    
    Parameters:
    df (pandas.DataFrame): The input DataFrame.
    max_nulls (int): The maximum number of null values allowed in a row.
    
    Returns:
    pandas.DataFrame: The DataFrame with rows dropped based on the maximum null count.
    """
    # Keep only rows where the number of nulls is less than or equal to max_nulls
    return df[df.isnull().sum(axis=1) < max_nulls]

df_nulls_processed = drop_rows_with_excessive_nulls(df_nulls_preprocessed)
df_nulls_processed.shape

(250, 34)

**Result**: Dropped 4 rows with over 9 null values

Fill missing values using `bfill_nulls` (Fill Backward) function

In [16]:
def bfill_nulls(df):
    # Apply bfill to all columns except the Country columns
    df.iloc[:, 2:] = df.iloc[:, 2:].bfill(axis=1)
    
    return df

df_nulls_filled = bfill_nulls(df_nulls_processed)

In [17]:
# Confirm that bfill_nulls filled all nulls
print_null_values(df_nulls_filled)

No columns with null values.


`standardise_float_columns` function used to standardise numeric columns

In [18]:
def standardise_float_columns(df, start_col):
    """
    Converts specified columns with scientific notation to regular float format.
    
    Parameters:
    df (pandas.DataFrame): The DataFrame containing columns to format.
    start_col (int): The index of the first column to format.
    
    Returns:
    pandas.DataFrame: DataFrame with specified columns formatted as floats.
    """
    # Format each column in the sliced list to float and round
    columns_to_format = df.columns[start_col:]
    for col in columns_to_format:
        df[col] = pd.to_numeric(df[col], errors='coerce')  # Convert to numeric
        df[col] = df[col].round(2)  # Round to 2 decimals
    
    return df

# Format columns starting from the third column onward
df_clean = standardise_float_columns(df_nulls_filled, start_col=2)
df_clean.head(2)

Unnamed: 0,Country_Name,Country_Code,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Afghanistan,AFG,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4,12084.4
1,Albania,ALB,7888.0,7868.5,7849.0,7829.5,7810.0,7790.5,7771.0,7751.5,7732.0,7712.5,7693.0,7705.77,7718.54,7731.31,7744.08,7756.85,7769.62,7782.39,7795.16,7807.93,7820.7,7834.94,7849.17,7863.4,7877.64,7891.88,7891.8,7889.02,7889.0,7889.0,7889.0,7889.0


# 5. Exploratory Data Analysis (EDA)<a class="anchor" id="chapter5"></a>

# 6. Conclusion and Insights <a class="anchor" id="chapter6"></a>