
# **Ecological Footprint Classification: Assessing Environmental Impact Using Random Forest Classifier**

Link to dataset: https://www.kaggle.com/datasets/footprintnetwork/ecological-footprint/code

Dataset information:The ecological footprint measures the ecological assets that a given population requires to produce the natural resources it consumes (including plant-based food and fiber products, livestock and fish products, timber and other forest products, space for urban infrastructure) and to absorb its waste, especially carbon emissions. The footprint tracks the use of six categories of productive surface areas: cropland, grazing land, fishing grounds, built-up (or urban) land, forest area, and carbon demand on land.

A nation’s biocapacity represents the productivity of its ecological assets, including cropland, grazing land, forest land, fishing grounds, and built-up land. These areas, especially if left unharvested, can also absorb much of the waste we generate, especially our carbon emissions.

Both the ecological footprint and biocapacity are expressed in global hectares — globally comparable, standardized hectares with world average productivity.

If a population’s ecological footprint exceeds the region’s biocapacity, that region runs an ecological deficit. Its demand for the goods and services that its land and seas can provide — fruits and vegetables, meat, fish, wood, cotton for clothing, and carbon dioxide absorption — exceeds what the region’s ecosystems can renew. A region in ecological deficit meets demand by importing, liquidating its own ecological assets (such as overfishing), and/or emitting carbon dioxide into the atmosphere. If a region’s biocapacity exceeds its ecological footprint, it has an ecological reserve.

Acknowledgements
The ecological footprint measure was conceived by Mathis Wackernagel and William Rees at the University of British Columbia. Ecological footprint data was provided by the Global Footprint Network.

Inspiration
Is your country running an ecological deficit, consuming more resources than it can produce per year? Which countries have the greatest ecological deficits or reserves? Do they consume less or produce more than the average country? When will Earth Overshoot Day, the day on the calendar when humanity has used one year of natural resources, occur in 2017?

1) Defining the question:
a) Objectives

Main Objective: Classifying countries based on their ecological footprint status (ecological deficit or reserve) using quantitative data, and establishing what actionable insights this classification can offer for environmental sustainability

Secondary Objective: 

Assessing the environmental impact of ecological deficits using data visualization and statistical analysis.

Specific objectives:
1. Perform data exploration to understand distributions and correlations of ecological footprint data

2. Conduct data preprocessing to prepare the dataset for classification modelling.

3. Build a random forest classifier to classify ecological footprints across different countries.

4. Compare ecological footprints between countries and derive insights into environmental impact.

5. Identify actionable insights for mitigating global environmental impact based on classification results.

b) Metrics for success
- Succesful classification of ecological footprints using the Random Forest Classifier
- Identification of significant ecological factors contributing to classification outcomes.
- Clear and actionable insights derived from comparative analysis between countries.

c) Context

Comparative analysis of ecological footprints across countries is essential for understanding global environmental impact and guiding policy decisions. This project utilizes a Random Forest Classifier to classify and compare ecological footprints, aiming to provide actionable insights for environmental suatainability and policy making.



2) Reading Data

In [2]:
#Import necessary libraries

import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px


In [3]:
#Load data into a dataframe
data = pd.read_csv('countries.csv')
data.head()

Unnamed: 0,Country,Region,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,...,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required,Data Quality
0,Afghanistan,Middle East/Central Asia,29.82,0.46,$614.66,0.3,0.2,0.08,0.18,0.0,...,0.24,0.2,0.02,0.0,0.04,0.5,-0.3,0.46,1.6,6
1,Albania,Northern/Eastern Europe,3.16,0.73,"$4,534.37",0.78,0.22,0.25,0.87,0.02,...,0.55,0.21,0.29,0.07,0.06,1.18,-1.03,1.27,1.87,6
2,Algeria,Africa,38.48,0.73,"$5,430.57",0.6,0.16,0.17,1.14,0.01,...,0.24,0.27,0.03,0.01,0.03,0.59,-1.53,1.22,3.61,5
3,Angola,Africa,20.82,0.52,"$4,665.91",0.33,0.15,0.12,0.2,0.09,...,0.2,1.42,0.64,0.26,0.04,2.55,1.61,0.54,0.37,6
4,Antigua and Barbuda,Latin America,0.09,0.78,"$13,205.10",,,,,,...,,,,,,0.94,-4.44,3.11,5.7,2


3) Data Exploration
Attribute information:
Ecological footprints and biocapacity by country
The data set contains the following attributes:
- Country: Name of the country.
- Region: Geographic region to which the country belongs.
- Population (millions): Total population of the country in millions.
- HDI (Human Development Index): Measure of a country's average achievements in key dimensions of human development (e.g., education, health).
- GDP per Capita: Gross Domestic Product per capita, indicating the economic productivity per person in the country.
- Cropland Footprint: Ecological footprint associated with cropland usage.
- Grazing Footprint: Ecological footprint associated with grazing land usage.
- Forest Footprint: Ecological footprint associated with forest land usage.
- Carbon Footprint: Ecological footprint associated with carbon emissions.
- Fish Footprint: Ecological footprint associated with fishing activities.
- Total Ecological Footprint: Overall ecological footprint per capita, considering all resource usage.
- Cropland: Area of land dedicated to cropland in hectares per person.
- Grazing Land: Area of land dedicated to grazing in hectares per person.
- Forest Land: Area of land dedicated to forests in hectares per person.
- Fishing Water: Area of water dedicated to fishing activities in hectares per person.
- Urban Land: Area of land occupied by urban development in hectares per person.
- Total Biocapacity: Total biocapacity available in global hectares per person.
- Biocapacity Deficit or Reserve: Difference between total biocapacity and ecological footprint; deficit indicates resource consumption exceeds availability.
- Earths Required: Number of Earths required if everyone lived like the average person in the country.
- Countries Required: Number of countries with similar resource consumption patterns required to support the country's population sustainably.
- Data Quality: Indicates the quality or reliability of the data used for each attribute.





Preview the top of our dataset

In [4]:
data.head()

Unnamed: 0,Country,Region,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,...,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required,Data Quality
0,Afghanistan,Middle East/Central Asia,29.82,0.46,$614.66,0.3,0.2,0.08,0.18,0.0,...,0.24,0.2,0.02,0.0,0.04,0.5,-0.3,0.46,1.6,6
1,Albania,Northern/Eastern Europe,3.16,0.73,"$4,534.37",0.78,0.22,0.25,0.87,0.02,...,0.55,0.21,0.29,0.07,0.06,1.18,-1.03,1.27,1.87,6
2,Algeria,Africa,38.48,0.73,"$5,430.57",0.6,0.16,0.17,1.14,0.01,...,0.24,0.27,0.03,0.01,0.03,0.59,-1.53,1.22,3.61,5
3,Angola,Africa,20.82,0.52,"$4,665.91",0.33,0.15,0.12,0.2,0.09,...,0.2,1.42,0.64,0.26,0.04,2.55,1.61,0.54,0.37,6
4,Antigua and Barbuda,Latin America,0.09,0.78,"$13,205.10",,,,,,...,,,,,,0.94,-4.44,3.11,5.7,2


Preview the bottom of our dataset

In [5]:
data.tail()

Unnamed: 0,Country,Region,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,...,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required,Data Quality
183,Viet Nam,Asia-Pacific,90.8,0.66,"$1,532.31",0.5,0.01,0.19,0.79,0.05,...,0.55,0.01,0.17,0.16,0.1,1.0,-0.65,0.95,1.66,6
184,Wallis and Futuna Islands,Asia-Pacific,0.01,,,,,,,,...,,,,,,1.51,-0.56,1.19,1.37,3T
185,Yemen,Middle East/Central Asia,23.85,0.5,"$1,302.30",0.34,0.14,0.04,0.42,0.04,...,0.09,0.12,0.04,0.2,0.04,0.5,-0.53,0.59,2.06,5
186,Zambia,Africa,14.08,0.58,"$1,740.64",0.19,0.18,0.33,0.24,0.01,...,0.24,0.94,0.99,0.02,0.04,2.23,1.24,0.57,0.44,6
187,Zimbabwe,Africa,13.72,0.49,$865.91,0.2,0.32,0.29,0.53,0.01,...,0.15,0.32,0.12,0.01,0.02,0.62,-0.75,0.79,2.2,6


In [6]:
#Generate the summary of descriptive statistics on the dataset
data.describe()

Unnamed: 0,Population (millions),HDI,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,Total Ecological Footprint,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required
count,188.0,172.0,173.0,173.0,173.0,173.0,173.0,188.0,173.0,173.0,173.0,173.0,173.0,188.0,188.0,188.0,188.0
mean,37.342372,0.68636,0.578208,0.263179,0.373815,1.804913,0.122486,3.317606,0.53185,0.45659,2.459191,0.595145,0.06711,4.019681,0.702074,1.915745,4.037397
std,140.756836,0.15604,0.355691,0.352067,0.359349,1.898283,0.158427,2.370931,0.672567,1.014738,10.593956,1.661872,0.054844,11.689075,11.771339,1.369624,12.444616
min,0.0,0.34,0.07,0.0,0.01,0.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.05,-14.14,0.24,0.02
25%,2.0375,0.5575,0.35,0.08,0.17,0.42,0.02,1.4825,0.18,0.03,0.06,0.03,0.03,0.675,-1.935,0.855,0.9425
50%,7.97,0.72,0.52,0.18,0.26,1.14,0.07,2.74,0.35,0.12,0.34,0.11,0.05,1.31,-0.73,1.58,1.705
75%,24.87,0.8025,0.7,0.32,0.46,2.6,0.15,4.64,0.59,0.34,1.17,0.37,0.09,2.815,0.2125,2.6775,2.8475
max,1408.04,0.94,2.68,3.47,3.03,12.65,0.82,15.82,5.42,8.23,95.16,16.07,0.27,111.35,109.01,9.14,159.47


Checking whether each column has an appropriate data type

In [7]:
data.dtypes

Country                            object
Region                             object
Population (millions)             float64
HDI                               float64
GDP per Capita                     object
Cropland Footprint                float64
Grazing Footprint                 float64
Forest Footprint                  float64
Carbon Footprint                  float64
Fish Footprint                    float64
Total Ecological Footprint        float64
Cropland                          float64
Grazing Land                      float64
Forest Land                       float64
Fishing Water                     float64
Urban Land                        float64
Total Biocapacity                 float64
Biocapacity Deficit or Reserve    float64
Earths Required                   float64
Countries Required                float64
Data Quality                       object
dtype: object

In [8]:
#Changing wrong dtypes to correct
data['GDP per Capita'] = data['GDP per Capita'].str.replace('[\$,]','', regex=True).astype(float)

#Rename the column to include '$'
data.rename(columns={'GDDP per Capita': 'GDDP per Capita ($)'}, inplace=True)

data.head()

  data['GDP per Capita'] = data['GDP per Capita'].str.replace('[\$,]','', regex=True).astype(float)


Unnamed: 0,Country,Region,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,...,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required,Data Quality
0,Afghanistan,Middle East/Central Asia,29.82,0.46,614.66,0.3,0.2,0.08,0.18,0.0,...,0.24,0.2,0.02,0.0,0.04,0.5,-0.3,0.46,1.6,6
1,Albania,Northern/Eastern Europe,3.16,0.73,4534.37,0.78,0.22,0.25,0.87,0.02,...,0.55,0.21,0.29,0.07,0.06,1.18,-1.03,1.27,1.87,6
2,Algeria,Africa,38.48,0.73,5430.57,0.6,0.16,0.17,1.14,0.01,...,0.24,0.27,0.03,0.01,0.03,0.59,-1.53,1.22,3.61,5
3,Angola,Africa,20.82,0.52,4665.91,0.33,0.15,0.12,0.2,0.09,...,0.2,1.42,0.64,0.26,0.04,2.55,1.61,0.54,0.37,6
4,Antigua and Barbuda,Latin America,0.09,0.78,13205.1,,,,,,...,,,,,,0.94,-4.44,3.11,5.7,2


In [9]:
#data['Data Quality'] = data['Data Quality'].astype(float)


In [10]:
data.dtypes

Country                            object
Region                             object
Population (millions)             float64
HDI                               float64
GDP per Capita                    float64
Cropland Footprint                float64
Grazing Footprint                 float64
Forest Footprint                  float64
Carbon Footprint                  float64
Fish Footprint                    float64
Total Ecological Footprint        float64
Cropland                          float64
Grazing Land                      float64
Forest Land                       float64
Fishing Water                     float64
Urban Land                        float64
Total Biocapacity                 float64
Biocapacity Deficit or Reserve    float64
Earths Required                   float64
Countries Required                float64
Data Quality                       object
dtype: object

#Determining the number of records in our dataset

In [11]:
data.shape

(188, 21)

In [12]:
#Get info about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 21 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Country                         188 non-null    object 
 1   Region                          188 non-null    object 
 2   Population (millions)           188 non-null    float64
 3   HDI                             172 non-null    float64
 4   GDP per Capita                  173 non-null    float64
 5   Cropland Footprint              173 non-null    float64
 6   Grazing Footprint               173 non-null    float64
 7   Forest Footprint                173 non-null    float64
 8   Carbon Footprint                173 non-null    float64
 9   Fish Footprint                  173 non-null    float64
 10  Total Ecological Footprint      188 non-null    float64
 11  Cropland                        173 non-null    float64
 12  Grazing Land                    173 

4) Tidying the dataset

Check for missing values

In [13]:
#Check for null values
data.isnull().sum()

Country                            0
Region                             0
Population (millions)              0
HDI                               16
GDP per Capita                    15
Cropland Footprint                15
Grazing Footprint                 15
Forest Footprint                  15
Carbon Footprint                  15
Fish Footprint                    15
Total Ecological Footprint         0
Cropland                          15
Grazing Land                      15
Forest Land                       15
Fishing Water                     15
Urban Land                        15
Total Biocapacity                  0
Biocapacity Deficit or Reserve     0
Earths Required                    0
Countries Required                 0
Data Quality                       0
dtype: int64

hecking for duplicated values

In [14]:
data.duplicated().sum()

0

Dealing with missing values

5) Data Visualization

In [15]:
import plotly.express as px
#Define specific columns to include in the pie chart
columns_to_plot = ['Cropland', 'Grazing Land', 'Forest Land','Fishing Water', 'Urban Land']

#Extract the specific columns for plotting from the dataframe
plot_data = data[columns_to_plot]

#Calculate sum of each column
sum_values = plot_data.sum()

#prepare data for plotly pie chart
pie_data = pd.DataFrame({'Category': sum_values.index, 'Value': sum_values.values})

#Display the summed values
print(pie_data)
#plotting an interractive pie chart with plotly express
fig = px.pie(pie_data, values='Value', names='Category', title='Percentage of land available')
fig.show()
#Customize layout
#fig.update_traces(textposition='inside', textinfo='percent+label')

        Category   Value
0       Cropland   92.01
1   Grazing Land   78.99
2    Forest Land  425.44
3  Fishing Water  102.96
4     Urban Land   11.61


In [16]:
#create a choropleth map for population(millions)
fig = px.choropleth(data,
                    locations='Country',
                    locationmode='country names',
                    color='Population (millions)',
                    hover_name='Country',
                    title='Population Across Countries')
#Show the plot
fig.show()



In [17]:
#Create a choropleth map for GDP per Capita
fig = px.choropleth(data,
                    locations='Country',
                    locationmode='country names',
                    color='GDP per Capita',
                    hover_name='Country',
                    title='GDP per Capita Across Countries')
#Show the plot
fig.show()

In [18]:
#Create a chloropleth map for HDI
fig = px.choropleth(data,
                    locations='Country',
                    locationmode='country names',
                    color='HDI',
                    hover_name='Country',
                    title='HDI Across Countries')
#Show the plot
fig.show()

In [19]:
#Create a stacked bar chart
fig = px.bar(data, x='Country',y=['Cropland Footprint', 'Grazing Footprint','Forest Footprint', 'Carbon Footprint', 'Fish Footprint'],
             title='Ecological Footprints across Countries',
             labels={'value': 'Ecological Footprint'},
             barmode='stack')
#Customize layout
fig.update_layout(
    title='Ecological Footprints Across Countries',
    xaxis_title='Country', 
    yaxis_title='Ecological Footprint',
    height=600,  # Adjust height of the plot
    width=1810,  # Adjust width of the plot
    margin=dict(l=50, r=50, t=50, b=50),  # Adjust margins)
)
#Show the plot
fig.show()

In [20]:
import plotly.express as px

# Assuming 'data' is your DataFrame with ecological footprint data

# Example: Plot line chart for Total Ecological Footprint across different countries
fig = px.line(data, x='Country', y='Total Ecological Footprint',
              title='Total Ecological Footprint Across Countries',
              labels={'Country': 'Country', 'Total Ecological Footprint': 'Total Ecological Footprint'})

# Customize layout
fig.update_layout(
    xaxis_title='Country',
    yaxis_title='Total Ecological Footprint',
    height=500,  # Adjust height of the plot
    width=800,   # Adjust width of the plot
    margin=dict(l=50, r=50, t=50, b=50),  # Adjust margins
)

# Show the plot
fig.show()


In [21]:
import plotly.express as px

# Assuming 'data' is your DataFrame with various metrics across countries or regions

# Example: Plot line chart for Total Biocapacity and Biocapacity Deficit or Reserve across different countries
fig = px.line(data, x='Country', y=['Total Biocapacity', 'Biocapacity Deficit or Reserve'],
              title='Biocapacity Trends Across Countries',
              labels={'Country': 'Country', 'value': 'Value', 'variable': 'Metric'})

# Customize layout
fig.update_layout(
    xaxis_title='Country',
    yaxis_title='Value',
    height=500,  # Adjust height of the plot
    width=800,   # Adjust width of the plot
    margin=dict(l=50, r=50, t=50, b=50),  # Adjust margins
)

# Show the plot
fig.show()

In [22]:
import plotly.express as px

# Assuming 'data' is your DataFrame with ecological footprint and biocapacity metrics per country

# Example: Plot line chart for Ecological Footprint vs Biocapacity across different countries
fig = px.line(data, x='Country', y=['Total Ecological Footprint', 'Total Biocapacity'],
              title='Ecological Footprint vs Biocapacity Across Countries',
              labels={'Country': 'Country', 'value': 'Value', 'variable': 'Metric'})

# Customize layout
fig.update_layout(
    xaxis_title='Country',
    yaxis_title='Value',
    height=500,  # Adjust height of the plot
    width=800,   # Adjust width of the plot
    margin=dict(l=50, r=50, t=50, b=50),  # Adjust margins
)

# Show the plot
fig.show()


Feature Engineering

a) Target encoding:

Calculating a new column for ecological deficit or reserve to help predict what country would be in deficit or reserve based on historical data

In [23]:
#create new column ecological deficit or reserve
def diff(a,b):
    return b-a

data['Ecological R/D'] = data.apply(
    lambda x: diff(x['Total Ecological Footprint'], x['Total Biocapacity']), axis=1
)

In [24]:
data.columns

Index(['Country', 'Region', 'Population (millions)', 'HDI', 'GDP per Capita',
       'Cropland Footprint', 'Grazing Footprint', 'Forest Footprint',
       'Carbon Footprint', 'Fish Footprint', 'Total Ecological Footprint',
       'Cropland', 'Grazing Land', 'Forest Land', 'Fishing Water',
       'Urban Land', 'Total Biocapacity', 'Biocapacity Deficit or Reserve',
       'Earths Required', 'Countries Required', 'Data Quality',
       'Ecological R/D'],
      dtype='object')

In [25]:
data[['Total Ecological Footprint', 'Total Biocapacity', 'Biocapacity Deficit or Reserve', 'Ecological R/D']].head(10)

Unnamed: 0,Total Ecological Footprint,Total Biocapacity,Biocapacity Deficit or Reserve,Ecological R/D
0,0.79,0.5,-0.3,-0.29
1,2.21,1.18,-1.03,-1.03
2,2.12,0.59,-1.53,-1.53
3,0.93,2.55,1.61,1.62
4,5.38,0.94,-4.44,-4.44
5,3.14,6.92,3.78,3.78
6,2.23,0.89,-1.35,-1.34
7,11.88,0.57,-11.31,-11.31
8,9.31,16.57,7.26,7.26
9,6.06,3.07,-3.0,-2.99


In [26]:
data = data.drop('Biocapacity Deficit or Reserve', axis =1)


In [27]:
data.columns

Index(['Country', 'Region', 'Population (millions)', 'HDI', 'GDP per Capita',
       'Cropland Footprint', 'Grazing Footprint', 'Forest Footprint',
       'Carbon Footprint', 'Fish Footprint', 'Total Ecological Footprint',
       'Cropland', 'Grazing Land', 'Forest Land', 'Fishing Water',
       'Urban Land', 'Total Biocapacity', 'Earths Required',
       'Countries Required', 'Data Quality', 'Ecological R/D'],
      dtype='object')

In [28]:
data.describe()

Unnamed: 0,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,Total Ecological Footprint,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Earths Required,Countries Required,Ecological R/D
count,188.0,172.0,173.0,173.0,173.0,173.0,173.0,173.0,188.0,173.0,173.0,173.0,173.0,173.0,188.0,188.0,188.0,188.0
mean,37.342372,0.68636,14238.324913,0.578208,0.263179,0.373815,1.804913,0.122486,3.317606,0.53185,0.45659,2.459191,0.595145,0.06711,4.019681,1.915745,4.037397,0.702074
std,140.756836,0.15604,20927.249796,0.355691,0.352067,0.359349,1.898283,0.158427,2.370931,0.672567,1.014738,10.593956,1.661872,0.054844,11.689075,1.369624,12.444616,11.77138
min,0.0,0.34,276.69,0.07,0.0,0.01,0.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.05,0.24,0.02,-14.14
25%,2.0375,0.5575,1524.39,0.35,0.08,0.17,0.42,0.02,1.4825,0.18,0.03,0.06,0.03,0.03,0.675,0.855,0.9425,-1.9325
50%,7.97,0.72,5430.57,0.52,0.18,0.26,1.14,0.07,2.74,0.35,0.12,0.34,0.11,0.05,1.31,1.58,1.705,-0.73
75%,24.87,0.8025,14522.8,0.7,0.32,0.46,2.6,0.15,4.64,0.59,0.34,1.17,0.37,0.09,2.815,2.6775,2.8475,0.2125
max,1408.04,0.94,114665.0,2.68,3.47,3.03,12.65,0.82,15.82,5.42,8.23,95.16,16.07,0.27,111.35,9.14,159.47,109.01


Feature scaling 

1) Standardiation

In [29]:
#Dropping the non numerical data
data1 = data.select_dtypes(include=['float64', 'int64'])
data1.head(5)

Unnamed: 0,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,Total Ecological Footprint,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Earths Required,Countries Required,Ecological R/D
0,29.82,0.46,614.66,0.3,0.2,0.08,0.18,0.0,0.79,0.24,0.2,0.02,0.0,0.04,0.5,0.46,1.6,-0.29
1,3.16,0.73,4534.37,0.78,0.22,0.25,0.87,0.02,2.21,0.55,0.21,0.29,0.07,0.06,1.18,1.27,1.87,-1.03
2,38.48,0.73,5430.57,0.6,0.16,0.17,1.14,0.01,2.12,0.24,0.27,0.03,0.01,0.03,0.59,1.22,3.61,-1.53
3,20.82,0.52,4665.91,0.33,0.15,0.12,0.2,0.09,0.93,0.2,1.42,0.64,0.26,0.04,2.55,0.54,0.37,1.62
4,0.09,0.78,13205.1,,,,,,5.38,,,,,,0.94,3.11,5.7,-4.44


In [30]:
X1 = data1.iloc[:, 0:19]

In [31]:
X2 = data1.iloc[:, 0:19]

In [32]:
#Feature scaling to reduce skewness of the model(bringing all features to asimilar range sono one feature dominates the other) and or reduce bias - i.e on 'seemingly larger values'
#Standard scaler
#import scaler
from sklearn.preprocessing import StandardScaler
#call your scaler
scaler = StandardScaler()


In [33]:
#fit your data
X1 = scaler.fit_transform(X1)
#Turn X1 back into a dataframe
X1 = pd.DataFrame(X1, columns = data1.columns)
#print(scaler.fit(data1))
#.fit_transform
X1.head()

Unnamed: 0,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,Total Ecological Footprint,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Earths Required,Countries Required,Ecological R/D
0,-0.053585,-1.454889,-0.652891,-0.784433,-0.179973,-0.820004,-0.858476,-0.775381,-1.068928,-0.435194,-0.253597,-0.230912,-0.359157,-0.495742,-0.301913,-1.065717,-0.196383,-0.084504
1,-0.243495,0.280492,-0.465045,0.568971,-0.123001,-0.345554,-0.493934,-0.648773,-0.468408,0.027065,-0.243714,-0.205352,-0.316913,-0.130013,-0.243583,-0.472735,-0.174629,-0.147536
2,0.008104,0.280492,-0.422097,0.061444,-0.293918,-0.568825,-0.351288,-0.712077,-0.506469,-0.435194,-0.184413,-0.229965,-0.353122,-0.678606,-0.294193,-0.509339,-0.034436,-0.190125
3,-0.117696,-1.069249,-0.458742,-0.699845,-0.322404,-0.708369,-0.847909,-0.205646,-1.009722,-0.49484,0.952174,-0.172218,-0.202252,-0.495742,-0.126067,-1.007151,-0.295484,0.078188
4,-0.265364,0.601859,-0.049516,,,,,,0.872189,,,,,,-0.26417,0.874287,0.133957,-0.437995


#in standardization, the goal is to have a mean of 0 and std of 1 for standard scaler 

In [34]:
X1.describe().round(3)


Unnamed: 0,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,Total Ecological Footprint,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Earths Required,Countries Required,Ecological R/D
count,188.0,172.0,173.0,173.0,173.0,173.0,173.0,173.0,188.0,173.0,173.0,173.0,173.0,173.0,188.0,188.0,188.0,188.0
mean,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0
std,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003,1.003
min,-0.266,-2.226,-0.669,-1.433,-0.75,-1.015,-0.954,-0.775,-1.225,-0.793,-0.451,-0.233,-0.359,-1.227,-0.341,-1.227,-0.324,-1.264
25%,-0.251,-0.828,-0.609,-0.643,-0.522,-0.569,-0.732,-0.649,-0.776,-0.525,-0.422,-0.227,-0.341,-0.679,-0.287,-0.777,-0.249,-0.224
50%,-0.209,0.216,-0.422,-0.164,-0.237,-0.318,-0.351,-0.332,-0.244,-0.271,-0.333,-0.201,-0.293,-0.313,-0.232,-0.246,-0.188,-0.122
75%,-0.089,0.746,0.014,0.343,0.162,0.241,0.42,0.174,0.559,0.087,-0.115,-0.122,-0.136,0.419,-0.103,0.558,-0.096,-0.042
max,9.764,1.63,4.813,5.926,9.135,7.413,5.73,4.416,5.287,7.289,7.683,8.776,9.339,3.71,9.207,5.289,12.523,9.226


2) Normalization
Here the goal is to have the max go to 1 and the min 0

In [35]:
from sklearn.preprocessing import MinMaxScaler

In [36]:
scaleminmax = MinMaxScaler(feature_range=(0,1))
X2 = scaleminmax.fit_transform(X2)

In [37]:
#Turn X1 back into a dataframe
X2 =pd.DataFrame(X2, columns = data1.columns)
X2.head()

Unnamed: 0,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,Total Ecological Footprint,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Earths Required,Countries Required,Ecological R/D
0,0.021178,0.2,0.002955,0.088123,0.057637,0.023179,0.014229,0.0,0.024026,0.04428,0.024301,0.00021,0.0,0.148148,0.004043,0.024719,0.009909,0.112464
1,0.002244,0.65,0.037221,0.272031,0.063401,0.07947,0.068775,0.02439,0.116234,0.101476,0.025516,0.003047,0.004356,0.222222,0.010153,0.11573,0.011602,0.106456
2,0.027329,0.65,0.045056,0.203065,0.04611,0.05298,0.090119,0.012195,0.11039,0.04428,0.032807,0.000315,0.000622,0.111111,0.004852,0.110112,0.022515,0.102395
3,0.014787,0.3,0.038371,0.099617,0.043228,0.036424,0.01581,0.109756,0.033117,0.0369,0.172539,0.006726,0.016179,0.148148,0.022462,0.033708,0.002195,0.127974
4,6.4e-05,0.733333,0.113022,,,,,,0.322078,,,,,,0.007996,0.322472,0.035622,0.078766


In [38]:
X2.describe().round(3)

Unnamed: 0,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,Total Ecological Footprint,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Earths Required,Countries Required,Ecological R/D
count,188.0,172.0,173.0,173.0,173.0,173.0,173.0,173.0,188.0,173.0,173.0,173.0,173.0,173.0,188.0,188.0,188.0,188.0
mean,0.027,0.577,0.122,0.195,0.076,0.12,0.143,0.149,0.188,0.098,0.055,0.026,0.037,0.249,0.036,0.188,0.025,0.121
std,0.1,0.26,0.183,0.136,0.101,0.119,0.15,0.193,0.154,0.124,0.123,0.111,0.103,0.203,0.105,0.154,0.078,0.096
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.001,0.363,0.011,0.107,0.023,0.053,0.033,0.024,0.069,0.033,0.004,0.001,0.002,0.111,0.006,0.069,0.006,0.099
50%,0.006,0.633,0.045,0.172,0.052,0.083,0.09,0.085,0.151,0.065,0.015,0.004,0.007,0.185,0.011,0.151,0.011,0.109
75%,0.018,0.771,0.125,0.241,0.092,0.149,0.206,0.183,0.274,0.109,0.041,0.012,0.023,0.333,0.025,0.274,0.018,0.117
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [39]:
#test