# <font color = Blue> National Footprint and Biocapacity Accounts 2019 Edition
    
<font color = Blue>Dataset provides Ecological Footprint per capita data for years 1961-2016 in global hectares (gha).
Ecological Footprint is a measure of how much area of biologically productive land and water an individual, population, or activity requires to produce all the resources it consumes and to absorb the waste it generates, using prevailing technology and resource management practices. The Ecological Footprint is measured in global hectares. Since trade is global, an individual or country's Footprint tracks area from all over the world. Without further specification, Ecological Footprint generally refers to the Ecological Footprint of consumption (rather than only production or export). Ecological Footprint is often referred to in short form as Footprint.

## <font color =Green> About this Dataset</b>
<font color =Green>This data includes total and per capita national biocapacity, ecological footprint of consumption, ecological footprint of production and total area in hectares. This dataset, however, does not include any of our yield factors (national or world) nor any equivalence factors. To view these click here .

In [1]:
import pandas as pd
import numpy as np
import sklearn.utils

In [2]:
df = pd.read_csv( r'D:\Hamoye stage C\footprint-nfa-2019-edition\footprint-nfa-2019-edition\NFA 2019 public_data.csv',low_memory=False )

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72186 entries, 0 to 72185
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         72186 non-null  object 
 1   year            72186 non-null  int64  
 2   country_code    72186 non-null  int64  
 3   record          72186 non-null  object 
 4   crop_land       51714 non-null  float64
 5   grazing_land    51714 non-null  float64
 6   forest_land     51714 non-null  object 
 7   fishing_ground  51713 non-null  float64
 8   built_up_land   51713 non-null  float64
 9   carbon          51713 non-null  float64
 10  total           72177 non-null  float64
 11  QScore          72185 non-null  object 
dtypes: float64(6), int64(2), object(4)
memory usage: 6.6+ MB


In [4]:
df.shape


(72186, 12)

In [5]:
df.isna().sum() 


country               0
year                  0
country_code          0
record                0
crop_land         20472
grazing_land      20472
forest_land       20472
fishing_ground    20473
built_up_land     20473
carbon            20473
total                 9
QScore                1
dtype: int64

In [6]:
#percentage of missing values 

df.isna().mean().round(3)*100

country            0.0
year               0.0
country_code       0.0
record             0.0
crop_land         28.4
grazing_land      28.4
forest_land       28.4
fishing_ground    28.4
built_up_land     28.4
carbon            28.4
total              0.0
QScore             0.0
dtype: float64

In [7]:
df[ 'QScore' ].value_counts()

3A    51481
2A    10576
2B    10096
1A       16
1B       16
Name: QScore, dtype: int64

In [8]:
#for simplicity, we will drop the rows with missing values. 

df = df.dropna()

In [9]:
df.isna().sum()

country           0
year              0
country_code      0
record            0
crop_land         0
grazing_land      0
forest_land       0
fishing_ground    0
built_up_land     0
carbon            0
total             0
QScore            0
dtype: int64

In [10]:
df[ 'QScore' ].value_counts()

3A    51473
2A      224
1A       16
Name: QScore, dtype: int64

### <font color =Red>An obvious change in our target variable after removing the missing values is that thereare only three classes left and from the distribution of the 3 classes, we can see that there is an obvious imbalance between the classes.<br></n>There are methods that can be applied to handle this imbalance such as oversampling and undersampling.</font> 

### Oversampling involves increasing the number of instances in the class with fewer instances</n>while undersampling </n>involves reducing the data points in the class with more instances.</n>For now, we will convert this to a binary classification problem by combining class '2A'and '1A'.

In [11]:
df[ 'QScore' ] = df[ 'QScore' ].replace([ '1A' ], '2A' )

In [12]:
df[ 'QScore' ].value_counts()

3A    51473
2A      240
Name: QScore, dtype: int64

In [13]:
#creating different dataframes
df_2A = df[df.QScore== '2A' ]
df_3A = df[df.QScore== '3A' ].sample( 350 )
data_df = df_2A.append(df_3A)

In [14]:
data_df = sklearn.utils.shuffle(data_df)
data_df = data_df.reset_index(drop= True )
data_df.shape
data_df.QScore.value_counts() 


3A    350
2A    240
Name: QScore, dtype: int64

In [68]:
data_df = data_df.drop(columns=[ 'country_code' , 'country' , 'year' ]) 

KeyError: "['country_code' 'country' 'year'] not found in axis"

In [69]:
X = data_df.drop(columns= 'QScore' )
y = data_df[ 'QScore' ] 


In [70]:
#split the data into training and testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size= 0.3 , random_state= 0 )
y_train.value_counts() 

3A    236
2A    177
Name: QScore, dtype: int64

### There is still an imbalance in the class distribution. For this, we use SMOTE only on thetraining data to handle this. 


In [71]:
#encode categorical variable
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
x_train.record = encoder.fit_transform(x_train.record)
x_test.record = encoder.transform(x_test.record) 



In [72]:
import imblearn
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state= 1 )
x_train_balanced, y_balanced = smote.fit_sample(x_train, y_train) 

In [73]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
normalised_train_df = scaler.fit_transform(x_train_balanced.drop(columns=[ 'record' ]))
normalised_train_df = pd.DataFrame(normalised_train_df,
columns=x_train_balanced.drop(columns=[ 'record' ]).columns)
normalised_train_df[ 'record' ] = x_train_balanced[ 'record' ] 


In [74]:
x_test = x_test.reset_index(drop= True )
normalised_test_df = scaler.transform(x_test.drop(columns=[ 'record' ]))
normalised_test_df = pd.DataFrame(normalised_test_df,
columns=x_test.drop(columns=[ 'record' ]).columns)
normalised_test_df[ 'record' ] = x_test[ 'record' ]

In [75]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(normalised_train_df, y_balanced)
#returns
LogisticRegression(C= 1.0 , class_weight= None , dual= False , fit_intercept= True ,
 intercept_scaling= 1 , l1_ratio= None , max_iter= 100 ,
 multi_class= 'auto' , n_jobs= None , penalty= 'l2' ,
 random_state= None , solver= 'lbfgs' , tol= 0.0001 , verbose= 0 ,
 warm_start= False )

LogisticRegression()

## <font color= Blue> Measuring Classification Performance
### *Cross-validation and accuracy*



In [61]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(log_reg, normalised_train_df, y_balanced, cv= 5 , scoring= 'f1_macro' )
scores* 100 


array([50.50437867, 50.32817889, 53.10657596, 59.50113379, 59.50113379])

### *K-Fold Cross Validation*

In [56]:
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
kf = KFold(n_splits= 5 )
kf.split(normalised_train_df)
f1_scores = []
#run for every split 

for train_index, test_index in kf.split(normalised_train_df):
    x_train, x_test = normalised_train_df.iloc[train_index],normalised_train_df.iloc[test_index]
    y_train, y_test = y_balanced[train_index],y_balanced[test_index]
    model = LogisticRegression().fit(x_train, y_train)
    f1_scores.append(f1_score(y_true=y_test, y_pred=model.predict(x_test),pos_label= '2A' )* 100 )
f1_scores 

[50.943396226415096, 54.0, 52.52525252525253, 55.00000000000001, 0.0]

### *Stratified K-Fold Cross Validation*


In [62]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
skf = StratifiedKFold(n_splits= 5 , shuffle= True , random_state= 1 )
f1_scores = []
#run for every split
for train_index, test_index in skf.split(normalised_train_df, y_balanced):
    x_train, x_test = np.array(normalised_train_df)[train_index],np.array(normalised_train_df)[test_index]
    y_train, y_test = y_balanced[train_index], y_balanced[test_index]
    model = LogisticRegression().fit(x_train, y_train)
 #save result to list
    f1_scores.append(f1_score(y_true=y_test, y_pred=model.predict(x_test), pos_label= '2A' )* 100) 
f1_scores 

[55.769230769230774,
 57.99999999999999,
 60.60606060606061,
 58.58585858585859,
 48.484848484848484]

### *Leave One Out Cross Validation (LOOCV)* 

In [60]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(LogisticRegression(), normalised_train_df, y_balanced, cv=loo,
 scoring= 'f1_macro' )
average_score = scores.mean() * 100
average_score

52.96610169491526

## *Confusion Matrix*

In [76]:
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score,confusion_matrix

new_predictions = log_reg.predict(normalised_test_df)

new_predictions


array(['2A', '2A', '2A', '3A', '2A', '2A', '2A', '2A', '3A', '2A', '3A',
       '3A', '3A', '2A', '2A', '2A', '3A', '2A', '2A', '3A', '2A', '3A',
       '2A', '2A', '3A', '2A', '2A', '2A', '2A', '3A', '2A', '2A', '2A',
       '2A', '3A', '3A', '3A', '2A', '2A', '2A', '3A', '2A', '3A', '3A',
       '2A', '2A', '2A', '3A', '2A', '2A', '3A', '2A', '3A', '2A', '2A',
       '2A', '3A', '2A', '3A', '2A', '2A', '3A', '2A', '2A', '2A', '3A',
       '2A', '3A', '3A', '3A', '2A', '2A', '2A', '3A', '2A', '2A', '3A',
       '2A', '2A', '3A', '3A', '2A', '3A', '2A', '3A', '3A', '2A', '2A',
       '3A', '2A', '2A', '2A', '2A', '3A', '2A', '2A', '3A', '2A', '2A',
       '2A', '2A', '2A', '2A', '3A', '2A', '3A', '3A', '2A', '2A', '2A',
       '3A', '3A', '3A', '2A', '2A', '2A', '2A', '3A', '2A', '2A', '3A',
       '2A', '3A', '2A', '2A', '2A', '3A', '2A', '3A', '3A', '2A', '3A',
       '2A', '2A', '3A', '3A', '3A', '3A', '2A', '3A', '2A', '3A', '2A',
       '3A', '2A', '3A', '3A', '3A', '2A', '3A', '2

In [77]:
y_test.shape


(177,)

In [78]:
new_predictions.shape

(177,)

In [79]:
cnf_mat = confusion_matrix(y_true=y_test, y_pred=new_predictions, labels=[ '2A' , '3A' ])

cnf_mat 

array([[34, 29],
       [70, 44]], dtype=int64)

In [80]:
from sklearn.metrics import classification_report
print(classification_report(y_true=y_test, y_pred=new_predictions, labels=[ '2A' , '3A' ]))

              precision    recall  f1-score   support

          2A       0.33      0.54      0.41        63
          3A       0.60      0.39      0.47       114

    accuracy                           0.44       177
   macro avg       0.46      0.46      0.44       177
weighted avg       0.50      0.44      0.45       177



In [82]:
log_reg.score(normalised_test_df,y_test)

0.4406779661016949

## *Accuracy*

In [51]:
accuracy = accuracy_score(y_true=y_test, y_pred=new_predictions)
print( 'Accuracy: {}' .format(round(accuracy* 100 ), 2 ))

Accuracy: 44


## *Precision* 

In [52]:
precision = precision_score(y_true=y_test, y_pred=new_predictions, pos_label= '2A' )
print( 'Precision: {}' .format(round(precision* 100 ), 2 ))

Precision: 33


In [53]:
## *Recall* 

In [54]:
recall = recall_score(y_true=y_test, y_pred=new_predictions, pos_label= '2A' )
print( 'Recall: {}' .format(round(recall* 100 ), 2 ))

Recall: 54


## *F1-Score*

In [31]:
f1 = f1_score(y_true=y_test, y_pred=new_predictions, pos_label= '2A' )
print( 'F1: {}' .format(round(f1* 100 ), 2 )) 

F1: 41


## *Tree-Based Methods and The Support Vector Machine*

In [64]:
from sklearn.tree import DecisionTreeClassifier
dec_tree = DecisionTreeClassifier()
dec_tree.fit(normalised_train_df, y_balanced)

DecisionTreeClassifier()

In [83]:
new_predictions = dec_tree.predict(normalised_test_df)

In [84]:
dec_tree.score(normalised_test_df,y_test)

0.6440677966101694

In [85]:
print(classification_report(y_true=y_test, y_pred=new_predictions, labels=[ '2A' , '3A' ]))

              precision    recall  f1-score   support

          2A       0.50      0.54      0.52        63
          3A       0.73      0.70      0.72       114

    accuracy                           0.64       177
   macro avg       0.62      0.62      0.62       177
weighted avg       0.65      0.64      0.65       177

