**Introduction**

Every year people demand more from nature than it can regenerate. Individuals, communities and government leaders use ecological footprint data to better manage limited resources, reduce economic risk, and improve well-being. The Dataset provides Ecological Footprint per capita data for years 1961-2016 in global hectares (gha). Ecological Footprint is a measure of how much area of biologically productive land and water an individual, population, or activity requires to produce all the resources it consumes and to absorb the waste it generates, using prevailing technology and resource management practices. The Ecological Footprint is measured in global hectares. Since trade is global, an individual or country's Footprint tracks area from all over the world. 

Apart from predicting numeric values, another important supervised machine learning method is classification and it involves predicting classes (either binary or multinomial classes). In this section, we will cover how to measure performances of class prediction, linear classification methods and non-linear/tree-based methods. We’ll also focus on strategies for applying a successful classification model like interpretability-accuracy trade-off, class and imbalance.

The National Footprint and Biocapacity Accounts (NFAs) measure the ecological resource use and resource capacity of nations from 1961 to 2016. The calculations in the National Footprint and Biocapacity Accounts are primarily based on United Nations data sets, including those published by the Food and Agriculture Organization, United Nations Commodity Trade Statistics Database, and the UN Statistics Division, as well as the International Energy Agency. In this project, we will use this data to classify and predict the quality metrics (qascore) of the ecological footprint data for the different countries. This data includes total and per capita national biocapacity, the ecological footprint of consumption, the ecological footprint of production and total area in hectares.

Data Source: https://data.world/footprint/nfa-2019-edition

**Linear Classification and Logistic Regression**

In machine learning, classification is a supervised method of segmenting data points into various labels or classes. Unlike regression, the target variable in a classification problem is discrete. Each data point used in training classification models must have a corresponding label in order for the characteristics and patterns in the classes to be learnt appropriately. Classification can either be binary - identifying that a given email is spam or not or, multi-class - classifying a fruit as orange, mango or banana.

**Linear classifiers and the importance of class probabilities**

For simplicity, we define a linear classifier as a binary classifier that separates two classes (positive and negative class) using a linear separator by computing a linear combination of the features and comparing against a set threshold.

**Logistic Regression: Sigmoid, logit and the log-likelihood**

Logistic regression is a linear algorithm that can be used for binary or multiclass classification. It is a discriminative classifier that estimates the probability that an instance belongs to a class using an s-shape function curve called the sigmoid function. The predicted values obtained after using a linear equation on the predictors by applying logistic regression can fall in the range of negative infinity to positive infinity. The sigmoid maps these results by shrinking the value to fall between 0 and 1.  We can say that we use the sigmoid function to transform linear regression into logistic regression.



The sigmoid function can be applied to a linear equation, 

z =𝜷0 + 𝜷1x

to obtain values h between 0 and 1 such that


For a binary classification task with classes A and B, if a threshold is set for 0.5 and the probability of an instance belonging to a class is p, we can say that if p < 0.5 the instance if of class A while it is of class B is p > 0.5.  

Also known as the log of odds, logit is the logarithm of odds ratio where the odds ratio is the probability that an event occurs divided by the probability that the event does not occur. Logit is the inverse of the sigmoid such that it maps values from negative infinity to positive infinity.


Recall that in linear regression, we minimized the sum of squared errors SSE; in logistic regression, the log-likelihood is maximized.





In [2]:
pip install -U imbalanced -learn

Note: you may need to restart the kernel to use updated packages.


'C:\Users\AYA' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


'C:\Users\AYA' is not recognized as an internal or external command,
operable program or batch file.


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

In [5]:
df = pd.read_csv('https://query.data.world/s/wh6j7rxy2hvrn4ml75ci62apk5hgae')

  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
df

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
0,Armenia,1992,1,AreaPerCap,1.402924e-01,1.995463e-01,0.097188051,3.688847e-02,2.931995e-02,0.000000e+00,5.032351e-01,3A
1,Armenia,1992,1,AreaTotHA,4.830000e+05,6.870000e+05,334600,1.270000e+05,1.009430e+05,0.000000e+00,1.732543e+06,3A
2,Armenia,1992,1,BiocapPerCap,1.598044e-01,1.352610e-01,0.084003213,1.374213e-02,3.339780e-02,0.000000e+00,4.262086e-01,3A
3,Armenia,1992,1,BiocapTotGHA,5.501762e+05,4.656780e+05,289207.1078,4.731155e+04,1.149823e+05,0.000000e+00,1.467355e+06,3A
4,Armenia,1992,1,EFConsPerCap,3.875102e-01,1.894622e-01,1.26E-06,4.164833e-03,3.339780e-02,1.114093e+00,1.728629e+00,3A
...,...,...,...,...,...,...,...,...,...,...,...,...
72181,World,2016,5001,BiocapTotGHA,3.984702e+09,1.504757e+09,5.11176e+09,1.095445e+09,4.726163e+08,0.000000e+00,1.216928e+10,3A
72182,World,2016,5001,EFConsPerCap,5.336445e-01,1.402092e-01,0.273495,8.974253e-02,6.329435e-02,1.646235e+00,2.746619e+00,3A
72183,World,2016,5001,EFConsTotGHA,3.984702e+09,1.046937e+09,2.04218e+09,6.701039e+08,4.726163e+08,1.229237e+10,2.050891e+10,3A
72184,World,2016,5001,EFProdPerCap,5.336445e-01,1.402092e-01,0.273495,8.974253e-02,6.329435e-02,1.646235e+00,2.746619e+00,3A


In [7]:
#check distribution of target variable
df['QScore'].value_counts()

3A    51481
2A    10576
2B    10096
1A       16
1B       16
Name: QScore, dtype: int64

In [8]:
df.isna().sum()

country               0
year                  0
country_code          0
record                0
crop_land         20472
grazing_land      20472
forest_land       20472
fishing_ground    20473
built_up_land     20473
carbon            20473
total                 9
QScore                1
dtype: int64

In [9]:
df = df.dropna()

In [10]:
df.isna().sum()

country           0
year              0
country_code      0
record            0
crop_land         0
grazing_land      0
forest_land       0
fishing_ground    0
built_up_land     0
carbon            0
total             0
QScore            0
dtype: int64

In [11]:
df

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
0,Armenia,1992,1,AreaPerCap,1.402924e-01,1.995463e-01,0.097188051,3.688847e-02,2.931995e-02,0.000000e+00,5.032351e-01,3A
1,Armenia,1992,1,AreaTotHA,4.830000e+05,6.870000e+05,334600,1.270000e+05,1.009430e+05,0.000000e+00,1.732543e+06,3A
2,Armenia,1992,1,BiocapPerCap,1.598044e-01,1.352610e-01,0.084003213,1.374213e-02,3.339780e-02,0.000000e+00,4.262086e-01,3A
3,Armenia,1992,1,BiocapTotGHA,5.501762e+05,4.656780e+05,289207.1078,4.731155e+04,1.149823e+05,0.000000e+00,1.467355e+06,3A
4,Armenia,1992,1,EFConsPerCap,3.875102e-01,1.894622e-01,1.26E-06,4.164833e-03,3.339780e-02,1.114093e+00,1.728629e+00,3A
...,...,...,...,...,...,...,...,...,...,...,...,...
72181,World,2016,5001,BiocapTotGHA,3.984702e+09,1.504757e+09,5.11176e+09,1.095445e+09,4.726163e+08,0.000000e+00,1.216928e+10,3A
72182,World,2016,5001,EFConsPerCap,5.336445e-01,1.402092e-01,0.273495,8.974253e-02,6.329435e-02,1.646235e+00,2.746619e+00,3A
72183,World,2016,5001,EFConsTotGHA,3.984702e+09,1.046937e+09,2.04218e+09,6.701039e+08,4.726163e+08,1.229237e+10,2.050891e+10,3A
72184,World,2016,5001,EFProdPerCap,5.336445e-01,1.402092e-01,0.273495,8.974253e-02,6.329435e-02,1.646235e+00,2.746619e+00,3A


In [12]:
df['QScore'].value_counts()

3A    51473
2A      224
1A       16
Name: QScore, dtype: int64

An obvious change in our target variable after removing the missing values is that there are only three classes left and from the distribution of the 3 classes, we can see that there is an obvious imbalance between the classes. There are methods that can be applied to handle this imbalance such as *oversampling* and *undersampling*.

*Oversampling* involves increasing the number of instances in the class with fewer instances while *undersampling* involves reducing the data points in the class with more instances.

For now, we will convert this to a binary classification problem by combining class '2A' and '1A'.

In [13]:
df['QScore'] = df['QScore'].replace(['1A'], '2A')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [14]:
df['QScore'].value_counts()

3A    51473
2A      240
Name: QScore, dtype: int64

In [15]:
df_2A = df[df.QScore=='2A'] #to call for a particular set of values in a column 

In [16]:
df_2A

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
1536,Algeria,2016,4,AreaPerCap,2.072989e-01,8.112722e-01,0.048357265,2.258528e-02,2.998367e-02,0.000000e+00,1.119497e+00,2A
1537,Algeria,2016,4,AreaTotHA,8.417600e+06,3.294260e+07,1963600,9.171000e+05,1.217520e+06,0.000000e+00,4.545842e+07,2A
1538,Algeria,2016,4,BiocapPerCap,2.021916e-01,2.636077e-01,0.027166736,7.947991e-03,2.924496e-02,0.000000e+00,5.301590e-01,2A
1539,Algeria,2016,4,BiocapTotGHA,8.210214e+06,1.070408e+07,1103135.245,3.227369e+05,1.187524e+06,0.000000e+00,2.152769e+07,2A
1540,Algeria,2016,4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,1.472910e-02,2.924496e-02,1.391455e+00,2.407316e+00,2A
...,...,...,...,...,...,...,...,...,...,...,...,...
65469,Ukraine,2016,230,BiocapTotGHA,8.966961e+07,6.457412e+06,19268322.04,6.920056e+06,3.952345e+06,0.000000e+00,1.262677e+08,2A
65470,Ukraine,2016,230,EFConsPerCap,1.155004e+00,1.081601e-02,0.149119494,5.150234e-02,8.893945e-02,1.453137e+00,2.908519e+00,2A
65471,Ukraine,2016,230,EFConsTotGHA,5.132678e+07,4.806484e+05,6626661.768,2.288692e+06,3.952345e+06,6.457538e+07,1.292505e+08,2A
65472,Ukraine,2016,230,EFProdPerCap,2.017831e+00,5.768143e-03,0.216720935,3.223480e-03,8.893945e-02,1.665767e+00,3.998251e+00,2A


In [17]:
df_3A = df[df.QScore=='3A'].sample(350)

In [18]:
df_3A

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
16193,Denmark,1993,54,AreaTotHA,3.270139e+06,1.320868e+05,466174.9,9.962370e+06,3.063625e+05,0.000000e+00,1.413713e+07,3A
36566,Lesotho,2011,122,EFProdPerCap,6.124386e-02,4.772367e-01,0.394973198,3.070000e-05,1.062451e-02,4.021286e-01,1.346238e+00,3A
24056,Ghana,2009,81,AreaPerCap,3.053908e-01,3.472251e-01,0.383478765,1.401493e-01,3.157548e-02,0.000000e+00,1.207819e+00,3A
25897,Guinea,1996,90,AreaTotHA,3.217000e+06,1.070000e+07,7048000,4.826200e+06,3.000590e+05,0.000000e+00,2.609126e+07,3A
3143,Australia,2011,10,EFProdTotGHA,5.686145e+07,6.452131e+07,20843532.02,9.997755e+05,1.430103e+06,1.434571e+08,2.881133e+08,3A
...,...,...,...,...,...,...,...,...,...,...,...,...
22705,Gambia,1977,75,AreaTotHA,1.740000e+05,4.000000e+05,451020.4082,6.274000e+05,1.567860e+04,0.000000e+00,1.668099e+06,3A
67928,Ethiopia,1994,238,EFProdPerCap,1.735797e-01,1.957231e-01,0.585161,1.950000e-05,3.036301e-02,1.310008e-02,9.979467e-01,3A
33203,Kenya,1965,114,BiocapTotGHA,3.166004e+06,1.068060e+07,815614.2532,8.395668e+05,2.050294e+05,0.000000e+00,1.570682e+07,3A
35528,Lao People's Democratic Republic,1989,120,AreaPerCap,2.058762e-01,1.933110e-01,4.267751141,1.449832e-01,4.972513e-02,0.000000e+00,4.861647e+00,3A


In [19]:
df['QScore'].value_counts()

3A    51473
2A      240
Name: QScore, dtype: int64

In [20]:
data_df = df_2A.append(df_3A)

In [21]:
data_df

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
1536,Algeria,2016,4,AreaPerCap,2.072989e-01,8.112722e-01,0.048357265,0.022585,2.998367e-02,0.000000,1.119497e+00,2A
1537,Algeria,2016,4,AreaTotHA,8.417600e+06,3.294260e+07,1963600,917100.000000,1.217520e+06,0.000000,4.545842e+07,2A
1538,Algeria,2016,4,BiocapPerCap,2.021916e-01,2.636077e-01,0.027166736,0.007948,2.924496e-02,0.000000,5.301590e-01,2A
1539,Algeria,2016,4,BiocapTotGHA,8.210214e+06,1.070408e+07,1103135.245,322736.916200,1.187524e+06,0.000000,2.152769e+07,2A
1540,Algeria,2016,4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,0.014729,2.924496e-02,1.391455,2.407316e+00,2A
...,...,...,...,...,...,...,...,...,...,...,...,...
22705,Gambia,1977,75,AreaTotHA,1.740000e+05,4.000000e+05,451020.4082,627400.000000,1.567860e+04,0.000000,1.668099e+06,3A
67928,Ethiopia,1994,238,EFProdPerCap,1.735797e-01,1.957231e-01,0.585161,0.000019,3.036301e-02,0.013100,9.979467e-01,3A
33203,Kenya,1965,114,BiocapTotGHA,3.166004e+06,1.068060e+07,815614.2532,839566.802600,2.050294e+05,0.000000,1.570682e+07,3A
35528,Lao People's Democratic Republic,1989,120,AreaPerCap,2.058762e-01,1.933110e-01,4.267751141,0.144983,4.972513e-02,0.000000,4.861647e+00,3A


In [22]:
data_df.QScore.value_counts()

3A    350
2A    240
Name: QScore, dtype: int64

In [23]:
data_df['QScore'].value_counts()

3A    350
2A    240
Name: QScore, dtype: int64

In [24]:
import sklearn.utils

In [25]:
data_df = sklearn.utils.shuffle(data_df)

In [26]:
data_df

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
26419,Guyana,2005,91,BiocapTotGHA,4.765815e+05,1.207970e+06,47530807.69,4.100384e+06,4.135844e+04,0.000000,5.335710e+07,3A
72108,World,2007,5001,BiocapPerCap,5.100363e-01,2.290718e-01,0.784827,1.614105e-01,5.598818e-02,0.000000,1.741334e+00,3A
55605,Sierra Leone,2005,197,BiocapTotGHA,1.655174e+06,2.129777e+06,1131898.574,1.350716e+06,1.871156e+05,0.000000,6.454682e+06,3A
8288,Myanmar,1976,28,AreaPerCap,3.286447e-01,1.189694e-02,1.279848279,7.616505e-01,3.013317e-02,0.000000,2.412174e+00,3A
1540,Algeria,2016,4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,1.472910e-02,2.924496e-02,1.391455,2.407316e+00,2A
...,...,...,...,...,...,...,...,...,...,...,...,...
32065,Jamaica,2016,109,AreaTotHA,2.150000e+05,2.290000e+05,334820,1.356100e+06,6.290070e+04,0.000000,2.197821e+06,2A
62670,Trinidad and Tobago,2016,220,EFConsPerCap,4.275409e-01,1.918018e-01,0.302139636,1.047238e-01,1.222536e-03,7.350750,8.378179e+00,2A
19888,Fiji,1993,66,AreaPerCap,3.443590e-01,2.317801e-01,1.273015762,3.963572e+00,4.314474e-02,0.000000,5.855872e+00,3A
1537,Algeria,2016,4,AreaTotHA,8.417600e+06,3.294260e+07,1963600,9.171000e+05,1.217520e+06,0.000000,4.545842e+07,2A


In [27]:
data_df = data_df.reset_index(drop=True)

In [28]:
data_df

Unnamed: 0,country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
0,Guyana,2005,91,BiocapTotGHA,4.765815e+05,1.207970e+06,47530807.69,4.100384e+06,4.135844e+04,0.000000,5.335710e+07,3A
1,World,2007,5001,BiocapPerCap,5.100363e-01,2.290718e-01,0.784827,1.614105e-01,5.598818e-02,0.000000,1.741334e+00,3A
2,Sierra Leone,2005,197,BiocapTotGHA,1.655174e+06,2.129777e+06,1131898.574,1.350716e+06,1.871156e+05,0.000000,6.454682e+06,3A
3,Myanmar,1976,28,AreaPerCap,3.286447e-01,1.189694e-02,1.279848279,7.616505e-01,3.013317e-02,0.000000,2.412174e+00,3A
4,Algeria,2016,4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,1.472910e-02,2.924496e-02,1.391455,2.407316e+00,2A
...,...,...,...,...,...,...,...,...,...,...,...,...
585,Jamaica,2016,109,AreaTotHA,2.150000e+05,2.290000e+05,334820,1.356100e+06,6.290070e+04,0.000000,2.197821e+06,2A
586,Trinidad and Tobago,2016,220,EFConsPerCap,4.275409e-01,1.918018e-01,0.302139636,1.047238e-01,1.222536e-03,7.350750,8.378179e+00,2A
587,Fiji,1993,66,AreaPerCap,3.443590e-01,2.317801e-01,1.273015762,3.963572e+00,4.314474e-02,0.000000,5.855872e+00,3A
588,Algeria,2016,4,AreaTotHA,8.417600e+06,3.294260e+07,1963600,9.171000e+05,1.217520e+06,0.000000,4.545842e+07,2A


In [29]:
data_df.shape

(590, 12)

In [30]:
data_df.QScore.value_counts()

3A    350
2A    240
Name: QScore, dtype: int64

In [31]:
#more preprocessing
data_df = data_df.drop(columns=['country_code', 'country', 'year'])

In [32]:
data_df

Unnamed: 0,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
0,BiocapTotGHA,4.765815e+05,1.207970e+06,47530807.69,4.100384e+06,4.135844e+04,0.000000,5.335710e+07,3A
1,BiocapPerCap,5.100363e-01,2.290718e-01,0.784827,1.614105e-01,5.598818e-02,0.000000,1.741334e+00,3A
2,BiocapTotGHA,1.655174e+06,2.129777e+06,1131898.574,1.350716e+06,1.871156e+05,0.000000,6.454682e+06,3A
3,AreaPerCap,3.286447e-01,1.189694e-02,1.279848279,7.616505e-01,3.013317e-02,0.000000,2.412174e+00,3A
4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,1.472910e-02,2.924496e-02,1.391455,2.407316e+00,2A
...,...,...,...,...,...,...,...,...,...
585,AreaTotHA,2.150000e+05,2.290000e+05,334820,1.356100e+06,6.290070e+04,0.000000,2.197821e+06,2A
586,EFConsPerCap,4.275409e-01,1.918018e-01,0.302139636,1.047238e-01,1.222536e-03,7.350750,8.378179e+00,2A
587,AreaPerCap,3.443590e-01,2.317801e-01,1.273015762,3.963572e+00,4.314474e-02,0.000000,5.855872e+00,3A
588,AreaTotHA,8.417600e+06,3.294260e+07,1963600,9.171000e+05,1.217520e+06,0.000000,4.545842e+07,2A


In [33]:
X = data_df.drop(columns='QScore')

In [34]:
X

Unnamed: 0,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total
0,BiocapTotGHA,4.765815e+05,1.207970e+06,47530807.69,4.100384e+06,4.135844e+04,0.000000,5.335710e+07
1,BiocapPerCap,5.100363e-01,2.290718e-01,0.784827,1.614105e-01,5.598818e-02,0.000000,1.741334e+00
2,BiocapTotGHA,1.655174e+06,2.129777e+06,1131898.574,1.350716e+06,1.871156e+05,0.000000,6.454682e+06
3,AreaPerCap,3.286447e-01,1.189694e-02,1.279848279,7.616505e-01,3.013317e-02,0.000000,2.412174e+00
4,EFConsPerCap,6.280528e-01,1.810332e-01,0.162800822,1.472910e-02,2.924496e-02,1.391455,2.407316e+00
...,...,...,...,...,...,...,...,...
585,AreaTotHA,2.150000e+05,2.290000e+05,334820,1.356100e+06,6.290070e+04,0.000000,2.197821e+06
586,EFConsPerCap,4.275409e-01,1.918018e-01,0.302139636,1.047238e-01,1.222536e-03,7.350750,8.378179e+00
587,AreaPerCap,3.443590e-01,2.317801e-01,1.273015762,3.963572e+00,4.314474e-02,0.000000,5.855872e+00
588,AreaTotHA,8.417600e+06,3.294260e+07,1963600,9.171000e+05,1.217520e+06,0.000000,4.545842e+07


In [35]:
y = data_df['QScore']

In [36]:
y

0      3A
1      3A
2      3A
3      3A
4      2A
       ..
585    2A
586    2A
587    3A
588    2A
589    2A
Name: QScore, Length: 590, dtype: object

In [37]:
#split the data into training and testing sets
from sklearn.model_selection import train_test_split

In [38]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [39]:
y_train.value_counts() 

3A    245
2A    168
Name: QScore, dtype: int64

In [40]:
y_test.value_counts()

3A    105
2A     72
Name: QScore, dtype: int64

In [41]:
x_train

Unnamed: 0,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total
285,EFConsTotGHA,3.320799e+06,6.345800e+05,2765794.631,3.345071e+05,2.689011e+05,1.264427e+07,1.996886e+07
113,AreaTotHA,2.580000e+05,4.000000e+05,137300,1.241000e+05,1.121330e+05,0.000000e+00,1.031533e+06
18,BiocapTotGHA,7.517044e+06,2.431725e+06,4907595.398,7.145312e+05,1.496289e+06,0.000000e+00,1.706718e+07
76,BiocapTotGHA,8.210214e+06,1.070408e+07,1103135.245,3.227369e+05,1.187524e+06,0.000000e+00,2.152769e+07
206,EFProdTotGHA,2.396295e+05,1.424887e+05,17777.27342,2.866636e+06,8.282097e+05,2.381253e+07,2.790727e+07
...,...,...,...,...,...,...,...,...
277,AreaTotHA,3.400000e+06,1.820000e+07,4.70118e+07,1.298890e+07,8.827310e+05,0.000000e+00,8.248343e+07
9,AreaPerCap,4.692302e-01,1.269614e-01,0.324916104,8.808966e-02,6.363007e-02,0.000000e+00,1.072827e+00
359,AreaPerCap,3.053908e-01,3.472251e-01,0.383478765,1.401493e-01,3.157548e-02,0.000000e+00,1.207819e+00
192,AreaTotHA,3.217000e+06,1.070000e+07,7048000,4.826200e+06,3.000590e+05,0.000000e+00,2.609126e+07


In [42]:
x_test

Unnamed: 0,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total
225,EFProdTotGHA,1.266348e+06,5.556791e+05,1144569.949,1.418238e+04,7.218740e+04,2.193404e+06,5.246371e+06
14,EFConsPerCap,3.536519e-01,3.016104e-02,0.017141929,9.666484e-03,2.813797e-02,1.306050e+00,1.744809e+00
85,EFProdTotGHA,1.089608e+07,3.485538e+07,5971153.158,1.183855e+06,4.228640e+06,2.287240e+07,8.000750e+07
418,EFProdPerCap,2.829950e-01,1.286783e-01,0.003938652,1.999000e-04,9.446197e-02,2.141098e-01,7.243837e-01
132,EFConsPerCap,4.001329e-01,6.284914e-01,0.298353855,2.135064e-02,2.907724e-02,5.087959e-01,1.886202e+00
...,...,...,...,...,...,...,...,...
346,EFProdTotGHA,1.552119e+06,1.011414e+06,3671804.755,6.591122e+03,6.590495e+04,2.911589e+05,6.598994e+06
369,EFProdPerCap,7.533517e-01,4.936446e-02,7.437320853,1.714518e-01,1.353134e-01,2.931224e+00,1.147803e+01
140,AreaPerCap,1.189882e-01,1.592605e-01,0.171111899,2.244472e-02,3.764945e-02,0.000000e+00,5.094548e-01
533,BiocapPerCap,2.944724e-01,4.759346e-01,0.679956291,1.426055e-01,1.591488e-02,0.000000e+00,1.608884e+00


In [43]:
y_train

285    3A
113    3A
18     2A
76     2A
206    3A
       ..
277    3A
9      3A
359    3A
192    3A
559    3A
Name: QScore, Length: 413, dtype: object

In [44]:
y_test

225    3A
14     2A
85     3A
418    2A
132    3A
       ..
346    3A
369    2A
140    3A
533    3A
171    3A
Name: QScore, Length: 177, dtype: object

#There is still an imbalance in the class distribution. For this, we use SMOTE only on the training data to handle this.

The 'record' column contains record names. There are about 8 distinct names. you can confirm this for yourself by running the following line of code: df.record.value_counts() . Label encoder just takes these 8 or so distinct names and encodes them with integers from 0 to 7. Again you can rerun the line of code above to check out the change for yourself

In [45]:
df.record.value_counts()

BiocapTotGHA    6466
EFConsTotGHA    6465
EFProdTotGHA    6465
AreaTotHA       6465
BiocapPerCap    6463
EFConsPerCap    6463
EFProdPerCap    6463
AreaPerCap      6463
Name: record, dtype: int64

In [46]:
x_train['record'].value_counts()

AreaTotHA       61
AreaPerCap      58
EFProdTotGHA    54
BiocapTotGHA    53
EFConsTotGHA    50
EFProdPerCap    48
BiocapPerCap    45
EFConsPerCap    44
Name: record, dtype: int64

The label encoding is important because machine learning algorithms don't work with strings (names) ... only numbers!

For the imblearn.over_sampling and smote part: since we are working with an imbalanced dataset, where there is more of one output category than the other, imblearn over_sampling helps to balance things out to prevent bias for one class which would harm performance

In [47]:
#encode categorical variable
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
x_train.record = encoder.fit_transform(x_train.record)
x_test.record = encoder.transform(x_test.record)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [48]:
x_train.record.value_counts()

1    61
0    58
7    54
3    53
5    50
6    48
2    45
4    44
Name: record, dtype: int64

In [49]:
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE

In [50]:
import imblearn
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=1)
x_train_balanced, y_balanced = smote.fit_sample(x_train, y_train)

In [51]:
y_balanced.value_counts()

3A    245
2A    245
Name: QScore, dtype: int64

In [52]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()   

In [53]:
normalised_train_df = scaler.fit_transform(x_train_balanced.drop(columns=['record']))

In [54]:
normalised_train_df

array([[1.19295679e-03, 6.05805010e-04, 1.55488797e-03, ...,
        1.04794127e-03, 1.61677836e-03, 1.40590832e-03],
       [9.26833596e-05, 3.81862033e-04, 7.71879864e-05, ...,
        4.36996382e-04, 0.00000000e+00, 7.26251187e-05],
       [2.70040642e-03, 2.32145851e-03, 2.75897602e-03, ...,
        5.83122446e-03, 0.00000000e+00, 1.20161592e-03],
       ...,
       [1.08488107e-10, 1.20378469e-10, 1.27223163e-10, ...,
        1.13208289e-10, 3.06693637e-10, 2.07844921e-10],
       [1.26836347e-02, 7.34401144e-03, 4.22529302e-03, ...,
        1.78598119e-02, 2.19575649e-02, 1.66551834e-02],
       [9.54371028e-04, 1.07723778e-03, 6.43514334e-05, ...,
        3.21559843e-03, 2.39350203e-04, 4.64682571e-04]])

In [55]:
normalised_train_df = pd.DataFrame(normalised_train_df, columns=x_train_balanced.drop(columns=['record']).columns)
normalised_train_df

Unnamed: 0,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total
0,1.192957e-03,6.058050e-04,1.554888e-03,6.478642e-04,1.047941e-03,1.616778e-03,1.405908e-03
1,9.268336e-05,3.818620e-04,7.718799e-05,2.403535e-04,4.369964e-04,0.000000e+00,7.262512e-05
2,2.700406e-03,2.321459e-03,2.758976e-03,1.383884e-03,5.831224e-03,0.000000e+00,1.201616e-03
3,2.949419e-03,1.021871e-02,6.201660e-04,6.250680e-04,4.627929e-03,0.000000e+00,1.515658e-03
4,8.608397e-05,1.360276e-04,9.994115e-06,5.552021e-03,3.227637e-03,3.044824e-03,1.964813e-03
...,...,...,...,...,...,...,...
485,2.753680e-03,2.287009e-03,4.588866e-03,8.434836e-04,5.836131e-03,7.440702e-04,1.828821e-03
486,1.117575e-10,4.815526e-09,8.418429e-11,5.744612e-12,2.337164e-10,2.608898e-10,5.208654e-10
487,1.084881e-10,1.203785e-10,1.272232e-10,1.556821e-10,1.132083e-10,3.066936e-10,2.078449e-10
488,1.268363e-02,7.344011e-03,4.225293e-03,1.886633e-02,1.785981e-02,2.195756e-02,1.665518e-02


In [56]:
normalised_train_df['record'] = x_train_balanced['record']
normalised_train_df

Unnamed: 0,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,record
0,1.192957e-03,6.058050e-04,1.554888e-03,6.478642e-04,1.047941e-03,1.616778e-03,1.405908e-03,5
1,9.268336e-05,3.818620e-04,7.718799e-05,2.403535e-04,4.369964e-04,0.000000e+00,7.262512e-05,1
2,2.700406e-03,2.321459e-03,2.758976e-03,1.383884e-03,5.831224e-03,0.000000e+00,1.201616e-03,3
3,2.949419e-03,1.021871e-02,6.201660e-04,6.250680e-04,4.627929e-03,0.000000e+00,1.515658e-03,3
4,8.608397e-05,1.360276e-04,9.994115e-06,5.552021e-03,3.227637e-03,3.044824e-03,1.964813e-03,7
...,...,...,...,...,...,...,...,...
485,2.753680e-03,2.287009e-03,4.588866e-03,8.434836e-04,5.836131e-03,7.440702e-04,1.828821e-03,6
486,1.117575e-10,4.815526e-09,8.418429e-11,5.744612e-12,2.337164e-10,2.608898e-10,5.208654e-10,4
487,1.084881e-10,1.203785e-10,1.272232e-10,1.556821e-10,1.132083e-10,3.066936e-10,2.078449e-10,4
488,1.268363e-02,7.344011e-03,4.225293e-03,1.886633e-02,1.785981e-02,2.195756e-02,1.665518e-02,5


In [57]:
x_test = x_test.reset_index(drop=True)
x_test

Unnamed: 0,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total
0,7,1.266348e+06,5.556791e+05,1144569.949,1.418238e+04,7.218740e+04,2.193404e+06,5.246371e+06
1,4,3.536519e-01,3.016104e-02,0.017141929,9.666484e-03,2.813797e-02,1.306050e+00,1.744809e+00
2,7,1.089608e+07,3.485538e+07,5971153.158,1.183855e+06,4.228640e+06,2.287240e+07,8.000750e+07
3,6,2.829950e-01,1.286783e-01,0.003938652,1.999000e-04,9.446197e-02,2.141098e-01,7.243837e-01
4,4,4.001329e-01,6.284914e-01,0.298353855,2.135064e-02,2.907724e-02,5.087959e-01,1.886202e+00
...,...,...,...,...,...,...,...,...
172,7,1.552119e+06,1.011414e+06,3671804.755,6.591122e+03,6.590495e+04,2.911589e+05,6.598994e+06
173,6,7.533517e-01,4.936446e-02,7.437320853,1.714518e-01,1.353134e-01,2.931224e+00,1.147803e+01
174,0,1.189882e-01,1.592605e-01,0.171111899,2.244472e-02,3.764945e-02,0.000000e+00,5.094548e-01
175,2,2.944724e-01,4.759346e-01,0.679956291,1.426055e-01,1.591488e-02,0.000000e+00,1.608884e+00


In [58]:
normalised_test_df = scaler.transform(x_test.drop(columns=['record']))
normalised_test_df = pd.DataFrame(normalised_test_df, columns=x_test.drop(columns=['record']).columns)
normalised_test_df['record'] = x_test['record']
normalised_test_df

Unnamed: 0,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,record
0,4.549203e-04,5.304819e-04,6.434599e-04,2.746804e-05,2.813233e-04,2.804628e-04,3.693710e-04,7
1,1.262827e-10,2.879339e-11,9.636934e-12,1.872178e-11,1.096572e-10,1.670000e-10,1.078920e-10,4
2,3.914283e-03,3.327486e-02,3.356892e-03,2.292858e-03,1.647954e-02,2.924612e-03,5.632932e-03,7
3,1.009001e-10,1.228434e-10,2.214251e-12,3.871608e-13,3.681301e-10,2.737746e-11,3.604888e-11,6
4,1.429804e-10,5.999925e-10,1.677300e-10,4.135133e-11,1.133177e-10,6.505792e-11,1.178468e-10,4
...,...,...,...,...,...,...,...,...
172,5.575800e-04,9.655520e-04,2.064233e-03,1.276550e-05,2.568399e-04,3.722945e-05,4.646025e-04,7
173,2.698700e-10,4.712603e-11,4.181149e-09,3.320631e-10,5.273333e-10,3.748052e-10,7.931596e-10,6
174,4.198262e-11,1.520389e-10,9.619653e-11,4.347033e-11,1.467246e-10,0.000000e+00,2.091681e-11,0
175,1.050232e-10,4.543533e-10,3.822612e-10,2.761943e-10,6.202228e-11,0.000000e+00,9.832215e-11,2


In [59]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(normalised_train_df, y_balanced)
#returns
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

LogisticRegression()

**Measuring Classification Performance**

**Cross-validation and accuracy**

From the previous module, we now understand why data scientists and machine learning engineers avoid having models that overfit or underfit. Cross Validation (CV) is a well known and trusted method applied to avoid overfitting and enable generalization. Although there are different techniques used in performing cross validation, the fundamental concept involves partitioning the dataset into a number of subsets, holding out a set for evaluation then training the model on the other sets. This gives a more reliable estimate of how the model performs across different training sets because it provides an average score across different training samples used. The only drawback with cross validation is that it takes more time and computational resources however, the gain obtained in having a better model is very well worth this cost. ***K-Fold cross validation, Stratified K-Fold cross validation and Leave One Out Cross Validation (LOOCV)*** are some cross validation techniques.


**K-Fold Cross Validation**

This technique is called K-Fold because the data is split into K equal groups.  If k = 5, a 5-fold cross validation can be performed such that the data is split into k1, k2, k3, k4 and k5. The model is trained on k2 - k5 and evaluated on k1 then repeated k times until every group is used to train and test the model. 



**Stratified K-Fold Cross Validation**

Although similar to the technique described above, Stratified K-Fold cross validation ensures that in every fold, there is an equal proportion of each target class to obtain a good representation of the data and avoid imbalance and biased results. For example, if there are two target classes t1 and t2 with equal distribution in the data, it is best to ensure that the folds also have the same distribution.


**Leave One Out Cross Validation (LOOCV)**

In this method, one instance is left out and used as the test set while the model is trained on N-1 data points where N is the number of data points. This means that the number of instances and folds are equal.


**Confusion Matrix, Precision-Recall, ROC curve and the F1-score**

Accuracy, precision, recall, F1-score and many others are evaluation metrics used in measuring the performance of classification models. In this section, we discuss these metrics. 

**Confusion Matrix**

It is an N x N matrix that gives a summary of the correct and incorrect predicted classification results for the N target classes. The values in the diagonal of the matrix represent the number of correctly predicted classes while every other cell in the matrix indicates the misclassified classes. This means that the more predicted values that fall in the diagonal, the better the model. True positive, false positive, true negative and false negative are terms used when interpreting a confusion matrix.


***True Positive (TP): This is a correct classification where the predicted value is the same as the actual value. Using the table above, this means that actual value was positive and the predicted value was also positive.***

***True Negative (TN): The predicted value also matches the actual value. In this case, it is for the negative class. The actual value is negative and the predicted value is negative.***

***False Positive (FP): Also called a Type I error, this is a misclassification such that the model predicted a positive class while the actual class is negative. Telling a man that he is pregnant is definitely a false positive.***

***False Negative (FN): Also another misclassification where the predicted value is negative and the actual value is positive. Another example will be telling a pregnant woman that she is not pregnant. FN is known as a Type II error.***


**Accuracy**

This is the ratio of the number of correctly predicted instances to the total number of instances. It is a commonly used metric suitable when the target classes are not imbalanced. A high accuracy does not necessarily mean that the model has high predicting power. Hence, depending on the task, it is important to not use only the accuracy metric because it does not provide enough information about the model.



**Precision**

The ratio of correctly predicted instances of a class to the total number of items predicted by the model to be in that class is referred to as precision (known as Positive Predicted Value - PPV). This translates to the total percentage of the results obtained that are relevant. For the positive class, it is the ratio of true positives to the sum of true positives and false positives



**Recall**

Known as the sensitivity of the model, recall gives a percentage of total relevant results correctly predicted by the model. It is the ratio of the true positives to the actual number of positives (true positives and false negatives).


Like in the previous module where we discussed the bias-variance trade-off, there is also a trade-off between precision and recall. It is impossible to maximise both metrics simultaneously because an increase in recall decreases precision. Identify which metric is important based on your task and optimise.


**F1-Score**

This metric is the harmonic mean of precision and recall that aims to have an optimal balance of both. The F1-Score is quite easy to use and can be focused on to maximize as opposed to maximizing precision and recall.



**ROC Curve**

The Receiver Operating Characteristics (ROC) curve is a probability curve that measures the performance of a classification model at different set thresholds. Recall also known as the True Positive Rate (TPR) is plotted on the y-axis against the False Positive Rate (FPR) on the x-axis.

The code examples above are not the optimal results that can be obtained with the model. Hyperparameter tuning can be performed to improve the model.

In [60]:
#confusion matrix
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score, confusion_matrix
new_predictions = log_reg.predict(normalised_test_df)
cnf_mat = confusion_matrix(y_true=y_test, y_pred=new_predictions, labels=['2A', '3A'])
cnf_mat

array([[64,  8],
       [92, 13]], dtype=int64)

In [61]:
#accuracy
accuracy = accuracy_score(y_true=y_test, y_pred=new_predictions)
print('Accuracy: {}'.format(round(accuracy*100), 2))

Accuracy: 44.0


In [62]:
#precision
precision = precision_score(y_true=y_test, y_pred=new_predictions, pos_label='2A')
print('Precision: {}'.format(round(precision*100), 2))

Precision: 41.0


In [63]:
#recall
recall = recall_score(y_true=y_test, y_pred=new_predictions, pos_label='2A')
print('Recall: {}'.format(round(recall*100), 2))

Recall: 89.0


In [64]:
#f1-score
f1 = f1_score(y_true=y_test, y_pred=new_predictions, pos_label='2A')
print('F1: {}'.format(round(f1*100), 2))

F1: 56.0


In [65]:
#cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(log_reg, normalised_train_df, y_balanced, cv=5, scoring='f1_macro')
scores

array([0.48442761, 0.4791037 , 0.51702509, 0.46126937, 0.55027117])

In [66]:
#K-Fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
kf.split(normalised_train_df) 
f1_scores = []
#run for every split
for train_index, test_index in kf.split(normalised_train_df):
    x_train, x_test = normalised_train_df.iloc[train_index], normalised_train_df.iloc[test_index]
    y_train, y_test = y_balanced[train_index], y_balanced[test_index]
model = LogisticRegression().fit(x_train, y_train)
f1_scores.append(f1_score(y_true=y_test, y_pred=model.predict(x_test), pos_label='2A')*100)

In [67]:
#StratifiedKFold cross validation
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
f1_scores = []
#run for every split
for train_index, test_index in skf.split(normalised_train_df, y_balanced):
    x_train, x_test = np.array(normalised_train_df)[train_index], np.array(normalised_train_df)[test_index]
    y_train, y_test  = y_balanced[train_index], y_balanced[test_index]
model = LogisticRegression().fit(x_train, y_train)
#save result to list
f1_scores.append(f1_score(y_true=y_test, y_pred=model.predict(x_test), pos_label='2A'))

In [68]:
#leave one out cross validation
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(LogisticRegression(), normalised_train_df, y_balanced, cv=loo, 
                         scoring='f1_macro')
average_score = scores.mean() * 100

In [69]:
scores

array([0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 1., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 0., 1.,
       1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1.,
       0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0.,
       0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 1., 1., 0.,
       1., 1., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 1., 1., 1., 0., 0.,
       0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1.,
       0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0.,
       0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 1., 0., 0., 0.

In [70]:
average_score

38.36734693877551

**Multiclass Classification**

**Multilabel and Multiclass classification**

Multiclass classification deals with more than two classes where an instance is classified into a single class. For example, given a dataset with a set of features that describe the weather such that the classes are sunny, rainy and windy, a multiclass classification task will only give a single class as the result. In contrast, multilabel classification classifies an instance into a set of target labels. Articles and movies are examples where this can apply. An article can discuss a single topic but can also be about politics, religion, education and many more while movies are commonly tagged to multiple genres such as comedy, adventure, action.

**The Sigmoid and the Softmax function**

The softmax function is quite similar to the sigmoid explained earlier. It is used for multiclass classification because it can obtain the probabilities for various classes such that the probabilities of each class sum to 1. This means that an increase in the probability of a class causes a decrease in the probability of at least one of the other classes. It can also be referred to as a generalization of logistic regression or the sigmoid function and can be used for multi-class classification while the sigmoid function is used in multi-label classification. The softmax function is popularly used in the output layers of neural networks. Although the sum of the outputs of the softmax must be 1, this is not the same for the sigmoid function. 

**Tree-Based Methods and The Support Vector Machine**

**Linear and non-linear Support Vector Machine**

***Support Vector Machine (SVM)***  is a supervised machine learning algorithm that is used to solve both classification and regression tasks. In classification, the algorithm uses a line or hyperplane to separate classes by using data points close to the boundary (support vector)  for each class and a hyperplane that maximizes the distance between the classes. For clarity, a hyperplane is a line that linearly separates data points. Although there can be several hyperplanes between classes, the optimal hyperplane which has the maximum distance or margin between itself and the support vectors is chosen.


As we know, data is not always linearly separable such that a straight line might not be able to adequately segregate classes. Although SVM is a linear classifier, it can be used to classify a non-linear dataset by transforming the dataset to a higher dimensional feature space where it can be linearly separable. This is done using the kernel trick such that a kernel function is applied on each data point to map to a higher dimensional space. 

**Decision Trees and CART algorithm**

The decision tree is a widely used non-parametric supervised machine learning approach that splits instances in a dataset based on different decision rules inferred from the features in the dataset. It is a tree-based algorithm with nodes that represent a specific attribute or decision rule such that for an instance, a question is asked at a node and possible answers to the question found on both edges. This is a sequential process that involves recursive partitioning of nodes for several features until the leaves for the tree provides the final output or class for that instance. Decision trees can also be used to solve regression problems.


**ID3 - Iterative Dichotomiser 3, CART - Classification and Regression Trees, and C4.5** are some examples of decision tree algorithms. In this section, we only discuss the CART algorithm. The CART predictive model generates decision rules that have a binary tree representation such that each non-terminal node has two child nodes as opposed to some other tree-based methods that have more child nodes. It supports numerical target variables. At every node, the best split is chosen such that the splitting criterion is maximised. Gini impurity index is used as the splitting criterion in CART.

**Gini Impurity**: this is a measure of the chance that a randomly selected instance will be wrongly classified when selected. For different classes in a dataset, with p(i) as the probability that the chosen instance belongs to class i, the gini impurity index for all classes G, can be calculated such that:


***Gini impurity*** index values range between 0 and 1 such that 0 translates to a pure classification where all instances belong to the same class while 1 means that there is a random distribution of the instances across different classes. To select the best split, the gini gain is calculated by taking a weighted sum of the gini impurity index then subtracting from the original impurity. Higher gini gain leads to better splits simply put, the lower the gini impurity, the better the split.

**Overfitting in Decision Trees, Early Stopping and Pruning**

The recursive partitioning of nodes until the final subsets are obtained in decision trees makes it prone to overfitting. The deeper the tree, the higher the chances of the overfitting. This can be prevented using a stopping criterion such as early stopping and pruning. Early stopping or pre-pruning involves stopping the tree-building process before the tree becomes too complex and the training data is perfectly classified. An early stopping condition like the maximum depth can be set to avoid deep trees such that the tree stops growing after reaching the set maximum depth for the tree. Another early stopping criterion that can be used is the classification error. At every splitting stage, the error is checked. If there is no significant decrease in the error, there is no need to make the tree more complex. When there are fewer data points than a set threshold value, early stopping can also take place. Early stopping may also produce underfit models if it stops too early. Post-pruning, on the other hand, allows the tree to be fully built before simplifying by removing sections of the tree at different levels by calculating the error rate.



In [71]:
from sklearn.tree import DecisionTreeClassifier
dec_tree = DecisionTreeClassifier()
dec_tree.fit(normalised_train_df, y_balanced)

DecisionTreeClassifier()