# FEAUTRES SELECTION

In this section we are going to learn how to select and extract important characteristics of a dataset. The difference of extraction and selection is that when you extract characteristics you reduce the input characteristics and the characteristics that remain are different than the original characteristics, you change them. Characteristics extraction is also known as "dimensional reduction algorithms" and the most common method is PCA.To reduce the characteristics is really useful and important because it improves the performance of the model and the quality of the dataset.
We are going to use the Pumpkin seed dataset (https://www.kaggle.com/datasets/muratkokludataset/pumpkin-seeds-dataset) as an example. 

In [2]:
import pandas as pd
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import f1_score

In [3]:
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):
    strat = df[stratify] if stratify else None
    train_set, test_set = train_test_split(
        df, test_size=0.4, random_state=rstate, shuffle=shuffle, stratify=strat)
    strat = test_set[stratify] if stratify else None
    val_set, test_set = train_test_split(
        test_set, test_size=0.5, random_state=rstate, shuffle=shuffle, stratify=strat)
    return (train_set, val_set, test_set)

In [4]:
def remove_labels(df, label_name):
    X = df.drop(label_name, axis=1)
    y = df[label_name].copy()
    return (X, y)

In [5]:
path = os.getcwd() + '\data\Pumpkin_Seeds_Dataset.xlsx'
df = pd.read_excel(path, header=0, names=None)
df.head()

Unnamed: 0,Area,Perimeter,Major_Axis_Length,Minor_Axis_Length,Convex_Area,Equiv_Diameter,Eccentricity,Solidity,Extent,Roundness,Aspect_Ration,Compactness,Class
0,56276,888.242,326.1485,220.2388,56831,267.6805,0.7376,0.9902,0.7453,0.8963,1.4809,0.8207,Çerçevelik
1,76631,1068.146,417.1932,234.2289,77280,312.3614,0.8275,0.9916,0.7151,0.844,1.7811,0.7487,Çerçevelik
2,71623,1082.987,435.8328,211.0457,72663,301.9822,0.8749,0.9857,0.74,0.7674,2.0651,0.6929,Çerçevelik
3,66458,992.051,381.5638,222.5322,67118,290.8899,0.8123,0.9902,0.7396,0.8486,1.7146,0.7624,Çerçevelik
4,66107,998.146,383.8883,220.4545,67117,290.1207,0.8187,0.985,0.6752,0.8338,1.7413,0.7557,Çerçevelik


In [6]:
train_set, val_set, test_set = train_val_test_split(df) #Let's divide the dataset

In [7]:
X_train, y_train = remove_labels(train_set, 'Class')
X_val, y_val = remove_labels(val_set, 'Class')
X_test, y_test = remove_labels(test_set, 'Class')

In [9]:
from sklearn.ensemble import RandomForestClassifier  #We apply RANDOM FOREST

clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train, y_train)

RandomForestClassifier(n_estimators=50, n_jobs=-1, random_state=42)

In [11]:
y_pred = clf_rnd.predict(X_val) #we predict with the validation set
print("F1 score:", f1_score(y_pred, y_val, average='weighted'))

F1 score: 0.8843157690064165


In [12]:
#Now we are going to see what characteristics are the most important in the dataset.
clf_rnd.feature_importances_

array([0.0309045 , 0.03681118, 0.07160577, 0.0379822 , 0.02830559,
       0.02774625, 0.11750758, 0.04649365, 0.04254706, 0.15791249,
       0.22390359, 0.17828014])

In [13]:
feature_importances = {name: score for name, score in zip(list(df), clf_rnd.feature_importances_)} #We extract those characteristics

In [14]:
feature_importances_sorted = pd.Series(feature_importances).sort_values(ascending=False)
feature_importances_sorted.head(20) #We sort the characteristics from the most important to the least

Aspect_Ration        0.223904
Compactness          0.178280
Roundness            0.157912
Eccentricity         0.117508
Major_Axis_Length    0.071606
Solidity             0.046494
Extent               0.042547
Minor_Axis_Length    0.037982
Perimeter            0.036811
Area                 0.030905
Convex_Area          0.028306
Equiv_Diameter       0.027746
dtype: float64

In [19]:
# We extract the 5 characteristics with more relevance for the algorithm 
columns = list(feature_importances_sorted.head(5).index)
columns

['Aspect_Ration',
 'Compactness',
 'Roundness',
 'Eccentricity',
 'Major_Axis_Length']

In [20]:
X_train_reduced = X_train[columns].copy()
X_val_reduced = X_val[columns].copy()

In [21]:
X_train_reduced.head(5)

Unnamed: 0,Aspect_Ration,Compactness,Roundness,Eccentricity,Major_Axis_Length
348,1.7612,0.7504,0.832,0.8232,396.6914
1089,1.8641,0.7306,0.8125,0.8439,414.4705
1850,2.9151,0.5828,0.6593,0.9393,505.1026
300,2.1326,0.6824,0.7727,0.8832,465.2679
1658,2.5135,0.6291,0.717,0.9174,517.9383


In [22]:
from sklearn.ensemble import RandomForestClassifier #We apply random forest to the reduced data!

clf_rnd = RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
clf_rnd.fit(X_train_reduced, y_train)

RandomForestClassifier(n_estimators=50, n_jobs=-1, random_state=42)

In [23]:
y_pred = clf_rnd.predict(X_val_reduced) #We predict with the validation set

In [24]:
print("F1 score:", f1_score(y_pred, y_val, average='weighted'))

F1 score: 0.8507297860838673


**Conclusion:** The f1 score of the prediction with the reduced dataset is a bit less than with all the characteristics, meaning that the algorithm is a bit worse. However, with less characteristics the process is computationally faster, so we enter a debate of what is better, to loose a bit of precision but win time or the other way round. In my opinion, this will depend on the situation and the difference between both f1.