## Digital Agriculture Homework 3

### Step Forward Feature Selection: A Practical Example in Python
#### Implementing Feature Selection and Building a Model
https://github.com/ChrisKuet/Digital_Ag

##### 1. Importing libraries and renaming them

In [1]:
# Importing libraries and renaming them
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score as acc
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
import os

##### 2. Specifying the data path and reading the data

In [2]:
# Read data
npath=os.path.abspath(os.pardir)+"\\Data\winequality-white.csv"
df1=pd.read_csv(npath,sep=';')

##### 3. Summary statistics of the data and saving it as a csv file

In [3]:
df2=df1.describe()
dpath=os.path.abspath(os.pardir)+"\\Results\CK_Description_02_16_2023.csv"
df2.to_csv(dpath)

##### 4. Splitting the data into Training and Test data (75%/25%)

In [4]:
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    df1.values[:,:-1],
    df1.values[:,-1:],
    test_size=0.25,
    random_state=42)

y_train = y_train.ravel()
y_test = y_test.ravel()

##### 5. Printing the dimension of the Training and Test data

In [5]:
print('Training dataset shape:', X_train.shape, y_train.shape)
print('Testing dataset shape:', X_test.shape, y_test.shape)

Training dataset shape: (3673, 11) (3673,)
Testing dataset shape: (1225, 11) (1225,)


##### 6. Building a Random Forest classifier to use in feature selection (Best 5 features)

In [6]:
# Build RF classifier to use in feature selection
clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)

# Build step forward feature selection for 5 features
sfs1 = sfs(clf,
           k_features=5, #subset of features to be selected
           forward=True,
           floating=False,
           verbose=2,
           scoring='accuracy', #metric for selecting best subset
           cv=5) #5-fold cross validation

# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:  1.3min finished

[2023-02-17 14:32:58] Features: 1/5 -- score: 0.4935958034439934[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  1.4min finished

[2023-02-17 14:34:22] Features: 2/5 -- score: 0.5417972529611299[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  1.5min finished

[2023-02-17 14:35:51] Features: 3/5 -- score: 0.6082196148214054[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 

##### 7. Checking the 5 features selected

In [7]:
# Which features?
feat_cols_5 = list(sfs1.k_feature_idx_)
print(feat_cols_5)

[1, 3, 6, 7, 10]


The output indicates that index of the best 5 features selected

##### 8. Building a Random Forest model using the 5 selected features

In [8]:
# Build full model with the 5 selected features
clf = RandomForestClassifier(n_estimators=1000, random_state=42, max_depth=4)
clf.fit(X_train[:, feat_cols_5], y_train)

y_train_pred = clf.predict(X_train[:, feat_cols_5])
print('Training accuracy on the 5 selected features: %.3f' % acc(y_train, y_train_pred))

y_test_pred = clf.predict(X_test[:, feat_cols_5])
print('Testing accuracy on the 5 selected features: %.3f' % acc(y_test, y_test_pred))

Training accuracy on the 5 selected features: 0.559
Testing accuracy on the 5 selected features: 0.512


##### 9. Building a Random Forest classifier to use in feature selection (Best 6 features)

In [9]:
# Build step forward feature selection for 6 features
sfs2 = sfs(clf,
           k_features=6, #subset of features to be selected
           forward=True,
           floating=False,
           verbose=2,
           scoring='accuracy', #metric for selecting best subset
           cv=5) #5-fold cross validation

# Perform SFFS
sfs2 = sfs2.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   56.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:  7.6min finished

[2023-02-17 14:46:34] Features: 1/6 -- score: 0.5004051975013438[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   28.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  6.6min finished

[2023-02-17 14:53:10] Features: 2/6 -- score: 0.53879293406736[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   33.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  5.1min finished

[2023-02-17 14:58:14] Features: 3/6 -- score: 0.5393408589593875[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 ou

##### 10. Checking the 6 selected features

In [10]:
# Which features?
feat_cols_6 = list(sfs2.k_feature_idx_)
print(feat_cols_6)

[0, 1, 3, 4, 5, 10]


The output indicates that index of the best 6 features selected

##### 11. Building a Random Forest model using the 6 selected features

In [11]:
# Build full model with the 6 selected features
clf = RandomForestClassifier(n_estimators=1000, random_state=42, max_depth=4)
clf.fit(X_train[:, feat_cols_6], y_train)

y_train_pred = clf.predict(X_train[:, feat_cols_6])
print('Training accuracy on the 6 selected features: %.3f' % acc(y_train, y_train_pred))

y_test_pred = clf.predict(X_test[:, feat_cols_6])
print('Testing accuracy on the 6 selected features: %.3f' % acc(y_test, y_test_pred))

Training accuracy on the 6 selected features: 0.563
Testing accuracy on the 6 selected features: 0.515


##### 12. Building a Random Forest model with all the features

In [12]:
# Build full model on ALL features, for comparison
clf = RandomForestClassifier(n_estimators=1000, random_state=42, max_depth=4)
clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
print('Training accuracy on all features: %.3f' % acc(y_train, y_train_pred))

y_test_pred = clf.predict(X_test)
print('Testing accuracy on all features: %.3f' % acc(y_test, y_test_pred))

Training accuracy on all features: 0.566
Testing accuracy on all features: 0.509


##### 13. Summary of Results

In [13]:
SummaryMetrics=pd.DataFrame({'Train Accuracy':[0.559,0.563,0.566], 'Test Accuracy':[0.512,0.515,0.509]} ,  
                   index = ['Best 5 features','Best 6 features','All the features'] )

print(SummaryMetrics)

                  Train Accuracy  Test Accuracy
Best 5 features            0.559          0.512
Best 6 features            0.563          0.515
All the features           0.566          0.509
