# Lab 7: Feature Selection

Feature selection is the process of choosing features (aka 'variables', 'attributes', 'predictors',
'columns', 'independent variables') to include in our models. This is useful in situations where there
are many variables to choose from--a problem known as the "curse of dimensionality". Including too 
many predictors when training could lead to overfitting. It can also lead to models that are computationally
more efficient to train and predict because there is less input. 

There are many ways to manage this process. We could do it manually (as we have thus far this semester).
We will cover common techniques in this lab.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split 
from sklearn import tree
from sklearn import ensemble
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

# Data
We will use the KDDCup Network Attack dataset from Labs 2 and 3.

In [3]:
# load txt file
names = pd.read_csv('data/kddcup.names', header=None, delimiter=':',skiprows=1)

# make column 0 into a list
name_list = names[0].tolist()

# add the last column with type
name_list.append('type')

In [4]:
netattacks = pd.read_csv('data/kddcup.data_10_percent_corrected', names=name_list, header=None, index_col=None)

# use a 0 (normal) or 1 (malicious) to code bad traffic
netattacks['label'] = np.where(netattacks['type'] == 'normal.', 0, 1)

netattacks = netattacks.select_dtypes(include=np.number)

# train-test split
train, test = train_test_split(netattacks, test_size=0.25)


## Start with all predictors

In [5]:
# get columns not label
pred_vars = list(netattacks.columns)

# remove 'label' because it is what we are trying to predict
pred_vars.remove('label')

I will also split the predictors and label column. This makes some of the 
later tasks a bit easier. 

In [6]:
train_X = train[pred_vars]
train_y = train['label']

# Get importance from a classifier
Here we will fit a Random Forest with all of the predictors. We can view the 
feature importance scores from these predictors.

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree


clf = RandomForestClassifier()
dtree = tree.DecisionTreeClassifier(criterion='entropy', max_depth=10)
dtree.fit(train_X, train_y)
clf = clf.fit(train_X, train_y)
print(clf.feature_importances_)

[6.50654905e-03 7.18201449e-02 1.37746503e-01 5.00904049e-06
 3.43298124e-03 7.59809640e-06 6.76054617e-03 8.91519566e-05
 9.26911296e-02 6.67938382e-03 7.41321529e-05 5.96063753e-06
 1.48605335e-04 6.22543729e-05 2.11783064e-05 5.44338407e-05
 0.00000000e+00 0.00000000e+00 6.95411256e-04 2.87819167e-01
 5.46755974e-02 1.51299541e-02 3.45867298e-03 1.73837340e-03
 3.38186564e-03 3.70607196e-02 2.14457430e-02 1.81897944e-02
 7.82233685e-02 1.27080088e-02 1.86218909e-02 1.62729057e-02
 3.58510813e-02 4.45475866e-02 7.24452220e-03 3.52635845e-03
 5.48956340e-03 7.81385389e-03]


In [10]:
print(dtree.feature_importances_)


[1.02065500e-03 3.80799500e-02 3.97530487e-02 0.00000000e+00
 1.49961625e-03 3.13057654e-05 3.40805814e-02 0.00000000e+00
 1.07292019e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00
 3.87652753e-05 3.62216714e-04 0.00000000e+00 6.76828085e-05
 0.00000000e+00 0.00000000e+00 0.00000000e+00 8.46168868e-01
 2.77575716e-05 8.14449005e-05 8.47681195e-05 7.72429956e-05
 0.00000000e+00 0.00000000e+00 5.69658657e-05 0.00000000e+00
 2.06883934e-02 1.61337387e-03 3.66627040e-03 1.25142039e-04
 1.45280587e-03 1.28452691e-03 5.26213619e-04 7.11784796e-05
 2.22222983e-04 7.84608323e-03]


In [12]:
dfeature_importances = pd.DataFrame(dtree.feature_importances_, index =train_X.columns,  columns=['importance']).sort_values('importance', ascending=False)
display(dfeature_importances)

Unnamed: 0,importance
count,0.846169
dst_bytes,0.039753
src_bytes,0.03808
hot,0.034081
dst_host_count,0.020688
dst_host_srv_rerror_rate,0.007846
dst_host_same_srv_rate,0.003666
dst_host_srv_count,0.001613
wrong_fragment,0.0015
dst_host_same_src_port_rate,0.001453


In [13]:
feature_importances = pd.DataFrame(clf.feature_importances_, index =train_X.columns,  columns=['importance']).sort_values('importance', ascending=False)
display(feature_importances)



Unnamed: 0,importance
count,0.287819
dst_bytes,0.137747
logged_in,0.092691
dst_host_count,0.078223
src_bytes,0.07182
srv_count,0.054676
dst_host_srv_diff_host_rate,0.044548
same_srv_rate,0.037061
dst_host_same_src_port_rate,0.035851
diff_srv_rate,0.021446


## Select Best Features
The `SelectFromModel()` function uses the feature importance scores shown above to get a subset. We will
use the default cutoff threshold, but you can make this more or less permissive.

In [None]:
from sklearn.feature_selection import SelectFromModel
model = SelectFromModel(clf, prefit=True, threshold=0.01)

## Get dataframe with reduced columns

In [None]:
support = model.get_support()
X_reduced = train_X.iloc[:, support]
X_reduced.shape

# Training

In [None]:
# fit with all columns
rf = RandomForestClassifier()
rf.fit(train_X, train_y)


We will also train a model with the data selected during feature selection.

In [None]:
# fit with reduced set of columns
rf_reduced = RandomForestClassifier()
rf_reduced.fit(X_reduced, train_y)

# Evaluation

The evaluation script has been modified slightly from prior weeks. It now contains two lists, one
has the classifiers and the other has the different training sets. This is required because 
feature selection removes columns. The `sklearn` models require the data passed to the `predict` functions 
to contain the exact same columns as the data used to train the model. 

In [None]:
test_X = test[pred_vars]
test_X_reduced = test_X.iloc[:, support]
test_y = test['label']

## Use test data to get evaluation statistics

In [None]:
# list of our models
fitted = [rf, rf_reduced]

# list of test sets for each
test_sets = [test_X, test_X_reduced]

# empty dataframe to store the results
result_table = pd.DataFrame(columns=['classifier_name', 'fpr','tpr','auc', 
                                     'log_loss', 'clf_report'])

for i in range(len(fitted)):
    # select classifier and testing data
    clf = fitted[i]
    test_ = test_sets[i]

    # print the name of the classifier
    print(clf.__class__.__name__)
    
    # get predictions
    yproba = clf.predict_proba(test_)
    yclass = clf.predict(test_)
    
    # auc information
    fpr, tpr, _ = metrics.roc_curve(test_y,  yproba[:,1])
    auc = metrics.roc_auc_score(test_y, yproba[:,1])
    
    # log loss
    log_loss = metrics.log_loss(test_y, yproba[:,1])
    
    # add some other stats based on confusion matrix
    clf_report = metrics.classification_report(test_y, yclass, digits=5)
    
    # add the results to the dataframe
    result_table = result_table.append({'classifier_name':clf.__class__.__name__,
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'auc':auc,
                                        'log_loss': log_loss,
                                        'clf_report': clf_report}, ignore_index=True)
#result_table.set_index('classifier_name', inplace=True)

In [None]:
for i in result_table.index:
    print('\n---- statistics for', result_table.loc[i, 'classifier_name'], "----\n")
    print(result_table.loc[i, 'clf_report'])
    print("Model AUC:", result_table.loc[i, 'auc'])
    print("Model log loss:", result_table.loc[i, 'log_loss'])

In [None]:
fig = plt.figure(figsize=(14,12))

for i in result_table.index:
    plt.plot(result_table.loc[i]['fpr'], 
             result_table.loc[i]['tpr'], 
             label="{}, AUC={:.3f}".format(result_table.loc[i]['classifier_name'], result_table.loc[i]['auc']))
    
plt.plot([0,1], [0,1], color='orange', linestyle='--')

plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)

plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)

plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')

plt.show()

# Exercises
1. Use a Decision Tree as the initial classifier (before the `SelectFromModel` cell). How many important feature are there?
 Answer : 37
2. Try feature selection using the `SelectKBest()` method
   [documented here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest). Re-run the training and evaluation process. Do the features chosen with this method work better or worse than using
   all variables or those chosen with the `SelectFromModel()` method? 
   Answser It ran better because SelectfromModel selected tht best two features 

# Extra
The evaluation script is somewhat cumbersome, especially with two lists (one for models and one for test data subsets). Simplify the evaluation script and put it into a function. 