### Feature selection is an important step in data preprocessing that involves selecting a subset of relevant features or variables from a larger set of features.we will eliminate features that can be noisy  with misleading data, this process is crucial in reducing the complexity of a machine learning model and improving its performance.

### importing necessary libraries:

In [113]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,accuracy_score,recall_score,precision_score,confusion_matrix
import numpy as np

In [114]:
import warnings
warnings.filterwarnings('ignore')

### importing training and validation datasets:

In [115]:
dfTrain = pd.read_csv("data/balancedAppDarknet.csv")
dfValidate = pd.read_csv("data/validation_dataset.csv")

# dropping useless or duplicate columns :
### Based on domain knwoledge we can clearly see that some columns are useless  , for example timestamp ,  a column that gives the time of execution of the internet traffic  , for sure has nothing to do with defining the application type of an internet traffic thus , we drop it ,same thing for flow ID , it contains values of the source(IP and port) and destination (IP and port) , and there is already a separate column of each of them , since he represent duplicate data , we drop it as well.


In [116]:
dfValidate = dfValidate.drop(['Flow ID','Timestamp'],axis=1)

In [117]:
print(dfTrain.shape,dfValidate.shape)

(154109, 82) (23405, 82)


In [118]:
dfTrain=dfTrain.drop_duplicates()


In [119]:
dfTrain["application"].value_counts()

6    20114
4    19590
1    19523
0    19300
5    19179
7    19043
3    18716
2    18644
Name: application, dtype: int64

In [120]:
dfTrain.columns.to_list()

['Src IP',
 'Src Port',
 'Dst IP',
 'Dst Port',
 'Protocol',
 'Flow Duration',
 'Total Fwd Packet',
 'Total Bwd packets',
 'Total Length of Fwd Packet',
 'Total Length of Bwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Min',
 'Fwd Packet Length Mean',
 'Fwd Packet Length Std',
 'Bwd Packet Length Max',
 'Bwd Packet Length Min',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Std',
 'Flow Bytes/s',
 'Flow Packets/s',
 'Flow IAT Mean',
 'Flow IAT Std',
 'Flow IAT Max',
 'Flow IAT Min',
 'Fwd IAT Total',
 'Fwd IAT Mean',
 'Fwd IAT Std',
 'Fwd IAT Max',
 'Fwd IAT Min',
 'Bwd IAT Total',
 'Bwd IAT Mean',
 'Bwd IAT Std',
 'Bwd IAT Max',
 'Bwd IAT Min',
 'Fwd PSH Flags',
 'Bwd PSH Flags',
 'Fwd URG Flags',
 'Bwd URG Flags',
 'Fwd Header Length',
 'Bwd Header Length',
 'Fwd Packets/s',
 'Bwd Packets/s',
 'Packet Length Min',
 'Packet Length Max',
 'Packet Length Mean',
 'Packet Length Std',
 'Packet Length Variance',
 'FIN Flag Count',
 'SYN Flag Count',
 'RST Flag Count',
 'PSH Flag

### inspecting performance so far:

In [121]:
X_train = dfTrain.drop("application",axis=1)
y_train = dfTrain.loc[:,["application"]]
X_validate = dfValidate.drop("application",axis=1)
y_validate = dfValidate.loc[:,["application"]]

In [122]:
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_validate)
accuracy = accuracy_score(y_validate, y_pred)
print(f"Accuracy before applying feature selection: {accuracy}")
F1_Score = f1_score(y_validate,y_pred,average='weighted')
print(f"F1-score before applying feature selection: {F1_Score}")
Precision_Score = precision_score(y_validate,y_pred,average='weighted')
print(f"precision-score before applying feature selection: {Precision_Score}")
Recall_Score = recall_score(y_validate,y_pred,average='weighted')
print(f"recall-score before applying feature selection: {Recall_Score}")
Confusion_Matrix = confusion_matrix(y_validate,y_pred)
print(f"confusion-matrix before applying feature selection:\n {Confusion_Matrix}")

Accuracy before applying feature selection: 0.896432386242256
F1-score before applying feature selection: 0.8967646886391679
precision-score before applying feature selection: 0.8982620682787179
recall-score before applying feature selection: 0.896432386242256
confusion-matrix before applying feature selection:
 [[3172   24    5    1   38    7  295    7]
 [  13 6396    9    1  163    4  100    0]
 [   5   14 1637  342   31    1   18  177]
 [   5    4  207  940    5    2    5   62]
 [  23  155   18    7 1927    4   63   14]
 [   3   20    0    0    3 4806    1    0]
 [ 252  100    7    1   55    4 1500   12]
 [   1    0   71   45    7    0   13  603]]


# droping constant features :
### Dropping constant features from a dataset can lead to faster computation time, reduce overfitting, improve accuracy, and improve interpretability of the model.
### unvariant features  leads the model to "think" that there are no differences between classes when we are trying to make him distinguish between them : 

In [123]:
num_unique = dfTrain.nunique()
num_unique

Src IP         11003
Src Port       44518
Dst IP         22984
Dst Port       20486
Protocol          18
               ...  
Idle Mean       5519
Idle Std       39297
Idle Max        4277
Idle Min       13108
application        8
Length: 82, dtype: int64

In [124]:
columns_to_drop = num_unique[num_unique <= 1].index

dfTrain=dfTrain.drop(columns_to_drop, axis=1)
dfValidate=dfValidate.drop(columns_to_drop, axis=1)


In [125]:
print(dfTrain.shape,dfValidate.shape)

(154109, 67) (23405, 67)


### Inspecting performance after dropping constant features :

In [126]:
X_train = dfTrain.drop("application",axis=1)
X_validate = dfValidate.drop("application",axis=1)
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_validate)
accuracy = accuracy_score(y_validate, y_pred)
print(f"Accuracy after droping constant columns: {accuracy}")
F1_Score = f1_score(y_validate,y_pred,average='weighted')
print(f"F1-score after droping constant columns: {F1_Score}")
Precision_Score = precision_score(y_validate,y_pred,average='weighted')
print(f"precision-score after droping constant columns: {Precision_Score}")
Recall_Score = recall_score(y_validate,y_pred,average='weighted')
print(f"recall-score after droping constant columns: {Recall_Score}")
Confusion_Matrix = confusion_matrix(y_validate,y_pred)
print(f"confusion-matrix after droping constant columns:\n {Confusion_Matrix}")

Accuracy after droping constant columns: 0.896218756675924
F1-score after droping constant columns: 0.8965857330136279
precision-score after droping constant columns: 0.898143752409758
recall-score after droping constant columns: 0.896218756675924
confusion-matrix after droping constant columns:
 [[3175   25    8    1   31    7  297    5]
 [  13 6390    8    2  162    6  105    0]
 [   6   16 1639  341   29    1   16  177]
 [   5    4  194  952    6    3    4   62]
 [  22  153   17    9 1925    5   67   13]
 [   3   21    0    0    3 4805    1    0]
 [ 255  102    9    2   53    4 1493   13]
 [   1    0   75   48    5    0   14  597]]


### Dropping one of all mutually highly correlated features to avoid issues with multicollinearity :

In [127]:
corr_matrix = dfTrain.corr().abs()

high_corr_mask = corr_matrix > 0.8

high_corr_features = []
for i in range(len(high_corr_mask.columns)):
    for j in range(i):
        if high_corr_mask.iloc[i, j]:
            colname1 = high_corr_mask.columns[i]
            colname2 = high_corr_mask.columns[j]
            high_corr_features.append((colname1, colname2))

print(high_corr_features)

[('Fwd Packet Length Std', 'Fwd Packet Length Max'), ('Bwd Packet Length Std', 'Bwd Packet Length Max'), ('Flow IAT Max', 'Flow IAT Std'), ('Flow IAT Min', 'Flow IAT Mean'), ('Fwd IAT Total', 'Flow Duration'), ('Fwd IAT Mean', 'Flow IAT Mean'), ('Fwd IAT Max', 'Flow IAT Std'), ('Fwd IAT Max', 'Flow IAT Max'), ('Fwd IAT Min', 'Flow IAT Mean'), ('Fwd IAT Min', 'Fwd IAT Mean'), ('Bwd IAT Total', 'Flow Duration'), ('Bwd IAT Total', 'Fwd IAT Total'), ('Bwd IAT Max', 'Flow IAT Max'), ('Bwd IAT Min', 'Bwd IAT Mean'), ('Fwd Header Length', 'Total Fwd Packet'), ('Bwd Header Length', 'Total Bwd packets'), ('Fwd Packets/s', 'Flow Packets/s'), ('Packet Length Std', 'Packet Length Max'), ('Packet Length Std', 'Packet Length Mean'), ('ACK Flag Count', 'Total Fwd Packet'), ('ACK Flag Count', 'Total Bwd packets'), ('ACK Flag Count', 'Fwd Header Length'), ('ACK Flag Count', 'Bwd Header Length'), ('Average Packet Size', 'Packet Length Mean'), ('Average Packet Size', 'Packet Length Std'), ('Fwd Segment S

### To choose what column to drop , we will use feature importance provided by a random forest classifier , the one assigned to the lowest importance by the classifier will be dropped :

In [128]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

feature_importance = dict(zip(X_train.columns, rf.feature_importances_))

In [129]:
for corr_feature in high_corr_features:
    feature1, feature2 = corr_feature
    if feature_importance[feature1] > feature_importance[feature2]:
        drop_feature = feature2
    else:
        drop_feature = feature1
        
    if drop_feature in dfTrain.columns.to_list():
        dfTrain.drop(columns=[drop_feature], inplace=True)
        dfValidate.drop(columns=[drop_feature], inplace=True)


In [130]:
print(dfTrain.shape,dfValidate.shape)

(154109, 41) (23405, 41)


### Inspecting performance after dropping one of all mutually highly correlated columns :

In [131]:
X_train = dfTrain.drop("application",axis=1)
X_validate = dfValidate.drop("application",axis=1)
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_validate)
accuracy = accuracy_score(y_validate, y_pred)
print(f"Accuracy after Dropping one of all mutually highly correlated features: {accuracy}")
F1_Score = f1_score(y_validate,y_pred,average='weighted')
print(f"F1-score after Dropping one of all mutually highly correlated features: {F1_Score}")
Precision_Score = precision_score(y_validate,y_pred,average='weighted')
print(f"precision-score after Dropping one of all mutually highly correlated features: {Precision_Score}")
Recall_Score = recall_score(y_validate,y_pred,average='weighted')
print(f"recall-score after Dropping one of all mutually highly correlated features: {Recall_Score}")
Confusion_Matrix = confusion_matrix(y_validate,y_pred)
print(f"confusion-matrix after Dropping one of all mutually highly correlated features:\n {Confusion_Matrix}")

Accuracy after Dropping one of all mutually highly correlated features: 0.8973296304208502
F1-score after Dropping one of all mutually highly correlated features: 0.8977920397591332
precision-score after Dropping one of all mutually highly correlated features: 0.8993615387749816
recall-score after Dropping one of all mutually highly correlated features: 0.8973296304208502
confusion-matrix after Dropping one of all mutually highly correlated features:
 [[3170   26    7    3   31    8  300    4]
 [  18 6375    8    1  165    5  114    0]
 [   5   14 1668  313   28    1   23  173]
 [   4    3  190  958    8    2    5   60]
 [  21  145   21    9 1920    7   75   13]
 [   2   19    1    0    3 4808    0    0]
 [ 259  100    8    2   53    2 1495   12]
 [   2    0   65   47    5    0   13  608]]


# recursif feature eliminator (RFE):
### RFE is considered a brute force technique ,  at every step , the RFE drops a feature and calculates the efficiency of a model passed to the "estimator" argument ,in our case the random forest classifier , if it increases , it will moves to the next iteration , otherwise , it will restore it and drops an other feature until there is only a defined number of features passed to the "n_features_to_select" argument, the number of features to drop at each step is passed to the "step" argument , in our case we define it as 1. 

In [132]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

rfe = RFE(estimator=clf, n_features_to_select=25, step=1)

rfe.fit(X_train, y_train)

RFE_selected_features = X_train.columns[rfe.support_]

In [133]:
RFE_selected_features= RFE_selected_features.to_list()
RFE_selected_features

['Src IP',
 'Src Port',
 'Dst IP',
 'Dst Port',
 'Flow Duration',
 'Total Length of Fwd Packet',
 'Total Length of Bwd Packet',
 'Fwd Packet Length Min',
 'Bwd Packet Length Min',
 'Bwd Packet Length Mean',
 'Flow Bytes/s',
 'Flow Packets/s',
 'Flow IAT Mean',
 'Flow IAT Max',
 'Fwd Header Length',
 'Bwd Header Length',
 'Bwd Packets/s',
 'Packet Length Min',
 'Packet Length Variance',
 'Average Packet Size',
 'Fwd Segment Size Avg',
 'FWD Init Win Bytes',
 'Bwd Init Win Bytes',
 'Fwd Seg Size Min',
 'Idle Max']

In [134]:
dfTrainRFE = dfTrain[RFE_selected_features]
dfRFE = dfValidate[RFE_selected_features]

print(dfTrainRFE.shape,dfRFE.shape)

(154109, 25) (23405, 25)


In [135]:
def inspect_perf_gain(X_train,X_validate,target,technique):
    try:
        X_train = X_train.drop(columns=[target])
    except KeyError:
        pass
    try:
        X_validate = X_validate.drop(columns=[target])
    except KeyError:
        pass
    rfc = RandomForestClassifier(n_estimators=100, random_state=42)
    rfc.fit(X_train, y_train)

    y_pred = rfc.predict(X_validate)
    accuracy = accuracy_score(y_validate, y_pred)
    print(f"Accuracy with using {technique} technique: {accuracy}")
    F1_Score = f1_score(y_validate,y_pred,average='weighted')
    print(f"F1-score with using {technique} technique: {F1_Score}")
    Precision_Score = precision_score(y_validate,y_pred,average='weighted')
    print(f"precision-score with using {technique} technique: {Precision_Score}")
    Recall_Score = recall_score(y_validate,y_pred,average='weighted')
    print(f"recall-score with using {technique} technique: {Recall_Score}")
    Confusion_Matrix = confusion_matrix(y_validate,y_pred)
    print(f"confusion-matrix with using {technique} technique:\n {Confusion_Matrix}")

### Inspecting performance with features selected by the RFE :

In [136]:
inspect_perf_gain(dfTrainRFE,dfRFE,"application","RFE")

Accuracy with using RFE technique: 0.8983550523392437
F1-score with using RFE technique: 0.8987659567370357
precision-score with using RFE technique: 0.9003590271178014
recall-score with using RFE technique: 0.8983550523392437
confusion-matrix with using RFE technique:
 [[3175   25    7    1   31    7  298    5]
 [  17 6378    8    0  168    4  111    0]
 [   6   11 1656  328   30    1   15  178]
 [   3    2  191  968    7    0    3   56]
 [  20  146   21    6 1936    3   68   11]
 [   5   18    2    0    3 4805    0    0]
 [ 264   98    8    2   52    2 1493   12]
 [   2    0   65   39    5    0   14  615]]


# Random forest Feature importance :
### here we will only keep features assigned by a high feature importance by a random forest classifier

In [137]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

feature_importance = dict(zip(X_train.columns, rf.feature_importances_))

In [138]:
RFC_selected_features = [i[0] for i in sorted(feature_importance.items(), key=lambda x:x[1])[::-1]][:25]

In [139]:
RFC_selected_features

['Src IP',
 'Idle Max',
 'Src Port',
 'Dst Port',
 'Flow IAT Mean',
 'Flow Packets/s',
 'Flow IAT Max',
 'Flow Bytes/s',
 'Flow Duration',
 'Bwd Packets/s',
 'Dst IP',
 'FWD Init Win Bytes',
 'Average Packet Size',
 'Bwd Init Win Bytes',
 'Fwd Segment Size Avg',
 'Fwd Header Length',
 'Packet Length Variance',
 'Bwd Packet Length Mean',
 'Bwd Packet Length Min',
 'Packet Length Min',
 'Fwd Seg Size Min',
 'Bwd Header Length',
 'Total Length of Fwd Packet',
 'Subflow Fwd Packets',
 'Fwd Packet Length Max']

In [140]:
dfTrainRFC = dfTrain[RFC_selected_features]
dfRFC = dfValidate[RFC_selected_features]

print(dfTrainRFC.shape,dfRFC.shape)

(154109, 25) (23405, 25)


In [141]:
inspect_perf_gain(dfTrainRFC,dfRFC,"application","feature importance")

Accuracy with using feature importance technique: 0.8993804742576372
F1-score with using feature importance technique: 0.8998763512630937
precision-score with using feature importance technique: 0.9014145351138501
recall-score with using feature importance technique: 0.8993804742576372
confusion-matrix with using feature importance technique:
 [[3184   25    7    2   29    7  291    4]
 [  15 6376    8    0  167    4  116    0]
 [   6   10 1671  323   26    1   13  175]
 [   3    2  195  964    7    1    1   57]
 [  20  144   20    7 1932    3   74   11]
 [   2   19    1    0    4 4807    0    0]
 [ 258   92    8    2   48    2 1511   10]
 [   2    0   68   46    5    0   14  605]]


# mutual information gain:
### Mutual Information Gain (MIG) is a measure of the amount of information that one random variable X provides about another random variable Y ,the mutual information gain is high when knowing the value of X provides a lot of information about the value of Y, and it is low when knowing the value of X provides little or no information about the value of Y.

In [142]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif

kbest = SelectKBest(score_func=mutual_info_classif, k=25)

kbest.fit(X_train, y_train)

MIC_selected_features = X_train.columns[kbest.get_support()]

In [143]:
MIC_selected_features=MIC_selected_features.to_list()
MIC_selected_features

['Src IP',
 'Src Port',
 'Dst IP',
 'Dst Port',
 'Flow Duration',
 'Total Length of Fwd Packet',
 'Total Length of Bwd Packet',
 'Fwd Packet Length Max',
 'Fwd Packet Length Min',
 'Bwd Packet Length Max',
 'Bwd Packet Length Min',
 'Bwd Packet Length Mean',
 'Flow Bytes/s',
 'Flow Packets/s',
 'Flow IAT Mean',
 'Flow IAT Max',
 'Fwd Header Length',
 'Bwd Header Length',
 'Bwd Packets/s',
 'Packet Length Min',
 'Packet Length Variance',
 'Average Packet Size',
 'Fwd Segment Size Avg',
 'FWD Init Win Bytes',
 'Idle Max']

In [144]:
dfTrainMIC = dfTrain[MIC_selected_features]
dfMIC = dfValidate[MIC_selected_features]

print(dfTrainMIC.shape,dfMIC.shape)

(154109, 25) (23405, 25)


### Inspecting performance with features selected by the MIC :

In [145]:
inspect_perf_gain(dfTrainMIC,dfMIC,"application","MIC")

Accuracy with using MIC technique: 0.8989959410382397
F1-score with using MIC technique: 0.8993236312996562
precision-score with using MIC technique: 0.9007486613684079
recall-score with using MIC technique: 0.8989959410382397
confusion-matrix with using MIC technique:
 [[3183   24    7    1   32   11  286    5]
 [  20 6389    7    0  161    4  105    0]
 [   5   12 1656  330   30    2   17  173]
 [   2    3  195  964    7    2    1   56]
 [  17  149   25    6 1928    5   72    9]
 [   2   18    2    0    4 4807    0    0]
 [ 262   96    8    2   49    1 1500   13]
 [   2    0   66   41    4    0   13  614]]


# CHI2 features dependency:
### The basic idea behind chi-square-based feature selection is to measure the independence between each feature and the target variable. The intuition is that if a feature is independent of the target variable, it's unlikely to be useful for predicting the target. Conversely, if a feature is dependent on the target variable, it may be a good predictor.

In [146]:
from sklearn.feature_selection import chi2

kbest = SelectKBest(score_func=chi2, k=25)

kbest.fit(np.abs(X_train), y_train)

CHI2_selected_features = X_train.columns[kbest.get_support()]

In [147]:
CHI2_selected_features= CHI2_selected_features.to_list()
CHI2_selected_features

['Src IP',
 'Src Port',
 'Dst IP',
 'Dst Port',
 'Flow Duration',
 'Total Length of Fwd Packet',
 'Total Length of Bwd Packet',
 'Bwd Packet Length Max',
 'Flow Bytes/s',
 'Flow Packets/s',
 'Flow IAT Mean',
 'Flow IAT Max',
 'Fwd IAT Std',
 'Bwd IAT Mean',
 'Bwd IAT Std',
 'Fwd Header Length',
 'Bwd Header Length',
 'Bwd Packets/s',
 'Packet Length Variance',
 'Bwd Bulk Rate Avg',
 'FWD Init Win Bytes',
 'Bwd Init Win Bytes',
 'Fwd Act Data Pkts',
 'Idle Std',
 'Idle Max']

In [148]:
dfTrainCHI2 = dfTrain[CHI2_selected_features]
dfCHI2 = dfValidate[CHI2_selected_features]

print(dfTrainCHI2.shape,dfCHI2.shape)

(154109, 25) (23405, 25)


### Inspecting performance with features selected by CHI2 :

In [149]:
inspect_perf_gain(dfTrainCHI2,dfCHI2,"application","CHI2")

Accuracy with using CHI2 technique: 0.8963469344157231
F1-score with using CHI2 technique: 0.8972462576382563
precision-score with using CHI2 technique: 0.8993403904349225
recall-score with using CHI2 technique: 0.8963469344157231
confusion-matrix with using CHI2 technique:
 [[3170   21    6    1   33    9  303    6]
 [  13 6286    9    0  218    4  156    0]
 [   4   12 1689  306   29    1   11  173]
 [   3    2  192  966    7    1    3   56]
 [  19  124   21    8 1938    7   83   11]
 [   4   22    2    0    4 4801    0    0]
 [ 251   82    7    1   57    2 1521   10]
 [   2    0   69   40    6    0   15  608]]


# Analysis of Variance (ANOVA):
### ANOVA works by comparing the amount of variation between the groups to the amount of variation within the groups, and uses the F-test to determine whether the differences between the groups are statistically significant.

In [150]:
from sklearn.feature_selection import f_classif

kbest = SelectKBest(score_func=f_classif, k=25)


kbest.fit(X_train, y_train)


ANOVA_selected_features = X_train.columns[kbest.get_support()]

In [151]:
ANOVA_selected_features= ANOVA_selected_features.to_list()
ANOVA_selected_features

['Src IP',
 'Src Port',
 'Dst IP',
 'Dst Port',
 'Fwd Packet Length Min',
 'Bwd Packet Length Max',
 'Bwd Packet Length Min',
 'Bwd Packet Length Mean',
 'Flow Packets/s',
 'Flow IAT Max',
 'Fwd IAT Std',
 'Bwd IAT Mean',
 'Bwd IAT Std',
 'Fwd PSH Flags',
 'Bwd Packets/s',
 'Packet Length Min',
 'FIN Flag Count',
 'SYN Flag Count',
 'Average Packet Size',
 'Fwd Segment Size Avg',
 'Subflow Fwd Packets',
 'FWD Init Win Bytes',
 'Fwd Seg Size Min',
 'Idle Std',
 'Idle Max']

In [152]:
dfTrainANOVA = dfTrain[ANOVA_selected_features]
dfANOVA = dfValidate[ANOVA_selected_features]

print(dfTrainANOVA.shape,dfANOVA.shape)

(154109, 25) (23405, 25)


### Inspecting performance with features selected by ANOVA :

In [153]:
inspect_perf_gain(dfTrainANOVA,dfANOVA,"application","ANOVA")

Accuracy with using ANOVA technique: 0.9063020722067934
F1-score with using ANOVA technique: 0.9065364978405293
precision-score with using ANOVA technique: 0.9073635568380084
recall-score with using ANOVA technique: 0.9063020722067934
confusion-matrix with using ANOVA technique:
 [[3186   25    6    1   24    9  294    4]
 [  15 6425    5    0  139    4   98    0]
 [   5   12 1757  254   24    2   12  159]
 [   2    1  175  991    7    2    3   49]
 [  22  151   21    5 1933    6   63   10]
 [   1   19    1    1    3 4808    0    0]
 [ 254  104    6    1   47    2 1504   13]
 [   1    0   75   37    5    0   14  608]]


# Intersection set of columns between all previous techniques :
### we will only keep the features that all previous techniques selected 

In [154]:
INTERSECTION_features = set(RFE_selected_features).intersection(MIC_selected_features, CHI2_selected_features, ANOVA_selected_features)

print(INTERSECTION_features)  

{'Dst IP', 'Flow Packets/s', 'Src IP', 'Dst Port', 'Flow IAT Max', 'FWD Init Win Bytes', 'Idle Max', 'Bwd Packets/s', 'Src Port'}


In [155]:
intesection_features = list(INTERSECTION_features)

In [156]:
len(intesection_features)

9

### Inspecting efficiency on the dataset with  the intersected features of all techniques :

In [157]:
dfTrainINTER = dfTrain[INTERSECTION_features]
dfINTER = dfValidate[INTERSECTION_features]
print(dfTrainINTER.shape,dfINTER.shape)

(154109, 9) (23405, 9)


In [158]:
inspect_perf_gain(dfTrainINTER,dfINTER,'application',"all")

Accuracy with using all technique: 0.8936979277932066
F1-score with using all technique: 0.8948765982441874
precision-score with using all technique: 0.8970691569792666
recall-score with using all technique: 0.8936979277932066
confusion-matrix with using all technique:
 [[3135   18    5    2   38    9  336    6]
 [  18 6260   12    1  202    4  189    0]
 [   7    9 1723  288   29    2   12  155]
 [   2    2  185  991    3    0    2   45]
 [  41  114   39    8 1910    4   85   10]
 [   2   21    1    1    8 4800    0    0]
 [ 246   75   21    1   84    1 1494    9]
 [   4    1   72   38    4    0   17  604]]


In [159]:
dfTrain.shape

(154109, 41)

In [160]:
dfTrain["application"].value_counts()

6    20114
4    19590
1    19523
0    19300
5    19179
7    19043
3    18716
2    18644
Name: application, dtype: int64

In [161]:
best_features_so_far = ['Src IP',
 'Src Port',
 'Dst IP',
 'Dst Port',
 'Fwd Packet Length Min',
 'Bwd Packet Length Max',
 'Bwd Packet Length Min',
 'Bwd Packet Length Mean',
 'Flow Packets/s',
 'Flow IAT Max',
 'Fwd IAT Std',
 'Bwd IAT Mean',
 'Bwd IAT Std',
 'Fwd PSH Flags',
 'Bwd Packets/s',
 'Packet Length Min',
 'FIN Flag Count',
 'SYN Flag Count',
 'Average Packet Size',
 'Fwd Segment Size Avg',
 'Subflow Fwd Packets',
 'FWD Init Win Bytes',
 'Fwd Seg Size Min',
 'Idle Std',
 'Idle Max']

## to summerize , in this section we used different techniques to select features such as:
* Domaine knowledge &rarr; F1-score :0.8967646886391679

* brute experimentation:

    - Recursif feature eliminator (RFE) &rarr; F1-score : 0.89949676619698
    
* model based techniques:

    - Random forest feature importance &rarr; F1-score : 0.8998763512630937
    
* statistical methods:

    - Mutual information gain (MIG) &rarr; F1-score : 0.8993236312996562
    
    - Correlation &rarr; F1-score : 0.8977920397591332
    
    - dropping unvariant features &rarr; F1-score : 0.8977920397591332
    
    - The chi-square feature dependency (CHI2) &rarr; F1-score : 0.8972462576382563
    
    - intersection set of features &rarr; F1-score : 0.8953174957655806
    
    - Analysis of variance (ANOVA) &rarr; F1-score : 0.9065364978405293


## we can remark that the best performance so far is obtained with the features selected by ANOVA , thus , those features are the ones we will work with in the rest of this journey .
