<a href="https://colab.research.google.com/github/Kaiziferr/machine_learning/blob/main/feature_selection/02_RFE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [63]:
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.exceptions import ConvergenceWarning
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#**Info**
---
@By: **Steven Bernal**

@Nickname: **Kaiziferr**

@Git: https://github.com/Kaiziferr

# **Config**
---

In [2]:
warnings.filterwarnings("ignore", category=ConvergenceWarning)
#sns.set(style="darkgrid")
pd.set_option('display.float_format', '{:,.2f}'.format)
#paleta = sns.color_palette('Set2').as_hex()
random_seed=73

 The objective of this project is to demonstrate the recursive feature elimination (RFE)

# **Data Dictionary**
---
Data from a cryptocurrency mining network traffic dataset is used.


- `Name`: time window name.
- `Netflows`: number of netflows in the time window.
- `First_Protocol`: top 1 of protocols used in the time window.
- `Second_Protocol`: top 2 of protocols used in the time window.
- `Third_Protocol`: top 3 protocols used in the time window.
- `p1_d`: 25% of the percentiles of all durations in the time window
- `p2_d`: 50% of the percentiles of all durations in the time window
- `p3_d`: 75% of the percentiles of all durations in the time window
- `duration`: total duration of the time window
- `max_d`: maximum value of all durations in the time window.
- `min_d`: minimum value of all durations in the time window.
- `#packets`: total number of packets in the time window.
- `Avg_bps`: average bits per second in the time window.
- `Avg_pps`: average packets per second in the time window.
- `Avg_bpp`: average bytes per packet in the time window.
- `#Bytes`: total number of bytes in the time window.
- `#sp`: total number of source ports used in the time window.
- `#dp`: total number of destination ports used in the time window.
- `first_sp`: top 1 source ports in the time window.
- `second_sp`: top 2 source ports in the time window.
- `third_sp`: top 3 source ports in the time window.
- `first_dp`: top 1 destination ports in the time window.
- `second_dp`: top 2 destination ports in the time window.
- `third_dp`: top 3 destination ports in the time window.
- `p1_ip`: 25% of the percentiles of all packet inputs  in the time window.
- `p2_ip`: 50% of the percentiles of all packet inputs  in the time window.
- `p3_ip`: 75% of the percentiles of all packet inputs  in the time window.
- `p1_ib`: 25% of the percentiles of all byte inputs in the time window.
- `p2_ib`: 50% of the percentiles of all byte inputs in the time window.
- `p3_ib`: 75% of the percentiles of all byte inputs in the time window.
- `Type`: mining time window type
  - `benignas`: 0
  - `bitcash`: 1
  - `bitcoin`: 2
  - `ethereum `: 3
  - `monero`: 4
  - `litecoin`: 5

# **Data**

---



In [3]:
url = 'https://raw.githubusercontent.com/Kaiziferr/datasets/main/cryptojacking.csv'
dta = pd.read_csv(url, dtype=str).drop([
    'Unnamed: 0',
    'Name',
    'Second_Protocol',
    'Third_Protocol'], axis=1)
dta.head(5)

Unnamed: 0,Netflows,First_Protocol,p1_d,p2_d,p3_d,duration,max_d,min_d,#packets,Avg_bps,...,first_dp,second_dp,third_dp,p1_ip,p2_ip,p3_ip,p1_ib,p2_ib,p3_ib,Type
0,65,TCP,18.939,168.173,194.287,7845.125999999999,244.362,0.0,5546,125708,...,443,80.0,123.0,3.0,7.0,22.0,127.0,255.0,1888.0,0
1,18,UDP,0.0,0.0,0.0,0.086,0.044,0.0,20,148,...,443,53.0,53195.0,1.0,1.0,1.0,37.0,47.0,64.0,0
2,10,UDP,0.0,0.0,0.0,0.0,0.0,0.0,10,236,...,53,39308.0,54454.0,1.0,1.0,1.0,34.0,43.0,61.75,0
3,2771,UDP,0.0,0.0,0.0,8548.902,149.034,0.0,8711,129626,...,53,5355.0,443.0,1.0,1.0,1.0,39.0,49.0,54.0,0
4,2,UDP,0.0,0.0,0.0,0.0,0.0,0.0,2,328000,...,48871,53.0,,1.0,1.0,1.0,37.0,41.0,45.0,0


# **Preprocessing**
---

Assigning the real data type

In [4]:
dta[[
    'p1_d', 'p2_d', 'p3_d',
    'duration', 'max_d', 'min_d',
    'Avg_bps', 'Avg_pps','Avg_bpp',
    'p1_ip', 'p2_ip', 'p3_ip',
    'p1_ib', 'p2_ib','p3_ib']] = dta[[
    'p1_d', 'p2_d', 'p3_d',
    'duration', 'max_d', 'min_d',
    'Avg_bps', 'Avg_pps','Avg_bpp',
    'p1_ip', 'p2_ip', 'p3_ip',
    'p1_ib', 'p2_ib','p3_ib']].astype('float64')

In [5]:
dta[[
    'Netflows', '#packets',
    '#Bytes', '#sp', '#dp']] = dta[[
    'Netflows', '#packets',
    '#Bytes', '#sp', '#dp']].astype('int64')

In [6]:
dta['First_Protocol'].unique()

array(['TCP', 'UDP'], dtype=object)

Categorization of the categorical variable

In [7]:
dta['First_Protocol'] = dta['First_Protocol'].replace({'TCP': 0, 'UDP':1})
dta.head(2)

Unnamed: 0,Netflows,First_Protocol,p1_d,p2_d,p3_d,duration,max_d,min_d,#packets,Avg_bps,...,first_dp,second_dp,third_dp,p1_ip,p2_ip,p3_ip,p1_ib,p2_ib,p3_ib,Type
0,65,0,18.94,168.17,194.29,7845.13,244.36,0.0,5546,125708.0,...,443,80.0,123.0,3.0,7.0,22.0,127.0,255.0,1888.0,0
1,18,1,0.0,0.0,0.0,0.09,0.04,0.0,20,148.0,...,443,53.0,53195.0,1.0,1.0,1.0,37.0,47.0,64.0,0


In [8]:
dta.columns

Index(['Netflows', 'First_Protocol', 'p1_d', 'p2_d', 'p3_d', 'duration',
       'max_d', 'min_d', '#packets', 'Avg_bps', 'Avg_pps', 'Avg_bpp', '#Bytes',
       '#sp', '#dp', 'first_sp', 'second_sp', 'third_sp', 'first_dp',
       'second_dp', 'third_dp', 'p1_ip', 'p2_ip', 'p3_ip', 'p1_ib', 'p2_ib',
       'p3_ib', 'Type'],
      dtype='object')

There are null values, but for the purposes of the exercise, they are replaced with zero to reduce preprocessing time

In [9]:
dta = dta.fillna(0)
dta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2837 entries, 0 to 2836
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Netflows        2837 non-null   int64  
 1   First_Protocol  2837 non-null   int64  
 2   p1_d            2837 non-null   float64
 3   p2_d            2837 non-null   float64
 4   p3_d            2837 non-null   float64
 5   duration        2837 non-null   float64
 6   max_d           2837 non-null   float64
 7   min_d           2837 non-null   float64
 8   #packets        2837 non-null   int64  
 9   Avg_bps         2837 non-null   float64
 10  Avg_pps         2837 non-null   float64
 11  Avg_bpp         2837 non-null   float64
 12  #Bytes          2837 non-null   int64  
 13  #sp             2837 non-null   int64  
 14  #dp             2837 non-null   int64  
 15  first_sp        2837 non-null   object 
 16  second_sp       2837 non-null   object 
 17  third_sp        2837 non-null   o

# **Data Split**
---

The predictors and the class to be predicted are extracted.

In [10]:
X = dta.drop(['Type'], axis=1)
y = dta.iloc[:, -1]

The data is divided into training and test data.

In [11]:
X_train, _, y_train, _ = train_test_split(X, y, train_size=0.85, stratify=y, random_state=random_seed)

## **Model**
---

A dictionary is defined with instances of three models: Decision Tree Classifier, Lasso Regression, and Random Forest.

**Auto**

In [12]:
model_tree = DecisionTreeClassifier(random_state=random_seed)
model_lasso = Lasso(random_state=random_seed)
model_random = RandomForestClassifier()
models = {
    'Tree':model_tree,
    'Lasso': model_lasso,
    'Random': model_random
}

In [14]:
def select_feature_RFE(
    models:dict,
    columns:list,
    X:pd.Series,
    y:pd.Series,
    **kwargs)->tuple:
  """Run RFE with the models from the dictionary."""
  feature_rfe_score = pd.DataFrame()
  feature_rfe_score["Features"] = columns
  select_features = {}
  for k, m in models.items():
    selector = RFE(m, **kwargs)
    selector.fit(X, y)
    select_features[f'best_features_{k}'] = X.columns[selector.support_]
    feature_rfe_score[k] = selector.ranking_
  feature_rfe_score['acumulative_average'] = feature_rfe_score.iloc[:, 1:].mean(axis=1)
  return feature_rfe_score, pd.DataFrame(select_features)

Two data frames are returned: the first contains the iterations in which the features are eliminated, and the second includes the best features from of models.

In [15]:
feature_rfe_score, select_features = select_feature_RFE(
    models,
    X_train.columns,
    X_train,
    y_train,
    **{
        'n_features_to_select':8,
        'step': 1
    }
)

The first table contains a calculation of the averages from the feature elimination iterations by the RFE. The idea is to consider features that may not be the most statistically significant, but align with the business objectives

In [16]:
feature_rfe_score[feature_rfe_score['acumulative_average']<4]

Unnamed: 0,Features,Tree,Lasso,Random,acumulative_average
10,Avg_pps,9,1,1,3.67
11,Avg_bpp,1,1,1,1.0
15,first_sp,1,9,1,3.67
18,first_dp,1,1,1,1.0
19,second_dp,1,1,1,1.0
24,p1_ib,1,1,6,2.67


In [17]:
select_features

Unnamed: 0,best_features_Tree,best_features_Lasso,best_features_Random
0,duration,max_d,Avg_bps
1,Avg_bps,Avg_pps,Avg_pps
2,Avg_bpp,Avg_bpp,Avg_bpp
3,first_sp,first_dp,#sp
4,first_dp,second_dp,first_sp
5,second_dp,p3_ip,first_dp
6,p3_ip,p1_ib,second_dp
7,p1_ib,p2_ib,p2_ib


**str**

A logistic regression model is instantiated, and a 'coef_' value is defined so that RFE can use it as a criterion for importance selection in each iteration.

In [18]:
model_logistic_regression = LogisticRegression(random_state=random_seed)
selector = RFE(
    estimator=model_logistic_regression,
    n_features_to_select=8,
    step = 1,
    importance_getter='coef_')
selector.fit(X_train, y_train)

In [19]:
X.columns[selector.support_]

Index(['duration', 'Avg_bps', 'Avg_bpp', '#Bytes', 'third_sp', 'p1_ib',
       'p2_ib', 'p3_ib'],
      dtype='object')

**Callback**

A function is constructed to be invoked as a callback by the RFE function.

In [68]:
def select_best_feature(estimator):
  return estimator.feature_importances_

In [69]:
model_decition_tree = DecisionTreeClassifier(random_state=random_seed)

In [70]:
selector = RFE(
    estimator=model_decition_tree,
    n_features_to_select=8,
    step = 1,
    importance_getter=select_best_feature)

In [71]:
selector.fit(X_train, y_train)

In [72]:
X.columns[selector.support_]

Index(['duration', 'Avg_bps', 'Avg_bpp', 'first_sp', 'first_dp', 'second_dp',
       'p3_ip', 'p1_ib'],
      dtype='object')

#**Info**
---
@By: **Steven Bernal**

@Nickname: **Kaiziferr**

@Git: https://github.com/Kaiziferr