# Feature Engineering

The main objectives for this notebook are:
* Develop a set of features that have a potential to improve our model's performance
* Investiage the relationships between our new features and your target


## important steps 
1. Engineer a well argued feature (if with sources that's bonus point x2) 
2. Validate features after engineering
3. Don't use blind (auto) feature engineering - waste of time
    - Irrelevant Features Can Reduce Model Performance
    - Difficulty in Model Interpretability and Explainability
    - Lack of Alignment with Business Goals
4. Design a feature engineering pipeline at the end of the notebook

# Imports

In [2]:
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.io as pio
import seaborn as sns
from feature_engine.selection import SmartCorrelatedSelection

import polars as pl

# Path needs to be added manually to read from another folder
path2add = os.path.normpath(
    os.path.abspath(os.path.join(os.path.dirname("__file__"), os.path.pardir, "utils"))
)
if not (path2add in sys.path):
    sys.path.append(path2add)

from feature_engineering import (
    aggregate_node_features
    # feature_predictive_power,
    # get_graph_features,
)

pio.renderers.default = "notebook"

In [3]:
data = pl.read_parquet('../data/supervised_clean_data.parquet')
calls = pl.read_json('../data/supervised_call_graphs.json')

In [4]:
data.head(1)

Unnamed: 0_level_0,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification,is_anomaly
i64,str,f64,f64,f64,i64,str,f64,f64,f64,str,str,bool
0,"""1f2c32d8-2d6e-3b68-bc46-789469…",0.000812,0.004066,85.643243,5405,"""default""",1460.0,1295.0,451.0,"""E""","""normal""",False


In [5]:
calls.head(1)

_id,call_graph
str,list[struct[2]]
"""1f2c32d8-2d6e-3b68-bc46-789469…","[{""1f873432-6944-3df9-8300-8a3cf9f95b35"",""5862055b-35a6-316a-8e20-3ae20c1763c2""}, {""8955faa9-0e33-37ad-a1dc-f0e640a114c2"",""a4fd6415-1fd4-303e-aa33-bb1830b5d9d4""}, … {""016099ea-6f20-3fec-94cf-f7afa239f398"",""6fa8ad53-2f0d-3f44-8863-139092bfeda9""}]"


Since the main dataset already contains engineered features, there's not much opportunity to do feature engineering there. So, additional features will be created using the graph data that comes from `supervised_call_graphs.json`

## Process Graph Data

In [6]:
calls_processed = (
    calls.with_columns(
        pl.col("call_graph").list.eval(
            pl.element().struct.rename_fields(["from", "to"])
        )
    )
    .explode("call_graph")
    .unnest("call_graph")
)

calls_processed.head()

_id,from,to
str,str,str
"""1f2c32d8-2d6e-3b68-bc46-789469…","""1f873432-6944-3df9-8300-8a3cf9…","""5862055b-35a6-316a-8e20-3ae20c…"
"""1f2c32d8-2d6e-3b68-bc46-789469…","""8955faa9-0e33-37ad-a1dc-f0e640…","""a4fd6415-1fd4-303e-aa33-bb1830…"
"""1f2c32d8-2d6e-3b68-bc46-789469…","""85754db8-6a55-30b7-8558-dec75f…","""85754db8-6a55-30b7-8558-dec75f…"
"""1f2c32d8-2d6e-3b68-bc46-789469…","""9f08fee1-953c-3801-b254-c0256f…","""876b4958-7df1-3b2b-9def-1a22f1…"
"""1f2c32d8-2d6e-3b68-bc46-789469…","""857c4b20-3057-30e0-9ca3-d6f5c3…","""857c4b20-3057-30e0-9ca3-d6f5c3…"


## Feature Engineering

We can see that each graph has a separate `_id` that can be later used to join to the main dataset. A graph consistst out of `source` and `destination` nodes which refer to the available API calls.

### Basic Graph Level Features

The most basic graph-level that we can engineer are:
* Number of edges (connections)
* Number of nodes (APIs)

These features can be useful since most behaviours are going to have a "normal" range of APIs that they contact. If this number is too large or too small, this might be an indication of anomalous activity.

In [7]:
graph_features = calls_processed.group_by('_id').agg(
    pl.len().alias('n_connections'),
    pl.col('from'),
    pl.col('to')
).with_columns(
    pl.concat_list('from', 'to').list.unique().list.len().alias('n_unique_nodes')
).select([
    '_id',
    'n_connections',
    'n_unique_nodes'
])

graph_features.sample(3)

_id,n_connections,n_unique_nodes
str,u32,u32
"""3da8d060-73e4-3e9c-82da-a6f974…",65,40
"""e549089a-5a4e-33b4-9769-4c432b…",13,8
"""17045307-0bed-326f-a6f9-2304de…",17,14


### Node Level Features

Since graphs consist out of nodes, we can engineer a set of features around specific nodes (APIs). We can calculate:

* Node degrees -  the number of edges that come from/into a node. Very highly connected nodes can look anomalous.
* Node centrality - there are various centrality measures (e.g. Page Rank) but they all try to estimate how important to the whole graph is a specific node. This feature could be useful because a behaviour pattern that doesn't touch any of the "central" APIs would look anomalous


These features can be broken down into:
* **global** features - measure node attributes across all the graphs
* **local**  features - measure node attributes across a specific graph


In [8]:
calls_processed = calls_processed.with_columns(
    global_source_degrees = pl.len().over(pl.col('from')),
    global_dest_degrees = pl.len().over(pl.col('to')),
    local_source_degrees = pl.len().over(pl.col('from'), pl.col('_id')),
    local_dest_degrees = pl.len().over(pl.col('to'), pl.col('_id'))
)

calls_processed.sample(3)

_id,from,to,global_source_degrees,global_dest_degrees,local_source_degrees,local_dest_degrees
str,str,str,u32,u32,u32,u32
"""1b26b358-f7e4-3cf0-b1aa-b9c1a1…","""5cd474e2-1b47-3155-aac9-fdf91e…","""756ab2fe-a386-32dd-9a4e-18785c…",6983,22416,10,11
"""ac6e2c42-460a-30c1-ba92-a5fd21…","""16ada448-65e7-3cc9-b6ed-b30199…","""07d688eb-153b-3aa7-9151-f72f66…",3247,1395,19,5
"""cb8e40ee-a935-30ba-8071-9ced09…","""f854a0b4-c9c5-3b0f-b122-66bc71…","""756ab2fe-a386-32dd-9a4e-18785c…",4772,22416,2,25


Now that the node-level features are calculated, we need to aggregate them for a specific graph (`_id`). When aggregating, we can calcualte average, std, min, and max statistics for every feature to capture the distribution well.

In [9]:
node_features_agg = aggregate_node_features(
    calls_processed,
    node_features=[
        "global_source_degrees",
        "global_dest_degrees",
        "local_source_degrees",
        "local_dest_degrees",
    ],
    by="_id",
)

graph_features = graph_features.join(node_features_agg, on="_id")
graph_features.head()


_id,n_connections,n_unique_nodes,avg_global_source_degrees,min_global_source_degrees,max_global_source_degrees,std_global_source_degrees,avg_global_dest_degrees,min_global_dest_degrees,max_global_dest_degrees,std_global_dest_degrees,avg_local_source_degrees,min_local_source_degrees,max_local_source_degrees,std_local_source_degrees,avg_local_dest_degrees,min_local_dest_degrees,max_local_dest_degrees,std_local_dest_degrees
str,u32,u32,f64,u32,u32,f64,f64,u32,u32,f64,f64,u32,u32,f64,f64,u32,u32,f64
"""bf723452-e3e5-364d-8c55-1a3382…",58,21,9424.603448,626,32071,10773.873149,9273.551724,474,22416,8702.489041,4.37931,1,10,2.870464,4.448276,1,10,2.890835
"""f2413b79-b532-3263-9275-e2dec9…",380,106,6691.034211,33,32071,9113.158074,7615.252632,2,22416,8308.400452,9.531579,1,39,10.788763,12.015789,1,44,13.487794
"""cadba9de-1519-3988-bb26-96e72b…",261,73,7229.781609,165,32071,8261.57181,7790.45977,252,22416,7593.418844,8.210728,1,23,6.310982,8.448276,1,22,6.462481
"""e6853c15-cd13-3415-a971-eaafa9…",64,31,8439.03125,390,32071,9526.112115,8278.171875,544,22416,7609.341551,3.0625,1,8,2.107244,3.25,1,7,1.927248
"""79f6ac06-681f-342b-89e7-b5e795…",34,21,9643.558824,281,32071,10878.3489,8391.382353,295,22013,7225.648764,2.411765,1,6,1.794227,2.058824,1,4,1.013281


## Feature Selection
Feature selection will be done using 2 steps:
1. Quality checks - if the feature is constant or has too many missing values (>= 95%) it will be dropped
2. Correlation analysis - if features have very high correlation (>= 95%) with each other, they can be dropped as well

In [10]:
engineered_features = graph_features.columns[1:]
engineered_features

['n_connections',
 'n_unique_nodes',
 'avg_global_source_degrees',
 'min_global_source_degrees',
 'max_global_source_degrees',
 'std_global_source_degrees',
 'avg_global_dest_degrees',
 'min_global_dest_degrees',
 'max_global_dest_degrees',
 'std_global_dest_degrees',
 'avg_local_source_degrees',
 'min_local_source_degrees',
 'max_local_source_degrees',
 'std_local_source_degrees',
 'avg_local_dest_degrees',
 'min_local_dest_degrees',
 'max_local_dest_degrees',
 'std_local_dest_degrees']

### Quality Checks

In [11]:
null_counts = graph_features.null_count().transpose(include_header=True, header_name='col', column_names=['null_count'])
null_counts.filter(pl.col('null_count') > 0)

col,null_count
str,u32
"""std_global_source_degrees""",42
"""std_global_dest_degrees""",42
"""std_local_source_degrees""",42
"""std_local_dest_degrees""",42


In [12]:
static_features = graph_features.select(engineered_features).std().transpose(include_header=True, header_name='col', column_names=['std'])
static_features.filter(pl.col('std') == 0)

col,std
str,f64


**Observations:**
* 4 columns have missing values. All of them calculate standard deviation

**Impact**
* No features will be dropped for quality reasons