# Feature Engineering

The main objectives for this notebook are:
* Develop a set of features that have a potential to improve our model's performance
* Investiage the relationships between our new features and your target


## important steps 
1. Engineer a well argued feature (if with sources that's bonus point x2) 
2. Validate features after engineering
3. Don't use blind (auto) feature engineering - waste of time
    - Irrelevant Features Can Reduce Model Performance
    - Difficulty in Model Interpretability and Explainability
    - Lack of Alignment with Business Goals
4. Design a feature engineering pipeline at the end of the notebook

# Imports

In [5]:
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.io as pio
import seaborn as sns
from feature_engine.selection import SmartCorrelatedSelection

import polars as pl

# Path needs to be added manually to read from another folder
path2add = os.path.normpath(
    os.path.abspath(os.path.join(os.path.dirname("__file__"), os.path.pardir, "utils"))
)
if not (path2add in sys.path):
    sys.path.append(path2add)

from feature_engineering import (
    aggregate_node_features
    # feature_predictive_power,
    # get_graph_features,
)

pio.renderers.default = "notebook"

In [6]:
data = pl.read_parquet('../data/supervised_clean_data.parquet')
calls = pl.read_json('../data/supervised_call_graphs.json')

In [8]:
data.head(1)

Unnamed: 0_level_0,_id,inter_api_access_duration(sec),api_access_uniqueness,sequence_length(count),vsession_duration(min),ip_type,num_sessions,num_users,num_unique_apis,source,classification,is_anomaly
i64,str,f64,f64,f64,i64,str,f64,f64,f64,str,str,bool
0,"""1f2c32d8-2d6e-3b68-bc46-789469…",0.000812,0.004066,85.643243,5405,"""default""",1460.0,1295.0,451.0,"""E""","""normal""",False


In [9]:
calls.head(1)

_id,call_graph
str,list[struct[2]]
"""1f2c32d8-2d6e-3b68-bc46-789469…","[{""1f873432-6944-3df9-8300-8a3cf9f95b35"",""5862055b-35a6-316a-8e20-3ae20c1763c2""}, {""8955faa9-0e33-37ad-a1dc-f0e640a114c2"",""a4fd6415-1fd4-303e-aa33-bb1830b5d9d4""}, … {""016099ea-6f20-3fec-94cf-f7afa239f398"",""6fa8ad53-2f0d-3f44-8863-139092bfeda9""}]"


Since the main dataset already contains engineered features, there's not much opportunity to do feature engineering there. So, additional features will be created using the graph data that comes from `supervised_call_graphs.json`

## Process Graph Data

In [13]:
calls_processed = (
    calls.with_columns(
        pl.col("call_graph").list.eval(
            pl.element().struct.rename_fields(["from", "to"])
        )
    )
    .explode("call_graph")
    .unnest("call_graph")
)

calls_processed.head()

_id,from,to
str,str,str
"""1f2c32d8-2d6e-3b68-bc46-789469…","""1f873432-6944-3df9-8300-8a3cf9…","""5862055b-35a6-316a-8e20-3ae20c…"
"""1f2c32d8-2d6e-3b68-bc46-789469…","""8955faa9-0e33-37ad-a1dc-f0e640…","""a4fd6415-1fd4-303e-aa33-bb1830…"
"""1f2c32d8-2d6e-3b68-bc46-789469…","""85754db8-6a55-30b7-8558-dec75f…","""85754db8-6a55-30b7-8558-dec75f…"
"""1f2c32d8-2d6e-3b68-bc46-789469…","""9f08fee1-953c-3801-b254-c0256f…","""876b4958-7df1-3b2b-9def-1a22f1…"
"""1f2c32d8-2d6e-3b68-bc46-789469…","""857c4b20-3057-30e0-9ca3-d6f5c3…","""857c4b20-3057-30e0-9ca3-d6f5c3…"


## Feature Engineering

We can see that each graph has a separate `_id` that can be later used to join to the main dataset. A graph consistst out of `source` and `destination` nodes which refer to the available API calls.

### Basic Graph Level Features

The most basic graph-level that we can engineer are:
* Number of edges (connections)
* Number of nodes (APIs)

These features can be useful since most behaviours are going to have a "normal" range of APIs that they contact. If this number is too large or too small, this might be an indication of anomalous activity.

In [14]:
graph_features = calls_processed.group_by('_id').agg(
    pl.len().alias('n_connections'),
    pl.col('from'),
    pl.col('to')
).with_columns(
    pl.concat_list('from', 'to').list.unique().list.len().alias('n_unique_nodes')
).select([
    '_id',
    'n_connections',
    'n_unique_nodes'
])

graph_features.sample(3)

_id,n_connections,n_unique_nodes
str,u32,u32
"""6f217ee9-b57b-38f0-976c-1deb9f…",97,52
"""a3e0801a-a235-35b9-9587-01bb16…",8,7
"""9d8636d5-582f-32d7-a92a-55ba8c…",17,11
