# The Human in the Loop

In the previous chapter, we perfected our knowledge of the standard supervised learning workflows. In this chapter, we will critically examine the ways in which expert knowledge is incorporated in supervised learning. This is done through the identification of the appropriate unit of analysis which might require feature engineering across multiple data sources, through the sometimes imperfect process of labeling examples, and through the specification of a loss function that captures the true business value of errors made by our machine learning model.

In [10]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score

## Data fusion

Originaly, we used the _destination_ computer (from the `attack` dataset) as our entity of interest, and search for the  _destination_ in the `flow` dataset. However, our cybersecurity analyst just told us that it is the infected machines that generate the bad traffic, and will therefore appear as a _source_, not a _destination_, in the `flows` dataset.


In [2]:
def featurize(df):
    return {
        'unique_ports': len(set(df['destination_port'])),
        'average_packet': np.mean(df['packet_count']),
        'average_duration': np.mean(df['duration'])
    }

In [6]:
attack = pd.read_csv('data/redteam.csv')
display(attack.head())
print("#"*40)

flows = pd.read_csv('data/lanl_flows.csv')
display(flows.head())
print("#"*40)

Unnamed: 0,time,user_domain,source_computer,destination_computer
0,151036,U748@DOM1,C17693,C305
1,151648,U748@DOM1,C17693,C728
2,151993,U6115@DOM1,C17693,C1173
3,153792,U636@DOM1,C17693,C294
4,155219,U748@DOM1,C17693,C5693


########################################


Unnamed: 0,time,duration,source_computer,source_port,destination_computer,destination_port,protocol,packet_count,byte_count
0,471692,0,C5808,N24128,C26871,N17023,6,1,60
1,471692,0,C5808,N2414,C26871,N19148,6,1,60
2,471692,0,C5808,N24156,C26871,N8001,6,1,60
3,471692,0,C5808,N24161,C26871,N18502,6,1,60
4,471692,0,C5808,N24162,C26871,N11309,6,1,60


########################################


In [11]:
bads = attack.destination_computer.values

# Group by source computer, and apply the feature extractor
out = flows.groupby('source_computer').apply(featurize)

# Convert the iterator to a dataframe by calling list on it
X = pd.DataFrame(list(out), index=out.index)

# Check which sources in X.index are bad to create labels
y = [x in bads for x in X.index]

# Report the average accuracy of Adaboost over 3-fold CV
print(np.mean(cross_val_score(AdaBoostClassifier(), X, y)))

0.9445378151260504


We have successfully incorporated our analyst's feedback. Let's now try to add some more features. 

