### The Human in the loop

Now perfected your knowledge of the standard supervised learning workflows. In this, you will critically examine the ways in which expert knowledge is incorporated in supervised learning. This is done through the identification of the appropriate unit of analysis which might require feature engineering across multiple data sources, through the sometimes imperfect process of labeling examples, and through the specification of a loss function that captures the true business value of errors made by your machine learning model.

#### Importing Modules

In [10]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split , GridSearchCV , cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import  AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import  CountVectorizer
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest

### Is the source or destination is bad?

you used the destination computer as your entity of interest. However, your cybersecurity analyst just told you that it is the infected machines that generate the bad traffic, and will therefore appear as a source, not a destination, in the flows dataset.

The data flows has been preloaded, as well as the list bad of infected IDs and the feature extractor featurizer() from the previous lesson. You also have numpy available as np, AdaBoostClassifier, and cross_val_score().

In [2]:
flows = pd.read_csv('lanl_flows.csv')

In [4]:
flows.head()

Unnamed: 0,time,duration,source_computer,source_port,destination_computer,destination_port,protocol,packet_count,byte_count
0,471692,0,C5808,N24128,C26871,N17023,6,1,60
1,471692,0,C5808,N2414,C26871,N19148,6,1,60
2,471692,0,C5808,N24156,C26871,N8001,6,1,60
3,471692,0,C5808,N24161,C26871,N18502,6,1,60
4,471692,0,C5808,N24162,C26871,N11309,6,1,60


In [7]:
bads = {'C1',
 'C10',
 'C10005',
 'C1003',
 'C1006',
 'C1014',
 'C1015',
 'C102',
 'C1022',
 'C1028',
 'C10405',
 'C1042',
 'C1046',
 'C10577',
 'C1065',
 'C108',
 'C10817',
 'C1085',
 'C1089',
 'C1096',
 'C11039',
 'C11178',
 'C1119',
 'C11194',
 'C1124',
 'C1125',
 'C113',
 'C115',
 'C11727',
 'C1173',
 'C1183',
 'C1191',
 'C12116',
 'C1215',
 'C1222',
 'C1224',
 'C12320',
 'C12448',
 'C12512',
 'C126',
 'C1268',
 'C12682',
 'C1269',
 'C1275',
 'C1302',
 'C1319',
 'C13713',
 'C1382',
 'C1415',
 'C143',
 'C1432',
 'C1438',
 'C1448',
 'C1461',
 'C1477',
 'C1479',
 'C148',
 'C1482',
 'C1484',
 'C1493',
 'C15',
 'C1500',
 'C1503',
 'C1506',
 'C1509',
 'C15197',
 'C152',
 'C15232',
 'C1549',
 'C155',
 'C1555',
 'C1567',
 'C1570',
 'C1581',
 'C16088',
 'C1610',
 'C1611',
 'C1616',
 'C1626',
 'C1632',
 'C16401',
 'C16467',
 'C16563',
 'C1710',
 'C1732',
 'C1737',
 'C17425',
 'C17600',
 'C17636',
 'C17640',
 'C17693',
 'C177',
 'C1776',
 'C17776',
 'C17806',
 'C1784',
 'C17860',
 'C1797',
 'C18025',
 'C1810',
 'C18113',
 'C18190',
 'C1823',
 'C18464',
 'C18626',
 'C1887',
 'C18872',
 'C19038',
 'C1906',
 'C19156',
 'C19356',
 'C1936',
 'C1944',
 'C19444',
 'C1952',
 'C1961',
 'C1964',
 'C1966',
 'C1980',
 'C19803',
 'C19932',
 'C2012',
 'C2013',
 'C20203',
 'C20455',
 'C2057',
 'C2058',
 'C20677',
 'C2079',
 'C20819',
 'C2085',
 'C2091',
 'C20966',
 'C21349',
 'C21664',
 'C21814',
 'C21919',
 'C21946',
 'C2196',
 'C21963',
 'C22174',
 'C22176',
 'C22275',
 'C22409',
 'C2254',
 'C22766',
 'C231',
 'C2341',
 'C2378',
 'C2388',
 'C243',
 'C246',
 'C2519',
 'C2578',
 'C2597',
 'C2604',
 'C2609',
 'C2648',
 'C2669',
 'C2725',
 'C2816',
 'C2844',
 'C2846',
 'C2849',
 'C2877',
 'C2914',
 'C294',
 'C2944',
 'C3019',
 'C302',
 'C3037',
 'C305',
 'C306',
 'C307',
 'C313',
 'C3153',
 'C3170',
 'C3173',
 'C3199',
 'C3249',
 'C3288',
 'C3292',
 'C3303',
 'C3305',
 'C332',
 'C338',
 'C3380',
 'C3388',
 'C3422',
 'C3435',
 'C3437',
 'C3455',
 'C346',
 'C3491',
 'C3521',
 'C353',
 'C3586',
 'C359',
 'C3597',
 'C3601',
 'C3610',
 'C3629',
 'C3635',
 'C366',
 'C368',
 'C3699',
 'C370',
 'C3755',
 'C3758',
 'C3813',
 'C385',
 'C3888',
 'C395',
 'C398',
 'C400',
 'C4106',
 'C4159',
 'C4161',
 'C42',
 'C423',
 'C4280',
 'C429',
 'C430',
 'C4403',
 'C452',
 'C4554',
 'C457',
 'C458',
 'C46',
 'C4610',
 'C464',
 'C467',
 'C477',
 'C4773',
 'C4845',
 'C486',
 'C492',
 'C4934',
 'C5030',
 'C504',
 'C506',
 'C5111',
 'C513',
 'C52',
 'C528',
 'C529',
 'C5343',
 'C5439',
 'C5453',
 'C553',
 'C5618',
 'C5653',
 'C5693',
 'C583',
 'C586',
 'C61',
 'C612',
 'C625',
 'C626',
 'C633',
 'C636',
 'C6487',
 'C6513',
 'C685',
 'C687',
 'C706',
 'C7131',
 'C721',
 'C728',
 'C742',
 'C7464',
 'C7503',
 'C754',
 'C7597',
 'C765',
 'C7782',
 'C779',
 'C78',
 'C791',
 'C798',
 'C801',
 'C8172',
 'C8209',
 'C828',
 'C849',
 'C8490',
 'C853',
 'C8585',
 'C8751',
 'C881',
 'C882',
 'C883',
 'C886',
 'C89',
 'C90',
 'C9006',
 'C917',
 'C92',
 'C923',
 'C96',
 'C965',
 'C9692',
 'C9723',
 'C977',
 'C9945'}

In [8]:
def featurize(df):
    return {
        'unique_ports': len(set(flows['destination_port'])),
        'average_packet': np.mean(flows['packet_count']),
        'average_duration': np.mean(flows['duration'])
    }

In [11]:
# Group by source computer, and apply the feature extractor 
out = flows.groupby('source_computer').apply(featurize)

# Convert the iterator to a dataframe by calling list on it
X = pd.DataFrame(list(out), index=out.index)

# Check which sources in X.index are bad to create labels
y = [x in bads for x in X.index]

# Report the average accuracy of Adaboost over 3-fold CV
print(np.mean(cross_val_score(AdaBoostClassifier(), X, y)))

0.939495457083


### Feature engineering on grouped data

You will now build on the previous exercise, by considering one additional feature: the number of unique protocols used by each source computer. Note that with grouped data, it is always possible to construct features in this manner: you can take the number of unique elements of all categorical columns, and the mean of all numeric columns as your starting point. As before, you have flows preloaded, cross_val_score() for measuring accuracy, AdaBoostClassifier(), pandas as pd and numpy as np.

In [12]:
# Create a feature counting unique protocols per source
protocols = flows.groupby('source_computer').apply(
  lambda df: len(set(df['protocol'])))

# Convert this feature into a dataframe, naming the column
protocols_DF = pd.DataFrame(
  protocols, index=protocols.index, columns=['protocol'])

# Now concatenate this feature with the previous dataset, X
X_more = pd.concat([X, protocols_DF], axis=1)

# Refit the classifier and report its accuracy
print(np.mean(cross_val_score(
  AdaBoostClassifier(), X_more, y)))

0.939495457083
