# Evolving Whitehawk from an Expert-Based to a Data-Driven System

Currently, the Whitehawk Recommendation System (WRS) creates product selections based upon discussions with cybersecurity Subject Matter Experts.  Specifically, these talks lead to new _matching concepts_ for use between users and items.  While this is helpful in gaining direction and contextual understanding, it is a poor method for reaching objective, quantitative, results.  Employing data behind the WRS offers a variety of benefits, including:

* reduce system dependency on manual methods
* providing users with specific and actionable information
* objective justification for recommendations
* creating tangible business value

This notebook discusses the process for moving from an expert-based system to one that is automated and data-driven.

### Setup

In [1]:
%%classpath add mvn
tech.tablesaw tablesaw-plot 0.23.0
tech.tablesaw tablesaw-beakerx 0.23.0
com.jimmoores quandl-tablesaw 2.0.0
com.github.haifengl smile-core 1.5.1

Added jars: [jackson-core-2.8.9.jar, error_prone_annotations-2.1.3.jar, jersey-guava-2.25.1.jar, commons-math3-3.6.1.jar, threetenbp-1.3.6.jar, animal-sniffer-annotations-1.14.jar, aopalliance-repackaged-2.5.0-b32.jar, opencsv-4.2.jar, jersey-common-2.25.1.jar, swing-worker-1.1.jar, commons-lang3-3.7.jar, j2objc-annotations-1.1.jar, commons-text-1.3.jar, xchart-3.5.2.jar, swingx-1.6.1.jar, tablesaw-beakerx-0.23.0.jar, smile-math-1.5.1.jar, javax.inject-2.5.0-b32.jar, jackson-databind-2.8.9.jar, quandl-tablesaw-2.0.0.jar, commons-beanutils-1.9.3.jar, commons-collections-3.2.2.jar, json-20090211_1.jar, RoaringBitmap-0.7.14.jar, smile-data-1.5.1.jar, smile-plot-1.5.1.jar, jsr305-3.0.2.jar, VectorGraphics2D-0.13.jar, smile-graph-1.5.1.jar, quandl-core-2.0.0.jar, slf4j-api-1.7.25.jar, commons-collections4-4.1.jar, javax.ws.rs-api-2.0.1.jar, filters-2.0.235.jar, opencsv-2.3.jar, javax.annotation-api-1.2.jar, osgi-resource-locator-1.0.1.jar, commons-logging-1.2.jar, jsoup-1.11.3.jar, tablesaw

In [2]:
%import static tech.tablesaw.api.DoubleColumn.*

In [3]:
//%import static tech.tablesaw.api.BarTrace.*

input is incomplete: input is incomplete

In [4]:
%import static tech.tablesaw.aggregate.AggregateFunctions.*
%import tech.tablesaw.api.*
%import tech.tablesaw.columns.*
//%import smile.clustering.*
//%import smile.regression.*

// display Tablesaw tables with BeakerX table display widget
tech.tablesaw.beakerx.TablesawDisplayer.register()

null

### External Incident Data

Notice the fields *"Cause_Breach"* and *"Cause_Cost"* .  Products address an attack vector by both how they breach, and how they they inflict damage / loss (cost to firm).  In later versions, product-categories should note specifically how products defend against vectors.

In [5]:
// data obtained from verizon breach database 2014
val incident_file = Table.read().csv("./resources/dataIncidents.csv")
incident_file.first(3)

In [6]:
// only keep fields of interest
val incident = incident_file.select("Industry", "Sample_Count", "Firm_Count_US", "Cause_Pct", "Cause_Ref_Specific", "Exploit_Category")

<br>
<br>
<br>
# Processing: Security Score

### Chance of security incident

Assume there exists a probability distribution for discrete cybersecurity events.  For all matching variables, within a specific year, industry, business-type:

* obtain a sample of the number of security incidents and causes
* categorize causes by: 
  + internal personnel,
  + internal IT infrastructure, or 
  + external attack
* order these (as a percentage) from highest to lowest (regardless of causal category) (not necessary but helpful graphically)
* ensure percentages sum to one

<img src="images/incidents.jpg",width=400 height=100>

In [7]:
var Y = incident.numberColumn("Cause_Pct").asDoubleArray.toSeq

[[0.27, 0.26, 0.22, 0.07, 0.06, 0.05, 0.04, 0.03, 0.001, 0.001, 0.0]]

In [8]:
var X = incident.stringColumn("Cause_Ref_Specific").asList

[Web App Attack, Denial Of Service, Payment Card Skimmer, Insider Misuse, Other, Misc Error, Crime-ware, Theft / Loss, POS Intrusion, Cyber Espionage, (no incident)]

In [9]:
var incident_sort = incident.sortDescendingOn("Cause_Pct")

var Percent = incident_sort.numberColumn("Cause_Pct").asDoubleArray()
var Cause = incident_sort.stringColumn("Cause_Ref_Specific").asDoubleArray()

var Plot1 = new Plot {
    title="Probability Distribution for Security Incidents in the Finance Industry"
    initHeight = 300
    xLabel = "Cause Categories"
    yLabel = "Probability of Incident"
}
Plot1.add( new Bars {
    x=Cause
    y=Percent
    outlineColor = Color.black
    width = 0.75
    })
Plot1.add( new Line {
    displayName = "Probablity"
    x=Cause
    y=Percent
    width = 3
})

### Define the security score

The security score is defined on the set [0,1] and is evaluated as 1 minus the probabilty of incurring a successful (damaging) security incident

<img src="images/incidents_security.jpg",width=400 height=100>

In [11]:
var security = incident_sort.numberColumn("Cause_Pct").subtract(1).multiply(-1).setName("Security_Score");
incident_sort.addColumns(security)
incident_sort = incident_sort.sortDescendingOn("Cause_Pct")

var Percent = incident_sort.numberColumn("Cause_Pct").asDoubleArray()
var Cause = incident_sort.stringColumn("Cause_Ref_Specific").asDoubleArray()
var Security = incident_sort.numberColumn("Security_Score").asDoubleArray()


var Plot2 = new Plot {
    title="Security Score across Incidents in the Finance Industry"
    initHeight = 300
    xLabel = "Cause Categories"
    yLabel = "Probability of Incident"
}
Plot2.add( new Bars {
    x=Cause
    y=Percent
    outlineColor = Color.black
    width = 0.75
    })
Plot2.add(new Line {
    displayName = "Probability"
    x=Cause
    y=Percent
    width = 3
})
Plot2.add(new Line {
    displayName = "Security"
    x = Cause
    y = Security 
})
//Plot.setYBound(0.75, 1.0)
//Plot.yAxes(1).bound = (0.75, 1.0)

<br>
<br>
<br>
# Processing: Security Curve

### Matching problems to solutions

We must match the following variables:

* _incident causes_ are associated to
* _firm vulnerabilities (exploit-categories)_ which are addressed by
* _products-categories_

From the distribution, we can obtain the relative importance of causes and categories of causes.  Each of these will have associated vulnerabilities.  Vulnerabilities should be analyzed and understood, to some extent, to gain a general understanding.  

Each vulnerability must be addressed by a product-category.  Each product-category is covered by a single product recommended by WRS.

Product-categories are grouped by package (basic, balanced, advanced).  Packages are ordered, and aggregate all lower-ordered products.  By implementing all products from each package, we increase the security score.  By implementing all products (advanced package) we receive a security score of 1.

<img src="images/security_prodcat.jpg",width=400 height=300>

In [12]:
var Plot3 = new Plot {
    title="Security Score across Incident Causes in the Finance Industry"
    initHeight = 300
    initWidth = 700
    xLabel = "Cause Categories / Vulnerabilities"
    yLabel = "Security Score"
}
Plot3.add(new Line {
    displayName = "Security"
    x = Cause
    y = Security 
})
Plot3.setYBound(.70, 1.05)

<div align="left">
<img src="./resources/tbl_Vul_Prod-Cat.png",width=600 align="left">
<img src="./resources/tbl_Vulnerability.png",width=600 align="right">
</div>

### Calculation of the Security Curve for bundle packages

The product-categories do not align so cleanly with the incident causes / vulnerabilities.  So, I cannot simply say: _"with Basic package, you receive a security score of X%"_ from looking at the graph.

method:

* only be concerned with cause category - not its components: breach and cost
* begin with the _Advanced_ package which has all products assigned; it must receive a score of 99%
* divide the score at each vulnerability by the number of products within each package (assumption: they each contain risk evenly)
* instead of one risk curve, a separate risk curve is created for each package; the aggregated curve is that package's score


Match incidents to vulnerabilities to product-categories.  Specific products are irrelevant, only the product-category that the product addresses.  This allows for scoring to be de-coupled from product selection.

// OUTPUT.jsonl for arbitrary finance firm (header)
```
"resultPackages": [
    {"packagePrice": 78770, 
     "productRefs": ["Hemisphere's Proprietary Assessment less than 100", "Phish Threat", "Visiontek Universal SSD Cloning and Transfer Kit", "Log and Event Manager", "Legal Defense Support", "Adaptive Defense 360 and Advanced Reporting Tool 1 Year", "SOHO Network Security Firewall", "Maximum Security for Home", "Sophos Mobile Security", "Deep Discovery Inspector", "VyprVPN Premium Monthly", "Data Loss Prevention v.5 2 5.2", "TippingPoint", "DameWare Patch Manager 1 year", "Sophos Email Protection", "Backup Server"], 
     "resultPackageType": "ADVANCED"}, 
    {"packagePrice": 20000, 
     "productRefs": ["Phish Threat", "SOHO Network Security Firewall", "Log and Event Manager", "Maximum Security for Home", "Sophos Mobile Security", "VyprVPN Premium Monthly", "Data Loss Prevention v.5 2 5.2", "TippingPoint", "DameWare Patch Manager 1 year", "Sophos Email Protection", "Backup Server"], 
     "resultPackageType": "BALANCED"}, 
    {"packagePrice": 135, 
     "productRefs": ["OneGuard Plus 1 Year", "Sophos Mobile Security", "Data Loss Prevention v.5 2 5.2", "OneConnect Plus 1 Year", "Identity Tracking for Identity Manager", "Archiver 1 Year"], 
     "resultPackageType": "BASIC"}]}
```

In [24]:
// this is a flat file of the above nested OUTPUT.jsonl (serviceBundleTemplate-py)
val product_file = Table.read().csv("./resources/dataProducts.csv")
val product = product_file.select("Package", "ProductName", "Category", "VendorName", "Type", "supplierPrice", "SellingPrice")
println("The number of unique products selected for the advanced-package: " + product.shape)
product.first(3)

The number of unique products selected for the advanced-package: 21 rows X 7 cols


In [19]:
// counts of product-categories
var lCategory = product.stringColumn("Category").asList
product.xTabCounts("Category").sortDescendingOn("Count")

21


In [18]:
// this file matches product-categories with exploit-categories, or all of them (any)
val category = Table.read().csv("./resources/dataCategory.csv")
println(category.shape)
category.first(3)

21 rows X 2 cols


In [25]:
product.stringColumn("Category").asSet

[Forensics, Network Intrusion Prevention System, Mobile Data Security, Patch Management, Traffic Analysis, Access Control, Data Leak Prevention, Host-Based Intrusion Prevention System, Incident Response, Email Filter, Training, Security Information and Event Management, Backup, Vulnerability Assessment, Threat Intelligence, Virtual Private Network, Antimalware]

In [28]:
// match products to exploit-categories
var tblMrg1 = product.join("Category").leftOuter(category, "Product_Cat")
tblMrg1 = tblMrg1.select("Package","ProductName","Category","Exploit_Cat")
tblMrg1.first(3)
//tblMrg1.write().csv("tblMrg1.csv");
//tblMrg1.select("Package","ProductName","Category","Exploit_Cat").where(tblMrg1.stringColumn("Exploit_Cat").isEqualTo(""))

In [30]:
var tblMrg2 = tblMrg1.xTabCounts("Exploit_Cat","Package").sortAscendingOn("total")

In [39]:
// aggregate function
def getSummedList(list: List[Double]) = list.scan(0.0)((a, b) => a + b)

getSummedList: (list: List[Double])List[Double]


In [40]:
// Merge
var tblMrg3 = incident_sort.join("Exploit_Category").leftOuter(tblMrg2, "[labels]")

// create product fractional support
var CausPct_part = tblMrg3.numberColumn("Cause_Pct").divide(tblMrg3.numberColumn("total")).setName("CausPct_part")
tblMrg3.addColumns(CausPct_part)
tblMrg3.shape

11 rows X 12 cols

In [33]:
// aggregate for basic
var CausPct_BASIC = tblMrg3.numberColumn("BASIC").multiply(tblMrg3.numberColumn("CausPct_part")).setName("CausPct_BASIC")
tblMrg3.addColumns(CausPct_BASIC)
var values_1 = tblMrg3.numberColumn("CausPct_BASIC").asDoubleArray
var values_2 = getSummedList(values_1.toList)
tblMrg3.addColumns(  create("CausAgg_BASIC", values_2.toArray)  )
tblMrg3.numberColumn("CausAgg_BASIC")

Number column: CausAgg_BASIC

In [34]:
// aggregate for balanced
var values_1 = tblMrg3.numberColumn("BALANCED").add(tblMrg3.numberColumn("BASIC")).multiply(tblMrg3.numberColumn("CausPct_part")).setName("CausPct_BALANCED").asDoubleArray
var values_2 = getSummedList(values_1.toList)
tblMrg3.addColumns( create("CausAgg_BALANCED", values_2.toArray))
tblMrg3.numberColumn("CausAgg_BALANCED")

Number column: CausAgg_BALANCED

In [35]:
// aggregate for advanced
var values_1 = tblMrg3.numberColumn("total").multiply(tblMrg3.numberColumn("CausPct_part")).setName("CausPct_ADVANCED").asDoubleArray
var values_2 = getSummedList(values_1.toList)
tblMrg3.addColumns( create("CausAgg_ADVANCED", values_2.toArray))
tblMrg3.numberColumn("CausAgg_ADVANCED")

Number column: CausAgg_ADVANCED

In [36]:
tblMrg3

In [37]:
var Cause = tblMrg3.stringColumn("Cause_Ref_Specific").asDoubleArray()
var CausAgg_BASIC = tblMrg3.numberColumn("CausAgg_BASIC").asDoubleArray()
var CausAgg_BALANCED = tblMrg3.numberColumn("CausAgg_BALANCED").asDoubleArray()
var CausAgg_ADVANCED = tblMrg3.numberColumn("CausAgg_ADVANCED").asDoubleArray()

var Plot4 = new Plot {
    title = "Security Score Curves across Packages"
    initHeight = 300
    initWidth = 700
    xLabel = "Cause Categories / Vulnerabilities"
    yLabel = "Security Score"
}
Plot4.add(new Line {
    displayName = "Basic Security Score"
    x = Cause
    y = CausAgg_BASIC 
})
Plot4.add(new Line {
    displayName = "Balanced Security Score"
    x = Cause
    y = CausAgg_BALANCED 
})
Plot4.add(new Line {
    displayName = "Advanced Security Score"
    x = Cause
    y = CausAgg_ADVANCED 
})
//Plot4.setYBound(.70, 1.05)

<br>
<br>
<br>
# Processing: Product Selection

### Product table: context data and matching variables

Product table with rows (products), columns (features / attributes, and matching concepts calculated from these).  We remove products not in the domain of possible solutions.  This includes removal based on contextual information: industry and product category.  We also remove on matching variables: company size.  The resulting product table is prepared for filtering.

### Product table: matching constraints

Matching constraints are not as definitive as matching variables.  Instead of removing products that are not in the domain, we can sort, filter and find best fit products for the user traits.

<img src="images/product_table.jpg",width=500 height=300>

<br>
<br>
<br>
# Obtaining Data

Getting the data to determine the 'chance of a security incident' for a firm of particular characteristics is not easy, but it is available.  Multiple data sources create reports, articles, and blog posts from their statistics in order to market their acumen.  While this data is not an appealing raw form, it can be aggregated to obtain the data we need.  

For instance, the following graph appears in an annual federal report.  These can be aggregated over time to dispaly interesting patterns.  Multiple reports create statistics from different data sets.  This allows an opportunity for overlapping coverage of samples, and cross-validation.  

In addition, similar plots are available for large corporations and small-/medium-sized businesses.  No one data source has information across all domains.  However, we can aggregate this information to get it.

<img src="images/network_security-incidents.jpg",width=500 height=300>

<br>
<br>
<br>
# References: External Data

* [verizon incident database 2018](https://www.verizonenterprise.com/verizon-insights-lab/dbir/#2018DBIR)
* [Common Weakness Enumeration](https://en.wikipedia.org/wiki/Common_Weakness_Enumeration)
* [Common Vulnerabilities and Exposures](https://en.wikipedia.org/wiki/Common_Vulnerabilities_and_Exposures)
* [Common Vulnerability Scoring System](https://en.wikipedia.org/wiki/Common_Vulnerability_Scoring_System)
* [Exploit Databases](https://www.exploit-db.com/exploits/44938/)


# References: TableSaw 

* [User guide](https://jtablesaw.github.io/tablesaw/userguide/toc)
* [Example notebook](https://github.com/twosigma/beakerx/blob/master/doc/groovy/Tablesaw.ipynb)
* [Tests for determining output](https://github.com/jtablesaw/tablesaw/tree/master/core/src/test/java/tech/tablesaw)

In [165]:
//EXAMPLE: Create a column
var values = List(0.135, 0.135, 0.245, 0.28, 0.28, 0.30500000000000005, 0.31170000000000003, 0.32670000000000005, 0.32720000000000005, 0.3272167)
create("doubles", values.toArray)

Column: doubles
0.135
0.135
0.245
0.28
0.28
0.30500000000000005
0.31170000000000003
0.32670000000000005
0.32720000000000005
0.3272167


In [248]:
//EXAMPLE: Collection scan to accumulate
def getSummedList(list: List[Double]) = list.scan(0.0)((a, b) => a + b)
getSummedList(values)

[[0.0, 0.135, 0.27, 0.515, 0.795, 1.0750000000000002, 1.3800000000000003, 1.6917000000000004, 2.0184000000000006, 2.3456000000000006, 2.6728167000000007, 3.000033400000001]]