# Implementing the Whitehawk Recommendation System

The Whitehawk Recommendation System is a Knowledge-based hybrid system.  Its development begins with a naive, constraint approach.  As more external data is integrated, the system evolves to include additional approaches, including utility- and case-based.  See the notebook [Recommender Systems and Knowledge-based Selections](https://github.com/WhitehawkCEC/cyber/blob/feature/DSR-77-calculate-questionnaire-indicators/core/decision-engine/src/main/scala/com/whitehawk/decisionengine/notes_RecommenderSystems.ipynb) for a short introduction to these systems and why the Knowledge- / constraint-based approach is the appropriate beginning state for Whitehawk.

In this notebook discussion, we will apply theory from the Recommendation System body of literature, as well as add two concepts to extend the theory to meet Whitehawk Exchange needs:

First, we will use the term __matching concepts__ to indicate the variables and constraints calculated from raw data (questionnaire or otherwise) that are used to match users to items.  These are _concepts_ because they are defined abstractions that are subject to change.  _Matching concepts_ are different from variables used in recommendation systems which are typically tangible item characteristics, such as price.

Second, the term __security score__ is the coverage an organization is provided by implementing Whitehawk Recommendation System package selections.  It is the probability of successfully defending against any security incident, within a particular context (industry, firm size, etc).  The _security score_ is defined on the set [0,1] and is evaluated as 1 minus the probabilty of incurring a successful security incident (internal personnell, internal IT infrastructure, or external attack).  This should be a data-driven statistic that accounts for products, services, policies / guidelines and any other recommendations that Whitehawk identifies as necessary for providing protection.  Importantly, it should be a proxy to external scores, such as BitSight Score, that may be used for assessing an organizations' security.

# Requirements

### Questionnaire requirements

There are two requirements for the questionnaire to be flexible: matching concepts must be used for selection, and it must allow for multiple types of users.  The first requirement naturally enables the second.   

In the first year of Whitehawk, it was determined that rather than using raw metrics to match customers to products; instead, calculated metrics (matching concepts) should be used to infer abstract variables and constraints.  For instance, instead of using the number of employees as an alias for the general concept of Size, a calculated metric (polynomial) is used consisting of four different quenstionnaire variables:

* C (Int)- number of computers the customer owns.
* E (Int)- number of employees the customer has.
* L (Int)- number of locations the customer has.
* U (Int)- number of users the customer has.

_The specifics are outlined in [documentation](https://whitehawkcec.atlassian.net/wiki/spaces/WDP/pages/360316939/Customer+Metrics+Indicators)_

The matching concepts can evolve, but not be noticable to the user.  Matching concepts are also useful for allowing an _adaptable_ questionnaire for multiple types of users, the second requirement.  Here, it enables a more fluid environment where the qustionnaire is de-coupled from the general concepts and may be changed in response to different user type needs.  An adaptable questionnaire may be more difficult to implement because it requires different questions for each type of user.  However, a separate matching concept can be made for each user type.  In the example of size, above, a separate calculation may be used for the finance industry, instead of healthcare.  Also, as new data sources are integrated, matching concepts can include data from these new sources. 

### Product selection requirements

Product selection is currently performed using a static template.  Multiple tests are performed on this static template to ensure the results of the questionnaire meet the acceptable criteria described by Subject Matter Experts in Advisory Services.  These requirements will evolve as more users interact with them.  Current requirements include the following taken from [solution bundle output tests](https://github.com/WhitehawkCEC/service-bundle-template/blob/master/tests/test_output.py):

_within Template_ 

* [X] the number of rows should equal the number of combination of headers => check_template_rec()
* [X] all products should be used => check_template_product_use()

_within Bundle (template row)_

* [X] number of products should increase (strictly monotonic): Basic, Balanced, Advanced => check_bundle_product_count()
* should they be the same?

_within Package_

* there must be one, and only one, product for each product-category
* products cannot be duplicated across product-categories?
* ~~product name must be related to scale~~

_across Prices_

* [X] within bundle, prices should increase: Basic, Balanced, Advanced => check_price_incr()
* [X] globally, prices should be reasonable (limited by upper,lower bound) => check_price_limits()

# Product Selection

The basic implementation steps are addressed in the notebook [Recommender Systems and Knowledge-based Selections](https://github.com/WhitehawkCEC/cyber/blob/feature/DSR-77-calculate-questionnaire-indicators/core/decision-engine/src/main/scala/com/whitehawk/decisionengine/notes_RecommenderSystems.ipynb#Implementation).

### Input

The input will consist of:

* questionnaire results that map to arbitrary matching concepts
* product table with rows (products), columns (features / attributes, and matching concepts calculated from these)
* scenario of user needs that are chosen by an advisor (basic, balanced, advanced)

### Matching concepts and security score

The matching concepts include all the variables and constraints necessary to create a consistent recommendation to the Constraint Satistfaction Problem (CSP) (V = Vc ∪ Vprod, D, C = Cr ∪ Cf ∪ Cprod ∪ REQ).  Matching concepts must be categorized by being either:

* variable - domain of possibility; all package recommended items' security score within Vc (industry, size) sum to one
* constraint - based upon context, used as `if <context>, then <result>`; context information includes: package(basic, balanced, advanced), maturity, ...

The security score for all items in recommended-packages, within Vc, must sum to one because by taking Whitehawk Recommendation Systems' recommended steps the user has reduced their probabilty of incurring a successful security incident to a trivial amount.

### Selection Process

These are the steps in the recommendation process:

_(before selection)_

* define matching concept rules for product table
* implement matching concepts in product table

_(selection)_

* import matching concepts from quetionnaire indicators
* transform raw quetionnaire to input data
* remove products that are not within variable domains
* remove products that do not meet contextual constraints
* sort and filter products based on scenario needs

In [None]:
/*
(before selection)
define matching concept rules for product table
implement matching concepts in product table

(selection code)
import matching concepts from quetionnaire indicators
transform raw quetionnaire to input data
perform selection for: basic, balanced, advanced

*/

_Note:_ section 'Reference: Tablesaw Dataframe' must be run before the below code

In [43]:
//products.selectWhere(column("Header-Industry").isEqualTo("Healthcare"))
val res_1 = products.selectWhere(column("Category").isEqualTo("Antimalware"))
val res_2 = res_1.sortOn("-SellingPrice")
res_2.first(1)

# Reference: Tablesaw DataFrame 

[example notebook](http://127.0.0.1:7777/notebooks/doc/groovy/Tablesaw.ipynb)

In [1]:
%%classpath add mvn
tech.tablesaw tablesaw-plot 0.11.4
tech.tablesaw tablesaw-smile 0.11.4
tech.tablesaw tablesaw-beakerx 0.11.4

Added jars: [commons-collections-3.2.2.jar, fastutil-8.1.1.jar, jsr305-1.3.9.jar, xchart-3.5.0.jar, opencsv-4.1.jar, filters-2.0.235.jar, commons-text-1.1.jar, gson-2.8.2.jar, smile-core-1.4.0.jar, tablesaw-plot-0.11.4.jar, guava-23.0.jar, commons-math3-3.6.1.jar, error_prone_annotations-2.0.18.jar, animal-sniffer-annotations-1.14.jar, tablesaw-beakerx-0.11.4.jar, VectorGraphics2D-0.11.jar, swing-worker-1.1.jar, smile-math-1.4.0.jar, tablesaw-core-0.11.4.jar, snappy-0.4.jar, tablesaw-smile-0.11.4.jar, commons-beanutils-1.9.3.jar, swingx-1.6.1.jar, commons-logging-1.2.jar, smile-graph-1.4.0.jar, jsoup-1.11.2.jar, RoaringBitmap-0.6.51.jar, smile-plot-1.4.0.jar, slf4j-api-1.7.21.jar, commons-lang3-3.6.jar, smile-data-1.4.0.jar, j2objc-annotations-1.1.jar]


In [2]:
%import tech.tablesaw.aggregate.*
%import tech.tablesaw.api.*
%import tech.tablesaw.api.ml.clustering.*
%import tech.tablesaw.api.ml.regression.*
%import tech.tablesaw.columns.*

// display Tablesaw tables with BeakerX table display widget
tech.tablesaw.beakerx.TablesawDisplayer.register()

null

In [8]:
val products = Table.read().csv("./dataCombine.csv")
products.first(5)

In [7]:
products.structure()
products.columnNames()
products.shape()

6597 rows X 23 cols

In [11]:
import tech.tablesaw.api.QueryHelper.column
products.structure().selectWhere(column("Column Type").isEqualTo("FLOAT"))

In [None]:
/*Mapping operations
def month = tornadoes.dateColumn("Type").month()
tornadoes.addColumn(month);
tornadoes.columnNames()
*/

In [16]:
//Sorting by column
products.sortOn("-SellingPrice").first(5)
products.column("SellingPrice").summary()

In [18]:
//Performing totals and sub-totals
def priceByScale = products.median("SellingPrice").by("Type")
priceByScale.setName("Median price by Type")
priceByScale

In [27]:
//Cross Tabs
CrossTab.xCount(products, products.column("Category"), products.column("Type"))