# Machine-based Text Analytics of CyberSecurity Strategies
Uses machine learning to calssify sentences from CyberSecurity documents

**These labels come from the headers in the cyberwellness profiles linked above**

| Category               | Sub category |
|------------------------| -------------|
|LEGAL MEASURES          | CRIMINAL LEGISLATION, REGULATION AND COMPLIANCE|
|TECHNICAL MEASURES      | CIRT, STANDARDS, CERTIFICATION|
|ORGANIZATION MEASURES   | POLICY, ROADMAP FOR GOVERNANCE, RESPONSIBLE AGENCY, NATIONAL BENCHMARKING|
|CAPACITY BUILDING       | STANDARDISATION DEVELOPMENT, MANPOWER DEVELOPMENT, PROFESSIONAL CERTIFICATION, AGENCY CERTIFICATION|
|COOPERATION             | INTRA-STATE COOPERATION, INTRA-AGENCY COOPERATION, PUBLIC SECTOR PARTNERSHIP,  INTERNATIONAL COOPERATION|
|CHILD ONLINE PROTECTION | NATIONAL LEGISLATION,  UN CONVENTION AND PROTOCOL, INSTITUTIONAL SUPPORT, REPORTING MECHANISM|

In [7]:
# When using nltk for the first time, uncomment the following lines and run cell.
# nltk.download must only be downloaded once
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [12]:
%matplotlib inline
import numpy as np
import tensorflow as tf
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
import pickle
import random
import json
from collections import Counter
from pprint import pprint

## Preprocessing Data
The following cells read in training samples from a json file, create a lexicon from it, create arrays that store the number of occurences of each word in the lexicon, and serialize the generated list of features.

In [15]:
# Opens training data stored as Json and converts to Python list
with open('results_concatenated.json') as f:    
    data = json.load(f)

pprint(data[:3])

[{'Country': 'Jordan',
  'sentence': 'However, these approaches: are generally basic; not systematic; '
              'subjective; have no clear definition or boundaries, are not '
              'thorough; do not meet international standards; and do not deal '
              'effectively with threats emerging from cyberspace.',
  'sentence_id,': 'ff30d97ab4',
  'tag': [{'category': 'technical measures', 'subcategory': ['standards']}]},
 {'Country': 'Jordan',
  'sentence': 'Strategies and policies developed by the private sector should '
              'augment, comply, and be consistent with this strategy.',
  'sentence_id,': 'e50e3676b6',
  'tag': [{'category': 'organization measures', 'subcategory': ['policy']}]},
 {'Country': 'Jordan',
  'sentence': 'security policy and role-based security responsibilities will '
              'have a higher rate of success in protecting critical '
              'information.',
  'sentence_id,': 'ddd832b614',
  'tag': [{'category': 'organization measu

In [17]:
# Splits data into 3 parts, IDs, sentences, and tags

# For testing purposes
sentence_ids = []

# Lexicons created from sentences will be inputs
sentences  = []

# Tags will be outputs
tags = []

for input_val in data:
    sentence_ids.append(input_val[u'sentence_id,'])
    sentences.append(input_val[u'sentence'])
    tags.append(input_val[u'tag'])

print("Number of training examples is {} \n".format(len(sentences)))
print("First example is \nX: {} \n\n y: {}".format(sentences[0], tags[0]))

Number of training examples is 2045 

First example is 
X: However, these approaches: are generally basic; not systematic; subjective; have no clear definition or boundaries, are not thorough; do not meet international standards; and do not deal effectively with threats emerging from cyberspace. 

 y: [{'category': 'technical measures', 'subcategory': ['standards']}]


In [69]:
# Creates lexicon (list of unique words) from all training samples
def create_lexicon(sentences):
    lexicon = []
    for sentence in sentences:
        for word in word_tokenize(sentence):
            root = lemmatizer.lemmatize(word.lower()).encode('utf-8')
            if len(root) > 1 and root not in stop and root not in lexicon:
                lexicon.append(root)
    
    return lexicon

In [70]:
# Creates 2D array containing number of occurences of each word in lexicon in each sample
def produce_X(sentences, lexicon):
    X = []
    for sentence in sentences:
        X_sample = [0 for _ in lexicon]
        for word in word_tokenize(sentence):
            root = lemmatizer.lemmatize(word.lower()).encode('utf-8')
            if root in lexicon:
                X_sample[lexicon.index(root)] += 1
        
        X.append(X_sample)
    
    return np.array(X)

In [71]:
sample_lexicon = create_lexicon(sentences)
pprint(sample_lexicon[:6])

[b'however', b'these', b'approach', b'are', b'generally', b'basic']


In [72]:
X = produce_X(sentences, sample_lexicon)

print(X)

[[1 1 1 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 1 1 1]]


In [73]:
# Pickles features generated for reuse

with open('sample_X.npy', 'wb') as f:
    np.save(f, X)

### Categories (index 0-5)
0. LEGAL MEASURES
1. TECHNICAL MEASURES
2. ORGANIZATION MEASURES
3. CAPACITY BUILDING
4. COOPERATION
5. CHILD ONLINE PROTECTION

> Categories will be stored as a 1D array with each number corresponding to a category listed above. 

In [74]:
# Dictionary stores label names and corresponding index to be turned on in the one hot vector.
category_dict = {
    u'LEGAL MEASURES' : 0,
    u'TECHNICAL MEASURES' : 1,
    u'ORGANIZATION MEASURES' : 2,
    u'CAPACITY BUILDING' : 3,
    u'COOPERATION' : 4,
    u'CHILD ONLINE PROTECTION' : 5
}

In [75]:
def produce_y(tags):
    return np.array([category_dict[tag[0][u'category'].upper()] for tag in tags])

In [76]:
y = produce_y(tags)

print(y)

[1 2 2 ..., 2 1 2]


In [77]:
with open('tags.npy', 'wb') as f:
    np.save(f, y)

## Tensorflow Boilerplate
To simplify the Tensorflow code, we will define a set of functions to delare variables.

In [78]:
def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

In [1]:
def fc_layer(X, W, b, name='fc'):
    with tf.name_scope(name):
        return tf.nn.relu(tf.matmul(X, W) + b)

## Loading the Data
Now that the data has been processed it is now time to load the data and fit a model to it

In [79]:
with open("sample_X.npy","rb") as f:
    X = np.load(f)

print(X)

[[1 1 1 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 1 1 1]]


In [81]:
with open("tags.npy","rb") as f:
    y = np.load(f)

    print(y)

[1 2 2 ..., 2 1 2]


In [None]:
lexicon_size = len(X[0])
number_of_options = max(y)

In [5]:
def one_hot(vector):
    def hot_or_not(i, j):
        return 1 if i == j else 0
    return [[hot_or_not(i, j) for j in number_of_options] for i in vector]

## Constructing the Model
Now we can create a neural network to fit the data.

In [None]:
X_placeholder = tf.placeholder(tf.float32, [None, lexicon_size])

In [None]:
w1 = weight_variable([lexicon_size, 10])
b1 = bias_variable([10])

model = fc_layer(X_placeholder, w1, b1, name='fc1')

In [None]:
w2 = weight_variable([10, 10])
b2 = bias_variable([10])

model = fc_layer(X_placeholder, w2, b2, name='fc2')

In [None]:
w3 = weight_variable([10, number_of_options])
b3 = bias_variable([number_of_options])

y_predicted = tf.matmul(model, w3) + b3