# Machine-based Text Analytics of CyberSecurity Strategies
Uses machine learning to calssify sentences from CyberSecurity documents

**These labels come from the headers in the cyberwellness profiles linked above**

| Category               | Sub category |
|------------------------| -------------|
|LEGAL MEASURES          | CRIMINAL LEGISLATION, REGULATION AND COMPLIANCE|
|TECHNICAL MEASURES      | CIRT, STANDARDS, CERTIFICATION|
|ORGANIZATION MEASURES   | POLICY, ROADMAP FOR GOVERNANCE, RESPONSIBLE AGENCY, NATIONAL BENCHMARKING|
|CAPACITY BUILDING       | STANDARDISATION DEVELOPMENT, MANPOWER DEVELOPMENT, PROFESSIONAL CERTIFICATION, AGENCY CERTIFICATION|
|COOPERATION             | INTRA-STATE COOPERATION, INTRA-AGENCY COOPERATION, PUBLIC SECTOR PARTNERSHIP,  INTERNATIONAL COOPERATION|
|CHILD ONLINE PROTECTION | NATIONAL LEGISLATION,  UN CONVENTION AND PROTOCOL, INSTITUTIONAL SUPPORT, REPORTING MECHANISM|

In [12]:
# When using nltk for the first time, uncomment the following lines and run cell.
# nltk.download must only be downloaded once
# import nltk
# nltk.download()

In [11]:
%matplotlib inline
import numpy as np
import tensorflow as tf
from nltk.token import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
import pickle
import random
import json
from collections import Counter
from pprint import pprint

ImportError: No module named token

In [3]:
# Opens training data stored as Json and converts to Python list
with open('results_concatenated.json') as f:    
    data = json.load(f)

pprint(data[:4])

[{u'Country': u'Jordan',
  u'sentence': u'However, these approaches: are generally basic; not systematic; subjective; have no clear definition or boundaries, are not thorough; do not meet international standards; and do not deal effectively with threats emerging from cyberspace.',
  u'sentence_id,': u'ff30d97ab4',
  u'tag': [{u'category': u'technical measures',
            u'subcategory': [u'standards']}]},
 {u'Country': u'Jordan',
  u'sentence': u'Strategies and policies developed by the private sector should augment, comply, and be consistent with this strategy.',
  u'sentence_id,': u'e50e3676b6',
  u'tag': [{u'category': u'organization measures',
            u'subcategory': [u'policy']}]},
 {u'Country': u'Jordan',
  u'sentence': u'security policy and role-based security responsibilities will have a higher rate of success in protecting critical information.',
  u'sentence_id,': u'ddd832b614',
  u'tag': [{u'category': u'organization measures',
            u'subcategory': [u'policy']}]

In [10]:
# Splits data into 3 parts, IDs, sentences, and tags

# Optional
sentence_ids = []

# Lexicons created from sentences will be inputs
sentences  = []

# Tags will be outputs
tags = []

for input_val in data:
    sentence_ids.append(input_val[u'sentence_id,'])
    sentences.append(input_val[u'sentence'].encode("utf-8"))
    tags.append(input_val[u'tag'])

print "Number of training examples is {} \n".format(len(sentences))
print "First example is \nX: {} \n\n y: {}".format(sentences[0], tags[0])

Number of training examples is 2045 

First example is 
X: However, these approaches: are generally basic; not systematic; subjective; have no clear definition or boundaries, are not thorough; do not meet international standards; and do not deal effectively with threats emerging from cyberspace. 

 y: [{u'category': u'technical measures', u'subcategory': [u'standards']}]


In [13]:
# Creates lexicon (list of unique words) from all training samples
def create_lexicon(sentences):
    lexicon = []
    for sentence in sentences:
            for sentence in sentences:
                root = lemmatizer.lemmatize(word)
                if root not in stop and root not in lexicon:
                    lexicon.append(root)
    
    return lexicon

In [14]:
# Creates 2D array containing number of occurences of each word in lexicon in each sample
def produce_X(sentences, lexicon):
    X = []
    for sentence in sentences:
        X_sample = [0 for _ in range(lexicon)]
        for word in sentence:
            if word.lower in lexicon:
                X_sample[lexicon.index(word.lower)] += 1
        
        X.append(X_sample)
    
    return np.array(X)