#Automating Information Security 
##-- Developing a Log Analyzer

---

###Understanding Threat and Attack

* Threat: A potential for violation of security, which exists when there is a circumstance, capability, action, or  event that could breach security and cause harm. That is, a threat is a possible danger that might exploit a vulnerability.

* Attack: An assault on system security that derives from an intelligent threat. That is, an intelligent act that is a deliberate attempt (especially in the sense of a method or technique) to evade security services and violate the security policy of a system.

###What is a security log?

* A security log is a log that contains records of login/logout activity or other security-related events specified by the system's audit policy. 
* The Security Log is one of the primary tools used by Administrators to detect and investigate attempted and successful unauthorized activity and to troubleshoot problems; Microsoft describes it as "**Your Best and Last Defense**".
* Types of data logged: Web, Linux, OpenStack, VMWare access ; DNS query; etc.


##Apache web access log: Common Log Format
* The Common Log Format is a standardized text file format used by web servers when generating server log files.
* Each line in a file stored in the Common Log Format has the following syntax: 
```host ident authuser date request status bytes```

For example: ```127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326```

* **127.0.0.1** is the IP address of the client (remote host) which made the request to the server.
* **user-identifier** is the RFC 1413 identity of the client.
* **frank** is the userid of the person requesting the document.
* **[10/Oct/2000:13:55:36 -0700]** is the date, time, and time zone that the request was received, by default in strftime format %d/%b/%Y:%H:%M:%S %z.
* **"GET /apache_pb.gif HTTP/1.0"** is the request line from the client. The method GET, /apache_pb.gif the resource requested, and HTTP/1.0 the HTTP protocol.
* **200** is the HTTP status code returned to the client. 2xx is a successful response, 3xx a redirection, 4xx a client error, and 5xx a server error.
* **2326** is the size of the object returned to the client, measured in bytes.



There are many other types of log, such as DNS Query Log, which includes information like querying client address, resource record type queried, and resource record name queried.


###The challenges in the development of log analyzer

* Handling large volume of data
    * Good data structures; big data processing engine and platform
* Selecting/searching info/symptoms from the raw data
    * Understanding systems and networking technologies/protocols behind the data
* Correlating symptoms to their root causes
    * Graphic Models; AI/ML


##The skills required to implement it!

* Design: 
* Regular expression
* Dictionary (Counter)
* File access
* OOP design


###Specialized Dictionaries

* The "collections" module has several special-purpose dictionaries with modified behavior:
    * OrderedDict: a dictionary that remembers the order keys were inserted
    * Defaultdict: a dictionary that enables you to specify a default value for undefined keys
    * Counter: a dictionary that automatically counts the number of times a key is set


###Counter

Counter is a customized defaultdict that counts the instances of keys with a few useful methods: ```most_common()```, ```updata()```, ```elements()```, ```subtract()```


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
os.chdir("/content/drive/My Drive/Colab Notebooks/IT170")
currentDirectoryPath = os.getcwd()
print(currentDirectoryPath)

/content/drive/My Drive/Colab Notebooks/IT170


In [None]:
anycontentStr = "Zero trust (ZT) is the term for an evolving set of cybersecurity paradigms that move network defenses from static, network-based perimeters to focus on users, assets, and resources. A zero trust architecture (ZTA) uses zero trust principles to plan enterprise infrastructure and workflows. Zero trust assumes there is no implicit trust granted to assets or user accounts based solely on their physical or network location (i.e., local area networks versus the internet). Authentication and authorization (both user and device) are discrete functions performed before a session to an enterprise resource is established. Zero trust is a response to enterprise network trends that include remote users and cloud-based assets that are not located within an enterprise-owned network boundary. Zero trust focus on protecting resources, not network segments, as the network location is no longer seen as the prime component to the security posture of the resource. This document contains an abstract definition of zero trust architecture (ZTA) and gives general deployment models and use cases where zero trust could improve an enterprise’s overall information technology security posture."

In [None]:
from collections import Counter

wordcount = Counter()
#wordcount.update( open(“anycontent.txt”).read().lower().split() )
wordcount.update( anycontentStr.lower().split() )
wordcount.most_common(10)


[('trust', 9),
 ('zero', 8),
 ('and', 7),
 ('the', 6),
 ('network', 6),
 ('to', 6),
 ('is', 5),
 ('an', 5),
 ('of', 3),
 ('that', 3)]

In [None]:
wordcount['trust']
wordcount.update(['trust', 'trust'])
wordcount['trust']
wordcount.subtract(['trust'])
wordcount['trust']

10

Some common Regular Expression rules - using https://pythex.org/ to test:

* clientIP
* timestamp
* action


In [None]:
import re
from collections import Counter

#logstr = '109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"'

regexp = re.compile(
    r'(?P<clientIP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).+\['
    + '(?P<timestamp>\d{2}/[A-Z][a-z]{2}/\d\d\d\d).+\"'
    + '(?P<action>[A-Z]{3,4})'
    )

cnt_clientIPs = Counter()

f = open('access-sample.log', 'r')
matched = 0
failed = 0
for line in f:
    m = re.match(regexp, line)
    if m:
        cnt_clientIPs.update([m.group('clientIP')])
        matched += 1
    else:
        failed += 1

    print("""\
client .........: %s
timestamp ......: %s
action .........: %s
""" % ( m.group('clientIP'),
        m.group('timestamp'),
        m.group('action'),
    ))

print('[*] %d lines matched the regular expression' % (matched))
print('[*] %d lines failed to match the regular expression' % (failed), end='\n\n')
print('[*] ============================================')
print('[*] 10 Most Frequently Occurring Clients Queried')
print('[*] ============================================')

for clientIP, count in cnt_clientIPs.most_common(10):
    print('[*] %30s: %d' % (clientIP, count))
print('[*] ============================================')
