# Automating Information Security - Log Analyzer

## Threats vs. Attacks

Threat
*   A potential for violation of security, which exists when there is a circumstance, capability, action, or  event that could breach security and cause harm. That is, a threat is a possible danger that might exploit a vulnerability.

Attack
*   An assault on system security that derives from an intelligent threat. That is, an intelligent act that is a deliberate attempt (especially in the sense of a method or technique) to evade security services and violate the security policy of a system.

## What is a security log?

A security log is a log that contains records of login/logout activity or other security-related events specified by the system's audit policy. 

The Security Log is one of the primary tools used by Administrators to detect and investigate attempted and successful unauthorized activity and to troubleshoot problems; Microsoft describes it as **Your Best and Last Defense**.

Types of data logged
*   Web, Linux, OpenStack, VMWare access ; DNS query; etc.





## Apache web access log: Common Log Format

The Common Log Format is a standardized text file format used by web servers when generating server log files.

Each line in a file stored in the Common Log Format has the following syntax: 

*   host ident authuser date request status bytes
*   127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

  *   127.0.0.1 is the IP address of the client (remote host) which made the request to the server.
  *   user-identifier is the RFC 1413 identity of the client.
  *   frank is the userid of the person requesting the document.
  *   [10/Oct/2000:13:55:36 -0700] is the date, time, and time zone that the request was received, by default in strftime format %d/%b/%Y:%H:%M:%S %z.
  *   "GET /apache_pb.gif HTTP/1.0" is the request line from the client. The method GET, /apache_pb.gif the resource requested, and HTTP/1.0 the HTTP protocol.
  *   200 is the HTTP status code returned to the client. 2xx is a successful response, 3xx a redirection, 4xx a client error, and 5xx a server error.
  *   2326 is the size of the object returned to the client, measured in bytes.










## Many other types of log: such as DNS Query Log

Data we might be interested in:
*    Querying client address
*    Resource record type queried
*    Resource record name queried


## The development of log analyzer – the challenge

*    Handling large volume of data
  *    Good data structures; big data processing engine and platform
*    Selecting/searching info/symptoms from the raw data
  *    Understanding systems and networking technologies/protocols behind the data
*    Correlating symptoms to their root causes
  *    Graphic Models; AI/ML


## Tackling the challenges

*    Handling large volume of data
  *    Good data structures - Counters
*    Selecting/searching info/symptoms from the raw data
  *    Understanding systems and networking technologies/protocols behind the data  - Regular Expression
*    Correlating symptoms to their root causes
  *    Graphic Models; AI/ML – Future learning


## Our web log (non standard format)

*   A sample line:
  *   109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-“
*   A log file can be easily hundreds of millions of lines 
*   One of our log files is ~600MB. We take the first 10,000 lines for testing



## Recap: regular expression

* A regular expression is a string that defines a pattern to match other strings
* They are highly efficient but difficult to read
* Implemented in the Python “re” module >>> import re
* Example: re.findall(‘regular expression’, ‘data to search’)

### Python re functions:

* .match(re, str): Start at the beginning of str and extract as much as you can
* .search(re, str): Match anywhere in str
* .match() and .search() return an object with the results in its .group() attribute
* .findall(re, str): Find all occurrences in the string; return them in a list

### Some common Regular Expression rules
- using https://pythex.org/ to test

  * clientIP
  * timestamp
  * action




## Recap: Dictionary

* Lists are automatically indexed with an integer
* With dictionaries, you specify a “key” as the index to a “value”
* Unordered data structure where a given key produces its matching value
* Key can be any immutable data type, e.g., integer, string, tuple
* Value can be integer, string, list, another dictionary, etc.
* Dictionaries are VERY FAST at storing and retrieving data

### Specialized Dictionaries

* The ‘collections’ module has several special-purpose dictionaries with modified behavior
  * OrderedDict: a dictionary that remembers the order keys were inserted
  * Defaultdict: a dictionary that enables you to specify a default value for undefined keys
  * Counter: a dictionary that automatically counts the number of times a key is set


## Counter

Counter is a customized defaultdict that counts the instances of keys with a few useful methods: most_common(x), updata(), elements(), subtract()




In [1]:
from google.colab import drive
import os
drive.mount('/content/drive')
os.chdir("/content/drive/My Drive/data")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
from collections import Counter

wordcount = Counter()
wordcount.update( open('anycontent.txt').read().lower().split() )
wordcount.most_common(5)

[('network', 6), ('and', 5), ('security', 4), ('how', 3), ('the', 3)]

In [3]:
wordcount['security']

4

In [4]:
wordcount.update(['security'])
wordcount['security']

5

In [5]:
wordcount.subtract(['security'])
wordcount['security']

4

## Applying OOP

* Fields
* Methods

In [0]:
import re
from collections import Counter

In [0]:
#logstr = '109.169.248.247 - - [12/Dec/2015:18:25:11 +0100] "GET /administrator/ HTTP/1.1" 200 4263 "-" "Mozilla/5.0 (Windows NT 6.0; rv:34.0) Gecko/20100101 Firefox/34.0" "-"'

regexp = re.compile(
    r'(?P<clientIP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).+\['
    + '(?P<timestamp>\d{2}/[A-Z][a-z]{2}/\d\d\d\d).+\"'
    + '(?P<action>[A-Z]{3,4})'
    )

cnt_clientIPs = Counter()

In [0]:
f = open('access-sample.log', 'r')

In [0]:
matched = 0
failed = 0
for line in f:
    m = re.match(regexp, line)
    if m:
        cnt_clientIPs.update([m.group('clientIP')])
        matched += 1
    else:
        failed += 1

    print("""\
client .........: %s
timestamp ......: %s
action .........: %s
""" % ( m.group('clientIP'),
        m.group('timestamp'),
        m.group('action'),
    ))

print('[*] %d lines matched the regular expression' % (matched))
print('[*] %d lines failed to match the regular expression' % (failed), end='\n\n')
print('[*] ============================================')
print('[*] 10 Most Frequently Occurring Clients Queried')
print('[*] ============================================')
for clientIP, count in cnt_clientIPs.most_common(10):
    print('[*] %30s: %d' % (clientIP, count))
print('[*] ============================================')
