# Anomaly Detection: Acquire Data

## About the data

https://csr.lanl.gov/data/cyber1/  

This data set represents 58 consecutive days of de-identified event data collected from five sources within Los Alamos National Laboratory’s corporate, internal computer network.

The data sources include Windows-based authentication events from both individual computers and centralized Active Directory domain controller servers; process start and stop events from individual Windows computers; Domain Name Service (DNS) lookups as collected on internal DNS servers; network flow data as collected on at several key router locations; and a set of well-defined red teaming events that present bad behavior within the 58 days. In total, the data set is approximately 12 gigabytes compressed across the five data elements and presents 1,648,275,307 events in total for 12,425 users, 17,684 computers, and 62,974 processes.

To the extent possible under law, Los Alamos National Laboratory has waived all copyright and related or neighboring rights to Comprehensive, Multi-Source Cyber-Security Events. This work is published from: United States.


## Skills
1. Create a dictionary
2. Access keys and values from a dictionary
3. Acquire multiple files from a URL
4. Unzip files
5. Read sample 
6. Use StringIO to parse the file contents into a delimited file
7. Create multiple data frames in a loop using a dictionary

In [43]:
import gzip
from io import StringIO
import pandas as pd
from urllib.request import urlopen

files = {'auth': ['time', 'source_user_domain', 'destination_user_domain', 'source_computer', 
                  'destination_computer','authentication_type', 'logon_type', 'authentication_orientation',
                  'success_failure'],
         'proc': ['time','user_domain','computer','process_name','start_end'],
         'flows': ['time', 'duration', 'source_computer', 'source_port', 'destination_computer', 
                   'destination_port', 'protocol', 'packet_count', 'byte_count'],
         'dns': ['time', 'source_computer', 'computer_resolved'],
         'redteam': ['time','user_domain', "source_computer", "destination_computer"]
        }

In [49]:
print('Auth Values:')
print(names_dict['auth'])

Auth Values:
['time', 'source_user_domain', 'destination_user_domain', 'source_computer', 'destination_computer', 'authentication_type', 'logon_type', 'authentication_orientation', 'success_failure']


In [1]:
print('Dict Keys:')
print(names_dict.keys())

Dict Keys:


NameError: name 'names_dict' is not defined

In [51]:
print('Dict Values:')
print(names_dict.values())

Dict Values:
dict_values([['time', 'source_user_domain', 'destination_user_domain', 'source_computer', 'destination_computer', 'authentication_type', 'logon_type', 'authentication_orientation', 'success_failure'], ['time', 'user_domain', 'computer', 'process_name', 'start_end'], ['time', 'duration', 'source_computer', 'source_port', 'destination_computer', 'destination_port', 'protocol', 'packet_count', 'byte_count'], ['time', 'source_computer', 'computer_resolved'], ['time', 'user_domain', 'source_computer', 'destination_computer']])


In [52]:
print('First Key:')
print(list(names_dict.keys())[0])

First Key:
auth


In [61]:
d = {}
for f in list(names_dict.keys()):
    resp = urlopen(path+f+'.txt.gz')
    with gzip.open(resp, 'rt') as r:
        file_content = r.read(200)
    d[f] = pd.DataFrame(pd.read_csv(StringIO(file_content), sep=',', header=None, names=names_dict[f]))

In [62]:
{key: value for key, value in d.items()}

{'auth':    time    source_user_domain destination_user_domain source_computer  \
 0     1  ANONYMOUS LOGON@C586    ANONYMOUS LOGON@C586           C1250   
 1     1  ANONYMOUS LOGON@C586    ANONYMOUS LOGON@C586            C586   
 2     1            C101$@DOM1              C101$@DOM1            C988   
 
   destination_computer authentication_type logon_type  \
 0                 C586                NTLM    Network   
 1                 C586                   ?    Network   
 2                 C988                   ?        Net   
 
   authentication_orientation success_failure  
 0                      LogOn         Success  
 1                     LogOff         Success  
 2                        NaN             NaN  ,
 'proc':    time  user_domain computer process_name start_end
 0     1     C1$@DOM1       C1          P16     Start
 1     1  C1001$@DOM1    C1001           P4     Start
 2     1  C1002$@DOM1    C1002           P4     Start
 3     1  C1004$@DOM1    C1004           P4

In [63]:
d['proc']

Unnamed: 0,time,user_domain,computer,process_name,start_end
0,1,C1$@DOM1,C1,P16,Start
1,1,C1001$@DOM1,C1001,P4,Start
2,1,C1002$@DOM1,C1002,P4,Start
3,1,C1004$@DOM1,C1004,P4,Start
4,1,C1017$@DOM1,C1017,P4,Start
5,1,C1018$@DOM1,C1018,P4,Start
6,1,C1020$@DOM1,C1020,P3,Start
7,1,,,,
