# Preliminary EDA

### An Advanced Persistent Threat (APT) is a type of cyber attack where an attacker gains unauthorized access to a network and remains undetected for an extended period of time. The goal of an APT is usually to steal sensitive information or to disrupt critical systems. APT attacks are usually carried out by well-funded and highly skilled attackers, such as nation-state actors or organized criminal groups.



## Part 1: Load and analyze ecarbro data (data to join computer data to network activity)

In [22]:
import json
import pandas as pd
import matplotlib.pyplot as plt


# load the JSON file
with open('ecarbro.json', 'r') as file:
    # Load the data into a Python list
    data_list = [json.loads(line) for line in file]

# Create a pandas DataFrame
ecarbro_df = pd.DataFrame(data_list)

# Print the shape of the DataFrame
print("Number of rows:", ecarbro_df.shape[0])
print("Number of columns:", ecarbro_df.shape[1])


Number of rows: 1183297
Number of columns: 12


In [21]:
# Print the first 5 records
ecarbro_df.head()


Unnamed: 0,timestamp,id,hostname,objectID,object,action,actorID,pid,ppid,tid,principal,properties
0,2019-09-17T09:26:35.03-04:00,e504a981-2536-4b12-a44c-f4b2ae723110,SysClient0073.systemia.com,3636319d-d1e7-493d-a688-482c23beb4c6,FLOW,INFO,d4d5df43-c149-4ca1-be0c-2baee5bab45d,4164,1092,-1,NT AUTHORITY\SYSTEM,"{'acuity_level': '1', 'bro_uid': 'CaHP0p2A2tAD..."
1,2019-09-17T09:26:35.152-04:00,5d8b62a0-396c-456a-af0f-064040cfc40d,SysClient0073.systemia.com,b25a9b16-5c75-4861-a037-af9db6afd72d,FLOW,INFO,d4d5df43-c149-4ca1-be0c-2baee5bab45d,4164,1092,-1,NT AUTHORITY\SYSTEM,"{'acuity_level': '1', 'bro_uid': 'CijQwi3KjteE..."
2,2019-09-17T09:26:35.002-04:00,ebfe4bf6-b43f-44a9-8efd-208e5192c85e,SysClient0073.systemia.com,85d2081c-6ee6-4265-b6a0-0f68ff5a04e4,FLOW,INFO,d4d5df43-c149-4ca1-be0c-2baee5bab45d,4164,1092,-1,NT AUTHORITY\SYSTEM,"{'acuity_level': '1', 'bro_uid': 'Cf3eUg4f14sM..."
3,2019-09-17T09:26:35.191-04:00,11b93dcc-3331-4663-ab3d-4392239a9767,SysClient0073.systemia.com,b21d18d5-c78f-419c-a33c-239fa64d9618,FLOW,INFO,d4d5df43-c149-4ca1-be0c-2baee5bab45d,4164,1092,-1,NT AUTHORITY\SYSTEM,"{'acuity_level': '1', 'bro_uid': 'C7STqe1ZJtPo..."
4,2019-09-17T09:26:35.199-04:00,e9f2889f-c880-42d9-92d7-6ac686d1fde5,SysClient0073.systemia.com,05050754-9fa2-4946-b014-e12b220b72af,FLOW,INFO,d4d5df43-c149-4ca1-be0c-2baee5bab45d,4164,1092,-1,NT AUTHORITY\SYSTEM,"{'acuity_level': '1', 'bro_uid': 'CcetpP3lA1NM..."


### Group by Object and Action type

In [4]:
# Group the dataframe by the "Object" and "Action" columns and display the count of each group
grouped = ecarbro_df.groupby(["object", "action"]).size().reset_index(name='Count')
print(grouped)

  object action    Count
0   FLOW   INFO  1183297


## Part 2: Load and analyze connection log data (network activity)

In [23]:
import pandas as pd


# Define the column names
col_names = ['ts', 'uid', 'id.orig_h', 'id.orig_p', 'id.resp_h', 'id.resp_p', 'proto', 'service', 'duration',
             'orig_bytes', 'resp_bytes', 'conn_state', 'local_orig', 'local_resp', 'missed_bytes', 'history',
             'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'tunnel_parents']


conn_df = pd.read_csv('conn.09_00_00-10_00_00.log', sep='\t', header=None, names=col_names,
                 na_values=['(empty)', '-'], keep_default_na=False, skiprows=7)

# Check the first 5 rows of the dataframe
conn_df.head()


  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,...,conn_state,local_orig,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents
#types,time,string,addr,port,addr,port,enum,string,interval,count,...,string,bool,bool,count,string,count,count,count,count,set[string]
1569229194.008118,CXjvkk4Cn2Z9gtezOk,142.20.58.111,52559,165.101.35.5,443,tcp,ssl,0.009248,917,752,...,F,F,0,ShADadfF,8,1249,8,1084,,
1569229194.209692,CJtLsd15U6DBxokNYk,142.20.57.118,52465,153.129.45.5,443,tcp,ssl,0.010520,933,7408,...,F,F,0,ShADadfF,7,1225,13,7940,,
1569229194.229933,CadbW9GsF5vphwxGl,142.20.57.62,50181,86.62.223.5,443,tcp,ssl,0.012368,1109,7760,...,F,F,0,ShADadfF,7,1401,12,8252,,
1569229194.239066,C2GVgr1qQ7xf9uE2Aj,142.20.57.62,50185,86.62.223.5,443,tcp,ssl,0.009342,1109,12560,...,F,F,0,ShADadfF,7,1401,15,13172,,


### Group by Port and Protocol type

In [26]:
# Group the dataframe by the "Object" and "Action" columns and display the count of each group
grouped = conn_df.groupby(["proto"]).size().reset_index(name='Count').sort_values(by='Count', ascending=False)
print(grouped)

      proto    Count
5       ssl  1543302
3      http    82258
1       dns     6339
6  ssl,smtp      515
4  smtp,ssl      419
0                  1
2      enum        1


**Commentary:** A protocol is like a set of rules that computers use to talk to each other and send information over the internet. Think of it like a secret code that helps the computers understand each other and send the right information. In our data, we see different types of protocols and will investigate if any protocol is more conducive to malicious activity.


## Part 3:  Load and analyze ecar data (computer data)

In [12]:
# Read the first 100K records of the JSON file
ecar_df = pd.read_json('AIA-1-25.ecar.json', lines=True, nrows=100000)

# Check the first 5 rows of the dataframe
print(ecar_df.head())

          action                               actorID  \
0         CREATE  89f91b70-9613-4e70-9473-23bb029e3889   
1         CREATE  89f91b70-9613-4e70-9473-23bb029e3889   
2         CREATE  89f91b70-9613-4e70-9473-23bb029e3889   
3  REMOTE_CREATE  ff8a4d62-eb4a-4cfc-aea2-1870d4dba3f1   
4         CREATE  88984aa1-d004-41d9-a498-8e9eab111f62   

                     hostname                                    id  object  \
0  SysClient0004.systemia.com  97e0d110-cc53-4baf-a293-4646e6d967a9  THREAD   
1  SysClient0004.systemia.com  9adb7fc5-73f3-4567-8f6b-adbbc4a9ba3d  THREAD   
2  SysClient0004.systemia.com  4fc209b1-c5e4-49a2-94ec-d2310f211a02  THREAD   
3  SysClient0004.systemia.com  a7624d42-57d8-4c31-bc71-191f8be09a49  THREAD   
4  SysClient0022.systemia.com  77bb2643-59a8-45e0-ad50-b198a4d8e5f0  THREAD   

                               objectID   pid  ppid principal  \
0  9e902e20-df97-4120-8ed5-01b61980bab8   312    -1             
1  bafe59cd-d8af-4be1-96b1-e7c727c1aec8   312 

In [17]:
# Calculate the total number of records in the dataframe
total_records = ecar_df.shape[0]

# Calculate the percent of total count for each combination of "action" and "object"
grouped['Percent'] = grouped['Count'] / total_records * 100

# Sort the data by "Percent" in descending order
grouped.sort_values(by='Percent', ascending=False, inplace=True)

# Reset the index of the dataframe
grouped.reset_index(drop=True, inplace=True)

# Display the results
print(grouped)

          object         action  Count  Percent
0           FLOW          START  55801   55.801
1           FLOW        MESSAGE  20921   20.921
2         MODULE           LOAD   5430    5.430
3           FILE         MODIFY   4525    4.525
4           FILE           READ   2962    2.962
5           FILE          WRITE   2823    2.823
6         THREAD      TERMINATE   1619    1.619
7         THREAD         CREATE   1533    1.533
8        PROCESS           OPEN   1282    1.282
9           FILE         CREATE    984    0.984
10          FILE         RENAME    368    0.368
11         SHELL        COMMAND    335    0.335
12        THREAD  REMOTE_CREATE    318    0.318
13          FILE         DELETE    273    0.273
14      REGISTRY           EDIT    239    0.239
15          FLOW           OPEN    189    0.189
16       PROCESS         CREATE    164    0.164
17       PROCESS      TERMINATE    151    0.151
18          TASK          START     28    0.028
19  USER_SESSION          GRANT     16  

**Commentary:** Started looking at the events data on machines, which has events like opening a file, creating a file, etc. The total number of events is over 17 million and we will be performing the full EDA in our PySpark pipeline.   However, we started our EDA by sampling 100K records to determine future considerations when we discuss with other subject matter experts.

We observed that the the Object and Action pair of Flow/Start and Flow/Message constitute roughly over 70% of the events. A flow is when two hosts or machines communicate over a network. 

We will perform further research and analysis to determine if malicious activities take place in these events and if so, then distinguish between the regular and malicious events.