# HACS202 Final Project
## Data Analysis Section
## Group A, Spring 2020

Akilesh Praveen

This notebook showcases the scraping of data from honeypot logs, the tidying of said data, and the interpretation of this data.

This repository also contains the raw scripts used to perform data scraping, with additional debug features included. Find them in the 'scripts' directory.

### Section 0: Imports

Libraries and initialization for this project.

In [1]:
import re
import pandas as pd

### Section 1: Data Collection

First, let's extract the data from the `mitm_files` folder. This script is a little involved, but it essentially separates the log into 'attacks'. Each attack is considered begun when the script detects a connection initiated, and each attack is considered ended when the script detecs that the connection is closed. From this, we extract the date & time of the attack, whether a password was used, the IP address of the attacker, and the client that the attacker was using.

This data was then transferred from a list of lists to a dataframe.

_When connections were established and then not closed, that data appeared as ambiguous, and was therefore excluded._

In [2]:
def analyze_attack(fp):
    line = fp.readline()

    if not line:
        return None

    # read the first line of an attack
    initial = re.search("([0-9]{4}-[0-9]{2}-[0-9]{2})\s([0-9]{2}:[0-9]{2}:[0-9]{2})\.[0-9]{3}\s-\s\[Debug\]\s\[Connection\]\sAttacker\sconnected:\s(.*)\s\|\sClient\sIdentification:\s(.*)", line)

    if initial:
        my_date = initial.group(1)
        my_time = initial.group(2)
        my_ip = initial.group(3)
        my_client = initial.group(4)

        # number of attempts + whether a password was used or not
        my_attempts = 0
        my_passwords = []

        # this attack continues until we find the 'Attacker closed the connection' line
        end = False
        while(not end):
            nextline = fp.readline()
            found_end = re.search("Attacker\sclosed\sthe\sconnection$",nextline)

            if found_end:
                end = True
                return [my_date, my_time, my_ip, my_client, my_attempts, my_passwords]

            # this line is not the end, we can get data from it
            
            # check to see if password attempts have increased
            attempt_add = re.search("has\sso\sfar\smade", nextline)

            if attempt_add:
                my_attempts += 1

            password_add = re.search("trying\sto\sauthenticate\swith\s\"(.*)\"", nextline)

            if password_add:
                if password_add.group(1) == "none":
                    my_passwords.append(False)
                else:
                    my_passwords.append(True)

            




    else:
        # not a distinct 'attack'- move to next event
        read_until_next(fp)
        return ["", "", "", "", "", []]


def read_until_next(fp):
    line = fp.readline()

    while (line):
        nextgroup = re.search("Attacker\sclosed\sthe\sconnection$",line)

        if nextgroup:
            break
        else:
            line = fp.readline()
            
# use analyze_attack to store all attacks in a list for now


mitm_data_102 = []
mitm_data_103 = []

# open all 3 files and analyze them

with open('../raw_data/mitm_files/mitm_file102', 'r') as fp:
    # fp is now our file pointer to the mitm data file.

    # skip the five lines at the top
    line = fp.readline()
    line = fp.readline()
    line = fp.readline()
    line = fp.readline()
    line = fp.readline()

        
    while line:
        curr = analyze_attack(fp)
        if curr is None:
            break
        else:
            mitm_data_102.append(curr)


            
# move the contents of data to a dataframe
mitm_df_102 = pd.DataFrame(mitm_data_102, columns = ['date', 'time', 'ip_address', 'client_type', 'attempts', 'password_used' ])   
mitm_df_102

Unnamed: 0,date,time,ip_address,client_type,attempts,password_used
0,2020-05-04,15:14:00,195.3.147.47,SSH-2.0-paramiko_1.8.1,1,"[False, True]"
1,2020-05-04,15:14:08,195.3.147.47,SSH-2.0-OpenSSH_5.3,1,"[False, True]"
2,2020-05-04,15:14:20,195.3.147.47,SSH-2.0-WinSCP_release_5.1.5,1,"[False, True]"
3,2020-05-04,15:16:08,193.105.134.45,SSH-2.0-WinSCP_release_5.7.5,1,"[False, True]"
4,2020-05-04,15:16:19,103.89.89.242,SSH-2.0-paramiko_2.7.1,2,"[True, False, True]"
5,,,,,,[]
6,2020-05-04,15:19:23,193.105.134.45,SSH-2.0-OpenSSH_5.2,1,"[False, True]"
7,2020-05-04,15:22:59,195.3.147.47,SSH-2.0-WinSCP_release_5.1.5,1,"[False, True]"
8,2020-05-04,15:24:47,195.3.147.47,SSH-2.0-OpenSSH_3.9p1,1,"[False, True]"
9,2020-05-04,15:27:16,195.3.147.47,SSH-2.0-paramiko_1.8.1,1,"[False, True]"


In [3]:
# make a similar dataframe for log 103

with open('../raw_data/mitm_files/mitm_file103', 'r') as fp:
    # fp is now our file pointer to the mitm data file.

    # skip the five lines at the top
    line = fp.readline()
    line = fp.readline()
    line = fp.readline()
    line = fp.readline()
    line = fp.readline()

        
    while line:
        curr = analyze_attack(fp)
        if curr is None:
            break
        else:
            mitm_data_103.append(curr)
            
mitm_df_103 = pd.DataFrame(mitm_data_103, columns = ['date', 'time', 'ip_address', 'client_type', 'attempts', 'password_used' ])   
mitm_df_103

Unnamed: 0,date,time,ip_address,client_type,attempts,password_used
0,2020-05-04,15:29:40,193.105.134.45,SSH-2.0-Granados-1.0,1,"[False, True]"
1,2020-05-04,15:45:47,195.3.147.47,SSH-2.0-paramiko_1.7.7.1,1,"[False, True]"
2,2020-05-04,15:56:02,180.214.238.55,SSH-2.0-Go,1,"[False, True]"
3,2020-05-04,16:01:08,195.3.147.47,SSH-2.0-OpenSSH_5.2,1,"[False, True]"
4,2020-05-04,16:08:14,193.105.134.45,SSH-2.0-libssh_0.5.5,1,"[False, True]"
5,2020-05-04,16:31:13,195.3.147.47,SSH-2.0-OpenSSH_6.2,1,"[False, True]"
6,2020-05-04,16:45:28,195.3.147.47,SSH-2.0-libssh_0.5.5,1,"[False, True]"
7,2020-05-04,16:53:34,195.3.147.47,SSH-2.0-OpenSSH_5.9,1,"[False, True]"
8,2020-05-04,16:55:27,193.105.134.45,SSH-2.0-PuTTY_Release_0.64,1,"[False, True]"
9,2020-05-04,17:12:14,195.3.147.47,SSH-2.0-Nmap_SSH2_Hostkey,1,"[False, True]"


#### Container 104

Interesitngly enough, container 104 had such an erratic Connect and Disconnect pattern that separate events were nigh-indistinguishable. Data from it was not easily convertible into a format similar to containers 102 and 103.

Although not quantitatively definable at this time, this is an interesting observation to make regardless. We can already see this as a clear difference between containers 102, 103, and 104.

#### Login attempts

We will now look to collect the login attempts for all of our containers, and their details. This can be found in the `raw_data/logins` folder.

Like before, we'll read the data into lists of lists, then proceed to represent them as dataframes.

### Section 2: Tidying Data

Using Python and Pandas features to tidy the data we have placed in dataframes.

### Section 3: Data Analysis

Drawing conclusions from the data we've come up with.