IoC Types to Extract:
- IP (v4 and v6, public and private - remove private later)
   - "dumb parse" (no validation - easier regex)?
   - https://docs.python.org/3/library/ipaddress.html
- URL / domain
- hashes: MD5, SHA1, SHA256

Accept these file types:
- txt 
- Word doc / docx
- maybe CSV?
- maybe PDF?

Thoughts:
- use IP library to determine IP type 
- find a hash library to validate hashes / determine type
- need good regexs for IoCs

https://github.com/eset/malware-ioc/tree/master/nukesped_lazarus

In [2]:
import os
import re
import pandas as pd
# from checkHash import *

Maltego IoC Regexs

https://www.maltego.com/blog/advanced-iocs-collection-with-osint-and-threat-intelligence-feeds/

**CVEs**

CVE-\d{4}-\d*

**Common hashes (MD5, SHA1 and SHA256)**

[A-Fa-f0-9]{64}|[a-fA-F0-9]{40}|[a-fA-F0-9]{32}

**URL even when defanged (ex: hxxps://google[.]com)**

(?:h..ps?|f.p)://(?:www(?:.|[.])|(?!www))[a-zA-Z0-9]+a-zA-Z0-9[^\s]{2,}|www(?:.|[.])[a-zA-Z0-9]+a-zA-Z0-9[^\s]{2,}|https?://(?:www(?:.|[.])|(?!www))[a-zA-Z0-9]+(?:.|[.])[^\s]{2,}|www(?:.|[.])[a-zA-Z0-9]+(?:.|[.])[^\s]{2,}

**IPv4 even when defanged (ex: 8[.]8[.]8[.]8)**

\b[\d]{1,3}[?.]?[\d]{1,3}[?.]?[\d]{1,3}[?.]?[\d]{1,3}(:\d+)?\b

**IPv6**

[0-9a-fA-F]{1,4}(?::[0-9a-fA-F]{1,4}){7}|(?:(?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4}){0,5})?)::(?:(?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4}){0,5})?)

**BTC address**

(?:[13]|bc1)[a-zA-HJ-NP-Z0-9]{26,62}

In [5]:
# Constants

ACCEPTED_FILETYPES = (
    # "PDF",
    "TXT",
    "CSV",
    )

MD5_REGEX_PATTERN = r"(\b[0-9a-f]{32}\b)"
SHA1_REGEX_PATTERN = r"(\b[0-9a-f]{40}\b)"
SHA256_REGEX_PATTERN = r"(\b[0-9a-f]{64}\b)"

IPV4_REGEX_PATTERN = r"(\b([0-9]{1,3}\.{0,1}){4}(\:{1}[0-9]{1,5}){0,1}\b)" # "dumb" (no validation), also handles 1-5 digit port suffixes



In [37]:
# ----- Determine accepted files -----

sample_data_path = "sample-data"
sample_data = os.listdir(sample_data_path)

accepted_files = []
for filename in sample_data:
    if filename.upper().endswith(ACCEPTED_FILETYPES):
        print("accepted filetype: ", filename)
        accepted_files.append(filename)

print(accepted_files)

accepted filetype:  IPC2s.csv
accepted filetype:  MalwareBazaarRecentMalwareSamples_MD5_20240901.txt
accepted filetype:  MalwareIoCRepoNukespedLazarus_SHA1.txt
accepted filetype:  MalwareTrafficAnalysisExerciseAnswers_IPDomainSHA256_20240815.txt
accepted filetype:  VariousHashes.txt
accepted filetype:  VariousIPs.txt
['IPC2s.csv', 'MalwareBazaarRecentMalwareSamples_MD5_20240901.txt', 'MalwareIoCRepoNukespedLazarus_SHA1.txt', 'MalwareTrafficAnalysisExerciseAnswers_IPDomainSHA256_20240815.txt', 'VariousHashes.txt', 'VariousIPs.txt']


In [38]:
# ----- Determine file type and read files -----

def read_text(path: str, filename: str) -> str:
    with open(f"{path}/{filename}", "r") as file:
        data = file.read()
        file.close()
    return data

def read_csv(path: str, filename: str) -> pd.DataFrame:
    df = pd.read_csv(f"{path}/{filename}")
    return df

read_files = []
for filename in accepted_files:
    print("filename: ", filename)
    if filename.upper().endswith("TXT"):
        file_type = "text"
        data = read_text(sample_data_path, filename)
    elif filename.upper().endswith("CSV"):
        file_type = "csv"
        data = read_csv(sample_data_path, filename)

    read_files.append({
        "filename": filename,
        "file_type": file_type,
        "data_type": type(data),
        "data": data,
    })

# print(read_files)

filename:  IPC2s.csv
filename:  MalwareBazaarRecentMalwareSamples_MD5_20240901.txt
filename:  MalwareIoCRepoNukespedLazarus_SHA1.txt
filename:  MalwareTrafficAnalysisExerciseAnswers_IPDomainSHA256_20240815.txt
filename:  VariousHashes.txt
filename:  VariousIPs.txt


In [62]:
df = pd.read_csv("sample-data/IPC2s.csv")
out = df["#ip"].str.findall(r"(\b([0-9]{1,3}\.{0,1}){4}(\:{1}[0-9]{1,5}){0,1}\b)", flags=re.IGNORECASE)
out

0        [(1.117.232.76, 76, )]
1         [(1.117.60.10, 10, )]
2       [(1.12.232.192, 192, )]
3       [(1.12.242.190, 190, )]
4         [(1.14.206.72, 72, )]
                 ...           
555       [(94.20.88.63, 63, )]
556      [(94.232.46.54, 54, )]
557    [(94.242.61.116, 116, )]
558    [(94.247.42.107, 107, )]
559    [(95.174.67.234, 234, )]
Name: #ip, Length: 560, dtype: object

In [39]:
# ----- Determine file data type and parse with regex -----

def parse_string_data_for_iocs(string: str) -> None:
    pass

def parse_dataframe_for_iocs(df: pd.DataFrame) -> None:
    display(df)

    for col in df.columns:
        # display(df[col])
        col_matches = df[col].str.extractall(IPV4_REGEX_PATTERN)
        # print("col_matches: ", col_matches)
        display(col_matches)

for file in read_files[0:1]:
    print(file["filename"])
    if file["data_type"] == str:
        print("this file's data is string data")

    elif file["data_type"] == pd.DataFrame:
        print("this file's data is df")
        parse_dataframe_for_iocs(file["data"])

IPC2s.csv
this file's data is df


Unnamed: 0,#ip,ioc
0,1.117.232.76,Possible Cobaltstrike C2 IP
1,1.117.60.10,Possible Cobaltstrike C2 IP
2,1.12.232.192,Possible Cobaltstrike C2 IP
3,1.12.242.190,Possible Cobaltstrike C2 IP
4,1.14.206.72,Possible Cobaltstrike C2 IP
...,...,...
555,94.20.88.63,Possible Cobaltstrike C2 IP
556,94.232.46.54,Possible Cobaltstrike C2 IP
557,94.242.61.116,Possible Metasploit C2 IP
558,94.247.42.107,Possible Metasploit C2 IP


Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,1.117.232.76,76,
1,0,1.117.60.10,10,
2,0,1.12.232.192,192,
3,0,1.12.242.190,190,
4,0,1.14.206.72,72,
...,...,...,...,...
555,0,94.20.88.63,63,
556,0,94.232.46.54,54,
557,0,94.242.61.116,116,
558,0,94.247.42.107,107,


Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


**MD5 Format:**

The format of an MD5 hash is a 32-character hexadecimal string that represents 128 bits of data. The hexadecimal digits used are 0–9 and a–f. For example, the MD5 hash for the word "hello" is "5d41402abc4b2a76b9719d911017c592". 

**SHA1 Format**

SHA-1 Hash Algorithm | Board Infinity
The SHA-1 hash format is a 160-bit (20-byte) hash value, usually represented as 40 hexadecimal digits. SHA-1 stands for Secure Hash Algorithm 1, and it's a cryptographic hash function that takes a string as input and generates a hash value. 

**SHA256 Format**

64 hex digits 
https://stackoverflow.com/questions/3064133/will-a-sha256-hash-always-have-64-characters

**IPv4 Format**

https://www.regextutorial.org/regex-for-numbers-and-ranges.php

**IPv6 Format** 

https://stackoverflow.com/questions/53497/regular-expression-that-matches-valid-ipv6-addresses


In [21]:
md5_matches = re.findall(MD5_REGEX_PATTERN, read_in_data, re.IGNORECASE)
print(len(md5_matches))

sha1_matches = re.findall(SHA1_REGEX_PATTERN, read_in_data, re.IGNORECASE)
print(len(sha1_matches))

sha256_matches = re.findall(SHA256_REGEX_PATTERN, read_in_data, re.IGNORECASE)
print(len(sha256_matches))

0
0
0


In [23]:
ipv4_matches = re.findall(IPV4_REGEX_PATTERN, read_in_data, re.IGNORECASE)
print(len(ipv4_matches))
ipv4_matches

6


[('104.21.55.70:80', '70', ':80'),
 ('104.21.55.70', '70', ''),
 ('172.67.170.169:443', '169', ':443'),
 ('172.67.170.169', '169', ''),
 ('72.5.43.29:80', '29', ':80'),
 ('72.5.43.29', '29', '')]