# Forensic Analysis Lab Task

Forensic analysis in the context of cybersecurity refers to the investigation and examination of digital evidence to understand how an incident occurred, identify malicious activity, and determine its impact. When applied to process information on a device, forensic analysis involves reviewing data about running and previously executed processes—such as their names, execution paths, user accounts, start times, parent-child relationships, and resource usage.

In this task we will be looking at the system information of a device and labelling different processes are either malicious or non-malicious. In this file there is a CSV `process_data.csv` and we would like to analyse any potential suspicious processes that could of existed on this device.


In [1]:
# This section of code will display the contents of the CSV and count any repeating fields under each column
import pandas as pd

# Load the CSV file
df = pd.read_csv("processes_data.csv")

# 1. List all columns
print("Fields in the CSV:")
print(df)

# 2. For each column, print values that appear more than once
for column in df.columns:
    print(f"\nValues appearing more than once in column: '{column}'")
    duplicates = df[column].value_counts()
    repeated = duplicates[duplicates > 1]
    if repeated.empty:
        print("  No repeated values.")
    else:
        print(repeated.head(10))  # Show top 10 repeated values


Fields in the CSV:
         PID   ProcessName   CPU%  MemMB  ParentPID         StartTime  \
0          1   notepad.exe  13.80  130.8          0  2025-04-17 00:58   
1          2   notepad.exe  19.68   69.4          1  2025-04-18 15:51   
2          3      code.exe  13.86  100.1          1  2025-04-24 17:26   
3          4  explorer.exe  14.51   37.4          3  2025-04-18 20:58   
4          5       cmd.exe   9.19   72.9          3  2025-04-25 13:37   
...      ...           ...    ...    ...        ...               ...   
49995  49996    python.exe   4.79   86.7      18928  2025-04-15 07:40   
49996  49997    chrome.exe  15.75   21.6      16488  2025-04-29 13:41   
49997  49998   svchost.exe   2.32  185.3      14062  2025-04-07 14:56   
49998  49999       cmd.exe   8.25  159.9      12517  2025-04-30 13:47   
49999  50000  explorer.exe  18.65   60.3      42194  2025-04-10 07:52   

                              CmdLine     User Signed  NetworkConnections  \
0         C:\Users\Public\n

### Task 1: Identifying Suspicious Information

In the data presented above come up with a list of fields or range values (e.g. if a field is numeric then a range of values that could be labeled as suspicious) that could indicate that a process on this device is malicious.

Class is a field that indicates whether a process is malicious or not, where 1 indicates a malicious process and 0 indicates a normal process

Some of the information presented in the previous block of code may be useful for this task.


#### SAMPLE ANSWERS:

High CPU (CPU% > 80)
Malicious processes, especially those involved in data exfiltration or cryptomining, often consume a large portion of system resources like CPU. A sudden spike in CPU usage could be a sign that a process is running hidden tasks, such as running complex computations (e.g., cryptocurrency mining) or scanning the system for sensitive information. Legitimate applications rarely utilize over 80% of the CPU unless performing resource-heavy tasks.

Suspicious Process Name (ProcessName in suspicious_process_names)
Certain process names are commonly associated with known malicious software or attack tools. For example, processes like mimikatz.exe or trojan.exe are often used in attacks related to credential theft, privilege escalation, or remote access. If a process name matches any of these suspicious or known malware-associated names, it could be a sign of malicious activity on the system.

Many Network Connections (NetworkConnections > 50)
A high number of network connections could indicate that a process is attempting to communicate with many external servers or other machines, possibly for command-and-control (C&C) activities or exfiltrating data. Malicious software, such as botnets, ransomware, or worms, often tries to establish multiple outbound connections to a variety of IP addresses to spread or receive commands. This behavior is typically abnormal for a legitimate process, especially in terms of the sheer number of connections it opens.

### Tast 2: Write the code to label fields as malicious or normal based on the any flags which you have identified in the previous task

In [None]:
# Write your code here
suspicious_process_names = {'mimikatz.exe', 'evil.exe', 'trojan.exe', 'backdoor123.exe', 'pwstealer.exe'}

# Analyze function
def analyze_row(row):
    reasons = []

    # Rules
    if row.get('CPU%', 0) > 80:
        reasons.append('High CPU')

    if str(row.get('ProcessName')).lower() in suspicious_process_names:
        reasons.append('Suspicious Process Name')

    if row.get('NetworkConnections', 0) > 50:
        reasons.append('Many Network Connections')
    
    return reasons

# Apply detection rules
df['MaliciousFieldReasons'] = df.apply(analyze_row, axis=1)
df['IsMalicious'] = df['MaliciousFieldReasons'].apply(lambda x: 1 if len(x) > 0 else 0)

# Quick look at how many were flagged
print("Malicious flagged:", df['IsMalicious'].sum())

Malicious flagged: 938


### Task 3: Calculating the accuracy of our model

By comparing our predicted labels for each row to the actual labels of each row we can test the accuracy of our model

Below, write the code to calulcate the true positive, true negative, false positive and false negative that you have identified and calculate the accuracy of your model

In [41]:
tp = ((df['IsMalicious'] == 1) & (df['Class'] == 1)).sum()
tn = ((df['IsMalicious'] == 0) & (df['Class'] == 0)).sum()
fp = ((df['IsMalicious'] == 1) & (df['Class'] == 0)).sum()
fn = ((df['IsMalicious'] == 0) & (df['Class'] == 1)).sum()

# Print confusion matrix values
print(f"True Positives: {tp}")
print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")

# Calculate metrics
total = tp + tn + fp + fn
accuracy = (tp + tn) / total if total else 0
precision = tp / (tp + fp) if (tp + fp) else 0
recall = tp / (tp + fn) if (tp + fn) else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) else 0

# Print metrics
print(f"\nAccuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1_score:.4f}")


True Positives: 938
True Negatives: 49002
False Positives: 0
False Negatives: 60

Accuracy:  0.9988
Precision: 1.0000
Recall:    0.9399
F1 Score:  0.9690


### Task 3: Discussion

How did your model perform?

How could you improve the accuracy of this model?

What other fields (Not listed in the data) could be useful in being a more accurate identifier for malicious processes?

### SAMPLE ANSWERS:

The model's performance is acceptable, but not yet optimal. It detected malicious processes based on known indicators such as high CPU usage and unusual network connections. However, it struggled with more nuanced cases, particularly those where the process name didn't match predefined suspicious names. The overall performance, while somewhat effective, highlights the need for more sophisticated analysis and additional context-specific features to refine its predictions.

To improve the accuracy, I would first incorporate additional data points such as the process's execution time, parent-child relationships, and historical behavior of the process. This would allow the model to identify anomalies that are not immediately apparent from the features used. I would also experiment with machine learning algorithms to allow the model to learn from patterns in the data rather than relying on hardcoded rules. Additionally, integrating more comprehensive datasets with labeled malicious activity could provide more diverse examples to train the model, improving its ability to generalize.
