# Graphistry AI Cybersecurity Analysis Challenge
**Author**: Jesse Hines

**Date**: December 2024

This notebook implements a solution for tracing the spread of potential security compromises through a network by analyzing user-computer interactions found in the associated access_logs.csv and activity_logs.csv


In [14]:
# Imports
import pandas as pd
from collections import deque, defaultdict
from datetime import datetime

In [15]:
# Process the Interactions
access_logs = pd.read_csv("access_logs.csv")
activity_logs = pd.read_csv("activity_logs.csv")


## Solution Approach

The solution uses a Breadth-First Search (BFS) algorithm to trace the spread of potential compromise through the network. The below SecurityAnalyzer class has this implemented in the analyze_compromise along with additional helper functions for 

1. Start at suspicious user (In this case we start with U1)
2. Find all computers they accessed
3. Find all users who accessed those computers
4. Repeat until no new connections found

One important thing to note is that in a "real world" scenario, users who accessed a computer before the current suspicious user (according to the timestamps given in access_logs.csv and activity_logs.csv) would likely not be counted. For simplicity, this algorithm tracks all users but could be modified to account for this if we wanted to. We co

To put this in terms of code, analyze_compromise initializes with a deque containing the starting user and empty sets to track visited nodes. While the queue is not empty, we remove and process the next user in line. For each user, we query the activity logs to find all computers they've accessed. For each of these computers, we then query the access logs to find all other users who accessed that computer. Any newly discovered users are added to the queue for processing, ensuring we traverse the entire connected network.

### Time Complexity
- O(V + E) where:
  - V = number of vertices (users + computers)
  - E = number of edges (interactions between users and computers)

### Space Complexity
- O(V + E) for storing visited sets and queue, along with the dictionaries for logging

In [16]:
class SecurityAnalyzer:
    def __init__(self, access_logs_df, activity_logs_df):
        self.access_logs = access_logs_df
        self.activity_logs = activity_logs_df
        self.visited_users = set()
        self.visited_computers = set()
        self.event_log = []
        self.affected_users = set()
        self.affected_computers = set()
        
        # Preprocessing: Build dictionaries for fast lookups
        self.user_to_computers = defaultdict(list)
        self.computer_activities = defaultdict(list)
        self.computer_to_users = defaultdict(list)
        
        for _, row in self.activity_logs.iterrows():
            self.user_to_computers[row['user_id']].append(row['computer_id'])
    
        for _, row in self.access_logs.iterrows():
            self.computer_activities[row['computer_id']].append((row['affected_user_id'], row['activity_type']))
            self.computer_to_users[row['computer_id']].append(row['affected_user_id'])
        
    # performs BFS to trace users affected by the suspicious user’s activities
    def analyze_compromise(self, initial_user):
        """Perform breadth-first analysis of compromise spread"""
        queue = deque([initial_user])
        self.visited_users.add(initial_user)
        self.affected_users.add(initial_user)
        
        self.event_log.append(f"Analysis started with suspicious user {initial_user}")
        
        while queue:
            current_user = queue.popleft()
            self.event_log.append(f"Processing user: {current_user}")
            
            user_computers = self.user_to_computers.get(current_user, [])
            
            for computer in user_computers:
                if computer not in self.visited_computers:
                    self.visited_computers.add(computer)
                    self.affected_computers.add(computer)
                    self.event_log.append(f"User {current_user} accessed computer {computer}")
                    
                    self.event_log.append(f"Activities on computer {computer}:")
                    for (affected_user_id, activity_type) in self.computer_activities.get(computer, []):
                        self.event_log.append(f"  - {activity_type} by user {affected_user_id}")
                    
                    users_on_computer = self.computer_to_users.get(computer, [])
                    for user in users_on_computer:
                        if user not in self.visited_users:
                            self.affected_users.add(user)
                            self.visited_users.add(user)
                            queue.append(user)
                            self.event_log.append(f"Computer {computer} affected user {user}")
   
    # saves the event log and summary of the analysis to files.
    def save_outputs(self):
        """Save analysis results to files"""
        # Save event log
        with open("investigation_log.txt", "w") as f:
            f.write("\n".join(self.event_log))
        
        # Save summary
        summary_data = {
            "Metric": [
                "Total Affected Users",
                "Total Affected Computers",
            ],
            "Value": [
                len(self.affected_users),
                len(self.affected_computers),
            ]
        }
        pd.DataFrame(summary_data).to_csv("analysis_summary.csv", index=False)


Now using this class we are able to run our analysis

Note: While the access logs contain different activity types (LOGIN_FAILURE, FILE_DOWNLOAD, UNAUTHORIZED_ACCESS, etc.), we only use these for logging purposes. Affected users are anyone who had any interaction with a computer that an affected user accessed (or the inital user themselves)

In [17]:
# Initialize analyzer and run analysis
analyzer = SecurityAnalyzer(access_logs, activity_logs)
analyzer.analyze_compromise("U1")

# Save results
analyzer.save_outputs()

# Display summary
print("\nFinal Summary:")
print(f"Total Affected Users: {len(analyzer.affected_users)}")
print(f"Total Affected Computers: {len(analyzer.affected_computers)}")
print("\nResults have been saved to:")
print("- investigation_log.txt (detailed event log)")
print("- analysis_summary.csv (summary statistics)")




Final Summary:
Total Affected Users: 100
Total Affected Computers: 50

Results have been saved to:
- investigation_log.txt (detailed event log)
- analysis_summary.csv (summary statistics)
