# Download and analyse dataportal log events

## Downloading log events

We are going to analyse the log events of user `admin@example.com` (an admin/manager user).

To perform this task, you must have aws console access, because the keys are needed to access aws console via the cli.


In [34]:
user = "admin@example.com"

In [1]:
%%file download_dataportal_logs.sh

#!/bin/bash
LOG_GROUP_NAME="/aws/lambda/sbeacon-backend-dataPortal"
REGION="ap-southeast-2"

# Get all log stream names
log_streams=$(aws logs describe-log-streams \
  --log-group-name "$LOG_GROUP_NAME" \
  --query 'logStreams[*].logStreamName' \
  --output text \
  --region $REGION)

for stream in $log_streams; do
  echo "Downloading logs for stream: $stream"
  safe_stream_name=$(echo "$stream" | sed 's/\//_/g')
  output_file="dataportal_${safe_stream_name}.json"
  > "$output_file"  # Clear/create file

  next_token=""
  first_request=true

  while : ; do
    if [ "$first_request" = true ]; then
      response=$(aws logs get-log-events \
        --log-group-name "$LOG_GROUP_NAME" \
        --log-stream-name "$stream" \
        --start-from-head \
        --region $REGION \
        --output json)
      first_request=false
    else
      response=$(aws logs get-log-events \
        --log-group-name "$LOG_GROUP_NAME" \
        --log-stream-name "$stream" \
        --next-token "$next_token" \
        --region $REGION \
        --output json)
    fi

    # Save events (append only the "events" array)
    echo "$response" | jq '.events' >> "$output_file"

    # Get the nextForwardToken for the next page
    new_token=$(echo "$response" | jq -r '.nextForwardToken')

    # If the next token is the same as the previous, we're done
    if [ "$next_token" == "$new_token" ]; then
      break
    fi
    next_token=$new_token
  done

  echo "Finished downloading $stream"
done

echo "All log streams downloaded."


Overwriting download_dataportal_logs.sh


In [15]:
# Run follwing command with keys in the terminal
# bash download_dataportal_logs.sh

## Loading the events


In [18]:
from glob import glob
import json

def iterate_log_entries():
    entries = []
    for file in glob("dataportal_*.json"):
        with open(file, "r") as f:
            data = f.read()
            data = data.replace("[]\n", "")
            entries +=  json.loads(data)
    
    log_entry = []
    for entry in entries:
        log_entry.append(entry)
        if entry["message"].startswith("REPORT"):
            yield log_entry
            log_entry = []


## Dataportal notebook events for the user admin@example.com


In [91]:
from textwrap import indent
import re
from urllib.parse import unquote

re_notebook_start = re.compile(r"^/dportal/notebooks/.*?/start$")
re_notebook_stop = re.compile(r"^/dportal/notebooks/.*?/stop$")
re_notebook = re.compile(r"^/dportal/notebooks/[a-zA-Z0-9-]+$")

for log_entry in iterate_log_entries():
    log_event = list(filter(lambda x: x["message"].startswith("Event Received"), log_entry))[0]
    event = log_event["message"]
    event = event.replace("Event Received: ", "")
    event = json.loads(event)
    

    if not event["requestContext"]["authorizer"]["claims"]["email"] == user:
        continue

    if event["httpMethod"] == "POST" and event["path"] == "/dportal/notebooks":
        print(f"User {user} created a notebook at {log_event['timestamp']}")
        print("\tNotebook properties:")
        print(indent(json.dumps(json.loads(event["body"]), indent=4), "\t"))

    elif re_notebook_start.match(event["path"]):
        print(f"User {user} started notebook: {event['path'].split('/')[-2]}, at {log_event["timestamp"]}")
    
    elif re_notebook_stop.match(event["path"]):
        print(f"User {user} stopped notebook: {event['path'].split('/')[-2]}, at {log_event["timestamp"]}")

    elif re_notebook.match(event["path"]):
        print(f"User {user} listed details of notebook: {event['path'].split('/')[-1]}, at {log_event["timestamp"]}")

    elif "/dportal/notebooks" == event["path"]:
        print(f"User {user} listed notebooks at {log_event['timestamp']}")


User admin@example.com listed details of notebook: new-test, at 1747798600351
User admin@example.com started notebook: new-test, at 1747798604247
User admin@example.com listed details of notebook: new-test, at 1747798604446
User admin@example.com listed details of notebook: new-test, at 1747798606108
User admin@example.com created a notebook at 1747798620851
	Notebook properties:
	{
	    "instanceName": "My-test-notebook",
	    "instanceType": "ml.t3.medium",
	    "volumeSize": 5,
	    "identityId": "ap-southeast-2:099e873d-80b5-cb64-b9b4-0f64c663bd46"
	}
User admin@example.com listed details of notebook: new-test, at 1747798621244
User admin@example.com listed notebooks at 1747798600351
User admin@example.com listed details of notebook: testNotebook, at 1747798621244
User admin@example.com listed details of notebook: testNotebook, at 1747798600341
User admin@example.com listed notebooks at 1747798621256
User admin@example.com listed details of notebook: My-test-notebook, at 1747798621

## Dataportal manager tasks performed by user admin@example.com


In [None]:
from textwrap import indent
import re
from urllib.parse import unquote

re_projects = re.compile(r"^/dportal/admin/projects$")
re_project = re.compile(r"^/dportal/admin/projects/[a-zA-Z%0-9]+$")
re_projects_ingest = re.compile(r"^/dportal/admin/projects/[a-zA-Z%0-9]+/ingest/[a-zA-Z%0-9-]+$")
re_notebook_delete = re.compile(r"^/dportal/admin/notebooks/[a-zA-Z-0-9]+/delete$")
re_notebook = re.compile(r"^/dportal/admin/notebooks/[a-zA-Z-0-9]+$")

for log_entry in iterate_log_entries():
    log_event = list(filter(lambda x: x["message"].startswith("Event Received"), log_entry))[0]
    event = log_event["message"]
    event = event.replace("Event Received: ", "")
    event = json.loads(event)
    


    if not event["requestContext"]["authorizer"]["claims"]["email"] == user or "/dportal/admin" not in event["path"]:
        continue

    #
    # Projects
    #
    
    if event["httpMethod"] == "POST" and re_projects.match(event["path"]):
        print(f"User {user} created a project at {log_event['timestamp']}")
        print("\tProject properties:")
        print(indent(json.dumps(json.loads(event["body"]), indent=4), "\t"))

    elif event["httpMethod"] == "GET" and re_projects.match(event["path"]):
        print(f"User {user} listed projects at {log_event['timestamp']}")

    elif event["httpMethod"] == "GET" and re_project.match(event["path"]):
        print(f"User {user} listed details of project: {event['path'].split('/')[-1]}, at {log_event['timestamp']}")


    elif event["httpMethod"] == "PUT" and re_project.match(event["path"]):
        print(f"User {user} updated details of project: {unquote(event['path'].split('/')[-1])}, at {log_event['timestamp']}")
        print("\tProject properties:")
        print(indent(json.dumps(json.loads(event["body"]), indent=4), "\t"))

    elif event["httpMethod"] == "POST" and re_projects_ingest.match(event["path"]):
        print(f"User {user} ingested data into project: {unquote(event['path'].split('/')[-3])}, at {log_event['timestamp']}")
        print("\tIngest properties:")
        print(indent(json.dumps(json.loads(event["body"]), indent=4), "\t"))

    # 
    # sBeacon 
    #

    elif event["httpMethod"] == "POST" and event["path"] == "/dportal/admin/sbeacon/index":
        print(f"User {user} indexed data into sBeacon at {log_event['timestamp']}")

    # 
    # notebooks
    # 

    elif event["httpMethod"] == "GET" and event["path"] == "/dportal/admin/notebooks":
        print(f"User {user} listed notebooks at {log_event['timestamp']}")

    elif event["httpMethod"] == "GET" and re_notebook.match(event["path"]):
        print(f"User {user} listed details of notebook: {unquote(event['path'].split('/')[-1])}, at {log_event['timestamp']}")

    elif event["httpMethod"] == "POST" and re_notebook_delete.match(event["path"]):
        print(f"User {user} deleted notebook: {unquote(event['path'].split('/')[-1])}, at {log_event['timestamp']}")

    elif event["httpMethod"] == "GET" and event["path"] == "/dportal/admin/folders":
        print(f"User {user} listed folders at {log_event['timestamp']}")

    else:
        print("MISSED EVENT", event["httpMethod"], event["path"])



User admin@example.com deleted notebook: delete, at 1747804819990
User admin@example.com listed projects at 1747804373824
User admin@example.com listed projects at 1747804378229
User admin@example.com updated details of project: My test project, at 1747804386554
	Project properties:
	{
	    "description": "This is a test project - updated",
	    "files": [
	        "minimal.vcf.gz",
	        "minimal.vcf.gz.csi",
	        "minimal.vcf.gz.tbi"
	    ]
	}
User admin@example.com listed projects at 1747798660520
User admin@example.com listed projects at 1747798676185
User admin@example.com listed projects at 1747798689686
User admin@example.com ingested data into project: Example Query Project, at 1747798819096
	Ingest properties:
	{
	    "s3Payload": "s3://gasi-dataportal-20241120071209060300000001/projects/Example Query Project/project-files/chr1-metadata.json",
	    "vcfLocations": [
	        "s3://gasi-dataportal-20241120071209060300000001/projects/Example Query Project/project-files/ch