# Inference Outputs Inspection Notebook

This notebook is set up to check and inspect all inference outputs are as expected. This includes two steps:

1. checking that the images have been processed. 
2. inspecting all log files to determine what caused any errors.

In [1]:
import json
import os
import pandas as pd
import tqdm
import shutil

In [2]:
region = "japan_job"
code = "jpn"
inferences_folder = "japan_inferences"

In [3]:
logs_dir = f"/home/users/dylcar/amber-inferences/logs/{code}/"

In [4]:
keys_files = os.listdir(f"/home/users/dylcar/amber-inferences/keys/{region}")
keys_files = [os.path.join(f"/home/users/dylcar/amber-inferences/keys/{region}/", x) for x in keys_files]

for i in keys_files:
    print(i)

/home/users/dylcar/amber-inferences/keys/japan_job/dep000105.json
/home/users/dylcar/amber-inferences/keys/japan_job/dep000106.json


## Check Image Processing

Here we read in the keys files and check all images are included in the output inference CSV files. 
Where images are missing, there is the option (`write_missing`) to write those to a new JSON file for future inference calls. 

In [5]:
def check_file(keys_file, region, write_missing=False):
    # Load the keys file
    with open(keys_file, "r") as f:
        keys = json.load(f)

    dep = os.path.basename(keys_file).split(".")[0]
    print(f"🎥 Checking {dep}...")

    # Where to look for CSV files
    csv_dir = f"/home/users/dylcar/amber-inferences/data/{inferences_folder}/{dep}"  # current directory

    all_missing_keys = {}

    for date, image_paths in tqdm.tqdm(keys.items()):
        jpgs = [os.path.basename(p) for p in image_paths]

        csv_path = os.path.join(csv_dir, f"{dep}_{date}.csv")
        if not os.path.exists(csv_path):
            print(f" ⚠️ CSV for {date} not found: {csv_path}")
            continue

        # Load the CSV
        try:
            df = pd.read_csv(csv_path, low_memory=False, encoding='utf-8')

            # Flatten all strings in the DataFrame to search for filenames
            analysed_images = list(set(df["image_path"]))
            analysed_images = [os.path.basename(x) for x in analysed_images if isinstance(x, str)]
            missing = [os.path.basename(jpg) for jpg in jpgs if os.path.basename(jpg) not in analysed_images]

            if missing:
                print(f"- ❌ {len(missing)}/{len(jpgs)} missing jpgs for {os.path.basename(csv_path)}")

                # create a subset of the format keys_file made up of the missing jpgs
                missing_keys = {date: [os.path.join(dep, "snapshot_images", x) for x in missing]}
                all_missing_keys.update(missing_keys)

        except Exception as e:
            print(f" ⚠️ Error processing {csv_path}: {e}")

    if all_missing_keys:
        # write all_missing_keys to a file
        missing_keys_file = f"/home/users/dylcar/amber-inferences/keys/{region}_final_missing_keys/{dep}.json"

        if write_missing:
            print(f"❗️ Writing missing keys for {dep} to {missing_keys_file}")
            os.makedirs(os.path.dirname(missing_keys_file), exist_ok=True)

            with open(missing_keys_file, "w") as f:
                json.dump(all_missing_keys, f, indent=4)


In [28]:
for keys_file in keys_files:
    check_file(keys_file, region, write_missing=True)

🎥 Checking dep000105...


100%|██████████| 28/28 [00:00<00:00, 36.87it/s] 


🎥 Checking dep000106...


100%|██████████| 28/28 [00:01<00:00, 26.43it/s]


## Check Log Messages

This section is designed to inspect all log files, and save the last line in the file. This will help identify any errors that occurred during processing and which deployments/sessions should be rerun.

In [6]:
# read in the last line of each log file and save it to a DataFrame
tail_lines = pd.DataFrame(columns=['last_line', 'last_full_line', 'error_or_pass'])

log_files = [os.path.join(logs_dir, f) for f in os.listdir(logs_dir) if f.endswith('.out')]
for log_file in log_files:
  with open(log_file, 'r') as f:
    lines = f.readlines()
    if lines:
        # handle case where last line is empty/whitespace
        last_nonempty_line = lines[-1].strip()
        if not last_nonempty_line and len(lines) > 1:
            last_nonempty_line = lines[-2].strip()
            last_full_line = lines[-2].rstrip("\n")
        else:
            last_full_line = lines[-1].rstrip("\n")

        last_line = last_nonempty_line
        error_or_pass = 'ERROR'

        if 'All images already processed in' in last_line:
            last_line = 'All images already processed in ...'
            error_or_pass = 'PASS'
        if 'Error submitting job for chunk ' in last_line:
            last_line = 'Error submitting job for chunk ...'
        if 'CANCELLED AT' in last_line and 'DUE TO TIME LIMIT ***' in last_line:
            last_line = 'CANCELLED AT ... DUE TO TIME LIMIT'
        if 'YOLOv8m-seg summary (fused)' in last_line:
            error_or_pass = 'PASS'
        if 'df = pd.read_csv(csv_file)' in last_line:
            error_or_pass = 'ERROR ON TRACKING ONLY'
        if 'Cosine similarity score out of bounds' in last_line:
            error_or_pass = 'PASS, but worth checking not for all images'
        if 'No previous image embedding found for None' in last_line:
            error_or_pass = 'PASS'

        if error_or_pass == 'ERROR':
            logfile_basename = os.path.basename(log_file)
            shutil.copyfile(
                log_file,
                f"/home/users/dylcar/amber-inferences/logs/logs_to_investigate/{logfile_basename}"
            )

        tail_lines.loc[os.path.basename(log_file)] = [last_line, last_full_line, error_or_pass]

In [7]:
# get value counts of the last lines (first element of the list)
tail_lines['last_line'].value_counts()

last_line
YOLOv8m-seg summary (fused): 111 layers, 24,586,035 parameters, 0 gradients, 98.7 GFLOPs    49
Error submitting job for chunk ...                                                           4
All images already processed in ...                                                          3
Name: count, dtype: int64

In [31]:
tail_lines['error_or_pass'].value_counts()

error_or_pass
PASS     52
ERROR     4
Name: count, dtype: int64

In [13]:
tail_lines[tail_lines['last_full_line'] == 'df = pd.read_csv(csv_file)']

Unnamed: 0,last_line,last_full_line,error_or_pass
