### Combine and Filter Detections (CNN + AST)

This script combines the detection results from two separate models:
1.  A self-attention CNN (`mosquito_path_CNN`)
2.  An Audio Spectrogram Transformer (AST) (`mosquito_path_AST`)

The goal is to filter the CNN's results based on the AST's results to create a cleaner, higher-precision final dataset. It uses file names—which encode the original source file, channel, and start time—to compare the outputs.

**Filtering Heuristic:**

* **Heuristic:** If the self-attention CNN classified a segment as a mosquito sound, but the AST did not, the segment is considered suspect. It could be a weak mosquito sound, but it is more likely to be noise (clicks, pops, etc.).
* **Goal:** The primary goal is to remove noise, even at the cost of some weaker positive samples.
* **Action 1 (Blacklist):** First, all segments from recordings listed in `blacklist_fn` are removed.
* **Action 2 (CNN-only removal):** If a segment is found *only* in the CNN results (and not AST), it is added to an `exclude_set`.
* **Action 3 (Neighborhood removal):** To ensure no noisy sections remain, the neighbors (at +-500ms and +-1000ms) of these excluded CNN-only segments are also added to the `exclude_set`.
* **Final Output:** The script copies only the files that are present in **both** the CNN and AST detection sets, *and* are **not** in the final `exclude_set`.

In [18]:

mosquito_path_CNN="szunyog_hangok_osztalyozáshoz_25_04_03-best99"
mosquito_path_AST="szunyog_hangok_osztalyozáshoz_25_04_03_AST"
blacklist_fn="wav_blacklist_for_train_25_04_14.txt"

output_path="szunyog_hangok_osztalyozáshoz_25_04_14-filtered"


In [19]:
import os
import shutil
import re


In [20]:
black_list = []

# Process lines
try:
    with open(blacklist_fn, "r", encoding="utf-8") as file:
        black_list = [line.strip().replace(" ", "") for line in file]
except:
    pass
    
# Print result
print(black_list)


['Test0558', 'Test0553', 'Test0551', 'Test0550', 'Test0547', 'Test0537', 'Test0528', 'Test0525', 'Test0519', 'Test0516', 'Test0506', 'Test0501', 'Test0500', 'Test0497', 'Test0491', 'Test0488', 'Test0486', 'Test0480', 'Test0468', 'Test0461', 'Test0428', 'Test0402', 'Test0378', 'Test0376', 'Test0344', 'Test0425', 'Test0451', 'Test0463', 'Test0469', 'Test0475', 'Test0478', 'Test0481', 'Test0484', 'Test0487', 'Test0490', 'Test0490', 'Test0521', 'Test0566']


In [21]:

os.makedirs(output_path, exist_ok=True)

# Regex for parsing the filename
pattern = re.compile(r'^(Test\d+)_([^_]+)_([^\.]+)\.wav_(\d)_(\d+)\.wav$')

# AST files in a set
ast_files = set(os.listdir(mosquito_path_AST))
#cnn_files = os.listdir(mosquito_path_CNN)
# CNN files - only those that are not in the blacklist
cnn_files = [
    f for f in os.listdir(mosquito_path_CNN)
    if (m := pattern.match(f)) and m.group(1) not in black_list
]

# Check what is in both folders
cnn_and_ast = [f for f in cnn_files if f in ast_files]

# And those only in CNN
cnn_only = [f for f in cnn_files if f not in ast_files]

# Indexing by filename for fast lookup
cnn_base_dict = {}  # key: (TestID, family, species, channel), value: list of milliseconds

for filename in cnn_files:
    match = pattern.match(filename)
    if match:
        if match.group(1) not in black_list:
            key = (match.group(1), match.group(2), match.group(3), match.group(4))
            millisec = int(match.group(5))
            cnn_base_dict.setdefault(key, []).append((filename, millisec))
        else:
            print(f"blacklist: {match.group(1)}")



In [22]:
#cnn_base_dict


In [23]:
# Now we select what we do NOT want to copy (only in CNN)
exclude_set = set()
for filename in cnn_only:
    match = pattern.match(filename)
    if not match:
        continue
    key = (match.group(1), match.group(2), match.group(3), match.group(4))
    millisec = int(match.group(5))
    exclude_set.add(filename)
    
    # neighboring files
    for delta in [-500, +500, -1000, +1000]:
        neighbor_millisec = millisec + delta
        for candidate, cand_ms in cnn_base_dict.get(key, []):
            if cand_ms == neighbor_millisec:
                exclude_set.add(candidate)

# Now we copy everything that is in both CNN and AST, and is not in the exclude_list
for f in cnn_and_ast:
    if f in exclude_set:
        continue  # this one was excluded

    # copy the original
    shutil.copy2(os.path.join(mosquito_path_CNN, f), os.path.join(output_path, f))

    