### Assessment of H5N1 annotation quality

I try now to use the HA/NA subtypes that I annotated using nextclade sort to filter influenza A for the H5N1 subtype instead of relying on finding H5N1 in the fasta header. The downside of this approach is that I can only identify sequences as being of the subtype H5N1 if my grouping is successful.

I compare 
- the current ingest pipeline ([commit](https://github.com/loculus-project/loculus/commit/9323a187bd24fcd123199c0fc723af509a431ef7)) using the current h5n1 [config](https://github.com/GenSpectrum/servers/blob/main/ansibleSetup/roles/loculus/templates/organismValues.yml#L676) - which I label as "old" approach 
- and the "new" approach which uses nextclade sort to annotate subtype [here](https://github.com/loculus-project/loculus/pull/3407) with the [config] (https://github.com/loculus-project/private_deployments/blob/main/deploy/virus5/values.yaml).

I test the accuracy of the subtype annotation by dropping a random subsample of the filtered data into nextclade: 
```
grep -A 1 '_seg4$' results/submit_sequences.fasta  | grep -v '^--$' > results/filtered_ha.fasta
seqkit sample -n 1000 -2 --out-file results/subsample.fasta results/filtered_ha.fasta 
```
I use the all H5 clades dataset and see that all samples in my subsample have been annotated and that they are nicely distributed along the tree. 

In [16]:
from Bio import SeqIO
nextclade_sort_grouped_accessions = []
nextclade_sort_grouped_accessions_sets =[]
sort_number_of_each_seg = {}
nextclade_sort_groups = []
with open("ingest/results_nextclade_sort/submit_sequences.fasta", encoding="utf-8") as f_in:
    records = SeqIO.parse(f_in, "fasta")
    for record in records:
        id = record.id.replace("NC_", "NC")
        nextclade_sort_grouped_accessions.extend(id.split("_")[0].split("/"))
        nextclade_sort_grouped_accessions_sets.append(frozenset(id.split("_")[0].split("/")))
        nextclade_sort_groups.append(id.split("_")[0])
        seg = id.split("_")[1]
        sort_number_of_each_seg[seg] = sort_number_of_each_seg.get(seg, 0) + 1

print(len(set(nextclade_sort_grouped_accessions)))
print(len(set(nextclade_sort_groups)))
print(sort_number_of_each_seg)

71506
9860
{'seg8': 8640, 'seg5': 8710, 'seg4': 9860, 'seg7': 8759, 'seg1': 8502, 'seg6': 9860, 'seg3': 8596, 'seg2': 8579}


In [17]:
old_grouped_accessions = []
old_grouped_accessions_sets = []
old_number_of_each_seg = {}
old_groups = []
with open("ingest/results/submit_sequences.fasta", encoding="utf-8") as f_in:
    records = SeqIO.parse(f_in, "fasta")
    for record in records:
        id = record.id.replace("NC_", "NC")
        old_grouped_accessions.extend(id.split("_")[0].split("/"))
        old_grouped_accessions_sets.append(frozenset(id.split("_")[0].split("/")))
        old_groups.append(id.split("_")[0])
        seg = id.split("_")[1]
        old_number_of_each_seg[seg] = old_number_of_each_seg.get(seg, 0) + 1

print(len(set(old_grouped_accessions)))
print(len(set(old_groups)))
print(old_number_of_each_seg)

86437
21131
{'seg8': 10084, 'seg1': 10050, 'seg5': 10140, 'seg4': 13799, 'seg7': 10453, 'seg6': 11849, 'seg3': 10023, 'seg2': 10039}


In [18]:
print(f"Number of accessions only in new approach:{len(set(nextclade_sort_grouped_accessions) - set(old_grouped_accessions))}")
print(f"Number of accessions only in old approach:{len(set(old_grouped_accessions) - set(nextclade_sort_grouped_accessions))}")

print(f"Number of groups only in new approach:{len(set(nextclade_sort_groups) - set(old_groups))}")
print(f"Number of groups only in old approach:{len(set(old_groups) - set(nextclade_sort_groups))}")

Number of accessions only in new approach:315
Number of accessions only in old approach:15246
Number of groups only in new approach:171
Number of groups only in old approach:11442


In [19]:
nextclade_sort_grouped_accessions_sets = set(nextclade_sort_grouped_accessions_sets)
old_grouped_accessions_sets = set(old_grouped_accessions_sets)

In [27]:
count = sum(any(set1 == set2 for set2 in old_grouped_accessions_sets) for set1 in nextclade_sort_grouped_accessions_sets)
print("Number of identical groupings:", count)

count = sum(any(set2.issuperset(set1) and set1 != set2 for set2 in old_grouped_accessions_sets) for set1 in nextclade_sort_grouped_accessions_sets)
print("Number of sets in nextclade_sort_grouped_accessions_sets with a superset in old_grouped_accessions_sets:", count)

count = sum(any(set2.issuperset(set1) and set1 != set2 for set2 in nextclade_sort_grouped_accessions_sets) for set1 in old_grouped_accessions_sets)
print("Number of sets in old_grouped_accessions_sets with a superset in nextclade_sort_grouped_accessions_sets:", count)

Number of identical groupings: 9689
Number of sets in nextclade_sort_grouped_accessions_sets with a superset in old_grouped_accessions_sets: 0
Number of sets in old_grouped_accessions_sets with a superset in nextclade_sort_grouped_accessions_sets: 141


In all the segment groups where I am able to add more segments I am able to add the segment 8 - this makes sense as these same sequences are then rejected by the preprocessing pipeline because they do not align to the segment8 reference in the H5N1 reference assembly. 

In [41]:
sort_supersets_diff = [
    set2 - set1
    for set1 in old_grouped_accessions_sets
    for set2 in nextclade_sort_grouped_accessions_sets
    if set2.issuperset(set1) and set1 != set2
]

sort_supersets_diff

[frozenset({'OR420953.1.seg8'}),
 frozenset({'AB716341.1.seg8'}),
 frozenset({'OR783388.1.seg8'}),
 frozenset({'OQ683478.1.seg8'}),
 frozenset({'OR421034.1.seg8'}),
 frozenset({'OP377610.1.seg8'}),
 frozenset({'OP270000.1.seg8'}),
 frozenset({'OQ584540.1.seg8'}),
 frozenset({'OP377449.1.seg8'}),
 frozenset({'DQ997273.1.seg8'}),
 frozenset({'OQ734941.1.seg8'}),
 frozenset({'OP377626.1.seg8'}),
 frozenset({'OR818691.1.seg8'}),
 frozenset({'OQ737757.1.seg8'}),
 frozenset({'GU182186.1.seg8'}),
 frozenset({'OQ584634.1.seg8'}),
 frozenset({'OR818568.1.seg8'}),
 frozenset({'OP377522.1.seg8'}),
 frozenset({'OR136576.1.seg8'}),
 frozenset({'OQ734930.1.seg8'}),
 frozenset({'OP377424.1.seg8'}),
 frozenset({'OP377490.1.seg8'}),
 frozenset({'OR136608.1.seg8'}),
 frozenset({'PP853100.1.seg8'}),
 frozenset({'DQ997114.1.seg8'}),
 frozenset({'OR421119.1.seg8'}),
 frozenset({'OQ734882.1.seg8'}),
 frozenset({'OP377384.1.seg8'}),
 frozenset({'OR421148.1.seg8'}),
 frozenset({'OR136584.1.seg8'}),
 frozenset

In [52]:
left_over_sort_sets = set()
for set1 in nextclade_sort_grouped_accessions_sets:
    if not any(set2.issuperset(set1) or set1 == set2 or set1.issuperset(set2) for set2 in old_grouped_accessions_sets):
        left_over_sort_sets.add(set1)
left_over_sets = list(left_over_sort_sets)
left_over_sets.sort(key=lambda x: len(x))
print(f"{len(left_over_sort_sets)} Sets in nextclade_sort_grouped_accessions_sets without an overlapping set in old_grouped_accessions_sets: {left_over_sets}")

30 Sets in nextclade_sort_grouped_accessions_sets without a superset in old_grouped_accessions_sets: [frozenset({'KY635774.1.seg4', 'KY635588.1.seg6'}), frozenset({'KY635495.1.seg6', 'KY635652.1.seg4'}), frozenset({'LC831696.1.seg4', 'LC831697.1.seg6'}), frozenset({'KY635700.1.seg6', 'KY635740.1.seg4'}), frozenset({'KP762497.1.seg4', 'KP762498.1.seg6'}), frozenset({'PQ468756.1.seg6', 'PQ468755.1.seg4'}), frozenset({'KY635616.1.seg4', 'KY635764.1.seg6'}), frozenset({'KY635743.1.seg6', 'KY635822.1.seg4'}), frozenset({'KP762500.1.seg6', 'KP762499.1.seg4'}), frozenset({'LC106085.1.seg6', 'LC106094.1.seg7', 'LC106076.1.seg5', 'LC106103.1.seg8', 'LC106067.1.seg4', 'LC106058.1.seg3'}), frozenset({'LC106105.1.seg8', 'LC106087.1.seg6', 'LC106060.1.seg3', 'LC106096.1.seg7', 'LC106051.1.seg2', 'LC106078.1.seg5', 'LC106069.1.seg4'}), frozenset({'LC106059.1.seg3', 'LC106068.1.seg4', 'LC106086.1.seg6', 'LC106077.1.seg5', 'LC106104.1.seg8', 'LC106095.1.seg7', 'LC106050.1.seg2'}), frozenset({'LC106052

In [55]:
left_over_old_sets = set()
for set1 in old_grouped_accessions_sets:
    if not any(set2.issuperset(set1) or set1 == set2 or set1.issuperset(set2) for set2 in nextclade_sort_grouped_accessions_sets):
        left_over_old_sets.add(set1)
left_over_sets = list(left_over_old_sets)
left_over_sets.sort(key=lambda x: len(x))
print(f"{len(left_over_old_sets)} Sets in old_grouped_accessions_sets without an overlapping set in nextclade_sort_grouped_accessions_sets: {left_over_sets}")

11301 Sets in old_grouped_accessions_sets without a superset in nextclade_sort_grouped_accessions_sets: [frozenset({'KP638546.1.seg8'}), frozenset({'PP998401.1.seg4'}), frozenset({'JN055388.1.seg4'}), frozenset({'ON974725.1.seg4'}), frozenset({'JQ906606.1.seg4'}), frozenset({'MZ976838.1.seg4'}), frozenset({'GU811714.1.seg4'}), frozenset({'AY741216.1.seg6'}), frozenset({'MG668932.1.seg5'}), frozenset({'PQ002141.1.seg7'}), frozenset({'JQ858472.1.seg4'}), frozenset({'EF631181.1.seg4'}), frozenset({'PP761580.1.seg3'}), frozenset({'PQ098643.1.seg7'}), frozenset({'MH791626.1.seg6'}), frozenset({'OQ584659.1.seg3'}), frozenset({'OP221293.1.seg4'}), frozenset({'MN880474.1.seg4'}), frozenset({'KJ522726.1.seg4'}), frozenset({'KR088159.1.seg4'}), frozenset({'EF631166.1.seg4'}), frozenset({'MH791543.1.seg1'}), frozenset({'MN994157.1.seg8'}), frozenset({'EU542785.1.seg4'}), frozenset({'OR165029.1.seg1'}), frozenset({'PQ162249.1.seg4'}), frozenset({'OQ730497.1.seg7'}), frozenset({'KC791653.1.seg6'}),

I need to check if the remaining groups all do not contain HA and NA segments and that is why I am unable to annotate them using nextclade sort, this is partially the case. Of the 29 remaining sequences that exist in the H5N1 current database but were lost in the new approach I see that this is due to the HA and NA segment not being grouped together in the new approach. Most likely this is because I perform grouping on all influenza A sequences and I will not group segments if I identify a group with multiple segments of the same type. However these are all issues that could be solved if I had access to the assembly information.

In [56]:
count_has_na_ha = 0
for group in left_over_old_sets:
    count = 0
    for segs in group:
        if "seg4" in segs:
            count += 1
        if "seg6" in segs:
            count += 1
    if count == 2:
        count_has_na_ha += 1
        print(group)

print(count_has_na_ha)

frozenset({'OQ546777.1.seg4', 'OQ546773.1.seg1', 'OQ546775.1.seg5', 'OQ546776.1.seg6', 'OQ546772.1.seg3', 'OQ546774.1.seg2'})
frozenset({'PP829962.1.seg5', 'PP829960.1.seg6', 'PP829958.1.seg4', 'PP829961.1.seg1', 'PP829963.1.seg7', 'PP829957.1.seg3', 'PP829959.1.seg8'})
frozenset({'JF758718.1.seg1', 'JF758719.1.seg8', 'JF758723.1.seg4', 'JF758721.1.seg5', 'JF758724.1.seg2', 'JF758720.1.seg6', 'JF758717.1.seg7', 'JF758722.1.seg3'})
frozenset({'MK964005.1.seg4', 'MK964006.1.seg6', 'MK964007.1.seg5', 'MK964009.1.seg2', 'MK964008.1.seg3'})
frozenset({'JF758744.1.seg3', 'JF758745.1.seg2', 'JF758741.1.seg4', 'JF758743.1.seg5', 'JF758748.1.seg1', 'JF758742.1.seg7', 'JF758747.1.seg8', 'JF758746.1.seg6'})
frozenset({'MN209561.1.seg2', 'MN209565.1.seg4', 'MN209563.1.seg1', 'MN209562.1.seg8', 'MN209560.1.seg7', 'MN209564.1.seg6', 'MN209559.1.seg3', 'MN209558.1.seg5'})
frozenset({'OM938323.1.seg6', 'OM938324.1.seg7', 'OM938320.1.seg2', 'OM938325.1.seg8', 'OM938321.1.seg4', 'OM938319.1.seg1', 'OM93