In [3]:
import pandas as pd

  from pandas.core import (


## Description of ST1

% TO CHANGE

### Column Descriptions

| Column              | Description                                                                 |
| ------------------- | --------------------------------------------------------------------------- |
| `title`             | Title of the article                                                        |
| `abstract`          | Abstract of the article                                                     |
| `source`            | Source(s) where the article was found; duplicates may have multiple sources |
| `review`            | Indicates review article: `1` = review, `0` = not review, empty = unknown   |
| `relevance`         | Overall relevance of the article (based on content)                         |
| `code`              | (Reserved / unused)                                                         |
| `what section used` | Main section(s) where the article is used (see mapping below)               |
| `subgroup`          | Specific task within the section (see article for details)                  |
| `ref`               | Reference ID (for BibTeX or internal lookup)                                |
| `ai_topic`          | AI-related notes (optional)                                                 |
| `medicine_topic`    | Medicine-related notes (optional)                                           |
| `notes`             | General notes                                                               |
| `not_relevant`      | Boolean flag: article is irrelevant (`TRUE` / `FALSE`)                      |
| `partly_relevant`   | Boolean flag: article is partially relevant                                 |
| `relevant`          | Boolean flag: article is relevant                                           |

---

### `what section used` Mapping

| Code          | Phase          | Section Name                                      |
| ------------- | -------------- | ------------------------------------------------- |
| `intro: rev`  | Introduction   | Review                                            |
| `pre: KNLR`   | Pre-Analytics  | Knowledge Navigation & Literature Review          |
| `pre: RS`     | Pre-Analytics  | Risk Stratification                               |
| `ana: MIA`    | Analytics      | Medical Imaging Analysis                          |
| `ana: AVE`    | Analytics      | Analysis of Variant Effects                       |
| `ana: CVI`    | Analytics      | Clinical Variant Interpretation                   |
| `post: PCS`   | Post-Analytics | Patient Clustering & Concept Typing               |
| `post: DRA`   | Post-Analytics | Data & Results Aggregation                        |
| `post: CRGDS` | Post-Analytics | Clinical Report Generation & Decision Support     |
| `edu`         | Education      | Educational Use                                   |
| `disc`        | Discussion     | General LLM Use or Non-categorized Medical Topics |

---

### `subgroup` Notes

The `subgroup` field provides fine-grained task labels within a section (e.g., "Named Entity Recognition", "Phenotype Extraction", "Variant Prioritization").
Refer to the article directly for subgroup meaning and examples.


## Now work

In [4]:
data = pd.read_csv('./data/ST2_v2.csv', index_col=0).reset_index(drop=True)
len_before_deletion = data.shape[0]
data = data.dropna(subset=["title", "abstract"], how="any") # delete some service rows
print(f"Initially df with {len_before_deletion} rows, after cleaning -- {data.shape[0]} rows")
data

Initially df with 319 rows, after cleaning -- 318 rows


Unnamed: 0,title,abstract,ref,source,review,relevance,final_category,subcategory,final_code,not_relevant,partly_relevant,relevant,code,what section used,subgroup,notes,ai_topic,medicine_topic
0,A systematic review and meta-analysis of diagn...,While generative artificial intelligence (AI) ...,Takita2025,Pubmed,1.0,1.0,intro,none,0,FALSE,TRUE,FALSE,0,intro: rev,,review+check,,
1,Addressing the Gaps in Early Dementia Detectio...,The rapid global aging trend has led to an inc...,moya2024addressinggapsearlydementia,arXiv,1.0,2.0,intro,none,0,FALSE,FALSE,TRUE,0,intro: rev,specific,,,early demencia detection
2,Artificial intelligence in clinical genetics,Artificial intelligence (AI) has been growing ...,Duong2025-xi,PubMed,1.0,2.0,intro,none,0,FALSE,FALSE,TRUE,0,intro: rev,,,,clinical genetics
3,Bioinformatics and Biomedical Informatics with...,The year 2023 marked a significant surge in th...,wang2024bioinformaticsbiomedicalinformaticscha...,"arXiv,PubMed",1.0,2.0,intro,none,0,FALSE,FALSE,TRUE,0,intro: rev,chatgpt,Reviews ChatGPT's applications in bioinformati...,ChatGPT; systematic review,Broad applications of LLMs in biomedical domai...
4,Chatbot Artificial Intelligence for Genetic Ca...,Most individuals with a hereditary cancer synd...,Webster2023-of,PubMed,1.0,2.0,intro,none,0,FALSE,FALSE,TRUE,0,intro: rev,specific,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
313,Using a Natural Language Processing Approach t...,Implementing artificial intelligence to extrac...,,PubMed,,1.0,,,,FALSE,TRUE,FALSE,,,,,,
314,Using LLMs to label medical papers according t...,We introduce the sequence classification probl...,,arXiv,,1.0,,,,FALSE,TRUE,FALSE,,,,Fine-tuned BERT-based models outperform GPT-4 ...,"BERT, RoBERTa, BioMedBERT, BioLinkBERT, GPT-4;...",Classification of clinical evidence related to...
315,Virtual Labs and Designer Bugs - Generative AI...,AI technologies can pose a major national secu...,,PubMed,,1.0,,,,FALSE,TRUE,FALSE,,,,AI's role in genetic research and biological s...,No specific transformer-based models like GPT ...,Genetic research with a focus on synthetic bio...
316,ViTally Consistent: Scaling Biological Represe...,Large-scale cell microscopy screens are used i...,,arXiv,,1.0,,,,FALSE,TRUE,FALSE,,,,,,


Check unique pairs, then use them as check

In [5]:
pairs = (
    data.groupby(["final_category", "subcategory"])
    .size()
    .reset_index(name="count")
    .sort_values(["final_category", "subcategory"])
)
pairs.sort_values(by='count', ascending=False)

Unnamed: 0,final_category,subcategory,count
29,areas,none,41
33,intro,none,14
3,CDA,MDP,13
21,KN,NER/RE,12
14,GDA,AVE,11
0,CDA,CDN,8
34,techniques,none,8
1,CDA,CDP,7
11,COM,PCC,7
17,GDA,PP,6


In [6]:
# print("{")
# for _, row in pairs.sort_values(by='count', ascending=False).iterrows():
#     print(f"    ({row['final_category']!r}, {row['subcategory']!r}): ,")
# print("}")

In [53]:
# Results section:
# KN (knowledge navigation), includes NER/RE и RD (relation discovery)
# CDA (clinical data analysis), includes CDN (clinical data normalization), CDP (clinical diagnosis prediction), MDP (molecular diagnosis prediction), COP (outcome prediction)
# GDA (genetic data analysis), includes AVE (analysis of variant effects, GVI (genetic variant interpretation) and PP (phenotype prediction)
# COM (communication), includes MPC (medical professional communication - literature QA and stuff like that) and PC (patient communication, includes counselling and stuff alike).
# areas 
# 
# Discussion section:

mapper = {
    ('intro', 'none'): "000",
    ('KN', 'NER/RE'): "111",
    ('KN', 'RD'): "112",
    ('KN, COM', 'NER/RE, MPC'): "111; 141",
    ('KN', 'NER/RE, RD'): "111; 112",
    ('KN', 'RD, NER/RE'): "112; 111",
    ('KN, GDA', 'NER/RE, AVE'): "111; 131",
    ('KN, CDA', 'NER/RE, CDN'): "111; 121",
    ('KN, GDA, COM', 'NER/RE, PP, PCC'): "111; 133; 142",
    ('CDA', 'CDN'): "121",
    ('CDA', 'CDP'): "122",
    ('CDA', 'MDP'): "123",
    ('CDA', 'COP'): "124",
    ('CDA', 'MDP, COP'): "123; 124",
    ('CDA, GDA', 'COP, PP'): "124; 133",
    ('CDA, COM', 'COP, MPC'): "124; 141",
    ('CDA, COM', 'CDP, MPC'): "122; 141",
    ('CDA, KN', 'CDN, NER/RE'): "121; 111",
    ('GDA', 'AVE'): "131",
    ('GDA', 'GVI'): "132",
    ('GDA', 'PP'): "133",
    ('GDA', 'AVE, GVI'): "131; 132",
    ('GDA, CDA', 'AVE, CDN'): "131; 121",
    ('GDA, COM', 'GVI, MPC'): "132; 141",
    ('GDA, areas', 'AVE, none'): "131; 151",
    ('COM', 'MPC'): "141",
    ('COM', 'PCC'): "142",
    ('COM, KN', 'MPC, NER/RE'): "141; 111",
    ('COM', 'MPC, PCC'): "141; 142",
    ('COM, GDA', 'MPC, PP'): "141; 133",
    ('areas', 'none'): "151",
    ('areas, KN', 'none, NER/RE'): "151; 111",
    ('techniques', 'none'): "211",
    ('data', 'none'): "221",
    ('biases', 'none'): "231",
}

In [54]:
def get_final_code(row, show_title=False):
    key = (row["final_category"], row["subcategory"])
    if key not in mapper:
        cur_title = row.title.replace("\n", "")
        if show_title:
            print(f"[⚠️] Not found in mapper: {key} for {cur_title}")
        else:
            print(f"[⚠️] Not found in mapper: {key}")
        return None
    return str(mapper[key])

data["final_code_2"] = data.apply(get_final_code, axis=1).astype(str)
data.to_csv('./data/coded_st2.csv')

[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[⚠️] Not found in mapper: (nan, nan)
[

check that nans -> they'are fully nans, and we havnt miss annotating anything

## draw relevant to article tables and graphs

See  `usage_hist`