In [64]:
import pandas as pd

## Description of ST1

### Column Descriptions

| Column              | Description                                                                 |
| ------------------- | --------------------------------------------------------------------------- |
| `title`             | Title of the article                                                        |
| `abstract`          | Abstract of the article                                                     |
| `source`            | Source(s) where the article was found; duplicates may have multiple sources |
| `review`            | Indicates review article: `1` = review, `0` = not review, empty = unknown   |
| `relevance`         | Overall relevance of the article (based on content)                         |
| `code`              | (Reserved / unused)                                                         |
| `what section used` | Main section(s) where the article is used (see mapping below)               |
| `subgroup`          | Specific task within the section (see article for details)                  |
| `ref`               | Reference ID (for BibTeX or internal lookup)                                |
| `ai_topic`          | AI-related notes (optional)                                                 |
| `medicine_topic`    | Medicine-related notes (optional)                                           |
| `notes`             | General notes                                                               |
| `not_relevant`      | Boolean flag: article is irrelevant (`TRUE` / `FALSE`)                      |
| `partly_relevant`   | Boolean flag: article is partially relevant                                 |
| `relevant`          | Boolean flag: article is relevant                                           |

---

### `what section used` Mapping

| Code          | Phase          | Section Name                                      |
| ------------- | -------------- | ------------------------------------------------- |
| `intro: rev`  | Introduction   | Review                                            |
| `pre: KNLR`   | Pre-Analytics  | Knowledge Navigation & Literature Review          |
| `pre: RS`     | Pre-Analytics  | Risk Stratification                               |
| `ana: MIA`    | Analytics      | Medical Imaging Analysis                          |
| `ana: AVE`    | Analytics      | Analysis of Variant Effects                       |
| `ana: CVI`    | Analytics      | Clinical Variant Interpretation                   |
| `post: PCS`   | Post-Analytics | Patient Clustering & Concept Typing               |
| `post: DRA`   | Post-Analytics | Data & Results Aggregation                        |
| `post: CRGDS` | Post-Analytics | Clinical Report Generation & Decision Support     |
| `edu`         | Education      | Educational Use                                   |
| `disc`        | Discussion     | General LLM Use or Non-categorized Medical Topics |

---

### `subgroup` Notes

The `subgroup` field provides fine-grained task labels within a section (e.g., "Named Entity Recognition", "Phenotype Extraction", "Variant Prioritization").
Refer to the article directly for subgroup meaning and examples.


## Now work

In [76]:
data = pd.read_csv('./data/ST2.csv', index_col=0).reset_index(drop=True)
data

Unnamed: 0,title,abstract,ref,source,review,relevance,code,what section used,subgroup,ai_topic,medicine_topic,notes,not_relevant,partly_relevant,relevant
0,Addressing the Gaps in Early Dementia Detectio...,The rapid global aging trend has led to an inc...,moya2024addressinggapsearlydementia,arXiv,1.0,2.0,0.0,intro: rev,specific,,early demencia detection,,FALSE,FALSE,TRUE
1,Artificial intelligence in clinical genetics,Artificial intelligence (AI) has been growing ...,Duong2025-xi,PubMed,1.0,2.0,0.0,intro: rev,,,clinical genetics,,FALSE,FALSE,TRUE
2,Attention mechanism models for precision medicine,The development of deep learning models plays ...,10.1093/bib/bbae156,PubMed,1.0,2.0,0.0,intro: rev,general,"SAN, GAT, transformers, other",Attention Mechanism Models for Precision Medic...,,FALSE,FALSE,TRUE
3,Bioinformatics and Biomedical Informatics with...,The year 2023 marked a significant surge in th...,wang2024bioinformaticsbiomedicalinformaticscha...,"arXiv,PubMed",1.0,2.0,0.0,intro: rev,chatgpt,ChatGPT; systematic review,Broad applications of LLMs in biomedical domai...,Reviews ChatGPT's applications in bioinformati...,FALSE,FALSE,TRUE
4,Chatbot Artificial Intelligence for Genetic Ca...,Most individuals with a hereditary cancer synd...,Webster2023-of,PubMed,1.0,2.0,0.0,intro: rev,specific,,,,FALSE,FALSE,TRUE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
316,,,,,,,,,,post: DRA,6,8,,,
317,,,,,,,,,,post: CRGDS,9,15,,,
318,,,,,,,,,,,,,,,
319,,,,,,,,,,edu,5,9,,,


check that nans -> they'are fully nans, and we havnt miss annotating anything

In [77]:
for col in ['code', 'what section used', 'subgroup']:
    na_col = data[data[col].isna()]
    print(f'If column `{col}` is NA, then other values are (correct if na, and relevance is 1, not 2):')
    display(na_col.fillna('NA').groupby(['relevance', 'code', 'what section used', 'subgroup']).size().reset_index(name='count'))
    

If column `code` is NA, then other values are (correct if na, and relevance is 1, not 2):


Unnamed: 0,relevance,code,what section used,subgroup,count
0,1.0,,,,132
1,,,,,17


If column `what section used` is NA, then other values are (correct if na, and relevance is 1, not 2):


Unnamed: 0,relevance,code,what section used,subgroup,count
0,1.0,,,,132
1,,,,,17


If column `subgroup` is NA, then other values are (correct if na, and relevance is 1, not 2):


Unnamed: 0,relevance,code,what section used,subgroup,count
0,1.0,,,,132
1,2.0,0.0,intro: rev,,1
2,,,,,17


Clean na's, so to work further with what we selected

In [78]:
data = data[~data['what section used'].isna()]
data.shape

(172, 15)

## draw relevant to article tables and graphs

In [79]:
print(f'All used articles: {data.shape[0]}\n------------------------------')

print(f'Relevances count:')
display(data.groupby(['relevance',]).size().reset_index(name='count'))
print('------------------------------')

All used articles: 172
------------------------------
Relevances count:


Unnamed: 0,relevance,count
0,1.0,49
1,2.0,123


------------------------------


In [80]:
# Step 1: Split the string by commas or semicolons, strip whitespace, and explode into rows
sections = (
    data['what section used']
    .dropna()
    .str.split(r'[;,]')
    .explode()
    .str.strip()
)

# Step 2: Count frequency of each section entry
section_counts = sections.value_counts().reset_index()
# display(section_counts)
section_counts.columns = ['section_code', 'count']
section_mapping = {
    'intro: rev': ('Introduction', 'review'),
    'pre: KNLR': ('Pre-Analytics', 'knowledge navigation & literature review'),
    'pre: RS': ('Pre-Analytics', 'risk stratification'),
    'ana: MIA': ('Analytics', 'medical imaging analysis', ),
    'ana: AVE': ('Analytics', 'analysis of variant effects'),
    'ana: CVI': ('Analytics', 'clinical variant interpretation'),
    'post: PCS': ('Post-Analytics', 'patient clustering & concept typing'),
    'post: DRA': ('Post-Analytics', 'data & results aggregation'),
    'post: CRGDS': ('Post-Analytics', 'clinical report generation & decision support'),
    'edu': ('Education', 'education'),
    'disc': ('Discussion', 'LLM usage, other medical topics')
}


# Step 4: Map section code to phase and name
section_counts['analytics_phase'] = section_counts['section_code'].map(lambda x: section_mapping.get(x, ('Unknown', 'Unknown'))[0])
section_counts['section_name'] = section_counts['section_code'].map(lambda x: section_mapping.get(x, ('Unknown', 'Unknown'))[1])

# Step 5: Sort by mapping order
mapping_order = list(section_mapping.keys())
section_counts['section_code'] = pd.Categorical(section_counts['section_code'], categories=mapping_order, ordered=True)
section_counts = section_counts.sort_values('section_code')

# Step 6: Reorder columns
section_counts = section_counts[['analytics_phase', 'section_name', 'count']].rename(columns = {'analytics_phase': 'article section'}).rename(columns = {'section_name': 'research/application area'})

# Display result
display(section_counts)

Unnamed: 0,article section,research/application area,count
5,Introduction,review,12
1,Pre-Analytics,knowledge navigation & literature review,40
6,Pre-Analytics,risk stratification,11
2,Analytics,medical imaging analysis,24
3,Analytics,analysis of variant effects,23
7,Analytics,clinical variant interpretation,9
10,Post-Analytics,patient clustering & concept typing,7
9,Post-Analytics,data & results aggregation,8
4,Post-Analytics,clinical report generation & decision support,15
8,Education,education,9


In [81]:
flags = []
print('Parts of the diagnostics:')
for col in ['pre', 'ana', 'post']:
    flag = data['what section used'].apply(lambda x: col in x)
    print(col, flag.sum())
    flags.append(flag)
print(f'\nAll diagnostics: {sum(flags[0]|flags[1]|flags[2])}')

Parts of the diagnostics:
pre 51
ana 53
post 29

All diagnostics: 122


In [84]:
data.loc[:, 'code'] = data['code'].astype(int)
data.groupby(['code',  "subgroup"]).size().reset_index(name='count')

Unnamed: 0,code,subgroup,count
0,0.0,chatgpt,3
1,0.0,general,2
2,0.0,not_llm,2
3,0.0,specific,4
4,111.0,extraction,15
5,112.0,qa,11
6,113.0,creation,11
7,121.0,predisposition,4
8,122.0,patient,5
9,211.0,sequence,4


```
0* - intro
1* - pre (11* - KNLR, 12* - RS)
2* - anal
3* - post
41* - edu
51* - disc
```