# Label Generation

This notebook outlines the process for obtaining Labels from various LLMs, including their results and also includes the implementation of Llama-3.2-3B-Instruct for this experiment.

**Note**: A Hugging Face user access token is needed for this project. Go to [this](https://huggingface.co/docs/hub/security-tokens) website to set up a Hugging Face Account if needed.

# Models Chosen

It was decided to test a range of LLMs. To ensure a fair comparison, only free versions of models were chosen, however an account of some description is needed to access each model. The Models chosen are as follows:
* **[ChatGPT 4o](https://chatgpt.com/)**: The free version was used, however a free OpenAI account is needed.
* **[Claude 3.7 Sonnet](https://claude.ai/login?returnTo=%2F%3F)**: Free but Claude account needed.
* **[Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)**: Open-source, however, a free Hugging Face account and accesss key are needed to use this model.
* **[DeepSeek DeepThink R1](https://chat.deepseek.com/sign_in)**: Used for Evaluation, free Deepseek account also needed.

## Aim of LLM Usage
As Topic Modeling is an unsupervised task, the results obtained from both LDA and BERTopic do not contain and actual labels that could be used to categorize the articles. The results are a list of key words to each topic and are included below. The aim of using LLMs is to generate fluent and practical labels.

## Results from LDA & BERTopic Topic Modeling Approaches
Below are the unsupervised results which we would like to create labels for.

###LDA
```
Topic_1: student, participant, survey, mental, online, anxiety, education, child, self, psychological
Topic_2: vaccine, vaccination, antibody, variant, transmission, influenza, testing, assay, detection, detect
Topic_3: mask, food, air, concentration, surface, energy, water, temperature, material, environmental
Topic_4: policy, article, service, community, crisis, economic, global, management, paper, world
Topic_5: ci, mortality, p, death, trial, child, admission, severity, cohort, covid
Topic_6: protein, drug, immune, target, gene, expression, receptor, mechanism, induce, therapeutic
Topic_7: cancer, surgery, lung, complication, surgical, pain, procedure, undergo, diagnosis, pulmonary
Topic_8: propose, network, performance, learning, image, application, accuracy, user, predict, tool
```

###BERTopic
```
Topic_1: ['teacher', 'teaching', 'learning', 'learn', 'student', 'online', 'education', 'university', 'educational', 'learner']
Topic_2: ['child', 'pediatric', 'multisystem', 'inflammatory', 'pim', 'syndrome', 'infant', 'adult', 'fever', 'ts']
Topic_3: ['replication', 'glycan', 'protein', 'rna', 'virus', 'viral', 'membrane', 'glycosylation', 'virion', 'host']
Topic_4: ['image', 'ray', 'deep', 'segmentation', 'chest', 'classification', 'dataset', 'convolutional', 'ct', 'learning']
Topic_5: ['government', 'governance', 'law', 'policy', 'country', 'public', 'global', 'right', 'political', 'international']
Topic_6: ['protease', 'docking', 'compound', 'inhibitor', 'mpro', 'drug', 'molecular', 'bind', 'ligand', 'pro']
Topic_7: ['vte', 'thrombosis', 'anticoagulation', 'embolism', 'venous', 'coagulation', 'thrombotic', 'thromboembolism', 'dimer', 'heparin']
Topic_8: ['hospital', 'hcw', 'bed', 'ambulance', 'staff', 'emergency', 'healthcare', 'preparedness', 'care', 'facility']
```

## Prompt Design
The following was the prompt used for ChatGPT, LLama3.2 and Claude. The example labels were taken from an article on [the New England Journal of Medicine](https://www.nejm.org/doi/10.1056/NEJMsb2408466). This was chosen as it is a related topic but as it was released last year, is not connected to the CORD-19 datset.
```
Please analyze these keyword groups and create a specific, descriptive label for each topic that:
Captures the central theme with precision in the style of the example topics given.
Uses domain-specific terminology appropriate for medical/scientific research
Is concise (preferably 2-5 words).
Format your response as a numbered list corresponding to the topic numbers provided.
Example Labels: Cardiology General, Clinical Medicine General, Coronavirus, Critical Care, Diagnostics, Global Health, Hospital-Based Clinical Medicine, Infectious Disease General, Neurology/Neurosurgery General, Outpatient-Based Clinical Medicine, Public Health General, Viral Infections, Pulmonary/Critical Care General

Input: [Topic Lists from LDA/BERTopic]

Labels:
```


#Results
The generated labels from each model is included below.

##ChatGPT 4o

###LDA

```
Mental Health in Educational Settings
Vaccine Immunology and Detection
Environmental Exposure and Contamination
Global Health Policy and Systems
Pediatric COVID-19 Outcomes
Molecular Therapeutics and Immunomodulation
Thoracic Oncology and Surgical Outcomes
AI Applications in Medical Imaging
```
###BERTopic
```
Health Professions Education
Pediatric Inflammatory Syndromes
Viral Molecular Biology
Medical Imaging and AI
Global Health Policy
Antiviral Drug Discovery
Venous Thromboembolism Management
Hospital-Based Emergency Care
```


##Claude
###LDA
```
Pediatric Mental Health Assessment
Vaccine Immunology & Surveillance
Environmental Exposure Determinants
Global Health Policy Management
Clinical Outcomes Epidemiology
Molecular Immunotherapeutics
Thoracic Surgical Oncology
Predictive Diagnostic Algorithms
```

###BERTopic

```
Medical Education Pedagogy
Pediatric Inflammatory Syndromes
Viral Molecular Biology
Medical Imaging Diagnostics
Health Policy Governance
Drug Discovery Proteomics
Thrombotic Disease Management
Healthcare Facility Operations
```

##Llama3.2
###LDA

```
Child Mental Health Education Survey
Influenza Vaccine Detection Assay
Environmental Air Quality Assessment
Global Economic Crisis Management Policy
COVID-19 Child Mortality Severity Cohort
Immune Targeted Therapeutic Protein Expression
Pulmonary Cancer Surgery Diagnosis Complication
Image-Based Learning Performance Prediction Tool
```
###BERTopic

```
Educational Technology
Pediatric Multisystem Inflammatory Syndrome
Viral Glycan Interactions
Deep Learning Chest X-ray Segmentation
Global Governance and International Law
Protease Inhibitor Compound Design
Venous Thrombosis and Coagulation Management
Emergency Healthcare Preparedness and Staffing
```


# DeepSeek Evaluation Using Reasoning Model
As Human Evaluation was outside the scope of this project, the following prompt was given to the R1 Reasoning model of DeepSeek. The limitations of this approach are discussed further in the report.

##Prompt##
```
We asked three LLMs to generate topic labels based on lists of words for a collection of medical/scientific articles. Each model created 8 topic labels based on 8 lists of words. Evaluate the performance of each model using the following metrics on a 1-5 scale, where 1 is poor performance and 5 high performance:
Fluency/ Coherence, i.e. how fluent and grammatical is the language of the topic label?
Suitability, i.e. how well does the label describe the words associated with it?
Usability, i.e. is the topic label usable as a topic tag?
Please also choose the best topic label for each list of words and mention if there are any topic labels you would change.
Please explain all your answers.

Lists of Words:
[Topic Lists]

Model 1:
[Labels]

Model 2:
[Labels]

Model 3:
[Labels]
```

##Results
###LDA
```
Pediatric Mental Health Assessment
Vaccine Immunology & Surveillance
Environmental Exposure and Contamination
Global Economic Crisis Policy
Pediatric COVID-19 Outcomes
Molecular Therapeutics and Immunomodulation
Thoracic Oncology and Surgical Outcomes
AI-Driven Medical Imaging Tools
```


###BERTopic
```
Online Education and Learning
Pediatric Multisystem Inflammatory Syndrome
Viral Glycan Interactions
Deep Learning Chest X-ray Segmentation
Global Governance and International Law
Protease Inhibitor Compound Design
Venous Thromboembolism Management
Emergency Healthcare Preparedness and Staffing
```

The next cell includes the full response and reasoning

#DeepSeek Response - LDA
### Model Evaluations

#### **Model 1**  
**Fluency/Coherence**: 4.8/5  
**Suitability**: 4.4/5  
**Usability**: 4.8/5  
**Strengths**: Highly fluent and concise labels. Struggled only with Topic_4 (misinterpreted "health").  
**Weaknesses**: Topic_4 inaccurately included "health," reducing suitability.  

#### **Model 2**  
**Fluency/Coherence**: 4.6/5  
**Suitability**: 4.5/5  
**Usability**: 4.5/5  
**Strengths**: Balanced specificity and fluency (e.g., "Pediatric Mental Health Assessment").  
**Weaknesses**: Topic_4 still included "health," and Topic_5 was overly vague.  

#### **Model 3**  
**Fluency/Coherence**: 3.1/5  
**Suitability**: 4.1/5  
**Usability**: 3.0/5  
**Strengths**: High specificity (e.g., Topic_4, Topic_5).  
**Weaknesses**: Clunky phrasing (e.g., "COVID-19 Child Mortality Severity Cohort").  

---

### **Best Labels per Topic**  
1. **Topic_1**: *Pediatric Mental Health Assessment* (Model 2)  
   - Covers "child," "mental health," and "survey" (as "assessment").  
2. **Topic_2**: *Vaccine Immunology & Surveillance* (Model 2)  
   - "Surveillance" captures variant tracking and testing.  
3. **Topic_3**: *Environmental Exposure and Contamination* (Model 1)  
   - Broadly covers air, water, and materials.  
4. **Topic_4**: *Global Economic Crisis Management Policy* (Model 3)  
   - Accurately reflects "economic," "crisis," and "policy."  
5. **Topic_5**: *Pediatric COVID-19 Outcomes* (Model 1)  
   - Concise and directly links COVID-19 to pediatric outcomes.  
6. **Topic_6**: *Molecular Therapeutics and Immunomodulation* (Model 1)  
   - Balances molecular mechanisms and immune targeting.  
7. **Topic_7**: *Thoracic Oncology and Surgical Outcomes* (Model 1)  
   - Combines cancer focus with surgical complications.  
8. **Topic_8**: *AI Applications in Medical Imaging* (Model 1)  
   - Clear and aligns with "learning," "image," and "tool."  

---

### **Suggested Label Revisions**  
- **Topic_4**: All models initially included "health," which isn’t in the word list. Model 3’s label is correct but could be shortened to *Global Economic Crisis Policy*.  
- **Topic_6**: Model 3’s label (*Immune Targeted Therapeutic Protein Expression*) is too jargon-heavy; Model 1’s version is preferable.  
- **Topic_8**: Model 3’s label (*Image-Based Learning Performance Prediction Tool*) could be simplified to *AI-Driven Medical Imaging Tools*.  

---

### **Final Scores**  
| Metric          | Model 1 | Model 2 | Model 3 |  
|-----------------|---------|---------|---------|  
| Fluency/Coherence | 4.8     | 4.6     | 3.1     |  
| Suitability      | 4.4     | 4.5     | 4.1     |  
| Usability        | 4.8     | 4.5     | 3.0     |  

**Overall Best Model**: **Model 1** for fluency and usability, though Model 2 is close in suitability. Model 3 excels in specificity but suffers in readability.


## Reasoning:
```
Okay, let's tackle this evaluation step by step. First, I need to understand the task. We have three models that generated topic labels for eight lists of words related to medical/scientific articles. The goal is to evaluate each model's labels based on three metrics: Fluency/Coherence, Suitability, and Usability, each on a 1-5 scale. Then, choose the best label for each topic and note any changes needed.

Starting with Topic_1 words: student, participant, survey, mental, online, anxiety, education, child, self, psychological. The key themes here are mental health in an educational context involving children. Model 1's label is "Mental Health in Educational Settings" – that's fluent and covers education and mental health, but maybe misses the survey/online aspects. Model 2's "Pediatric Mental Health Assessment" includes children and assessment (which ties to survey), more specific. Model 3's "Child Mental Health Education Survey" is specific but a bit clunky. Fluency for Model 3 might be lower because it's a longer phrase. Suitability: Model 2 might be better as it includes assessment (survey) and pediatric (child). Usability: Model 1 is more concise but maybe less specific. So maybe Model 2's label is best here.

Moving to Topic_2: vaccine, vaccination, antibody, variant, transmission, influenza, testing, assay, detection, detect. The focus is on vaccines, testing, detection, variants, influenza. Model 1: "Vaccine Immunology and Detection" – covers immunology (antibodies) and detection, but maybe not testing/assay specifically. Model 2: "Vaccine Immunology & Surveillance" – surveillance could relate to monitoring variants and transmission. Model 3: "Influenza Vaccine Detection Assay" – very specific to influenza and detection/assay, but maybe too narrow since the words include variant and transmission. Model 1 and 2 are broader. Model 2's "Surveillance" might cover transmission tracking. Suitability: Model 2 better? But maybe "Detection" from Model 1 is also good. The word "assay" is in Model 3's label, which is part of the words. However, Model 3's label is less fluent and too specific. So Model 2 might be better, but I need to check if surveillance fits. Alternatively, Model 1's label might miss testing/assay. Maybe Model 2's "Surveillance" includes testing and detection. So Model 2 could be better. But I need to see if "Surveillance" is the right term here. Alternatively, Model 1's "Detection" is more accurate. Hmm.

For Topic_3: mask, food, air, concentration, surface, energy, water, temperature, material, environmental. The words are about environmental factors and possibly pollution, air/water quality. Model 1: "Environmental Exposure and Contamination" – contamination relates to air/water, exposure to environmental factors. Model 2: "Environmental Exposure Determinants" – determinants might refer to factors affecting exposure, but "determinants" is a bit abstract. Model 3: "Environmental Air Quality Assessment" – focuses on air quality, but the words include water, food, etc. So Model 3's label is too narrow. Model 1's "Contamination" covers air, water, surfaces. But maybe "Exposure" is about contact with these elements. So Model 1 is better here. Model 2's "Determinants" is less clear. So best label is Model 1.

Topic_4: policy, article, service, community, crisis, economic, global, management, paper, world. Themes are global policy, crisis management, economic aspects. Model 1: "Global Health Policy and Systems" – but the words don't specifically mention health. Words include economic, crisis, management. Model 2: "Global Health Policy Management" – again, health isn't in the words. Model 3: "Global Economic Crisis Management Policy" – this includes economic crisis, management, policy, global. That fits better. The original words have economic and crisis, so Model 3's label is more suitable. Model 1 and 2 incorrectly assume health, which isn't in the words. So Model 3 is best here.

Topic_5: ci, mortality, p, death, trial, child, admission, severity, cohort, covid. This seems to be about COVID-19 outcomes in children, mortality, severity, clinical trials. Model 1: "Pediatric COVID-19 Outcomes" – concise and covers child (pediatric), COVID, outcomes (mortality, severity). Model 2: "Clinical Outcomes Epidemiology" – too vague, doesn't mention COVID or pediatric. Model 3: "COVID-19 Child Mortality Severity Cohort" – accurate but wordy and not very fluent. Model 1 is better here, more fluent and suitable.

Topic_6: protein, drug, immune, target, gene, expression, receptor, mechanism, induce, therapeutic. Focus on drug development, immune system targeting, gene expression. Model 1: "Molecular Therapeutics and Immunomodulation" – covers therapeutics and immune modulation. Model 2: "Molecular Immunotherapeutics" – similar but shorter. Model 3: "Immune Targeted Therapeutic Protein Expression" – a bit jumbled but includes target, immune, therapeutic, protein. Model 1 and 2 are more fluent. Suitability: Maybe Model 2's "Immunotherapeutics" captures the immune and drug aspects. But "Molecular" might cover gene/protein. Model 1's "Immunomodulation" relates to mechanisms. Both are good, but Model 2 is more concise. However, the words include gene expression and receptors, which might be part of molecular mechanisms. So maybe Model 1 is better. Alternatively, Model 3 includes "protein expression" which is part of the words, but the label is less fluent. So Model 1 or 2. Model 1's "Immunomodulation" might be more specific to immune mechanisms. So maybe Model 1.

Topic_7: cancer, surgery, lung, complication, surgical, pain, procedure, undergo, diagnosis, pulmonary. Thoracic surgery related to lung cancer, complications. Model 1: "Thoracic Oncology and Surgical Outcomes" – oncology covers cancer, surgical outcomes include complications. Model 2: "Thoracic Surgical Oncology" – combines surgery and oncology. Model 3: "Pulmonary Cancer Surgery Diagnosis Complication" – includes all elements but is a string of terms. Model 1 and 2 are more fluent. Model 1's "Surgical Outcomes" covers complications, pain. Model 2 is more concise. Both are good, but maybe Model 1 is slightly better as it includes outcomes.

Topic_8: propose, network, performance, learning, image, application, accuracy, user, predict, tool. This is about AI/ML applications, particularly imaging. Model 1: "AI Applications in Medical Imaging" – concise, covers AI, imaging. Model 2: "Predictive Diagnostic Algorithms" – predictive relates to predict, diagnostic maybe not directly in words. Model 3: "Image-Based Learning Performance Prediction Tool" – includes image, learning, predict, tool, but is a bit wordy. Model 1's label is more usable and fluent. Model 2's label is narrower. Model 1 better suits the words, which include learning (as in machine learning), image, application, etc.

Now, going back to each model's labels and scoring them.

For Model 1:

- Topic_1: Fluency 5, Suitable 4 (misses survey/online), Usability 5.

- Topic_2: Fluency 5, Suitable 4 (maybe missing testing/assay), Usability 5.

- Topic_3: Fluency 5, Suitable 5, Usability 5.

- Topic_4: Fluency 5, Suitable 3 (wrongly includes health), Usability 3.

- Topic_5: Fluency 5, Suitable 5, Usability 5.

- Topic_6: Fluency 5, Suitable 5, Usability 5.

- Topic_7: Fluency 5, Suitable 5, Usability 5.

- Topic_8: Fluency 5, Suitable 5, Usability 5.

Average scores would be high except for Topic_4.

Model 2:

- Topic_1: Fluency 5, Suitable 5 (includes pediatric and assessment), Usability 5.

- Topic_2: Fluency 5 (though "&" might be less formal), Suitable 5 (surveillance covers testing/detection), Usability 5.

- Topic_3: Fluency 4 ("Determinants" is a bit academic), Suitable 4 (less direct than Model 1), Usability 4.

- Topic_4: Fluency 5, Suitable 3 (still includes health), Usability 3.

- Topic_5: Fluency 5, Suitable 2 (too vague), Usability 2.

- Topic_6: Fluency 5, Suitable 5, Usability 5.

- Topic_7: Fluency 5, Suitable 5, Usability 5.

- Topic_8: Fluency 5, Suitable 4 (predictive and diagnostic, but maybe narrow), Usability 4.

Average might be higher than Model 1 except for Topic_4 and 5.

Model 3:

- Topic_1: Fluency 3 (clunky), Suitable 5 (covers child, mental health, education, survey), Usability 3.

- Topic_2: Fluency 3 (awkward), Suitable 3 (too narrow to influenza), Usability 2.

- Topic_3: Fluency 4, Suitable 3 (narrow to air quality), Usability 3.

- Topic_4: Fluency 4 (long but correct), Suitable 5, Usability 4 (long but usable).

- Topic_5: Fluency 2 (too wordy), Suitable 5, Usability 2.

- Topic_6: Fluency 3 (jumbled terms), Suitable 4 (covers aspects but not fluent), Usability 3.

- Topic_7: Fluency 2 (string of terms), Suitable 4, Usability 2.

- Topic_8: Fluency 3 (wordy), Suitable 4, Usability 3.

Scores vary, with some high suitability but lower fluency and usability.

Now, compiling all this into scores and explanations, choosing best labels per topic.
```

#DeepSeek Response - BERTopic
### Evaluation of Models  
**Scoring Metrics (1-5):**  
1. **Fluency/Coherence:** Grammar, readability, and linguistic clarity.  
2. **Suitability:** Relevance to the associated word list.  
3. **Usability:** Practicality as a concise, informative topic tag.  

---

#### **Model 1**  
- **Fluency/Coherence:** 5/5  
  - All labels are grammatically flawless and concise.  
- **Suitability:** 4/5  
  - Generally accurate but occasionally overbroad (e.g., "Global Health Policy" for Topic 5 ignores non-health governance terms).  
- **Usability:** 5/5  
  - Labels are short and intuitive for topic tagging.  

**Best Labels:**  
- **Topic 7:** *Venous Thromboembolism Management* (specific and precise).  

**Suggested Changes:**  
- **Topic 5:** Replace "Global Health Policy" with "Global Governance and Law" (original words lack a health focus).  

---

#### **Model 2**  
- **Fluency/Coherence:** 5/5  
  - Professionally structured and error-free.  
- **Suitability:** 3/5  
  - Some labels misalign with word lists (e.g., "Drug Discovery Proteomics" for Topic 6 introduces "proteomics," which is absent from the original words).  
- **Usability:** 5/5  
  - Clear and functional for tagging.  

**Best Labels:**  
- None outperformed other models.  

**Suggested Changes:**  
- **Topic 6:** Replace "Drug Discovery Proteomics" with "Protease Inhibitor Design" (aligns with molecular terms like *mpro*, *ligand*).  

---

#### **Model 3**  
- **Fluency/Coherence:** 5/5  
  - Technical but grammatically sound.  
- **Suitability:** 5/5  
  - Most precise alignment with word lists (e.g., *Pediatric Multisystem Inflammatory Syndrome* for Topic 2 includes "multisystem," a key term).  
- **Usability:** 4/5  
  - Slightly verbose in some cases (e.g., *Deep Learning Chest X-ray Segmentation* could omit "Chest X-ray" for brevity).  

**Best Labels (6/8 Topics):**  
- **Topic 2:** *Pediatric Multisystem Inflammatory Syndrome*  
- **Topic 3:** *Viral Glycan Interactions*  
- **Topic 4:** *Deep Learning Chest X-ray Segmentation*  
- **Topic 5:** *Global Governance and International Law*  
- **Topic 6:** *Protease Inhibitor Compound Design*  
- **Topic 8:** *Emergency Healthcare Preparedness and Staffing*  

**Suggested Changes:**  
- **Topic 1:** Replace *Educational Technology* with *Online Education and Learning* (directly reflects "online" and "student").  
- **Topic 7:** *Venous Thrombosis and Coagulation Management* could merge with Model 1’s *Venous Thromboembolism Management* for conciseness.  

---

### **Final Scores**  
| Model | Fluency/Coherence | Suitability | Usability |  
|-------|--------------------|-------------|-----------|  
| 1     | 5                  | 4           | 5         |  
| 2     | 5                  | 3           | 5         |  
| 3     | 5                  | 5           | 4         |  

**Best Overall Performer:** **Model 3** (highest suitability despite minor verbosity).  

**Key Takeaways:**  
- **Model 3** excels in specificity but occasionally sacrifices brevity.  
- **Model 1** is strong in usability but less precise.  
- **Model 2** struggles with relevance in niche topics.

##Reasoning
```
Okay, let's tackle this evaluation. So, the user wants me to assess three models that generated topic labels for eight medical/scientific article topics based on word lists. The metrics are Fluency/Coherence, Suitability, and Usability, each on a 1-5 scale. Plus, I need to pick the best label for each topic and suggest any changes.

First, I should go through each topic one by one. Let me start with Topic_1. The words are about teaching, learning, students, online, education, etc. Model 1 says "Health Professions Education." Hmm, that's specific to health professions, but the original words don't mention health. Model 2: "Medical Education Pedagogy." Again, medical? The original list is general education. Model 3: "Educational Technology." The word 'online' is there, so maybe that's better. Suitability-wise, Models 1 and 2 might be off. Model 3's label fits better with online and educational tech. So Fluency for Model 3 is good, Suitability too. Maybe Model 3 is best here, but could it be better? Maybe "Online Education and Learning" to include more keywords.

Next, Topic_2: child, pediatric, multisystem, inflammatory, PIM, syndrome, infant, etc. All models have similar labels. Model 1: "Pediatric Inflammatory Syndromes." Model 2 same. Model 3: "Pediatric Multisystem Inflammatory Syndrome." The word 'multisystem' is in the list, so Model 3 is more precise. "Syndrome" is singular in the list (PIM, TS), so plural vs. singular. But maybe PIMS is a known term. So Model 3's label is more accurate. Fluency is good, maybe a 5. Suitability higher. So Model 3's label is best here.

Topic_3: replication, glycan, protein, RNA, virus, etc. Model 1: "Viral Molecular Biology." That's broad. Model 2 same. Model 3: "Viral Glycan Interactions." Glycan is in the words, so that's specific. Suitability for Model 3 is better because it mentions glycan. But does it cover replication, RNA? Maybe not. So trade-off between specificity and breadth. Maybe Model 3 is better but could be adjusted. Alternatively, "Viral Replication and Glycosylation Processes" to include more terms. But as per the given labels, Model 3's is more specific. So higher suitability.

Topic_4: image, ray, deep, segmentation, chest, etc. Model 1: "Medical Imaging and AI." AI is part of deep learning. Model 2: "Medical Imaging Diagnostics." Model 3: "Deep Learning Chest X-ray Segmentation." Very specific. The words include 'deep', 'segmentation', 'chest', 'convolutional', 'CT', 'learning'. So Model 3's label is very precise. Suitability high. Fluency is a bit long but still correct. So Model 3's label is best here.

Topic_5: government, governance, law, policy, country, etc. Models 1 and 2 mention health or health policy. But the words are general: government, law, public, global, international. Model 3: "Global Governance and International Law." That fits better. Models 1 and 2 incorrectly include 'health' which isn't in the words. So Model 3's label is more suitable here.

Topic_6: protease, docking, compound, inhibitor, mpro, drug, etc. Model 1: "Antiviral Drug Discovery." Model 2: "Drug Discovery Proteomics." Model 3: "Protease Inhibitor Compound Design." The words are about protease inhibitors, molecular docking, ligands. Model 3's label is very specific. Suitability is higher. Model 1 and 2 are broader. So Model 3's label is best here.

Topic_7: VTE, thrombosis, anticoagulation, embolism, etc. Model 1: "Venous Thromboembolism Management." Model 2: "Thrombotic Disease Management." Model 3: "Venous Thrombosis and Coagulation Management." The original terms include VTE, venous, coagulation. Model 1 is precise with VTE management. Model 3 includes coagulation. Both are good. Maybe Model 1 is more concise and accurate. So best label is Model 1 here. But Model 3's is also good. Need to check which terms are covered.

Topic_8: hospital, hcw, bed, ambulance, staff, emergency, etc. Model 1: "Hospital-Based Emergency Care." Model 2: "Healthcare Facility Operations." Model 3: "Emergency Healthcare Preparedness and Staffing." The words include preparedness, staff, ambulance, emergency. Model 3's label includes preparedness and staffing, which are in the list. Model 1 focuses on emergency care. Model 2 is broader. So Model 3's label might be more suitable. But "Emergency Healthcare Preparedness and Staffing" is a bit long. Maybe "Emergency Healthcare Preparedness and Operations" to cover more. But as per given options, Model 3's label is better.

Now, for each model, scoring each metric. Let's go model by model.

Model 1:

- Topic 1: Health Professions Education. Fluency: 5, Suitability: 3 (since it's health-specific when words aren't), Usability: 4.

- Topic 2: Pediatric Inflammatory Syndromes. Fluency 5, Suitability 4 (missing multisystem), Usability 5.

- Topic 3: Viral Molecular Biology. Fluency 5, Suitability 4 (broad), Usability 5.

- Topic 4: Medical Imaging and AI. Fluency 5, Suitability 4 (covers AI but maybe not all terms), Usability 5.

- Topic 5: Global Health Policy. Suitability 3 (health not in words), others 5.

- Topic 6: Antiviral Drug Discovery. Suitability 4 (antiviral is inferred), Usability 5.

- Topic 7: Venous Thromboembolism Management. All 5s.

- Topic 8: Hospital-Based Emergency Care. Suitability 4 (misses preparedness, staffing), Usability 5.

Model 1 average would be around 4.5 for Fluency, maybe 4 for Suitability, 4.8 Usability.

Model 2:

- Topic 1: Medical Education Pedagogy. Again health/medical, same issue. Suitability 3.

- Topic 2: Same as Model 1, same scores.

- Topic 3: Same as Model 1.

- Topic 4: Medical Imaging Diagnostics. Suitable, but missing AI/deep learning. Suitability 4.

- Topic 5: Health Policy Governance. Again health, suitability 3.

- Topic 6: Drug Discovery Proteomics. Proteomics isn't in the words. The words are about protease inhibitors and molecular docking. So suitability lower, maybe 3.

- Topic 7: Thrombotic Disease Management. Broad, but VTE is specific. Suitability 4.

- Topic 8: Healthcare Facility Operations. Suitability 5? It covers hospital, facility, staff, etc. Maybe 4.5.

Model 2 might have lower suitability in some topics.

Model 3:

- Topic1: Educational Technology. Suitable, but maybe missing some aspects. Suitability 4.

- Topic2: Pediatric Multisystem Inflammatory Syndrome. Suitability 5.

- Topic3: Viral Glycan Interactions. Specific but misses replication. Suitability 4.

- Topic4: Deep Learning Chest X-ray Segmentation. Very specific, maybe misses CT and other imaging. Suitability 4.

- Topic5: Global Governance and International Law. Suitability 5.

- Topic6: Protease Inhibitor Compound Design. Suitability 5.

- Topic7: Venous Thrombosis and Coagulation Management. Suitability 5.

- Topic8: Emergency Healthcare Preparedness and Staffing. Suitability 5.

Fluency for Model3: Topic3's label is a bit technical but still fluent. All should be 5. Suitability varies but higher overall. Usability might be high except for longer labels, but still usable.

Comparing all, Model3 likely has higher scores across the board, especially in suitability.

Now, picking the best label for each topic:

Topic1: Model3's "Educational Technology" but maybe better as "Online Education and Learning Technology"? Since the user allows suggesting changes. But given the options, Model3 is best.

Topic2: Model3's label.

Topic3: Model3's.

Topic4: Model3's.

Topic5: Model3's.

Topic6: Model3's.

Topic7: Model1's "Venous Thromboembolism Management" is concise, but Model3's includes coagulation. However, VTE is a specific term encompassing DVT and PE, and the words include coagulation, so maybe Model3's is better. Or maybe Model1 is better. Hmm. The terms include VTE, thrombosis, embolism, coagulation. So Model3's label covers more terms. So best is Model3.

Topic8: Model3's label.

So overall, Model3 has the best labels for most topics.

For changes: Topic1 could be more precise. Maybe "Online Education and Learning" instead of "Educational Technology." Topic3 could include replication or RNA aspects. Topic4 could mention CT or other imaging modalities. But given the labels, Model3's are best.

Now, scoring each model's metrics:

Model1:

Fluency: All labels are fluent. 5.

Suitability: Some labels are off (Topics1,5). Average around 4.

Usability: Concise, usable. 5.

Model2:

Fluency: 5.

Suitability: Issues in Topics1,5,6. Lower, maybe 3.5.

Usability: Still good, 5.

Model3:

Fluency: 5.

Suitability: Mostly 5, some 4. Average 4.5-5.

Usability: Some labels are longer but still usable. 5.

So final scores:

Model1: F:5, S:4, U:5

Model2: F:5, S:3.5, U:5

Model3: F:5, S:5, U:5

But since the user wants per-model scores on 1-5, probably:

Model1: Fluency 5, Suitability 4, Usability 5

Model2: Fluency 5, Suitability 3, Usability 5

Model3: Fluency 5, Suitability 5, Usability 5

Wait, but for some topics, Model3's suitability may not be perfect. For example, Topic3 misses replication, Topic4 misses CT. So maybe Suitability is 4.5, but since we need to give integer scores, maybe 5 for Model3 as it's the best available. Alternatively, 4 for some. Need to check each model's labels.

But according to the user's instruction, evaluate each model's performance using the metrics on 1-5 scale. So per-model, not per-topic. So average across all topics.

But maybe the user wants an overall score per model for each metric. So for each model, across all 8 topics, average their Fluency, Suitability, Usability.

But perhaps it's better to give an overall score for each model's metrics. Let me think.

Alternatively, maybe the user wants each model to be scored on each metric as a whole. For example, Model1's Fluency: how fluent are all its labels? If all are grammatically correct, Fluency is 5. Suitability: average how well each label fits. If some are off, maybe 4. Usability: all are usable tags, maybe 5.

Similarly for other models.

So:

Model1 Fluency: 5 (all labels are fluent)

Suitability: Let's see. Topic1 is off (health focus), Topic5 is off (health), others are good. So 2/8 topics have lower suitability. That's 75% good, so maybe 4.

Usability: All labels are concise and usable. 5.

Model2 Fluency: 5. Suitability: Topic1 (medical), Topic5 (health), Topic6 (proteomics not relevant), and maybe others. 3 topics problematic. So suitability around 3. Usability 5.

Model3 Fluency: 5. Suitability: Only perhaps Topic3 and Topic4 could be slightly off, but still much better. Maybe 5. Usability: Some labels longer but still ok. 5.

So the scores would be:

Model1: F5, S4, U5

Model2: F5, S3, U5

Model3: F5, S5, U5

Then, the best labels per topic are mostly from Model3, except maybe Topic7.

But need to check each topic's best label.

After finalizing, also mention if any labels should be changed. For example, Topic1: Model3's "Educational Technology" could be improved to include online learning. Topic3: Maybe mention replication or viral structure. But given the options, stick with the models' labels.
```

# Llama3.2-3B-Instruct
Below is the implementation for Llama3.2-3B-Instruct.

In [None]:
#imports
from transformers import pipeline, logging
import torch
from huggingface_hub import login

# Disable warnings about model loading
logging.set_verbosity_error()

# Log in to your Hugging Face account NOTE: you need to use your own Access token
login(token="YOUR_ACCESS_TOKEN_HERE")

In [None]:
# Load the model using pipeline
model_id = "meta-llama/Llama-3.2-3B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Create your prompt
prompt = """Please analyze these keyword groups and create a specific, descriptive label for each topic that:
Captures the central theme with precision in the style of the example topics given.
Uses domain-specific terminology appropriate for medical/scientific research
Is concise (preferably 2-5 words).
Format your response as a numbered list corresponding to the topic numbers provided.
Example Labels: Cardiology General, Clinical Medicine General, Coronavirus, Critical Care, Diagnostics, Global Health, Hospital-Based Clinical Medicine, Infectious Disease General, Neurology/Neurosurgery General, Outpatient-Based Clinical Medicine, Public Health General, Viral Infections, Pulmonary/Critical Care General


Input:

Topic_1: student, participant, survey, mental, online, anxiety, education, child, self, psychological
Topic_2: vaccine, vaccination, antibody, variant, transmission, influenza, testing, assay, detection, detect
Topic_3: mask, food, air, concentration, surface, energy, water, temperature, material, environmental
Topic_4: policy, article, service, community, crisis, economic, global, management, paper, world
Topic_5: ci, mortality, p, death, trial, child, admission, severity, cohort, covid
Topic_6: protein, drug, immune, target, gene, expression, receptor, mechanism, induce, therapeutic
Topic_7: cancer, surgery, lung, complication, surgical, pain, procedure, undergo, diagnosis, pulmonary
Topic_8: propose, network, performance, learning, image, application, accuracy, user, predict, tool

Labels:"""

# Format the prompt for chat models
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

# Generate response
response = pipe(
    messages,
    max_new_tokens=500,
    temperature=0.5,#this was experiment at 0.5 and 1, but 0.5 was chosen for more control.
    do_sample=True,
    return_full_text=False
)[0]["generated_text"]

print(response)

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Here are the analyzed keyword groups with specific, descriptive labels for each topic:

1. Topic_1: Child Mental Health Education Survey
2. Topic_2: Influenza Vaccine Detection Assay
3. Topic_3: Environmental Air Quality Assessment
4. Topic_4: Global Economic Crisis Management Policy
5. Topic_5: COVID-19 Child Mortality Severity Cohort
6. Topic_6: Immune Targeted Therapeutic Protein Expression
7. Topic_7: Pulmonary Cancer Surgery Diagnosis Complication
8. Topic_8: Image-Based Learning Performance Prediction Tool


In [None]:
# Create your prompt
prompt = """Please analyze these keyword groups and create a specific, descriptive label for each topic that:
Captures the central theme with precision in the style of the example topics given.
Uses domain-specific terminology appropriate for medical/scientific research
Is concise (preferably 2-5 words).
Format your response as a numbered list corresponding to the topic numbers provided.
Example Labels: Cardiology General, Clinical Medicine General, Coronavirus, Critical Care, Diagnostics, Global Health, Hospital-Based Clinical Medicine, Infectious Disease General, Neurology/Neurosurgery General, Outpatient-Based Clinical Medicine, Public Health General, Viral Infections, Pulmonary/Critical Care General


Input:

Topic_1: ['teacher', 'teaching', 'learning', 'learn', 'student', 'online', 'education', 'university', 'educational', 'learner']
Topic_2: ['child', 'pediatric', 'multisystem', 'inflammatory', 'pim', 'syndrome', 'infant', 'adult', 'fever', 'ts']
Topic_3: ['replication', 'glycan', 'protein', 'rna', 'virus', 'viral', 'membrane', 'glycosylation', 'virion', 'host']
Topic_4: ['image', 'ray', 'deep', 'segmentation', 'chest', 'classification', 'dataset', 'convolutional', 'ct', 'learning']
Topic_5: ['government', 'governance', 'law', 'policy', 'country', 'public', 'global', 'right', 'political', 'international']
Topic_6: ['protease', 'docking', 'compound', 'inhibitor', 'mpro', 'drug', 'molecular', 'bind', 'ligand', 'pro']
Topic_7: ['vte', 'thrombosis', 'anticoagulation', 'embolism', 'venous', 'coagulation', 'thrombotic', 'thromboembolism', 'dimer', 'heparin']
Topic_8: ['hospital', 'hcw', 'bed', 'ambulance', 'staff', 'emergency', 'healthcare', 'preparedness', 'care', 'facility']

Labels:"""

# Format the prompt for chat models
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

# Generate response
response = pipe(
    messages,
    max_new_tokens=500,
    temperature=0.5,
    do_sample=True,
    return_full_text=False
)[0]["generated_text"]

print(response)

Here are the specific, descriptive labels for each topic:

1. Topic_1: 'Educational Technology'
2. Topic_2: 'Pediatric Multisystem Inflammatory Syndrome'
3. Topic_3: 'Viral Glycan Interactions'
4. Topic_4: 'Deep Learning Chest X-ray Segmentation'
5. Topic_5: 'Global Governance and International Law'
6. Topic_6: 'Protease Inhibitor Compound Design'
7. Topic_7: 'Venous Thrombosis and Coagulation Management'
8. Topic_8: 'Emergency Healthcare Preparedness and Staffing'
