In [1]:
import datasets
from datasets import load_dataset, load_from_disk
import pandas as pd
import torch
import numpy as np
import random
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from tqdm import tqdm
import seaborn as sns

# Med Textbook + MDPI

In [114]:
data_path = "/mnt/home/al2644/storage/hf_dataset/OpenMedTextDataset"
medqa_text = load_from_disk(data_path)

In [125]:
medqa_text

DatasetDict({
    Med-Textbooks: Dataset({
        features: ['book-name', 'license', 'text'],
        num_rows: 29
    })
    Med-MDPI: Dataset({
        features: ['field', 'file-name', 'text'],
        num_rows: 127707
    })
})

In [130]:
example = medqa_text['Med-MDPI'][10]
field, filename, text = example['field'], example['file-name'], example['text']

In [138]:
field_list = list(set(medqa_text['Med-MDPI']['field']))

In [139]:
field_list

['cancers_3',
 'ijerph_9',
 'ijerph_8',
 'allergies',
 'behavsci',
 'cancers_1',
 'diabetology',
 'hearts',
 'dermatopathology',
 'oral',
 'jcm_2',
 'ijerph_5',
 'medicines',
 'ijerph_1',
 'ijerph_2',
 'epidemiologia',
 'gastrointestdisord',
 'medsci',
 'ijerph_6',
 'jcm_4',
 'antibiotics',
 'biomed',
 'biomedinformatics',
 'cardiogenetics',
 'jcm_1',
 'ctn',
 'livers',
 'biomedicines',
 'pharmacy',
 'endocrines',
 'cancers_4',
 'antibodies',
 'ijerph_7',
 'healthcare',
 'immuno',
 'viruses',
 'diseases',
 'gastroent',
 'curroncol',
 'ijerph_4',
 'ijerph_3',
 'cancers_2',
 'brainsci',
 'biomolecules',
 'diagnostics',
 'vaccines',
 'biologics',
 'biotech',
 'uro',
 'clinpract',
 'jcm_3']

In [132]:
print(text)

Unverified beta-lactam allergies are a substantial public health problem, as the majority of patients labeled as beta-lactam allergic do not have clinically significant allergies that may hinder the use beta-lactam therapy when indicated. Outdated or inaccurate beta-lactam or penicillin allergies can result in serious consequences, including suboptimal antibiotic therapy, increased risk of adverse effects, and use of broader spectrum antibiotics than indicated, which may contribute to antimicrobial resistance. The purpose of this review is to provide an overview of beta-lactam allergy and highlight the role of pharmacists in managing beta-lactam allergies. Studies have shown that pharmacists can play a vital role in allergy assessment, penicillin skin testing, beta-lactam desensitization, evaluation of beta-lactam cross-reactivity and recommending appropriate antibiotic therapy in patients with beta-lactam allergies.Beta-lactam antibiotics are considered a first-line therapy in many ba

# PubMed -- Research Publication

## Data Pipeline
1. Parse Front, Body, Refs
2. Remove non-English paper
3. Remove by quality
4. Fetch Refs, Abstract, Keywords, etc.

### Step1

In [79]:
from datasets import load_from_disk
pubmed_split = load_from_disk('/mnt/home/al2644/.cache/huggingface/datasets/pmc___open_access_split')

Loading dataset from disk:   0%|          | 0/423 [00:00<?, ?it/s]

Loading dataset from disk:   0%|          | 0/33 [00:00<?, ?it/s]

In [110]:
idx = 100
example = pubmed_split['main'][idx]

In [107]:
example.keys()

dict_keys(['text', 'pmid', 'accession_id', 'license', 'last_updated', 'retracted', 'citation'])

In [111]:
print(example['text'])

I am a clone. That is, I am a colony of cells that developed from a single fertilized egg cell. Most animals are clones like me. It is a slight oversimplification to say that all of an animal's cells are genetically identical to each other. Some cells have mutations. In mammals, some cells (red blood cells) lack a nuclear genome entirely. Some cells have viruses—and when it's in a cell, a virus is basically a gene—that other cells lack. But a typical animal is a clone in the sense that all its cells arise from that single fertilized egg cell.

Not all animals, however, are clones. Sometimes two tiny embryos developing inside their mother will fuse together into a single embryo and continue developing. The resulting animal is not a clone, but a chimera: a conglomeration of two different cell lineages into a single organism. Some species of monkeys (marmosets) typically have chimeric blood, from having shared a blood supply with a twin in utero (Haig 1999), and rare cases of accidental c

In [112]:
print(example['pmid'])

15024412


### Note:
- PubMed has around 85B tokens
- Not all publication are in English

In [99]:
query = "SELECT * FROM pubmedmetadata WHERE pmid == 15024412;"
df = pd.read_sql_query(query, conn)     # -> DataFrame

In [100]:
df

Unnamed: 0,pmid,article_title,abstract,language,journal_title,journal_issn,journal_volume,journal_issue,pub_year,pub_month,pub_day,article_date,article_ids,authors,mesh_headings,keywords,article_references,citation_count,source_file,processed_at
0,15024412,The strange case of the armored scale insect a...,,eng,PLoS biology,1545-7885,2,3,2004,Mar,,2004-03-16,"{""pubmed"": ""15024412"", ""doi"": ""10.1371/journal...","[{""last"": ""Normark"", ""fore"": ""Benjamin B"", ""in...","[{""descriptor"": {""name"": ""Animals"", ""ui"": ""D00...",[],"[{""citation"": ""Anim Behav. 2000 Mar;59(3):629-...",11,pubmed25n0491.xml,2025-11-11T10:56:18.018403


In [103]:
df.loc[0]['article_title']

'The strange case of the armored scale insect and its bacteriome.'

### PubMed Metadata

In [2]:
filepath = '/mnt/home/al2644/storage/pubmed/baseline/pubmed25n0001.xml'

In [8]:
import gzip
import xml.etree.ElementTree as ET

with open(filepath, 'r') as f:
    tree = ET.parse(f)
    root = tree.getroot()

In [11]:
for article in root.findall(".//PubmedArticle"):
    pmid = article.findtext(".//PMID")
    title = article.findtext(".//ArticleTitle")
    print(pmid, title)

1 Formate assay in body fluids: application in methanol poisoning.
2 Delineation of the intimate details of the backbone conformation of pyridine nucleotide coenzymes in aqueous solution.
3 Metal substitutions incarbonic anhydrase: a halide ion probe study.
4 Effect of chloroquine on cultured fibroblasts: release of lysosomal hydrolases and inhibition of their uptake.
5 Atomic models for the polypeptide backbones of myohemerythrin and hemerythrin.
6 Studies of oxygen binding energy to hemoglobin molecule.
7 Maturation of the adrenal medulla--IV. Effects of morphine.
9 Radiochemical assay of glutathione S-epoxide transferase and its enhancement by phenobarbital in rat liver in vivo.
8 Comparison between procaine and isocarboxazid metabolism in vitro by a liver microsomal amidase-esterase.
10 Digitoxin metabolism by rat liver microsomes.
11 Identification of adenylate cyclase-coupled beta-adrenergic receptors with radiolabeled beta-adrenergic antagonists.
12 The effect of adrenaline and 

17636 Diet and dental caries-a review.
17637 Interactions of C-reactive protein with the first component of human complement.
17638 Hydrolysis of a synthetic amide substrate by human C1 esterase (C1s).
17639 Immunofixation after electrofocusing: improved method for specific detection of serum proteins with determination of isoelectric points. I. Immunofixation print technique for detection of alpha-1-protease inhibitor.
17641 The Haarscheibe.
17640 Autonomic neuroeffector junctions--reflex vasodilatation of the skin.
17642 Factors which influence blood platelet migration.
17643 Intestinal absorption of alpha-tocopherol in the unanesthetized rat. The influence of luminal constituents on the absorptive process.
17644 The defect of uric acid metabolism in Eck-fistula rats.
17645 Conference on inflammation. Rationale of the conference on inflammation.
17646 Conference on inflammation. Introductory remarks.
17647 Induction of specific tissue transplantation tolerance using fractionated tota

22948 Analysis of the major histocompatibility complex in Syrian hamsters. II. Linkage studies.
22949 Secretion of various antimicrobial substances in dogs with experimental bacterial prostatitis.
22951 Urinary pH effects of diet additives.
22950 [The passage of nitrogenous compounds through the wall of perfused sheep rumen (author's transl)].
22953 [Role of adrenoreceptors in regulating aspartate aminotransferase isoenzyme activity in albino rat hearts].
22957 The aetiology of depression and anxiety.
22958 [Long-term treatment with the new antiarrhythmic drug propafenone in correlation to plasma levels (author's transl)].
22954 [Anion-sensitive nuclear ATPase of the rat heart].
22952 [HCO3-stimulated adenosine triphosphatase in rat ovarian tumor cells].
22959 The concentration of 2,3 diphosphoglycerate and adenosine triphosphate in the red blood cells of newborn infants with respiratory distress syndrome.
22955 [Depression of rat liver acetyl-CoA-carboxylase activity by salicylate and

In [None]:
https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed25n0001.xml.gz

# MegaMathWeb-Pro
### Statistics:
- 15M documents
- 15B tokens

Identifying Equivalent Fractions
Equivalent fractions are fractions that have the same value or represent the same part of an object. For example, if a pie is cut into two pieces, each piece is one-half of the pie. If the same pie is cut into 4 pieces, then two pieces represent the same amount of pie that 1/2 did, making 1/2 equivalent to 2/4.

To determine equivalent fractions, multiply the numerator and denominator of one fraction by the same number, ensuring the numerators are equal after multiplication. Comparing 1/2 and 2/4 involves multiplying 1/2 by 2/2, resulting in 2/4, confirming they are equivalent. In contrast, comparing 1/2 and 3/7 by multiplying 1/2 by 3/3 yields 3/6, which is not the same as 3/7, indicating the fractions are not equivalent.

Examples of equivalent fractions include:
- Fractions equivalent to 1/2: 2/4, 3/6, 4/8, 5/10, 6/12
- Fractions equivalent to 1/3: 2/6, 3/9, 4/12, 5/15
- Fractions equivalent to 1/4: 2/8, 3/12, 4/16, 5/20
- Fractions equivalent to 1/5: 2/10, 3/15, 4/20, 5/25
- Fractions equivalent to 2/5: 4/10, 6/15, 8/20, 10/25

This method allows for the identification of equivalent fractions by adjusting the numerator and denominator through multiplication, providing a systematic approach to comparing fractions.

### Knowledge Links

You will provided with a chunk of knowledge-intesive passages. You should behave as a scholar to understand the knowledge by treating it as a paper and contextualize it in the prior related work. Follow the steps belows:
1. Summarize the key motivations, problem statement, highlights, and takeaways with a set of keywords. Put the summary between <abstract> </abstract> tags and keywords between <keywords> </keywords> tags.
2. Write a list of references that can be considered as peer work to better understand this knowledge. For each reference, you do not need to formally write the author names or title. Each related work should be summarized into either a query or summary of the key ideas or key words that can be used to search for the actual work. In addition, only find the closely related knowledge and put them in a JSON object.
<examples>
</examples>

<knowledge>
Identifying Equivalent Fractions
Equivalent fractions are fractions that have the same value or represent the same part of an object. For example, if a pie is cut into two pieces, each piece is one-half of the pie. If the same pie is cut into 4 pieces, then two pieces represent the same amount of pie that 1/2 did, making 1/2 equivalent to 2/4.

To determine equivalent fractions, multiply the numerator and denominator of one fraction by the same number, ensuring the numerators are equal after multiplication. Comparing 1/2 and 2/4 involves multiplying 1/2 by 2/2, resulting in 2/4, confirming they are equivalent. In contrast, comparing 1/2 and 3/7 by multiplying 1/2 by 3/3 yields 3/6, which is not the same as 3/7, indicating the fractions are not equivalent.

Examples of equivalent fractions include:
- Fractions equivalent to 1/2: 2/4, 3/6, 4/8, 5/10, 6/12
- Fractions equivalent to 1/3: 2/6, 3/9, 4/12, 5/15
- Fractions equivalent to 1/4: 2/8, 3/12, 4/16, 5/20
- Fractions equivalent to 1/5: 2/10, 3/15, 4/20, 5/25
- Fractions equivalent to 2/5: 4/10, 6/15, 8/20, 10/25

This method allows for the identification of equivalent fractions by adjusting the numerator and denominator through multiplication, providing a systematic approach to comparing fractions.  
</knowledge>

# Prompt