In [1]:
!nvidia-smi

Tue May  6 21:05:23 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GH200 120GB             On  | 00000009:01:00.0 Off |                    0 |
| N/A   26C    P0              92W / 900W |  13549MiB / 97871MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GH200 120GB             On  | 00000039:01:00.0 Off |  

In [2]:
import json
from pprint import pprint

import torch
import transformers
from environs import env
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    QuantoConfig,
    AutoProcessor,
    Llama4ForConditionalGeneration,
)

from local_funcs import chat_funcs, prompt_funcs
from yiutils.project_utils import find_project_root

proj_root = find_project_root("justfile")
data_dir = proj_root / "data"

print(transformers.__version__)
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.version.cuda)

env.read_env(proj_root / ".env")
access_token = env("HUGGINGFACE_TOKEN")

path_to_mr_pubmed_data = (
    data_dir / "intermediate" / "mr-pubmed-data" / "mr-pubmed-data.json"
)
assert path_to_mr_pubmed_data.exists(), (
    f"Data file {path_to_mr_pubmed_data} does not exist."
)

with open(path_to_mr_pubmed_data, "r") as f:
    mr_pubmed_data = json.load(f)

article_data = mr_pubmed_data[0]

message_metadata = prompt_funcs.make_message_metadata(article_data["ab"])
message_results = prompt_funcs.make_message_results(article_data["ab"])


  from .autonotebook import tqdm as notebook_tqdm


4.51.3
2.6.0
True
12.6


In [3]:
MODEL_ID = "nvidia/Llama-3_3-Nemotron-Super-49B-v1"

device = "auto"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=access_token)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=dtype,
    device_map=device,
    token=access_token,
    trust_remote_code=True,
)

Loading checkpoint shards: 100%|██████████| 21/21 [00:18<00:00,  1.11it/s]


In [4]:
messages = message_metadata
# messages = message_results
input_ids = tokenizer.apply_chat_template(
    conversation=messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
print(input_ids.shape)
input_ids

torch.Size([1, 1056])


tensor([[128000, 128006,   9125,  ...,  78191, 128007,    271]],
       device='cuda:0')

In [5]:
terminators = [
    tokenizer.eos_token_id,
    # tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]
outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.5,
    top_p=0.95,
)
print(outputs.shape)
outputs

TypeError: DeciLMPreTrainedModel._prepare_generation_config() takes 2 positional arguments but 3 were given

In [10]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'You are a data scientist responsible for extracting accurate information from research papers. You answer each question with a single JSON string.<｜User｜>\n                This is an abstract from a Mendelian randomization study.\n                    "Alcohol consumption significantly impacts disease burden and has been linked to various diseases in observational studies. However, comprehensive meta-analyses using Mendelian randomization (MR) to examine drinking patterns are limited. We aimed to evaluate the health risks of alcohol use by integrating findings from MR studies. A thorough search was conducted for MR studies focused on alcohol exposure. We utilized two sets of instrumental variables-alcohol consumption and problematic alcohol use-and summary statistics from the FinnGen consortium R9 release to perform de novo MR analyses. Our meta-analysis encompassed 64 published and 151 de novo MR analyses across 76 distinct primary outcomes. Results show that a genetic predisposition 

In [11]:
response = outputs[0][input_ids.shape[-1] :]
result = tokenizer.decode(response, skip_special_tokens=True)
print(result)

Alright, so I need to figure out the exposures and outcomes from the given abstract. The abstract talks about alcohol consumption and problematic alcohol use as exposures. For the outcomes, it mentions several diseases like Parkinson's disease, prostate hyperplasia, rheumatoid arthritis, chronic pancreatitis, colorectal cancer, head and neck cancers, alcoholic liver disease, cirrhosis, pneumonia, and more.

Next, I have to categorize these. Alcohol consumption and problematic alcohol use are both behavioral, so they fall under the 'behavioural' category. The outcomes are various diseases, so I'll categorize each under the appropriate disease group. For example, Parkinson's disease is a nervous system disease, prostate hyperplasia is a neoplasm, and so on.

Now, looking at the methods used in the abstract, it mentions Mendelian randomization (MR) with two sets of instrumental variables and a meta-analysis. The methods listed in the example include things like two-sample MR, multivariabl