# IPAI
## Individual Challenge - Extracting Knowledge from free form text with an LLM
### Helton Mendonça nº56870

Model taken from https://huggingface.co/AstroMLab/AstroSage-8B
---

As a backup to waht I'm doing in this script I could use https://huggingface.co/spaces/AstroMLab/AstroSage which is the chatbot in a way like ChatGPT, but it's less automated and requires manual pasting of the text which I tried to avoid.
---

From hugginface it says that:

"AstroSage-Llama-3.1-8B is a domain-specialized natural-language AI assistant tailored for research in astronomy, astrophysics, and cosmology. Trained on the complete collection of astronomy-related arXiv papers from 2007-2024 along with millions of synthetically-generated question-answer pairs and other astronomical literature, AstroSage-Llama-3.1-8B demonstrates excellent proficiency on a wide range of questions. "

## Setup

In [None]:
!pip install bitsandbytes transformers torch accelerate huggingface_hub pandas

In [None]:
!pip install arxiv

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import arxiv
from arxiv import Client
def collect_abstracts(n_results):
    try:
        client = arxiv.Client()
        # Search for the n most recent articles matching the query
        search = arxiv.Search(
            query="TESS OR JWST OR ALMA OR HAWK-I",
            max_results=n_results,
            sort_by=arxiv.SortCriterion.SubmittedDate
        )
        papers = list(client.results(search))
        if not papers:
            print("No papers found for the given query.")
            return []
        return [paper.summary for paper in papers]
    except Exception as e:
        print(f"An error occurred: {e}")
        return []

For the free text I will be taken arxiv abastracts that mentioned either of our main instruments, with the possibility of exctracting relevant and/or similar knowledge from our original schema of datasets and merging

In [None]:
def extract_json(response):
    json_start = response.find("{")
    json_end = response.rfind("}")
    if json_start != -1 and json_end != -1:
        json_str = response[json_start:json_end+1]
        try:
            print('Response created Json')
            return json.loads(json_str)
        except json.JSONDecodeError as e:
            print(f"JSON decode error: {e}")
            return {}
    else:
        print("Response created empty JSON")
        return {}

# Used ai to help refine this function to make easier to process the json files after

In [None]:
from tqdm import tqdm

def extract_info(abstracts, prompts):
    results = []
    for k, abstract in enumerate(tqdm(abstracts, total=len(abstracts), desc="Processing abstracts")):
        abstract_data = {}
        for i, prompt in enumerate(tqdm(prompts, total=len(prompts), desc=f"Prompts for abstract {k+1}", leave=False)):
            print(f"Processing prompt {i+1} of {len(prompts)}")
            full_prompt = prompt + f"\nAbstract: {abstract}"
            response = generate_response(full_prompt)
            prompt_data = extract_json(response)
            abstract_data.update(prompt_data)
        results.append(abstract_data)
    return results

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
import json
import pandas as pd

# Load the model and tokenizer
quant_config1 = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained("AstroMLab/AstroSage-8b",
                                             device_map="auto",
                                             quantization_config=quant_config1)

tokenizer = AutoTokenizer.from_pretrained("AstroMLab/AstroSage-8b")

# Function to generate a response
def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    response = outputs[0][inputs['input_ids'].shape[-1]:]
    decoded = tokenizer.decode(response, skip_special_tokens=True)

    return decoded




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/900 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [None]:
# Example usage
prompt = """
You are an expert in general astrophysics. Your task is to answer the following question:
What are the main components of a galaxy?
"""
response = generate_response(prompt)
print(response)

The main components of a galaxy include stars, gas, dust, and dark matter. Stars are the luminous objects that emit light, gas is the fuel for star formation, dust is the material that can obscure light, and dark matter is the invisible matter that helps to hold the galaxy together. Additionally, galaxies can also contain black holes and other exotic objects. The specific composition of a galaxy can vary depending on its type and size. For example, spiral galaxies like the Milky Way have a central bulge, a disk, and spiral arms, while elliptical galaxies are more concentrated and lack distinct features. Dwarf galaxies, on the other hand


## Prompts and Model Outputs


With a bigger prompt I was getting no results:

(e.g. Consider yourself a astronomy expert, specially in the following instruments:
    TESS, JWST, ALMA, HAWK-I.
    From the following abstract, extract if possible:
    - Target Name
    - Object Type (e.g., star, galaxy, quasar, nebula)
    - Dataset or Observation ID
    - Purpose of Observation
    - Related Observations
    - Key Findings
    - Instrument Used (The telescope, camera, or spectrograph involved)
    - Right Ascencion and Declination (RA and DEC)
    - Observation Date (any format)
    - Exposure Time (any format)
    - Electromagnetic Range (e.g., optical, infrared)
    - Distance from Earth
    - Other possible releventant info I may know) ...



So my strategy will be partioning this bigger prompt into smaller ones combining elements of knowledge i would like to extract from the abastracts. Then i will comnbine into one json per paper to then filter the data and merge it with our original




In [None]:
prompts = []

prompt1 = f"""Consider yourself a astronomy expert, specially in the following instruments:TESS, JWST, ALMA, HAWK-I.
From the following abstract, extract if exists:
- Target Name
- Object Type (e.g., star, galaxy, quasar, nebula)
- Instrument Used (e.g., TESS, JWST, ALMA, HAWK-I)

Return the result in JSON format. Use null for missing fields.
"""
prompts.append(prompt1)
#####################################
prompt2 = f""" Consider yourself a astronomy expert, specially in the following instruments:TESS, JWST, ALMA, HAWK-I.
From the following abstract, extract if exists:
- Dataset or Observation ID
- Observation Date (any format)

Return the result in JSON format. Use null for missing fields.
"""
prompts.append(prompt2)
#######################################
prompt3 = f""" Consider yourself a astronomy expert, specially in the following instruments:TESS, JWST, ALMA, HAWK-I.
From the following abstract, extract if exists:
- Right Ascension (RA)
- Declination (DEC)
- Distance from Earth

Return the result in JSON format. Use null for missing fields."""
prompts.append(prompt3)
########################################
prompt4 = f""" Consider yourself a astronomy expert, specially in the following instruments:TESS, JWST, ALMA, HAWK-I.
From the following abstract, extract if exists:
- Purpose of Observation
- Key Findings


Return the result in JSON format. Use null for missing fields."""

prompts.append(prompt4)
###################################
prompt5 = f""" Consider yourself a astronomy expert, specially in the following instruments:TESS, JWST, ALMA, HAWK-I.
From the following abstract, extract if exists:
- Exposure Time (any format)
- Electromagnetic Range (e.g., optical, infrared)

Return the result in JSON format. Use null for missing fields."""

prompts.append(prompt5)
##########################################
prompt6 = f""" Consider yourself a astronomy expert, specially in the following instruments:TESS, JWST, ALMA, HAWK-I.
From the following abstract, extract if exists:
- Related Observations
- Any other relevant information

Return the result in JSON format. Use null for missing fields."""

prompts.append(prompt6)

In [None]:
abstracts = collect_abstracts(n_results=10) # Tried with 100 but went beyond colab gpu limit, so il just use like 20 to prove that it works but ideally for more data diversity the number of papers should be as high as possible

In [None]:
print(len(abstracts))

20


An example abstract:

In [None]:
print(abstracts[0])

In this study we incorporate a new grid of kilonova simulations produced by
the Monte Carlo radiative transfer code SuperNu in an inference pipeline for
astrophysical transients, and evaluate their performance. These simulations
contain four different two-component ejecta morphology classes. We analyze
follow-up observational strategies by Vera Rubin Observatory in optical, and
James Webb Space Telescope (JWST) in mid-infrared (MIR). Our analysis suggests
that, within these strategies, it is possible to discriminate between different
morphologies only when late-time JWST observations in MIR are available. We
conclude that follow-ups by the new Vera Rubin Observatory alone are not
sufficient to determine ejecta morphology. Additionally, we make comparisons
between surrogate models based on radiative transfer simulation grids by
SuperNu and POSSIS, by analyzing the historic kilonova AT2017gfo that
accompanied the gravitational wave event GW170817. We show that both SuperNu
and POSSIS mod

In [None]:
results = extract_info(abstracts=abstracts, prompts=prompts)

Processing abstracts:   0%|          | 0/10 [00:00<?, ?it/s]
Prompts for abstract 1:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 1:  17%|█▋        | 1/6 [00:14<01:12, 14.47s/it][A

Response created Json
Processing prompt 2 of 6



Prompts for abstract 1:  33%|███▎      | 2/6 [00:32<01:05, 16.47s/it][A

Response created empty JSON
Processing prompt 3 of 6



Prompts for abstract 1:  50%|█████     | 3/6 [00:49<00:49, 16.58s/it][A

Response created empty JSON
Processing prompt 4 of 6



Prompts for abstract 1:  67%|██████▋   | 4/6 [01:01<00:29, 14.93s/it][A

Response created empty JSON
Processing prompt 5 of 6



Prompts for abstract 1:  83%|████████▎ | 5/6 [01:17<00:15, 15.37s/it][A

Response created empty JSON
Processing prompt 6 of 6



Prompts for abstract 1: 100%|██████████| 6/6 [01:33<00:00, 15.60s/it][A
Processing abstracts:  10%|█         | 1/10 [01:33<14:02, 93.65s/it]

Response created empty JSON



Prompts for abstract 2:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 2:  17%|█▋        | 1/6 [00:17<01:26, 17.21s/it][A

Response created empty JSON
Processing prompt 2 of 6



Prompts for abstract 2:  33%|███▎      | 2/6 [00:33<01:07, 16.77s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 3 of 6



Prompts for abstract 2:  50%|█████     | 3/6 [00:50<00:49, 16.63s/it][A

Response created empty JSON
Processing prompt 4 of 6



Prompts for abstract 2:  67%|██████▋   | 4/6 [01:06<00:33, 16.65s/it][A

Response created empty JSON
Processing prompt 5 of 6



Prompts for abstract 2:  83%|████████▎ | 5/6 [01:23<00:16, 16.70s/it][A

Response created Json
Processing prompt 6 of 6



Prompts for abstract 2: 100%|██████████| 6/6 [01:39<00:00, 16.57s/it][A
Processing abstracts:  20%|██        | 2/10 [03:13<12:58, 97.35s/it]

Response created empty JSON



Prompts for abstract 3:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 3:  17%|█▋        | 1/6 [00:06<00:34,  6.94s/it][A

Response created Json
Processing prompt 2 of 6



Prompts for abstract 3:  33%|███▎      | 2/6 [00:23<00:50, 12.53s/it][A

Response created empty JSON
Processing prompt 3 of 6



Prompts for abstract 3:  50%|█████     | 3/6 [00:23<00:21,  7.04s/it][A

Response created empty JSON
Processing prompt 4 of 6



Prompts for abstract 3:  67%|██████▋   | 4/6 [00:40<00:21, 10.73s/it][A

Response created empty JSON
Processing prompt 5 of 6



Prompts for abstract 3:  83%|████████▎ | 5/6 [00:55<00:12, 12.39s/it][A

Response created Json
Processing prompt 6 of 6



Prompts for abstract 3: 100%|██████████| 6/6 [01:12<00:00, 14.00s/it][A
Processing abstracts:  30%|███       | 3/10 [04:26<10:02, 86.11s/it]

Response created empty JSON



Prompts for abstract 4:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 4:  17%|█▋        | 1/6 [00:08<00:40,  8.16s/it][A

Response created Json
Processing prompt 2 of 6



Prompts for abstract 4:  33%|███▎      | 2/6 [00:20<00:42, 10.61s/it][A

Response created Json
Processing prompt 3 of 6



Prompts for abstract 4:  50%|█████     | 3/6 [00:28<00:28,  9.49s/it][A

Response created empty JSON
Processing prompt 4 of 6



Prompts for abstract 4:  67%|██████▋   | 4/6 [00:44<00:24, 12.19s/it][A

Response created empty JSON
Processing prompt 5 of 6



Prompts for abstract 4:  83%|████████▎ | 5/6 [01:01<00:13, 13.70s/it][A

Response created empty JSON
Processing prompt 6 of 6



Prompts for abstract 4: 100%|██████████| 6/6 [01:18<00:00, 14.86s/it][A
Processing abstracts:  40%|████      | 4/10 [05:44<08:18, 83.10s/it]

Response created empty JSON



Prompts for abstract 5:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 5:  17%|█▋        | 1/6 [00:16<01:22, 16.49s/it][A

Response created empty JSON
Processing prompt 2 of 6



Prompts for abstract 5:  33%|███▎      | 2/6 [00:32<01:05, 16.44s/it][A

Response created empty JSON
Processing prompt 3 of 6



Prompts for abstract 5:  50%|█████     | 3/6 [00:39<00:35, 11.76s/it][A

Response created empty JSON
Processing prompt 4 of 6



Prompts for abstract 5:  67%|██████▋   | 4/6 [00:55<00:27, 13.60s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 5 of 6



Prompts for abstract 5:  83%|████████▎ | 5/6 [01:03<00:11, 11.65s/it][A

Response created Json
Processing prompt 6 of 6



Prompts for abstract 5: 100%|██████████| 6/6 [01:20<00:00, 13.28s/it][A
Processing abstracts:  50%|█████     | 5/10 [07:04<06:50, 82.03s/it]

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)



Prompts for abstract 6:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 6:  17%|█▋        | 1/6 [00:13<01:09, 13.91s/it][A

Response created Json
Processing prompt 2 of 6



Prompts for abstract 6:  33%|███▎      | 2/6 [00:30<01:01, 15.34s/it][A

Response created Json
Processing prompt 3 of 6



Prompts for abstract 6:  50%|█████     | 3/6 [00:46<00:47, 15.96s/it][A

Response created empty JSON
Processing prompt 4 of 6



Prompts for abstract 6:  67%|██████▋   | 4/6 [01:03<00:32, 16.30s/it][A

Response created empty JSON
Processing prompt 5 of 6



Prompts for abstract 6:  83%|████████▎ | 5/6 [01:15<00:14, 14.77s/it][A

Response created Json
Processing prompt 6 of 6



Prompts for abstract 6: 100%|██████████| 6/6 [01:32<00:00, 15.31s/it][A
Processing abstracts:  60%|██████    | 6/10 [08:37<05:41, 85.49s/it]

Response created empty JSON



Prompts for abstract 7:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 7:  17%|█▋        | 1/6 [00:08<00:42,  8.52s/it][A

Response created Json
Processing prompt 2 of 6



Prompts for abstract 7:  33%|███▎      | 2/6 [00:24<00:52, 13.08s/it][A

Response created Json
Processing prompt 3 of 6



Prompts for abstract 7:  50%|█████     | 3/6 [00:41<00:43, 14.52s/it][A

Response created empty JSON
Processing prompt 4 of 6



Prompts for abstract 7:  67%|██████▋   | 4/6 [00:57<00:30, 15.22s/it][A

Response created empty JSON
Processing prompt 5 of 6



Prompts for abstract 7:  83%|████████▎ | 5/6 [01:14<00:15, 15.82s/it][A

Response created empty JSON
Processing prompt 6 of 6



Prompts for abstract 7: 100%|██████████| 6/6 [01:30<00:00, 15.97s/it][A
Processing abstracts:  70%|███████   | 7/10 [10:07<04:21, 87.12s/it]

Response created empty JSON



Prompts for abstract 8:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 8:  17%|█▋        | 1/6 [00:16<01:23, 16.75s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 2 of 6



Prompts for abstract 8:  33%|███▎      | 2/6 [00:33<01:07, 16.98s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 3 of 6



Prompts for abstract 8:  50%|█████     | 3/6 [00:50<00:50, 16.85s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 4 of 6



Prompts for abstract 8:  67%|██████▋   | 4/6 [01:07<00:33, 16.76s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 5 of 6



Prompts for abstract 8:  83%|████████▎ | 5/6 [01:23<00:16, 16.77s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 6 of 6



Prompts for abstract 8: 100%|██████████| 6/6 [01:41<00:00, 16.92s/it][A
Processing abstracts:  80%|████████  | 8/10 [11:48<03:03, 91.61s/it]

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)



Prompts for abstract 9:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 9:  17%|█▋        | 1/6 [00:16<01:21, 16.40s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 2 of 6



Prompts for abstract 9:  33%|███▎      | 2/6 [00:32<01:05, 16.41s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 3 of 6



Prompts for abstract 9:  50%|█████     | 3/6 [00:49<00:49, 16.65s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 4 of 6



Prompts for abstract 9:  67%|██████▋   | 4/6 [01:06<00:33, 16.55s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 5 of 6



Prompts for abstract 9:  83%|████████▎ | 5/6 [01:22<00:16, 16.50s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 6 of 6



Prompts for abstract 9: 100%|██████████| 6/6 [01:38<00:00, 16.47s/it][A
Processing abstracts:  90%|█████████ | 9/10 [13:27<01:33, 93.91s/it]

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)



Prompts for abstract 10:   0%|          | 0/6 [00:00<?, ?it/s][A

Processing prompt 1 of 6



Prompts for abstract 10:  17%|█▋        | 1/6 [00:16<01:23, 16.62s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 2 of 6



Prompts for abstract 10:  33%|███▎      | 2/6 [00:33<01:06, 16.51s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 3 of 6



Prompts for abstract 10:  50%|█████     | 3/6 [00:49<00:49, 16.44s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 4 of 6



Prompts for abstract 10:  67%|██████▋   | 4/6 [01:06<00:33, 16.51s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 5 of 6



Prompts for abstract 10:  83%|████████▎ | 5/6 [01:22<00:16, 16.64s/it][A

Response created Json
JSON decode error: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Processing prompt 6 of 6



Prompts for abstract 10: 100%|██████████| 6/6 [01:30<00:00, 13.50s/it][A
Processing abstracts: 100%|██████████| 10/10 [14:58<00:00, 89.81s/it]

Response created Json





In [None]:
results

[{'Target Name': None, 'Object Type': None, 'Instrument Used': None},
 {'Exposure Time': '1 hour', 'Electromagnetic Range': 'Infrared'},
 {'Target Name': None,
  'Object Type': None,
  'Instrument Used': 'JWST',
  'Exposure Time': None,
  'Electromagnetic Range': 'infrared'},
 {'Target Name': None,
  'Object Type': None,
  'Instrument Used': 'TESS',
  'Observation Date': '2023-01-01',
  'Dataset or Observation ID': '10.17909/t9-7s6-5k62'},
 {'Exposure Time': '0.85 - 2.5 μm', 'Electromagnetic Range': 'Infrared'},
 {'Target Name': 'CY Tau',
  'Object Type': 'star',
  'Instrument Used': 'JWST',
  'dataset_or_observation_id': 'JWST-ERS-1324-10601',
  'observation_date': '2023-08-26',
  'Exposure Time': '1.5 hours',
  'Electromagnetic Range': 'infrared'},
 {'Target Name': None,
  'Object Type': None,
  'Instrument Used': ['ALMA', 'SMA', 'MeerKAT'],
  'dataset_or_observation_id': None,
  'observation_date': '2017-10-26'},
 {},
 {},
 {'RelatedObservations': None, 'OtherRelevantInformation': N

## Integration

In [None]:
import os
from google.colab import files
def save_json_files(results, output_dir="/content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files"):
    """Save each JSON object in results as a separate .json file."""
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)

    # Loop through results and save each JSON
    for i, result in enumerate(tqdm(results, total=len(results), desc="Saving JSON files")):
        filename = os.path.join(output_dir, f"abstract_{i+1}.json")
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(result, f, indent=4)
        print(f"Saved: {filename}")
        # Download the file in Colab
        files.download(filename)

In [None]:
save_json_files(results)

Saving JSON files:   0%|          | 0/10 [00:00<?, ?it/s]

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_1.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving JSON files:  10%|█         | 1/10 [00:00<00:02,  3.30it/s]

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_2.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_3.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_4.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_5.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_6.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_7.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_8.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_9.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving JSON files:  90%|█████████ | 9/10 [00:00<00:00, 27.18it/s]

Saved: /content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files/abstract_10.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving JSON files: 100%|██████████| 10/10 [00:00<00:00, 23.72it/s]


Now after getting the json files, I want to pass them into csv's and merge the equal fields to our "final dataset". Since I couldnt extract much new knowledge I guess its not worth to add to our dataset, but ideally that would be best cause it would enalble us more relevant data, possibly relevant to other questions that we did not explore beforhand

In [5]:
import json
import pandas as pd
import os
current_dataset = pd.read_csv('/content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/merged_df.csv')
results = []
for filename in os.listdir('/content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files'):
    if filename.endswith('.json'):
        with open(os.path.join('/content/drive/MyDrive/coisas da faculdade/Mestrado Ciência de Dados/2ºsemestre/IPAI/json_files', filename), 'r') as f:
            data = json.load(f)
            print(data)
            df = pd.DataFrame([data])
            results.append(df)



# print(os.getcwd())

{'Target Name': None, 'Object Type': None, 'Instrument Used': None}
{'Exposure Time': '1 hour', 'Electromagnetic Range': 'Infrared'}
{'Target Name': None, 'Object Type': None, 'Instrument Used': 'JWST', 'Exposure Time': None, 'Electromagnetic Range': 'infrared'}
{'Target Name': None, 'Object Type': None, 'Instrument Used': 'TESS', 'Observation Date': '2023-01-01', 'Dataset or Observation ID': '10.17909/t9-7s6-5k62'}
{'Exposure Time': '0.85 - 2.5 μm', 'Electromagnetic Range': 'Infrared'}
{'Target Name': 'CY Tau', 'Object Type': 'star', 'Instrument Used': 'JWST', 'dataset_or_observation_id': 'JWST-ERS-1324-10601', 'observation_date': '2023-08-26', 'Exposure Time': '1.5 hours', 'Electromagnetic Range': 'infrared'}
{'Target Name': None, 'Object Type': None, 'Instrument Used': ['ALMA', 'SMA', 'MeerKAT'], 'dataset_or_observation_id': None, 'observation_date': '2017-10-26'}
{}
{}
{'RelatedObservations': None, 'OtherRelevantInformation': None}


In [10]:
current_dataset.head()

Unnamed: 0,target_name,s_ra,s_dec,dataset,MJD-OBS,em,coord_block,temp_block,cluster
0,IRAS-05248-7007,81.086467,-70.083778,JWST,59699.184242,3221.5,"(810, -701)",8528,0
1,J1120+0641,170.006167,6.690083,JWST,59945.244998,1147.5,"(1700, 66)",8563,0
2,CEERS-FULL-V2,214.909666,52.872408,JWST,60389.455668,2950.0,"(2149, 528)",8627,0
3,GOODSS2009,53.052791,-27.731936,JWST,60231.322527,2950.0,"(530, -278)",8604,0
4,ABELL2744,3.594322,-30.395694,JWST,60132.288682,3560.0,"(35, -304)",8590,0


In [13]:
results[5]

Unnamed: 0,Target Name,Object Type,Instrument Used,dataset_or_observation_id,observation_date,Exposure Time,Electromagnetic Range
0,CY Tau,star,JWST,JWST-ERS-1324-10601,2023-08-26,1.5 hours,infrared


In [14]:
results[6]

Unnamed: 0,Target Name,Object Type,Instrument Used,dataset_or_observation_id,observation_date
0,,,"[ALMA, SMA, MeerKAT]",,2017-10-26


In [15]:
results[4] # As you see the LLM not always get stuff right so we need to take that into account, also shows the importance of infering on more data, and more carefully curated data as well, to avoid these hallucinations

Unnamed: 0,Exposure Time,Electromagnetic Range
0,0.85 - 2.5 μm,Infrared


Ideally after getting the df's of the json files we would concat them just to get everything and finally merge with our current dataset, in a outer format just to keep all the data fields, which could expand our question pool for example

In [19]:
final_df = pd.concat(results, ignore_index=True)
final_df.head()

Unnamed: 0,Target Name,Object Type,Instrument Used,Exposure Time,Electromagnetic Range,Observation Date,Dataset or Observation ID,dataset_or_observation_id,observation_date,RelatedObservations,OtherRelevantInformation
0,,,,,,,,,,,
1,,,,1 hour,Infrared,,,,,,
2,,,JWST,,infrared,,,,,,
3,,,TESS,,,2023-01-01,10.17909/t9-7s6-5k62,,,,
4,,,,0.85 - 2.5 μm,Infrared,,,,,,


In [23]:
final_df = current_dataset.merge(final_df, how='outer', left_on='target_name', right_on='Target Name')
final_df.head()

Unnamed: 0,target_name_x,s_ra_x,s_dec_x,dataset_x,MJD-OBS_x,em_x,coord_block_x,temp_block_x,cluster_x,target_name_y,...,Object Type,Instrument Used,Exposure Time,Electromagnetic Range,Observation Date,Dataset or Observation ID,dataset_or_observation_id,observation_date,RelatedObservations,OtherRelevantInformation
0,-14-HER,242.602552,43.815628,JWST,60448.92317,1990.5,"(2426, 438)",8635.0,0.0,,...,,,,,,,,,,
1,-48-Cet-PSF-CALIBRATOR,22.40097,-21.629309,JWST,60302.910535,2950.0,"(224, -217)",8614.0,0.0,,...,,,,,,,,,,
2,-49-CET,23.658054,-15.676382,JWST,60179.502764,4433.0,"(236, -157)",8597.0,0.0,,...,,,,,,,,,,
3,-49-CET,23.658063,-15.676382,JWST,60302.799933,2950.0,"(236, -157)",8614.0,0.0,,...,,,,,,,,,,
4,-49-CET,23.658054,-15.676382,JWST,60169.959282,12800.0,"(236, -157)",8595.0,0.0,,...,,,,,,,,,,


## Further Improvements in my opinion

- More specificity in the queries for the papers, meaning a more detailed query to try and find more specific papers on each isntrument or in the problem it self. Some domain knowledge in this area would greatly improve this.

-  Altough I think the queries were clear, more prompt efficiency woulnd't hurt.

- Some cleaning strategies like we did in the first phase by entity matching, field normalization, deduplication, etc.

- And obviously more computational avaibility as colab free verison is quite instable and unpredictable which didnt help my case, but it was all I had since the LLM model was too heavy for my pc.

All of these could refine the use of a LLM in a workflow to produce trustable data on this topic in specific, but Im sure in many other thanks to API's like this arxiv one that enable us to get more trustworthy free text without manual copy pasting.