# [Star Name Database Walkthrough](https://github.com/HypatiaOrg/HySite/wiki/Star-Name-Database-Walkthrough)

See our wiki page for database setup and configuration instructions at: https://github.com/HypatiaOrg/HySite/wiki/Star-Name-Database-Walkthrough

In [None]:
import os

from hypatia.sources.simbad.ops import star_collection
from hypatia.sources.simbad.batch import get_star_data_batch
from hypatia.sources.nea.ops import needs_micro_lense_name_change
from hypatia.sources.nea.query import query_nea, set_data_by_host, hypatia_host_name_rank_order
from hypatia.configs.source_settings import nea_names_the_cause_wrong_simbad_references


star_list = [
    '2MASSJ04545692-6231205',
    '2MASSJ09423526-6228346',
    '2MASSJ09442986-4546351',
    '2MASSJ15120519-2006307',
    'LTT 1445 A',
    'L 168-9',
    'G162-44',
    'HD 6434', 
    'G50-16',
    'GJ1252',
    'GJ4102',
    'GJ436',
    'Gliese486',
    'K2-129',
    'K2-137',
    'K2-239',
    'K2-54',
    'K2-72',
    'L181-1',
    'L231-32',
    'GJ 1132',
    'HD 27859',
    'LHS 1678',
    'GJ 357',
    'L98-59',
    'LHS1140',
    '* 7 Com',
    'LHS3844',
    'LP961-53',
    'TOI-269',
    'Wolf437',
    'Furuhjelm I-1002',
]

# Batch star names - `get_star_data_batch()`
The `get_star_data_batch` class is the primary tool for simultaneously querying multiple stars from SIMBAD. The keyword argument `search_ids` should be a list, and the returned data will be in `star_docs`, a list of the same length. The `search_ids` list can have repeated values; `get_star_data_batch()` is smart enough not to send only one query to SIMBAD for a given string name. 

In [None]:
star_docs = get_star_data_batch(search_ids=star_list, test_origin='star names example notebook')

The first time this function is called, you will notice some printed output regarding the number of stars being queried from the SIMBAD database's API. However, every subsequent call of this function for the same stars will find the matching names already in our local MongoDB database. The SIMBAD API is only accessed when the data is not available locally. To see this, we can call the same function again.

In [None]:
star_docs = get_star_data_batch(search_ids=star_list, test_origin='star names example notebook')

## The returned data: `star_docs`
Let's take a look at the returned data in a single data document of the list `star_docs`.

In [None]:
star_doc = star_docs[1]
print(f"star_doc's keys:\n    {star_doc.keys()}\n")
simbad_name = star_doc['_id']
print(f'The SIMBAD name:\n    {simbad_name}\n')
all_names = star_doc['aliases']
print(f'All known names:\n    {all_names}\n')
lower_case_space_removed_names_for_matching = star_doc['match_names']
print(f'Names formatted to make name-matching simpler by removing spaces and using only lowercase letters:\n    {lower_case_space_removed_names_for_matching}\n')
print(f'The whole star doc:\n   {star_doc}\n')

Most of the fields for a star document may not be relevant to your astronomy application or you may want to harvest even more information from SIMBAD's API. The fields were chosen specifically because we use this data in the Hypatia Catalog pipeline. You may notice some star docs have a `params` key, which is used to store spectral types for stars (when available); however, we may expand the usage to include other parameters available on SIMBAD. 

## Using the star_docs to determine unique star names 
The most crucial part of `star_docs` is that it provides a **unique** ID for a given star. From a data-science perspective, we now have a way to identify duplicates and cross-match lists of stars using different naming conventions. This is the core of the Hypatia Catalog data association process, which is how we can associate data from hundreds of sources into a single dataset with one entry per star and every star unique.


Let's look at the returned `star_docs` and find any duplicates: 

In [None]:
unique_ids = {}
for list_index, star_doc, star_name in zip(range(len(star_list)), star_docs, star_list):
    simbad_name = star_doc['_id']
    if simbad_name in unique_ids.keys():
        unique_ids[simbad_name]['queried_names'].add(star_name)
        unique_ids[simbad_name]['list_indexes'].append(list_index)
    else:
        unique_ids[simbad_name] = {'queried_names': {star_name}, 'list_indexes': [list_index]}
    print(f'Queried name: "{star_name}" - SIMBAD name "{simbad_name}"')

for simbad_name, queried_info in unique_ids.items():
    queried_names = queried_info['queried_names']
    list_indexes = queried_info['list_indexes']
    if len(list_indexes) > 1:
        print(f'\nDuplicate name found for {simbad_name}:\n    {sorted(queried_names)}\n    at list indexes {list_indexes}')

# Using Multiple Names to Look Up a Single Star

The star name problem is everywhere in astronomy, such that many databases will help by providing a few name references for a given star. One of the Hypatia Catalog's favorite databases, the [NASA Exoplanet Archive](https://exoplanetarchive.ipac.caltech.edu/) (NEA), provides us with several names for many of the exoplanet host stars. For this part of the example, we will download some NEA data and run it through the `get_star_data_batch()` function to find the SIMBAD names.

## Fetch and format NEA data

We will use some of our Hypatia Catalog tools for this, but we'll also write/read the data to/from a text file in case you want to examine the data yourself. 

In [None]:
nea_data = query_nea()

In [None]:
nea_host_ids = [
    'hostname',
    'gaia_id',
    'tic_id',
    'hip_name',
    'hd_name',
]
nea_hosts_file = os.path.join(os.getcwd(), 'nea_host.csv')
planets_by_host_name = set_data_by_host(nea_data)

In [None]:
with open(nea_hosts_file, 'w') as f:
    f.write(f'{",".join(nea_host_ids)}\n')
    for hostname, host_and_planet_data in planets_by_host_name.items():
        gaia_id = host_and_planet_data.get('gaia dr2', '')
        tic_id = host_and_planet_data.get('tic', '')
        hip_name = host_and_planet_data.get('hip', '')
        hd_name = host_and_planet_data.get('hd', '')
        f.write(f'{hostname},{gaia_id},{tic_id},{hip_name},{hd_name}\n')
        

In [None]:
nea_tuple_host_names = []
nea_names_dicts = []
with open(nea_hosts_file, 'r') as f:
    header = None
    for line in f.readlines():
        line = line.strip()
        if header is None:
            header = line.split(',')
        else:
            names = line.split(',')
            nea_tuple_host_names.append(tuple(name for name in names if name !=''))
            nea_names_dicts.append({column: name for column, name in zip(header, names) if name !=''})
print(f'20 Names formmated for get_star_data_batch():\n    {"\n    ".join([str(names) for names in nea_tuple_host_names[1000:1020]])}')


## The `get_star_data_batch()` interactive mode

When we run this cell, we will see an exception (below). This is because some NEA names for the star `gam1 Leo` point to different SIMBAD records. By default, we want `get_star_data_batch()` to raise errors when this happens, so we can stop and check that our input data is correct. 
```
ValueError                                Traceback (most recent call last)
Cell In[48], line 1
----> 1 star_docs_nea = get_star_data_batch(search_ids=nea_tuple_host_names, test_origin='star names example notebook - nea')

File /backend/hypatia/sources/simbad/batch.py:93, in get_star_data_batch(search_ids, test_origin, has_micro_lens_names, all_ids, override_interactive_mode)
     91         for found_oid, found_names in oid_map.items():
     92             error_msg += f'  Found SIMBAD database id:({found_oid}) for names: {found_names}\n'
---> 93     raise ValueError(error_msg)
     94 # Update the not-found indexed for processing after the batching loop
     95 simbad_not_found_indexes.update(indexes_this_batch - set(index_to_oid.keys()))

ValueError: List indexes {196} have more than one oid from the SIMBAD API
 List index 196 has more than one oid from the SIMBAD API for names ('gam1 Leo', 'TIC 95431294', 'HIP 50583 A', 'HD 89484')
  Found SIMBAD database id:(1710354) for names: {'gam1 Leo', 'HD 89484'}
  Found SIMBAD database id:(1710355) for names: {'TIC 95431294'}
```

In [None]:
star_docs_nea = get_star_data_batch(search_ids=nea_tuple_host_names, test_origin='star names example notebook - nea')

## Handling Conflicting Name Records
Resolving the above conflict is something we have checked and solved for the Hypatia Catalog. Let's reformat the names and try again, but this time we are going to:
1. Remove names from the `search_id` keyword argument that will cause conflicts.
2. Keep the conflicts and put them in a new keyword argument `all_ids, which will not search SIMBAD but will be used to save a record of these names in the local database that point to the correct SIMBAD record (this is how we allow entry of custom, non-SIMBAD star names to be used as identifiers).
3. Detect specific microlensing names. When these are not found in SIMBAD, we will automatically make a *no-SIMBAD local database record* that skips any other similar inputs.
4. Allow input for any star names that SIMBAD does not know by opening an interactive menu prompt. 

In [None]:
all_ids = []
search_ids = []
has_micro_lens_names = []
for host_name, host_data in planets_by_host_name.items():
    # every star must have a nea_name that is not empty
    nea_name = host_data['nea_name']
    if not nea_name:
        raise ValueError(
            f'No valid name found for host, this is not supposed to happen, see host_data: {host_data}')
    nea_ids = {host_data[param] for param in hypatia_host_name_rank_order if param in host_data.keys()}
    all_ids.append(tuple(nea_ids))
    names_to_try = {nea_id for nea_id in nea_ids if nea_id not in nea_names_the_cause_wrong_simbad_references}
    mirco_name_for_simbad = needs_micro_lense_name_change(nea_name)
    has_micro_lens_name = mirco_name_for_simbad is not None
    if has_micro_lens_name:
        names_to_try = set([mirco_name_for_simbad] + list(names_to_try))
    search_ids.append(tuple(names_to_try))
    has_micro_lens_names.append(has_micro_lens_name)
# update or get all the name data for these stars from SIMBAD
star_docs_nea = get_star_data_batch(search_ids=search_ids, test_origin='star names example notebook - nea',
                                    has_micro_lens_names=has_micro_lens_names, all_ids=all_ids)

## Disable interactive mode
By default, we open an interactive menu to allow us to view any names not recognized by SIMBAD. Sometimes we can correct the names to something SIMBAD understands, but other times we have to accept that there is no SIMBAD record and instead rely on the name data we supply to create a local copy.

To disable interactive mode and automatically accept no-SIMBAD record for any stars not found, we can use the overriding keyword argument: `override_interactive_mode`. 

In [None]:
star_docs_nea = get_star_data_batch(search_ids=search_ids, test_origin='star names example notebook - nea',
                                    has_micro_lens_names=has_micro_lens_names, all_ids=all_ids, override_interactive_mode=True)

## Static data, that is more likely to work
There is a possibility that you may need to add more names to `nea_names_the_cause_wrong_simbad_references` and return the above code to get it to finish without exception. The the nature of live databases is that they are constantly changing.

Let's upload some static example data from the NEA that was known to be working on May 18, 2025. But note there can still be changes in SIMBAD that cause this process to fail. 

In [None]:
nea_static_hosts_file = os.path.join(os.getcwd(), 'nea_static_host.psv')
with open(nea_static_hosts_file, 'r') as f:
    search_ids = [tuple(line.strip().split('|')) for line in f.readlines()]
star_docs_nea = get_star_data_batch(search_ids=search_ids, test_origin='star names example notebook - nea', override_interactive_mode=True)

# Cross-Matching Names
We now cross-match names in the two lists of star_docs used in this example.

In [None]:
our_list_lookup = {star_doc['_id']:  star_doc for star_doc in star_docs}
nea_match_lookup = {star_doc['_id']:  star_doc for star_doc in star_docs_nea}

for simbad_id in our_list_lookup.keys():
    original_list_info = unique_ids[simbad_id]
    if simbad_id in nea_match_lookup.keys():
        print(f'Found: Original data {original_list_info} in found in NEA dataset, SIMBAD ID: {simbad_id}')
    else:
        print(f'    Nope: Original data {original_list_info} in not found in NEA dataset')

# Database Reset
Adding items to the database changes the behavior of `get_star_data_batch()`, to start over and reset the starname database, uncomment and run the command below.

In [None]:
# star_collection.reset()