# Add external catalog for source matching: allWISE catalog

This notebook will create a dabase containing the allWISE all-sky mid-infrared catalog. As the catalogs grows (the allWISE catalog we are inserting contains of the order of hundreds of millions sources), using an index on the geoJSON corrdinate type to support the queries becomes unpractical, as such an index does not compress well. In this case, and healpix based indexing offers a good compromise. We will use an healpix grid of order 16, which has a resolution of ~ 3 arcseconds, simlar to the FWHM of ZTF images. 

References, data access, and documentation on the catalog can be found at:

http://wise2.ipac.caltech.edu/docs/release/allwise/

http://irsa.ipac.caltech.edu/data/download/wise-allwise/

This notebook is straight to the point, more like an actual piece of code than a demo. For an explanation of the various steps needed in the see the 'insert_example' notebook in this same folder.

## 1) Inserting:

In [1]:
import numpy as np
from healpy import ang2pix
from extcats import CatalogPusher

# build the pusher object and point it to the raw files.
wisep = CatalogPusher.CatalogPusher(
    catalog_name = 'wise',
    data_source = '../testdata/AllWISE/',
    file_type = ".bz2")


# read column names and types from schema file
schema_file = "../testdata/AllWISE/wise-allwise-cat-schema.txt"
names, types = [], {}
with open(schema_file) as schema:
    for l in schema:
        if "#" in l or (not l.strip()):
            continue
        name, dtype = zip(
            [p.strip() for p in l.strip().split(" ") if not p in [""]])
        name, dtype = name[0], dtype[0]
        #print (name, dtype)
        names.append(name)
        # convert the data type
        if "char" in dtype:
            types[name] = str
        elif "decimal" in dtype:
            types[name] = np.float64
        elif "serial" in dtype or "integer" in dtype:
            types[name] = int
        elif "smallfloat" in dtype:
            types[name] = np.float16
        elif "smallint" in dtype:
            types[name] = np.int16
        elif dtype == "int8":
            types[name] = np.int8
        else:
            print("unknown data type: %s"%dtype)

# select the columns you want to use.
use_cols = []
select = ["Basic Position and Identification Information", 
         "Primary Photometric Information", 
         "Measurement Quality and Source Reliability Information",
         "2MASS PSC Association Information"]
with open(schema_file) as schema:
    blocks = schema.read().split("#")
    for block in blocks:
        if any([k in block for k in select]):
            for l in block.split("\n")[1:]:
                if "#" in l or (not l.strip()):
                    continue
                name, dtype = zip(
                    [p.strip() for p in l.strip().split(" ") if not p in [""]])
                use_cols.append(name[0])
print("we will be using %d columns out of %d"%(len(use_cols), len(names)))

# now assign the reader to the catalog pusher object
import pandas as pd
wisep.assign_file_reader(
        reader_func = pd.read_csv, 
        read_chunks = True,
        names = names,
        usecols = lambda x : x in use_cols,
        #dtype = types,    #this mess up with NaN values
        chunksize=5000,
        header=None,
        engine='c',
        sep='|',
        na_values = 'nnnn')


# define the dictionary modifier that will act on the single entries
def modifier(srcdict):
    srcdict['hpxid_16'] = int(
        ang2pix(2**16, srcdict['ra'], srcdict['dec'], lonlat = True, nest = True))
    #srcdict['_id'] = srcdict.pop('source_id')   doesn't work, seems it is not unique
    return srcdict
wisep.assign_dict_modifier(modifier)


# finally push it in the databse
wisep.push_to_db(
    coll_name = 'srcs', 
    index_on = "hpxid_16",
    overwrite_coll = True, 
    append_to_coll = False)


# if needed print extensive info on database
#wisep.info()

INFO:extcats.CatalogPusher:found 1 files for catalog wise in data source: ['../testdata/AllWISE/']
INFO:extcats.CatalogPusher:checking raw files for existence and consistency..
INFO:extcats.CatalogPusher:all files exists and have consistent type.
INFO:extcats.CatalogPusher:file reader read_csv assigned to pusher.
INFO:extcats.CatalogPusher:source document modifer modifier assigned to pusher.
INFO:extcats.CatalogPusher:using mongo client at localhost:27017
INFO:extcats.CatalogPusher:connecting to database wise. Here some stats:
INFO:extcats.CatalogPusher:{
  "db": "wise",
  "collections": 1,
  "views": 0,
  "objects": 15575416,
  "avgObjSize": 1284.6681279652498,
  "dataSize": 20009240515.0,
  "storageSize": 8750858240.0,
  "numExtents": 0,
  "indexes": 2,
  "indexSize": 469798912.0,
  "fsUsedSize": 220678574080.0,
  "fsTotalSize": 231446335488.0,
  "ok": 1.0
}
INFO:extcats.CatalogPusher:collection has the following indexes: _id_, hpxid_16_1
INFO:extcats.CatalogPusher:inserting ../testd

we will be using 79 columns out of 298


INFO:extcats.CatalogPusher:inserted 15575416 documents in 4.39e+03 seconds
INFO:extcats.CatalogPusher:done inserting catalog wise in collection wise.srcs. Took 4.39e+03 seconds


## 2) Testing the catalog

At this stage, a simple test is run on the database, consisting in crossmatching with a set of randomly distributed points.

In [2]:
# now test the database for query performances. We use 
# a sample of randomly distributed points on a sphere
# as targets. 

# define the funtion to test coordinate based queries:
from healpy import ang2pix, get_all_neighbours
from astropy.table import Table
from astropy.coordinates import SkyCoord

return_fields = ['designation', 'ra', 'dec']
project = {}
for field in return_fields: project[field] = 1
print (project)


hp_order, rs_arcsec = 16, 30.
def test_query(ra, dec, coll):
    """query collection for points within rs of target ra, dec.
    The results as returned as an astropy Table."""
    
    # find the index of the target pixel and its neighbours 
    target_pix = int( ang2pix(2**hp_order, ra, dec, nest = True, lonlat = True) )
    neighbs = get_all_neighbours(2**hp_order, ra, dec, nest = True, lonlat = True)

    # remove non-existing neigbours (in case of E/W/N/S) and add center pixel
    pix_group = [int(pix_id) for pix_id in neighbs if pix_id != -1] + [target_pix]
    
    # query the database for sources in these pixels
    qfilter = { 'hpxid_%d'%hp_order: { '$in': pix_group } }
    qresults = [o for o in coll.find(qfilter)]
    if len(qresults)==0:
        return None
    
    # then use astropy to find the closest match
    tab = Table(qresults)
    target = SkyCoord(ra, dec, unit = 'deg')
    matches_pos = SkyCoord(tab['ra'], tab['dec'], unit = 'deg')
    d2t = target.separation(matches_pos).arcsecond
    match_id = np.argmin(d2t)

    # if it's too far away don't use it
    if d2t[match_id]>rs_arcsec:
        return None
    return tab[match_id]

# run the test
wisep.run_test(test_query, npoints = 10000)


INFO:extcats.CatalogPusher:running test queries using 10000 random points
  1%|          | 54/10000 [00:00<00:18, 536.21it/s]

{'designation': 1, 'ra': 1, 'dec': 1}


100%|██████████| 10000/10000 [00:15<00:00, 632.00it/s]
INFO:extcats.CatalogPusher:Total document found for query: 1
INFO:extcats.CatalogPusher:Took 1.58e+01 sec for 10000 random queries. Average query time: 1.583e-03 sec


# 3) Adding metadata

Once the database is set up and the query performance are satisfactory, metadata describing the catalog content, contact person, and query strategies have to be added to the catalog database. If presents, the keys and parameters for the healpix partitioning of the sources are also to be given, as well as the name of the compound geoJSON/legacy pair entry in the documents.

This information will be added into the 'metadata' collection of the database which will be accessed by the CatalogQuery. The metadata will be stored in a dedicated collection so that the database containig a given catalog will have two collections:
    - db['srcs'] : contains the sources.
    - db['meta'] : describes the catalog.

In [3]:
mqp.healpix_meta(healpix_id_key = 'hpxid_16', order = 16, is_indexed = True, nest = True)
mqp.science_meta(
    contact =  'C. Norris', 
    email = 'chuck.norris@desy.de', 
    description = 'allWISE infrared catalog',
    reference = 'http://wise2.ipac.caltech.edu/docs/release/allwise/')

NameError: name 'mqp' is not defined