In [1]:
from pathlib import Path

import pandas as pd

from pdr_tests.utilz.new_coverage_utilz import (
    MetricLoader, pathtable_to_treeframe, tf_pathcounts, measured_manifest_names
)

#### Let's prepare to load in the summary products from our manifest(s) of interest.

**Note**: These are created with old data; they don't update just based on changes in pdr-tests. So you'll want to occasionally update them so you have updated metrics.

To update them, or create summary products for the first time from the coverage manifests themselves see the documentation on the coverage pipeline [here](https://docs.google.com/document/d/17VdNITxeL3uWpw5WAMbwIsJoN6UROQPbglPjdugoRaY/edit#heading=h.kmvz5h4az3u). If you already have existing coverage manifests (from this pipeline; you can get them in google drive [here](https://drive.google.com/drive/u/0/folders/1wn7fQhyQlZXzINv1SDpi_VZEDeGZqpkw)) you should only need to execute step 2 of the coverage analysis pipeline -- not the blank manifest pipeline. The code for these pipelines lives in `pdrdev/manifest_analysis`.

In [2]:
manifests_folder = Path().absolute() / "node_manifests/"

loader = MetricLoader(manifests_folder)

Now you can load products of whichever type, pivot, and filter you'd like. This can be done for one or manifests at once. The value assigned to `manifest` can be a complete filename, partial filename, or list of (partial) names. This allows for implicit aggregation of data from multiple manifest that meet your partial criteria.

To see all the names of all available manifests that have gone through the metrics pipeline:
```
measured_manifest_names(manifests_folder / "coverage_metrics")
```
Or just look at the folder and see what metrics manifests are there. That's fine too.

Pivot is the value by which the tables are organized on primarily. See [pivot table](https://en.wikipedia.org/wiki/Pivot_table). Options are: `pivot = one of ("dataset_pds", "dataset_ix", "ptype", "volume")` (More info on these in the coverage pipeline documentation)

Manifest type can be stats or paths. Paths files are much more detailed, stats files are better for primary overviews. Stats files might wrinkle your brow and make you think there is a problem, paths files will help you figure out where that problem is coming from and how to fix it. `mtype = one of ("paths", "stats")`.

Filters can be specified for covered (**Note**: covered includes "excluded" file types, uncovered does not), uncovered, included, and all. Filters can only be used with paths manifest types. `filt = one of ("cov", "ucov", "inc", "all")`

The default settings for `loader` are `pivot='dataset_pds'` and `mtype='stats'` (which is probably the most useful product type for this kind of analysis).

In [3]:
manifest = 'img_usgs'

statstable = loader.load(manifest)

#### The cell below shows information (including the partial urls in the volume column) for PDS datasets with no coverage.

**Note**: You can change the ==0.0 to <=0.3 to get anything less than 30% coverage (for example). You'll see that for this particular example there is no change in the number of rows. This means any PDS datasets we have coverage for are covered at a higher rate than 30%. Which is good! We should have very high coverage in any volume we've already added definitions for!

In [4]:
no_coverage_df = statstable.loc[statstable['coverage'] == 0]
no_coverage_df

Unnamed: 0,dataset_pds,n_inc,n_cov,n_ucov,coverage,n,volume,ptype,dataset_ix,manifest
0,lro-l-lamp-2-edr,111095,0,111095,0.0,182649,[Missions/Lunar_Reconnaissance_Orbiter/LAMP/LR...,,,img_usgs_lunar_reconnaissance_orbiter
1,lro-l-lamp-3-rdr,106356,0,106356,0.0,174485,[Missions/Lunar_Reconnaissance_Orbiter/LAMP/LR...,,,img_usgs_lunar_reconnaissance_orbiter
5,vl1/vl2-m-lcs-2-edr,6585,0,6585,0.0,19924,"[Missions/Viking_Lander/vl_0001, Missions/Viki...",,,img_usgs_viking_lander
6,lo-l-lo3/4/5-4-cdr,27312,0,27312,0.0,29123,"[Missions/Lunar_Orbiter/LO3_0001, Missions/Lun...",,,img_usgs_lunar_orbiter
7,lo3-l-80mm_flc/610mm_flc-4-cdr,2404,0,2404,0.0,2600,[Missions/Lunar_Orbiter/LO3_0001],,,img_usgs_lunar_orbiter
...,...,...,...,...,...,...,...,...,...,...
93,mgn-v-rdrs-5-midr-c1;mgn-v-rdrs-5-bidr-full-re...,63401,0,63401,0.0,64462,"[Missions/Magellan/mg_0002, Missions/Magellan/...",,,img_usgs_magellan
94,mgn-v-rdrs-5-midr-n-polar-stereogr,3606,0,3606,0.0,3666,"[Missions/Magellan/mg_0019, Missions/Magellan/...",,,img_usgs_magellan
95,mgn-v-rdrs-5-midr-polar-stereogr,897,0,897,0.0,917,[Missions/Magellan/mg_0127],,,img_usgs_magellan
97,mgn-v-rdrs-5-gdr-slope,776,0,776,0.0,801,[Missions/Magellan/mg_3001],,,img_usgs_magellan


The coverage pipeline automatically attempts to identify if a volume is PDS4. Datasets are not assigned to files it flags as PDS4 so they won't appear in `dataset_pds` (the default) pivot tables. They will only appear in stats/dataset_ix files if we have already added coverage for them because they can be assigned `dataset_ix` values (if we made selection rules for them).

#### Let's sort this by the count of data files in the dataset so we can target higher data volumes first.

In [7]:
sorted_df = no_coverage_df.sort_values(by=['n_inc', 'n_ucov'], ascending=False)
sorted_df

Unnamed: 0,dataset_pds,n_inc,n_cov,n_ucov,coverage,n,volume,ptype,dataset_ix,manifest
26,ody-m-thm-3-visrdr;ody-m-thm-2-visedr;ody-m-th...,2549613,0,2549613,0.0,3716092,[Missions/Mars_Odyssey/THEMIS/USA_NASA_PDS_ODT...,,,img_usgs_mars_odyssey
58,clem1-/l/e/y-a/b/u/h/l/n-2-edr;clem1-l/e/y-a/b...,1900846,0,1900846,0.0,1905427,"[Missions/Clementine/cl_0001, Missions/Clement...",,,img_usgs_clementine
27,ody-m-thm-5-irpbt;ody-m-thm-5-irgeo;ody-m-thm-...,1498354,0,1498354,0.0,2261987,[Missions/Mars_Odyssey/THEMIS/USA_NASA_PDS_ODT...,,,img_usgs_mars_odyssey
80,co-e/v/j/s-vims-2-qube,1300899,0,1300899,0.0,2259617,"[Missions/Cassini/VIMS/covims_0001, Missions/C...",,,img_usgs_cassini
42,mess-e/v/h-mdis-4-cdr-caldata,694232,0,694232,0.0,694509,[Missions/MESSENGER/MSGRMDS_2001],,,img_usgs_messenger
...,...,...,...,...,...,...,...,...,...,...
45,mess-h-mdis-5-rdr-mdr,127,0,127,0.0,326,[Missions/MESSENGER/MSGRMDS_5001],,,img_usgs_messenger
46,mess-h-mdis-5-rdr-md3,102,0,102,0.0,261,[Missions/MESSENGER/MSGRMDS_6001],,,img_usgs_messenger
85,go-e-nims-4-tube;go-e-nims-4-mosaic;go-e-nims-...,79,0,79,0.0,354,"[Missions/Galileo/NIMS/go_1102, Missions/Galil...",,,img_usgs_galileo
84,go-v-nims-4-mosaic;go-v-nims-4-tube;go-v-nims-...,42,0,42,0.0,176,[Missions/Galileo/NIMS/go_1101],,,img_usgs_galileo


If you want to be able to explore different rows of this table (so you can look at more than the top and bottom 5 rows), you can use the following syntax:

In [8]:
sorted_df[0:10] # Shows the first ten rows

Unnamed: 0,dataset_pds,n_inc,n_cov,n_ucov,coverage,n,volume,ptype,dataset_ix,manifest
26,ody-m-thm-3-visrdr;ody-m-thm-2-visedr;ody-m-th...,2549613,0,2549613,0.0,3716092,[Missions/Mars_Odyssey/THEMIS/USA_NASA_PDS_ODT...,,,img_usgs_mars_odyssey
58,clem1-/l/e/y-a/b/u/h/l/n-2-edr;clem1-l/e/y-a/b...,1900846,0,1900846,0.0,1905427,"[Missions/Clementine/cl_0001, Missions/Clement...",,,img_usgs_clementine
27,ody-m-thm-5-irpbt;ody-m-thm-5-irgeo;ody-m-thm-...,1498354,0,1498354,0.0,2261987,[Missions/Mars_Odyssey/THEMIS/USA_NASA_PDS_ODT...,,,img_usgs_mars_odyssey
80,co-e/v/j/s-vims-2-qube,1300899,0,1300899,0.0,2259617,"[Missions/Cassini/VIMS/covims_0001, Missions/C...",,,img_usgs_cassini
42,mess-e/v/h-mdis-4-cdr-caldata,694232,0,694232,0.0,694509,[Missions/MESSENGER/MSGRMDS_2001],,,img_usgs_messenger
43,mess-e/v/h-mdis-6-ddr-geomdata,320560,0,320560,0.0,320740,[Missions/MESSENGER/MSGRMDS_3001],,,img_usgs_messenger
41,mess-e/v/h-mdis-2-edr-rawdata,291008,0,291008,0.0,291331,[Missions/MESSENGER/MSGRMDS_1001],,,img_usgs_messenger
18,lcross-e/l-mir1-2-raw;lcross-e/l-mir1-3-cal;lc...,285860,0,285860,0.0,286088,[Missions/LCROSS/lcro_0001],,,img_usgs_lcross
37,dawn-a-fc2-2-edr-ceres-images,251074,0,251074,0.0,423170,"[Missions/Dawn/Ceres/DWNCAFC2_1A, Missions/Daw...",,,img_usgs_dawn
34,mgs-m-moc-na/wa-2-sdp-l0,241478,0,241478,0.0,258241,"[Missions/Mars_Global_Surveyor/MOC/mgsc_1001, ...",,,img_usgs_mars_global_surveyor


Partial urls for volumes are expressed as lists in dataset_pds stats products, which can be hard to read. To get a clearer look, we can get a flattened list out of a stats table by running the cell below:

Now you can get the full url by going to the pdrdev spider that corresponds with this coverage (you can find those here: https://github.com/MillionConcepts/pdrdev/tree/main/pdrtestsuite/spiders) and then combining the value in in the volume column with the url in `start_url` for the spider to go right to the volume that will make the most impact on coverage for this manifest!

In [16]:
vols_for_uncov_datasets = no_coverage_df['volume'].explode().unique()
vols_for_uncov_datasets # you can also slice this using the notation above `vols_for_uncovdatasets[:10]` to see just the first ten entries

array(['Missions/Lunar_Reconnaissance_Orbiter/LAMP/LROLAM_0001',
       'Missions/Lunar_Reconnaissance_Orbiter/LAMP/LROLAM_0002',
       'Missions/Lunar_Reconnaissance_Orbiter/LAMP/LROLAM_0003', ...,
       'Missions/Magellan/mg_0127', 'Missions/Magellan/mg_3001',
       'Missions/Magellan/mg_3002'], dtype=object)

Of course, some PDS volumes are really, really large and contain many different datasets and product types. 
We can drill down using the paths tables. 

Here we load paths products for uncovered datasets at from the same manifest that we were working with above.
This is probably going to make a big table!

In [21]:
ucov_paths = loader.load(manifest, mtype="paths", filt="ucov")
ucov_paths

Unnamed: 0,field,name,path,extension,inc,n,manifest
0,dataset_pds,lo-l-lo3/4/5-4-cdr,Missions/Lunar_Orbiter/LO3_0001/constructed_fr...,img,3,3,img_usgs_lunar_orbiter
1,dataset_pds,lo-l-lo3/4/5-4-cdr,Missions/Lunar_Orbiter/LO3_0001/constructed_fr...,img,3,3,img_usgs_lunar_orbiter
2,dataset_pds,lo-l-lo3/4/5-4-cdr,Missions/Lunar_Orbiter/LO3_0001/constructed_fr...,img,3,3,img_usgs_lunar_orbiter
3,dataset_pds,lo-l-lo3/4/5-4-cdr,Missions/Lunar_Orbiter/LO3_0001/constructed_fr...,img,3,3,img_usgs_lunar_orbiter
4,dataset_pds,lo-l-lo3/4/5-4-cdr,Missions/Lunar_Orbiter/LO3_0001/constructed_fr...,img,4,4,img_usgs_lunar_orbiter
...,...,...,...,...,...,...,...
129757,dataset_pds,mess-h-mdis-5-rdr-rtm,Missions/MESSENGER/MSGRMDS_8001/RTM/MDIS_RTM_W...,LBL,1,1,img_usgs_messenger
129758,dataset_pds,mess-h-mdis-5-rdr-rtm,Missions/MESSENGER/MSGRMDS_8001/RTM/MDIS_RTM_W...,IMG,1,1,img_usgs_messenger
129759,dataset_pds,mess-h-mdis-5-rdr-rtm,Missions/MESSENGER/MSGRMDS_8001/RTM/MDIS_RTM_W...,LBL,1,1,img_usgs_messenger
129760,dataset_pds,mess-h-mdis-5-rdr-rtm,Missions/MESSENGER/MSGRMDS_8001/RTM/MDIS_RTM_W...,IMG,1,1,img_usgs_messenger


Then, if you'd like to drill down to a specific dataset, you can. Here we'll use `.sample()` to pick a random set, but you can set `uncovered_set` to the name of whichever one you're interested in from the `dataset_pds` column in the `no_coverage_df` table.

The `name` field of the paths table always contains the pivot value (dataset_pds is the pivot this case). So we can filter to the rows where the `name` field is the same as the dataset name we are interested in.

In [22]:
uncovered_set = no_coverage_df['dataset_pds'].sample().iloc[0]
print(uncovered_set)

set_paths = ucov_paths.loc[ucov_paths['name'] == uncovered_set]
set_paths

mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-emissivity


Unnamed: 0,field,name,path,extension,inc,n,manifest
74619,dataset_pds,mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-...,Missions/Magellan/mg_3001/gedr/merc,img,33,33,img_usgs_magellan
74620,dataset_pds,mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-...,Missions/Magellan/mg_3001/gedr/merc,lbl,36,36,img_usgs_magellan
74621,dataset_pds,mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-...,Missions/Magellan/mg_3001/gedr/merc,tab,3,3,img_usgs_magellan
74622,dataset_pds,mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-...,Missions/Magellan/mg_3001/gedr/north,img,5,5,img_usgs_magellan
74623,dataset_pds,mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-...,Missions/Magellan/mg_3001/gedr/north,lbl,8,8,img_usgs_magellan
...,...,...,...,...,...,...,...
74716,dataset_pds,mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-...,Missions/Magellan/mg_3002/gtdr/sinus,lbl,36,36,img_usgs_magellan
74717,dataset_pds,mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-...,Missions/Magellan/mg_3002/gtdr/sinus,tab,3,3,img_usgs_magellan
74718,dataset_pds,mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-...,Missions/Magellan/mg_3002/gtdr/south,img,5,5,img_usgs_magellan
74719,dataset_pds,mgn-v-rdrs-5-gdr-topographic;mgn-v-rdrs-5-gdr-...,Missions/Magellan/mg_3002/gtdr/south,lbl,8,8,img_usgs_magellan


You can use this to find  paths to explore, help see what we should exclude that we're not exclusing (or include that we're not including), etc.

For example, here is a summary of the file extensions:

In [24]:
set_paths['extension'].value_counts()

extension
img    34
lbl    34
tab    34
Name: count, dtype: int64

`pdr-tests` also includes convenience functions to help you slice up the paths and get a look at the directory structure. `pathtable_to_treeframe` intakes a path table (like the one we made above) and outputs a pandas dataframe of paths split by their depth level as well as file names and extensions. What is especially useful is if you pass the dataframe it makes to `tf_pathcounts` which will then output a pandas Series containing the file counts in each of the paths. See below:

In [29]:
tf = pathtable_to_treeframe(set_paths)
tf_pathcounts(tf)

0         1         2        3    
Missions  Magellan  mg_3001  gtdr     15
                    mg_3002  gtdr     15
                    mg_3001  gredr    12
                             gedr     12
                             gsdr     12
                    mg_3002  gedr     12
                             gredr    12
                             gsdr     12
Name: count, dtype: int64

You can also combine these two to see the number of files of each extension in each of a certain level of path like so:

In [31]:
tf[[2, 'extension']].value_counts()

2        extension
mg_3001  img          17
         lbl          17
         tab          17
mg_3002  img          17
         lbl          17
         tab          17
Name: count, dtype: int64