# SPASE Record Analysis - How to Use (Advanced)
Author: Zach Boquet

## Introduction 
For documentation on how to add to this project, view the related notebook named "HowToAdd.ipynb" <br>
<br>
This project provides a method to analyze FAIR for the SPASE records in the NumericalData and DisplayData categories.<br>

This notebook shows you how to convert desired SPASE record fields into a SQLite database. The desired fields correspond to metadata extracted from XML files using ElementTree. Also shown is how to query data from that database. <br> 
- The tutorial I used to implement ElementTree is <a href="https://realpython.com/python-xml-parser/" target="_blank">https://realpython.com/python-xml-parser/</a>. 
- If more context is needed for the SQLite code than is provided by the comments, I recommend visiting <a href="https://www.sqlitetutorial.net/" target="_blank">https://www.sqlitetutorial.net/</a>.<br>

*Note that this code was extensively tested on the NASA SPASE Github repo. However, results are not guaranteed when running on other SPASE Github repos. This code was tested in Summer 2024 on SPASE version 2.6.1*

This program takes ~3.88 minutes (233 seconds) to fully run from scratch on over 3000 records.

This program takes ~1.28 minutes (77 seconds) to update over 3000 records if using built-in database.

You can also run the program using older database files found in the repo history. Simply change the name of the parameter to the create_sqlite_database function in the first cell to the name of the .db file you wish to use.

In [None]:
# clone NASA SPASE Github Repo into the directory above this tutorial.
! git clone -b master --single-branch --depth=1 https://github.com/hpde/NASA ../../NASA

In [None]:
# show your current directory
! pwd

## Scraping the SPASE records and populating our tables

This code block performs the following: <br>

- takes the absolute path of a SPASE directory they wish to be scraped as an argument<br>
- finds all desired metadata <br>
- creates all needed tables <br>
- adds an entry for each record found into the MetadataEntries table. <br>
- stores the locations of where each metadata field is found in the MetadataSources table <br>
- adds entries into the Records table with general info and info needed for database maintenance.<br>
- populate the TestResults table with default values. <br>
- update the columns associated with a given analysis test (i.e records that have authors -> has_author column) to have a 'True' value of 1.<br>
<br>

If you want a full printout of everything being done, pass True as the printFlag argument to Create.<br>

Examples are also found as comments at the bottom of the code block to test smaller, yet complex directories.<br>
Note: This code was designed to work for the NumericalData and DisplayData directories, so inputting directories besides those two may cause logical errors and produce unintended results.

### Starting from an Existing Database
A pre-built database is included in this repository. You can use this database as a starting point if you would like to see how the project updates the tables.

If you wish to start completely from scratch, skip this section and refer to the next section.

*Note that running the project from the built-in database is much faster than creating your own from scratch*

In [None]:
from Scripts import create_sqlite_database

help(create_sqlite_database)

In [None]:
# input abs path of database file you wish to load from, located one directory above this notebook
#conn = create_sqlite_database("../SPASE_Data_20240716.db")

### Creating or Updating a Database
If updating the built-in database, skip the first cell which overrides the conn variable and run the other cells. Otherwise, if you wish to start completely from scratch, run all cells.

#### Example directories
 
Overall paths (>3000 records): "../../NASA/NumericalData" and "../../NASA/DisplayData"  
Smaller subdirectory = "../../NASA/NumericalData/DE2"   
Bigger subdirectory = "../../NASA/NumericalData/ACE"  
Complex author examples: "../../NASA/NumericalData/Cassini/MAG/PT60S.xml" and "../../NASA/NumericalData/ACE/Attitude/Definitive/PT1H.xml"  
Complex URL example: "../../NASA/NumericalData/ACE/CRIS/L2/P1D.xml"  
#### Code

In [None]:
#from Scripts import create_sqlite_database

# if you wish to start a fresh db file from scratch
conn = create_sqlite_database("../SPASE_Data_new.db")

In [None]:
# import main Python function
from Scripts import Create

help(Create)

In [None]:
# This block updates current records from the indicated directories in the MetadataEntries table
# This step can take a while if you start with a new database file.
# Starting with an existing database file significantly speeds this up.
Create('../../NASA/NumericalData', conn)
Create('../../NASA/DisplayData', conn)

These few lines simply connect to a new database file and update the database using the SPASE records in the desired directory.

## Executing Analysis Tests and Viewing the Results 
In this code block, we perform calls to the View function in main to get the Counts and ID's of the SPASE records that pass each analysis test. 
These analysis tests include links that have: <br> 

- authors <br>
- publishers <br>
- publication years <br>
- dataset names <br>
- licenses <br>
- URLs <br>
- NASA URLs <br>
- persistent identifiers <br>
- descriptions <br>
- citation info <br>
- DCAT-3 compliance info.<br>

### Executing the Analysis Tests

In [None]:
# this function returns all records that pass associated tests 
# and prints the counts of those that pass the test specified in the argument
from Scripts import View

help(View)

In [None]:
# example that returns values for one test
records = View(conn, desired = ['Citation'])

In [None]:
# example that returns values for 4 tests, one of which doesn't match
records = View(conn, desired = ['Author', 'Publisher', 'NASA URL', 'Compliance'])

In [None]:
# example that returns values for all tests
records = View(conn)

### Plotting the Analysis Results

In [None]:
# This function creates a bar chart for the metadata fields checked.
from Scripts import MetadataBarChart

help(MetadataBarChart)

In [None]:
# Plotting the bar chart for all records 
fig = MetadataBarChart(conn)

In [None]:
# Plotting the percent version of the same bar chart
fig = MetadataBarChart(conn, percent = True)

In [None]:
# Plotting the bar chart for only records with a NASA URL
fig = MetadataBarChart(conn, All = False)

In [None]:
# Plotting the percent version of the same bar chart
fig = MetadataBarChart(conn, percent = True, All = False)

## Calculating and plotting the FAIR Score Distributions
This code overwrites the default values placed in the TestResults table to have the actual FAIR Scores that are calculated according to the following algorithm:<br>

- +1 for author
- +1 for dataset name
- +1 for publication year
- +1 for publisher
- +1 for all citation info
- +1 for description
- +1 for PID
- +1 for DCAT3-US compliance
- +1 for license
- +1 for NASA URL <br>
======================= <br>
- Total Possible Points of 10

*Note that this alogrithm is expected to change*

If interested in viewing the FAIR Score for a particular record, refer to the column-specific queries section further below.  

### Calculating the FAIR Scores
These code blocks will be much faster if simply updating the already populated database. Otherwise, it will take longer.

In [None]:
from Scripts import FAIRScorer

help(FAIRScorer)

In [None]:
# calculate FAIR scores for all records
FAIRScorer(conn)

### Plotting the FAIR Score Distributions

In [None]:
from Scripts import FAIR_Chart

help(FAIR_Chart)

In [None]:
# for all records
fig = FAIR_Chart(conn)

In [None]:
# only for records with NASA URLs
fig = FAIR_Chart(conn, All = False)

## How to do your own queries 
This section gives complex examples of how to query the database for row specific and column specific queries. Also provided are brief explanations of some of the SQLite syntax and also an example of a complex query for both categories. <br>

If more context is needed for the SQLite code than is provided by the comments, I recommend visiting <a href="https://www.sqlitetutorial.net/" target="_blank">https://www.sqlitetutorial.net/</a>.<br>

*Disclaimer: Not all authors are provided, as checks were only done to find if an allowed author exists.* 
<br>
**Also note that when a SPASE record has multiple product keys for one URL or multiple URLs in general, each URL/product key gets their own entry into the table. This is why there may be 'duplicate entries' in the database.**

### Record Specific Queries

This section gives a more complex example of how to get data with queries based on records/rows.

In [None]:
from Scripts import execution

help(execution)

> Complex example: Selecting multiple items from multiple tables by using the ResourceID

> - Notice we use commas for multiple items and INNER JOINS when it is over multiple tables.

In [None]:
ID = "spase://NASA/NumericalData/Interball-2/IMAP3/PT120S"
rows = execution(f""" SELECT author, MetadataSources.author_source, Records.SPASE_URL 
            FROM MetadataEntries
                INNER JOIN MetadataSources USING (SPASE_id)
                INNER JOIN Records USING (SPASE_id)
            WHERE SPASE_id = '{ID}';""", conn, "multiple")
rows[0]

### Column Specific Queries
This section describes how to get data with queries based on the column values.

> Ex: How many records have 3 out of 4 fields needed for citation?
> - Use AND and OR operators just like in programming languages.

In [None]:
stmt = """SELECT COUNT(DISTINCT SPASE_id) FROM TestResults 
                WHERE (has_author = 1 
                AND has_datasetName = 1
                AND has_pubYr = 1)
                OR (has_author = 1 
                AND has_datasetName = 1
                AND has_pub = 1)
                OR (has_author = 1 
                AND has_pub = 1
                AND has_pubYr = 1)
                OR (has_datasetName = 1 
                AND has_pub = 1
                AND has_pubYr = 1)"""
items = execution(stmt, conn)
items

> Complex Ex: What records have at least 2 of the desired fields?
> - Notice we use f-strings to concatenate strings instead of retyping text.

> *Note that you can find more complex SQLite queries such as AL1Stmt, AL3Stmt, and allStmt in the RecordGrabber.py script. Also found here is work for having queries specific to publishers*

In [None]:
has_citation = """author NOT LIKE ""
                    AND datasetName NOT LIKE ""
                    AND publicationYr NOT LIKE ""
                    AND publisher NOT LIKE "" """
citationStmt = f"""SELECT DISTINCT SPASE_id FROM MetadataEntries 
                WHERE {has_citation};"""
has_compliance = """ description NOT LIKE ""
                AND datasetName NOT LIKE ""
                AND PID NOT LIKE "" """
complianceStmt = f"""SELECT DISTINCT SPASE_id FROM MetadataEntries 
                WHERE {has_compliance};"""

# at least 2 fields
AL2Stmt = f"""SELECT DISTINCT SPASE_id FROM MetadataEntries 
                WHERE ({has_citation}
                AND
                    {has_compliance}) 
                OR
                    ({has_citation}
                AND
                    PID NOT LIKE "")
                OR
                    ({has_citation}
                AND
                    license LIKE "%cc0%" OR license LIKE "%Creative Commons Zero v1.0 Universal%")
                OR
                    ({has_compliance}
                AND
                    license LIKE "%cc0%" OR license LIKE "%Creative Commons Zero v1.0 Universal%")
                OR
                    ({has_compliance}
                AND 
                    PID NOT LIKE "")
                OR  
                    (PID NOT LIKE ""
                AND
                    license LIKE "%cc0%" OR license LIKE "%Creative Commons Zero v1.0 Universal%") LIMIT 10;"""

items = execution(AL2Stmt, conn)
items

## How to Backup Database
This code generates a backup copy of the live database into a file specified by the filename variable.

*Note that this cannot be run if there is a pending SQL statement or open transaction running.*

In [None]:
from Scripts import executionALL

help(executionALL)

In [None]:
filename = "../SPASE_Data_new_backup.db"
stmt = f"VACUUM main INTO '{filename}'"
executionALL(stmt, conn)