# Data Access Check Tutorial
This notebook shows how to use the DataAccessCheck script to test all the HAPI links in the database for data accessibility. This script was not tested on non-HAPI data access links.

*Note that if you want to run for all 1632 HAPI links present in the database, expect the script to run for a LONG time. It is recommended to only do so if you are fine waiting overnight or even for an entire day or more.*

## Select the amount and type of datasets you wish to test.

Note that you may change the database file that is loaded in the call to create_sqlite_database. This is especially important if you created your own from scratch in the HowToUse_Advanced tutorial.

In [None]:
from DataAccessCheck import DataChecker
from Scripts.SQLiteFun import create_sqlite_database, execution
from contextlib import closing, redirect_stdout
from IPython.utils.io import Tee

# input abs path of database file you wish to query from
conn = create_sqlite_database("/home/jovyan/HDRL-Internship-2024/SPASE_Data_20240716.db")
# create list to hold all dataset names acquired from db
prodKeys = [] 

The same functionality to sqlite statements shown in the HowToUse notebooks still applies. For instance, you can tailor the datasets you wish to test however you like, such as by mission, author, publisher, publication year, etc. This, of course, is done by adding additional WHERE arguments.

A basic example query would be to select the first 10 HAPI datasets stored in the database, as shown below.

In [None]:
# if want to run on ALL 1632 datasets, remove the 'LIMIT 10' text from the query
HapiStmt = """SELECT prodKey FROM MetadataEntries WHERE url LIKE '%/hapi' LIMIT 10"""
prodKeys = execution(HapiStmt, conn)

Note that you can test the nth dataset(s) by offsetting where the query starts in the database. For example, if you wish to test the datasets that are in positions 60-70 in the database, you would perform the following query instead.

``` python
HapiStmt = """SELECT prodKey FROM MetadataEntries WHERE url LIKE '%/hapi' LIMIT 10 OFFSET 59"""
```

## Execute the script and wait for the results!

The following code blocks test the datasets specified from your above query, with options to print the outputs to both the console and a text file or instead to only a text file.

*If testing all datasets in the database, it is advised to only output to a text file.*

In [None]:
help(DataChecker)

In [None]:
# if want output only in file
with open("../DatalinkCheckOutputTest.txt", "w") as file:
    with redirect_stdout(file):
        lines = DataChecker(prodKeys, conn)

# if want both file and console
#with closing(Tee("../DatalinkCheckOutputTest.txt", "w", channel="stdout")) as outputstream:
 #   lines = DataChecker(prodKeys, conn)
print("The program is done!")

## Analyzing the results

Before analyzing the results, the code creates another text file containing just those datasets that timed out at some or all time intervals and failed to retrieve data. Afterwards, the code prints out the results.

> For a more detailed explanation of the results from the script, use the following key to query the TestResults table in the database, specifically the dataAccess and Errors columns.
> Possible outcomes for each dataset and their corresponding message recorded in the database:
> - Data successfully accessed --> "Passed" value in dataAccess

>> - Data was successfully accessed BUT some intervals timed out --> Also gets "Passed after some intervals timed out" in Errors

> - No data was accessed --> "Failed" in dataAccess and "HAPI data check failed after _ attempts" in Errors

>> - No data was accessed but some intervals timed out --> "Failed data check but some intervals timed out" in Errors

>> - Initial data info check failed --> "HAPI info check failed" in Errors

In [None]:
# export all datasets which take too long to a text file for further investigation
#   can use this subset of datasets as the prodKeys argument for another script iteration
textFile = open("../HAPI_TakeTooLongTest.txt", "w")
for line in lines:
    textFile.write(line)
    textFile.write("\n")
textFile.close()

Datasets = []
dataFails = 0
tookTooLong = 0
HAPIErrors = 0
Unavailable = 0
with open("../DatalinkCheckOutputTest.txt") as file:
    for line in file:
        if ("No data was found.") in line:
            dataFails += 1
        elif "Bad request - unknown dataset id" in line:
            Unavailable += 1
        elif "Problem with https://cdaweb.gsfc.nasa.gov/hapi/info?" in line:
            HAPIErrors += 1
            before, sep, after = line.partition("Problem with https://cdaweb.gsfc.nasa.gov/hapi/info?id=")
            dataset, sep, after = after.partition(".")
            Datasets.append(dataset)
            
with open("../HAPI_TakeTooLongTest.txt") as file:
    tookTooLong = len(file.readlines())

dataSuccesses = len(prodKeys) - (tookTooLong + HAPIErrors + dataFails + Unavailable)
print("The number of links that successfully retrieved data are " + str(dataSuccesses))
print("The number of broken links is " + str(dataFails))
print("The number of links that are not actual datasets in CDAWeb are " + str(Unavailable))
print("The number of links that encountered another HAPIError are " + str(HAPIErrors))
print("The number of links that timed out is " + str(tookTooLong))