# SPASE Record Analysis - How to Add New Extracted Items or New Tests

If you have not viewed the related notebook, "How to Use", do so before going through this notebook. <br>

This notebook runs through how to add to this project, specifically:
1. how to add additional fields to extract from SPASE records
2. how to add to the SQLite database 
3. how to add new database queries to report the results in the tables <br>

Also covered is a brief description of how to test the results.

## Adding additional extraction fields

For this example, we will show how to add the ORCID ID.

Note that if your field is not ORCID ID, your code would have the following differences: 
- It may be found in a different spot in the SPASE_Scraper_Script
- - This means the value for the variable that holds the field's location would be different also
- If your field does not have the possibility to hold multiple values, there is no need for its variable to be a list and you can use a String instead.
- You will have different names for the variables as it may not be the ORCID_ID
 
The rest of the example code should be accurate as those steps must happen no matter what field is added.

The code introduced can be placed where it says to in the SPASE_Scraper_Script comments. Find them easily using Ctrl-F and searching for 'Code X'.

> First up, in order to account for the instance that there are multiple authors we are scraping, the variable for ORCID needs to be a list. Also needed is another variable to temporarily hold the ORCID ID since we only return the IDs of authors that are within priorities. Additionally, we need a String variable to hold the location in the XML SPASE record where the ORCID_ID was acquired. To cover the case when no authors are provided, we need to give these ORCID variables default values: <br>

We will call this code Code A
```python
ORCID = []
ORCID_ID = ""
ORCID_ID_Field = ""
```

> Next, we need to know where the value would be found. ORCID ID would likely be found in the Contact section. With this in mind, we need to find in the SPASE_Scraper_Script where we iterate through that section: <br>
```python
elif child.tag.endswith("Contact"):
    C_Child = child
# iterate thru Contact to find PersonID and Role
for child in C_Child:
```

> After that, we need to add another elif statement to check the child nodes within Contact for the tag we are seeking, which in this case may be something like "ORCID". Then we just save the text tagged by ORCID into our temporary variable, ORCID_ID. We also need to give ORCID_ID_Field a value since it has been found. This is done by concatenating the predefined String, parent, with where we found the ORCID_ID, which is ResourceHeader/Contact/ORCID. This would look similar to what is needed: <br>

This is Code B.

```python
# find ORCID
elif child.tag.endswith("ORCID"):
    # store ORCID
    ORCID_ID = child.text
    ORCID_ID_Field = (parent + "/ResourceHeader/Contact/ORCID")
```

> Then, if an author is found that fits our priority rules, we assign this temporary value to the list at the same time we add the author name and author roles to theirs. This keeps the ordering the same so that the ORCID ID stays with the author it belongs to. There are 2 places the author can be collected outside of the Publication Info section, so both of these assignments would need to be added to each of these areas. <br>

This is code C.
```python 
ORCID = [ORCID_ID]
```
And this is code D.
```python
ORCID.append(ORCID_ID)
```

> Lastly, add the ORCID list and the ORCID_ID_Field as a return and edit the calls to the 'SPASE_Scraper" script in the main.py file to reflect the added returns. We will continue by assuming you named the variables in main.py that holds the returns as ORCID_ID and ORCID_ID_Field.

## Adding new field to the database

This section will continue with our previous example of the ORCID ID to show you how to add it to the SPASE_Data.db database.

### Editing SQLite database structure

> First, we would need to add a column to the MetadataEntries, MetadataSources, and TestResults tables. This can be done by using the ALTER TABLE command in SQLite. This can be executed anywhere.

```python
from SQLiteFun import executionALL

executionALL("""ALTER TABLE MetadataEntries ADD COLUMN
                    ORCID_ID TEXT""")
executionALL("""ALTER TABLE TestResults ADD COLUMN
                    has_ORCID INTEGER""")
executionALL("""ALTER TABLE MetadataSources ADD COLUMN
                    ORCID_ID_Source TEXT""")
```

> Also helpful (but not required) would be to manually add these columns to the create_tables() function in SQLiteFun.
We will call these lines Code T.

Add the ORCID_ID column into the MetadataEntries table before the "UNIQUE..." line.
```python
PID TEXT,
ORCID_ID TEXT,
UNIQUE(SPASE_id, URL, prodKey)
```
Add the ORCID_ID_Source column into the MetadataSources table at the end after the PID_source column.
```python
description_source TEXT,
PID_source TEXT,
ORCID_ID_Source TEXT
```

And add the has_ORCID column to the TestResults section before the addition to the Errors column.
```python
has_compliance INTEGER,
has_ORCID INTEGER,
Errors TEXT
```

### Edit the main.py file.
The code introduced can be placed where it says to in the main.py comments. Find them easily using Ctrl-F and searching for 'Code X'.

> Optional: Add print statements to occur when troubleshooting with the printFlag argument set to True.

We will call this Code E.
This joins the ORCID_ID(s) into one comma separated string.
```python
ORCID_ID = ", ".join(ORCID_ID)
```
And this Code F
```python
print("The ORCID_ID(s) are " + ORCID_ID)
```

> Next, edit the SQLite UPSERT statement for MetadataEntries

We will call this Code G.
Add the column name found in the table (ORCID_ID) to the tuple found after "INSERT INTO MetadataEntries" in the same position you added it into the table last step, which should be right after the last column, PID.
```python
(SPASE_id,author,authorRole,publisher,publicationYr,datasetName,
    license,URL,prodKey,description,PID,ORCID_ID)
```
Now we add the actual values to be inserted into the table in the section after "VALUES". Again, this is added after PID.

This Code H.
```python
VALUES ("{ResourceID}","{author}","{authorRole}","{pub}","{pubYear}",
        "{datasetName}","{license}","{url[i]}","{prodKey[i]}","description found","{PID}","{ORCID_ID}")
```
Lastly, we need to ensure the UPDATE part works correctly by adding the assignment using the excluded keyword to the 'SET' section. This code updates the ORCID_ID of the entry that is already in the table with the value recently scraped, in case it may have changed. Again this is right after the "PID = excluded.PID" line.

And this Code I.
```python
description = excluded.description,
PID = excluded.PID
ORCID_ID = excluded.ORCID_ID; '''
```

> Next, we need to do the past 3 steps again to edit the UPSERT statement for the MetadataSources table. This includes adding the ```ORCID_ID_Source``` column to the tuple, the ```"{ORCID_ID_Field}"``` to the VALUES section, and the ```ORCID_ID_Field = excluded.ORCID_ID_Field``` to the SET section. These will be called Code U, V, and W.

> Finally, edit the new entries' default values into TestResults.

Add a default value of 0 to the has_ORCID column whenever new records are added. Existing entries in the table are assigned a default value of null when we added the column earlier.

We will call this Code J. This is added with the other zeros as part of the assignment statement to Test.
```python
Test = (record,0,"","",0,0,0,0,0,0,0,0,0,0,0,0,"")
```

## Adding new database queries to report the results
This section will further continue with our ORCID_ID example and show you how to add a new query to have ORCID_ID updated in the TestResults table.

### Edit RecordGrabber.py
The code introduced can be placed where it says to in the RecordGrabber.py comments. Find them easily using Ctrl-F and searching for 'Code X'.

> Add SQLite SELECT statement in RecordGrabber to return SPASE_id's of all records that have ORCID_ID's.

We will call this Code K. This can be added with the other SQLite SELECT statements.
```python
ORCID_Stmt = """SELECT DISTINCT SPASE_id FROM MetadataEntries WHERE ORCID_ID NOT LIKE "" ;"""
```

*Note that if you are also wanting to query based on specific publishers, you will need to add a similar statement labeled SPDF_ORCID_Stmt and SDAC_ORCID_Stmt which just concatenate the SPDF_Intersect and SDAC_Intersect with the ORCID_Stmt, respectively.*

> Execute the newly added statement and add it as a return.

We will call this Code L. Since we are not specifying a publisher for this example, we will add this to the allRecords function with the other statements.
```python
ORCIDs = execution(self.ORCIDStmt, conn)
```
Then, add it as a return in whatever position you like. Just make sure to keep the same order when you call it externally to avoid assigning the wrong links to the wrong labels. This location will be called Code M.

*Note that if you want to specify publisher, you will also need to follow these steps with the SPDF_ORCIDs and SDAC_ORCIDs labels inside their respective functions of SPDF_Records and SDAC_Records.*

### Edit main.py
The code introduced can be placed where it says to in the main.py comments. Find them easily using Ctrl-F and searching for 'Code X'.

1. Edit Create()

If not done already, add a variable to hold the newly returned list of records with ORCID_IDs in main.py.

> - We will mark this as Code N. For reference purposes, we will continue by assuming you assigned the list to a variable named ORCID_Records.

Next, we need to add a new call to the TestUpdate function to update the records that have ORCID_IDs to have a 1 in the has_ORCID column.

> - We will call this Code O.

```python
TestUpdate(ORCID_Records, "has_ORCID", conn)
```

2. Edit View()

We again need to add ORCID_Records as a variable to the call to 'testObj.allRecords()'. We also need to add "ORCID" to be included in the list of default values for desired in the View() definition. Lastly, we need to add "ORCID" as a new key in the 'desiredRecords' dictionary to be returned.
> - We will mark these locations as Code P, Code Q, and Code R, respectively.

Then we just need to add a check to return the ORCID_Records if it is included in the 'desired' parameter.
> - We will call this Code S

```python
elif record == "ORCID":
    print("There are " + str(len(ORCID_Records)) + " records that have ORCID_IDs.")
    desiredRecords["ORCID"] = ORCID_Records
```

## Test in the HowToUse notebook
Now you can test what you have added to get the results!

Running the first section should automatically do everything needed to add your new field to the tables and populate them for all records. 

After that, to verify it has worked properly, you can query the results in the "Executing Analysis Tests and Viewing the Results" section by using the keyword "ORCID" as an argument to View(). Then assign the records to a variable such as ORCID_Records by slicing the records list that was returned and run the code.

This should now give you a variable holding all records that contain ORCID_IDs and should also print out how many records that is.