# GLANSIS REFERENCE CLEANER & BULK UPLOADER
**Description:** The following scripts will help you to clean and bulk upload references

**Installing Libraries:** To run this script, there are several necessary packages that need to be installed. Below is the quick and easy way to install the necessary pacakges to run this code. You only need to run it the first time you run this script. After that, the packages will be installed in your system. For that reason, I have the code commented out because there should be no reason to run any other time. IF this is your first time, remove the '#' to uncomment the second line ('pip install requirement.text'). Make sure the requirements.txt is in the main folder.

*Be Aware: This is not the 'proper' way to initalize a script. If you find yourself running multiple scripts for different project and are frequently installing new packages, you should create a virtual environment. There is plenty of resources online explaining how to so.

In [1]:
# install Libaries - keep this commented after firest use
# pip install -r requirements.txt


# PART 2: BULK UPLOADER


## 1. Import libraries

In [1]:
# DO NOT EDIT
import pandas as pd              # Managing dataframes
import os                        # Setting working directores
import tkinter as tkinter        # Creates GUI
from tkinter import filedialog   # Creates file dialog pop-up
from tkinter import messagebox   # Creates message box for errors
from bs4 import BeautifulSoup    # HTML parsing
import time                      # Inputs pauses to let webpage load
from selenium import webdriver                                     # automate web browser interaction
from selenium.webdriver.common.keys import Keys                    # automate keyboard actions
from selenium.webdriver.common.by import By                        # find elements by html id on webpage
from selenium.webdriver.support.ui import Select                   # automate dropdown selection
from selenium.webdriver.support.ui import WebDriverWait            # command driver to wait
from selenium.webdriver.support import expected_conditions as EC   # command driver to wait until loaded
from selenium.webdriver.chrome.options import Options              # option to make webdriver not visible
from selenium.common.exceptions import WebDriverException          # Deal with exptions in webdriver


## 2. Set up for Bulk Upload:
By running the cell below, you will open a file dialog box with your computers directory. You may have to minimize the current window to see (sometimes it likes to hide). Then travel to the folder containing the excel sheet with your references that need cleaned. Select the file and hit 'open.' The first five rows of reference sheet will appear below if you have uploaded correctly. Check to make sure everything looks correct. 

In [13]:
# DO NOT EDIT

# Open a file dialog to select an Excel file
file_path = filedialog.askopenfilename(filetypes=[("Excel files", "*.xlsx;*.xls")])

# Read the Excel file, specifying the data types for both columns
dtype_mapping = {
    'Author': str,
    'Year': str,
    'Title': str,
    'Journal Name': str,
    'Volume': str,
    'Issue': str,
    'Pages': str,
    'Keywords': str,
    'Abstract': str,
    'DIO': str
    }


# Read the Excel file into a DataFrame using pandas, applying the specified data types
ref = pd.read_excel(file_path, dtype = dtype_mapping)

# Replace NaN to blanks
ref.fillna('', inplace=True)

# Show first five rows of references
ref.head()

Unnamed: 0,Duplicate,Type,Author,Year,Title,Journal Name,Volume,Issue,Pages,URL,Location,Specimen Data Entered,Impacts Data Entered,Keywords,Abstract,DOI,PDF Name
0,No,Journal Article,"Schofield, K. A., C. M. Pringle, J. L. Meyer, ...",2001,The importance of crayfish in the breakdown of...,Freshwater Biology,46,9,1191-1204,,NAS,N,N,"Cambarus bartonii, Appalachian brook crayfish,...",1. Rhododendron ( Rhododendron maximum ) is a ...,10.1046/j.1365-2427.2001.00739.x,Schofield-2001-The importance of crayfish in t...
1,No,Journal Article,"Seiler, S. M., and A. M. Turner",2004,Growth and population size of crayfish in head...,Freshwater Biology,49,7,870-881,,NAS,N,N,"Cambarus bartonii, headwater streams, acidiﬁca...",1. Environmental stress may have indirect posi...,10.1111/j.1365-2427.2004.01231.x,Seiler-2004-Growth and population size of cray...
2,No,Journal Article,"Sherba, M., D. W. Dunham, and H. H. Harvey",2000,Sublethal copper toxicity and food response in...,Ecotoxicology and Environmental Safety,46,3,329-333,,NAS,N,N,"Cambarus bartonii, Appalachian brook crayfish,...",Food response preceding and following exposure...,10.1006/eesa.1999.1910,Sherba-2000-Sublethal copper toxicity and food...
3,No,Journal Article,"Zachary, J. L., P. S. Thomas, and A. W. Stuart",2009,West Virginia crayfishes (Decapoda: Cambaridae...,Northeastern Naturalist,16,2,225-238,,NAS,N,N,"Cambarus bartonii, Appalachian brook crayfish,...",West Virginia's crayfishes have received moder...,10.1656/045.016.0205,Zachary-2009-West Virginia Crayfishes (Decapod...


Before running the cell below, make sure that your PDFs files are in seperate folder. After running this cell, another file dialog box will open. Select the folder where you have stored the PDFs.

In [3]:
# DO NOT EDIT

# Open a file dialog to select PDF file location
pdfs_folder_path = filedialog.askdirectory(title="Select a Folder")

## 3. Open a window to the NAS data entry page ##
Next, open a window to login into the NAS data entry page. This will take a couple steps: 
1. Once the web driver has started, you should see the USGS two-factor authententication login scene. You should be able to interact with the web dirver as you would a regular browser. Login in using your regular credentials and preferred method of authentication.
2. Once you are logged in, click on the "References" link.


In [4]:
# DO NOT EDIT

# Open web page using Selenium webdriver
driver = webdriver.Chrome()
driver.get('https://nas.er.usgs.gov/DataEntry/References/Default.aspx')

## 4. Enter References: 
The following code will bulk upload references and pdfs. You may want to minimize the browser in the web driver - it is going be a little hard to look at while it enters data. 

*Only run this once! Be aware that running the bulk upload code cell more than once will create duplicates! If you are having issues getting it to run, it would be best to close the web driver window altogether.*

Once it is done, check to see if there are any errors. You might have errors most likely as a result of formatting issues or typos. These references likely did not get uploaded or did not have their pdfs uploaded. You will need to double check and upload or fix them manually yourself.



In [14]:
## Test when 
# Create empty list for new RefNum
results = []


# Input references and pdfs into NAS database - the following code will iterate 
# through each line of your excel sheet

for index, row in ref.iterrows():  
    
    try:
    
        # Click on 'New' Button to create new reference
        new_button = driver.find_element(By.ID, 'ContentPlaceHolder1_New')
        new_button.click()

        time.sleep(0.5)

        # Find dropdown elements on the web page and select information - TYPE
        type_dropdown = Select(driver.find_element(By.ID, 'ContentPlaceHolder1_type'))
        type_dropdown.select_by_visible_text(str(row["Type"]))

        time.sleep(0.5)

        # Find input elements on the web page and fill them with data - AUTHOR
        author_input = driver.find_element(By.ID, 'ContentPlaceHolder1_author')
        author_input.send_keys(str(row["Author"]))

        # Find input elements on the web page and fill them with data - YEAR
        year_input = driver.find_element(By.ID, 'ContentPlaceHolder1_date')
        year_input.send_keys(str(row["Year"]))

        # Find dataframe input elements on the web page and fill them with data - TITLE
        time.sleep(0.5)
        
        driver.switch_to.frame(0)   # switch focus to the title iframe
        title_input = driver.find_element(By.CLASS_NAME, 'cke_show_borders')   # locate 
        title_input.click()
        title_input.send_keys(str(row["Title"]))   # submit input elements
        driver.switch_to.default_content()   # Switch back to the main content

        # Find input elements on the web page and fill them with data - JOURNAL NAME
        journal_input = driver.find_element(By.ID, 'ContentPlaceHolder1_journal')
        journal_input.send_keys(str(row["Journal Name"]))

        # Find input elements on the web page and fill them with data - VOLUME
        vol_input = driver.find_element(By.ID, 'ContentPlaceHolder1_vol')
        vol_input.send_keys(str(row["Volume"]))

        # Find input elements on the web page and fill them with data - ISSUE
        issue_input = driver.find_element(By.ID, 'ContentPlaceHolder1_issue')
        issue_input.send_keys(str(row["Issue"]))

        # Find input elements on the web page and fill them with data - PAGES
        page_input = driver.find_element(By.ID, 'ContentPlaceHolder1_pages')
        page_input.send_keys(str(row["Pages"]))

        # Find input elements on the web page and fill them with data - PAGES
        page_input = driver.find_element(By.ID, 'ContentPlaceHolder1_URL')
        page_input.send_keys(str(row["URL"]))

        # Find dropdown elements on the web page and select information - SPECIMEN DATA
        specimen_dropdown = Select(driver.find_element(By.ID, 'ContentPlaceHolder1_entered'))
        specimen_dropdown.select_by_visible_text(str(row["Specimen Data Entered"]))

        # Find dropdown elements on the web page and select information - IMPACT DATA
        impact_dropdown = Select(driver.find_element(By.ID, 'ContentPlaceHolder1_impacts'))
        impact_dropdown.select_by_visible_text(str(row["Impacts Data Entered"]))

        time.sleep(0.5)

        # Find dropdown elements on the web page and select information - LOCATION
        location_dropdown = Select(driver.find_element(By.ID, 'ContentPlaceHolder1_LocationDDL'))
        location_dropdown.select_by_visible_text(str(row["Location"]))

        # Find input elements on the web page and fill them with data - KEYWORDS
        keyword_input = driver.find_element(By.ID, 'ContentPlaceHolder1_key_words')
        keyword_input.send_keys(str(row["Keywords"]))

        # Find input elements on the web page and fill them with data - ABSTRACT
        time.sleep(0.5)
        
        driver.switch_to.frame(1)   # switch focus to the title iframe
        abstract_input = driver.find_element(By.CLASS_NAME, 'cke_show_borders')   # locate
        abstract_input.click()
        abstract_input.send_keys(str(row["Abstract"]))   # submit input elements
        driver.switch_to.default_content()   # Switch back to the main content

        # Find input elements on the web page and fill them with data - DOI
        doi_input = driver.find_element(By.ID, 'ContentPlaceHolder1_DOI')
        doi_input.send_keys(str(row["DOI"]))
    
        # Submit the form
        submit_button = driver.find_element(By.ID, 'ContentPlaceHolder1_Submit')
        submit_button.click()

        # Accept alert to add PDF
        driver.switch_to.alert.accept()

        # Switch back to default content
        driver.switch_to.default_content()


        # Get html script from web page and find RefNum
        WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, "ContentPlaceHolder1_refnum")))
        html = driver.page_source
        soup = BeautifulSoup(html,'html.parser')
        refnum = soup.find("span", {"id": "ContentPlaceHolder1_refnum"}).get_text()


        # append refnum to result list for later
        results.append(refnum)


        # Rename File
        old_name = row["PDF Name"]
        new_name = refnum + '.pdf'
        os.rename(os.path.join(pdfs_folder_path, old_name), os.path.join(pdfs_folder_path, new_name))

        # Choose file
        time.sleep(1)
        choose_button = driver.find_element(By.ID, 'ContentPlaceHolder1_FileInput')
        choose_button.send_keys(os.path.join(pdfs_folder_path, new_name))


        # Submit button
        submit_button = driver.find_element(By.ID, 'ContentPlaceHolder1_Submit')
        submit_button.click()


        # Continue button
        continue_button = driver.find_element(By.ID, 'ContentPlaceHolder1_Button1')
        continue_button.click()
        

    except Exception as e:
        
        # Print the error message
        error_message = str(e).splitlines()[0]
        print(f"Error row {index + 1}: {error_message}\nAuthors: {row['Author']} \nYear: {row['Year']} \nTitle: {row['Title']}\n")
        
        # Return to starting data entry page
        driver.get('https://nas.er.usgs.gov/DataEntry/References/Default.aspx')
        
        # Append NA to result list for later
        results.append('NA')


# Add Reference Number to excel sheet
ref['RefNum'] = results


## 5. Quality Control:
Double-check to make sure everything ran properly. 
    1. Check your 'pdfs' folder - you should be able to see the pdfs that have been renamed with their new references number
    2. Run the code below to make sure that the new reference numbers have been appended. Note: references that has error will have an 'NA' file name

In [15]:
# DO NOT EDIT

#reorder columsn
new_column_order = ['RefNum', "Type", "Author", "Year", "Title", "Journal Name", "Volume", "Issue", "Pages", "URL", "Keywords", "Abstract", "DOI", "PDF Name"]
ref = ref[new_column_order]

ref.head()

Unnamed: 0,RefNum,Type,Author,Year,Title,Journal Name,Volume,Issue,Pages,URL,Keywords,Abstract,DOI,PDF Name
0,42975,Journal Article,"Schofield, K. A., C. M. Pringle, J. L. Meyer, ...",2001,The importance of crayfish in the breakdown of...,Freshwater Biology,46,9,1191-1204,,"Cambarus bartonii, Appalachian brook crayfish,...",1. Rhododendron ( Rhododendron maximum ) is a ...,10.1046/j.1365-2427.2001.00739.x,Schofield-2001-The importance of crayfish in t...
1,42976,Journal Article,"Seiler, S. M., and A. M. Turner",2004,Growth and population size of crayfish in head...,Freshwater Biology,49,7,870-881,,"Cambarus bartonii, headwater streams, acidiﬁca...",1. Environmental stress may have indirect posi...,10.1111/j.1365-2427.2004.01231.x,Seiler-2004-Growth and population size of cray...
2,42977,Journal Article,"Sherba, M., D. W. Dunham, and H. H. Harvey",2000,Sublethal copper toxicity and food response in...,Ecotoxicology and Environmental Safety,46,3,329-333,,"Cambarus bartonii, Appalachian brook crayfish,...",Food response preceding and following exposure...,10.1006/eesa.1999.1910,Sherba-2000-Sublethal copper toxicity and food...
3,42978,Journal Article,"Zachary, J. L., P. S. Thomas, and A. W. Stuart",2009,West Virginia crayfishes (Decapoda: Cambaridae...,Northeastern Naturalist,16,2,225-238,,"Cambarus bartonii, Appalachian brook crayfish,...",West Virginia's crayfishes have received moder...,10.1656/045.016.0205,Zachary-2009-West Virginia Crayfishes (Decapod...


## 6. Export new excel sheet:
A new column was added to original pdf for new Reference number's for each reference. The following exports a new excel sheet.

In [16]:
# DO NOT EDIT

file_path = filedialog.asksaveasfilename(defaultextension=".xlsx",
                                            filetypes=[("Excel files", "*.xlsx"),
                                            ("All files", "*.*")])
ref.to_excel(file_path, index=False)

# Open the Excel file with the default application
os.system(file_path)


0

All Done!