# GLANSIS REFERENCE CLEANER & BULK UPLOADER
**Description:** The following scripts will help you to better clean and bulk upload references


# PART 1: Reference Cleaner

## 1. Exporting from EndNote

To correctly get your Journal Article references out of Endnote into a text file that Jupyter Notebooks can read, see the following steps:
1. Download and save the GLANSIS_refManagerExport.
2. Select and open output style. It will open an EndNote output style editor.
3. Click File > Save As.
4. Close output style editor.
5. Go to hte task bar and select the 'Biolographic Output Style' dropdown bar. Click 'Select Another Style.' Scroll to and select 'GLANSIS_refManagerExport.' Click 'Choose.'
6. Select all the journal articles you want to enter into NAS.
7. Click File > Export. In the pop-up menu, make to change preferences to:<br>
    a. File name: Enter your file name of choice<br>
    b. Save as type: Text File (.txt)<br>
    c. Output style: GLANSIS_refManagerExport<br>
    d. Click Save<br>
8. To make it easier for the code to find a PDF, extract all PDFs from EndNote library and put into a new folder.<br>
    a. Go to directory where the EndNote Library is stored. Click on the .Data folder.<br>
    b. There will be a PDF and sdb folder inside. Click on the PDF folder.<br>
    c. In the search bar type '.pdf'. Click on the first PDF and hit Ctrl + A.<br>
    d. Copy and paste PDFs into a new folder. I recommend in the same folder as your EndNote Library outside your .Data file<br>
9. Once this is done, you should be ready to start editing your
    

*This will only work for journal articles which is the bulk of what we handle. Any reports, websites, or other references will need to be entered by hand. 


## 2. Run Widget to Clean References

In [1]:
# DO NOT EDIT

import os                            # Mananages directories
import numpy as np                   # Used to manage NA values
import pandas as pd                  # Manages dataframes
import re                            # Edits text strings
import tkinter as tk                 # Constructs GUI
from tkinter import Tk, filedialog   # Creates file dialog box
import requests                      # Pulls HTML code from webpage
from bs4 import BeautifulSoup        # HTML parsing
import fitz                          # Open and pdf manipulation - package for PyMuPDF
from docx import Document            # Create and edit Word Document
from collections import OrderedDict  # Use to remove duplicates from keywords
import unicodedata                   # Convert extracted keywords to unicode - removes issues with duplication
import nltk                          # Library of natural language processing tools              
import nltk.corpus                   # Access corpora
import string                        # Format text strings
import pickle                        # Serailizes and deserializing models
import math                          # Math functions for numerical computations
import nltk.data                     # Retrieves data files and resources
from nltk.tokenize.treebank import TreebankWordDetokenizer   # Concatenates token sequences
from nltk.tokenize import word_tokenize   # Splits words into tokens
nltk.download('punkt')                    # Pre-trained tokenizer
from openpyxl import load_workbook        # Edit Excel sheets
from openpyxl.styles import PatternFill   # Modify Excel formatting

# Command to save species names
def save_species_names():
    
    global species_names
    
    # Save species names entries
    species_names = species_names_entry.get().split(',')
    
    # Check if species_names exists
    if not species_names:
        
        # If species_id is empty, show an error message
        save_success_label.config(text = "Error - Species names required.", fg="red")
        
    else:
        
        # If species_id exists, proceed with saving
        save_success_label.config(text = "Success!", fg="green")

# Command to save references
def open_reference_file():

    global df
    
    # Open a file dialog to select an Excel file
    file_path = filedialog.askopenfilename(title="Select a File", filetypes=[("Text Document", "*.txt")])

    # Create new column names because text file has no header
    col_names = ["Type", "Author", "Year", "Title", "Journal Name", "Volume", "Issue", "Pages", "URL", "Keywords", "Abstract", "DOI", "PDF Name"]

    # Convert text file into a dataframe
    df = pd.read_csv(file_path, sep = '\t', header = None, dtype = str, names = col_names, quotechar = '"')
    
     # Check if df exists
    if df.empty:
        
        # If df exists, proceed with saving
        df_success_label.config(text = "Error: Select .txt file with references.", fg="red")
        
    else:
        
        # If df is empty, show an error message
        df_success_label.config(text = "Success!", fg="green")

# Command to save PDF folder
def open_pdf_folder():
    
    global pdf_folder

    # Open a file dialog to select PDF file location
    pdf_folder = filedialog.askdirectory(title="Select a Folder")
    
    # Check if pdf_folder exists
    if not pdf_folder:
        
        # If df exists, proceed with saving
        pdf_folder_success_label.config(text = "Error: Select PDF folder.", fg="red")
        
    else:
        
        # If df is empty, show an error message
        pdf_folder_success_label.config(text = "Success!", fg="green")

# Command to clean references
def clean_references():
    
    global styled_df
    global df
    global doc
    global species_names
    
    # Duplications 

    #url creation
    url = 'https://nas.er.usgs.gov/queries/references/ReferenceList.aspx'

    results = []

    for index, row in df.iterrows():

        # extract search parameters references
        lead_author = str(row['Author']).split(',', 1)[0] + ','
        date = row['Year']
        title_start = ' '.join(str(row['Title']).split()[:5])

        #set queary parameters
        search_params = {
            "refnum": '',
            "author": lead_author,
            "date": date,
            "title":title_start,
            "journal": '',
            "publisher": '',
            "vol": '',
            "issue":'',
            "pages":'',
            "URL":'',
            "key_words":'',
            "type":'' 
        }

        # call url
        response = requests.post(url, params = search_params)

        # scrape the RefNum 
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")

            #find specific tag using id attribute
            desired_a_tag = soup.find("a", {"id": "ContentPlaceHolder1_GridView1_HyperLink1_0"})

            if desired_a_tag:
                refnum = desired_a_tag.get_text(strip = True)
                results.append(refnum)
            else:
                refnum = 'No'
                results.append(refnum)

     # add column
    df['Duplicate'] = results 


    # DO NOT EDIT

    # Add for location, specimen data, and impact data
    df['Location'] = 'NAS'
    df['Specimen Data Entered'] = 'N'
    df['Impacts Data Entered'] = 'N'


    # DO NOT EDIT

    # Clean DOI
    def clean_doi(text):
        if pd.notna(text):
            if 'doi.org' in text:
                return text.replace('https://doi.org/', '')
            else:
                return text
        else:
            return text

    df['DOI'] = df['DOI'].apply(clean_doi)


    # Define a function to remove illegal characters
    def remove_illegal_chars(text):
        if pd.isnull(text):  # Check if the cell is empty
            return ''
        # Define the pattern for illegal characters
        illegal_chars_pattern = re.compile(r'[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]')

        # Replace illegal characters with an empty string
        cleaned_text = illegal_chars_pattern.sub('', text)

        return cleaned_text

    df['Abstract'] = df['Abstract'].apply(remove_illegal_chars) 

    # URL Cleaning
    def clear_cell(row):
        if not pd.isna(row['DOI']):
            return np.nan
        else:
            return row['URL']

    df['URL'] = df.apply(clear_cell, axis = 1)

    # Remove bad URLs
    def remove_non_url(text):
        if pd.notna(text):
            if '<Go to ISI>:' in text:
                return None
            else:
                return text

    df['URL'] = df['URL'].apply(remove_non_url)

    # Clean PDF file name
    def clean_pdf_name(text):
        if isinstance(text, str):
            pdf_name = re.search(r'[^/]+$', text).group()
            return pdf_name
        else:
            return text


    df['PDF Name'] = df['PDF Name'].apply(clean_pdf_name)


    # DO NOT EDIT
    
    # Create new word document for errors
    doc = Document()
    style = doc.styles['Normal']
    style.paragraph_format.space_after = 1

    # Add keywords
    def keyword_find(filename):

        # Open file
        file = fitz.open(filename)

        # Read and block only the first two pages of PDF
        text = []
        for i, page in enumerate(file):
            if i > 1:
                break
            text += page.get_text("blocks")

        # Close file
        file.close()

        # Create if loop to account PDFs that can't be read in
        if text:

            # Find block containing keywords - if no keywords in PDF, common & scientific names used
            for block in text:
                if block[4].lower().startswith('key-words:'):
                    pdf_keywords = block[4][10:].strip()
                    break
                elif block[4].lower().startswith('key words:'):
                    pdf_keywords = block[4][10:].strip()
                    break
                elif block[4].lower().startswith('keywords:'):
                    pdf_keywords = block[4][9:].strip()
                    break
                elif block[4].lower().startswith('key-words'):
                    pdf_keywords = block[4][9:].strip()
                    break
                elif block[4].lower().startswith('key words'):
                    pdf_keywords = block[4][9:].strip()
                    break
                elif block[4].lower().startswith('keywords'):
                    pdf_keywords = block[4][8:].strip()
                    break
                else:
                    pdf_keywords = '' 

        else:
            pdf_keywords = '' 

        # Replace intermediate characters - this list is not exhaustive
        keywords_replace = re.sub(r'[\n;�\xa0·./]', ', ', pdf_keywords).replace(', ,', ',').strip(',')

        # Combine keywords with scientific and common names
        clean_keywords = keywords_replace.split(',')

        return(clean_keywords)
        
    # Remove breaks created by EndNote
    df['Keywords'] = df['Keywords'].str.replace('\r\n', ', ')

    # Convert each row into a list
    df['Keywords'] = df['Keywords'].apply(lambda x: [x] if pd.notnull(x) else [])

    # Create empty list for error messages
    error_messages = []

    # Iterate through DataFrame rows
    for index, row in df.iterrows():
        try:
            # Create file path
            file_path = os.path.join(pdf_folder, row['PDF Name'])

            # Combine keywords entered by user, from EndNote, and from PDF - need to be lists not strings
            combined_keywords = species_names + row['Keywords'] + keyword_find(file_path)

            # Remove duplicate keywords
            unique_keywords = list(OrderedDict.fromkeys(combined_keywords))

            # Remove extra spaces around words in list
            unique_keywords_strip = [word.strip() for word in unique_keywords] 

            # Combine list into text string
            keywords = ', '.join(unique_keywords_strip)

            # Update 'Keyword' column
            df.at[index, 'Keywords'] = keywords

        except Exception as e:
            
            # Create error message
            error_text = f"Error row {index + 1}: {e} \nAuthors: {row['Author']} \nYear: {row['Year']} \nTitle: {row['Title']} \nPDF: {row['PDF Name']}\n"
            
            # Append to error message list
            error_messages.append(str(error_text))
            
            # Write error message to the Word document
            doc.add_paragraph(f"Error row {index + 1}: {e}")
            doc.add_paragraph(f"Authors: {row['Author']}")
            doc.add_paragraph(f"Year: {row['Year']}")
            doc.add_paragraph(f"Title: {row['Title']}")
            doc.add_paragraph("")  # Add an empty line between errors
            
            # Skip to the next iteration of the loop
            continue

    # Remove illegal characters
    df['Keywords'] = df['Keywords'].apply(remove_illegal_chars) 


    # Create Truecaser to edit titles
    class TrueCaser(object):
        def __init__(self, dist_file_path=None):

            """ Initialize module with the model from Google Drive """
            if dist_file_path is None:
                dist_file_path = 'models/truecaserTest.dist'

            with open(dist_file_path, "rb") as distributions_file:
                pickle_dict = pickle.load(distributions_file)
                self.uni_dist = pickle_dict["uni_dist"]
                self.backward_bi_dist = pickle_dict["backward_bi_dist"]
                self.forward_bi_dist = pickle_dict["forward_bi_dist"]
                self.trigram_dist = pickle_dict["trigram_dist"]
                self.word_casing_lookup = pickle_dict["word_casing_lookup"]
            self.detknzr = TreebankWordDetokenizer()

        def get_score(self, prev_token, possible_token, next_token):
            pseudo_count = 5.0

            # Get Unigram Score
            numerator = self.uni_dist[possible_token] + pseudo_count
            denominator = 0
            for alternativeToken in self.word_casing_lookup[
                    possible_token.lower()]:
                denominator += self.uni_dist[alternativeToken] + pseudo_count

            unigram_score = numerator / denominator

            # Get Backward Score
            bigram_backward_score = 1
            if prev_token is not None:
                numerator = (
                    self.backward_bi_dist[prev_token + "_" + possible_token] +
                    pseudo_count)
                denominator = 0
                for alternativeToken in self.word_casing_lookup[
                        possible_token.lower()]:
                    denominator += (self.backward_bi_dist[prev_token + "_" +
                                                          alternativeToken] +
                                    pseudo_count)

                bigram_backward_score = numerator / denominator

            # Get Forward Score
            bigram_forward_score = 1
            if next_token is not None:
                next_token = next_token.lower()  # Ensure it is lower case
                numerator = (
                    self.forward_bi_dist[possible_token + "_" + next_token] +
                    pseudo_count)
                denominator = 0
                for alternativeToken in self.word_casing_lookup[
                        possible_token.lower()]:
                    denominator += (
                        self.forward_bi_dist[alternativeToken + "_" + next_token] +
                        pseudo_count)

                bigram_forward_score = numerator / denominator

            # Get Trigram Score
            trigram_score = 1
            if prev_token is not None and next_token is not None:
                next_token = next_token.lower()  # Ensure it is lower case
                numerator = (self.trigram_dist[prev_token + "_" + possible_token +
                                               "_" + next_token] + pseudo_count)
                denominator = 0
                for alternativeToken in self.word_casing_lookup[
                        possible_token.lower()]:
                    denominator += (
                        self.trigram_dist[prev_token + "_" + alternativeToken +
                                          "_" + next_token] + pseudo_count)

                trigram_score = numerator / denominator

            result = (math.log(unigram_score) + math.log(bigram_backward_score) +
                      math.log(bigram_forward_score) + math.log(trigram_score))

            return result

        def first_token_case(self, raw):
            return raw.capitalize()

        def get_true_case(self, sentence, out_of_vocabulary_token_option="title"):
            """ Wrapper function for handling untokenized input.

            @param sentence: a sentence string to be tokenized
            @param outOfVocabularyTokenOption:
                title: Returns out of vocabulary (OOV) tokens in 'title' format
                lower: Returns OOV tokens in lower case
                as-is: Returns OOV tokens as is

            Returns (str): detokenized, truecased version of input sentence
            """
            tokens = word_tokenize(sentence)
            tokens_true_case = self.get_true_case_from_tokens(tokens, out_of_vocabulary_token_option)
            return self.detknzr.detokenize(tokens_true_case)

        def get_true_case_from_tokens(self, tokens, out_of_vocabulary_token_option="title"):
            """ Returns the true case for the passed tokens.

            @param tokens: List of tokens in a single sentence
            @param pretokenised: set to true if input is alreay tokenised (e.g. string with whitespace between tokens)
            @param outOfVocabularyTokenOption:
                title: Returns out of vocabulary (OOV) tokens in 'title' format
                lower: Returns OOV tokens in lower case
                as-is: Returns OOV tokens as is

            Returns (list[str]): truecased version of input list
            of tokens
            """
            tokens_true_case = []
            for token_idx, token in enumerate(tokens):

                if token in string.punctuation or token.isdigit():
                    tokens_true_case.append(token)
                else:
                    token = token.lower()
                    if token in self.word_casing_lookup:
                        if len(self.word_casing_lookup[token]) == 1:
                            tokens_true_case.append(
                                list(self.word_casing_lookup[token])[0])
                        else:
                            prev_token = (tokens_true_case[token_idx - 1]
                                          if token_idx > 0 else None)
                            next_token = (tokens[token_idx + 1]
                                          if token_idx < len(tokens) - 1 else None)

                            best_token = None
                            highest_score = float("-inf")

                            for possible_token in self.word_casing_lookup[token]:
                                score = self.get_score(prev_token, possible_token,
                                                       next_token)

                                if score > highest_score:
                                    best_token = possible_token
                                    highest_score = score

                            tokens_true_case.append(best_token)

                        if token_idx == 0:
                            tokens_true_case[0] = self.first_token_case(
                                tokens_true_case[0])

                    else:  # Token out of vocabulary
                        if out_of_vocabulary_token_option == "title":
                            tokens_true_case.append(token.title())
                        elif out_of_vocabulary_token_option == "capitalize":
                            tokens_true_case.append(token.capitalize())
                        elif out_of_vocabulary_token_option == "lower":
                            tokens_true_case.append(token.lower())
                        else:
                            tokens_true_case.append(token)

            return tokens_true_case

    # Upload Truecaser model
    caser = TrueCaser('models/truecaserTest.dist')

    # Function to apply get_true_case to a sentence
    def apply_true_case(sentence):
        return caser.get_true_case(sentence, "lower")


    # Apply the function to the 'sentences' column
    df['Title'] = df['Title'].apply(apply_true_case)

    # Function to capitalize the first letter of the first word
    def capitalize_first_word(text):
        return text[:1].capitalize() + text[1:]

    # Apply the function to the specified column
    df['Title'] = df['Title'].apply(capitalize_first_word)


    # Function to remove <i> and </i> tags from the text
    def remove_italic_tags(text):
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text()

    # Apply the function to the 'text_column' and store the result in a new column
    df['Title'] = df['Title'].apply(remove_italic_tags)

    # Upload text file with species names with subspecies list
    species_names_with_subspecies = pd.read_excel('textFiles/subspecies.xlsx', header = None)
    species_names_with_subspecies = species_names_with_subspecies.iloc[:, 0].to_list()

    # Upload text file with species names list
    species_names = pd.read_excel('textFiles/scientific_names.xlsx', header = None)
    species_names = species_names.iloc[:, 0].to_list()

    # Upload text file with genus list
    genus = pd.read_excel('textFiles/genus.xlsx', header = None)
    genus = genus.iloc[:, 0].to_list()

    # Forumala to surround text with <em>
    def surround_with_em(text, words_to_surround):
        if isinstance(text, str):
            for word in words_to_surround:
                pattern = r'(?<!<em>)\b' + re.escape(word.strip()) + r'\b(?!<\/em>)'
                replacement = r'<em>\g<0></em>'
                text = re.sub(pattern, replacement, text)
        return text


    # Apply the replacement to species names with subspecies
    df['Title'] = df['Title'].apply(lambda x: surround_with_em(x, species_names_with_subspecies))
    # df['Abstract'] = df['Abstract'].apply(lambda x: surround_with_em(x, species_names_with_subspecies))

    # Apply the replacement to species names
    df['Title'] = df['Title'].apply(lambda x: surround_with_em(x, species_names))
    # df['Abstract'] = df['Abstract'].apply(lambda x: surround_with_em(x, species_names))

    # Apply the replacement to genus
    df['Title'] = df['Title'].apply(lambda x: surround_with_em(x, genus))
    
    # Variation of genus usage make more work in abstract
    # df['Abstract'] = df['Abstract'].apply(lambda x: surround_with_em(x, genus))


    # Formula to abbreviate scientific names
    def abbreviate_scientific_name(scientific_names):

        abbreviated_names = []

        for name in scientific_names:
            words = name.split()
            genus = words[0][0] + "."
            species = " ".join(words[1:])
            abbreviated_names.append(genus + " " + species)

        return abbreviated_names

    abbreviated_names = abbreviate_scientific_name(species_names)

    # Apply the replacement to abbreviated scientific names
    df['Title'] = df['Title'].apply(lambda x: surround_with_em(x, abbreviated_names))
    # df['Abstract'] = df['Abstract'].apply(lambda x: surround_with_em(x, abbreviated_names))
    
    # Reorder columns
    new_column_order = ["Duplicate", "Type", "Author", "Year", "Title", "Journal Name", "Volume", "Issue", "Pages", "URL", "Location", "Specimen Data Entered", "Impacts Data Entered", "Keywords", "Abstract", "DOI", "PDF Name"]
    df = df[new_column_order]

    def highlight(x):
        # Define columns to highlight for missing values and non-PDF file names
        columns_to_highlight_na = ["Type", "Author", "Year", "Title", "Journal Name", "Volume", "Pages", "Location", "Specimen Data Entered", "Impacts Data Entered", "PDF Name"]
        column_to_check_pdf = "PDF Name"

        # Create an empty DataFrame with the same index and columns as the input DataFrame
        styled_df = pd.DataFrame('', index=x.index, columns=x.columns)

        # Highlight missing values
        for col in columns_to_highlight_na:
            mask_na = pd.isna(x[col])
            styled_df.loc[mask_na, col] = f'background-color: #FFFF00;'

        # Handle NaN values in "PDF Name" column before applying bitwise NOT
        mask_pdf_valid = ~x[column_to_check_pdf].astype(str).str.endswith('.pdf', na=False)
        styled_df.loc[mask_pdf_valid, column_to_check_pdf] = 'background-color: #FFFF00;'

        # Return the styled DataFrame
        return styled_df

    # Apply styling to the original DataFrame
    styled_df = df.style.apply(highlight, axis=None)
    
    
    # Check if styled_df exists
    if not styled_df:
        
        # If df exists, proceed with saving
        styled_df_success_label.config(text = "Error: Cleaning not completed!", fg="red")
        
    else:
        
        # If df is empty, show an error message
        styled_df_success_label.config(text = "Completed!", fg="green")
    
    # Create message box
    def show_error_messages(messages):
        
        # Create a new top-level window
        error_window = tk.Toplevel()
        error_window.title("Error in Keywords")

        # Add a scrollbar
        scrollbar = tk.Scrollbar(error_window)
        scrollbar.pack(side=tk.RIGHT, fill=tk.Y)

        # Create a text widget to display the error messages
        text = tk.Text(error_window, wrap=tk.WORD, yscrollcommand=scrollbar.set, font=("Arial", 9))
        text.pack(expand=True, fill=tk.BOTH)
        scrollbar.config(command=text.yview)

        # Insert error messages into the text widget
        for msg in messages:
            text.insert(tk.END, msg + "\n")

        # Disable editing in the text widget
        text.config(state=tk.DISABLED)

        # Set the dimensions of the window
        error_window.geometry("400x400")  # Change dimensions as needed

        # Make the window modal (disables interaction with other windows)
        error_window.transient()
        error_window.grab_set()

        # Run the window's event loop
        error_window.mainloop()
        
    # Keyword error message     
    if error_messages:
            
        # Run function for error messages
        show_error_messages(error_messages)
            

# Command to export Excel with references
def export_excel():

    # Export Excel
    try:
        
        # Select where Excel should be saved
        file_path = filedialog.asksaveasfilename(defaultextension=".xlsx", filetypes=[("Excel files", "*.xlsx"), ("All files", "*.*")])
        
        # Export Excel with references
        styled_df.to_excel(file_path, engine='openpyxl', index=False)
        
    except pd.errors.ParserError:
        
        # If illegal characters, send message
        export_error_label.config(text = "Error: Illegal characters found in the data.", fg="red")
        
    except Exception as e:
        
        # If illegal characters, send message
        export_error_label.config(text = "Error: {e}", fg="red")

    # Save Error Document
    directory = os.path.dirname(file_path)
    doc.save(os.path.join(directory, "Keywords_Error_Doc.docx"))
    
    # Open the Excel file
    os.system(file_path) 
    

# Create a tkinter window
window = tk.Tk()
window.attributes("-topmost", True)
window.title("Reference Editor")

# Header paragraph
header_text = "Reference Editor: The following widget will edit references from Endote and create an Excel sheet that can be bulk uploaded."
header_label = tk.Label(window, text = header_text, wraplength = 500, justify = "left")
header_label.grid(row = 0, padx = 10, pady = (10, 20), sticky = 'w')


# Species ID entry fields
species_names_label = tk.Label(window, text = "Scientific & common names:")
species_names_label.grid(row = 1, padx = 10, pady = 5, sticky = "w")

species_names_entry = tk.Entry(window, width = 45)
species_names_entry.grid(row = 1, padx = (180, 0), pady = 5, sticky = "w")

# Save button
save_button = tk.Button(window, text = "Save", command = save_species_names, width = 15, height = 1)
save_button.grid(row = 3, padx = 10, pady = 10, sticky = "w")
save_success_label = tk.Label(window, text = "")
save_success_label.grid(row = 3, padx = 150, pady = 10, sticky = "w")


# Description of buttons
button_text = "Click buttons below to import reference information from Endnote"
button_label = tk.Label(window, text = button_text,  wraplength = 500, justify = "left")
button_label.grid(row = 4, padx = 10, pady = (10, 0), sticky = "w")

# Import references
ref_import_button = tk.Button(window, text = "Select .txt File", command = open_reference_file, width = 15, height = 1)
ref_import_button.grid(row = 5, padx = 10, pady = 10, sticky = "w")
df_success_label = tk.Label(window, text = "")
df_success_label.grid(row = 5, padx = 150, pady = 10, sticky = "w")

# Select PDF folder
pdf_button = tk.Button(window, text = "Select PDF Folder", command = open_pdf_folder, width = 15, height = 1)
pdf_button.grid(row = 6, padx = 10, pady = 10, sticky = "w")
pdf_folder_success_label = tk.Label(window, text = "")
pdf_folder_success_label.grid(row = 6, padx = 150, pady = 10, sticky = "w")


# Description of button
clean_button_text = "Click buttons below to clean references - will take time to load"
clean_button_label = tk.Label(window, text = clean_button_text,  wraplength = 450, justify = "left")
clean_button_label.grid(row = 7, padx = 10, pady = (10, 0), sticky = "w")

# Clean reference button
clean_ref_button = tk.Button(window, text = "Clean References", command = clean_references, width = 15, height = 1)
clean_ref_button.grid(row = 8, padx = 10, pady = 10, sticky = "w")
styled_df_success_label = tk.Label(window, text = "")
styled_df_success_label.grid(row = 8, padx = 150, pady = 10, sticky = "w")

# Warning Note
warning_text = checklist_text = """** If you get you get a pop-up window with error messages, there was at least one error when pulling keywords. Most likely, there was an issue with the PDF file name. """
warning_label = tk.Label(window, text = warning_text, wraplength = 500, justify = "left")
warning_label.grid(row = 9, padx = 10, pady = (10, 20), sticky = 'w')

# Description of button
export_button_text = "Click button below to export reference information from Endnote"
export_button_label = tk.Label(window, text = export_button_text,  wraplength = 450, justify = "left")
export_button_label.grid(row = 10, padx = 10, pady = (10, 0), sticky = "w")

# Excel export button
excel_export_button = tk.Button(window, text = "Export Excel", command = export_excel, width = 15, height = 1)
excel_export_button.grid(row = 11, padx = 10, pady = (10, 20), sticky = "w")
export_error_label = tk.Label(window, text = "")
export_error_label.grid(row = 11, padx = 150, pady = 10, sticky = "w")

# Quality Assurance Note
qa_text = checklist_text = """** There is still some work that needs to be done before bulk uploading these references. The following is a checklist to follow:

1. There are certain columns that cannot be blank. Cells that cannot be blank are highlighted in yellow. Fix those cells.
2. Check the format of reference columns in case EndNote missed them, particularly Type, Year, Journal Name, Pages, PDF Name. For example, make sure the Year column only has years.
3. Check Reference Numbers in the Duplicate column to ensure they are a duplicate. Remove duplicate rows from the sheet.
4. Double-check proper capitalization in titles - as stated above the model is not perfect.
5. Add hypertext (<em>) around any extra scientific names in titles.
6. Add subtext (<sub>) and supertext (<sup>) hypertext around numbers part of chemical formulas or units. This can also be done in NAS reference editor.
7. Check Issue columns: it's not a required field but sometimes EndNote puts in strange formats that need to be corrected.
8. Check any remaining URLs: NAS requires website URL to link directly to website of the journal with PDF/html version of the journal article if available (directly to the website of the journal). NAS does not want URLs that lead to indexing services (ProQuest, Web of Knowledge, EBSCOHost).
9. Double check blank DOI cells: On rare occasions PDF has DOI but EndNote does not report it."""

qa_label = tk.Label(window, text = qa_text, wraplength = 500, justify = "left")
qa_label.grid(row = 12, padx = 10, pady = (10, 20), sticky = 'w')

# Run the tkinter event loop
window.mainloop()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\redinger\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
