## Trophic Information Pipeline Readme

**Description**
The following cells in concert work to create a trophic information pipeline run in a Jupyter notebook to create a file that has the final classifications for scientific names found within research articles.

**Getting Started**
To run this notebook, the libraries that are imported must be on your machine. The Ollie tool must be downloaded from  http://knowitall.cs.washington.edu/ollie/ollie-app-latest.jar as well as the English MaltParser model (engmalt.linear-1.7.mco) http://www.maltparser.org/mco/english_parser/engmalt.html based on the instructions from https://github.com/knowitall/ollie. The two downloads should be in the same folder as the jupyter notebook file. The scientific names file, the common names file, the abbreviated scientific names file, the english words file, the Random Forest training file and the trophic keywords file must be downloaded from https://github.com/JSRaffing/Trophic-Information-Extraction-Pipeline and all kept in the same folder as the previous downloaded mentioned. Once that is complete fill in the necessary names in the Part 1 cell which are the names of the files to be analyzed (PDFs), and the name of the result file. All other variables are left as the defaults based on the downloaded file names. If you change a file or a file name, the default must be changed.

## Part 1: Filling in the Variables

In [None]:
# Files to be analyzed: replace the example file names with the names of your actual pdfs
files_to_be_analyzed = ['example_file1.pdf', 'example_file2.pdf', 'example_file3.pdf']
# Result file name: replace the example result file name with the name of the file you want
output_file = 'Name_of_Result_File.txt'
# English words File: To change the file, replace 'words.txt' with the name of your file that is structured with one word on each line.
english_dict_file = 'words.txt'
# Scientific Names File: # To change the file, replace 'scinames-final.txt' with the name of your file that has each scientific name on a row in the first column and the corresponding kingdom on the same row in the second column
sci_names_file = 'scinames-final.txt'
# Common Names File: # To change the file, replace 'comnames-final.txt' with the name of your file that has each common name on a row in the first column and the corresponding kingdom on the same row in the second column
common_names_file = 'comnames-final.txt'
# Abbreviated Scientific Name File: # To change the file, replace 'acronamesflipped-may8.txt' with the name of your file that has each common name on a row in the first column and the corresponding kingdom on the same row in the second column
abbreviated_names_file = 'acronamesflipped-final.txt'
# Trophic Keywords File: # To add keywords to the file, open the file and add your keywords to the phrases list as well as the specific category it would fall in
trophic_keywords_file = 'trophickeywords-final.txt'
# Random Forest training file
random_forest_training_file = 'mixedtrain2.csv'
# Ollie File
ollie_file = 'ollie-app-latest.jar'



## Part 2: Importing Libaries

In [None]:
# Each imported library is used in a later step.
import csv
from itertools import zip_longest
import nltk
from itertools import groupby
import sys
import re
import itertools
import operator
from itertools import groupby
import argparse
import subprocess
from subprocess import PIPE, Popen
import fitz
import Levenshtein
from Levenshtein import distance
from more_itertools import unique_everseen
from sklearn.ensemble import RandomForestClassifier
import os
import pandas as pd
import ast

print('Libraries imported')

## Part 3:  Loading Necessary Collections

**English Word Collection**

In [None]:
# The collection of English words is saved as a list by reading the words from a file.
english_list = []

with open(english_dict_file, "r") as file:
    for line in file:
        english_list.append(line)

print('English dictionary loaded')

**Scientific Names Collection**

In [None]:
# The collection of scientific names and kingdoms are saved as a dictionary by reading from a two columned file
sci_names = []
kingdom_names = []

with open(sci_names_file, 'r') as data: 
    for line in data.readlines():
        # Change the '\t' delimiter to your file's delimiter if it isn't tab separated.
        line = line.rstrip('\n')
        entity = line.split('\t') 
        sci_names.append(entity[0])
        kingdom_names.append(entity[1])
# Lists are zipped together to create a dictionary.
sci_name_dict = dict(zip(sci_names, kingdom_names))

print('Scientific names loaded')

**Common Names Collection**

In [None]:
# The collection of common names and kingdoms are saved as a dictionary by reading from a two columned file
common_names = []
kingdom_names = []

with open(common_names_file, 'r') as data: 
    for line in data.readlines():
        # Change the '\t' delimiter to your file's delimiter if it isn't tab separated.
        line = line.rstrip('\n')
        entity = line.split('\t') 
        common_names.append(entity[0])
        kingdom_names.append(entity[1])
# Lists are zipped together to create a dictionary.
common_names_dict = dict(zip(common_names, kingdom_names))

print('Common names loaded')

**Abbreviated Scientific Names**

In [None]:
# The collection of abbreviated names and expanded names are saved as a dictionary by reading from a two columned file
abbreviated_names = []
expanded_names = []

with open(abbreviated_names_file, 'r') as data: 
    for line in data.readlines():
        # Change the '\t' delimiter to your file's delimiter if it isn't tab separated.
        line = line.rstrip('\n')
        entity = line.split('\t') 
        abbreviated_names.append(entity[0])
        expanded_names.append(entity[1])
        
print('Abbreviated names loaded')

**Trophic Keywords and Categories**

In [None]:
# The collection of trophic keywords and specific categories are saved as a dictionary by reading from a two columned file
total_keywords_and_categories = []

with open(trophic_keywords_file, 'r') as data: 
    for line in data.readlines():
        # Change the '\t' delimiter to your file's delimiter if it isn't tab separated.
        line = line.rstrip('\n')
        category = line.split('\t')
        total_keywords_and_categories.append(category[1])

total_keywords = ast.literal_eval(total_keywords_and_categories[0])
lefttoright = ast.literal_eval(total_keywords_and_categories[1])
righttoleft = ast.literal_eval(total_keywords_and_categories[2])
reflexive = ast.literal_eval(total_keywords_and_categories[3])
parasite = ast.literal_eval(total_keywords_and_categories[4])
detritivore = ast.literal_eval(total_keywords_and_categories[5])

print('Trophic categories loaded')

## Part 4: Random Forest Model

**Training Classifier**

In [None]:
# The file to train the classifier is read and its data is saved into an object
mixedtraining3 = pd.read_csv(random_forest_training_file)
# The data is cleaned, removing titles
del mixedtraining3['name']
X = mixedtraining3.iloc[:, 0:676].values
y = mixedtraining3['type']
# The classifier is trained
classifier = RandomForestClassifier(n_estimators=20, random_state=0)
classifier.fit(X, y)

print('Random Forest model loaded')

**Creating Function with Classifier**

In [None]:
# The fucnction sci_check uses the trained Random Forest model from above to predict whether or not a term is a scientific word
def sci_check(term):
    entry = {}
    all_columns = []
    entry_frequencies = []
    # Creating list of all possible bigrams
    for char in 'abcdefghijklmnopqrstuvwxyz':
        all_possible_bigrams = [char+b for b in 'abcdefghijklmnopqrstuvwxyz']
        all_columns.extend(all_possible_bigrams)
        # Creating list of bigrams in term
        chars = [term[i:i+2] for i in range(0, len(term))]
    all_columns_2 = all_columns
    # Count the frequency of the bigram in the term
    for bigram in all_columns_2:
        frequency = chars.count(bigram)
        entry[bigram] = frequency
        entry_frequencies.append(frequency)
    # Use random forest model to check if it's a possible scientific name
    y_pred = classifier.predict([entry_frequencies])
    
    return y_pred[0]

## Part 5: Function to Search Ollie Results

In [None]:
# The function "searching_ollie_results" searches through Ollie results for relations that contain a keyword
def searching_ollie_results(list_of_lists):
    relevant_ollie_results = []
    relevant_ollie_results_dict = {}
    # Iterate through list of sentences and find sentences that have one of the food phrases
    keep = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ.1234567890')
    for ollie_relation in list_of_lists:
        ollie_relation = ''.join(filter(keep.__contains__, str(ollie_relation)))
        for keyword in total_keywords:
            # Check if phrases are in sentences, the lowercasing both
            if re.search(keyword.lower(), ollie_relation.lower()):
                # Get the index of phrase in sentence
                keyword_index = re.search(r'\b({})\b'.format(keyword.lower()), ollie_relation.lower())
                # Check if index is equal to None
                if keyword_index is not None:
                    # Check if sentence already in results
                    if ollie_relation not in relevant_ollie_results_dict.keys():
                        # Add sentence and phrase into dictionary
                        relevant_ollie_results_dict[ollie_relation] = keyword
                        # Add sentence, phrase and indices to final results
                        relevant_ollie_results.extend([[ollie_relation, keyword, [keyword_index.start(), keyword_index.end()]]])
                        
                    else:
                        # If sentence already in result dictionary, save the previous entry
                        smallest = min((relevant_ollie_results_dict[ollie_relation], keyword), key=len)
                        # Check if smaller phrase is in longer phrase
                        if smallest in keyword:
                            # Save both phrases
                            both = [relevant_ollie_results_dict[ollie_relation], keyword]
                            # Keep the phrase that is the longest
                            longest = max(both, key=len)
                            # Find the index of the phrase
                            keyword_index_2 = re.search(r'\b({})\b'.format(relevant_ollie_results_dict[ollie_relation].lower()), ollie_relation.lower())
                            # Get the index of the previous entry from the results
                            if [ollie_relation, relevant_ollie_results_dict[ollie_relation], [keyword_index_2.start(), keyword_index_2.end()]] in relevant_ollie_results:
                                previous_index = relevant_ollie_results.index([ollie_relation, relevant_ollie_results_dict[ollie_relation], [keyword_index_2.start(), keyword_index_2.end()]])
                            # Replace the previous results with the newer phrase entry
                                relevant_ollie_results[previous_index] = [ollie_relation, longest, [keyword_index.start(), keyword_index.end()]]
                        else:
                            # If the phrases don't intersect, then extend the results with the new entry
                            relevant_ollie_results.extend([[ollie_relation, keyword, [keyword_index.start(), keyword_index.end()]]])
                            
    # Remove duplicates from results
    relevant_ollie_results = list(relevant_ollie_results for relevant_ollie_results,_ in itertools.groupby(relevant_ollie_results))
    
    return relevant_ollie_results

## Part 6: Function to Identify Scientific Names and Add a Final Classification

In [None]:
# This function "identify_and_classify" takes the relevant trophic relations found by the "searching_ollie_results" function and locates scientific names while giving them a final classification
def identify_and_classify(relevant_ollie_relations):
    all_final_classifications = []
    # Iterate through sentences of previous output
    for i in range(0,len(relevant_ollie_relations)):
        current_relation = []
        words_in_official_scientific_name = []
        words_added_to_current_relation = []
        # Saving part of each result as a variable
        sentence = relevant_ollie_relations[i][0]
        phrase = relevant_ollie_relations[i][1]
        words_added_to_current_relation.append(phrase)
        current_relation.append(relevant_ollie_relations[i][1])
        keyword_string_index = relevant_ollie_relations[i][2]
        # Split by spaces
        before_space = sentence[:keyword_string_index[0]].split()
        after_space = sentence[keyword_string_index[0]:].split()
        # Add a rule to check if word is a noun
        is_noun = lambda pos: pos[:2] == 'NN'
        # Finding nouns in sentence using the previously made rule
        nouns = [word for (word, pos) in nltk.pos_tag(before_space+after_space) if is_noun(pos)]
        nouns2 = nouns
        words_in_english_dictionary = []
        words_in_sci_name_list = []
        # Checking if any of the nouns land in the dictionary
        for word in nouns:
            if word not in detritivore:
                if word in english_list:
                    words_in_english_dictionary.append(word)
            # Check if word closely matches a word in list of scientific names
            metric = (2, 'noname')
            for master_name in sci_names:
                new_metric = distance(word, master_name)
                if (new_metric < metric[0]):
                    metric = (new_metric, master_name)
            if metric[1] == 'noname':
                # Check if word is in abbreviated scientific names or common names
                matching = [s for s in abbreviated_names + common_names if word in s]
                if len(matching) > 0:
                    metric = (0, metric[1])
            if metric[0] < 2:
                words_in_sci_name_list.append(word)
        # Regular expressions of different forms of scientific names
        patterns = ['[A-z]\. [a-z]{3,}', '[A-Z][a-z]{4,}: [A-Z][a-z]{4,}', '[A-Z][a-z]{2,} [A-z]{3,}'] 
                #'[A-Z][a-z]{4,} [A-z]{4,}'
        # Finding all regular expression matches in a sentence
        located_patterns = []
        for pattern in patterns:
            searching_for_patterns = re.findall(pattern, relevant_ollie_relations[i][0])
            located_patterns.extend(searching_for_patterns)
        patterns_in_data_list = []
        words_from_model = []
        # Check if regular expression scientific name is closely related to a scientific name in the list
        for located_word in located_patterns:
            # Check if length of results is greater than 0 
            metric = (2, 'noname')
            for master_name in sci_names:
                new_metric = distance(located_word, master_name)
                if (new_metric < metric[0]):
                    metric = (new_metric, master_name)
            if metric[1] == 'noname':
                # Check if word is in abbreviated scientific names or common names
                matching = [s for s in abbreviated_names + common_names if located_word in s]
                if len(matching) > 0 or '.' in located_word:
                    metric = (0, metric[1])
            if metric[0] < 2:
                patterns_in_data_list.append(located_word)
                input_1 = (located_word, 'scientificname')
                # Check if word comes before or after the relational phrase in the sentence
                if sentence.index(located_word) < sentence.lower().index(phrase):
                    if current_relation.index(phrase) == 0: 
                        phrase_index = current_relation.index(phrase)
                        current_relation.insert(0,input_1)
                        words_added_to_current_relation.append(located_word)
                    else:
                        phrase_index = current_relation.index(phrase)
                        current_relation.insert((phrase_index),input_1)
                        words_added_to_current_relation.append(located_word)
                else:
                    phrase_index = current_relation.index(phrase)
                    current_relation.index(phrase)
                    current_relation.append(input_1)
                    words_added_to_current_relation.append(located_word)
            else:
                # Test the located word against the model
                test = sci_check(located_word)
                input_test = [located_word, test]
                if test.item() == 0:
                    words_from_model.append(located_word)
                    input_1 = (located_word, 'scientificname')
                # If model says it is a scientific word, check if word comes before or after the relational phrase in the sentence
                    if sentence.index(located_word) < sentence.lower().index(phrase):
                        if current_relation.index(phrase) == 0: 
                            phrase_index = current_relation.index(phrase)
                            current_relation.insert(0,input_1)
                            words_added_to_current_relation.append(located_word)
                        else:
                            phrase_index = current_relation.index(phrase)
                            current_relation.insert((phrase_index),input_1)
                            words_added_to_current_relation.append(located_word)
                    else:
                        phrase_index = current_relation.index(phrase)
                        current_relation.index(phrase)
                        current_relation.append(input_1)
                        words_added_to_current_relation.append(located_word)
                
        # Checking through nouns to see if any of them combined is a scientific name
        if len(nouns) > 0:
            for i in range(0, (len(nouns)-1)):
                possible_scientific_word = nouns[i]+' '+nouns[i+1]
                metric = (2, 'noname')
                for master_name in sci_names:
                    new_metric = distance(possible_scientific_word, master_name)
                    if (new_metric < metric[0]):
                        metric = (new_metric, master_name)
                if metric[0] < 2 and metric[1] != 'noname':
                    if possible_scientific_word in sentence:
                        # Check if the the combination of the words is already in one of lists of scientific words that have already been found
                        if any((possible_scientific_word in s for s in patterns_in_data_list + [', '.join(words_added_to_current_relation)])) == False:
                            for each in possible_scientific_word.split():
                                words_in_official_scientific_name.append(each)
                            # Check if word came before or after phrase  
                            if sentence.index(possible_scientific_word) < sentence.lower().index(phrase):
                                phrase_index = current_relation.index(phrase)
                                if phrase_index == 0:
                                    input_1 = (possible_scientific_word, 'possiblescientificname')
                                    current_relation.insert(0, input_1)
                                    words_added_to_current_relation.append(possible_scientific_word)
                                else:
                                    phrase_index = current_relation.index(phrase)
                                    input_1 = (possible_scientific_word, 'possiblescientificname')
                                    current_relation.insert(phrase_index, input_1)
                                    words_added_to_current_relation.append(possible_scientific_word)
                            # Add word to the end because it comes after the phrase
                            else:
                                input_1 = (possible_scientific_word, 'possiblescientificname')
                                current_relation.append(input_1)
                                words_added_to_current_relation.append(possible_scientific_word)
                    
                else:
                    # Split the possible scientific word into its parts and see if its in a list
                    splitted = possible_scientific_word.split()
                    # Check if substring is in the list of scientific names
                    for substring in possible_scientific_word.split():
                        metric = (2, 'noname')
                        for master_name in sci_names:
                            new_metric = distance(substring, master_name)
                            if (new_metric < metric[0]):
                                metric = (new_metric, substring)
                        commonnamesfull = [x for x in detritivore]
                        # Check if the substring has already been added or checked
                        if metric[1] != 'noname':
                            if substring not in ', '.join(patterns_in_data_list + words_from_model).split() + (' '.join(words_added_to_current_relation)).split():
                                input_1 = (substring, 'scientificname')
                                # If the substring has not already been added, check if it comes before or after the phrase
                                if substring.lower() not in english_list and len(substring) > 2:
                                    if sentence.index(substring) < sentence.lower().index(phrase):
                                        phrase_index = current_relation.index(phrase)
                                        if phrase_index == 0:
                                            current_relation.insert(0, input_1)
                                            words_added_to_current_relation.append(substring)
                                        else:
                                            phrase_index = current_relation.index(phrase)
                                            current_relation.insert(phrase_index, input_1)
                                            words_added_to_current_relation.append(substring)
                                    # Add the substring to the end if the word comes after the phrase      
                                    else:
                                        words_added_to_current_relation.append(substring)
                                        current_relation.append(input_1)
                                else:
                                    dictionaryterm = substring
                        # Check if it is in the list of common names in the detritivore category
                        elif any(substring in s for s in commonnamesfull):
                            if substring not in ', '.join(words_added_to_current_relation):
                                input_1 = (substring, 'common scientificname')
                                # Check if word comes before or after the phrase
                                if sentence.index(substring) < sentence.lower().index(phrase):
                                    phrase_index = current_relation.index(phrase)
                                    if phrase_index == 0:
                                        current_relation.insert(0, input_1)
                                        words_added_to_current_relation.append(substring)
                                    else:
                                        phrase_index = current_relation.index(phrase)
                                        current_relation.insert(phrase_index, input_1)
                                        words_added_to_current_relation.append(substring)
                                else:
                                    current_relation.append(input_1)
                                    words_added_to_current_relation.append(substring)
                            
                        else:
                            # Check if word is a scientific word based on the model
                            test = sci_check(substring)
                            input_test = [substring, test]
                            if test.item() == 0:
                                words_from_model.append(substring)
                                # Add possible scientific name if not in previously added lists
                                if substring not in (' '.join(patterns_in_data_list)).split():
                                    if substring not in ', '.join(words_added_to_current_relation):
                                        if substring not in english_list:
                                            input_1 = (substring, 'possiblescientificname')
                                            if sentence.index(substring) < sentence.lower().index(phrase):
                                                phrase_index = current_relation.index(phrase)
                                                if phrase_index == 0:
                                                    current_relation.insert(0, input_1)
                                                    words_added_to_current_relation.append(substring)
                                                else:
                                                    phrase_index = current_relation.index(phrase)
                                                    current_relation.insert(phrase_index, input_1)
                                                    words_added_to_current_relation.append(substring)
                                            else:
                                                phrase_index = current_relation.index(phrase)
                                                current_relation.append(input_1)
                                                words_added_to_current_relation.append(substring)
                                
                            
        # Using categories of directionality to figure out sentence classification                       
        current_relation = [x[0] for x in groupby(current_relation)]
        phrase_index = current_relation.index(phrase)
        classifications_and_relation = []
        if any('scientificname' in part or 'possiblescientificname' in part or 'common scientificname' in part for part in current_relation):
            if len(current_relation) > 2:
                # Separate sentences based on before and after phrase
                classifications_and_relation.append(sentence)
                words_after_phrase = [nm for nm in current_relation[phrase_index:] if 'scientificname' in nm[1]]
                words_before_phrase = [nm for nm in current_relation[:phrase_index] if 'scientificname' in nm[1]]
                # Check if phrase in lefttoright and if there is a word after the phrase
                if phrase in lefttoright and (phrase_index+1) < len(current_relation):
                    # Check if word in scientific name list or common name list or abbreviated name list
                    for k in range(0, len(words_after_phrase)):
                        final_result = 'default'
                        word = words_after_phrase[k][0]
                        if word not in detritivore:
                            metric = (2, 'noname')
                            for master_name in sci_names:
                                new_metric = distance(word, master_name)
                                if (new_metric < metric[0]):
                                    metric = (new_metric, master_name)
                            if metric[1] == 'noname':
                                # Check if word in common name list
                                matching = [s for s in common_names if word in s]
                                if len(matching) > 0:
                                    metric = (0, metric[1]) 
                                    kept = min(matching, key=len)
                                    final_result = common_names_dict[min(matching, key=len)]
                                    metric = list(metric)
                                    metric[1] = min(matching, key=len)
                                else:
                                    # Check if word in abbreviated name list
                                    matching = [s for s in abbreviated_names if word in s]
                                    if len(matching) > 0:
                                        final_result = 'not available'
                            else:
                                if metric[0] < 2:
                                    final_result = sci_name_dict[metric[1]]
                                else:
                                    final_result = 'none'
                        # Check if words in sentence are in the parasite list of words
                        sentencesplit = sentence.split()
                        parasite_testing = [x for x in parasite if x in sentence]
                        # If parasite words are in sentence, add parasite classification
                        if len(parasite_testing)>0:
                            for i in range(0, len(words_before_phrase)):
                                classifications_and_relation.append(str(words_before_phrase[i][0] + " is a parasite" + " - " + ', '.join(parasite_testing)))
                        # Check if Animalia is in the kingdom designation 
                        elif 'Animalia' in str(final_result):
                            for i in range(0, len(words_before_phrase)):
                                classifications_and_relation.append( str(words_before_phrase[i][0] + " is a carnivore" + ' - ' + metric[1]))
                                # If Animalia and Herbivore is in classification then omnivore classification is added
                                if str(words_before_phrase[i][0]) + ' is a herbivore' in classifications_and_relation:
                                    classifications_and_relation.append( str(words_before_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn'))
                        # Check if Plantae is in the kingdom designation
                        elif 'Plantae' in str(final_result):
                            for i in range(0, len(words_before_phrase)):
                                classifications_and_relation.append(str(words_before_phrase[i][0] + " is a herbivore" + ' - ' + metric[1]))
                                # If Animalia and Herbivore is in classification then omnivore classification is added
                                if str(words_before_phrase[i][0]) + ' is a carnivore' in classifications_and_relation:
                                    classifications_and_relation.append( str(words_before_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn'))
                        # Check if word is in detritivore list
                        elif word in detritivore:
                            for i in range(0, len(words_before_phrase)):
                                # Add the detritivore classification
                                classifications_and_relation.append(str(words_before_phrase[i][0] + " is a detritivore" + ' - ' + word))
                        else:
                            tag = 'sentence of interest'
                # If phrase in detritivore and right to left, the word after the phrase is classified
                elif phrase in detritivore and phrase in righttoleft:
                    for k in range(0, len(words_after_phrase)):
                        # Add detritivore classification
                        classifications_and_relation.append(str(words_after_phrase[k][0] + " is a detritivore" + ' - ' + phrase))
                # Check if phrase in righttoleft and if there is a word after the phrase
                elif phrase in righttoleft and (phrase_index+1) < len(current_relation):
                    # Check if word in scientific name list or common name list or abbreviated name list
                    for k in range(0, len(words_before_phrase)):   
                        final_result = 'default'
                        word = words_before_phrase[k][0]
                        if word not in detritivore:
                            metric = (2, 'noname')
                            for master_name in sci_names:
                                new_metric = distance(word, master_name)
                                if (new_metric < metric[0]):
                                    metric = (new_metric, master_name)
                            # Check if word in common name list
                            if metric[1] == 'noname':
                                matching = [s for s in common_names if word in s]
                                if len(matching) > 0:
                                    metric = (0, metric[1])
                                    kept = (min(matching, key=len))
                                    final_result = common_names_dict[min(matching, key=len)]
                                    metric = list(metric)
                                    metric[1] = kept
                                # Check if word in abbreviated name list
                                else:
                                    matching = [s for s in abbreviated_names if word in s]
                                    if len(matching) > 0:
                                        final_result = 'not available'
                            else:
                                if metric[0] < 2:
                                    final_result = sci_name_dict[metric[1]]
                                else:
                                    final_result = 'none'
                        # Check if words in sentence are in the parasite list of words
                        sentencesplit = sentence.split()
                        parasite_testing = [x for x in parasite if x in sentence]
                        # If parasite words are in sentence, add parasite classification
                        if len(parasite_testing)>0:
                            for i in range(0, len(words_after_phrase)):
                                classifications_and_relation.append(str(words_after_phrase[i][0] + " is a parasite" + ' - ' + ', '.join(parasite_testing)))
                        # Check if Animalia is in the kingdom designation
                        elif 'Animalia' in str(final_result):
                            for i in range(0, len(words_after_phrase)):
                                classifications_and_relation.append( str(words_after_phrase[i][0] + " is a carnivore" + ' - ' + metric[1]))
                                # If Animalia and Herbivore is in classification then omnivore classification is added
                                if str(words_before_phrase[i][0]) + ' is a herbivore' in classifications_and_relation:
                                    classifications_and_relation.append( str(words_after_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn'))
                        # Check if Plantae is in the kingdom designation
                        elif 'Plantae' in str(final_result):
                            for i in range(0, len(words_after_phrase)):
                                classifications_and_relation.append(str(words_after_phrase[i][0] + " is a herbivore" + ' - ' + metric[1]))
                                # If Animalia and Herbivore is in classification then omnivore classification is added
                                if str(words_after_phrase[i][0]) + ' is a carnivore' in classifications_and_relation:
                                    classifications_and_relation.append( str(words_after_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn'))
                        # Check if word is in detritivore list
                        elif word in detritivore:
                            for i in range(0, len(words_after_phrase)):
                                # Add the detritivore classification
                                classifications_and_relation.append(str(words_after_phrase[i][0] + " is a detritivore" + ' - ' + word))
                        else:
                            tag = 'sentence of interest'
                # Check if phrase is in list of reflexive keywords
                elif phrase in reflexive.keys():
                    words_before_phrase = [nm for nm in current_relation[:phrase_index] if 'scientificname' in nm[1]]
                    words_after_phrase = [nm for nm in current_relation[phrase_index:] if 'scientificname' in nm[1]]
                    # Check if there is more nouns before or after the keywords
                    if len(words_after_phrase) > len(words_before_phrase):
                        #Iterate through words after phrase and classify
                        for i in range(0, len(words_after_phrase)):
                            # Add omnivore classification
                            if reflexive[phrase] == 'omnivore':
                                classifications_and_relation.append(str(words_after_phrase[i][0] + ' is an ' + reflexive[phrase] + ' - ' + phrase))
                            # Add herbivore classification
                            elif reflexive[phrase] == 'herbivore':
                                classifications_and_relation.append(str(words_after_phrase[i][0] + ' is a ' + reflexive[phrase] + ' - ' + phrase))
                                # Add omnivore classification if carnivore and herbivore classification are present
                                if str(words_after_phrase[i][0]) + ' is a carnivore' in classifications_and_relation:
                                    classifications_and_relation.append( str(words_after_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn'))
                            # Add carnivore classification
                            elif reflexive[phrase] == 'carnivore':
                                classifications_and_relation.append(str(words_after_phrase[i][0] + ' is a ' + reflexive[phrase] +  ' - ' + phrase))
                                # Add omnivore classification if carnivore and herbivore classification are present
                                if str(words_after_phrase[i][0]) + ' is a herbivore' in classifications_and_relation:
                                    classifications_and_relation.append( str(words_after_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn'))
                            else:
                                classifications_and_relation.append(str(words_after_phrase[i][0] + ' is a ' + reflexive[phrase] + ' - ' + phrase))
                    else:
                        # Iterate through list of words before phrase
                        for i in range(0, len(words_before_phrase)):
                            # Add omnivore classification
                            if reflexive[phrase] == 'omnivore':
                                classifications_and_relation.append(str(words_before_phrase[i][0] + ' is an ' + reflexive[phrase] + ' - ' + phrase))
                            # Add herbivore classification
                            elif reflexive[phrase] == 'herbivore':
                                classifications_and_relation.append(str(words_before_phrase[i][0] + ' is a ' + reflexive[phrase] + ' - ' + phrase))
                                # Add omnivore classification if carnivore and herbivore classification are present
                                if str(words_before_phrase[i][0]) + ' is a carnivore' in classifications_and_relation:
                                    classifications_and_relation.append( str(words_before_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn'))
                            # Add carnivore classification
                            elif reflexive[phrase] == 'carnivore':
                                classifications_and_relation.append(str(words_before_phrase[i][0] + ' is a ' + reflexive[phrase] + ' - ' + phrase))
                                # Add omnivore classification if carnivore and herbivore classification are present
                                if str(words_before_phrase[i][0]) + ' is a herbivore' in classifications_and_relation:
                                    classifications_and_relation.append( str(words_before_phrase[i][0] + " is an omnivore" + ' - ' + 'Herb+Carn'))
                            else:
                                classifications_and_relation.append(str(words_before_phrase[i][0] + ' is a ' + reflexive[phrase]))
                    
            # If length of relation and scientific name is equal to two
            elif len(current_relation) == 2:
                words_before_phrase = [nm for nm in current_relation[:phrase_index] if 'scientificname' in nm[1]]
                words_after_phrase = [nm for nm in current_relation[phrase_index:] if 'scientificname' in nm[1]]
                classifications_and_relation.append(sentence)
                # Add final classification depending on if phrase comes before or after scientific name
                if phrase in reflexive.keys() and current_relation.index(phrase) == 1:
                    for i in range(0, len(words_before_phrase)):
                        classifications_and_relation.append(str(words_before_phrase[i][0] + ' is an ' + reflexive[phrase] + ' - ' + phrase))
                elif phrase in reflexive.keys() and current_relation.index(phrase) == 0:
                    for i in range(0, len(words_after_phrase)):
                        classifications_and_relation.append(str(words_after_phrase[i][0] + ' is an ' + reflexive[phrase] + ' - ' + phrase))
                elif phrase in detritivore and phrase in righttoleft:
                    for k in range(0, len(words_after_phrase)):
                        classifications_and_relation.append(current_relation[1][0] + " is a detritivore" + ' - ' + phrase)
                else:
                    tag = 'Sentence of interest'   
            else:
                tag = 'Sentence of interest'
        
        # Check if length of input is greater than one
        if len(classifications_and_relation)>1:
            # Remove duplicates
            classifications_and_relation = list(unique_everseen(classifications_and_relation))
            # Add relation and classifications to overall list
            all_final_classifications.append(classifications_and_relation)
                
        
                
                
            
        
                
    return all_final_classifications


## Part 7: Function to Clean Results and Output to a Text File

In [None]:
# Function that cleans results and outputs it to a file
def send_classifications_to_file(classifications, output_file_name, analyzed_file_name):
    # Creating list for triplets to be written to file
    final_classification_triplets = []
    # Adding the name of the analyzed file
    final_classification_triplets.append((analyzed_file_name, '--', '--'))
    # Iterating through the final classification for each relation
    for classification in classifications:
        # Creating a list of the final classifications for each relation
        categories = [x.split('-')[0] for x in classification[1:]]
        # Iterating through each final classification
        for classified_noun in list(set(categories)):
            # Creating a list of the decider(s) for each classification
            duplicate_classifications = [y.split('-')[1].strip() for x in classifications for y in x if classified_noun in y]
            # Adding a triplet of the relation, the noun classified and the decider to the list of triplets
            final_classification_triplets.append((classification[0].strip(), classified_noun.strip(), list(set(duplicate_classifications))))
    # Writing to a file by checking first if it exists
    if os.path.isfile(output_file) == False:
    # Adding a triplet that represents the header to the triplet list
        final_classification_triplets.insert(0, ('Ollie Relation', 'Final Classification', 'Classification Decider'))
        with open(output_file, 'w') as f:
            writer = csv.writer(f, delimiter='\t')
            writer.writerows(final_classification_triplets)
    else:
        # Appending to a file if it is already exists
        with open(output_file, 'a') as f:
            writer = csv.writer(f, delimiter='\t')
            writer.writerows(final_classification_triplets)
    
    return 'Outputs written to file'

## Part 8: Analyzing Files

In [None]:
# Iterating through list of files to be analyzed
print('Analyzing files')
for file in files_to_be_analyzed:
    print(file)
    extracted_text = ''
    # Open the file
    document = fitz.open(file) 
    # Iterate through pages in the file
    for page in document:
        # Extract the text from each page in the document
        texts = page.getText('text')
        # Add the extracted text to the text object
        extracted_text = extracted_text + texts
    # Replace dashes in the extracted text
    extracted_text = extracted_text.replace(" -", "")
    extracted_text = extracted_text.replace(" - ", "")
    extracted_text = extracted_text.replace("- ", "")
    # Split the extracted text where it says References and take all the text above
    extracted_text = extracted_text.split("References",maxsplit=1)[0]
    # Extracted text file name
    extracted_text_file_name = file.split('.')[0] + '-extracted-text.txt'
    # Ollie extractions file name
    relation_file_name = file.split('.')[0] + '-relations.txt'
    # Adding a new line between every sentence
    text_with_new_lines = re.sub(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', '\n', extracted_text)
    # Replacing newlines between lowercase characters with a single space
    replace_new_lines_between_lowercase_characters = re.sub(r'(\n+)(?=[a-z])', " ", text_with_new_lines)
    # Save the extracted text to a file
    with open(extracted_text_file_name, 'w') as out:
        out.write(replace_new_lines_between_lowercase_characters)
    # Save the Ollie extractions to a file
    with open(relation_file_name, 'w') as f:
        # Running the Ollie tool on the extracted text file
        subprocess.run(["java", "-Xmx512m", "-jar", ollie_file, extracted_text_file_name], stdout=f)
    # Opening the files with relations
    with open(relation_file_name) as f:
        relations = []
        for line in f:
            # Iterating through relations and extracting lines that start with a confidence score
            if re.match(r"^\d+.*$",line):
                # Adding the found lines to the relations list
                relations.append(line)
        # Replace characters and spaces added by Ollie
        relations = [relation.replace("\n", "") for relation in relations]
        relations = [relation.replace(")[enabler", ", ") for relation in relations]
        relations = [relation.replace(")[", ", ") for relation in relations]
        relations = [relation.replace("attrib=", " attribute ")]
        # Search through relations for keywords
        relations_with_keyword = searching_ollie_results(relations)
        print(relations_with_keyword)
        # Analyze relations and add a final classification
        classifications = identify_and_classify(relations_with_keyword)
        print(classifications)
        # Writing the classifications to a file
        send_classifications_to_file(classifications, output_file, file)
    