# The Ripple Effect-Table Creator

# Notes

## Inputs

Impact Statements
- Great Lakes Foodweb
- NOAA's Current Impact Statements
- NOAA's Reference List
- Nonindigenous aquatic species (what is invasive)
- Waterlife Data
- Technical Memorandum
- Species ID List

Networks
- [NHDPlus Great Lakes Data (Vector Processing Unit 04) | US EPA](https://www.epa.gov/waterdata/nhdplus-great-lakes-data-vector-processing-unit-04)
- USGS geojson data


## Outputs

Impact Statements For NOAA
- impact_statements.xlsx


Tables for our Database
- Network Graph
  - impact_rel.csv
  - invasive_species.csv
  - species.csv
- Waterways Map
  - species_observed.csv
- Ripple Plot
  - target_dropdown_rel.csv
  - target_dropdown_impacted.csv
  - target_dopdown_impacter.csv


## Imports

In [1]:
!apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
!pip install pdftotext

import pdftotext
import re
import os
import pandas as pd
import numpy as np
import csv
import urllib.request
import networkx as nx
import datetime as dt
import json
import pickle

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from scipy.spatial import cKDTree
from sklearn.neighbors import BallTree, KDTree


from google.colab import drive
drive.mount('assets')

Reading package lists... Done
Building dependency tree       
Reading state information... Done
build-essential is already the newest version (12.4ubuntu1).
pkg-config is already the newest version (0.29.1-0ubuntu2).
python-dev is already the newest version (2.7.15~rc1-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libpoppler-cpp0v5
The following NEW packages will be installed:
  libpoppler-cpp-dev libpoppler-cpp0v5
0 upgraded, 2 newly installed, 0 to remove and 40 not upgraded.
Need to get 36.7 kB of archives.
After this operation, 188 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpoppler-cpp0v5 amd64 0.62.0-2ubuntu2.12 [28.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libpoppler-cpp-dev amd64 0.62.0-2ubuntu2.12 [8,676 B]
Fetched 36.7 kB in 0s (85.5 kB/

## File Paths

In [2]:
main_path = '/content/assets/Shared drives/ermiasb-rjbowman-tobyk/'
data_path = main_path + 'data/'
assets_path = main_path + 'assets/'
results_path = main_path + 'results/'

## Import Files

In [3]:
# Impact Statement Inputs
foodweb_file = data_path + "FoodWeb-Fact-Sheet-Matrices.xlsx"
waterlife_data_excel_file = data_path + 'Waterlife_data_5272021.xlsx'
old_impact_statements_file = data_path + "2021-07-16-impacts.csv"
noaa_invasive_species_file = data_path + 'glansis-species.csv'
noaa_watchlist_species_file = data_path + 'glansis-watchlist-species.csv'
impact_references_file = data_path + 'impact_statement_references_existing.csv'
tsn_file = data_path + 'species_to_tsn.csv'
species_to_species_id_file = data_path + 'species_to_species_id.csv'
reference_match_file = data_path +'references_match.xlsx'
existing_reference_path = data_path + 'existing_references.xlsx'
species_id_to_scientific_file = data_path + 'species_id_to_scientific.csv'

# Network Inputs
rivers_file = data_path + 'rivers.geojson'
lakes_file = data_path + 'lake.geojson'
invasion_file = data_path + 'NAS-Data-Download-06-09-2021-59463.csv'
waterways_edges_file = data_path + "waterway_edges.csv"

## Asset Files

In [4]:
# Impact Statement Asset Outputs
impacted_impacter_file = assets_path + 'impacter_impacted.csv'
index_to_species_file = assets_path + 'index_to_species.csv'
index_to_species_file = assets_path + 'index_to_species.csv'
impact_line_file = assets_path + 'impact_lines.txt'
impact_statement_file = assets_path + 'impact_statements.txt'
pred_prey_file = assets_path + 'pred_prey.csv'
impact_relationships_file = assets_path + 'impact_relationships.txt'
references_file = assets_path + '/references.txt'
database_upload_file = assets_path + '/database_upload.txt'
scientific_to_common_file = assets_path + 'scientific_to_common.csv'
impacter_impacted_distance_named_file = assets_path + 'impacter_impacted_distance_named.csv'

# Network Asset Outputs
observations_waterways_distances_file = assets_path + 'ovservations_waterways_distance.csv'
species_to_index_file = assets_path + "species_to_index.csv"
index_to_species_file = assets_path + "index_to_species.csv"
relationships_named_file = assets_path +'relationships_file_named.csv'
relationships_keyed_file = assets_path + 'relationships_file_keyed.csv'
waterways_dataframe_file = assets_path + 'waterways.csv'
waterways_file = assets_path + 'waterways.p'
specimens_locations_file = assets_path + "specimen_locations.csv"


## Export Files

In [5]:
# Impact Statement Outputs for NOAA
impact_statements_file = results_path + 'impact_statements.xlsx'

# Table Outputs for Database

#Network Graph
species_file = results_path + 'species.csv'
impact_rel_file =  results_path + 'impact_rel.csv'
invasive_species_file = results_path + 'invasive_species.csv'

# Waterways Map
species_observed_file = results_path + 'species_observed.csv'

# Ripple Outputs
invasive_impacters_dropdown_file = results_path + 'target_dropdown_impacter.csv'
impacted_species_dropdown_file = results_path + 'target_dropdown_impacted.csv'
impacter_impacted_distance_file = results_path + 'target_data_rel.csv'

## Global Variables

In [6]:
NOAA_url = 'https://www.glerl.noaa.gov/pubs/tech_reports/glerl-'
technical_files = ['tm-161.pdf', 'tm-161b.pdf', 'tm-161c.pdf', 'tm-169.pdf', 
                   'tm-169b.pdf', 'tm-169c.pdf']

sheet_names = ['Lake Michigan', 'Lake Huron', 'Lake Superior', 
               'Lake Ontario', 'Lake Erie', 'Lake St. Clair']

# Creating Impact Statements

## Creating Tools

In [7]:
def dictionary_to_csv(dictionary, filepath):
  """
  inputs a dictionary and a filepath
  outputs (saves) a csv file 
  """
  with open(filepath, 'w') as file:
    writer = csv.writer(file)
    for key, value in dictionary.items():
      writer.writerow([key, value])

## Downloading Files

In [8]:
def download_NOAA_technical_memorandums(base_url, download_list):
  """
  inputs a base URL for NOAA's technical memorandums and a list of files to download
  downloads the files and saves them to the asset path
  returns None
  """
  
  print('Downloading NOAA technical memorandum files....')

  existing_files = os.listdir(assets_path)
  
  for file_name in download_list:
    if file_name not in existing_files:
      url_portion = file_name.split('-')[-1]
      url_portion = url_portion.split('.')[0]
      urllib.request.urlretrieve(base_url + url_portion + '/' + file_name, assets_path + file_name)
  
  return None

## Cleaning Data

In [9]:
def fix_document_errors(pdf_file, page_idx, lines):
  """
  inputs a pdf file, a page from that file, and all the lines from that page
  changes data to fix errors in the orignal document
  returns corrected lines
  """

  if pdf_file == 'tm-161.pdf':
    if page_idx == 69:
      lines[19] = ''
      lines[20] = '• Anecdotal observations in the early 20th century spurred the reputation of G. affinis as a successful control' # fixes parsing for 20th resulting on 2 lines
    elif page_idx == 437:
      lines[2] = 'Common Name: Redtop' # Fixes a type 'Retop' in document
    elif page_idx == 727:
      lines.insert(26, 'Does the species have some medicinal or research value (outside of research geared towards its control)?') # fixes missing line in document
    elif page_idx == 939:
      lines[0] = 'Scientific Name: Viral hemorrhagic septocemia Virus (Family Novirhabdoviridae, Order Mononegavirales) Genotype IV sublineage b' # fixes scientific name on multiple lines
    elif page_idx == 1018:
      lines[25] = 'Van Overdijk, C.D.A., I.A. Grigorovich, T. Mabee, W.J. Ray, J.J.H. Ciborowski, and H.J. MacIsaac.' # capitalizes van
  
  elif pdf_file == 'tm-161b.pdf':
    if page_idx == 6:
      lines[27] = '●   Could potentially compete with other cladocerans for algal food sources, but this has not been documented' # Fixes missing \n in document

  elif pdf_file == 'tm-169.pdf':
    if page_idx == 328:
      lines[2] = 'Common Name: Blue Catfish, White Cat, White Fulton, Fulton, Humpback Blue, Forktail Cat, Blue Channel Catfish' # fixes common names on multiple lines
    elif page_idx == 1201:
      lines[3] = 'Common Name: Harris Mud Crab, Estuarine Mud Crab, Dwarf Mud Crab, White-fingered (or white-tipped) Mud Crab' # fixes common names on multiple lines
    elif page_idx == 1537:
      lines[31] = 'Bij de Vaate, A., K. Jażdżewski, H. A. M. Ketelaars, S. Gollasch, and G. van der Velde. Geographical' # capitalizes bij
    elif page_idx == 1552:
      lines[0] = 'De Kluijver, M.J., and S.S. Ingalsuo. Macrobenthos of the North Sea- Crustacea. Corophium curvispinum' # capitalize D
      lines[3] = 'De la Cruz, A. Va CNA contra la plaga del ‘repollito’. Conexion Total August 12, 2014 (2014).' # capitalize D
      lines[20] = 'Den Hartog, C., and G. Van der Velde. Invasions by plants and animals into coastal, brackish, and fresh' # capitalize D
      lines[22] = 'Den Hartog, C., F. van den Brink, and G. van der Velde. Why was the invasion of the river Rhine by' # capitalize D
    elif page_idx == 1619:
      lines[3] = 'Scipiloti, D. Fish community in the Stagnone di Marsala: Distribution and resource partitioning as a' # move initial after last name
    elif page_idx == 1630:
      lines[29] = 'Van den Brink F.W.B, G. van der Velde, and A. bij de Vaate. 1991. Amphipod invasion on the Rhine.' # capitalizes van
      lines[31] = 'Van den Brink, F.W.B., G. van der Velde, and A. bij de Vaate. Ecological aspects, explosive range' # capitalizes van
      lines[34] = 'Van Densen, W.L.T. Piscivory and the development of bimodality in the size distribution of 0+ pikeperch' # capitalizes van
    elif page_idx == 1631:
      lines[7] = 'Van der Velde, G., S. Rajagopal, B. Kelleher, I. Musko, B. Vaate, and F. Schram. Ecological impact of' # capitalize van
      lines[10] = 'Van der Velde, G., R.S.E.W. Leuven, D. Platvoet, K. Bacela, M.A.J. Huijbregts, H.W.M. Hendriks, and' # capitalize van
      lines[14] = 'Van Dijk, G.M., and B. van Zanten. Seasonal changes in zooplankton abundance in the lower Rhine' # capitalize van
      lines[18] = 'Van Haaren, T., and J. Soors. Sinelobus stanfordi (Richardson, 1901): A new crustacean invader in' # capitalize van
      lines[20] = 'Van Kessel, N., M. Dorenbosch, M.R.M. De Boer, R.S.E.W. Leuven, and G. Van der Velde. Competition' # capitalize van
      lines[23] = 'Van Overdijk, C.D.A., I.A. Grigorovich, T. Mabee, W.J. Ray, J.J.H. Ciborowski, and H.I. MacIsaac.' # capitalize van
      lines[26] = 'Van Riel, M.C., G. van der Velde, and A. bij de Vaate. To conquer and persist: colonization and' # capitalize van

  elif pdf_file == 'tm-169b.pdf':
    if page_idx == 69:
      lines.insert(4, 'Unknown') # fixes missing line in document
    elif page_idx == 52:
      lines.insert(-3, 'Does it diminish the perceived aesthetic or natural value of the areas it inhabits') # fixes missing questions
    elif page_idx == 222:      
      lines.insert(23, 'Does it diminish the perceived aesthetic or natural value of the areas it inhabits') # fixes missing questions
      lines.insert(23, 'Does it inhibit recreational activities and/or associated tourism') # fixes missing questions
      lines.insert(21, 'Does it negatively affect water quality') # fixes missing questions
      lines.insert(21, 'Does it cause damage to infrastructure') # fixes missing questions
      lines.insert(21, 'Does the species pose some hazard or threat to human health') # fixes missing questions
    elif page_idx == 247:
      lines[6] = 'Bij de Vaate, A. 2003. Degradation and recovery of the freshwater fauna in the lower sections of the' # capitalize bij
      lines[8] = 'Bij de Vaate, A., K. Jazdzewski, H. A. M. Ketelaars, S. Gollasch, and G. van der Velde. 2002.' # capitalize bij
    elif page_idx == 249:
      lines[32] = 'De Kluijver, M.J., and S.S. Ingalsuo. 1999. Macrobenthos of the North Sea- Crustacea. Corophium' # capitalize de
    elif page_idx == 250:
      lines[0] = 'Den Hartog, C., F. van den Brink, and G. van der Velde. 1992. Why was the invasion of the river' # capitalize den
    elif page_idx == 260:
      lines[1] = 'molitrix) planktivory in a floodplain lake of the lower Mississippi River basin. J Fresh Ecol 25:85–93.' # adds period to end reference
    elif page_idx == 263:
      lines[37] = 'Van den Brink, F., G. van der Velde, and A. bij de Vaate. 1989. A note on the immigration of' # capitalize van
    elif page_idx == 264:
      lines[0] = 'Van den Brink, F.W.B., G. van der Velde, and A. bij de Vaate. 1993. Ecological aspects, explosive' # capitalize van
      lines[3] = 'Van der Velde, G., B.G.P. Paffen, and F.W.B van den Brink. 1994. Decline of zebra mussel' # capitalize van
      lines[6] = 'Van der Velde, G., S. Rajagopal, B. Kelleher, I. Musko, B. Vaate, and F. Schram. 2000. Ecological' # capitalize van
      lines[9] = 'Van der Velde, G., S. Rajagopal, F. van den Brink, B. Kelleher, B. Paffen, A. Kempers, and A. bij de' # capitalize van
      lines[13] = 'Van Riel, M.C., G. van der Velde, and A. bij de Vaate. 2006. To conquer and persist: colonization and' # capitalize van

  elif pdf_file == 'tm-169c.pdf':
    if page_idx == 21:
      lines.insert(11, 'Does it outcompete native species for available resources') # fixes missing questions
    elif page_idx == 94:
      lines.pop(27) # fixes checkbox occuring on wrong line
    elif page_idx == 147:
      lines.pop(24) # fixes checkbox occuring on wrong line
    elif page_idx == 151:
      lines.pop(17) # fixes checkbox occuring on wrong line
    elif page_idx == 173:
      lines[22] = 'signal crayfish and juvenile Atlantic salmon. Journal of Fish Biology. 65(2):437-44.' # adds period to end reference
      lines[34] = 'cause massive population decline of an invasive crayfish. Freshwater Biology. 52(6): 1134-1146.' # adds period to end reference
    elif page_idx == 176:
      lines[31] = 'Ludwigia (Onagraceae) on the middle Loire River, France. Aquatic Botany. 90: 143-148.' # adds period to end reference
    elif page_idx == 177:
      lines[20] = 'affects pollinator visitants to a native plant at high abundances. Aquatic Invasions. 3: 357-367.' # adds period to end reference
      lines[22] = 'plants and macroinvertebrates in temperate ponds. Biological Invasions. 13: 2715-2726.' # adds period to end reference, adds space between plants and 

  return lines

In [10]:
def clean_page(page):
  """
  inputs a pdf page as string
  removes formating characters
  returns a string
  """

  page = page.replace('\t', '')
  page = page.replace('\r', '')
  page = page.replace('\xa0', '')          

  return page

In [11]:
def clean_lines(page):
  """
  inputs a pdf page as string
  breaks lines of the pdf
    removes empty lines, which is necessary for page number detection
  returns a list of lines
  """
  clean_lines = []
  lines = page.split('\n')
  for line in lines:
    line = line.strip()
    if line:
      clean_lines.append(line)
  return clean_lines

## Detecting Document Sections

In [12]:
def detect_page_number(line, page_number):
  """
  inputs a line from page and the current page number
  detects if the line is an integer and is the next expected page number
  returns whether the page number
  """
  
  if line.isnumeric():
    if int(line) == page_number + 1:
      page_number = int(line)
  return page_number

In [13]:
def detect_taxonimic_group(line, taxonomic_group):
  """
  inputs a line and the current taxonomic_group
  detects if the line contains a new taxonomic group
  returns the taxonomic group
  """
  
  if line[:2] == 'A.' and line[2].isnumeric():
    if not line[-1].isnumeric():
      line_list = line.split()
      taxonomic_group = ' '.join(line_list[1:]).strip()

  return taxonomic_group

In [14]:
def detect_scientific_name(line, scientific_name, page_number):
  """
  inputs line of text as a string and the current scientific name and the page_number
  detects if there is a new scientific name
  returns if we scientific name
  """

  if 'Scientific Name:' in line and page_number > 0:
    scientific_name = line.split('Scientific Name:')[-1].strip()
  return scientific_name

In [15]:
def detect_common_name(line, common_name):  
  """
  inputs a line of text as a string and the current common name
  detects if a new common name is listed
  returns the common name as a list
  """

  clean_common_names = common_name
  if 'Common Name' in line:
    line = line.replace('Common Name(s):','Common Name:')
    line = line.replace('none','')
    line = line.replace('None','')
    clean_common_names = []
    common_names = line.split('Common Name:')[-1].strip()
    common_names = common_names.split(',')
    for common_name in common_names:
      common_name = common_name.strip().lower()
      if common_name:
        clean_common_names.append(common_name)
  return clean_common_names

In [16]:
def detect_section(line, section, section_triggers, section_count, scientific_name, impact_trigger):
  """
  inputs line of text, current section, list of section triggers, 
    current scientific name, the section count and the impact trigger
  detects if we are in a new section
    calls the missing section function
    changes the section
    updates the section count
  returns the current section and impact trigger
  """
  
  for section_idx, trigger_statements in enumerate(section_triggers):
    for trigger_statement in trigger_statements:
      if trigger_statement in line and scientific_name:
        detect_missing_sections(section, section_idx, scientific_name)
        section = section_idx
        section_count[section] += 1
        impact_trigger = False
  return section, impact_trigger

In [17]:
def detect_missing_sections(current_section, new_section, scientific_name):
  """
  inputs current section, new section index, and scientific_name
  prints a warning if sections are missing 
  returns None
  """

  if (new_section == 0 and (current_section == 1 or current_section == 2)) or \
  (new_section == 2 and current_section != 1) or \
  (new_section == 3 and current_section != 2):
    print('Warning: Missing Section for', scientific_name)

  return None

In [18]:
def detect_question(line, section, question, question_triggers, question_count, impact_trigger):
  """
  receives line, current section, current question, list of question_triggers
    question_count, and impact_trigger
  detects if there is a new question
  returns the current quetion and the impact_trigger
  """

  section_question_triggers = question_triggers[section]
  for question_idx, section_questions in enumerate(section_question_triggers):
    for section_question in section_questions:      
      if section_question in line:
        question = question_idx
        impact_trigger = False
        question_count[section][question_idx] += 1
  return question, impact_trigger

In [19]:
def get_section_triggers():
  """
  no inputs
  returns a list of strings that denote new sections in the document
  """

  section_triggers = [
    ['Scientific Name:',],
    ['ENVIRONMENTAL IMPACT',],
    ['SOCIO-ECONOMIC IMPACT',],
    ['BENEFICIAL EFFECT', 'BENEFICIAL IMPACT',]                                  
  ]

  return section_triggers

In [20]:
def get_question_triggers():
  """
  no inputs
  returns a list of strings that denote new sections in the document
  """

  question_triggers = [
    [],
    [['Environmental Impact Total','Environmental Impacts Total',],
      ['Does the species pose some hazard or threat to the health of native species',],
      ['Does it outcompete native species for available resources',
      'Does it out-compete native species for available resources',],
      ['Does it alter predator-prey relationships?',],
      ['Has it affected any native populations genetically',],
      ['Does it negatively affect water quality',],
      ['Does it alter the physical ecosystem in some way',
      'Does it alter physical components of the ecosystem in some way',]],
    [['Economic Impact Total',],
      ['Does the species pose some hazard or threat to human health',],
      ['Does it cause damage to infrastructure',],
      ['Does it negatively affect water quality',],
      ['Does it harm any markets or economic sectors',
      'Does it negatively affect any markets or economic sectors',],
      ['Does it inhibit recreational activities and/or associated tourism',],
      ['Does it diminish the perceived aesthetic or natural value of the areas it inhabits',]],
    [['Beneficial Effect Total','Positive Impact Total', 'Beneficial Impact Total',],
      ['Does it act as a biological',
      'Does it ac as a biological',],
      ['Is it commercially valuable',],
      ['Is it recreationally valuable',],
      ['Does the species have some medicinal or research value',],
      ['Does the species remove toxins or pollutants from the water or otherwise increase water quality',],
      ['Does the species have a positive ecological impact outside of biological control',
      'Does the species have a positive ecological effect outside of biological control']]
  ]

  return question_triggers

In [21]:
def first_char_bullet_point(impact_statement_line):
  """
  inputs an impact statement line
  checks if the first character of the impact statement line is a bullet point
    if this is true, it indicates there is a new impact statement and 
    that this section is using bullet points to delineate impact statements
  returns true or false of the condition
  """

  if impact_statement_line[0] in ['•','●','']:
    return True
  else:
    return False


In [22]:
def new_question(prev_impact_data, impact_data):
  """
  inputs the previous impact data and the current impact data
  checks if the current impact data relates to a new file,
    new species, new impact section, or a new question as compared
    to the previous impact data
    if true, this indicates a new impact statement
  returns true or false of the condition
  """

  if impact_data[0] != prev_impact_data[0] or \
    impact_data[2] != prev_impact_data[2] or \
    impact_data[3] != prev_impact_data[3] or \
    impact_data[4] != prev_impact_data[4]:
    return True
  else:
    return False

In [23]:
def prev_line_reference(prev_impact_statement_line, impact_statement_line, uses_bullets):
  """
  inputs previous impact statement line, current impact statement line and 
    where the current impact statement is using bullets or not
  if the current statement is not using bullets to delineate new statments
    the function checks for the previous line ending in a reference and the 
    current line starting with a capital letter, indicating that the current
    line is the beginning of a new impact statement
  returns true or false of the condition
  """
  
  if not uses_bullets and \
    prev_impact_statement_line[-2:] == ').' and \
    impact_statement_line[0].isupper():
    return True
  else:
    return False

In [24]:
def last_character_hyphen(impact_line):
  """
  inputs an impact line
  checks if the last character of the impact line is a hyphen
  returns true or false of the condition
  """

  ends_with_hyphen = False
  if len(impact_line) > 0:
    if impact_line[-1] == '-':
      ends_with_hyphen = True

  return ends_with_hyphen

## Extracting Data

In [25]:
def exctract_impact_lines(pdf_file, data_path, output_file):
  """
  inputs a pdf_file, the path of the file, and file object to output results
  extracts all lines between impact questions for the impact sections
    (Environmental, Socio-Economic, and Beneficial)
  return None
  """

  print('Extracting impact lines from ' + pdf_file)

  with open(data_path + pdf_file, 'rb') as f:
      pdf = pdftotext.PDF(f)

      # Set variables for impact extractions
      taxonomic_group, scientific_name, common_name = '', '', []
      page_number, section, question = 0, 0, 0
      section_triggers = get_section_triggers()
      question_triggers = get_question_triggers()
      impact_trigger, impact_statement = False, ['','','','','','']

      # Set counters for monitoring section and question count
      section_count = [0,0,0,0]
      question_count = []
      for i in range(4):
        question_count.append([0,0,0,0,0,0,0])

      for page_idx, page in enumerate(pdf):
        page = clean_page(page)
        lines = clean_lines(page)
        lines = fix_document_errors(pdf_file, page_idx, lines)

        for line_idx, line in enumerate(lines):
          
          if line_idx == len(lines)-1:
            page_number = detect_page_number(line, page_number)

          else:
            taxonomic_group = detect_taxonimic_group(line, taxonomic_group)
            scientific_name = detect_scientific_name(line, scientific_name, page_number)
            common_name = detect_common_name(line, common_name)
            section, impact_trigger = detect_section(line, section, section_triggers, section_count, scientific_name, impact_trigger)
            question, impact_trigger = detect_question(line, section, question, question_triggers, question_count, impact_trigger)
                                
            if impact_trigger and len(line) > 0:
              ordinal_number_warning(line)
              output_file.write('{}|{}|{}|{}|{}|{}|{}\n'.format(pdf_file, str(page_idx+1), section, question, scientific_name, common_name, line))

            if question > 0 and scientific_name and 'Unknown' in line:
              impact_trigger = True
            
  print(section_count)
  print(question_count)

In [26]:
def locate_names_in_statements(scientific_names, common_names, common_to_scientific_dict, scientific_to_common_dict, common_name_occurrences_dict):
  """
  inputs scientific name set, common name set, and names dictionary
  searches impact statements for unique impacts on any species in the sets
    (ignoring itself)
  returns a set of impact relationship tuples
  """

  impact_relationships = set()

  with open(impact_statement_file, 'r') as impact_statements:
    for impact_statement in impact_statements:
      impact_statement = impact_statement.strip().lower()
      if len(impact_statement) > 0: # ignore blank lines
        impact_statement = impact_statement.split('|')
        if impact_statement[-1] != 'impact_statement': #ignore headers
          invasive_species = impact_statement[4].strip().lower()
          statement = impact_statement[-1]

          # search for scientific names
          for scientific_name in scientific_names:
            if scientific_name in statement and scientific_name != invasive_species:
              frequent_invasive_name = frequent_common_from_scientific(invasive_species, scientific_to_common_dict, common_name_occurrences_dict)
              frequent_impacted_name = frequent_common_from_scientific(scientific_name, scientific_to_common_dict, common_name_occurrences_dict)
              if frequent_invasive_name != frequent_impacted_name:
                impact_relationships.add((invasive_species, scientific_name, frequent_invasive_name,frequent_impacted_name))

          # search for common names
          potential_impacts = []
          for common_name in common_names:
            variations = [' '+common_name+' ', '|'+common_name+' ', ' '+common_name+'.', ' '+common_name+',']
            for variation in variations:
              if variation in statement:
                if invasive_species not in common_to_scientific_dict[common_name]:
                  frequent_invasive_name = frequent_common_from_scientific(invasive_species, scientific_to_common_dict, common_name_occurrences_dict)
                  if frequent_invasive_name != common_name:
                    potential_impacts.append(common_name)

          # remove overlapping names (ex: "brown bullhead" would yield "brown bullhead" and "bullhead")
          potential_impact_list = potential_impacts.copy()
          for potential_impact in potential_impact_list:
            for check_impact in potential_impact_list:
              if (potential_impact in check_impact) and (potential_impact != check_impact):
                if potential_impact in potential_impacts:
                  potential_impacts.remove(potential_impact)

          # add non-overlapping names to relationship set
          for potential_impact in potential_impacts:
            impact_relationships.add((invasive_species, potential_impact, frequent_invasive_name, potential_impact))

  print('{} relationships extracted from impact statements'.format(len(impact_relationships)))

  return impact_relationships

In [27]:
def locate_related_names_in_statements(invasive_species, statement, scientific_to_common_dict, related_species_names_dict, common_name_occurrences_dict, impact_relationships):
  """
  inputs invasive species, statement and dictionaries: scientific to common name, related species names, common_name_occurrences
  searches impact statements for relationships and adds them the to impact relationships set
  returns a set of impact relationship tuples
  """

  related_names = set(related_species_names_dict.keys())
  for related_name in related_names:
    if related_name[:-1] in statement:
      if invasive_species in related_names:
        frequent_invasive_name = related_species_names_dict[invasive_species]
      else:
        frequent_invasive_name = frequent_common_from_scientific(invasive_species, scientific_to_common_dict, common_name_occurrences_dict)
      if frequent_invasive_name + 's' in related_names:
        frequent_invasive_name += 's'
        
      noaa_impacted_name = related_species_names_dict[related_name]
      if frequent_invasive_name != noaa_impacted_name:
        impact_relationships.add((invasive_species, related_name, frequent_invasive_name, noaa_impacted_name))

  return impact_relationships

In [28]:
def extract_impacted_names_from_statements(impact_statement, invasive_species, name_search, related_species_names_dict, common_to_scientific_dict):
  """
  inputs impact statement, invasive species, name search and related species name dict, common_to_scientific_dict
  searches impact statements for impacted species
  returns a list of impacted species
  """

  impacted_names = []
  impact_statement = impact_statement.lower().strip()

  for name in name_search:
    if len(name) > 4:
      if name in impact_statement:
        impacted_names.append(name)
      elif name[-1] == 's' and name[:-1] in impact_statement:
        impacted_names.append(name)
  
  #filter out sub names (ex: carp and grass carp)
  temp_names = impacted_names
  impacted_names = []
  for name_idx, name in enumerate(temp_names):
    sub_name = False
    for other_idx, other_name in enumerate(temp_names):
      if name in other_name and name_idx != other_idx:
        sub_name = True
    if sub_name == False:
      related_name = related_species_names_dict.get(invasive_species, invasive_species) 
      if related_name != related_species_names_dict.get(name, name):
        if invasive_species not in common_to_scientific_dict.get(name, name):
          impacted_names.append(name)

  return impacted_names

In [29]:
def extract_impact_relationships():
  """
  no inputs
  loads the scientific name and common name dictionary from a file,
    adds the names from the impact statements,
    searches through impact statements to find relationships between species
    writes relationships to file
  returns None
  """

  print('Extracting impact relationships between species...')
  related_species_names_dict = create_related_species_names_dict()
  common_to_scientific_dict = create_common_to_scientific_dict(waterlife_data_excel_file)
  common_to_scientific_dict = add_impact_statement_names_to_dict(common_to_scientific_dict)  
  common_name_occurrences_dict = count_occurences(common_to_scientific_dict)
  scientific_to_common_dict = invert_dict(common_to_scientific_dict)
  invasive_species_ids = get_species_ids()
  species_id_dict = create_species_id_dict(related_species_names_dict, scientific_to_common_dict, common_name_occurrences_dict, invasive_species_ids)
  impact_relationships = create_impact_relationships(scientific_to_common_dict, related_species_names_dict, common_name_occurrences_dict, species_id_dict)
  write_impact_relationships_to_file(impact_relationships)

In [30]:
def extract_journal_references():
  """
  inputs nothing
  extracts references at end of each file, write refs to a file
  returns None
  """

  print('Extracting references...')

  output_file = open(references_file, 'w')
  output_file.write('pdf_file|reference\n')  
  
  ref_trigger_dict = {
    'tm-161.pdf': 'APPENDIX B. OIA REFERENCES',
    'tm-161b.pdf': '3.0 LITERATURE CITED',
    'tm-161c.pdf': '4.0 REFERENCES',
    'tm-169.pdf': 'APPENDIX B: Literature Cited in Assessments',
    'tm-169b.pdf': '4.0 REFERENCES',
    'tm-169c.pdf': '5.0 REFERENCES',
  }

  pdf_files = os.listdir(assets_path)
  for pdf_file in pdf_files:
    if pdf_file[-4:] == '.pdf':
      if pdf_file in ref_trigger_dict.keys():

        in_ref_section = False
        ref_title =  ref_trigger_dict[pdf_file]
        ref_counter = 0
        page_number = 0
        ref = ''

        with open(assets_path + pdf_file, 'rb') as f:
            pdf = pdftotext.PDF(f)

            for page_idx, page in enumerate(pdf):
              page = clean_page(page)
              lines = clean_lines(page)
              lines = fix_document_errors(pdf_file, page_idx, lines)

              for line_idx, line in enumerate(lines):
                
                if line_idx == len(lines)-1:
                  page_number = detect_page_number(line, page_number)
                else:
                  if in_ref_section:

                    new_line_ord = ord(line[0])
                    old_line_ord = old_line_ordinal(ref)

                    if line[0].isupper() and\
                      ('pp.' in ref or\
                        'http:' in ref or\
                        'www.' in ref or\
                        'fact sheet' in ref.lower() or\
                        'in press' in ref.lower() or\
                        'booklet' in ref.lower() or\
                        'available' in ref.lower() or\
                        'accessed' in ref.lower() or\
                        'annual report.' in ref.lower() or\
                        'vienna.' in ref.lower() or\
                        'report.' in ref.lower() or\
                        'izdatel' in ref.lower() or\
                        'doi:' in ref.lower() or\
                        'hokkaido university press.' in ref.lower() or\
                        'synopsis 135.' in ref.lower() or\
                        ref[-7:] == '(eds.).' or\
                        ref[-9:] == '(GLMRIS).' or\
                        len(prev_line) <= 80 or\
                        bool(re.search(r'\(\d\d\d\d[a-z]?\).?$', prev_line)) or\
                        (bool(re.search(r'[A-Z]{2}.$', prev_line)) and prev_line[-4:] != 'USA.') or\
                        ((bool(re.search(r'\d(-|–)\d', ref)) or\
                          bool(re.search(r'\):\d', ref)) or\
                          bool(re.search(r'\d:\d', ref))) and\
                          ref[-1] == '.')) and\
                      (new_line_ord >= old_line_ord) and\
                      prev_line[-4:] != 'with' and\
                      prev_line[-4:] != 'Nauk' and\
                      prev_line[-4:] != 'tate' and\
                      prev_line[-4:] != 'Lake' and\
                      prev_line[-4:] != '. B.' and\
                      prev_line[-4:] != '. R.' and\
                      prev_line[-5:] != 'le at' and\
                      prev_line[-4:] != 'NOAA' and\
                      prev_line[-4:] != 'USGS' and\
                      prev_line[-4:] != '(ECI' and\
                      prev_line[-11:] != '08/18/2017.' and\
                      prev_line[-5:] != '1990.' and\
                      prev_line[-2:] != 'of' and\
                      prev_line[-12:] != 'DFO Canadian' and\
                      prev_line[-12:] != 'Case report.' and\
                      prev_line[-8:] != '50_7.pdf' and\
                      prev_line[-9:] != 'phoxinus.' and\
                      prev_line[-8:] != 'European' and\
                      prev_line[-8:] != 'Y, 2009.' and\
                      prev_line[-10:] != '(Accessed:' and\
                      prev_line[-39:] != 'bodies. Qi, J., and K.T. Evered (eds.).' and\
                      prev_line[-6:] != '3. St.':

                      if len(ref) > 0:
                        output_file.write('{}|{}\n'.format(pdf_file, ref))
                        ref_counter += 1
                      
                      ref = line.strip()

                    else:
                      if len(ref) > 0:
                        if ref[-1] == '-':
                          ref = ref.strip() + line.strip()
                        else:
                          ref = ref.strip() + ' ' + line.strip()  
                      else:
                        ref = ref.strip() + ' ' + line.strip()

                  
                  if not in_ref_section and line == ref_title:
                    in_ref_section = True
                  prev_line = line

        output_file.write('{}|{}\n'.format(pdf_file, ref))
        ref_counter += 1
        print('{} references extracted from {}'.format(ref_counter, pdf_file))

  output_file.close()

  return None

In [31]:
def extract_impact_references():
  """
  inputs None
  finds citations within the document and matchres them to the journal reference
  returns None
  """
  print('Extracting references...')

  related_species_names_dict = create_related_species_names_dict()
  common_to_scientific_dict = create_common_to_scientific_dict(waterlife_data_excel_file)
  common_to_scientific_dict = add_impact_statement_names_to_dict(common_to_scientific_dict)  

  name_search = set()
  for name in related_species_names_dict.keys():
    name_search.add(name)
  for name in common_to_scientific_dict.keys():
    name_search.add(name)

  counter = 0
  impact_type_dict = make_impact_type_dict()
  output_file = open(database_upload_file, 'w')
  output_file.write('ID|species_ID|impact_type|study_type|study_location|impact_desc|refnum|notes|greatlakes_region|cost|location|impacted_TSNs\n')  
  
  # read through impact statement file to extract references
  with open(impact_statement_file, 'r') as impact_statements:
    for impact_statement in impact_statements:
      impact_statement = impact_statement.strip()
      if len(impact_statement) > 0: # ignore blank lines
        impact_statement = impact_statement.split('|')
        if impact_statement[-1] != 'impact_statement': #ignore headers
          impact_file = impact_statement[0].strip().lower()
          impact_section = impact_statement[2].strip()
          impact_question = impact_statement[3].strip()
          impact_section_question = impact_section + '|' + impact_question
          invasive_species = impact_statement[4].strip()     
          statement = impact_statement[-1].strip().lower()
          statement = statement.replace('.','')
          statement = statement.replace(',','')
          statement = statement.replace('(','')
          statement = statement.replace(')','')
          words = statement.split(' ')
          for word_index, word in enumerate(words):
            if bool(re.search(r'^\d\d\d\d', word)):
              year = word[:4]
              author = words[word_index-1]
              if author == 'al':
                author = words[word_index-3]

              if author not in ['in','and','of', 'to', 'a', 'the', 'from']:

                with open(references_file, 'r') as references:
                  potential_references = []
                  for reference in references:
                    reference_file, clean_reference = reference.split('|')
                    clean_reference = clean_reference.strip().lower()
                    if year in clean_reference and author in clean_reference and reference_file == impact_file:
                      potential_references.append((clean_reference.find(author), reference))
                
                if len(potential_references) > 0:
                  for potential_reference_index, potential_reference in enumerate(potential_references):
                    if potential_reference_index == 0:
                      min_index = 0
                      min_location = potential_reference[0]
                    if potential_reference[0] < min_location:
                      min_index = potential_reference_index
                      min_location = potential_reference[0]
                
                  #print('selected {}'.format(potential_references[min_index][1]))
                  id = counter
                  species_id = invasive_species
                  impact_type = impact_type_dict[impact_section_question]
                  study_type = ''
                  study_location = ''
                  impact_desc = impact_statement[-1]
                  refnum = potential_references[min_index][1].split('|')[1].strip()
                  notes = ''
                  greatlakes_region = 1
                  cost = ''
                  location = ''
                  impacted_TSNs = extract_impacted_names_from_statements(impact_desc, invasive_species.lower().strip(), name_search, related_species_names_dict, common_to_scientific_dict)

                  output_file.write('{}|{}|{}|{}|{}|{}|{}|{}|{}|{}|{}|{}\n'.format(
                      id,species_id,impact_type,study_type,study_location,impact_desc, \
                      refnum,notes,greatlakes_region,cost,location,impacted_TSNs))  

                  counter+=1
  
  print('{} impact statements matched to references.'.format(counter))

  output_file.close()

  return None

## Processing Data

In [32]:
def ordinal_number_warning(line):
  """
  inputs an impact statement line
  detects ordinal numbers (1st, 2nd, 3rd, 4th etc..) that did not parse correctly
    prints a warning to the screen if an ordinal number is detected
  returns None
  """

  if (line == 'st' or line == 'nd' or line == 'rd' or line =='th'):
    print('Warning possible ordinal number detected for scientific_name {} on page {}'.format(scientific_name, str(page_idx+1)))

In [33]:
def build_statement_break_classifier():
  """
  no inputs
  opens the impact statement line file and finds all known statement break lines
    and non statement break lines based on bullet points, new questions and 
    references at the end of the line, then creates trains and a classifier based
    on this data to be used for impact statement lines where it is unknown if
    the impact statement line represents a new impact statement
  returns the trained classifier
  """

  print('Training statement break classifier')
  X_train = []
  y_train = []
  counter = 0
  prev_impact_data = ['','','','','','']
  uses_bullets = False
  statement_break_classifier = GaussianNB()

  with open(impact_line_file, 'r') as impact_lines:
    for impact_line in impact_lines:

      impact_line = impact_line.strip()
      impact_data = impact_line.split('|')
      new_statement = False
      impact_statement_line = impact_data[-1].strip()
      prev_impact_statement_line = prev_impact_data[-1].strip()
      last_line_len = len(prev_impact_statement_line)

      if first_char_bullet_point(impact_statement_line): # first character is a bullet point
        new_statement, uses_bullets = True, True
      elif new_question(prev_impact_data, impact_data): # new question in file
        new_statement, uses_bullets = True, False
      elif prev_line_reference(prev_impact_statement_line, impact_statement_line, uses_bullets):
        new_statement = True

      if new_statement and prev_impact_data != ['','','','','','']:
        X_train.append(last_line_len)
        y_train.append(1)
      else:
        if uses_bullets:
          X_train.append(last_line_len)
          y_train.append(0)

      prev_impact_data = impact_data
  
  statement_break_classifier.fit(np.array(X_train).reshape(-1, 1), y_train)
  
  return statement_break_classifier

In [34]:
def prev_line_prediction(prev_impact_statement_line, impact_statement_line, uses_bullets, statement_break_classifier):
  """
  inputs previous impact statement line, current impact statement line and 
    where the current impact statement is using bullets or not and the
    statement break classifier that was trained on known data
  if the current statement is not using bullets to delineate new statments
    the function checks for the classifiers prediction of whether the previous line
    is the end of an impact statement and whether the current line starts with
    a capital letter, thus indication that the current line is the beginning 
    of a new impact statement
  returns true or false of the condition
  """
  
  if not uses_bullets and \
    impact_statement_line[0].isupper() and \
    statement_break_classifier.predict(np.array(len(prev_impact_statement_line)).
      reshape(-1,1)) == 1:
    return True
  else:
    return False

In [35]:
def detect_statement_breaks(prev_impact_data, impact_data, uses_bullets, statement_break_classifier):
  """
  inputs the previous impact data and the current impact data as lists
  determines if the impact statement line belongs to the same section and question
  returns a boolean value regarding whether the current line is associated with the 
    previous line, and boolean value regarding wether the current question uses
    bullet points to delineate impact statements
  """

  new_statement = False
  impact_statement_line = impact_data[-1].strip()
  prev_impact_statement_line = prev_impact_data[-1].strip()

  if first_char_bullet_point(impact_statement_line): # first character is a bullet point
    new_statement, uses_bullets = True, True
  elif new_question(prev_impact_data, impact_data): # new question in file
    new_statement, uses_bullets = True, False
  elif prev_line_reference(prev_impact_statement_line, impact_statement_line, uses_bullets):
    new_statement = True
  elif prev_line_prediction(prev_impact_statement_line, impact_statement_line, uses_bullets, statement_break_classifier):
    new_statement = True

  return new_statement, uses_bullets

In [36]:
def remove_bullet_point(impact_line):
  """
  inputs an impact line
  removes the leading bullet point if any
  returns the impact line without the leading bullet point
  """

  impact_line = impact_line.strip()
  if first_char_bullet_point(impact_line):
    impact_line = impact_line[1:].strip()

  return impact_line

In [37]:
def aggregate_impact_statements(statement_break_classifier):
  """
  no inputs
  reads the impact lines txt file and aggregates lines into statements
    removes lines that are not statments
  writes data to impact statements txt file
  """

  counter = 0
  prev_impact_data = ['','','','','','']
  uses_bullets = False
  ends_with_hyphen = False

  output_file = open(impact_statement_file, 'w')

  with open(impact_line_file, 'r') as impact_lines:
    for impact_line in impact_lines:

      impact_line = impact_line.strip()
      impact_data = impact_line.split('|')
      new_statement, uses_bullets = detect_statement_breaks(prev_impact_data, impact_data, uses_bullets, statement_break_classifier)
      impact_data[-1] = remove_bullet_point(impact_data[-1])

      if new_statement:
        output_file.write('\n'+ '|'.join(impact_data))
      else:
        if not ends_with_hyphen:
          output_file.write(' ')
        output_file.write(impact_data[-1])

      ends_with_hyphen = last_character_hyphen(impact_data[-1])
      prev_impact_data = impact_data

      if new_statement:
        counter += 1
  
  print('{} impact statements extracted'.format(counter))
  print('Impact statements file created')
  
  output_file.close()

  return None

In [38]:
def create_impact_relationships(scientific_to_common_dict, related_species_names_dict, common_name_occurrences_dict, species_id_dict):
  """
  inputs the related species names, scientific to common, common name occurrences and species_id dictionaries
  loads the noaa impact statement and scraped impact statement files and creates a relationships between spcies
  returs the impact relationship set
  """

  impact_relationships = set()

  # extract relationships from noaa statement file
  df = pd.DataFrame(pd.read_csv(old_impact_statements_file))
  for i, row in df.iterrows():
    if row[8] == 1: # great lakes region only
      try:
        idx = int(str(row[1]).strip())
        if idx in species_id_dict.keys():
          invasive_species = species_id_dict[idx]
          statement = row[5].strip().lower()
          impact_relationships = locate_related_names_in_statements(invasive_species, statement, scientific_to_common_dict, related_species_names_dict, common_name_occurrences_dict, impact_relationships)
      except:
        pass

  # extract relationships from impact statement file
  with open(impact_statement_file, 'r') as impact_statements:
    for impact_statement in impact_statements:
      impact_statement = impact_statement.strip().lower()
      if len(impact_statement) > 0: # ignore blank lines
        impact_statement = impact_statement.split('|')
        if impact_statement[-1] != 'impact_statement': #ignore headers
          invasive_species = impact_statement[4].strip().lower()
          statement = impact_statement[-1]
          impact_relationships = locate_related_names_in_statements(invasive_species, statement, scientific_to_common_dict, related_species_names_dict, common_name_occurrences_dict, impact_relationships)

  print('{} relationships extracted from impact statements'.format(len(impact_relationships)))

  return impact_relationships

In [39]:
def old_line_ordinal(ref):
  """
  inputs the reference statement that is being built
  compute the ordinal value of the first character (used for finding new references assuming alphabetical order)
    adjusts for letters with accents and references that are not in alphabetical order
  returns the ordinal value
  """

  try:
    old_line_ord = ord(ref[0])
  except:
    old_line_ord = 0

  if old_line_ord == 70:
    if ref[:5] == 'Fogel': # out of order in biblio
      old_line_ord = 67
  elif old_line_ord == 72: 
    if ref[:5] == 'Horns': # out of order in biblio
      old_line_ord = 69
  elif old_line_ord == 84: 
    if ref[:11] == 'Tinca tinca': # out of order in biblio
      old_line_ord = 65
  elif old_line_ord == 193: # fix A with accent
    old_line_ord = 65
  elif old_line_ord == 199: # fix C with accent
    old_line_ord = 67
  elif old_line_ord == 214: # fix O with accent
    old_line_ord = 79
  elif old_line_ord == 216: # fix O with accent
    old_line_ord = 79
  elif old_line_ord == 262: # fix C with accent
    old_line_ord = 67
  elif old_line_ord == 352: # fix S with accent
    old_line_ord = 83
  elif old_line_ord == 381: # fix S with accent
    old_line_ord = 90

  return old_line_ord

## Writing Output Data

In [40]:
def print_header(output_file):
  """
  no inputs
  prints the header for the output
  returns none
  """

  output_file.write('pdf_file|page|section|question|scientific_name|common_name|impact_statement\n')
  return None

In [41]:
def make_impact_type_dict():
  """
  no inputs
  creates a dictionary based on NOAA's impact types from their prior impact statements and
    relates them to the section and question in the new documents
  return impact type dictionary
  """
  impact_type_dict = {
      '1|1' : 'Disease/Parasites/Toxicity',
      '1|2' : 'Competition',
      '1|3' : 'Predation/Herbivory',
      '1|4' : 'Genetic',
      '1|5' : 'Water Quality',
      '1|6' : 'Habitat Alteration',
      '2|1' : 'Human Health',
      '2|2' : 'Infrastructure',
      '2|3' : 'Water Quality',
      '2|4' : 'Commerce',
      '2|5' : 'Recreation',
      '2|6' : 'Property Value',
      '3|1' : 'Other',
      '3|2' : 'Aquaculture/Agriculture',
      '3|3' : 'Recreation',
      '3|4' : 'Other',
      '3|5' : 'Water Quality',
      '3|6' : 'Other',
  }
  return impact_type_dict

In [42]:
def create_impact_line_file():
  """
  No inputs
  Creates a file output object
    Iterates through all files in the data folder
    Sends each file to the extraction function
  Returns None
  """

  output_file = open(impact_line_file, 'w')

  print_header(output_file)

  pdf_files = os.listdir(assets_path)
  for pdf_file in pdf_files:
    if pdf_file[-4:] == '.pdf':
      exctract_impact_lines(pdf_file, assets_path, output_file)

  output_file.close()

  print('Impact line file created')

In [43]:
def write_impact_relationships_to_file(impact_relationships):
  """
  inputs the set of impact relationship tuples
  writes weach tuple to a line in the impact relationship file
  returns None  
  """

  output_file = open(impact_relationships_file, 'w')
  output_file.write('tex_invasive|text_impacted|translated_invasive|translated_impacted\n')
  for (text_invasive, text_impacted, tranlsated_invasive, translated_impacted) in impact_relationships:
    output_file.write('{}|{}|{}|{}\n'.format(text_invasive, text_impacted, tranlsated_invasive, translated_impacted))


## Creating Taxonomy

In [44]:
def create_common_to_scientific_dict(waterlife_data_excel_file):
  """
  inputs the waterlife data file
  creates a dictionary based on the waterlife data file
  returns the dictionary
  """
  
  common_to_scientific_dict = dict()
  df = pd.DataFrame(pd.read_excel(waterlife_data_excel_file))
  df = df.iloc[:, 0:2].dropna()
  for i, row in df.iterrows():
    scientific_name = row[0].strip().lower()
    common_names = row[1].strip().lower().split('/')
    if len(scientific_name) > 0:
      for i, common_name in enumerate(common_names):
        if i >= 0: #if i == 0:
          common_name = common_name.strip()
          if common_name in common_to_scientific_dict.keys():
            common_to_scientific_dict[common_name].append(scientific_name)
          else:
            common_to_scientific_dict[common_name] = [scientific_name]
            
  return common_to_scientific_dict

In [45]:
def clean_common_name(common_name):
  """
  inputs a common name
  lowers, strips and removes leading 'a ' and leading 'an ' from common name
  returns cleaned common name
  """

  common_name = common_name.lower().strip()
  common_name = common_name.replace("'",'')    
  if common_name[0:2] == 'a ':
    common_name = common_name[2:]    
  if common_name[0:3] == 'an ':
    common_name = common_name[3:]
  common_name = common_name.lower().strip()
  
  return common_name

In [46]:
def add_impact_statement_names_to_dict(common_to_scientific_dict):
  """
  inputs the name_dict dictionary
  iterates through the impact statements to extract all scientific names
    and their related common names
    adds these mappings to the dictionary
  returns an updated name_dict
  """

  with open(impact_statement_file, 'r') as impact_statements:    
    for impact_statement in impact_statements:
      impact_statement = impact_statement.strip()      
      if len(impact_statement) > 0: # ignore blank lines
        impact_statement = impact_statement.split('|')        
        if impact_statement[-1] != 'impact_statement': #ignore headers
          invasive_species = impact_statement[4].strip().lower()
          common_name = impact_statement[5]
          common_name = common_name[1:-1]
          common_name_list = common_name.split(',')          
          for common_name in common_name_list:
            common_name = clean_common_name(common_name)            
            if len(common_name) > 0:
              if common_name not in common_to_scientific_dict.keys():
                common_to_scientific_dict[common_name] = [invasive_species]            
              else:
                if invasive_species not in common_to_scientific_dict[common_name]:
                  common_to_scientific_dict[common_name].append(invasive_species)
  
  return common_to_scientific_dict

In [47]:
def count_occurences(common_to_scientific_dict):
  """
  inputs common to scientific dictionary
  goes through all the text extracted from NOAA's technical memorandum and counts
    which names are used most often in the text and stores that in a dictionary
  outputs common name occurrences dictionary
  """
  common_names = common_to_scientific_dict.keys()
  common_name_occurrences_dict = dict()
  for common_name in common_names:
    common_name_occurrences_dict[common_name] = 0
  with open(impact_statement_file, 'r') as impact_statements:
    for impact_statement in impact_statements:
      impact_statement = impact_statement.strip().lower()
      if len(impact_statement) > 0: # ignore blank lines
        impact_statement = impact_statement.split('|')
        if impact_statement[-1] != 'impact_statement': #ignore headers
          statement = impact_statement[-1]
          for common_name in common_names:
            if common_name not in ['pink']: # too common
              variations = [' '+common_name+' ', '|'+common_name+' ', ' '+common_name+'.', ' '+common_name+',']
              for variation in variations:
                if variation in statement:
                    common_name_occurrences_dict[common_name] += 1
  

  return common_name_occurrences_dict

In [48]:
def invert_dict(common_to_scientific_dict):
  """
  inputs the common_name to scientific_name dictionary
  inverts the dictionary which is a many to many relationship
  returns the inverted dict
  """

  scientific_to_common_dict = dict()

  common_names = common_to_scientific_dict.keys()

  for common_name in common_names:
    scientific_names = common_to_scientific_dict[common_name]
    for scientific_name in scientific_names:
      if len(scientific_name) > 0:
        if scientific_name in scientific_to_common_dict.keys():
          scientific_to_common_dict[scientific_name].append(common_name)
        else:
          scientific_to_common_dict[scientific_name] = [common_name]

  return scientific_to_common_dict

In [49]:
def frequent_common_from_scientific(scientific_name, scientific_to_common_dict, common_name_occurrences_dict):
  """
  inputs a scientifc name and the scientific name to common name ditionary and the occurences of common names dictionary
  computes the most often used name
  returns the most often used name
  """

  if scientific_name in scientific_to_common_dict.keys():    
    translated_names = scientific_to_common_dict[scientific_name]
    occurrences = [common_name_occurrences_dict[x] for x in translated_names]
    most_frequent_name = translated_names[occurrences.index(max(occurrences))]
  else:
    most_frequent_name = scientific_name

  # exception to match noaa food web
  if most_frequent_name == 'green alga':
    most_frequent_name = 'green algae'

  return most_frequent_name

In [50]:
def load_food_web_common_names():
  """
  no inputs
  loads the pred_prey_relationship csv file and extracts the names of the species
  returns a set of species names
  """
  food_web_common_names = set()
  
  with open(pred_prey_file, 'r') as pred_prey_relationships:
    for relationship in pred_prey_relationships:      
      common_names = relationship.split(',')
      for common_name in common_names:
        common_name = common_name.strip().lower()
        if common_name != 'predators' and common_name != 'prey': # ignore title
          food_web_common_names.add(common_name)

  return food_web_common_names

In [51]:
def find_related_species_names_from_common_names(food_web_common_names):
  """
  inputs the food_web_common_names
  reads throught the waterlife name file
    creates the related_names_dict and adds names if they are present in the common names column
  returns related_names_dict
  """

  related_names_dict = dict()
  df = pd.DataFrame(pd.read_excel(waterlife_data_excel_file))

  for idx, row in df.iterrows():
    for common_name in food_web_common_names:
      if common_name[:-1] in str(row[1]):
        species_names = row[1].strip().lower().split('/')
        for i, species_name in enumerate(species_names):
          species_name = species_name.strip()
          species_name = species_name.replace('(','')
          species_name = species_name.replace(')','')
          scientific_name = row[0].lower().strip()
          add_to_dict = False
          if i == 0:
            exact_match = False
            false_match = False
            if species_name == common_name:
              exact_match = True
            else:
              if species_name in food_web_common_names:
                false_match = True
          if exact_match:
            if i == 0:
              add_to_dict = True
            else:
              if species_name not in food_web_common_names:
                add_to_dict = True
          else:
            if false_match == False:
              add_to_dict = True
          if i < 3 and add_to_dict: # tradeoff between getting more names and accuracy
              if common_name not in related_names_dict.keys():
                related_names_dict[common_name] = common_name
              if species_name not in related_names_dict.keys():
                related_names_dict[species_name] = common_name
              if scientific_name not in related_names_dict.keys():
                related_names_dict[scientific_name] = common_name
                    
  return related_names_dict

In [52]:
def find_related_species_names_from_groupings(food_web_common_names, related_species_names, column_index):
  """
  inputs the food_web_common_names, the related species names dictionary and a column index for the waterlife file
  if the name of a food web organism could not be located by common name of an individual species,
    the function attempts to detect based on order, class, family, or phylum and adds to the dict
  returns the updated related_species_name_dict
  """
  unmatched_names = set()
  for common_name in food_web_common_names:
    if common_name not in related_species_names.keys():
      unmatched_names.add(common_name)

  df = pd.DataFrame(pd.read_excel(waterlife_data_excel_file))
  for i, row in df.iterrows():
    for common_name in unmatched_names:
      matching_name = common_name

      # fixes inconsitency in the original data
      if matching_name == 'blue-green algae':
        matching_name = 'blue-green'
      if matching_name == 'cyclopoid copepods':
        matching_name = 'cyclopoida'
      if matching_name == 'calanoid copepods':
        matching_name = 'calanoida'
      if matching_name == 'chironomids':
        matching_name = 'chironomidae'
      if matching_name == 'oligochaetes':
        matching_name = 'annelida'
      if matching_name == 'protozoans':
        matching_name = 'flagellate'

      grouping_name = str(row[column_index]).lower().strip()

      scientific_name = str(row[0]).lower().strip()
      if matching_name[:-1] in grouping_name:
        if common_name not in related_species_names.keys():
          related_species_names[common_name] = common_name
        if matching_name not in related_species_names.keys():
          related_species_names[matching_name] = common_name
        if grouping_name not in related_species_names.keys():
          related_species_names[grouping_name] = common_name
        if scientific_name not in related_species_names.keys():
          related_species_names[scientific_name] = common_name
    
  return related_species_names

In [53]:
def load_name_exceptions(related_species_names):
  """
  inputs related_species_names dictionary
  adds hard coded exceptions due to naming inconsitencies in the original data
  returns an updated related_species_names dictionary
  """
  
  missing_mappings = [
                      ('cyclopoid', 'cyclopoid copepods'),
                      ('calanoid', 'calanoid copepods'),
                      ('zebra mussel', 'zebra/quagga mussels'),
                      ('quagga mussel', 'zebra/quagga mussels'),
                      ('zebra/quagga mussels', 'zebra/quagga mussels'),
                      ('dreissena polymorpha', 'zebra/quagga mussels'),
                      ('dreissena bugensis', 'zebra/quagga mussels'),
                      ('dreissena rostriformis','zebra/quagga mussels'),
                      ('bythotrephes longimanus', 'invasive waterfleas'),
                      ('cercopagis pengoi', 'invasive waterfleas'),
                      ('daphnia galeata galeata', 'invasive waterfleas'),
                      ('daphnia lumholtzi', 'invasive waterfleas'),
                      ('eubosmina coregoni', 'invasive waterfleas'),
                      ('invasive waterfleas', 'invasive waterfleas'),
                      ('raptorial waterfleas', 'raptorial waterfleas'),
                      ('native waterfleas', 'native waterfleas'),
                      ('elimia livescens', 'mollusks'),
                      ]

  for key, value in missing_mappings:
    if key not in related_species_names.keys():
      related_species_names[key] = value

  return related_species_names

In [54]:
def create_related_species_names_dict():
  """
  no inputs
  creates the related species names dict from the food web matrix and the waterlife file
  returns related_species_names
  """

  food_web_common_names = load_food_web_common_names()
  related_species_names = find_related_species_names_from_common_names(food_web_common_names)
  related_species_names = load_name_exceptions(related_species_names)
  related_species_names = find_related_species_names_from_groupings(food_web_common_names, related_species_names, 19) # Class
  related_species_names = find_related_species_names_from_groupings(food_web_common_names, related_species_names, 20) # Family
  related_species_names = find_related_species_names_from_groupings(food_web_common_names, related_species_names, 21) # Coarse Grouping
  related_species_names = find_related_species_names_from_groupings(food_web_common_names, related_species_names, 28) # Fine Grouping
  related_species_names = find_related_species_names_from_groupings(food_web_common_names, related_species_names, 30) # Order  

  return related_species_names

In [55]:
def get_species_ids():
  """
  no inputs
  loads the noaa_impact_file and extracts unique species ids
  returns a set of species ids
  """

  df = pd.DataFrame(pd.read_csv(old_impact_statements_file))
  unique_species_id = set()

  for i, row in df.iterrows():

    if row[8] == 1: # great lakes region only
      unique_species_id.add(row[1])
      
  return unique_species_id

In [56]:
def create_species_id_dict(related_species_names_dict, scientific_to_common_dict, common_name_occurrences_dict, invasive_species_ids):
  """
  inputs the related species names, scientific to common and common name occurrences dictionaries, and invasive species id set
  loads species id files from noaa and finds the invasive species id and extracts scientific and common names
  returns a dictionary with species id as key and common name as value
  """

  file_list = [noaa_invasive_species_file, noaa_watchlist_species_file]
  species_id_dict = dict()
  related_names = set(related_species_names_dict.keys())

  for file in file_list:
    df = pd.DataFrame(pd.read_csv(file))
    for i, row in df.iterrows():
      try:
        idx = int(str(row[0]).strip())
        if idx in invasive_species_ids:
          invasive_species = str(row[4]).strip().lower() + ' '+ str(row[5]).strip().lower()
          if invasive_species in related_names:
            frequent_invasive_name = related_species_names_dict[invasive_species]
          else:
            frequent_invasive_name = frequent_common_from_scientific(invasive_species, scientific_to_common_dict, common_name_occurrences_dict)
          if frequent_invasive_name + 's' in related_names:
            frequent_invasive_name += 's'

          if invasive_species == frequent_invasive_name:
            frequent_invasive_name = row[7]
          
          species_id_dict[idx] = frequent_invasive_name
      except:
        pass
  
  return species_id_dict

## Creating Predator-Prey Relationships

In [57]:
def excel_to_pandas(file, sheet):
  """
  inputs an excel file and the sheet of the excel file
  converts the sheet to a pandas dataframe
  returns the dataframe
  """
  if sheet == None:
    df = pd.read_excel(file, index_col=0)
  else:
    df = pd.read_excel(file, index_col=0, sheet_name=sheet)
  return df.dropna(axis=0, how="all")

In [58]:
def get_foodweb_species(df):
  """
  inputs a dataframe
  cleans the species data in the data frame
  reutrns a list of species from the dataframe
  """
  df.index, df.columns = df.index.str.lower(), df.columns.str.lower()
  
  df.columns = [re.sub(" [\(\[].*?[\)\]]", "", x) for x in df.columns] 
  df.index = [re.sub(" [\(\[].*?[\)\]]", "", x) for x in df.index] 
  
  df.columns = [re.sub("waterflea$", "waterfleas", x) for x in df.columns]
  df.index = [re.sub("waterflea$", "waterfleas", x) for x in df.index]
  df.index = [re.sub("waterlfea$", "waterfleas", x) for x in df.index]
  
  df.columns = [re.sub("poc", "organic detritus", x) for x in df.columns]
  df.index = [re.sub("poc", "organic detritus", x) for x in df.index]
  
  return df.index.tolist(), df.columns.values.tolist()

In [59]:
def get_waterway_pred_prey_pairs(file, sheet):
  """
  inputs an excel file and a sheet from that excel file
  finds all predetator / prey relationships in the sheet
  returns a list of predator prey relationships
  """
  species = excel_to_pandas(file, sheet).T
  predators, preys = get_foodweb_species(species)
  pred_prey_pairs = []
  
  for pred in predators:
    for prey in preys:
      if species.loc[pred,prey] == 'x':
        pred_prey_pairs.append((pred, prey))
  return pred_prey_pairs

In [60]:
def checkpath_save(filename, pair_list):
  """
  inputs a file name as string a list of predator prey relationships
  saves the list to a csv file
  returns None
  """
  with open(filename, 'w') as f:
    write = csv.writer(f)
    write.writerow(['Predators', 'Prey'])
    write.writerows(pair_list)
  return None

In [61]:
def get_all_pred_prey_pairs(file, sheets):
  """
  inputs a filename as a string and a list of sheets
  calls the function to get all predator prey relationships
  saves all predator prey relationships to a single file
  """
  pred_prey_pairs, fields = [], ['Predator','Prey']
  for sheet in sheets:
    waterway_pairs = get_waterway_pred_prey_pairs(file, sheet)

    for pair in waterway_pairs:
      pred_prey_pairs.append(pair)
  pred_prey_list = list(set(pred_prey_pairs))
  display(len(pred_prey_list))
  checkpath_save(pred_prey_file, pred_prey_list)

  return pred_prey_pairs

## Creating Scientific to Common Dictionary

In [62]:
def create_scientific_to_common_dictionary():
  """
  inputs nothing
  outputs scientific to foodweb name dictionary as dataframe
  """
  related_species_names_dict = create_related_species_names_dict()
  common_to_scientific_dict = create_common_to_scientific_dict(
      waterlife_data_excel_file)
  common_to_scientific_dict = add_impact_statement_names_to_dict(
      common_to_scientific_dict)  
  common_name_occurrences_dict = count_occurences(common_to_scientific_dict)
  scientific_to_common_dict = invert_dict(common_to_scientific_dict)

  df = pd.DataFrame([
    [key,value] for key,value in scientific_to_common_dict.items()],
    columns=["Name","Value"])
  df['Value'] = df['Value'].apply(lambda x: x[0].strip())
  
  return df

## Loading Impact References

In [63]:
def get_html_marks():
  """
  inputs nothing
  returns a list of html tags
  """
  return ['<br />', '&quot;', '&nbsp;', '- ', '<em>']

In [64]:
def get_impact_references(file_path):
  """
  inputs a filename and path as string
  remvoes html tags from journal abstracts
  returns a dataframe
  """
  print('Loading impact references...')
  df = pd.read_csv(file_path)
  for mark in get_html_marks():
    df['abstract'] = df['abstract'].str.replace(mark, '')

  return df.fillna('')

## Loading Impact Statements

In [65]:
def get_impact_statements(file_path):
  """
  inputs a filename and path as string
  loads the file into a df and removes the em html tags
  returns a dataframe
  """
  df = pd.read_csv(file_path)
  df['impact_desc'] = df['impact_desc'].str.replace('<em>', ''
    ).str.replace('</em>', '')
  return df

## Combining Dataframes and Get Train-Test Split

In [66]:
def selected_columns():
  """
  returns a list of selected columns for training
  """
  return ['ID', 'study_type', 'study_location', 'title', 'abstract', 'refnum', 
          'ref_type', 'Year', 'journal',  'impact_type', 'cost', 'location']

In [67]:
def boolean_columns():
  """
  returns a list of boolean columns
  """
  return ['cost', 'location']

In [68]:
def categorical_columns():
  """
  returns a list of categorical columns
  """
  return ['impact_type', 'ref_type'] 

In [69]:
def categorical_features():
  """
  returns a list of categorical training columns
  """
  return ['impact_type', 'ref_type']

In [70]:
def text_columns():
  """
  returns a list of text columns
  """
  return ['title', 'abstract', 'journal']

In [71]:
def label_columns():
  """
  returns a list of label (y) columns
  """
  return ['study_type', 'study_location']

In [72]:
def make_dict(values_list):
  """
  inputs a list of values
  returns a dictionary with each value a key and an unique integer as a value
    and a dictionary with each integer as a value and each integer as a key
  """
  dictionary, counter = {}, 1
  for value in values_list:
    dictionary[value], dictionary[counter] = counter, value
    counter += 1
  dictionary[0] = 'Unknown'
  
  return dictionary

In [73]:
def merge_reference_impacts(references_df, impacts_df):
  """
  inputs reference and impact dataframes
  returns a dataframe merged with selected columns
  """
  return pd.merge(references_df, impacts_df, on='refnum'
                  )[selected_columns()]

In [74]:
def convert_to_boolean(df, columns):
  """
  inputs a dataframe and target columns
  returns the dataframe with boolean values in the selected columns
  """
  for column in columns:
    df[column] = df[column].isnull().astype('int')
  return df

In [75]:
def convert_category(df,column):
  """
  inputs a dataframe and column
  returns a dataframe with integers in selcted column and integer dictionary
    for reverse lookup
  """
  dictionary = make_dict(list(df[column].unique()))
  df[column] = df[column].map(dictionary)
  return df, dictionary

In [76]:
def clean_text(text):
  """
  inputs a string
  returns a clean lower-cased string without punctuation
  """
  text = str(text)
  text = re.sub(r'[0-9]+', '', text)
  text = re.sub(r'[^\w\s]', '', text).lower()
  return text.strip()

In [77]:
def get_clean_text(df, columns):
  """
  inputs a dataframe and a list of selected text columns
  returns a dataframe with the selected columns cleaned
  """
  for column in columns:
    df[column] = df[column].apply(clean_text)
  return df

In [78]:
def prepare_features(df):
  """
  inputs a dataframe 
  returns the dataframe with integer values in the selected columns
  """
  df = convert_to_boolean(df, boolean_columns())
  df = get_clean_text(df, text_columns())
  df, study_type_dictionary = convert_category(df, label_columns()[0])
  df, study_location_dictionary = convert_category(df, label_columns()[1])

  return df, study_type_dictionary, study_location_dictionary

In [79]:
def get_features(references_df, impacts_df):
  """
  inputs references df and impacts df
  calls the prepare features function
  returns the result of the perpeare features function
  """
  return prepare_features(merge_reference_impacts(references_df,impacts_df))

In [80]:
def check_balance(df, predict_columns):
  """
  inputs a df and prediction columns
  performs an integrity check on the balance of values 
  returns none
  """
  print("Checking the balance of values in selected columns...")
  for predict_column in predict_columns:
    display(predict_column)
    for value in df[predict_column].unique():
      display("{}: {}".format(value, len(df[df['study_type']==value])))
  return

## Processing New Impact Statements

In [81]:
def load_extracted_impact_statements(file_path):
  """
  inputs a file of impact statements
  processes the file to add study type and location
  returns a df
  """
  print('Processing new impact statements...')
  # df = pd.read_excel(file_path).fillna('')
  df = pd.read_csv(file_path, delimiter="|").fillna('')
  df['greatlakes_region'] = 1
  df.rename(columns={'refnum':'reference', 'species_ID':'species'}, inplace=True)
  df = df.drop(columns=['study_type', 'study_location'])
  df = df[df['ID'].notna()]
  df.ID = df.ID.astype(int)
  
  return df

In [82]:
def load_references_match(file_path):
  """
  ipnuts a file
  drops the duplicate references from the file
  returns a df of references and reference numbers
  """
  df = pd.read_excel(file_path)
  df= df.drop_duplicates(subset="reference") 
  return df[['reference', 'refnum']]

In [83]:
def load_existing_references(filepath):
  """
  inputs a file an excel file of journal references
  returns a dataframe of the excel file
  """
  df = pd.read_excel(filepath)
  return df[['refnum', 'ref_type', 'author', 'Year', 'title', 
             'journal',  'abstract', 'impacts_entered']]

In [84]:
def match_references(unmatched_statements, reference_match, references):
  """
  inputs statements and references
  matches the statements with the reference
  returns a merged dataframe
  """
  df = pd.merge(unmatched_statements.drop(columns=['notes', 'impacted_TSNs']), 
                reference_match, how='left', on=['reference'])
  df = df.fillna({'refnum': 'NEW'})
  df = df[df['refnum'] != 'NEW']
  df.refnum = df.refnum.astype(int)
  df = df.drop(['reference'], axis=1)

  return pd.merge(df, references, on=["refnum"], how="left").fillna('')

In [85]:
def prepare_prediction_features(df):
  """
  inputs a dataframe 
  returns the dataframe with integer values in the selected columns
  """
  df = convert_to_boolean(df, boolean_columns())
  df = get_clean_text(df, text_columns())
  df['study_type'], df['study_location'] = 10,10

  return df[['ID', 'study_type', 'study_location', 'title', 'abstract', 'refnum',
       'ref_type', 'Year', 'journal', 'impact_type', 'cost', 'location']]

## Training Model

In [86]:
def get_train_test_split(df, label_column, test_percentage):
  """
  inputs dataframe, column of labels, and test set percentage
  outputs X_train, X_test, y_train, and y_test
  """
  X, y = df.drop([label_column], axis=1), df[label_column].astype('int')
  X_train, X_test, y_train, y_test = train_test_split(
      X, y, test_size=test_percentage, random_state=42)
  return X_train, X_test, y_train, y_test

In [87]:
def get_column_transformer():
  """
  outputs a column transformer
  """
  return ColumnTransformer(transformers = [
      ('boolean', Normalizer(norm="l1"), boolean_columns()),
      ('categorical', OneHotEncoder(handle_unknown = 'ignore'), 
                                    categorical_features()),
      ('tfidf-1', TfidfVectorizer(max_features=5000), text_columns()[0]),
      ('tfidf-2', TfidfVectorizer(max_features=10000), text_columns()[1]),
        ],remainder='drop')

In [88]:
 def get_trained_model(df,label_column, test_percentage):
   """
   inputs a dataframe, a column of labels and the percentage of the test set
   outputs a column transformer and classifier
   """
   print('Training the model...')
   X_train, X_test, y_train, y_test = get_train_test_split(
       df, label_column, test_percentage) 
   column_transformer = get_column_transformer()
   X_train = column_transformer.fit_transform(X_train.fillna(''))
   X_test = column_transformer.transform(X_test.fillna(''))
   clf = RandomForestClassifier(n_estimators=1000, warm_start=True,
        random_state=42).fit(X_train, y_train) 
   y_pred = clf.predict(X_test)
   print("Model Accuracy: {}".format(accuracy_score(y_test, y_pred)))
   
   return column_transformer, clf

## Getting Predictions

In [89]:
def get_predictions(df, column, column_transformer, clf):
  """
  inputs df, column, column transformer, and classifier
  outputs a dataframe with labeled values
  """
  print('Getting predictions...')
  X_pred = df.copy().drop([column], axis=1)
  X_pred = column_transformer.transform(X_pred.fillna(''))
  y_pred = clf.predict(X_pred)
  df[column] = y_pred

  return df

## Converting for Database

In [90]:
def column_dictionary(stype_dict, slocation_dict):
  """
  outputs a dictionary of dictionaries of keys and values for columns
  """
  # return {'study_type': study_type_dict,
  #         'study_location': study_location_dict, 
  return {'study_type': stype_dict,
          'study_location': slocation_dict, 
          'species_ID': pd.read_csv(species_to_species_id_file, index_col=0, squeeze=True).to_dict()}

In [91]:
def tsn_dictionary(filepath):
  """
  outputs a tsn dictionary
  """
  return pd.read_csv(filepath, index_col=0, squeeze=True).to_dict()

In [92]:
def convert_to_tsn(species_list):
  """
  imports a list of species
  exports a list of tsns
  """
  tsn_dict, count, tsns = tsn_dictionary(tsn_file), 0, ""

  species_list = species_list.replace('[', '').replace(']',''
    ).replace("'", '').split(', ')

  for species in species_list:
    if species == '': continue
    if count == 0: 
      tsns += str(tsn_dict[species.lower()])
    else: 
      tsns += ", " + str(tsn_dict[species.lower()])
    count += 1
    
  return tsns

In [93]:
def process_impact_statements(untransformed_df, predicted_df, column_dicts,
                              impacts):
  """
  imports a dataframe of untransformed impact statements, a dataframe of 
    predicted values, and a dictionary of dictionaries of numbers to values
  outputs a dataframe with complete values
  """
  print('Converting data for database...')
  df = pd.merge(untransformed_df, predicted_df[
    ['ID', 'study_type', 'study_location', 'refnum']], on=["ID"], how="left").fillna(0)
  df.refnum = df.refnum.astype(int)

  df['impacted_species'] = df['impacted_TSNs']
  df['impacted_TSNs'] = df['impacted_TSNs'].apply(lambda x: convert_to_tsn(x))
  df['species_ID'] = df['species']

  for column in column_dicts:
    df[column] = df[column].map(column_dicts[column])

  columns = list(impacts.columns.values)
  columns.extend(['species', 'impacted_species'])

  return df[columns]

## Creating Database Upload for NOAA

In [94]:
def create_database_upload_for_NOAA():
  """
  no inputs
  runs all the functions to extract impact statements from start to finish
    - download pdf files
    - extract data from pdf files
    - run machine learning models
    - write full statements to excel file to handoff to NOAA
  returns None
  """
  download_NOAA_technical_memorandums(NOAA_url, technical_files)
  predator_prey_pairs = get_all_pred_prey_pairs(foodweb_file, ['Lake Michigan'])
  create_impact_line_file()
  statement_break_classifier = build_statement_break_classifier()
  aggregate_impact_statements(statement_break_classifier)
  extract_impact_relationships()
  extract_journal_references()
  extract_impact_references()
  create_scientific_to_common_dictionary(
      ).to_csv(scientific_to_common_file, index=False)

  impact_references = get_impact_references(impact_references_file)
  impact_statements = get_impact_statements(old_impact_statements_file)
  features, study_type_dict, study_location_dict = get_features(
    impact_references, impact_statements)
  
  unmatched_impacts = load_extracted_impact_statements(database_upload_file)
  unfinished_impact_statements = prepare_prediction_features(
    match_references(unmatched_impacts, 
      load_references_match(reference_match_file), 
      load_existing_references(existing_reference_path)))
  study_type_transformer, study_type_classifer = get_trained_model(
      features, 'study_type', 0.1)
  study_location_transformer, study_location_classifer = get_trained_model(
      features, 'study_location', 0.1)
  new_impact_statements = get_predictions(
      unfinished_impact_statements, 'study_type', study_type_transformer, 
      study_type_classifer)
  new_impact_statements = get_predictions(
      unfinished_impact_statements, 'study_location', study_location_transformer, 
      study_location_classifer)
  print('Done predicting...')
  processed_impact_statements = process_impact_statements(
      unmatched_impacts, new_impact_statements, 
      column_dictionary(study_type_dict, study_location_dict),
      impact_statements)
  processed_impact_statements.to_excel(impact_statements_file)

  return None

# Impact Statements to Networks

## Creating tools

In [95]:
def dictionary_to_csv(dictionary, filepath):
  """
  inputs a dictionary and a filepath
  outputs (saves) a csv file 
  """
  with open(filepath, 'w') as file:
    writer = csv.writer(file)
    for key, value in dictionary.items():
      writer.writerow([key, value])

In [96]:
def csv_to_dataframe(filepath):
  """
  inputs a csv filepath
  outputs a dataframe
  """
  return pd.read_csv(filepath)

In [97]:
def csv_to_dict(filepath):
  """
  inputs a csv filepath
  outputs a dictionary
  """
  return pd.read_csv(filepath, header=None, index_col=0, squeeze=True).to_dict()

## Creating Species-Key Dictionary

In [98]:
def combined_relationships(invasive_filepath, foodweb_filepath):
  """
  inputs a filepath of invasive impact filepath and a predator prey filepath
  outputs a combined impacter-impacted dataframe
  """
  print('Creating species-key dictionary')
  invasive_df = pd.read_csv(invasive_filepath, delimiter='|')
  foodweb_df = csv_to_dataframe(foodweb_filepath)

  df = invasive_df.drop(columns=['tex_invasive', 'text_impacted'])
  df.rename(columns={'translated_invasive': 'impacter', 
                     'translated_impacted':'impacted'}, inplace=True)
  
  foodweb_df.rename(columns={'Predators': 'impacter', 
                             'Prey':'impacted'}, inplace=True)
  
  return df.append(foodweb_df)

In [99]:
def species_index_dicts(species_pairs_df):
  """
  inputs a file with value-to-value relationships
  outputs a value to index dictionary and an index to value dictionary
  """

  index_species_dictionary, species_index_dictionary = {}, {}
  species = list(set(species_pairs_df.impacter) | set(species_pairs_df.impacted))
  
  for index in range(0, len(species)):
    species_index_dictionary[species[index]] = index
    index_species_dictionary[index] = species[index]

  return species_index_dictionary, index_species_dictionary

## Getting Impacter-Impacted Distance

In [100]:
def combined_relationships(invasive_filepath, foodweb_filepath):
  """
  inputs a filepath of invasive impact filepath and a predator prey filepath
  outputs a combined impacter-impacted dataframe
  """
  print('Getting impacter-impacted distance...')
  invasive_df = pd.read_csv(invasive_filepath, delimiter='|')
  foodweb_df = csv_to_dataframe(foodweb_filepath)

  df = invasive_df.drop(columns=['tex_invasive', 'text_impacted'])
  df.rename(columns={'translated_invasive': 'impacter', 
                     'translated_impacted':'impacted'}, inplace=True)
  
  foodweb_df.rename(columns={'Predators': 'impacter', 
                             'Prey':'impacted'}, inplace=True)
  
  return df.append(foodweb_df)

In [101]:
def add_distance_between_nodes(species_pairs_df):
  """
  inputs a dataframe of value pairs
  ouputs a dataframe of value pairs with the shortest distance therebetween
  """
  species_pairs_distance, relationships = [], species_pairs_df.values.tolist()
  G = nx.DiGraph()
  G.add_edges_from(relationships)
  shortest_path = dict(nx.all_pairs_shortest_path_length(G))

  species = list(set(species_pairs_df.impacter) | set(species_pairs_df.impacted))

  for species1 in species:
    for species2 in species:
      if species1 == species2: continue
      if nx.has_path(G, species1, species2):
        species_pairs_distance.append(
            [species1, species2, shortest_path[species1][species2]])

  return pd.DataFrame(species_pairs_distance, 
                      columns=['impacter', 'impacted', 'distance'])

In [102]:
def key_relationships(species_species_df, species_index_dictionary):
  """
  inputs a named impacter-impacted pair dataframe and species-to-key dictionary
  outputs a keyed impacter-impacted pair dataframe
  """
  species_species_df['impacter'] = species_species_df['impacter'
      ].map(species_index_dictionary)
  species_species_df['impacted'] = species_species_df['impacted'
      ].map(species_index_dictionary)

  return species_species_df

## Getting Waterways and Cooordinates

In [103]:
def json_to_df(file_name):
  """
  inputs a file containing geojson data
  converts the dict object to a dataframe
    removes rows with no waterway name
    creates a row for each coordinate in the data frame (from the list of 
    coordinates)
  returns the expanded dataframe
  """
  print('Getting waterways and coordinates...')
  # clean data frame
  data = json.load(open(file_name))
  df = pd.json_normalize(data["features"])
  df = df.rename(columns={'properties.GNIS_NAME': 'properties.name'}) #rename USGS data vs. purchased data
  df.columns= df.columns.str.lower()
  df = df[df['properties.name'].notna()]
  df = df[['properties.name','geometry.type','geometry.coordinates']]    

  # flatten data frame based on geometry (line or polygon)
  coordinate_list = []
  for index, row in df.iterrows():
    water_body_name = row[0]
    if row[1] == 'LineString':
      for coordinates in row[2]:
        coordinate_list.append([water_body_name, coordinates[1], coordinates[0]])
    elif row[1] == 'Polygon':
      for shape in row[2]:
        for coordinates in shape:
          coordinate_list.append([water_body_name, coordinates[1], coordinates[0]])
    else:
      print('Error {} is an unkown geometry type'.format(row[1]))

  df = pd.DataFrame(coordinate_list, columns=['name','latitude', 'longitude'])

  print('{} coordinates loaded from {}.'.format(len(df), file_name))
  
  return df

In [104]:
def create_water_bodies_df(file_name_list):
  """
  inputs a list file_names
  creates data frames from each of the water source files and concatenates them 
  returns a data frame with name and coordinates of all water_bodies
  """

  water_bodies_df = pd.DataFrame()

  for file_name in file_name_list:
    
    water_body_df = json_to_df(file_name)
    water_bodies_df = pd.concat([water_bodies_df, water_body_df]
      ).reset_index(drop=True)

  return water_bodies_df

## Getting Invasion List

In [105]:
def simplified_invasions(invasions_file, scientific_to_name_filepath):
  """
  inputs path to invasive species specimen file
  returns simplified invsasive species specimen dateframe filtered by
    monitored species
  """
  print('Getting invasion list...')
  scientific_to_name_dict = pd.read_csv(scientific_to_name_filepath)
  scientific_to_name_dict = dict(zip(
      scientific_to_name_dict.Name, scientific_to_name_dict.Value))
  scientific_names = list(scientific_to_name_dict.keys())
  invasions_df = pd.read_csv(invasions_file, low_memory=False)
  invasions_df['Date'] = pd.to_datetime(
      invasions_df[['Year', 'Month', 'Day']]).dt.date
  invasions_df['Scientific Name'] = invasions_df['Scientific Name'
      ].apply(lambda x: x.lower().strip())
  invasions_df = invasions_df[invasions_df['Scientific Name'
      ].isin(scientific_names)]

  return invasions_df[[
      'Specimen Number', 'Species ID', 'Scientific Name', 'Common Name', 
      'Latitude', 'Longitude', 'Date']].reindex()

## Locating Closest Waterway Features

In [106]:
def get_waterway_pickle(waterways_df):
  """
  inputs waterways dataframe
  outputs a pickle file of the waterways
  """
  print('Locating closest waterway features...')
  kd = cKDTree(waterways_df[['latitude',	'longitude']].values) 
  pickle.dump(kd,open(waterways_file,'wb'))

  return kd

In [107]:
def get_closest_waterway(observed_df, waterways_df, waterways_tree):
  """
  inputs observed specimen dateframe, waterways dataframe, and waterways pickle
  finds the index of the closest waterways and maps the name to closest waterway
  outputs observed_df with closest waterway
  """

  observed_df = observed_df.dropna().reset_index().drop(['index'],  axis=1)
  
  distances, indices = waterways_tree.query(
      observed_df[["Latitude", "Longitude"]], k = 1)

  observed_df['closest waterway'] = pd.Series(indices).map(waterways_df['name']) 

  return observed_df

## Calculating Waterway Distance

In [108]:
def observation_locations(filepath):    
  """
  inputs a filepath
  outputs a dateframe with species ID and closest waterway
  """     
  return pd.read_csv(filepath)[['Species ID', 'closest waterway']
                               ].drop_duplicates()

In [109]:
def waterway_edges(filepath):
  """
  inputs a filepath
  outputs a dateframe with to and from columns
  """
  return pd.read_csv(filepath).drop(['Unnamed: 0'], axis=1)

In [110]:
def get_species_water_distance(specimens_waterways_df, waterway_edges_df):
  """
  inputs dataframes of specimens-waterways and waterway edges
  outputs a dataframe with 'Species ID', 'waterway', and 'distance'
  """
  print('Calculating waterway distance...')
  presence = []
  observations = observation_locations(specimens_locations_file)
  flow_edges = waterway_edges(waterways_edges_file)
  waterways = list(set(flow_edges['from']) | set(flow_edges['to']))
  species_list = list(set(observations['Species ID']))

  G = nx.Graph()
  G.add_edges_from(flow_edges.values.tolist())
  spl = dict(nx.all_pairs_shortest_path_length(G))

  for species in species_list:
    df = observations[observations['Species ID'] == species]
    for waterway1 in df['closest waterway'].unique():
      if waterway1 not in waterways: continue
      for waterway2 in waterways:
        if waterway1 == waterway2: continue
        if nx.has_path(G, waterway1, waterway2):
          presence.append([species, waterway2, spl[waterway1][waterway2]])
  return  pd.DataFrame(presence, columns=['Species ID', 'waterway', 'distance'])

In [111]:
def species_id_to_foodweb_name(scientific_to_name_filepath, 
                               species_id_filepath):
  """
  inputs filepaths for a scientific name and a species id dictionaries
  outputs a species id to common (foodpath) name
  """
  scientific_to_name_dict = pd.read_csv(scientific_to_name_filepath)
  scientific_to_name_dict = dict(zip(
      scientific_to_name_dict.Name, scientific_to_name_dict.Value))
  df = pd.read_csv(species_id_filepath).drop(columns='Unnamed: 0')
  df['species'] = df['species'].apply(lambda x: x.lower().strip())
  df['name'] = df['species'].map(scientific_to_name_dict)
  df = df.drop(columns='species')

  return dict(zip(df['Species_ID'], df['name']))

In [112]:
def foodweb_name_to_index(foodweb_to_index_filepath):
  """
  inputs the food_web file
  returns a dictionary of the food web
  """
  return csv_to_dict(foodweb_to_index_filepath)

In [113]:
def add_species_index(species_water_df, speciesid_name_dict, name_index_dict):
  """
  inputs the observations of invasive species in michigan waterways
  maps species names to related data
  returns a data frame with the observation and their related data
  """
  name_index_dict['zebra mussel'] = 28
  species_water_df['name'] = species_water_df['Species ID'].map(speciesid_name_dict)
  species_water_df = species_water_df[species_water_df['name'].notna()].reset_index()
  species_water_df['key'] = species_water_df['name'].map(name_index_dict).fillna(10000).astype(int)

  return species_water_df[['key', 'waterway', 'distance', 'name', 'Species ID']]

In [114]:
def write_network_files():
  """
  inputs none
  writes all the network data files to tables for database uploads
    keeping a consistency of speices IDs throughout tables
  returns none
  """

  print('writing network files...')

  # get all species and assign to list
  invasives = list(pd.read_csv(impact_relationships_file, delimiter='|')['translated_invasive'].unique())
  invasives = [x.strip().lower() for x in invasives]
  invasives = set(invasives)
  relationships_file = impacter_impacted_distance_named_file
  relationships_df = pd.read_csv(relationships_file)
  relationships_df['impacter'] = relationships_df['impacter'].str.lower()
  relationships_df['impacter'] = relationships_df['impacter'].str.strip()
  relationships_df['impacted'] = relationships_df['impacted'].str.lower()
  relationships_df['impacted'] = relationships_df['impacted'].str.strip()
  relationships_df = relationships_df.drop_duplicates()
  relationships_df = relationships_df[['impacter', 'impacted', 'distance']]
  obs_df = pd.read_csv(observations_waterways_distances_file)
  obs_df = obs_df.replace('zebra mussel', 'zebra/quagga mussels')
  obs_df = obs_df.replace('cyclopoid copepod', 'cyclopoid copepods')
  obs_df = obs_df.replace('chinook salmon', 'chinook')
  obs_df = obs_df.replace('spiny waterflea', 'invasive waterfleas')
  obs_df = obs_df.replace('carp', 'bighead carp') # this is an assumption as the other types of invsaive carp are already listed
  species=invasives.copy()
  species.update(set(relationships_df['impacter'].unique()))
  species.update(set(relationships_df['impacted'].unique()))
  not_in_network_list = []
  for name in obs_df.name.unique():
    if name not in species:
      not_in_network_list.append(name)
  species.update(set(not_in_network_list))

  # write species table, write invasives table
  species_id_list = []
  invasives_id_list = []
  species_loop = list(sorted(species))
  for idx, specie in enumerate(species_loop):
    species_id_list.append([idx, specie])
    if specie in invasives:
      invasives_id_list.append([idx, specie])
  pd.DataFrame(species_id_list,columns=['id', 'name']).to_csv(species_file, index=False)
  pd.DataFrame(invasives_id_list,columns=['id', 'name']).to_csv(invasive_species_file, index=False)
  species_id_dict = {x[1] : x[0] for x in species_id_list}

  # write impacter impacted relation file 
  impacter_impacted_list= []
  for idx,row in relationships_df.iterrows():
    if row[2]==1:
      impacter_impacted_list.append([species_id_dict[row[0]], species_id_dict[row[1]]])
  impacter_impacted_df = pd.DataFrame(impacter_impacted_list, columns=['impacter_id', 'impacted_id'])
  impacter_impacted_df.to_csv(impact_rel_file, index=False)

  # write species observed file
  obs_list = []
  for idx, row in obs_df.iterrows():
    on_network = 0 if row[3] in not_in_network_list else 1
    obs_list.append([idx, species_id_dict[row[3]], row[1], row[2], on_network])
  pd.DataFrame(obs_list,columns=['uid', 'species_id', 'waterbody_name', 'distance', 'on_network']).to_csv(species_observed_file, index=False)

  return None

## Create Networks

In [115]:
def create_networks():
  """
  no inputs
  runs the network creation from start to finish
    - creates the food web networks
    - creates the waterways networks
    - writes all the data to files for database upload
  returns none
  """
  impacter_impacted_df = combined_relationships(
    impact_relationships_file, pred_prey_file)
  species_to_index, index_to_species = species_index_dicts(impacter_impacted_df)
  dictionary_to_csv(species_to_index, species_to_index_file)
  dictionary_to_csv(index_to_species, index_to_species_file)

  relationships_named = combined_relationships(
    impact_relationships_file, pred_prey_file)
  relationships_named.to_csv(relationships_named_file, index=False)

  species_species_distance = add_distance_between_nodes(relationships_named)
  species_species_distance.to_csv(impacter_impacted_distance_named_file, 
                                  index=False)
  relationships_keyed = key_relationships(
    relationships_named, csv_to_dict(species_to_index_file))
  relationships_keyed.to_csv(relationships_keyed_file, index=False)

  water_bodies_df = create_water_bodies_df([rivers_file, lakes_file])
  water_bodies_df.to_csv(waterways_dataframe_file, index=False)

  invasion_df = simplified_invasions(invasion_file, scientific_to_common_file)

  waterways = get_waterway_pickle(water_bodies_df)
  closest_waterways = get_closest_waterway(
      invasion_df, water_bodies_df, waterways)
  closest_waterways.to_csv(specimens_locations_file, index=False)

  species_id_to_name_dict = species_id_to_foodweb_name(
      scientific_to_common_file, species_id_to_scientific_file)
  
  foodweb_name_to_index_dict = foodweb_name_to_index(species_to_index_file)
  
  species_water_distance = get_species_water_distance(
      specimens_locations_file, waterways_edges_file)
  
  observations_plot = add_species_index(species_water_distance, 
      species_id_to_name_dict, foodweb_name_to_index_dict)
  observations_plot.to_csv(observations_waterways_distances_file, index=False)
  print('Networks created.')
  write_network_files()

  return

# Networks to Visualizations

In [116]:
def create_tables_for_ripple_plot():
  """
  inputs None
  opens the impacter impacted named distance file
  creates lists of impacters and impacteds and writes the csv files for db insertion
  returns None
  """

  print('writing tables for ripple plot')

  #load and clean data
  invasives = list(pd.read_csv(impact_relationships_file, delimiter='|')['translated_invasive'].unique())
  relationships_file = impacter_impacted_distance_named_file
  relationships_df = pd.read_csv(relationships_file)
  relationships_df['impacter'] = relationships_df['impacter'].str.lower()
  relationships_df['impacter'] = relationships_df['impacter'].str.strip()
  relationships_df['impacted'] = relationships_df['impacted'].str.lower()
  relationships_df['impacted'] = relationships_df['impacted'].str.strip()
  relationships_df = relationships_df.drop_duplicates()
  relationships_df = relationships_df[['impacter', 'impacted', 'distance']]

  #create a list of impacters
  impacters = list(relationships_df['impacter'].unique())
  impacters = list(set(invasives).intersection(set(impacters)))
  impacters = sorted(impacters)
  impacters_dict = [{'label': impacter, 'value': impacter} for impacter in impacters]
  impacters_2 = list(relationships_df['impacter'].unique())

  #compute polar coordinate theta for impacters
  relationships_list = []
  for impacter in impacters_2:
    relationships_list.append([impacter, impacter, 0, 0])
    for distance in range(1,4):
      sub_df = relationships_df[(relationships_df['impacter']==impacter) & (relationships_df['distance']==distance)].reset_index(drop=True)
      thetas = np.linspace(0, 360, len(sub_df) + 1)
      for idx, row in sub_df.iterrows():
        invasive_impacter = 1 if impacter in impacters else 0
        relationships_list.append([impacter, row[1], distance, thetas[idx]])
  impacters_df = pd.DataFrame(relationships_list, columns=['impacter', 'impacted', 'radius', 'theta'])

  #create a list of impacteds
  impacteds = list(relationships_df['impacted'].unique())
  impacteds = sorted(impacteds)
  impacted_dict = [{'label': impacted, 'value': impacted} for impacted in impacteds]

  #compute polar coordinate theta for impacteds
  relationships_list = []
  for impacted in impacteds:
    relationships_list.append([impacted, impacted, 0, 0])
    for distance in range(1,4):
      sub_df = relationships_df[(relationships_df['impacted']==impacted) & (relationships_df['distance']==distance)].reset_index(drop=True)
      sub_df = sub_df[sub_df['impacter'].isin(impacters)].reset_index(drop=True)
      thetas = np.linspace(0, 360, len(sub_df) + 1)
      for idx, row in sub_df.iterrows():
        relationships_list.append([row[0], impacted, distance, thetas[idx]])
  impacteds_df = pd.DataFrame(relationships_list, columns=['impacter', 'impacted', 'radius', 'theta2'])
  foodweb_df = pd.merge(impacters_df, impacteds_df,  how='outer', on=['impacter','impacted','radius']).fillna(0)

  # write database table for impacter_impacted distance
  relationships_list = []
  for idx, row in foodweb_df.iterrows():
    invasive_impacter = 0
    if row[0] in impacters or row[0] == row[1]:
      invasive_impacter = 1
    relationships_list.append([idx, row[0], row[1], row[2], int(row[3]), int(row[4]), invasive_impacter])
  foodweb_df = pd.DataFrame(relationships_list, columns=['uid', 'impacter', 'impacted', 'radius', 'theta', 'theta_two', 'invasive_impacter'])
  foodweb_df.to_csv(impacter_impacted_distance_file, index=False)

  # write database table for impacter dropdown
  impacter_list = []
  for idx, impacter in enumerate(impacters):
    impacter_list.append([idx, impacter])
  pd.DataFrame(impacter_list,columns=['uid', 'impacter']).to_csv(invasive_impacters_dropdown_file, index=False)

  # write database table for impacteds dropdown
  impacted_list = []
  for idx, impacted in enumerate(impacteds):
    impacted_list.append([idx, impacted])
  pd.DataFrame(impacted_list,columns=['uid', 'impacted']).to_csv(impacted_species_dropdown_file, index=False)

  print('all scripts completed successfully - done')
  return None

# Main

In [117]:
create_database_upload_for_NOAA()
create_networks()
create_tables_for_ripple_plot()


Downloading NOAA technical memorandum files....


247

Extracting impact lines from tm-161.pdf
[182, 182, 181, 181]
[[0, 0, 0, 0, 0, 0, 0], [182, 182, 182, 182, 182, 182, 182], [181, 181, 181, 181, 181, 181, 181], [181, 181, 181, 181, 181, 181, 181]]
Extracting impact lines from tm-161b.pdf
[6, 6, 6, 6]
[[0, 0, 0, 0, 0, 0, 0], [6, 6, 6, 6, 6, 6, 6], [6, 6, 6, 6, 6, 6, 6], [6, 6, 6, 6, 6, 6, 6]]
Extracting impact lines from tm-161c.pdf
[8, 6, 2, 2]
[[0, 0, 0, 0, 0, 0, 0], [6, 6, 6, 6, 6, 6, 6], [2, 2, 2, 2, 2, 2, 2], [2, 2, 2, 2, 2, 2, 2]]
Extracting impact lines from tm-169.pdf
[67, 67, 67, 67]
[[0, 0, 0, 0, 0, 0, 0], [67, 67, 67, 67, 67, 67, 67], [67, 67, 67, 67, 67, 67, 67], [67, 67, 67, 67, 67, 67, 67]]
Extracting impact lines from tm-169b.pdf
[28, 20, 11, 9]
[[0, 0, 0, 0, 0, 0, 0], [20, 20, 20, 20, 20, 20, 20], [11, 11, 11, 11, 11, 11, 11], [9, 9, 9, 9, 9, 9, 9]]
Extracting impact lines from tm-169c.pdf
[10, 8, 8, 10]
[[0, 0, 0, 0, 0, 0, 0], [8, 8, 8, 8, 8, 8, 8], [8, 8, 8, 8, 8, 8, 8], [10, 10, 10, 10, 10, 10, 10]]
Impact line file cr