Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
ext
 
 
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Biodiversity

DOI Gem Version Continuous Integration Status

Parses taxonomic scientific name and breaks it into semantic elements.

Important: Biodiversity parser >= 4.0.0 uses binding to https://gitlab.com/gogna/gnparser and is not backward compatible with older versions. However it is much much faster and better than previous versions.

This gem does not have a remote server or a command line executable anymore. For such features use https://gitlab.com/gogna/gnparser.

Installation

sudo gem install biodiversity

The gem should work on Linux, Mac and Windows (64bit) machines

Benchmarks

The fastest way to go through a massive amount of names is to use Biodiversity::Parser.parse_ary([big array], simple = true) function.

For example parsing a large file with one name per line:

#!/usr/bin/env ruby

require 'biodiversity'

P = Biodiversity::Parser
count = 0
File.open('all_names.txt').each_slice(50_000) do |sl|
  count += 1
  res = P.parse_ary(sl, true)
  puts count * 50_000
  puts res[0]
end

Here are comparative results of running parsers against a file with 24 million names on a 4CPU hyperthreaded laptop:

Program Version Full/Simple Names/min
gnparser 0.12.0 Simple 3,000,000
biodiversity 4.0.1 Simple 2,000,000
biodiversity 4.0.1 Full JSON 800,000
biodiversity 3.5.1 n/a 40,000

Example usage

You can use it as a library in Ruby:

require 'biodiversity'

#to find the gem version number
Biodiversity.version

# Note that the version in parsed output will correspond to the version of
# gnparser.

# to parse a scientific name into a simple Ruby hash
Biodiversity::Parser.parse("Plantago major", simple = true)

# to parse many scientific names using all computer CPUs
Biodiversity::Parser.parse_ary(["Plantago major", ... ], simple = true)

# to parse a scientific name into a very detailed Ruby hash
Biodiversity::Parser.parse("Plantago major")

# to parse many scientific names with all details using all computer CPUs
Biodiversity::Parser.parse_ary(["Plantago major", ... ])

#to get json representation
Biodiversity::Parser.parse("Plantago").to_json

# to clean name up
Biodiversity::Parser.parse("      Plantago       major    ")[:normalized]


# to get canonical form with or without infraspecies ranks, as well as
# stemmed version.
parsed = Biodiversity::Parser.parse("Seddera latifolia H. & S. var. latifolia")
parsed[:canonicalName][:full]
parsed[:canonicalName][:simple]
parsed[:canonicalName][:stem]

# to get detailed information about elements of the name
Biodiversity::Parser.parse("Pseudocercospora dendrobii (H.C. Burnett 1883) U. \
Braun & Crous 2003")[:details]

'Surrogate' is a broad group which includes 'Barcode of Life' names, and various undetermined names with cf. sp. spp. nr. in them:

parser.parse("Coleoptera BOLD:1234567")[:surrogate]

What is "nameStringID" in the parsed results?

ID field contains UUID v5 hexadecimal string. ID is generated out of bytes from the name string itself, and identical id can be generated using any popular programming language. You can read more about UUID version 5 in a blog post

For example "Homo sapiens" should generate "16f235a0-e4a3-529c-9b83-bd15fe722110" UUID

Copyright

Authors: Dmitry Mozzherin

Contributors: Patrick Leary, Hernán Lucas Pereira

Copyright (c) 2008-2020 Dmitry Mozzherin. See LICENSE for further details.

About

Scientific Name Parser

Resources

License

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages

You can’t perform that action at this time.