Matching IUCN <-> Checklist.
Pour chaque plante de l'IUCN, on souhaite trouver le nom accepté dans IUCN, et l'ensemble des synonymes.

In [276]:
import numpy as np
import pandas as pd
import re

# Chargement d'IUCN (Redlist)

In [277]:
iucn = pd.read_csv("/home/joon/Téléchargements/iucn/taxonomy.csv",sep=',')

In [278]:
iucn.columns

Index(['kingdomName', 'phylumName', 'orderName', 'className', 'familyName',
       'genusName', 'speciesName', 'infraType', 'infraName', 'infraAuthority',
       'authority', 'internalTaxonId', 'taxonomicNotes'],
      dtype='object')

In [279]:
iucn = iucn[['internalTaxonId','familyName','genusName','speciesName','authority']]

Il y a un encodage bizarre pour "&" dans iucn. On corrige ça:

In [280]:
iucn['authority'] = iucn['authority'].str.replace('&amp;',"&",regex=False)

Lorsqu'on fait des tests, on subsample IUCN:

In [281]:
#iucn = iucn.sample(n=1000)

# Chargement de la checklist

In [282]:
checklist1 = pd.read_csv('/home/joon/Téléchargements/wcs_feb_19.csv',sep='|')

In [283]:
checklist2 = pd.read_csv('/home/joon/Téléchargements/atoz_feb_19.csv',sep='|')

  interactivity=interactivity, compiler=compiler, result=result)


In [284]:
checklist = pd.concat([checklist1,checklist2],ignore_index=True)
del checklist1, checklist2

In [285]:
checklist = checklist.drop(columns=['checklist_db', 'ipni_id', 'taxon_rank',
       'taxon_status', 'genus_hybrid', 'species_hybrid',
       'infraspecific_rank', 'infraspecies',  'place_of_publication',
       'volume_and_page', 'first_published', 'nomenclatural_remarks',
       'geographic_area', 'lifeform_description', 'climate_description',
       'taxon_name',  'basionym_plant_name_id', 'homotypic_synonym'])

In [286]:
checklist = checklist.drop(columns=['parenthetical_author', 'publication_author','primary_author'])

In [287]:
checklist.columns

Index(['plant_name_id', 'family', 'genus', 'species', 'taxon_authors',
       'accepted_plant_name_id'],
      dtype='object')

In [288]:
checklist.accepted_plant_name_id = checklist.accepted_plant_name_id.replace(np.nan,0)

In [289]:
checklist.plant_name_id = checklist.plant_name_id.astype(int)
checklist.accepted_plant_name_id = checklist.accepted_plant_name_id.astype(int)

Certaines lignes dans checklist n'ont pas de accepted_plant_name_id.
On complète alors par plant_name_id

In [290]:
checklist.loc[checklist.accepted_plant_name_id==0,'accepted_plant_name_id'] = checklist.loc[checklist.accepted_plant_name_id==0,'plant_name_id']

# Matching

In [291]:
def filter_in_checklist_by_gs(genus,species):
    return checklist[(checklist['genus']==genus) & (checklist['species']==species)]

In [292]:
def filter_in_checklist_by_author(string,df,regex=False):
    return df[df.taxon_authors.str.contains(string,regex=regex,case=True)==True]

Following function returns
* the accepted_plant_id from checklist if matching is successful
* -1 if genus is not found in checklist
* -2 if species not found in checklist
* -3 couln't find author

In [293]:
from fuzzywuzzy import process

In [294]:
def matching_accepted_plant_name_from_checklist(iucn_row):
    global n_nonunique
    genus = iucn_row['genusName']
    species = iucn_row['speciesName']
    g_s = genus + ' ' + species
    search_res_gs = filter_in_checklist_by_gs(genus,species)
    if search_res_gs.shape[0] == 0: # if match fails with genus+species
        search_res_g = checklist[checklist['genus']==genus]
        if search_res_g.shape[0] == 0:
            print('Genus',genus,'not found in checklist')
            return -1
        else:
            fuzzy_extraction = process.extractOne(species,search_res_g.species.values)
            species_guess = fuzzy_extraction[0]
            guess_score = fuzzy_extraction[1]
            # print('Exact match for species',species, 'not found. Closest match:',species_guess,guess_score) 
            if guess_score >= 80:
                species = species_guess
                search_res_gs = search_res_g[search_res_g.species == species]
            else:
                print(g_s,'↳ Not close enough! Species not found in checklist')
                return -2
    if search_res_gs.accepted_plant_name_id.nunique() == 1:
        return search_res_gs.accepted_plant_name_id.iloc[0]
    else:
        authority = iucn_row.authority
        search_res_matching_authority = search_res_gs[search_res_gs.taxon_authors==authority]
        if search_res_matching_authority.accepted_plant_name_id.nunique() == 1:
            return search_res_matching_authority.accepted_plant_name_id.iloc[0]
        else:
            search_res_containing_authority = filter_in_checklist_by_author(authority,search_res_gs)
            n_res_containing_authority = search_res_containing_authority.accepted_plant_name_id.nunique()
            if  n_res_containing_authority == 1:
                return search_res_containing_authority.accepted_plant_name_id.iloc[0]
            elif n_res_containing_authority >= 2:
                n_nonunique = n_nonunique + 1
                print('Couln\'t find a unique match :', g_s,authority)
                return np.max(search_res_containing_authority.accepted_plant_name_id)
            else:
                words_from_authority = list(filter(lambda str: len(str) > 1,re.sub('[^\w\s]',' ',authority).split()))           
                search_res = search_res_gs
                while (len(words_from_authority) > 0) and (search_res.accepted_plant_name_id.nunique() > 1):
                    word =  words_from_authority.pop(0)
                    search_res_tmp = filter_in_checklist_by_author(word,search_res)
                    if search_res_tmp.shape[0] > 0:
                        search_res = search_res_tmp
                if search_res.shape[0] == 0:
                    print('Author matching matching failed :',g_s)
                    return -3
                else:
                    if search_res.accepted_plant_name_id.nunique() > 1:
                        n_nonunique = n_nonunique + 1
                        print('Couln\'t find a unique match :', g_s,authority)
                    return np.max(search_res.accepted_plant_name_id)

In [295]:
n_nonunique = 0
matching_results = iucn.apply(matching_accepted_plant_name_from_checklist,axis=1)

Couln't find a unique match : Eriosyce crispa (F.Ritter) Katt.
Couln't find a unique match : Cleistocactus winteri D.R.Hunt
Couln't find a unique match : Discocactus diersianus Esteves
Couln't find a unique match : Pygmaeocereus bieblii Diers
Couln't find a unique match : Melocactus stramineus Suringar
Couln't find a unique match : Coffea sessiliflora Bridson
Echinochloa colona ↳ Not close enough! Species not found in checklist
Couln't find a unique match : Epilobium parviflorum Schreber
Persicaria glabrum ↳ Not close enough! Species not found in checklist
Couln't find a unique match : Plantago major L. 
Couln't find a unique match : Persicaria barbata (L. ) H.Hara
Couln't find a unique match : Paspalum distichum L. 
Couln't find a unique match : Vigna hosei Backer ex K.Heyne
Couln't find a unique match : Oryza eichingeri Peter
Dalbergia gloveri ↳ Not close enough! Species not found in checklist
Chassalia sp. nov. ↳ Not close enough! Species not found in checklist
Couln't find a unique

Genus Aporusa not found in checklist
Couln't find a unique match : Shorea stenoptera Burck.
Genus Engelhardtia not found in checklist
Couln't find a unique match : Plathymenia foliolosa Benth.
Couln't find a unique match : Prosopis alba Grisebach
Couln't find a unique match : Prosopis nigra (Grisebach) Hieronymus
Couln't find a unique match : Cyanea arborea Hillebr.
Couln't find a unique match : Cyanea solenocalyx Hillebr.
Couln't find a unique match : Clermontia hawaiiensis (Hbd.) Rock
Couln't find a unique match : Clermontia grandiflora Gaud.
Stryphnodendron harbesonii ↳ Not close enough! Species not found in checklist
Couln't find a unique match : Acacia permixta Burtt Davy
Couln't find a unique match : Polyalthia elmeri Merr.
Genus Aporusa not found in checklist
Couln't find a unique match : Ternstroemia wallichiana (Griffith) Engl.
Couln't find a unique match : Durio testudinarum Becc.
Couln't find a unique match : Schinus engleri Barkley
Couln't find a unique match : Ruprechtia a

Couln't find a unique match : Cycas arnhemica K.D.Hill
Couln't find a unique match : Dioon tomasellii De Luca, Sabato & Vázq.Torres
Couln't find a unique match : Stenandrium harlingii Wassh.
Couln't find a unique match : Neriacanthus harlingii Wassh.
Couln't find a unique match : Draba violacea Humb. & Bonpl.
Couln't find a unique match : Pitcairnia simulans H.Luther
Genus Dasphyllum not found in checklist
Genus Dasphyllum not found in checklist
Couln't find a unique match : Tillandsia cyanea Linden ex K.Koch
Couln't find a unique match : Hieracium sprucei Arv.-Touv.
Couln't find a unique match : Centropogon saltuum E.Wimm.
Couln't find a unique match : Selaginella sericea A.Braun
Couln't find a unique match : Pappobolus ecuadoriensis Panero
Genus Vanvoorstia not found in checklist
Piper seychellarum ↳ Not close enough! Species not found in checklist
Couln't find a unique match : Clermontia multiflora Hillebr.
Couln't find a unique match : Euphorbia cap-saintemariensis Rauh
Couln't fin

Couln't find a unique match : Psoralea reverchonii S. Watson
Couln't find a unique match : Senna pendula (Willd.) H.S.Irwin & Barneby
Couln't find a unique match : Sophora gypsophila B.L.Turner & A.M.Powell
Couln't find a unique match : Tephrosia angustissima Chapman
Couln't find a unique match : Tephrosia barbatala Bosman & de Haas
Couln't find a unique match : Tephrosia brachyodon Domin
Couln't find a unique match : Tephrosia cephalantha Welw. ex Baker
Couln't find a unique match : Thermopsis montana Torr. & A.Gray
Couln't find a unique match : Zornia pardina Mohlenbr.
Couln't find a unique match : Allium oliganthum Kar. & Kir.
Couln't find a unique match : Nepenthes gracillima Ridley
Couln't find a unique match : Acacia aemula Maslin
Couln't find a unique match : Acacia phlebopetala Maslin
Couln't find a unique match : Dendrobium poneroides Schltr.
Couln't find a unique match : Cyperus glaber L.     
Couln't find a unique match : Mentha longifolia (L.) Huds
Couln't find a unique mat

Isoetes tiguliana ↳ Not close enough! Species not found in checklist
Couln't find a unique match : Carduus nyassanus (S.Moore) R.E.Fr.
Couln't find a unique match : Alisma plantago-aquatica L.      
Couln't find a unique match : Potamogeton pusillus L.      
Couln't find a unique match : Strobilanthes accrescens J.R.I.Wood
Couln't find a unique match : Pedicularis longipedicellata P.C.Tsoong
Couln't find a unique match : Pedicularis mucronulata P.C.Tsoong
Couln't find a unique match : Pedicularis perpusilla P.C.Tsoong
Couln't find a unique match : Pedicularis porriginosa P.C.Tsoong
Couln't find a unique match : Pedicularis xylopoda P.C.Tsoong
Couln't find a unique match : Potentilla bryoides Sojak
Couln't find a unique match : Pteridium pinetorum C.N.Page & R.R.Mill
Couln't find a unique match : Pedicularis imbricata P.C.Tsoong
Couln't find a unique match : Pedicularis inconspicua P.C.Tsoong
Couln't find a unique match : Lepidagathis sericea Benoist
Couln't find a unique match : Ulex a

Proportion of unsuccessful matches

In [298]:
np.sum(matching_results<=0)/len(matching_results)

0.01217978940751218

Proportion of non-unique matches

In [299]:
n_nonunique/len(matching_results)

0.01438000942951438