## Comparing Algorithm to Handwritten Transcription of Egg Cards

## Scoring

We will compare a random sample of results from our algorithm to the handwritten transcriptions. To create a sample, we filter out entries with low score, where the score is based on the number of empty values for categories.

#### Imports

In [1]:
import pandas as pd
import os
import re
import numpy as np

#### Categories to test

In [41]:
categories_to_score = ['cardSpecies', ['order', 'family', 'genus', 'species'], 'registrationNumber', 
 'locality', 'collector', 'date', 'setMark', 'noOfEggs']

#### Functions

In [2]:
def is_nan_(w):
    try:
        return np.isnan(w)
    except:
        return False

In [44]:
def get_score(sample):

    score = 0
    for category in categories_to_score:
        if type(category) is str:
            text = sample[category]
            if is_nan_(text) is False:
                score += 1
        else:
            pre_score = 0
            for cat in category:
                text = sample[cat]
                if is_nan_(text) is False:
                    pre_score += 1
            if pre_score >= 2:
                score+=1

    return score        

In [59]:
def get_list_of_index_to_keep(df):
    sample_scores = {}

    n = len(df)

    for i in range(n):
        keep = False
        sample = df.iloc[i]
        id_ = sample[0]
        score = get_score(sample)
        if score >= 6:
            keep = True
        sample_scores[id_] = keep
        
    return list(sample_scores.values())

#### Test

In [2]:
path = 'corrected/sample'

In [3]:
files = os.listdir(path)

In [5]:
df = pd.read_csv(path+'/'+files[0])

In [36]:
sample = df.iloc[0]

In [45]:
get_score(sample)

4

#### Filter / Save Results

In [65]:
for file in files:
    df = pd.read_csv(path+'/'+file)
    keep = get_list_of_index_to_keep(df)
    df['keep?'] = keep
    df.to_csv('corrected/filtered/'+file[:-4]+'_filtered.csv',index=False)

## Renaming IDs

Our IDs were based on the image name. We create a new column called ID, with just the individual number, in order to be able to make matches with the handwritten transcription more earily. 

Example ID (from our results): 058-0595a. This will become 595.

In [66]:
df = pd.read_csv('/home/arias1/Documents/GitHub/egg_cards/corrected/filtered/edited/drawer_58_results_nonbin_v2_filtered.csv')

In [70]:
ids_ = list(df['id'])

In [80]:
numbers = []

for id_ in ids_:
    numbers.append(int(re.findall('\d+',id_)[-1]))

In [84]:
for file in os.listdir('corrected/filtered/edited'):
    df = pd.read_csv('corrected/filtered/edited/'+file)
    numbers = []
    ids_ = list(df['id'])
    for id_ in ids_:
        numbers.append(int(re.findall('\d+',id_)[-1]))
    df['newID'] = numbers
    df.to_csv('corrected/filtered/edited2/'+file,index=False)

## Comparing Results

We make a direct comparison between individual entries per ID, by category. For now, we focus on five categories:
1. Registration year
1. Family
1. Genus
1. Species
1. Collector / Collection

(Scroll to the bottom for general results)

#### Imports

In [32]:
from fuzzywuzzy import fuzz
import re
import pandas as pd
import numpy as np

#### Original data

In [24]:
df_act = pd.read_csv('58_real_test.csv')
df_pred = pd.read_csv('corrected/filtered/edited2/drawer_58_results_nonbin_v2_filtered.csv')

In [94]:
df_act[df_act['ID']==3]

Unnamed: 0,ID,Drawer number,Card number,RegisterNumber(Year),RegisterNumber(Month/Batch),RegisterNumber(Day),RegisterNumber(Start),RegisterNumber(End),RegisterNumber(Suffix),Hybrid,...,Subspecies1,NameUncertain,Family2,Genus2,Subgenus2,Species2,Subspecies2,Host(Common Name),Digitising Notes,Collection
2,3,58,3,1901.0,11.0,20.0,126.0,,,0,...,picturata,0,,,,,,,,Crowley Bequest


In [93]:
df_pred[df_pred['ID']==3]

Unnamed: 0,imageID,ID,cardSpecies,order,family,genus,species,registrationNumber,locality,collector,date,setMark,noOfEggs
3,058-0003,3,STREPTOPELIA PICTURATA PICTURATA,,Columbidae,Streptopelia,Streptopelia picturata,8.901.11.20.126,Madagascar,but the Deans Cowal Foottit Collection Crowley...,28°,214A,_ |


#### Refined data (w. relevant categories)

In [25]:
cols_act = ['ID','Family1','RegisterNumber(Year)','Genus1','Species1','Subspecies1','Collection']

In [26]:
cols_pred = ['ID','family','registrationNumber','genus','species','collector']

In [27]:
df_pred_ = df_pred[cols_pred]
df_act_ = df_act[cols_act]

In [28]:
IDs_to_check = list(df_pred_['ID'])

### 1) Registration year

In [46]:
def check_reg_year(tst_a,tst_p):
    same = 0
    try:
        p = np.int_(re.findall('\d\d\d\d',tst_p['registrationNumber'].iloc[0]))
        a = int(tst_a['RegisterNumber(Year)'].iloc[0])
        if a in p:
            same = 1
    except:
        pass
    return same

In [47]:
reg_rate = {}
for id_ in IDs_to_check:
    tst_a = df_act_[df_act['ID']==id_]
    tst_p = df_pred_[df_pred['ID']==id_]
    if is_nan_(tst_p['registrationNumber'].iloc[0]) is False:
        reg = check_reg_year(tst_a,tst_p)
        reg_rate[str(id_)] = reg

In [48]:
np.ceil((sum(list(reg_rate.values()))/sum(list(reg_rate.values())))*100)

100.0

### 2) Family

In [49]:
k = 0
n = 0
inds = []

for id_ in IDs_to_check:
    tst_a = df_act_[df_act['ID']==id_]
    tst_p = df_pred_[df_pred['ID']==id_]
    
    fam_a = tst_a['Family1'].iloc[0]
    fam_p = tst_p['family'].iloc[0]

    if (is_nan_(fam_a) == False) and (is_nan_(fam_p) == False):
        if fam_a == fam_p:
            k = k+1
        n +=1
        inds.append(id_)

    
print([k,n])

[24, 24]


In [50]:
np.ceil((k/n)*100)

100.0

### 3) Genus

In [51]:
k2 = 0
n2 = 0
inds = []
bla = []

for id_ in IDs_to_check:
    tst_a = df_act_[df_act['ID']==id_]
    tst_p = df_pred_[df_pred['ID']==id_]
    
    fam_a = tst_a['Genus1'].iloc[0]
    fam_p = tst_p['genus'].iloc[0]

    if (is_nan_(fam_a) == False) and (is_nan_(fam_p) == False):
        if fam_a == fam_p:
            k2 = k2+1
        else:
            r = fuzz.ratio(fam_a, fam_p)
            if r > 90:
                k2 = k2+1
                bla.append([fam_a,fam_p])

        n2 +=1
        inds.append(id_)

    
print([k2,n2])

[165, 179]


In [52]:
np.ceil((k2/n2)*100)

93.0

### Species

In [53]:
k2 = 0
n2 = 0
inds = []
bla = []

for id_ in IDs_to_check:
    tst_a = df_act_[df_act['ID']==id_]
    tst_p = df_pred_[df_pred['ID']==id_]
    
    fam_a = tst_a['Species1'].iloc[0]
    fam_p = tst_p['species'].iloc[0]

    if (is_nan_(fam_a) == False) and (is_nan_(fam_p) == False):
        if (fam_a == fam_p) or (fam_a.lower() in fam_p.lower()):
            k2 = k2+1
        else:
            r = fuzz.ratio(fam_a.lower(), fam_p.lower())
            if r > 80:
                k2 = k2+1
                bla.append([fam_a.lower(),fam_p.lower()])

        n2 +=1
        inds.append(id_)

    
print([k2,n2])

[75, 79]


In [54]:
np.ceil((k2/n2)*100)

95.0

### Collector

#### Test

In [208]:
df_act_[df_act_['ID'] == 945]

Unnamed: 0,ID,Family1,RegisterNumber(Year),Genus1,Species1,Subspecies1,Collection
944,945,,1909.0,Scardafella,inca,,Godman-Salvin Coll.


In [209]:
df_pred_[df_pred_['ID'] == 945]

Unnamed: 0,ID,family,registrationNumber,genus,species,collector
128,945,Columbidae,1909.10.1.16.18,Columbina,Columbina inca,W B Richardson Godman Salvin Colin


In [212]:
a = df_act_[df_act_['ID'] == 945]['Collection'].iloc[0].lower()
b = df_pred_[df_pred_['ID'] == 945]['collector'].iloc[0].lower()

In [213]:
fuzz.ratio(a,b)

60

#### Fuzzy Threshold = 60

In [57]:
k2 = 0
n2 = 0
inds = []
bla = []

for id_ in IDs_to_check:
    tst_a = df_act_[df_act['ID']==id_]
    tst_p = df_pred_[df_pred['ID']==id_]
    
    a = tst_a['Collection'].iloc[0]
    p = tst_p['collector'].iloc[0]
    
    if (is_nan_(a) == False) and (is_nan_(p) == False):

        if (a.lower() == p.lower()) or (a.lower() in p.lower()):

            if (a == p) or (a in p):
                k2 = k2+1
            else:
                r = fuzz.ratio(a.lower(), p.lower())
                if r >= 60:
                    k2 = k2+1
                    bla.append([a.lower(),p.lower()])

            n2 +=1
            inds.append(id_)

    
print([k2,n2])

[119, 119]


In [58]:
np.ceil((k2/n2)*100)

100.0

#### Fuzzy Threshold = 80

In [55]:
k2 = 0
n2 = 0
inds = []
bla = []

for id_ in IDs_to_check:
    tst_a = df_act_[df_act['ID']==id_]
    tst_p = df_pred_[df_pred['ID']==id_]
    
    a = tst_a['Collection'].iloc[0]
    p = tst_p['collector'].iloc[0]
    

    if (is_nan_(a) == False) and (is_nan_(p) == False):
        
        if (a.lower() == p.lower()) or (a.lower() in p.lower()):
            k2 = k2+1
        else:
            r = fuzz.ratio(a.lower(), p.lower())
            if r >= 80:
                k2 = k2+1
                bla.append([a.lower(),p.lower()])

        n2 +=1
        inds.append(id_)

    
print([k2,n2])

[139, 218]


In [56]:
np.ceil((k2/n2)*100)

64.0

__TL;DR__ - Best accuracies (excluding empty entries) per category:
- Registration year: 100%
- Family: 100%
- Genus: 93% (fuzz threshold: 90)
- Species: 95% (fuzz threshold: 80)
- Collector: 100% (fuzz threshold: 60)