# PISCO - fish transect data - dealing with the notes column

I would like to make some effort to extract useful information and tidy the notes column. **A few of the notes are pretty inappropriate and/or use names. However, since PISCO has already shared them publicly, I assume it continues to be OK to do so.** Many, many of the notes contain potentially useful information.

A large number of the notes contain sex information. I can probably extract this. I need to look for and pull:
- M
- F
- M;
- F;
- MALE
- FEMALE
- FEAMLE
- MALE;
- FEMALE;
- TRANSITIONAL;
- TRANSITION
- MALE,

There is also some life stage information:
- JUVENILE
- JUVENILE;
- JEVENILE

Sometimes, sex is uncertain (e.g. 'M?'). I'll leave these in the notes.

Note cleaning:
- Explore cleaning lowercase versus capitals
- Some notes are preceeded by '. '

In [107]:
## Imports

import pandas as pd
import numpy as np
from random import randint, seed, sample

In [129]:
## Load data

fish_occ = pd.read_csv('fish_occ.csv', dtype={'sex':str, 'lifeStage':str})
fish = pd.read_csv('fish.csv', dtype={'transect':str, 'sex':str, 'site_name_old':str})

print(fish_occ.shape)
fish.shape

(369262, 14)


(381693, 29)

**Note** that the above number of records are different because fish still includes records where classcode = NO_ORG and a single record where count = NaN.

In [130]:
## Drop record where count is missing

print(fish.shape)
fish.dropna(subset=['count'], inplace=True)
fish = fish[fish['classcode'] != 'NO_ORG']
fish.reset_index(drop=True, inplace=True)
fish.shape

(381693, 29)


(369262, 29)

In [133]:
## Obtain relevant records from fish_occ (going to try doing everything in fish_occ)

notes = fish[['site', 'survey_date', 'classcode', 'count', 'sex', 'notes']].copy()
notes.shape

(369262, 6)

In [134]:
## Grab a random subset of records to make troubleshooting easier

seed(42)
idx = sample(range(notes.shape[0]), 1000)
subset = notes.iloc[idx, :].copy()

In [135]:
## Extract sex information from notes

sex_notes = []
sex_options = ['M', 'F', 'MALE', 'FEMALE', 'FEAMLE', 'MALES', 'FEMALES', 'TRANSITIONAL', 'TRANSITION', 'TRANNY']

for note in subset['notes']:
    
    colon_overlap = []
    comma_overlap = []
    slash_overlap = []
    
    if note == note:
        
        colon_split = list(map(str.strip, note.split(';')))
        if (len(colon_split) > 1) & ('' not in colon_split):
            colon_overlap = list(set(sex_options) & set(colon_split))
            
        comma_split = list(map(str.strip, note.split(',')))
        if (len(comma_split) > 1) & ('' not in comma_split):
            comma_overlap = list(set(sex_options) & set(comma_split))
            
        slash_split = list(map(str.strip, note.split('/')))
        if (len(slash_split) > 1) & ('' not in slash_split):
            slash_overlap = list(set(sex_options) & set(slash_split))
          
        
        if note in sex_options:
            sex_notes.append(note)
        elif colon_overlap != []:
            sex_notes.extend(colon_overlap)
        elif comma_overlap != []:
            sex_notes.extend(comma_overlap)
        elif (slash_overlap != []) & (len(slash_overlap) == 1):
            sex_notes.extend(slash_overlap)
        
        else:
            sex_notes.append(np.nan)
            
    else:
        sex_notes.append(np.nan)

In [136]:
## Inspect outcome

subset['sex_notes'] = sex_notes

pd.set_option('display.max_rows', 60)
subset[(subset['sex'].isna() == False) | (subset['notes'].isna() == False)]

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes
340726,PYRAMID_POINT_3,20180728,HDEC,1.0,FEMALE,,
221571,ANACAPA_WEST_ISLE_E,20090928,SPUL,1.0,JUVENILE,JUVENILE,
178389,NAPLES_CEN,20060913,HCAR,1.0,,HIGH RELIEF BEDROCK (NAPLES PROPER/ 3 FINGERS)...,
176472,SCI_COCHE_POINT_E,20060906,RTOX,1.0,,"MED/HIGH RELIEF BEDROCK, BOULDERS W/ SAND ON O...",
199191,SCI_SAN_PEDRO_POINT_E,20080801,SPUL,1.0,FEMALE,FEMALE,FEMALE
...,...,...,...,...,...,...,...
319622,SRI_JOHNSONS_LEE_SOUTH_E,20170925,SPUL,1.0,FEMALE,FEMALE,FEMALE
327710,SRI_JOHNSONS_LEE_SOUTH_E,20180822,SPUL,1.0,FEMALE,FEMALE,FEMALE
309219,SCI_FORNEY_W,20161214,SPUL,1.0,FEMALE,FEMALE,FEMALE
117864,SAUNDERS_MPA_1,20160912,HDEC,1.0,,NO SEX,


In [137]:
## Clean sex_notes

print(subset['sex_notes'].unique())
subset['sex_notes'].replace({'F':'female',
                  'M':'male',
                  'FEMALE':'female',
                  'MALE':'male',
                  'MALES':'male',
                  'FEMALES':'female',
                  'FEAMLE':'female',
                  'TRANSITIONAL':'transitional',
                  'TRANNY':'transitional',
                  'TRANSITION':'transitional'}, inplace=True)
print(subset['sex_notes'].unique())

pd.set_option('display.max_rows', 60)
subset[(subset['sex'].isna() == False) | (subset['notes'].isna() == False)]

[nan 'FEMALE' 'MALE' 'F' 'M']
[nan 'female' 'male']


Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes
340726,PYRAMID_POINT_3,20180728,HDEC,1.0,FEMALE,,
221571,ANACAPA_WEST_ISLE_E,20090928,SPUL,1.0,JUVENILE,JUVENILE,
178389,NAPLES_CEN,20060913,HCAR,1.0,,HIGH RELIEF BEDROCK (NAPLES PROPER/ 3 FINGERS)...,
176472,SCI_COCHE_POINT_E,20060906,RTOX,1.0,,"MED/HIGH RELIEF BEDROCK, BOULDERS W/ SAND ON O...",
199191,SCI_SAN_PEDRO_POINT_E,20080801,SPUL,1.0,FEMALE,FEMALE,female
...,...,...,...,...,...,...,...
319622,SRI_JOHNSONS_LEE_SOUTH_E,20170925,SPUL,1.0,FEMALE,FEMALE,female
327710,SRI_JOHNSONS_LEE_SOUTH_E,20180822,SPUL,1.0,FEMALE,FEMALE,female
309219,SCI_FORNEY_W,20161214,SPUL,1.0,FEMALE,FEMALE,female
117864,SAUNDERS_MPA_1,20160912,HDEC,1.0,,NO SEX,


In [141]:
## Add in cleaned sex column from fish_occ

# Get occ_subset
occ_subset = fish_occ.iloc[idx, :].copy()

# Add
subset['occ_sex'] = occ_subset['sex']

# Check
pd.set_option('display.max_rows', 60)
subset[(subset['sex'].isna() == False) | (subset['notes'].isna() == False)]

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex
340726,PYRAMID_POINT_3,20180728,HDEC,1.0,FEMALE,,,female
221571,ANACAPA_WEST_ISLE_E,20090928,SPUL,1.0,JUVENILE,JUVENILE,,
178389,NAPLES_CEN,20060913,HCAR,1.0,,HIGH RELIEF BEDROCK (NAPLES PROPER/ 3 FINGERS)...,,
176472,SCI_COCHE_POINT_E,20060906,RTOX,1.0,,"MED/HIGH RELIEF BEDROCK, BOULDERS W/ SAND ON O...",,
199191,SCI_SAN_PEDRO_POINT_E,20080801,SPUL,1.0,FEMALE,FEMALE,female,female
...,...,...,...,...,...,...,...,...
319622,SRI_JOHNSONS_LEE_SOUTH_E,20170925,SPUL,1.0,FEMALE,FEMALE,female,female
327710,SRI_JOHNSONS_LEE_SOUTH_E,20180822,SPUL,1.0,FEMALE,FEMALE,female,female
309219,SCI_FORNEY_W,20161214,SPUL,1.0,FEMALE,FEMALE,female,female
117864,SAUNDERS_MPA_1,20160912,HDEC,1.0,,NO SEX,,


In [147]:
## Create new column merging information from occ_sex and sex_notes

new_sex = [subset['occ_sex'].iloc[i] if subset['occ_sex'].iloc[i] == subset['occ_sex'].iloc[i] else subset['sex_notes'].iloc[i] for i in range(subset.shape[0])]
subset['new_sex'] = new_sex

# Check
pd.set_option('display.max_rows', 60)
subset[(subset['occ_sex'].isna() == False) & (subset['sex_notes'].isna() == False) & (subset['occ_sex'] != subset['sex_notes'])]

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex,new_sex


In [151]:
## Replace sex column in occ_subset with new_sex

occ_subset['sex'] = subset['new_sex']
occ_subset.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType
335243,SBI-Southeast_Sealion_20041009_INNER_BOT_1,SBI-Southeast_Sealion_20041009_INNER_BOT_1_occ7,"Kelp Bass, Calico Bass",Paralabrax clathratus,urn:lsid:marinespecies.org:taxname:282054,282054,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
58369,PARANOIDS_20080925_INMID_BOT_2,PARANOIDS_20080925_INMID_BOT_2_occ2,Painted Greenling,Oxylebius pictus,urn:lsid:marinespecies.org:taxname:240743,240743,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
13112,HOPKINS_DC_20020828_INNER_BOT_1,HOPKINS_DC_20020828_INNER_BOT_1_occ1,Black Surfperch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
144194,SCI_FORNEY_E_20030911_OUTMID_BOT_2,SCI_FORNEY_E_20030911_OUTMID_BOT_2_occ3,Black Surfperch,Embiotoca jacksoni,urn:lsid:marinespecies.org:taxname:240746,240746,WoRMS,present,HumanObservation,,,,8.0,number of individuals per 120 m3
128393,HOPKINS_UC_20180720_OUTER_BOT_3,HOPKINS_UC_20180720_OUTER_BOT_3_occ5,Blackeye Goby,Rhinogobiops nicholsii,urn:lsid:marinespecies.org:taxname:282580,282580,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3


In [156]:
## Check

subset[subset['occ_sex'] == 'female'].shape[0] # 49
subset[subset['new_sex'] == 'female'].shape[0] # 57
occ_subset[occ_subset['sex'] == 'female'].shape[0]

57

Ok, that seems to have worked. Let's see if we can try on the whole dataset now.

In [157]:
## Extract sex information from notes

sex_notes = []
sex_options = ['M', 'F', 'MALE', 'FEMALE', 'FEAMLE', 'MALES', 'FEMALES', 'TRANSITIONAL', 'TRANSITION', 'TRANNY']

for note in notes['notes']:
    
    colon_overlap = []
    comma_overlap = []
    slash_overlap = []
    
    if note == note:
        
        colon_split = list(map(str.strip, note.split(';')))
        if (len(colon_split) > 1) & ('' not in colon_split):
            colon_overlap = list(set(sex_options) & set(colon_split))
            
        comma_split = list(map(str.strip, note.split(',')))
        if (len(comma_split) > 1) & ('' not in comma_split):
            comma_overlap = list(set(sex_options) & set(comma_split))
            
        slash_split = list(map(str.strip, note.split('/')))
        if (len(slash_split) > 1) & ('' not in slash_split):
            slash_overlap = list(set(sex_options) & set(slash_split))
          
        
        if note in sex_options:
            sex_notes.append(note)
        elif colon_overlap != []:
            sex_notes.extend(colon_overlap)
        elif comma_overlap != []:
            sex_notes.extend(comma_overlap)
        elif (slash_overlap != []) & (len(slash_overlap) == 1):
            sex_notes.extend(slash_overlap)
        
        else:
            sex_notes.append(np.nan)
            
    else:
        sex_notes.append(np.nan)

In [165]:
## Inspect outcome

notes['sex_notes'] = sex_notes

pd.set_option('display.max_rows', 60)
notes[(notes['sex'].isna() == False) | (notes['notes'].isna() == False)].iloc[10000:10010, :]

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes
168024,SCI_YELLOWBANKS_W,20051006,EJAC,1.0,,NO CANOPY KELP,
168025,SCI_YELLOWBANKS_W,20051006,HRUB,1.0,,NO CANOPY KELP,
168026,SCI_YELLOWBANKS_W,20051006,HSEM,1.0,MALE,MALE; NO CANOPY KELP,MALE
168027,SCI_YELLOWBANKS_W,20051006,PCLA,3.0,,NO CANOPY KELP,
168028,SCI_YELLOWBANKS_W,20051006,PCLA,2.0,,NO CANOPY KELP,
168029,SCI_YELLOWBANKS_W,20051006,PCLA,2.0,,NO CANOPY KELP,
168030,SCI_YELLOWBANKS_W,20051006,PCLA,2.0,,NO CANOPY KELP,
168031,SCI_YELLOWBANKS_W,20051006,SATR,1.0,,NO CANOPY KELP,
168032,SCI_YELLOWBANKS_W,20051006,SPUL,1.0,FEMALE,FEMALE; NO CANOPY KELP,FEMALE
168033,SCI_YELLOWBANKS_W,20051006,BFRE,1.0,,NO CANOPY KELP,


In [166]:
## Clean sex_notes

print(notes['sex_notes'].unique())
notes['sex_notes'].replace({'F':'female',
                  'M':'male',
                  'FEMALE':'female',
                  'MALE':'male',
                  'MALES':'male',
                  'FEMALES':'female',
                  'FEAMLE':'female',
                  'TRANSITIONAL':'transitional',
                  'TRANNY':'transitional',
                  'TRANSITION':'transitional'}, inplace=True)
print(notes['sex_notes'].unique())

notes.head()

[nan 'F' 'M' 'FEMALE' 'MALE' 'TRANSITIONAL' 'MALES' 'FEMALES' 'TRANNY'
 'TRANSITION' 'FEAMLE']
[nan 'female' 'male' 'transitional']


Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes
0,HOPKINS_DC,19990907,ELAT,1.0,,,
1,HOPKINS_DC,19990907,ELAT,1.0,,,
2,HOPKINS_DC,19990907,HDEC,1.0,,,
3,HOPKINS_DC,19990907,OCAL,85.0,,,
4,HOPKINS_DC,19990907,OYT,100.0,,,


In [171]:
## Add in cleaned sex column from fish_occ

# Add
notes['occ_sex'] = fish_occ['sex']

# Check
pd.set_option('display.max_rows', 60)
notes[(notes['sex'].isna() == False) | (notes['notes'].isna() == False)].iloc[5000:5100, :]

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex
115567,HOPKINS_UC,20160818,HDEC,1.0,,FEMALE,female,
115635,HOPKINS_UC,20160818,HDEC,1.0,,NO SEX GIVEN,,
115645,HOPKINS_UC,20160818,HDEC,1.0,,NO SEX GIVEN,,
115646,HOPKINS_UC,20160818,HDEC,1.0,,NO SEX GIVEN,,
115904,MONASTERY_DC,20160824,HDEC,1.0,,MALE,male,
...,...,...,...,...,...,...,...,...
119901,DEL_MAR_REFERENCE_2,20170802,HDEC,1.0,,MALE,male,
119921,DEL_MAR_REFERENCE_2,20170802,HDEC,1.0,,MALE,male,
119928,DEL_MAR_REFERENCE_2,20170802,HDEC,1.0,,MALE,male,
119929,DEL_MAR_REFERENCE_2,20170802,HDEC,1.0,,FEMALE,female,


In [174]:
## Create new column merging information from occ_sex and sex_notes

new_sex = [notes['occ_sex'].iloc[i] if notes['occ_sex'].iloc[i] == notes['occ_sex'].iloc[i] else notes['sex_notes'].iloc[i] for i in range(notes.shape[0])]
notes['new_sex'] = new_sex

In [178]:
# Check
pd.set_option('display.max_rows', 600)
# notes[(notes['occ_sex'].isna() == False) & (notes['sex_notes'].isna() == False) & (notes['occ_sex'] != notes['sex_notes'])]
notes[(notes['occ_sex'].isna() == False) & (notes['sex_notes'].isna() == False) & (notes['occ_sex'] != notes['sex_notes'])]

Unnamed: 0,site,survey_date,classcode,count,sex,notes,sex_notes,occ_sex,new_sex


In [180]:
## Replace sex column in fish_occ with new_sex

fish_occ['sex'] = notes['new_sex']
fish_occ.head()

Unnamed: 0,eventID,occurrenceID,vernacularName,scientificName,scientificNameID,taxonID,nameAccordingTo,occurrenceStatus,basisOfRecord,identificationQualifier,sex,lifeStage,organismQuantity,organismQuantityType
0,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ1,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
1,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ2,Striped Surfperch,Embiotoca lateralis,urn:lsid:marinespecies.org:taxname:240740,240740,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
2,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ3,Kelp Greenling,Hexagrammos decagrammus,urn:lsid:marinespecies.org:taxname:240732,240732,WoRMS,present,HumanObservation,,,,1.0,number of individuals per 120 m3
3,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ4,Senorita,Oxyjulis californica,urn:lsid:marinespecies.org:taxname:240727,240727,WoRMS,present,HumanObservation,,,,85.0,number of individuals per 120 m3
4,HOPKINS_DC_19990907_INNER_BOT_1,HOPKINS_DC_19990907_INNER_BOT_1_occ5,Olive Or Yellowtail Rockfish,Sebastes,urn:lsid:marinespecies.org:taxname:126175,126175,WoRMS,present,HumanObservation,Sebastes serranoides or Sebastes flavidus,,,100.0,number of individuals per 120 m3


In [183]:
## Check

notes[notes['occ_sex'] == 'female'].shape[0] # 16528
notes[notes['new_sex'] == 'female'].shape[0] # 18031
fish_occ[fish_occ['sex'] == 'female'].shape[0]

18031

In [184]:
## Save for visual inspection

fish_occ.to_csv('occ.csv', index=False, na_rep='NaN')