# Cleaning CC data

This python notebook operates on a csv created after editing in open refine and is designed to finish cleaning columns of interest which were easier to clean in python.

# Table of Contents

1. [Outstanding Problems](#Outstanding-Problems)

1. [Setting up Python](#Setting-up-Python)
    
    1. [Setting the Location](#Setting-the-Location)
    
    2. [Importing Data](#Importing-Data)
    
    3. [Preparing for a Save](#Preparing-for-a-Save)
    
4. [Functions](#Functions)
    1. [appendstr](#appenstr)
    2. [typeordrop](#typeordrop)
    3. [myint](#myint)
    3. [testint](#testint)
    4. [rom2arab](#rom2arab)
    5. [exportliz](#exportliz)
    
2. [Inspecting the Data](#Inspecting-the-Data)
3. [Cleaning Data](#CleaningData)
    1. [Column-by-Column Cleaning](#Column-by-Column-Cleaning)
        1. [rtl](#rtl)
        2. [tl](#tl)
        3. [svl](#svl)
        4. [autotomized](#autotomized)
        5. [toes](#toes)
        6. [sex](#sex)
        7. [species](#species)
        7. [new.recap](#new.recap)
        8. [date](#date)
    2. [Correcting Class of Columns](#Correcting-Class-of-Columns)
    
4. [Adding New Columns](#Adding-New-Columns)

    1. [TL_SVL](#TL_SVL)
    
    2. [Mass_SVL](#Mass_SVL)
    
    3. [Lizard Number](#Lizard-Number)
         - [Assign Lizard Numbers](#Assign-Lizard-Numbers)
         - [QC the Numbers](#QC-the-Numbers)
    
    4. [Days Since Capture](#Days-Since-Capture)

    5. [Capture Number](#Capture-Number)

5. [Export Cleaned Data](#Export-Cleaned-Data)

# Outstanding Problems

1. [outstanding1](#outstanding1)
2. [outstanding2](#outstanding2)
3. [outstanding3](#outstanding3)
4. [outstanding4](#outstanding4)

# Setting up Python

[Top](#Table-of-Contents)

Here we import necessary packages. 
This chunk may take a while.

In [1]:
import pandas as pd
import numpy as np
import os,glob,logging
from liz_number import lizsort,mindate,smallest,validate
from liz_toes import make_str,label_pattern,replace_pattern,report_pattern

import plotly
import plotly.plotly as py
import plotly.graph_objs as go

plotly.tools.set_config_file(world_readable=True)
logging.basicConfig(filename='S:\\Chris\\TailDemography\\TailDemography\\Scripts and notes\\Cleaning CC (Part 2).log'
                    , filemode='a',
                    format='%(funcName)s - %(levelname)s - %(message)s - %(asctime)s', level=logging.DEBUG)
# increase print limit
pd.options.display.max_rows = 99999
pd.options.display.max_columns = 50

## Setting the Location
[Top](#Table-of-Contents)

These chunks identify the locations from which we can get data and to which we can save data.

### Source Data
Source files can be found in the following locations:

In [2]:
sourceDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/Cleaned Combined Data'
sourceDataBig = 'S:/Chris/TailDemography/TailDemography/Cleaned Combined Data'
# sourceBlack = 'C:/Users/test/Desktop'

### Intermediate Source Data
Intermediate files can be found in the following locations:

In [3]:
sourceInterDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/Intermediate Files/DeepCleaning'
sourceinterDataBig = 'S:/Chris/TailDemography/TailDemography/Intermediate Files/DeepCleaning'
# sourceBlack = 'C:/Users/test/Desktop'

### Output Data
Outputfiles can be found in the following locations:

In [4]:
outputPers = 'C:/Users/Christopher/Google Drive/TailDemography/outputFiles'
outputBig = 'S:/Chris/TailDemography/TailDemography/outputFiles'
# outputBlack = 'C:/Users/test/Desktop'

<a id='ImportingData'></a>

## Importing Data
[Top](#Table-of-Contents)

Here we import data from one of the available locations

In [5]:
os.chdir(sourceDataBig)
files = glob.glob('*.csv')
latest = files[-1]
latest


'Appended and Trimmed CC Data 2000-2017_2019-03-10 13hrs38min.csv'

In [6]:
df=pd.read_csv(latest)
df.head()

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
0,sj,3-7-11-19,m,2002-07-14 00:00:00,63.0,92.0,0.0,,10.0,halfway up to site,-200,NEW,,,b101t,toes in vial 58-02,
1,sj,3-7-11-18,m,2002-07-14 00:00:00,66.0,92.0,0.0,,10.8,left downstream 100m v 1 falls,-100,NEW,,,b102t,toes in vial 59-02,
2,sj,3-7-12-16,m,2002-07-14 00:00:00,68.0,103.0,0.0,,10.3,90m v 1 falls,-90,NEW,,,b103t,toes in vial 60-02,
3,sv,,,2002-07-05 00:00:00,,,,,,sb at intersection with trail v 1 falls,-30,sighting,,,w2a,,
4,sj,10,m,2002-07-14 00:00:00,85.0,118.0,0.0,,19.5,sb - trail intersection v 1 falls,-20,recap,toe loss may be natural,,b104t,,


## Preparing for a Save
[Top](#Table-of-Contents)

Now we change the working directory so that inermediate files are saved to our preferred location.

In [7]:
os.chdir(sourceinterDataBig)

# Functions
In this section we find functions written for this notebook.  We will need to consider whether or not to add these to the code library.

[Back to: Top](#Table-of-Contents)

1. [appendstr](#appenstr)
2. [typeordrop](#typeordrop)
3. [myint](#myint)
3. [testint](#testint)
4. [rom2arab](#rom2arab)
5. [exportliz](#exportliz)

## appendstr
[Back to Top](#Table-of-Contents)

[Back to Functions](#Functions)

In [8]:
def appendstr(x, value, connector = '', position=0):
    """
    appends *value* and *x* separated by a *connector* with the position of *val* determined by *position*
    :param x:
    :param value:
    :param connector:
    :param position:
    """
    assert((isinstance(x,str)|(x is None)|(x!=x))),"x must be str type, NoneType or NaN: x is {} type."\
    .format(type(x))
    assert(isinstance(value,str)),"value must be str type: value is {} type.".format(type(value))
    assert(isinstance(connector,str))\
    , "connector must be str or None type, not {} type.".format(type(connector))
    assert(isinstance(position,(int))), "position must be int type, not {}."\
           .format(type(position)) 
    if isinstance(position,int):
        if ((x!=x)|(x is None)|(x =='')):
            x=''
            position=0
#             assert(position == 0),"If x is NaN or len(x)==0, position must be 0 not {}.".format(position)
        else:
            x = x
        assert(position in [0]+[i for i in range(-1,len(x)-1)])\
        , "position must be a value in the range -1 through {}.".format(len(x)-1)
    try:
        prefix = x[:position]
    except Exception as e:
        logging.exception("Exception occurred in generating prefix.")
    try:
        suffix = x[position:]
    except Exception as e:
        logging.exception("Exception occurred in generating suffix.")
        
    if len(x)==0:
        res = value
    res = prefix+connector+value+connector+suffix

    return res
    

Here's an example of how *appendstr* works.

In [9]:
foo='bar'
appendstr(foo,'test',connector='_',position=1)

'b_test_ar'

In [10]:
foo='bar'
appendstr(foo,'test',connector='_',position=0)

'_test_bar'

In [11]:
appendstr('','test',position=0)

'test'

[back to Functions](#Functions)

## typeordrop
[Back to Top](#Table-of-Contents)

[Back to Functions](#Functions)

In [12]:
def typeordrop(x,typ,replace=None, verbose=True):
    """this function attempts to force an object, *x*, to a particular type,*typ*. If this is not possible, 
    it reports the value of the object that could not be forced and replaces the object with the value 
    supplied to the *replace* argument"""
    if not isinstance(x,typ)==True:
        while False:
            try:
                x=typ(x)
                logging.info("Working as expected")
                break
            except Excetion as e:
                logging.exception("Could not force value supplied to 'x' argument to f'{typ} type. x is f'{x} type.")
                if verbose==True:
                    print("Could not force value supplied to 'x' argument to {} type. x is {} type:\n\n x = {}"\
                          .format(typ,type(x),x))
                x = replace
    else:
        logging.info("f'{x} is already of type f{'typ'}.")
    return x
         

Here are a few examples of how *typeordrop* works.

In [13]:
x=['foo','bar']
typeordrop(x,int)

['foo', 'bar']

[back to Functions](#Functions)

## myint
[Back to Top](#Table-of-Contents)

[Back to Functions](#Functions)

In [14]:
def myint(x, verbose = False):
    try:
        x = str(x).split('.')[0]
    except Exception as e:
        typ = type(x)
        logging.exception("f'{x} is of type f'{typ} and cannot be forced to int.")
        x = x
        if verbose == True:
            print('{} is of type {} and cannot be forced to int.'.format(x,type(x)))
    return x


Here is are a few examples of how [*myint*](#myint) works.

In [15]:
bar = [None, 1.0, "f"]
print([type(x) for x in bar])
[myint(x) for x in bar]

[<class 'NoneType'>, <class 'float'>, <class 'str'>]


['None', '1', 'f']

In [16]:
bar = [None, 2001.0, "2001.0"]
print([type(x) for x in bar])
[myint(x,True) for x in bar]

[<class 'NoneType'>, <class 'float'>, <class 'str'>]


['None', '2001', '2001']

[back to Functions](#Functions)

## testint

[Back to Top](#Table-of-Contents)

[Back to Functions](#Functions)

In [17]:
def testint(x):
    try:
        res = int(x)
    except Exception as e:
        logging.error("f'{x} could not be forced to int.")
        res = x
    return res

In [18]:
foo = ['9ix2010', '2ii2010', '10vii2011','30viii2009','30-9-2011']
foo = pd.DataFrame(foo).rename(columns={0:'date'})
dict_rom2arab = {'i':'1','ii':'2','iii':'3','iv':'4','v':'5','vi':'6','vii':'7','viii':'8','ix':'9','x':'10'
        ,'xi':'11','xii':'12'}
print(dict_rom2arab.keys())
foo

dict_keys(['i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii'])


Unnamed: 0,date
0,9ix2010
1,2ii2010
2,10vii2011
3,30viii2009
4,30-9-2011


In [19]:
testint(3)

3

In [20]:
testint('r')

'r'

In [21]:
[isinstance(testint(char),int) for char in foo.date[0]]

[True, False, False, True, True, True, True]

In [22]:
[isinstance(testint(char),int) for char in foo.date[4]]

[True, True, False, True, False, True, True, True, True]

[back to Functions](#Functions)

## rom2arab
[Back to Top](#Table-of-Contents)

[Back to Functions](#Functions)

In [23]:
def rom2arab(x,new_dict= None, verbose = False, replace = False):
    """checks keys in new_dict to determine which keys the string portion of, x, and replaces the at string with 
    the corresponding value in new_dict."""
    dict_rom2arab = {'i':'01','ii':'02','iii':'03','iv':'04','v':'05','vi':'06','vii':'07','viii':'08'
                     ,'ix':'09','x':'10','xi':'11','xii':'12'}
    if new_dict is None:
        new_dict = dict_rom2arab 
    idx_str = [isinstance(testint(char),int) for char in x]
    idx_tmp = range(0,len(x))
    str_df = pd.DataFrame({'numeric':idx_str,'index':idx_tmp})
    idx_strings = str_df.loc[str_df.numeric==False,'index']
    idx_numeric = str_df.loc[str_df.numeric==True,'index']
    val2replace = x[idx_strings.min():idx_strings.max()+1] # This assumes that all string values are contiuguos.
    try:
        newval = '-'+new_dict[val2replace]+'-'
        logging.info(["replaced value: \'{}\'".format(val2replace),x.replace(val2replace,newval)])
        res = x.replace(val2replace,newval)
    except Exception as e:
        res = x
        logging.error("x has no string values contained in 'new_dict'. \nx:f'{x}")
    return res


In [24]:
rom2arab(foo.date[0],verbose=False)

'9-09-2010'

In [25]:
rom2arab(foo.date[0],verbose=True)

'9-09-2010'

In [26]:
rom2arab(foo.date[4],verbose=True)

'30-9-2011'

In [27]:
foo.date.apply(rom2arab)

0     9-09-2010
1     2-02-2010
2    10-07-2011
3    30-08-2009
4     30-9-2011
Name: date, dtype: object

[back to Functions](#Functions)

## exportliz
[Back to Top](#Table-of-Contents)

[Back to Functions](#Functions)

In [28]:
def exportliz(df,iterator, lizard = None, prefix = None, verbose = False):
    """creates a filename for each lizards and then saves that lizards data to that filename. 
    Can take a list of lizards or iterate over the entire dataframe"""
    assert isinstance(df,pd.DataFrame), logging.\
    error("df must be pandas DataFrame, not f'{type(df)}.")
    assert ((lizard is None) |(isinstance(lizard,list))),logging.\
    error("lizard must be Nonetype or list, not f'{lizard}.")
    assert ((prefix is None) |(isinstance(prefix,str))), logging.\
    error("lizard must be Nonetype or str, not f'{lizard}.")
    assert iterator in df.columns, logging.error("iterator must be in df.columns:\n f'{columns}")
    if prefix is None:
        prefix = "File for lizard " 
    suffix = ".csv"
    if lizard is not None:
        assert lizard in df[iterator].unique(), "lizard must be None or contained in df[iterator]."
        for liz in lizard:
            filename = prefix + str(liz) + suffix
            logging.info("liz type:f'{type(liz)}\nliz:f'{liz}\ndf[iterator]:f'{type(df[iterator])}")
            data = df.loc[df[iterator] == liz,:]
            data.to_csv(filename)
    else:
        for liz in pd.unique(df[iterator]):
            filename = prefix + str(liz) + suffix
            logging.info("liz type:f'{type(liz)}\nliz:f'{liz}\ndf[iterator]:f'{type(df[iterator])}")
            data = df.loc[df[iterator] == liz,:]
            data.to_csv(filename)

[back to Functions](#Functions)

## Inspecting the Data
[Top](#Table-of-Contents)

Let's take a look at the data.

In [29]:
print("\nThere are {} data points in our data set.".format(df.shape[0]))


There are 6299 data points in our data set.


In [30]:
print("\nThe columns in the data have the following data types:\n{}".format(df.dtypes))


The columns in the data have the following data types:
species         object
toes            object
sex             object
date            object
svl             object
tl              object
rtl             object
autotomized    float64
mass            object
location        object
meters          object
new.recap       object
painted         object
sighting       float64
paint.mark      object
vial            object
misc            object
dtype: object


<a id= 'CleaningData'></a>

# Cleaning the Data
[Back to: Top](#Table-of-Contents)

Now we get to the actual cleaning of the data.  We will inspect the data and take the appropriate cleaning steps:
1. [Column-by-Column Cleaning](#Column-by-Column-Cleaning)

2. [Correcting Class of Columns](#Correcting-Class-of-Columns)

## Column-by-Column Cleaning
[Back to: Top](#Table-of-Contents)

We will handle the cleaning for each column in this section.
1. [rtl](#rtl)
2. [tl](#tl)
3. [svl](#svl)
4. [autotomized](#autotomized)
    1. [creating 'rtl_orig'and relabeling 'rtl' and 'autotomized](#creating-'rtl_orig'-and-relabeling-'rtl'-and-'autotomized')
        - [copy the values in rtl to a new column, *rtl_orig*](#copy-the-values-in-rtl-to-a-new-column)
        - [relabel entries in the autotomized column based on the values in the rtl_orig column](#relabel-entries-in-the-autotomized-column-based-on-the-values-in-the-rtl_orig-column) 
        - [relabel entries in the rtl column](#relabel-entries-in-the-rtl-column)
5. [toes](#toes)
6. [sex](#sex)
7. [species](#species)
8. [new.recap](#new.recap)
9. [date](#date)

## rtl
[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)

Here we investigate and clean values in the column 'rtl'. These should be int type values that are greater than or equal to -1.  First, we test to see if all of the values are of type int.

In [31]:
badtypes = []
for val in df.rtl:
    try:
        x = isinstance(type(int(val)),int)
    except:
        badtypes=badtypes+[val]
print("'badtypes' represents {} entries in the df:".format(len(badtypes)))
if len(badtypes)==0:
    print("\nAll values in df.rtl can be successfuly converted to int.\n\n")
#     df['rtl'] = df.rtl.apply(int)
else:
    print("\nAll values in df.rtl could not be converted to int.  The following values could not be \
converted and should be investigated:\n\n{}\n\nbadtypes values are distributed as follows in the df:\n\n{}"\
          .format(list(set(badtypes)),df.loc[df.rtl.isin(badtypes),'rtl'].value_counts(dropna=False)))

'badtypes' represents 3596 entries in the df:

All values in df.rtl could not be converted to int.  The following values could not be converted and should be investigated:

[nan, 'o', '?', '10(kink)', '32 -12', '-']

badtypes values are distributed as follows in the df:

NaN         3590
?              2
10(kink)       1
32 -12         1
o              1
-              1
Name: rtl, dtype: int64


The non-NaN values are few, so we will inspect these first.

In [32]:
pd.set_option('max_colwidth',100000)
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna()),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1754,uo,4-6-18,m,2004-07-12 00:00:00,52,75,32 -12,,4.7,sb opp fallen juniper -> flat R,208.0,new,painted,,w^c,04-63,blue throat and blue belly; accidentally cut toe 6
1850,sv,,f,2004-07-21 00:00:00,-,-,-,,6.0,sb @ cc/ccc,240.0,recap,painted,,w148b,,escaped
1877,sj,2-9-12-18,f,2004-07-22 00:00:00,65,94,10(kink),,9.4,wall rt side v wall v cave tr,,recap,painted,,w154b,,hurt toes 11-13 in capture; Bss Tss
5127,sj,,m,2003-04-19 00:00:00,56,32,?,,,talus 326,326.0,NEW,painted,,b7c,,
5150,sj,4-10-14-18,m,2003-04-30 00:00:00,76,19,?,,,wall 15m,15.0,recap,painted,,b9a,,9 looks like a backwards P and t combined
5280,sv,1-6-11-20,m,2003-06-27 00:00:00,41,60,o,,4.0,sb 5m ^ cave trail,50.0,NEW,painted,,sMb,,"lost toes for vial, accidently cut off toe 11"


Based on review discussions, we will make the changes below:
- ‘?’--> 0; misc: “unsure if tail was recently broken at very tip”
- ‘o’--> 0
- ‘32 -12’ -->32; misc: “potential double-break at 12 \[george to check before use\]” 
- ‘-’--> NaN
- ‘10(kink)’-->0; misc:”kink at 10mm”
We will use the function [*appendstr*](#appendstr) to do this.

"‘?’--> 0; misc: “unsure if tail was recently broken at very tip”

In [33]:
idx_ques = (df.rtl.isin(badtypes))&(df.rtl=='?')
df[idx_ques]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
5127,sj,,m,2003-04-19 00:00:00,56,32,?,,,talus 326,326,NEW,painted,,b7c,,
5150,sj,4-10-14-18,m,2003-04-30 00:00:00,76,19,?,,,wall 15m,15,recap,painted,,b9a,,9 looks like a backwards P and t combined


In [34]:
df.loc[idx_ques,'misc']= df.loc[idx_ques,:].misc\
.apply(lambda x: appendstr(x,"unsure if tail was recently broken at very tip",';'))
df.loc[idx_ques,'rtl']= '0'

These entries now look like this:

In [35]:
df.loc[idx_ques,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
5127,sj,,m,2003-04-19 00:00:00,56,32,0,,,talus 326,326,NEW,painted,,b7c,,;unsure if tail was recently broken at very tip;
5150,sj,4-10-14-18,m,2003-04-30 00:00:00,76,19,0,,,wall 15m,15,recap,painted,,b9a,,;unsure if tail was recently broken at very tip;9 looks like a backwards P and t combined


"‘o’--> 0"

In [36]:
idx_o = (df.rtl.isin(badtypes))&(df.rtl=='o')
df[idx_o]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
5280,sv,1-6-11-20,m,2003-06-27 00:00:00,41,60,o,,4,sb 5m ^ cave trail,50,NEW,painted,,sMb,,"lost toes for vial, accidently cut off toe 11"


In [37]:
df.loc[idx_o,'rtl']= '0'

These entries now look like this:

In [38]:
df.loc[idx_o,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
5280,sv,1-6-11-20,m,2003-06-27 00:00:00,41,60,0,,4,sb 5m ^ cave trail,50,NEW,painted,,sMb,,"lost toes for vial, accidently cut off toe 11"


"‘32-12’ -->32; misc: “potential double-break at 12 \[george to check before use\]"

In [39]:
idx_32 = (df.rtl.isin(badtypes))&(df.rtl=='32 -12')
df.loc[idx_32]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1754,uo,4-6-18,m,2004-07-12 00:00:00,52,75,32 -12,,4.7,sb opp fallen juniper -> flat R,208,new,painted,,w^c,04-63,blue throat and blue belly; accidentally cut toe 6


In [40]:
df.loc[idx_32,'misc']= df.loc[idx_32,:].misc\
.apply(lambda x: appendstr(x,"potential double-break at 12 [george to check before use]",';'))

df.loc[idx_32,'rtl']= '32'

These entries now look like this:

In [41]:
df.loc[idx_32,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1754,uo,4-6-18,m,2004-07-12 00:00:00,52,75,32,,4.7,sb opp fallen juniper -> flat R,208,new,painted,,w^c,04-63,;potential double-break at 12 [george to check before use];blue throat and blue belly; accidentally cut toe 6


"‘-’-->'NaN'"

In [42]:
idx_minus = (df.rtl.isin(badtypes))&(df.rtl=='-')
df.loc[idx_minus,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1850,sv,,f,2004-07-21 00:00:00,-,-,-,,6,sb @ cc/ccc,240,recap,painted,,w148b,,escaped


We will also address the values for svl and tl in this row.

In [43]:
df.loc[idx_minus,['rtl','tl','svl']]= np.nan

These entries now look like this:

In [44]:
df.loc[idx_minus,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1850,sv,,f,2004-07-21 00:00:00,,,,,6,sb @ cc/ccc,240,recap,painted,,w148b,,escaped


‘10(kink)’-->0; misc:”kink at 10mm” We will use the function appendstr to do this."

In [45]:
idx_10kink = (df.rtl.isin(badtypes))&(df.rtl=='10(kink)')
df.loc[idx_10kink,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1877,sj,2-9-12-18,f,2004-07-22 00:00:00,65,94,10(kink),,9.4,wall rt side v wall v cave tr,,recap,painted,,w154b,,hurt toes 11-13 in capture; Bss Tss


In [46]:
df.loc[idx_10kink,'misc']= df.loc[idx_10kink,:].misc.apply(lambda x: appendstr(x,"kink at 10mm",';'))
df.loc[idx_10kink,'rtl']= '0'

These entries now look like this:

In [47]:
df.loc[idx_10kink,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1877,sj,2-9-12-18,f,2004-07-22 00:00:00,65,94,0,,9.4,wall rt side v wall v cave tr,,recap,painted,,w154b,,;kink at 10mm;hurt toes 11-13 in capture; Bss Tss


Now we will inspect those that had at least one other length measurement (svl or tl).

In [48]:
pd.reset_option('max_colwidth')
idx_rtlnaplus1 = (df.rtl.isna())&(((df.svl.isna())&~(df.tl.isna()))|(~(df.svl.isna())&(df.tl.isna())))
df.loc[idx_rtlnaplus1]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
299,sj,,m,2002-03-16 00:00:00,large,,,,,active in crevice in wall 3m v juniper xing,112.0,sighting,,,?,,
471,sj,,,2002-03-17 00:00:00,large,,,,,H4a,194.0,sighting,,,w85a??,,"probably w85a but could only see the ""5"""
832,sj,,f,2002-03-20 00:00:00,large,,,,,L across from wall,318.0,sighting,,,w||t,,
1297,sj,,m,2002-03-19 00:00:00,large,,,,,up rt wall @ pool,,sighting,,,???,,~25mm original T; rest regrown
1441,sv,,?,2002-03-19 00:00:00,small,,,,,sb 4m ^ flatR,,sighting,,,,,had moth so didn'tcatch
2526,sj,2-6-13-20,f,2001-03-23 00:00:00,~70,,,,,bottom R wall v cave trail,30.0,sighting,,,?,"could read toes 6,13 for certain; toe 2 uncert...",


All but one of these was a sighting.  We will have to look at the field notes to confirm whether or not data were actually missing for the remaining entry.

In [49]:
df.loc[(df.rtl.isna())&((df.svl.notna())|(df.tl.notna()))&df['new.recap'].str.contains('recap'),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1564,sv,1-6-16-17-20,m,2004-07-04 00:00:00,52,53,,,3.6,bottom chute,355,recap,painted,,w.t,,few mites


Once we have addressed these, we will force rtl to an int type.

Now we check to see for out of range rtl values, *i.e.* rtl values less than -1 or suspiciously high.

We will exclude 0 and -1 values for rtl in these figures because of the large proportion of in range values they account for.

In [50]:
dfnobadtypes0neg1 = (~df.rtl.isin(badtypes))&(~df.rtl.isin(['0','-1']))
dfother = ~(df.species.dropna().str.contains('v|j'))&(df.species.notna())&(dfnobadtypes0neg1)
jarrovii = go.Histogram(x = df.loc[(df.species.str.contains('j'))&(dfnobadtypes0neg1)
                                   ,'rtl'].astype(int, 'ignore'),name = 'S. jarrovii',xbins =dict(size=1)
                        #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
virgatus = go.Histogram(x = df.loc[(df.species.str.contains('v'))&(dfnobadtypes0neg1)
                                   ,'rtl'].astype(int, 'ignore'), name = 'S. virgatus',xbins =dict(size=1)
                       #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
other = go.Histogram(x = df.loc[dfother,'rtl'].astype(int, 'ignore'), name = 'other',xbins =dict(size=1)
                                  #,histnorm='probability'
                     , cumulative=dict(enabled = False
                                                                           , direction = 'increasing'))
data = [jarrovii, virgatus,other]
layout = go.Layout(
    title = 'Histogram of rtl by species',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'rtl (mm)',
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Lizards',
        titlefont = dict(
            size = 18))
)
fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Histogram of rtl by species (new)')

Perhaps it's worth inspecting values greater than 50. 

In [51]:
idx_dfabove50 = (df.species.str.contains('j|v'))&(~df.rtl.isin(badtypes))\
&(df.loc[(~df.rtl.isin(badtypes)),'rtl'].astype(int, 'ignore')>=50)
df[idx_dfabove50]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
90,sj,2-12-17,m,2002-07-03 00:00:00,87.0,93,54,,17.6,wall 20m ^ 1 falls,20.0,recap,painted; Rnuchal 5; Radj scales 4; R eye 4; Ln...,,b4c,,"[note: on 8VII02 - recorded w?a, but search of..."
92,sj,4-10-14-18,m,2002-07-03 00:00:00,66.0,69,50,,8.6,sb 20m ^ 1 falls,20.0,NEW,painted; shed recently; Rnuchal 5(3y2r); R adj...,,b3c,toes in vial 33-02 (4-10-14-18),
1489,sj,4-9-12-20,m,2004-07-02 00:00:00,85.0,66,52,,22.0,leaning juniper 7m v top,452.0,recap,painted,,wCa,,B soon to shed
1722,sj,4-9-12-20,m,2004-07-10 00:00:00,80.0,68,52,,21.2,5m v rock with oak,440.0,recap,painted,,wTa,,no mites
2147,sj,2 - 8 - 13,F,2011-06-21 00:00:00,70.0,73,54,,10.4,stream bed 2m v H5,198.0,recap,yes,,g26b,,
2149,sj,3 - 9 - 15,F,2011-06-23 00:00:00,75.0,82,56,,7.7,right Rs bottom bowl,-7.0,recap,yes,,g35b,,was w45c; recently dropped; TSS
2156,sj,1 - 7 - 14 - 19,M,2011-06-19 00:00:00,80.0,73,52,,20.0,opp oak R,418.0,recap,yes,,g10b.t,,BSS T-shed HSS
2170,sj,b 1 - 7 - 11,F,2011-06-20 00:00:00,7.5,70,90,,0.0,3m right side ^ Juniper Xing,118.0,recap,yes,,,,"Break at 50, tail still attached w48c -> g18b ..."
2176,sj,b 2 - 9 - 15 - 17,F,2011-06-20 00:00:00,7.6,68,86,,0.0,10m up CCC on slab,250.0,recap,yes,,g19b,,
2293,sj,5-11-18,M,2014-07-03 00:00:00,85.0,73,56,,19.5,black r,171.0,new,yes,,o5c,14-04,


<a id='outstanding1'></a>

Some of these values are reasonable, but there are few for which we will need to go back to the field notes in 2011.  Those rows in which rtl > tl need to be investigated.

[Back to Outstanding Problems](#outstanding)

In [52]:
idx_rtltlbig = (idx_dfabove50)&(df.rtl.astype(int,errors = 'ignore')>df.tl.astype(int,errors ='ignore'))
df.loc[idx_rtltlbig]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2170,sj,b 1 - 7 - 11,F,2011-06-20 00:00:00,7.5,70,90,,0,3m right side ^ Juniper Xing,118,recap,yes,,,,"Break at 50, tail still attached w48c -> g18b ..."
2176,sj,b 2 - 9 - 15 - 17,F,2011-06-20 00:00:00,7.6,68,86,,0,10m up CCC on slab,250,recap,yes,,g19b,,


These appear to be cases where svl,tl,rtl and mass may have been entered into the wrong columns, i.e. the correct placement of current values-->correct column should probably be:
- svl-->mass
- tl-->svl
- rtl-->tl
- mass-->rtl

We will correct these now.

In [53]:
import copy
def swap(df):
    tmp = {
        'rtl':copy.copy(df['rtl']),
        'tl':copy.copy(df['tl']),
        'svl':copy.copy(df['svl']),
        'mass':copy.copy(df['mass'])
    }
#     print(tmp)
    df['rtl'] = tmp['mass']
    df['tl'] = tmp['rtl']
    df['svl'] = tmp['tl']
    df['mass'] = tmp['svl']
#     print(tmp)
    return df


In [54]:
df.loc[idx_rtltlbig,['svl','rtl','tl','mass']] = swap(df.loc[idx_rtltlbig,['svl','rtl','tl','mass']])
df.loc[idx_rtltlbig,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2170,sj,b 1 - 7 - 11,F,2011-06-20 00:00:00,70,90,0,,7.5,3m right side ^ Juniper Xing,118,recap,yes,,,,"Break at 50, tail still attached w48c -> g18b ..."
2176,sj,b 2 - 9 - 15 - 17,F,2011-06-20 00:00:00,68,86,0,,7.6,10m up CCC on slab,250,recap,yes,,g19b,,


Now we force rtl to int type, ignoring errors.

In [55]:
df['rtl'] = df.rtl.astype(int,errors = 'ignore')

## tl
[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)

Here we investigate and clean values in the column 'tl'. These should be int type values that are positive.  First, we test to see if all of the values are of type int.

In [56]:
df.tl.astype(int,errors='ignore').apply(lambda x: type(x)).value_counts(dropna=False)

<class 'float'>    3590
<class 'str'>      2709
Name: tl, dtype: int64

Let us inspect the entries for which attempting to convert 'tl' results in a float type.

In [57]:
idx_floatNaNtl = df.tl.astype(int,errors='ignore').apply(lambda x: type(x) is float)
df.loc[idx_floatNaNtl,'tl'].value_counts(dropna=False)

NaN    3590
Name: tl, dtype: int64

These are all NaN entries and can be ignored for the time being.

Let's inspect the non NaN entries now.

In [58]:
idx_strtl = df.tl.astype(int,errors='ignore').apply(lambda x: type(x) is str)
df.loc[idx_strtl,'tl'].value_counts(dropna=False)

70         79
73         70
75         69
68         69
69         68
65         62
72         60
71         57
78         55
66         55
67         54
63         52
76         50
60         48
74         47
85         45
64         44
90         44
80         43
61         42
100        41
88         40
79         38
55         37
62         36
59         35
57         35
86         34
52         34
93         33
50         33
58         32
53         30
98         30
91         30
54         29
81         29
102        29
89         28
77         28
47         27
95         26
87         26
97         25
99         25
82         25
84         25
83         24
49         24
92         24
51         24
46         24
103        24
94         23
40         23
101        22
56         22
48         21
96         20
105        19
106        19
43         19
45         19
104        18
35         16
110        15
120        15
109        14
111        13
42         13
44         13
41    

With the exception of the value '56 (42)', the tl values that are not NaN could be converted to int types.  Let's inspect this entry.

In [59]:
pd.set_option('max_colwidth',1000)
idx_5642tl = df.loc[(idx_strtl) & (df.tl=='56 (42)'),:].index
df.loc[df.index.isin(idx_5642tl)]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1989,sj,1-2-3-4-5,f,2009-07-13 00:00:00,69,56 (42),-1,,9.2,T opp mid wall v juniper xing,85,new,painted,,y7a,,missing LFF (left front foot); open break in tail at 42


Based on the notes in the misc column, tl should be recorded as 56.  We will do this now.

In [60]:
df.loc[df.index.isin(idx_5642tl),'tl']='56'

Now the entry looks like this.

In [61]:
df.loc[df.index.isin(idx_5642tl),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1989,sj,1-2-3-4-5,f,2009-07-13 00:00:00,69,56,-1,,9.2,T opp mid wall v juniper xing,85,new,painted,,y7a,,missing LFF (left front foot); open break in tail at 42


We will use a histogram to try and identify abnormalities among the other tl values.

In [62]:
# dfnobadtypes0neg1 = (~df.tl.isin(badtypes))&(~df.tl.isin(['0','-1']))
dfother = ~(df.species.dropna().str.contains('v|j'))&(df.species.notna())&(df.tl.notna())
jarrovii = go.Histogram(x = df.loc[(df.species.str.contains('j'))&(df.tl.notna())
                                   ,'tl'].astype(int, 'ignore'),name = 'S. jarrovii',xbins =dict(size=1)
                        #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
virgatus = go.Histogram(x = df.loc[(df.species.str.contains('v'))&(df.tl.notna())
                                   ,'tl'].astype(int, 'ignore'), name = 'S. virgatus',xbins =dict(size=1)
                       #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
other = go.Histogram(x = df.loc[dfother,'tl'].astype(int, 'ignore'), name = 'other',xbins =dict(size=1)
                                  #,histnorm='probability'
                     , cumulative=dict(enabled = False
                                                                           , direction = 'increasing'))
data = [jarrovii, virgatus, other]
layout = go.Layout(
    title = 'Histogram of tl by species',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'tl (mm)',
#         tickfont = dict(
#         size = 8),
#         tickangle = 85,
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Lizards',
        titlefont = dict(
            size = 18))
)
fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Histogram of tl by species (new)')

For now there is not much we can identify graphically.  We will revist this later.  For now we will force tl to int.

In [63]:
df['tl'] = df.tl.astype(int, errors = 'ignore')

## svl
[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)



We will take a similar approach for svl.

In [64]:
df.svl.astype(int,errors='ignore').apply(lambda x: type(x)).value_counts(dropna=False)

<class 'float'>    3584
<class 'str'>      2715
Name: svl, dtype: int64

Let us inspect the entries for which attempting to convert 'svl' results in a float type.

In [65]:
idx_floatNaNsvl = df.svl.astype(int,errors='ignore').apply(lambda x: type(x) is float)
df.loc[idx_floatNaNsvl,'svl'].value_counts(dropna=False)

NaN    3584
Name: svl, dtype: int64

These are all NaN entries and can be ignored for the time being.

Let's inspect the non NaN entries now.

In [66]:
idx_strsvl = df.svl.astype(int,errors='ignore').apply(lambda x: type(x) is str)
df.loc[idx_strsvl,'svl'].value_counts(dropna=False)

50       108
52        85
53        82
56        81
55        80
51        77
49        75
60        74
48        74
75        71
70        71
54        70
58        61
65        60
47        60
61        58
46        57
57        52
45        51
73        50
63        50
68        49
72        47
76        46
59        45
66        44
62        44
64        43
80        43
78        40
43        38
82        37
40        36
74        36
42        35
77        35
71        34
69        33
67        32
85        31
79        30
44        30
81        28
83        26
39        26
84        26
87        23
38        23
89        21
32        20
35        20
41        20
37        20
31        20
36        20
34        19
86        19
88        18
90        16
33        14
30        13
91        12
93         8
92         7
29         7
28         6
large      4
27         4
22         2
26         2
95         2
13         2
96         2
98         2
105        2
112        1
24         1

The values 'large', 'small', and '~70' require closer inspection.

In [67]:
idx_txtvals = df.svl.isin(['small','large','~70'])
df.loc[idx_txtvals]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
299,sj,,m,2002-03-16 00:00:00,large,,,,,active in crevice in wall 3m v juniper xing,112.0,sighting,,,?,,
471,sj,,,2002-03-17 00:00:00,large,,,,,H4a,194.0,sighting,,,w85a??,,"probably w85a but could only see the ""5"""
832,sj,,f,2002-03-20 00:00:00,large,,,,,L across from wall,318.0,sighting,,,w||t,,
1297,sj,,m,2002-03-19 00:00:00,large,,,,,up rt wall @ pool,,sighting,,,???,,~25mm original T; rest regrown
1441,sv,,?,2002-03-19 00:00:00,small,,,,,sb 4m ^ flatR,,sighting,,,,,had moth so didn'tcatch
2526,sj,2-6-13-20,f,2001-03-23 00:00:00,~70,,,,,bottom R wall v cave trail,30.0,sighting,,,?,"could read toes 6,13 for certain; toe 2 uncertain and didn't see 20 but this is the only right-sized female who could possible fit! Originally caught in July 1998, sb 20m ^ cave trail.",


All of these values for svl should be set to NaN since these are estimates, not measured values.  For the entry with the svl value of '~70', we can add the estimated value to the misc column. We will use the [appendstr](#appendstr) function here again.

In [68]:
idx_apprx70svl = (idx_txtvals)&(df.svl=='~70')
df.loc[idx_apprx70svl,'misc'] = df.loc[idx_apprx70svl,'misc'].apply(lambda x: appendstr(x,connector=';'
                                                        ,position=-1
                                                        ,value='svl extimated to be ~70mm'))
df.loc[idx_apprx70svl,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2526,sj,2-6-13-20,f,2001-03-23 00:00:00,~70,,,,,bottom R wall v cave trail,30,sighting,,,?,"could read toes 6,13 for certain; toe 2 uncertain and didn't see 20 but this is the only right-sized female who could possible fit! Originally caught in July 1998, sb 20m ^ cave trail.",;svl extimated to be ~70mm;


In [69]:
df.loc[idx_txtvals,'svl']=np.nan
df.loc[idx_txtvals]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
299,sj,,m,2002-03-16 00:00:00,,,,,,active in crevice in wall 3m v juniper xing,112.0,sighting,,,?,,
471,sj,,,2002-03-17 00:00:00,,,,,,H4a,194.0,sighting,,,w85a??,,"probably w85a but could only see the ""5"""
832,sj,,f,2002-03-20 00:00:00,,,,,,L across from wall,318.0,sighting,,,w||t,,
1297,sj,,m,2002-03-19 00:00:00,,,,,,up rt wall @ pool,,sighting,,,???,,~25mm original T; rest regrown
1441,sv,,?,2002-03-19 00:00:00,,,,,,sb 4m ^ flatR,,sighting,,,,,had moth so didn'tcatch
2526,sj,2-6-13-20,f,2001-03-23 00:00:00,,,,,,bottom R wall v cave trail,30.0,sighting,,,?,"could read toes 6,13 for certain; toe 2 uncertain and didn't see 20 but this is the only right-sized female who could possible fit! Originally caught in July 1998, sb 20m ^ cave trail.",;svl extimated to be ~70mm;


Now we force the remaining svl values to int type.

In [70]:
df['svl'] = df.svl.astype(int, errors= 'ignore')

### autotomized 
[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)

[creating 'rtl_orig'and relabeling 'rtl' and 'autotomized](#creating-'rtl_orig'-and-relabeling-'rtl'-and-'autotomized')
- [copy the values in rtl to a new column, *rtl_orig*](#copy-the-values-in-rtl-to-a-new-column)
- [relabel entries in the autotomized column based on the values in the rtl_orig column](#relabel-entries-in-the-autotomized-column-based-on-the-values-in-the-rtl_orig-column) 
- [relabel entries in the rtl column](#relabel-entries-in-the-rtl-column)

Here we populate the 'autotomized' column based on the values in 'rtl'.  Most of the source files did not have this category and have NaN values others have float values of 1.0, 2.0 or 3.0 for intact, autotomized with no regrowth or autotomized with regrowth, respectively.  The cleaned data for autotomized will contain  bool type values True, for having experienced auttomy (irrespective of regrowth) and False for having no evidence of havign experienced autotomy.

In [71]:
df.autotomized.value_counts(dropna=False)

NaN     6212
 1.0      61
 3.0      17
 2.0       9
Name: autotomized, dtype: int64

We will inspect the rtl values for entries with non NaN values for autotomized to determine if we can depend on rtl values to determine autotomy status.  In order to rely on rtl values, the following conditions must be met:
- all entries in which autotomized equals 1.0 must have 0 for rtl
- all entries in which autotomized equals 2.0 or 3.0 must have -1 or some value >0 for rtl

In [72]:
intact = df.loc[(df.autotomized==1),'rtl'].astype(int,errors = 'ignore').value_counts(dropna=False)
values2check = [x for x in intact.index[intact.index!=0]]
if len(values2check)>0:
    print("The rtl values associated with {} need a closer look.".format(values2check))
else:
    print("Values for 'intact' entries are as expected.  Continue.")
pd.set_option('max_colwidth',1000)
# df.loc[(df.autotomized==1)&(df.rtl.isin(['21'])),:]
df.loc[(df.autotomized==1)&(df.rtl.astype(int, errors = 'ignore').isin([str(x) for x in values2check])),:]
# need to see what broke this line

The rtl values associated with [21] need a closer look.


Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2211,sv,3-7-11-16,f,19vi2010,60,42,21,1.0,8.5,6m^bottom site on R ^fallen T in sb,-14,NEW,yes,,y1a,CA-01-cc,gravid


<a id ='outstanding2'></a>

[Back to Outstanding Problems](#outstanding)

This lizard appears to have been misrecorded and should be listed as autotomized given the amount of regrowth. This should be confirmed in the field notes. If we trust the data as recorded and depend on the rtl values to label autotomized this will be corrected, so for now we will leave this as is.

In [73]:
autotomized = df.loc[(df.autotomized==2),'rtl'].value_counts(dropna=False)
aut_values2check = [x for x in autotomized.index[autotomized.index!='-1']]# change to 'isin' aregument with 0 and -1
if len(aut_values2check)>0:
    print("{} values associated with an rtl value of {} need a closer look."\
          .format(df.loc[(df.autotomized==2)&(df.rtl.isin(aut_values2check)),:].shape[0],aut_values2check))
else:
    print("Values for 'autotomized' entries are as expected.  Continue.")
pd.set_option('max_colwidth',1000)
idx_aut_entries2check = (df.autotomized==2)&(df.rtl.isin([str(x) for x in aut_values2check]))
df.loc[idx_aut_entries2check,:]

8 values associated with an rtl value of ['0'] need a closer look.


Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2187,sj,3-6-15-17,m,18viii2010,52,70,0,2.0,4.5,r outcrop ^ oak R,425.0,new,yes,,y68c,67-10-cc,Tss
2202,sj,3-6-12-18,f,26vii2010,66,79,0,2.0,9.0,339m; rt side 1m up,339.0,NEW,yes,,y54c,48-10-cc,Bss
2208,sj,3-6-13-20,f,5viii2010,48,69,0,2.0,3.5,5m up ccc,,new,yes,,>c,59-10-cc,
2235,sj,3-10-13-(14)-16,m,12vi2010,92,112,0,2.0,18.0,talus^Rwall v talus left side 4m up,,recap,yes,,w1c,,"toe 14 looks like natural toe loss; skinny [pics], spine visible [pics], old puncture wounds on back [pics], possible injury of back rt leg at knee [pics]"
2259,sj,4-7-8-9-11-18,m,26vii2010,96,97,0,2.0,24.5,2m left of stump,364.0,recap,yes,,y51c,,old mark wXc still visible; not the same as wXc at chute
2262,sj,3-10-13-14-16,m,26vii2010,91,111,0,2.0,18.5,1m v top R wall v talus in sb,320.0,recap,yes,,y55c,,1m v top R wall in sb; last mark still visible; injuries to dorsum [pics]; toe 14 may be natural toe loss; lizard is skinny: spine and hip bones visible [pics]
2265,sj,,f,5viii2010,70,86,0,2.0,11.0,bottom site,,recap,,,y61c.t,,BSS; toe 19 could be natural toe loss
2267,sv,3-8-12-16,m,20vi2010,52,58,0,2.0,3.5,5m v wall v wall v juniper xing left side sb,80.0,recap,yes,,y6a,,


<a id = 'outstanding3'></a>

[Back to Outstanding Problems](#outstanding)

Some of these cases are very straight forward given that the ratio of svl to tl is very close to 1, but others would be worth checking the original data to confirm. Another option is to use the svl to tl ratio of animals that we are sure are intact to decide how to classify these.  For now we will trust the system of recording used in 2010 and update the rtl values to '-1'.

In [74]:
df.loc[idx_aut_entries2check,'rtl'] = '-1'

In [75]:
df.loc[idx_aut_entries2check]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2187,sj,3-6-15-17,m,18viii2010,52,70,-1,2.0,4.5,r outcrop ^ oak R,425.0,new,yes,,y68c,67-10-cc,Tss
2202,sj,3-6-12-18,f,26vii2010,66,79,-1,2.0,9.0,339m; rt side 1m up,339.0,NEW,yes,,y54c,48-10-cc,Bss
2208,sj,3-6-13-20,f,5viii2010,48,69,-1,2.0,3.5,5m up ccc,,new,yes,,>c,59-10-cc,
2235,sj,3-10-13-(14)-16,m,12vi2010,92,112,-1,2.0,18.0,talus^Rwall v talus left side 4m up,,recap,yes,,w1c,,"toe 14 looks like natural toe loss; skinny [pics], spine visible [pics], old puncture wounds on back [pics], possible injury of back rt leg at knee [pics]"
2259,sj,4-7-8-9-11-18,m,26vii2010,96,97,-1,2.0,24.5,2m left of stump,364.0,recap,yes,,y51c,,old mark wXc still visible; not the same as wXc at chute
2262,sj,3-10-13-14-16,m,26vii2010,91,111,-1,2.0,18.5,1m v top R wall v talus in sb,320.0,recap,yes,,y55c,,1m v top R wall in sb; last mark still visible; injuries to dorsum [pics]; toe 14 may be natural toe loss; lizard is skinny: spine and hip bones visible [pics]
2265,sj,,f,5viii2010,70,86,-1,2.0,11.0,bottom site,,recap,,,y61c.t,,BSS; toe 19 could be natural toe loss
2267,sv,3-8-12-16,m,20vi2010,52,58,-1,2.0,3.5,5m v wall v wall v juniper xing left side sb,80.0,recap,yes,,y6a,,


In [76]:
regrown = df.loc[(df.autotomized==3),'rtl'].value_counts(dropna=False).reset_index()['index']\
.astype(int, errors = 'ignore')
values2check = [x for x in regrown<=0]
if sum(values2check)>0:
    print("The values associated with {} need a closer look.".format(values2check))
else:
    print("Values for 'regrown' entries are as expected.  Continue.")
pd.set_option('max_colwidth',1000)
df.loc[(df.autotomized==3)&(df.rtl.isin([str(x) for x in values2check])),:]

Values for 'regrown' entries are as expected.  Continue.


Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc


<a id='rtlRTL_ORIGautotomized'></a>

### creating 'rtl_orig' and relabeling 'rtl' and 'autotomized'
[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)

[Back to: 'autotomized'](#autotomized)

Now we will:
- [copy the values in rtl to a new column, *rtl_orig*](#copy-the-values-in-rtl-to-a-new-column)
- [relabel entries in the autotomized column based on the values in the rtl_orig column](#relabel-entries-in-the-autotomized-column-based-on-the-values-in-the-rtl_orig-column) 
- [relabel entries in the rtl column](#relabel-entries-in-the-rtl-column)

<a id='copyrtl'></a>

#### copy the values in rtl to a new column
[Back to: 'autotomized'](#autotomized)

In [77]:
df['rtl_orig'] = df.rtl

#### relabel entries in the autotomized column based on the values in the rtl_orig column
[Back to: 'autotomized'](#autotomized)

We will do this using the following logic:
    - if rtl_orig !=0 & rtl_orig.notna(), autotomized = True
    - if rtl_orig ==0, automized = False
    - if rtl_orig.isna(), autotomized = np.nan

In [78]:
idx_auttrue = (~df.rtl_orig.isin(['0']))&(df.rtl_orig.notna())
df.loc[idx_auttrue,'autotomized'] = True

In [79]:
idx_autfalse = (df.rtl_orig.isin(['0']))&(df.rtl_orig.notna())
df.loc[idx_autfalse,'autotomized'] = False

In [80]:
idx_autnan = df.rtl_orig.isna()
df.loc[idx_autnan,'autotomized'] = np.nan

#### relabel entries in the rtl column
[Back to: 'autotomized'](#autotomized)

We will do this using the following logic:
- if rtl_orig == -1, rtl = 0

In [81]:
idx_rtlneg1 = df.rtl_orig=='-1'
df.loc[idx_rtlneg1,'rtl'] = 0

In [82]:
df.autotomized.value_counts(dropna=False)

NaN      3591
False    1989
True      719
Name: autotomized, dtype: int64

## toes 
[Top](#Table-of-Contents)

[Top Cleaning](#CleaningData)

Here we make changes to toes based on comments regarding a 2004 male Sv with toes recorded as '1-6-16-17-20'

In [83]:
idx_sv2004m16161720 = (df.species=='sv') \
& (df.sex=='m') \
& (df.date.str.contains('2004-07-04')) \
&(df.svl=='52')&(df.tl=='53')

df.loc[idx_sv2004m16161720]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig
1564,sv,1-6-16-17-20,m,2004-07-04 00:00:00,52,53,,,3.6,bottom chute,355,recap,painted,,w.t,,few mites,


In [84]:
df.loc[idx_sv2004m16161720,'toes'] = '1-7-16-17-20'
df.loc[idx_sv2004m16161720,'rtl'] = 0

Now this entry looks like this.

In [85]:
df.loc[idx_sv2004m16161720]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig
1564,sv,1-7-16-17-20,m,2004-07-04 00:00:00,52,53,0,,3.6,bottom chute,355,recap,painted,,w.t,,few mites,


### Correcting toes

First we will rename "toes" to "toes_orig"

In [86]:
df = df.rename(columns = {'toes':'toes_orig'},index = str)

Next we create a new column, "toes"  for the renamed toes

In [87]:
df['toes'] = df.toes_orig

Now we attempt to identify problem _toes_ values and either correct them or export them for review.

In [88]:
pattern1 = ".( {1,}-.|.- {1,}.)" # toes entries with any number of spaces on either side of a hyphen
pattern2 = ".( {,}\w{,} {1,})." # toes entries with space around or between numbers <- the spaces here should be deleted
pattern3 = ".(')."
pattern4 = "./."  # entries with '/' <-- need to replace these with '-'
pattern5 = "(\?{1,})"#<-- these needs to be investigated
pattern6 = "^\d{3,}$" # entries consist of only a single number comprised of at least three digits 
#<-- these needs to be investigated by checking raw field notes
pattern7 = ".(-{2,})." # entries which have at least 2 consecutive '-' <- these should be investigated
pattern8 = "^0" # entries in which single digit numbers have a leading "0" <-- Check raw field notes on this too
pattern9 = "a\w" #<--handled hyphens should be inserted  between the [ab] and \w 
# entries that contain an 'a' or 'b' followed by any character in the set [a-zA-Z0-9_]
pattern10 = "b\w" #<--handled hyphens should be inserted  between the [ab] and \w 
pattern11 = "\wa" # entries that contain an 'a' or 'b' preceded by any character in the set [a-zA-Z0-9_]
pattern12 = "\wb" # entries that contain an 'a' or 'b' preceded by any character in the set [a-zA-Z0-9_]
pattern13 = "[()]"
# remove space before 'a' at end of toes
#investigate '\d-', 
#'-(*)-', 
#' (16) ', 
#'---', <- may not exist in raw data
#'\d- ', 
#'- \d', 
#transcription errors from excel (toes in date format,
#'-\d\d\d\d' <- may not be in the data set

We'll have to change this block if we add or remove toe patterns.
This is not ideal and needs to be fixed

In [89]:
#Label the toe patterns
toe_pattern = pd.Series([*range(1,14)]) 
toe_pattern = make_str(toe_pattern)

#create the descriptions of the toe patterns
toe_pattern_descr = pd.Series([pattern1,pattern2,pattern3,pattern4
                               ,pattern5,pattern6,pattern7,pattern8
                               ,pattern9,pattern10,pattern11,pattern12,pattern13])
#ensure that the descriptions are str values.  Do we need this?
# toe_pattern_descr = toe_pattern_descr.astype(str)
# print(toe_pattern_descr,"\n\n")
# Place toe pattern descriptions and labels into a dataframe
toe_pattern_reference = pd.DataFrame({'toe_pattern': toe_pattern,'description':toe_pattern_descr})
toe_pattern_reference

Unnamed: 0,toe_pattern,description
0,1,".( {1,}-.|.- {1,}.)"
1,2,".( {,}\w{,} {1,})."
2,3,.(').
3,4,./.
4,5,"(\?{1,})"
5,6,"^\d{3,}$"
6,7,".(-{2,})."
7,8,^0
8,9,a\w
9,10,b\w


Next, we replace the string 'nan' in the data set with a null value.

In [90]:
df.loc[df.toes=='nan','toes'] = np.nan

Let's see how many of these patterns we need to correct.  First we create a column in the df called *toe_pattern* and initiallize it with nan values. 

In [91]:
df['toe_pattern'] = np.nan

Here we use a for-loop to populate *toe_patterns* 
(there's probably a better way to do this with pandas map or apply, but I'll have to figure this out, for now this is fast enough, but it could make a difference with a larger data set or with more patterns)

In [92]:
for i in range(0,toe_pattern_reference.shape[0]):
    tmp_pat_num = toe_pattern_reference.loc[i,'toe_pattern']
    tmp_pattern = toe_pattern_reference.loc[i,'description']
    df = label_pattern(x=df,pat_num=tmp_pat_num,pattern = tmp_pattern,pat_col = 'toe_pattern',col = 'toes')
print(df.toe_pattern.value_counts(dropna=False))

NaN    5998
02      258
01       39
05        2
09        1
13        1
Name: toe_pattern, dtype: int64


#### Summarizing toe patterns
Here we produce a quick summary of the number of observations for each pattern in the data set.

In [93]:
toe_errors =df.toe_pattern.value_counts(dropna=False).reset_index()\
.rename(columns = {'index':'toe_pattern','toe_pattern':'observations'})
toe_errors.loc[toe_errors.toe_pattern.isnull(),'toe_pattern'] = 'Not covered by current patterns'
toe_errors_desc = toe_errors.merge(toe_pattern_reference,'left',on='toe_pattern')
toe_errors_desc

Unnamed: 0,toe_pattern,observations,description
0,Not covered by current patterns,5998,
1,02,258,".( {,}\w{,} {1,})."
2,01,39,".( {1,}-.|.- {1,}.)"
3,05,2,"(\?{1,})"
4,09,1,a\w
5,13,1,[()]


Now let's make sure we've accounted for every row in the data set

In [94]:
accountedRows = toe_errors.observations.sum()
totalRows = df.shape[0]
notAccountedRows = df.shape[0] - toe_errors.observations.sum()
print("\nThere are {} rows accounted for in the patterns (including null values) and there {} rows in the full data set.\
  There are {} rows unaccounted for.".format(accountedRows,totalRows,notAccountedRows))


There are 6299 rows accounted for in the patterns (including null values) and there 6299 rows in the full data set.  There are 0 rows unaccounted for.


Now we correct these patterns. We'll preserve the original toe data in a column called "toes_orig" just in case.  We can drop this later, if we are comfortable with the changes.  The new toes will be labeled "toes".

In [95]:
corrections_config = {'01':{'action':'replace','pattern_b':" - ",'replacement':"-"},
            '02':{'action':'replace','pattern_b':" ",'replacement':"-"},
            '03':{'action':'replace','pattern_b':"\'",'replacement':"\"\""},
            '04':{'action':'replace','pattern_b':"/",'replacement':"-"},
            '05':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '06':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '07':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '08':{'action':'replace','pattern_b':"^0",'replacement':"\"\""},
            '09':{'action':'replace','pattern_b':'a','replacement':'-a'},
            '10':{'action':'replace','pattern_b':'b','replacement':'-b'},          
            '11':{'action':'replace','pattern_b':"a",'replacement':"a-"},
            '12':{'action':'replace','pattern_b':"b",'replacement':"b-"},
            '13':{'action':'replace','pattern_b':"()",'replacement':"\"\""}}

In [96]:
toe_errors_desc['action'] = toe_errors_desc.loc[toe_errors_desc.toe_pattern.str.len()==2].toe_pattern\
.map(lambda x: corrections_config[x]['action'])

toe_errors_desc['pattern_b'] = toe_errors_desc.loc[toe_errors_desc.toe_pattern.str.len()==2].toe_pattern\
.map(lambda x: corrections_config[x]['pattern_b'])

toe_errors_desc['replacement'] = toe_errors_desc.loc[toe_errors_desc.toe_pattern.str.len()==2].toe_pattern\
.map(lambda x: corrections_config[x]['replacement'])

toe_errors_desc = toe_errors_desc.sort_values('toe_pattern').reset_index(drop=True)
toe_errors_desc.loc[toe_errors_desc['toe_pattern'] == 'Not covered by current patterns',
                                    ['action']] = 'ignore'
toe_errors_desc

Unnamed: 0,toe_pattern,observations,description,action,pattern_b,replacement
0,01,39,".( {1,}-.|.- {1,}.)",replace,-,-
1,02,258,".( {,}\w{,} {1,}).",replace,,-
2,05,2,"(\?{1,})",save,,
3,09,1,a\w,replace,a,-a
4,13,1,[()],replace,(),""""""
5,Not covered by current patterns,5998,,ignore,,


#### Replacing Toe Patterns
Here we actually replace offending patterns.

Merge toe_errors_dec to df

In [97]:
df = df.merge(toe_errors_desc[['description','toe_pattern','action','pattern_b','replacement']], on='toe_pattern')
df.head()

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement
0,sv,8-11-20,m,2002-03-19 00:00:00,52,74,0,False,5.0,5m ^ CC/CCC on rt side,245,recap,"old paint, repainted over",,y20b,,"not legible, however",0,8-11-20,2,".( {,}\w{,} {1,}).",replace,,-
1,sv,8-12-20,f,2002-07-19 00:00:00,58,71,0,False,5.7,sb at R outcrop at 425 on rt side,425,recap,gravid,,w43a,,,0,8-12-20,2,".( {,}\w{,} {1,}).",replace,,-
2,sv,3 13 16,m,2004-07-02 00:00:00,54,51,38,True,5.2,sb 10m v falls,-10,recap,painted,,wTb,,not shed since- paint still visible,38,3 13 16,2,".( {,}\w{,} {1,}).",replace,,-
3,sv,4 6 11,m,2004-07-02 00:00:00,49,66,0,False,4.5,see wAa,45,new,painted,,wAc,04 03,,0,4 6 11,2,".( {,}\w{,} {1,}).",replace,,-
4,sv,4 6 12,m,2004-07-02 00:00:00,45,62,0,False,3.3,3 m above cave trail on rt,48,new,painted,,wAb,04 04,,0,4 6 12,2,".( {,}\w{,} {1,}).",replace,,-


In [98]:
df['toes'] = df.apply(lambda x: replace_pattern(x=x,source_col='toes',pattern_b='pattern_b',
                                         replacement='replacement',action='action'), axis = 1)

Now we confirm that the patterns we expect to have eliminated have indeed been eliminated from the data set

In [99]:
report_pattern(df,'description','toe_pattern','toes','Post-Correction')

['.( {,}\\w{,} {1,}).' '.( {1,}-.|.- {1,}.)' 'a\\w' '(\\?{1,})' '[()]']


<a id='sex'></a>

### sex
[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)

Next we move on to cleaning the "sex" column.

First we want to get an idea of the types of problems in the sex column.  We start by striping leading and trailing whitespaces.  You can see here that there were none in the data set.

In [100]:
print(df.sex.str.len().unique())# returns unique lengths of sex
df.sex=df.sex.str.strip()
print(df.sex.str.len().unique())

[ 1. nan  4.  2.]
[ 1. nan  4.  2.]


#### Identify non "m" or "f" values and their frequencies

In [101]:
patterns_sex="m|f|NA"
non_matches=df.sex.loc[df.sex.str.match(patterns_sex)!=True]
print("\nThere are {} entries for sex which do not match the patterns {}:"\
      .format(non_matches.shape[0],patterns_sex.split("|")))
non_matches.value_counts()


There are 100 entries for sex which do not match the patterns ['m', 'f', 'NA']:


F    50
M    47
Name: sex, dtype: int64

#### Identify values to convert to NA, m, or f

In [102]:
sex2NA = ['adult','juv','nan','\?\?\?','\?']
sex2m = ['unm','M']
sex2f = ['F']
df.loc[df.sex.isin(sex2NA)==True]
print("There are {} entries that should be converted to 'NaN'".format(df.sex.isin(sex2NA).sum()))
print("There are {} entries that should be converted to 'm'".format(df.sex.isin(sex2m).sum()))
print("There are {} entries that should be converted to 'f'".format(df.sex.isin(sex2f).sum()))

There are 0 entries that should be converted to 'NaN'
There are 47 entries that should be converted to 'm'
There are 50 entries that should be converted to 'f'


#### Convert the values to NA, f, or m, respectively.

In [103]:
df.loc[df.sex.isin(sex2NA),'sex']=np.nan
df.loc[df.sex.isin(sex2m),'sex']='m'
df.loc[df.sex.isin(sex2f),'sex']='f'
print("Now there are {} entries that should be converted to 'NaN'".format(df.sex.isin(sex2NA).sum()))
print("Now there are {} entries that should be converted to 'm'".format(df.sex.isin(sex2m).sum()))
print("Now there are {} entries that should be converted to 'f'".format(df.sex.isin(sex2f).sum()))

Now there are 0 entries that should be converted to 'NaN'
Now there are 0 entries that should be converted to 'm'
Now there are 0 entries that should be converted to 'f'


#### Set all remaining sex with "?" to NaN

In [104]:
df.loc[(df.sex.str.contains('\?')) & (df.sex.notnull()),'sex'] = np.nan

<a id = 'species'></a>

### Species
[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)

In [105]:
print(df.species.str.len().unique())# returns unique lengths of species
df.species=df.species.str.strip()
print(df.species.str.len().unique())

[2 3 1]
[2 1]


In [106]:
df.species.value_counts(dropna = False)

sj    134
sv    129
uo     28
Sj      5
Uo      3
j       1
sc      1
Name: species, dtype: int64

In [107]:
patterns_species="j|v|sj|sv|NA"
idx_notsjsv = (df.species.str.match(patterns_species)!=True)&(df.species.str.contains('j|v',case=False)!=True)
non_matches=df.species.loc[idx_notsjsv]
print("\nThere are {} entries for species which do not match the patterns {} and are unlikely to be definitely \
'sv', 'sj':".format(non_matches.shape[0],patterns_species.split("|")))
non_matches.value_counts()


There are 32 entries for species which do not match the patterns ['j', 'v', 'sj', 'sv', 'NA'] and are unlikely to be definitely 'sv', 'sj':


uo    28
Uo     3
sc     1
Name: species, dtype: int64

We will set species for these entries to  'other'.

In [108]:
df.loc[df.species.isin(non_matches.unique()),'species'] = 'other'

<a id ='outstanding4'></a>
#### *Sceloporus jarrovii*
[Back to Outstanding Problems](#outstanding)

In [109]:
df.loc[df.species.str.contains('j',case=False),'species'].value_counts(dropna=False)

sj    134
Sj      5
j       1
Name: species, dtype: int64

The values with '?' should be investigated.

In [110]:
idx_sjrev = df.species.str.contains('j',case=False)&(df.species.str.contains('\?'))
df.loc[idx_sjrev,'species'].value_counts(dropna=False)
df.loc[idx_sjrev,:]

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement


In [111]:
idx_sj = df.species.str.contains('j',case=False)&(~df.species.str.contains('\?'))
df.loc[idx_sj,'species'].value_counts(dropna=False)

sj    134
Sj      5
j       1
Name: species, dtype: int64

We will convert the others should be converted to 'j'.

In [112]:
df.loc[idx_sj,'species'] = 'j'

#### *Sceloporus virgatus*

In [113]:
idx_sv = df.species.str.contains('v',case=False)
df.loc[idx_sv,'species'].value_counts(dropna=False)

sv    129
Name: species, dtype: int64

We will convert these to 'v'.

In [114]:
df.loc[idx_sv,'species'] = 'v'

### new.recap
[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)

1. [potential new and recap](#newandrecap)

In [115]:
newKeep = ['new','n','NEW','N','New']
recapKeep = ['recap','r','Recap','R']
sighting = ['sighting','sighted','heard','sighted ', 'missed']

Now let's identify other values to include in each list.

<a id = 'newandrecap'></a>
#### potential new and recap

[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)

[Back to: new.recap](#newrecap)

Let's identify cases that may be classified as recaptures or new captures.  We'll start with new captures.

In [116]:
idx_potnew = df['new.recap'].str.contains('n',case=False)==True
print([x for x in df.loc[idx_potnew,'new.recap'].value_counts(dropna=False).index])
df.loc[idx_potnew,'new.recap'].value_counts(dropna=False)

['NEW', 'new', 'sighting', 'N', 'not caught', 'new ']


NEW           76
new           73
sighting       4
N              1
not caught     1
new            1
Name: new.recap, dtype: int64

'NEW', 'New', 'new ', 'new', and 'N' are certainly new captures and should be converted to 'N'. for now we will add them to the *newKeep* list.

In [117]:
newKeep = list(set(newKeep + ['NEW', 'New', 'new ', 'new', 'N']))
newKeep

['new ', 'new', 'N', 'New', 'NEW', 'n']

'sighting', '?sighting', "didn't catch", 'sighing', and 'not caught' should be sightings, so wee will add them to the list of *sighting* list now.

In [118]:
sighting = list(set(sighting + ['sighting', '?sighting', 'didn\'t catch', 'sighing', 'not caught']))
sighting

['sighing',
 'sighted',
 '?sighting',
 'sighted ',
 'sighting',
 "didn't catch",
 'missed',
 'heard',
 'not caught']

now we take a similar approach for recaptures.

In [119]:
idx_potrec = df['new.recap'].str.contains('r',case=False)==True
print([x for x in df.loc[idx_potrec,'new.recap'].value_counts(dropna=False).index])
df.loc[idx_potrec,'new.recap'].value_counts(dropna=False)

['recap', 'reecap']


recap     137
reecap      1
Name: new.recap, dtype: int64

'recap', 'r', 'recap ',  'reecap', 'recapq' are certainly recaptures and should be converted to 'R'. for now we will add them to the *newRecapKeep* list.

In [120]:
recapKeep = list(set(recapKeep + ['recap', 'r', 'recap ',  'reecap', 'recapq']))
recapKeep

['recapq', 'recap', 'r', 'R', 'recap ', 'reecap', 'Recap']

'recap?', 'recap/new', 'recap? - toes suggest a NEW mark','recap ?', 'r?' require inspection.

In [121]:
idx_recaprev = df['new.recap']\
.isin(['recap?','recap/new', 'recap ?', 'recap? - toes suggest a NEW mark', 'r?'])
df.loc[idx_recaprev]

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement


'visual recapture', and 'heard' should be reclassified as 'sighting', so we will add them to the *sighting* list.

In [122]:
sighting = list(set(sighting + ['visual recapture', 'heard']))
sighting

['sighing',
 'heard',
 'sighted',
 'visual recapture',
 '?sighting',
 'sighted ',
 'sighting',
 'missed',
 "didn't catch",
 'not caught']

Now we assign 'N' to confired new captures.

In [123]:
idx_new = df['new.recap'].isin(newKeep)
df.loc[idx_new,'new.recap'] = 'N'
df.loc[idx_new,'new.recap'].value_counts(dropna = False)

N    151
Name: new.recap, dtype: int64

Now we assign 'R' to confirmed recaptures.

In [124]:
idx_recap = df['new.recap'].isin(recapKeep)
df.loc[idx_recap,'new.recap'] = 'R'
df.loc[idx_recap,'new.recap'].value_counts(dropna = False)

R    138
Name: new.recap, dtype: int64

Now we assign 'S' to confirmed sightings.

In [125]:
idx_sighting = df['new.recap'].isin(sighting)
df.loc[idx_sighting,'new.recap'] = 'S'
df.loc[idx_sighting,'new.recap'].value_counts(dropna = False)

S    6
Name: new.recap, dtype: int64

The remaining *new.recap* values need inspection.

In [126]:
print(df.loc[~df['new.recap'].isin(['R','N','S']),'new.recap'].unique())
idx_2check = ~df['new.recap'].isin(['R','N','S'])
df.loc[idx_2check,'new.recap'].value_counts()

[nan]


Series([], Name: new.recap, dtype: int64)

In [127]:
print("There are {} entries which need to be checked and classfied."\
      .format(df.loc[(idx_2check) & (df['new.recap'].notna()) ].shape[0]))
df.loc[(idx_2check) & (df['new.recap'].notna())]

There are 0 entries which need to be checked and classfied.


Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement


### date
[Back to: Top](#Table-of-Contents)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#Column-by-Column-Cleaning)

We need to handle the date data which don't confirm to a typical date format (e.g., 2010 data which contain roman numerals).

In [128]:
idx_romdate = (df.date.notna())&(df.date.str.contains('i|v|x'))
df.loc[idx_romdate,'date'].value_counts(dropna = False)

27vii2010    1
Name: date, dtype: int64

Now we need to replace the values in the date strings with the roma month surrounded by hyphens.

In [129]:
df.loc[idx_romdate,'date'] = pd.to_datetime(df.loc[idx_romdate,'date'].apply(lambda x: rom2arab(x))).dt.date
# .apply(lambda x: pd.to_datetime(rom2arab(x),errors='ignore'))
df.loc[idx_romdate,'date'].head()

297    2010-07-27
Name: date, dtype: object

This is what the other values look like now.

In [130]:
df.loc[~idx_romdate,'date'].head()

0    2002-03-19 00:00:00
1    2002-07-19 00:00:00
2    2004-07-02 00:00:00
3    2004-07-02 00:00:00
4    2004-07-02 00:00:00
Name: date, dtype: object

<a id= 'resumehere'></a>

[Top](#Table-of-Contents)

## Correcting Class of Columns
[Top](#Table-of-Contents)

[Top Cleaning](#CleaningData)

### Convert integer columns to int

In [131]:
intCols = ['meters']
df[intCols]=df[intCols].astype(int,errors='ignore')

### Convert numeric columns to numeric

In [132]:
numCols = ['svl','tl','rtl','mass']
df[numCols]=df[numCols].apply(pd.to_numeric,errors='coerce')

### Convert string columns to str

In [133]:
strCols = ['toes','sex','species','vial']
df[strCols]=df[strCols].astype(str, errors='ignore')

### Convert date to datetime

In [134]:
df.loc[df.date=="NA"]=np.nan
df.date = pd.to_datetime(df.date,errors='ignore')

In [135]:
print("\nAfter applying the above changes, the data types are as follows:\n{}".format(df.dtypes))


After applying the above changes, the data types are as follows:
species                object
toes_orig              object
sex                    object
date           datetime64[ns]
svl                   float64
tl                    float64
rtl                   float64
autotomized            object
mass                  float64
location               object
meters                 object
new.recap              object
painted                object
sighting              float64
paint.mark             object
vial                   object
misc                   object
rtl_orig               object
toes                   object
toe_pattern            object
description            object
action                 object
pattern_b              object
replacement            object
dtype: object


<a id='AddVar1'></a>

## Adding variables [*year*](#year) and [*rtl_orig*](#rtlorig)

<a id='year'></a>

### Year
[Back to: Top](#Table-of-Contents)

[Back to: Adding variables](#AddVar1)

We will use data contained in the *date* column to create the variable *year*.  TO do this we will define a small function, [*myint*](#myint), to convert year to an int type.

Now we apply [*myint*](#myint) to the 'date' column to create the variable year and inspect the unique values.

In [136]:
df['year'] = df.date.dt.year.apply(myint,verbose=False)
df.year.value_counts(dropna=False)

2006    156
2011     62
2014     36
2004     36
2007      5
2002      2
2015      1
2003      1
2010      1
2008      1
Name: year, dtype: int64

Let's inspect the entries with 'nan' values.  Note these 'nan' values are string values and not NaN.

In [137]:
df.loc[df.year=='nan',:]

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement,year


In [138]:
idx_nntoes = (df.year=='nan')&(df.toes_orig.notna())&(df.toes_orig!='nan')
print("{} of these entries have non-null values in the originl toes column."\
      .format(df.loc[idx_nntoes].shape[0]))

0 of these entries have non-null values in the originl toes column.


In [139]:
df.loc[idx_nntoes]

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement,year


Many of these appear to be from 2016/17, based on the toe vial numbers.  one may be from 2018, but we will need to confirm and determine if any of these have sufficient information to keep them in the data set.  For now we will drop them.

In [140]:
df.loc[idx_nntoes] = None
df.loc[idx_nntoes]

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement,year


# Adding New Columns
[Top](#Table-of-Contents)

We need to add new columns which we will use later in analyses:
- [TL_SVL](#TL-SVL)
- [Mass_SVL](#Mass_SVL)
- [Lizard Number](#Lizard-Number)
     - [assign lizard numbers](#Assign) 
     - [QC the lizard numbers](#QcLizNum) 
- [Days Since Capture](#daysSinceCapture)
- [Number of Captures](#capture)

## TL_SVL 
[Top](#Table-of-Contents)

[Top Adding New Columns](#Adding-New-Columns)



In [141]:
df['tl_svl']=(df.tl/df.svl)

## Mass_SVL
[Top](#Table-of-Contents)

[Top Adding New Columns](#Adding-New-Columns)



In [142]:
df['mass_svl']=(df.mass/df.svl)

## Lizard Number
[Top](#Table-of-Contents)

[Top Adding New Columns](#Adding-New-Columns)

Here we use a set of functions to:
 - [Assign Lizard Numbers](#Assign-Lizard-Numbers) to unique individuals (we repeat this step to ensure we have assigned all animals a number) and 
 - [QC the Numbers](#QC-the-Numbers) assigned.

<a id='Assign'></a>

### Assign Lizard Numbers
[Top](#Table-of-Contents)

[Top Adding New Columns](#Adding-New-Columns)

We make a first attempt at assigning lizard numbers.  We use the *lizsort* function to identify the subset of rows from the original dataset which have sufficient information to allow us to make an automated decision about the uniqueness of the individuals identified in those rows.  We name that df *sortable*.  The unsortable data are saved to a path as a file, *unsortable.csv*.  

In [143]:
sortable = lizsort(df, path = sourceinterDataBig)  


There were 8 entries for which values for one of the critical criteria, (['species', 'toes', 'sex', 'date', 'svl']), were null.      These entries could not be evaluated and were written out to the file unsortable.csv for evaluation.


Next we call the *mindate* function on *sortable*.  This identifies the earliest date at which each unique combination of *sortCriteria* are recorded in a new column, *initialCaptureDate*.  The default sortCriteria are of the variables *species*, *toes*, and *sex*.  This also calculates and adds a column for *year_diff*, the difference in years between the initial capture date and the date value in a given row. 

In [144]:
sortable = mindate(sortable)

Next we call a the function *smallest*, which is analogous to *mindate*, but groups data in *sortable* into unique combinations of *species*, *toes*, *sex*, and *initialCaptureDate* before assigning the smallest SVL value recorded for each group to a new column for that group, *smallest_svl*.  *smallest* then calculates a new column *svl_diff* which is analogous to *year_diff*.

In [145]:
sortable = smallest(sortable)

Next we call the *validate* function on *sortable*, which applies a series of validation tests to the data, sequentially numbers unique combinations of *sortCriteria* and returns a dict containing uniquely numbered individuals and summary data.

In [146]:
tmp_sort = validate(sortable)
df_numbered1 = tmp_sort['val_data']


Of those entries we can handle, there are 270 individuals as defined by ['species', 'toes', 'sex', 'initialCaptureDate', 'smallest_svl'] which pass validation based    on ['year_diff <= 7', 'svl_diff >= -2'] and 6 rows which do not pass validation.


### Second attempt to assign lizard numbers

[Top](#Table-of-Contents)

[Top Adding New Columns](#Adding-New-Columns)

Here we make a second attempt at assigning lizard numbers to ensure that all lizards have been assigned.  This second attempt is focused on those rows which were unvalidated during the first attempt *n_val_data*.  Since these are already a subset of those data which were sortable, we need only call the *mindate*, *smallest*, and *validate* functions.

In [147]:
n_val = mindate(tmp_sort['n_val_data'])
n_val = smallest(n_val)
df_numbered2 = validate(n_val)['val_data']


Of those entries we can handle, there are 6 individuals as defined by ['species', 'toes', 'sex', 'initialCaptureDate', 'smallest_svl'] which pass validation based    on ['year_diff <= 7', 'svl_diff >= -2'] and 0 rows which do not pass validation.


Since no rows remain unvalidated, we will not attempt a third validation.  We will simply append *df_numbered1* and *df_numbered2* to create *df_numbered* to create our full numbered dataset.

In [148]:
df_numbered = df_numbered1.append(df_numbered2,ignore_index=True,sort=False)
print("df:{}\ndf_numbered1:{}\ndf_numbered2:{}\ndf_numbered:{}".format(df.shape,df_numbered1.shape,df_numbered2.shape,
                                                               df_numbered.shape))
df_numbered.head()

df:(301, 27)
df_numbered1:(293, 32)
df_numbered2:(6, 32)
df_numbered:(299, 32)


Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement,year,tl_svl,mass_svl,initialCaptureDate,year_diff,smallest_svl,svl_diff,liznumber
0,v,8-11-20,m,2002-03-19,52.0,74.0,0.0,False,5.0,5m ^ CC/CCC on rt side,245,R,"old paint, repainted over",,y20b,,"not legible, however",0,8-11-20,2,".( {,}\w{,} {1,}).",replace,,-,2002,1.423077,0.096154,2002-03-19,0,52.0,0.0,243
1,v,8-12-20,f,2002-07-19,58.0,71.0,0.0,False,5.7,sb at R outcrop at 425 on rt side,425,R,gravid,,w43a,,,0,8-12-20,2,".( {,}\w{,} {1,}).",replace,,-,2002,1.224138,0.098276,2002-07-19,0,58.0,0.0,249
2,v,3 13 16,m,2004-07-02,54.0,51.0,38.0,True,5.2,sb 10m v falls,-10,R,painted,,wTb,,not shed since- paint still visible,38,3-13-16,2,".( {,}\w{,} {1,}).",replace,,-,2004,0.944444,0.096296,2004-07-02,0,54.0,0.0,178
3,v,4 6 11,m,2004-07-02,49.0,66.0,0.0,False,4.5,see wAa,45,N,painted,,wAc,04 03,,0,4-6-11,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.346939,0.091837,2004-07-02,0,49.0,0.0,194
4,v,4 6 12,m,2004-07-02,45.0,62.0,0.0,False,3.3,3 m above cave trail on rt,48,N,painted,,wAb,04 04,,0,4-6-12,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.377778,0.073333,2004-07-02,0,45.0,0.0,195


<a id='QcLizNum'></a>

### QC Lizard Numbers
[Top](#Table-of-Contents)

[Top Adding New Columns](#Adding-New-Columns)

First we display the output data frame.

In [149]:
df_numbered

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement,year,tl_svl,mass_svl,initialCaptureDate,year_diff,smallest_svl,svl_diff,liznumber
0,v,8-11-20,m,2002-03-19,52.0,74.0,0.0,False,5.0,5m ^ CC/CCC on rt side,245.0,R,"old paint, repainted over",,y20b,,"not legible, however",0,8-11-20,2,".( {,}\w{,} {1,}).",replace,,-,2002,1.423077,0.096154,2002-03-19,0,52.0,0.0,243
1,v,8-12-20,f,2002-07-19,58.0,71.0,0.0,False,5.7,sb at R outcrop at 425 on rt side,425.0,R,gravid,,w43a,,,0,8-12-20,2,".( {,}\w{,} {1,}).",replace,,-,2002,1.224138,0.098276,2002-07-19,0,58.0,0.0,249
2,v,3 13 16,m,2004-07-02,54.0,51.0,38.0,True,5.2,sb 10m v falls,-10.0,R,painted,,wTb,,not shed since- paint still visible,38,3-13-16,2,".( {,}\w{,} {1,}).",replace,,-,2004,0.944444,0.096296,2004-07-02,0,54.0,0.0,178
3,v,4 6 11,m,2004-07-02,49.0,66.0,0.0,False,4.5,see wAa,45.0,N,painted,,wAc,04 03,,0,4-6-11,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.346939,0.091837,2004-07-02,0,49.0,0.0,194
4,v,4 6 12,m,2004-07-02,45.0,62.0,0.0,False,3.3,3 m above cave trail on rt,48.0,N,painted,,wAb,04 04,,0,4-6-12,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.377778,0.073333,2004-07-02,0,45.0,0.0,195
5,v,4 6 13,f,2004-07-02,57.0,68.0,0.0,False,7.6,on pine at top of site,455.0,N,painted,,w1a,04 08,,0,4-6-13,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.192982,0.133333,2004-07-02,0,57.0,0.0,196
6,j,12 13,f,2004-07-02,66.0,68.0,11.0,True,10.2,wall @ pool,380.0,R,painted,,wCb,,toes may be natural loss; T slight ss,11,12-13,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.030303,0.154545,2004-07-02,0,66.0,0.0,10
7,other,4 11,f,2004-07-02,52.0,46.0,37.0,True,4.6,wall left Lizard rock,133.0,R,painted,,w=a,,gravid; shed since last capture,37,4-11,2,".( {,}\w{,} {1,}).",replace,,-,2004,0.884615,0.088462,2004-07-02,0,52.0,0.0,133
8,other,1 6,f,2004-07-03,54.0,77.0,0.0,False,5.5,tree opposite left curve wall,27.0,N,painted,,wTb,04 13,few parasites,0,1-6,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.425926,0.101852,2004-07-03,0,54.0,0.0,130
9,v,10 17,f,2004-07-03,56.0,67.0,0.0,False,6.7,sb at rt curved wall,27.0,R,painted,,w1b,,gravid; few mites,0,10-17,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.196429,0.119643,2004-07-03,0,56.0,0.0,172


Identify individuals that are missing any of the critical values, but have a lizard number assigned.

In [150]:
idx_critical_na = (df_numbered.toes_orig.isna())|(df_numbered.species.isna())|(df_numbered.date.isna())
print("There are {} rows that fit have na values in any category critical for liznumber generation.\n{}"\
      .format(df_numbered.loc[(idx_critical_na)].shape[0],df_numbered.loc[(idx_critical_na),
                                                                          ['toes_orig','toes','species','date']]))

There are 0 rows that fit have na values in any category critical for liznumber generation.
Empty DataFrame
Columns: [toes_orig, toes, species, date]
Index: []


Identify individuals that have same species and toes, but different sex for review

In [151]:
df_numbered = df_numbered.merge(df_numbered.groupby(['species','toes']).sex.nunique().reset_index()\
                       .rename(columns = {'sex':'sex_count'}),how = 'inner', on = ['species','toes'])
df_numbered.loc[df_numbered.sex_count>1,:].to_csv('entries flagged with same species and toes diff sex.csv')
print("{} rows have the same species and toes but different values for sex"\
      .format(df_numbered.loc[df_numbered.sex_count>1,:].shape[0]))
df_numbered.head()

19 rows have the same species and toes but different values for sex


Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement,year,tl_svl,mass_svl,initialCaptureDate,year_diff,smallest_svl,svl_diff,liznumber,sex_count
0,v,8-11-20,m,2002-03-19,52.0,74.0,0.0,False,5.0,5m ^ CC/CCC on rt side,245,R,"old paint, repainted over",,y20b,,"not legible, however",0,8-11-20,2,".( {,}\w{,} {1,}).",replace,,-,2002,1.423077,0.096154,2002-03-19,0,52.0,0.0,243,1
1,v,8-12-20,f,2002-07-19,58.0,71.0,0.0,False,5.7,sb at R outcrop at 425 on rt side,425,R,gravid,,w43a,,,0,8-12-20,2,".( {,}\w{,} {1,}).",replace,,-,2002,1.224138,0.098276,2002-07-19,0,58.0,0.0,249,1
2,v,3 13 16,m,2004-07-02,54.0,51.0,38.0,True,5.2,sb 10m v falls,-10,R,painted,,wTb,,not shed since- paint still visible,38,3-13-16,2,".( {,}\w{,} {1,}).",replace,,-,2004,0.944444,0.096296,2004-07-02,0,54.0,0.0,178,1
3,v,3 13 16,m,2006-05-20,55.0,49.0,38.0,True,5.5,left sb 5m v 1falls,-5,R,painted,,w1b,,wound mid dorsal rt side,38,3-13-16,2,".( {,}\w{,} {1,}).",replace,,-,2006,0.890909,0.1,2004-07-02,2,54.0,1.0,178,1
4,v,4 6 11,m,2004-07-02,49.0,66.0,0.0,False,4.5,see wAa,45,N,painted,,wAc,04 03,,0,4-6-11,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.346939,0.091837,2004-07-02,0,49.0,0.0,194,2


In [152]:
print("Lizard Numbers in the sample range from {} to {}."\
      .format(df_numbered.liznumber.min(),df_numbered.liznumber.max()))

Lizard Numbers in the sample range from 1 to 270.


In [153]:
possibleLizNum = set(range(int(df_numbered.liznumber.min()),int(df_numbered.liznumber.max())))
actualLizNum = set(pd.Series(df_numbered.liznumber.unique()).dropna().apply(int))
print("\nThere are {} entries.  There are {} unique lizard numbers.\
\n\nThe liznumber ranges from {} to {}."\
  .format(df_numbered.shape[0],len(df_numbered.liznumber.unique())\
          ,df_numbered.liznumber.min(),df_numbered.liznumber.max()))

missingLizNum = possibleLizNum - actualLizNum
if len(missingLizNum)>0:
    print("\n\nThe following numbers are not assigned to a lizard:\n{}"\
      .format(missingLizNum))
else:
    print("\n\nThere are no numbers which were not assigned.")


There are 299 entries.  There are 270 unique lizard numbers.

The liznumber ranges from 1 to 270.


There are no numbers which were not assigned.


<a id='daysSinceCapture'></a>

### Days Since Capture
[Top](#Table-of-Contents)

[Top Adding New Columns](#Adding-New-Columns)

*daysSinceCapture* identifies the number of days since the animal was captured

In [154]:
df_numbered.loc[:,'daysSinceCapture'] = (df_numbered.date - df_numbered.initialCaptureDate).dt.days

<a id='capture'></a>

### Capture Number
[Top](#Table-of-Contents)

[Top Adding New Columns](#Adding-New-Columns)

*capture* identifies the number of times an animal has been captured prior to an entry.
We will need to [QC of Capture Number and Recap Status](#QC-of-Capture-Number-and-Recap-Status) as well.

In [155]:
# need to QC this this seems to be leading to several cases in which recap individuals that 
# only have one capture
df_numbered['capture'] = df_numbered.sort_values(['liznumber','date'])\
.groupby(['liznumber']).daysSinceCapture.cumcount()+1

In [156]:
print(df_numbered.loc[df_numbered.species.isin(['j','v'])].groupby('capture').capture.count())

capture
1    238
2     26
3      3
Name: capture, dtype: int64


Let's QC these numbers by looking at the distribution of the number of captures.

In [157]:
Sj = go.Histogram(x = df_numbered.loc[(df_numbered.species=='j')].groupby('liznumber').capture.max(),name='Sj')
Sv = go.Histogram(x = df_numbered.loc[(df_numbered.species=='v')].groupby('liznumber').capture.max(),name='Sv')

data = [Sj,Sv]
layout = go.Layout(
    title = 'Histogram of Maximum Number of Captures for CC 2000-2017',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'Maximum Number of Captures',
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Unique Lizards',
        titlefont = dict(
            size = 18)))

fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Histogram of Maximum Number of Captures for CC 2000-2017')

Now we'll display all data points associated with lizards who have more than 8 captures reported.

In [158]:
threshold = 8
immortals = df_numbered.loc[df_numbered.capture>threshold].liznumber.unique()
df_immortals = df_numbered.loc[df_numbered.liznumber.isin(immortals)]
print("There are {} lizards with greater than {} captures.  They account for {} rows of data.\n"\
      .format(df_immortals.liznumber.nunique(),threshold,df_immortals.shape[0]))
if df_immortals.shape[0]<threshold:
    print("We will continue our analysis.")
else:
    df_immortals = df_immortals.loc[:,[ 'liznumber','capture','toe_pattern','species', 'toes_orig', 'toes',
                                       'sex', 'date', 'svl_diff', 'tl', 'rtl_orig',
                                       'meters', 'new.recap','sighting', 'vial', 'misc',
                                       'initialCaptureDate','year_diff', 'sex_count',
                                       'daysSinceCapture','year']].sort_values(['liznumber','year'])
    print("\nTheir toe_patterns are distributed as follows:\n{}"\
          .format(df_immortals.toe_pattern.value_counts(dropna=False)))
    print("\nThe new toes for those lizards are distributed as follows:\n{}"
          .format(df_immortals.groupby('toe_pattern').toes.value_counts(dropna=False)))
    print("\nThose with NaN values assigned as toe_pattern have the following original toes:\n{}"\
          .format(df_immortals.loc[df_immortals.toe_pattern.isna()].toes_orig.value_counts(dropna=False)))
    # print(df_immortals.loc[df_immortals.toe_pattern.isna()].liznumber.value_counts(dropna=False))
    df_immortals.loc[df_immortals.toe_pattern.isna()]

There are 0 lizards with greater than 8 captures.  They account for 0 rows of data.

We will continue our analysis.


## We can see that the liz number generation failed since lizards with no toes have been assigned a number

We will export these for further inspection using the *[exportliz](#exportliz)* function.

In [159]:
exportliz(df_immortals,iterator='liznumber')

<a id='yearstoolarge'></a>

### years too large
[Top](#Table-of-Contents)

In [160]:
yeartoomuch = df_numbered.loc[df_numbered.year_diff>=5,'liznumber']
checkyears = df_numbered.loc[df_numbered.liznumber.isin(yeartoomuch)].sort_values(['liznumber'])
checkyears.to_csv('check years.csv')
checkyears

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement,year,tl_svl,mass_svl,initialCaptureDate,year_diff,smallest_svl,svl_diff,liznumber,sex_count,daysSinceCapture,capture
204,j,2 8 13,f,2006-05-24,80.0,106.0,0.0,False,19.3,wall btwn H3/H4,185,R,painted,,w43b,,gravid; only three toes cut nothing missing on LHL!,0,2-8-13,2,".( {,}\w{,} {1,}).",replace,,-,2006,1.325,0.24125,2006-05-24,0,70.0,10.0,15,1,0,1
205,j,2 - 8 - 13,f,2011-06-21,70.0,73.0,54.0,True,10.4,stream bed 2m v H5,198,R,yes,,g26b,,,54,2-8-13,1,".( {1,}-.|.- {1,}.)",replace,-,-,2011,1.042857,0.148571,2006-05-24,5,70.0,0.0,15,1,1854,2
4,v,4 6 11,m,2004-07-02,49.0,66.0,0.0,False,4.5,see wAa,45,N,painted,,wAc,04 03,,0,4-6-11,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.346939,0.091837,2004-07-02,0,49.0,0.0,194,2,0,1
6,v,4 6 11,m,2014-07-05,51.0,61.0,0.0,False,4.2,mid wall below juniper crossing opp,237,N,yes,,o24a,14-35,,0,4-6-11,2,".( {,}\w{,} {1,}).",replace,,-,2014,1.196078,0.082353,2004-07-02,10,49.0,2.0,194,2,3655,2
8,v,4 6 12,m,2004-07-02,45.0,62.0,0.0,False,3.3,3 m above cave trail on rt,48,N,painted,,wAb,04 04,,0,4-6-12,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.377778,0.073333,2004-07-02,0,45.0,0.0,195,1,0,1
9,v,4 6 12,m,2014-07-05,47.0,67.0,0.0,False,3.5,sb @ 15,15,N,yes,,o25a,14-36,,0,4-6-12,2,".( {,}\w{,} {1,}).",replace,,-,2014,1.425532,0.074468,2004-07-02,10,45.0,2.0,195,1,3655,2
22,v,4 6 16,m,2004-07-03,44.0,64.0,0.0,False,3.6,sb 1m v H3,176,N,painted,,w2c,04 17,few mites,0,4-6-16,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.454545,0.081818,2004-07-03,0,44.0,0.0,202,1,0,1
23,v,4 6 16,m,2014-07-06,47.0,64.0,0.0,False,3.4,left SB 20 m ^ 1 falls,20,N,yes,,o29a,14-41,,0,4-6-16,2,".( {,}\w{,} {1,}).",replace,,-,2014,1.361702,0.07234,2004-07-03,10,44.0,3.0,202,1,3655,2
28,v,4 6 18,f,2004-07-03,52.0,68.0,0.0,False,4.6,1m v flat rock on left side,207,N,painted,,w3b,04 19,,0,4-6-18,2,".( {,}\w{,} {1,}).",replace,,-,2004,1.307692,0.088462,2004-07-03,0,52.0,0.0,204,1,0,1
29,v,4 6 18,f,2014-07-06,57.0,69.0,0.0,False,6.3,rt sb 5 m below L cc/ccc,235,N,yes,,o30a,14-43,orange badge,0,4-6-18,2,".( {,}\w{,} {1,}).",replace,,-,2014,1.210526,0.110526,2004-07-03,10,52.0,5.0,204,1,3655,2


In [161]:
jarrovii = go.Histogram(x = df_numbered.loc[df_numbered.species.isin(['j'])].groupby('liznumber')\
                     .year_diff.max(),name = 'S. jarrovii')
virgatus = go.Histogram(x = df_numbered.loc[df_numbered.species.isin(['v'])].groupby('liznumber')\
                     .year_diff.max(), name = 'S. virgatus')
data = [jarrovii, virgatus]
layout = go.Layout(
    title = 'Number of Individuals by Years Between First and Last Capture 2000-2017',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'Maximum Number of Years Since Initial Capture',
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Lizards',
        titlefont = dict(
            size = 18))
)
fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Frequency of Captures in Crystal Creek 2000 - 2017 (by species)')

Should we drop this figure entirely.  Not sure this is useful.

In [162]:
# Freeze work on this figure until we've resolved issues with calculation based on year
# ADD HORIZONTAL LINES FOR EACH YEAR
j_lizards = go.Scatter(x = df_numbered.loc[df_numbered.species.isin(['j'])].liznumber,
                   y = df_numbered.loc[df_numbered.species.isin(['j'])]\
                      .groupby('liznumber').daysSinceCapture.max(), 
                     mode = 'markers', name='S. jarrovii')
v_lizards = go.Scatter(x = df_numbered.loc[df_numbered.species.isin(['v'])].liznumber,
                   y = df_numbered.loc[df_numbered.species.isin(['v'])]\
                      .groupby('liznumber').daysSinceCapture.max(), 
                     mode = 'markers', name='S. virgatus')
# year1 = go.Scatter(x=[df_numbered.liznumber.min(),df_numbered.liznumber.max()],y = (365))
# year2 = go.Scatter(y = 365*2)
# year3 = go.Scatter(y = 365*3)
# year4 = go.Scatter(y = 365*4)
# year5 = go.Scatter(y = 365*5)
# year6 = go.Scatter(y = 365*6)
# year7 = go.Scatter(y = 365*7)
# year8 = go.Scatter(y = 365*8)

# data = [j_lizards, v_lizards, year1, year2, year3, year4, year5, year6, year7, year8]
data = [j_lizards, v_lizards]
layout = go.Layout(
    title = 'Days Since Initial Capture in Crystal Creek 2000 - 2017',
        titlefont = dict(
            size = 20),
    xaxis = dict(
            title='Lizard Number',
            titlefont=dict(
                size=18)),
    yaxis = dict(
            title='Greatest Number of Days Since<br> Initial Capture',
            titlefont=dict(
                size=18)))

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename = 'Days Since Initial Capture in Crystal Creek 2000 - 2017')

Drop this figure as well?

In [163]:
dfF = df_numbered.loc[(df_numbered.sex =='f' )& (df_numbered.species.isin(['j','v']))]
dfM = df_numbered.loc[(df_numbered.sex =='m') & (df_numbered.species.isin(['j','v']))]

In [164]:
# Freeze work on this figure until we've resolved issues with calculation based on year
females = go.Scatter(
    x = dfF.liznumber,
    y = dfF.groupby('liznumber').daysSinceCapture.max(),
    name = 'females',
    mode = 'markers',
    marker = dict(
        color = 'rgba(152, 0, 0, .8)',
        opacity = 0.75,
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        )
    )
)

males = go.Scatter(
    x = dfM.liznumber,
    y = dfM.groupby('liznumber').daysSinceCapture.max(),
    name = 'males',
    mode = 'markers',
    marker = dict(
        color = 'rgba(255, 182, 193, .9)',
        opacity = 0.75,
        line = dict(
            width = 2,
        )
    )
)

data = [females, males]

layout = dict(title = 'Days Since Initial Capture in Crystal Creek 2000 - 2017 By Sex',
              yaxis = dict(
                  title='Greatest Number of Days Since<br> Initial Capture',
                  titlefont=dict(
                      size=18)
              ),
              xaxis = dict(zeroline = False)
             )

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Days Since Initial Capture in Crystal Creek 2000 - 2017 By Sex')

In [165]:

males = go.Histogram(x = df_numbered.loc[(df_numbered.sex == 'm')& (df_numbered.species.isin(['j','v']))
                                                                    ,'year']
                     ,opacity= 0.75,name='males')
females = go.Histogram(x = df_numbered.loc[(df_numbered.sex == 'f')& (df_numbered.species.isin(['j','v']))
                                                                      ,'year']
                       , opacity= 0.75, name = 'females')
data = [males,females]
py.iplot(data, filename = 'Distribution of Sex by Year in Crystal Creek 2000 - 2017')

In [166]:
column_order = ['liznumber','date','initialCaptureDate',]

In [167]:
df.year.value_counts(dropna=False).reset_index()

Unnamed: 0,index,year
0,2006,156
1,2011,62
2,2014,36
3,2004,36
4,2007,5
5,2002,2
6,2015,1
7,2003,1
8,2010,1
9,2008,1


<a id='QcCapture'></a>

### QC of Capture Number and Recap Status
[Top](#Table-of-Contents)

[Top Add Columns](#AddCol)

[Top Capture Number](#capture)

In [168]:
recapQuestion=df_numbered\
.loc[(df_numbered.capture==1 )&(df_numbered['new.recap']=='recap')&(df_numbered.species.isin(['j','v'])),:]
print("There are {} instances in rows for which a lizard appears to have only one capture, \
but is listed as a recap.\
The distribution of these across years in the sample is as follows:\n{}."\
      .format(recapQuestion.shape[0],recapQuestion.year.value_counts()))
recapQuestion.to_csv("Questionable recaptures.csv")#These individuals need to be rechecked in the raw notes
recapQuestion.head()

There are 0 instances in rows for which a lizard appears to have only one capture, but is listed as a recap.The distribution of these across years in the sample is as follows:
Series([], Name: year, dtype: int64).


Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,description,action,pattern_b,replacement,year,tl_svl,mass_svl,initialCaptureDate,year_diff,smallest_svl,svl_diff,liznumber,sex_count,daysSinceCapture,capture


<a id='exportFinal'></a>

# Export Cleaned data
[Top](#Table-of-Contents)

Now we export the cleaned data to a csv.

In [169]:
df_numbered = df_numbered.rename(index = str, columns = {'new.recap':'newRecap'})
qc_drop_cols = df_numbered.columns[df_numbered.columns.str.contains('force|drop')]
df_full = df_numbered.drop(qc_drop_cols,1)

In [170]:
os.chdir(outputBig)
timestamp = (pd.to_datetime('now')-pd.Timedelta(hours=4))
timestamp = str(timestamp)[:-10].replace(':','hrs')+'min'
filename = 'cleaned CC data 2000-2017_' + timestamp+ '.csv'
df_full.to_csv(filename,index = False)
filename

'cleaned CC data 2000-2017_2019-04-25 01hrs00min.csv'