# Cleaning CC data

This python notebook operates on a csv created after editing in open refine and is designed to finish cleaning columns of interest which were easier to clean in python.

# [Resume Here](#resumehere)

<a id='TOC'></a>

# Table of Contents

1. [Outstanding Problems](#outstanding)

1. [Setting up Python](#SettingUp)
    
    1. [Setting the Location](#SettingLoc)
    
    2. [Importing Data](#ImportingData)
    
    3. [Preparing for a Save](#PreparingSave)
    
4. [Functions](#Functions)
    
2. [Inspecting the Data](#InspectingData)
3. [Cleaning Data](#CleaningData)
    1. [Column-by-Column Cleaning](#ColbyCol)
        1. [rtl](#rtl)
        2. [tl](#tl)
        3. [svl](#svl)
        4. [autotomized](#autotomized)
        5. [toes](#toes)
        6. [sex](#sex)
        7. [species](#species)
        7. [new.recap](#newrecap)
    2. [Correcting class of columns](#CorrectingClass)
    
4. [Adding Columns](#AddCol)

    1. [TL_SVL](#TlSvl)
    
    2. [Mass_SVL](#MassSvl)
    
    3. [Lizard Number](#LizardNumber)

5. [Export Cleaned Data](#exportFinal)

<a id='outstanding'></a>

# Outstanding Problems

1. [outstanding1](#outstanding1)
2. [outstanding2](#outstanding2)
3. [outstanding3](#outstanding3)
4. [outstanding4](#outstanding4)

<a id='SettingUp'></a>

# Setting up Python

[Top](#TOC)

Here we import necessary packages. 
This chunk may take a while.

In [1]:
import pandas as pd
import numpy as np
import os
from liz_number import lizsort,mindate,smallest,validate
from liz_toes import make_str,label_pattern, replace_pattern,report_pattern

import plotly
import plotly.plotly as py
import plotly.graph_objs as go

plotly.tools.set_config_file(world_readable=True)

# increase print limit
pd.options.display.max_rows = 99999
pd.options.display.max_columns = 50

<a id='SettingLoc'></a>

## Setting the location
[Top](#TOC)

These chunks identify the locations from which we can get data and to which we can save data.

## Source Data
Source files can be found in the following locations:

In [2]:
sourceDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/Cleaned Combined Data'
sourceDataBig = 'S:/Chris/TailDemography/TailDemography/Cleaned Combined Data'
# sourceBlack = 'C:/Users/test/Desktop'

### Intermediate Source Data
Intermediate files can be found in the following locations:

In [3]:
sourceInterDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/Intermediate Files/DeepCleaning'
sourceinterDataBig = 'S:/Chris/TailDemography/TailDemography/Intermediate Files/DeepCleaning'
# sourceBlack = 'C:/Users/test/Desktop'

## Output Data paths
Outputfiles can be found in the following locations:

In [4]:
outputPers = 'C:/Users/Christopher/Google Drive/TailDemography/outputFiles'
outputBig = 'S:/Chris/TailDemography/TailDemography/outputFiles'
# outputBlack = 'C:/Users/test/Desktop'

<a id='ImportingData'></a>

## Importing data
[Top](#TOC)

Here we import data from one of the available locations

In [5]:
os.chdir(sourceDataBig)
df=pd.read_csv('Appended and Trimmed CC Data 2000-2017_2019-01-07 19hrs22min.csv')
df.head()

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
0,sj,2-6-12-15,m,19vi2010,80,110,29,3.0,20.0,10m v bottom bowl,-15,?,yes,,y2c,03-10-cc,toe 15 missing at capture; possible recap
1,sj,2-9-15-17,f,13viii2010,56,77,0,1.0,5.5,20m up CCC,240,NEW,yes,,y62c,61-10-cc,Tss
2,sj,3-6-11-17,m,18viii2010,50,68,0,1.0,4.0,1m vT at top R island,157,NEW,yes,,y<c.t,,Bss; lost toes
3,sj,3-6-15-16,f,18viii2010,72,62,46,3.0,11.0,halfway between pool and 2 falls 2m up rt side,385,NEW,yes,,y65c,65-10-cc,
4,sj,3-6-12-17,m,18viii2010,57,82,0,1.0,6.0,R outcrop ^ oak R,425,new,yes,,y67c,66-10-cc,


<a id='PreparingSave'></a>

## Preparing for a save
[Top](#TOC)

Now we change the working directory so that inermediate files are saved to our preferred location.

In [6]:
os.chdir(sourceDataBig)
df.head()

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
0,sj,2-6-12-15,m,19vi2010,80,110,29,3.0,20.0,10m v bottom bowl,-15,?,yes,,y2c,03-10-cc,toe 15 missing at capture; possible recap
1,sj,2-9-15-17,f,13viii2010,56,77,0,1.0,5.5,20m up CCC,240,NEW,yes,,y62c,61-10-cc,Tss
2,sj,3-6-11-17,m,18viii2010,50,68,0,1.0,4.0,1m vT at top R island,157,NEW,yes,,y<c.t,,Bss; lost toes
3,sj,3-6-15-16,f,18viii2010,72,62,46,3.0,11.0,halfway between pool and 2 falls 2m up rt side,385,NEW,yes,,y65c,65-10-cc,
4,sj,3-6-12-17,m,18viii2010,57,82,0,1.0,6.0,R outcrop ^ oak R,425,new,yes,,y67c,66-10-cc,


<a id='Functions'></a>

# Functions
[Back to: Top](#TOC)

1. [appendstr](#appenstr)
2. [typeordrop](#typeordrop)

<a id = 'appendstr'></a>

## appendstr

In [7]:
def appendstr(x, value, connector = '', position='end'):
    """
    appends *value* and *x* separated by a *connector* with the position of *val* determined by *position*
    :param x:
    :param value:
    :param connector:
    :param position:
    """
    assert((isinstance(x,str)|(x is None)|(x!=x))),"x must be str type, NoneType or NaN: x is {} type."\
    .format(type(x))
    if ((x!=x)|(x is None)):
        x=''
    assert(isinstance(value,str)),"value must be str type: value is {} type.".format(type(value))
    assert(isinstance(connector,str))\
    , "connector must be str or None type, not {} type.".format(type(connector))
    assert(isinstance(position,(str,int))), "position must be either str or int type, not {}."\
           .format(type(position))           
    if isinstance(position,str):
        assert(position in ['start','end']), "If position is str type, it must be either 'start' or 'end'."
        positiondict = {'start':0,'end':len(x)}
        position = positiondict[position]
    if isinstance(position,int):
        assert(position in range(0,1+len(x)))\
        , "If position is int type, it must be a value in the range 0 through {}.".format(len(x))
    prefix = x[:position]
    suffix = x[position:]
    if len(x)==0:
        res = value
    else:
        if position == 0:
            res = prefix+value+connector+suffix
        if position == len(x):
            res = prefix+connector+value+suffix
        if (position>0&position<1):
            res = prefix+connector+value+connector+suffix

    return res
    

Here's an example of how *appendstr* works.

In [8]:
foo='bar'
appendstr(foo,'test',connector='_',position=1)

'b_test_ar'

In [9]:
appendstr(foo,'test',connector='_',position=1)

'b_test_ar'

In [10]:
appendstr(None,'test',connector='_',position='end')

'test'

In [11]:
appendstr(None,'test',position=0)

'test'

<a id='typeordrop'></a>

## typeordrop
[Back to Top](#TOC)

[Back to Functions](#Functions)

In [12]:
def typeordrop(x,typ,replace=None, verbose=True):
    """this function attempts to force an object, *x*, to a particular type,*typ*. If this is not possible, 
    it reports the value of the object that could not be forced and replaces the object with the value 
    supplied to the *replace* argument"""
    if not isinstance(x,typ)==True:
        while False:
            try:
                x=typ(x)
                print("Working as expected")
                break
            except TypeError:
                if verbose==True:
                    print("Could not force value supplied to 'x' argument to {} type. x is {} type:\n\n x = {}"\
                          .format(typ,type(x),x))
                x = replace
    else:
        print("{} is already of type {}.".format(x,typ))
    return x
         

Here are a few examples of how *typeordrop* works.

In [13]:
x=['foo','bar']
typeordrop(x,int)

['foo', 'bar']

<a id= 'InspectingData'></a>

## Inspecting the Data
[Top](#TOC)

Let's take a look at the data.

In [14]:
print("\nThere are {} data points in our data set.".format(df.shape[0]))


There are 6299 data points in our data set.


In [15]:
print("\nThe columns in the data have the following data types:\n{}".format(df.dtypes))


The columns in the data have the following data types:
species         object
toes            object
sex             object
date            object
svl             object
tl              object
rtl             object
autotomized    float64
mass            object
location        object
meters          object
new.recap       object
painted         object
sighting       float64
paint.mark      object
vial            object
misc            object
dtype: object


<a id= 'CleaningData'></a>

# Cleaning the Data
[Back to: Top](#TOC)

Now we get to the actual cleaning of the data.  We will inspect the data and take the appropriate cleaning steps:
1. [Column-by-Column Cleaning](#ColbyCol)

2. [Correcting class of columns](#CorrectingClass)

<a id='ColbyCol'></a>

## Column-by-Column Cleaning
[Back to: Top](#TOC)

We will handle the cleaning for each column in this section.
1. [rtl](#rtl)
2. [tl](#tl)
3. [svl](#svl)
4. [autotomized](#autotomized)
    1. [creating 'rtl_orig'and relabeling 'rtl' and 'autotomized](#rtlRTL_ORIGautotomized)
    2. [copy the values in rtl to a new column, *rtl_orig*](#copyrtl)
    3. [relabel entries in the autotomized column based on the values in the rtl_orig column](#relabelaut) 
    4. [relabel entries in the rtl column](#relabelrtl)
5. [toes](#toes)
6. [sex](#sex)
7. [species](#species)
8. [new.recap](#newrecap)

<a id='rtl'></a>

## 'rtl' 
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColbyCol)

Here we investigate and clean values in the column 'rtl'. These should be int type values that are greater than or equal to -1.  First, we test to see if all of the values are of type int.

In [16]:
badtypes = []
for val in df.rtl:
    try:
        x = isinstance(type(int(val)),int)
    except:
        badtypes=badtypes+[val]
print("'badtypes' represents {} entries in the df:".format(len(badtypes)))
if len(badtypes)==0:
    print("\nAll values in df.rtl can be successfuly converted to int.\n\n")
#     df['rtl'] = df.rtl.apply(int)
else:
    print("\nAll values in df.rtl could not be converted to int.  The following values could not be \
converted and should be investigated:\n\n{}\n\nbadtypes values are distributed as follows in the df:\n\n{}"\
          .format(list(set(badtypes)),df.loc[df.rtl.isin(badtypes),'rtl'].value_counts(dropna=False)))

'badtypes' represents 3596 entries in the df:

All values in df.rtl could not be converted to int.  The following values could not be converted and should be investigated:

[nan, '10(kink)', '?', '-', 'o', '32 -12']

badtypes values are distributed as follows in the df:

NaN         3590
?              2
o              1
10(kink)       1
-              1
32 -12         1
Name: rtl, dtype: int64


The non-NaN values are few, so we will inspect these first.

In [17]:
pd.set_option('max_colwidth',100000)
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna()),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
884,sj,,m,2003-04-19 00:00:00,56,32,?,,,talus 326,326.0,NEW,painted,,b7c,,
907,sj,4-10-14-18,m,2003-04-30 00:00:00,76,19,?,,,wall 15m,15.0,recap,painted,,b9a,,9 looks like a backwards P and t combined
1037,sv,1-6-11-20,m,2003-06-27 00:00:00,41,60,o,,4.0,sb 5m ^ cave trail,50.0,NEW,painted,,sMb,,"lost toes for vial, accidently cut off toe 11"
6097,uo,4-6-18,m,2004-07-12 00:00:00,52,75,32 -12,,4.7,sb opp fallen juniper -> flat R,208.0,new,painted,,w^c,04-63,blue throat and blue belly; accidentally cut toe 6
6193,sv,,f,2004-07-21 00:00:00,-,-,-,,6.0,sb @ cc/ccc,240.0,recap,painted,,w148b,,escaped
6220,sj,2-9-12-18,f,2004-07-22 00:00:00,65,94,10(kink),,9.4,wall rt side v wall v cave tr,,recap,painted,,w154b,,hurt toes 11-13 in capture; Bss Tss


Based on review discussions, we will make the changes below:
- ‘?’--> 0; misc: “unsure if tail was recently broken at very tip”
- ‘o’--> 0
- ‘32 -12’ -->32; misc: “potential double-break at 12 \[george to check before use\]” 
- ‘-’--> NaN
- ‘10(kink)’-->0; misc:”kink at 10mm”
We will use the function [*appendstr*](#appendstr) to do this.

"‘?’--> 0; misc: “unsure if tail was recently broken at very tip”

In [18]:
idx_ques = (df.rtl.isin(badtypes))&(df.rtl=='?')
df[idx_ques]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
884,sj,,m,2003-04-19 00:00:00,56,32,?,,,talus 326,326,NEW,painted,,b7c,,
907,sj,4-10-14-18,m,2003-04-30 00:00:00,76,19,?,,,wall 15m,15,recap,painted,,b9a,,9 looks like a backwards P and t combined


In [19]:
df.loc[idx_ques,'misc']= df.loc[idx_ques,:].misc\
.apply(lambda x: appendstr(x,"unsure if tail was recently broken at very tip",';'))
df.loc[idx_ques,'rtl']= '0'

These entries now look like this:

In [20]:
df.loc[idx_ques,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
884,sj,,m,2003-04-19 00:00:00,56,32,0,,,talus 326,326,NEW,painted,,b7c,,unsure if tail was recently broken at very tip
907,sj,4-10-14-18,m,2003-04-30 00:00:00,76,19,0,,,wall 15m,15,recap,painted,,b9a,,9 looks like a backwards P and t combined;unsure if tail was recently broken at very tip;


"‘o’--> 0"

In [21]:
idx_o = (df.rtl.isin(badtypes))&(df.rtl=='o')
df[idx_o]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1037,sv,1-6-11-20,m,2003-06-27 00:00:00,41,60,o,,4,sb 5m ^ cave trail,50,NEW,painted,,sMb,,"lost toes for vial, accidently cut off toe 11"


In [22]:
df.loc[idx_o,'rtl']= '0'

These entries now look like this:

In [23]:
df.loc[idx_o,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1037,sv,1-6-11-20,m,2003-06-27 00:00:00,41,60,0,,4,sb 5m ^ cave trail,50,NEW,painted,,sMb,,"lost toes for vial, accidently cut off toe 11"


"‘32-12’ -->32; misc: “potential double-break at 12 \[george to check before use\]"

In [24]:
idx_32 = (df.rtl.isin(badtypes))&(df.rtl=='32 -12')
df.loc[idx_32]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
6097,uo,4-6-18,m,2004-07-12 00:00:00,52,75,32 -12,,4.7,sb opp fallen juniper -> flat R,208,new,painted,,w^c,04-63,blue throat and blue belly; accidentally cut toe 6


In [25]:
df.loc[idx_32,'misc']= df.loc[idx_32,:].misc\
.apply(lambda x: appendstr(x,"potential double-break at 12 [george to check before use]",';'))

df.loc[idx_32,'rtl']= '32'

These entries now look like this:

In [26]:
df.loc[idx_32,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
6097,uo,4-6-18,m,2004-07-12 00:00:00,52,75,32,,4.7,sb opp fallen juniper -> flat R,208,new,painted,,w^c,04-63,blue throat and blue belly; accidentally cut toe 6;potential double-break at 12 [george to check before use];


"‘-’-->'NaN'"

In [27]:
idx_minus = (df.rtl.isin(badtypes))&(df.rtl=='-')
df.loc[idx_minus,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
6193,sv,,f,2004-07-21 00:00:00,-,-,-,,6,sb @ cc/ccc,240,recap,painted,,w148b,,escaped


We will also address the values for svl and tl in this row.

In [28]:
df.loc[idx_minus,['rtl','tl','svl']]= np.nan

These entries now look like this:

In [29]:
df.loc[idx_minus,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
6193,sv,,f,2004-07-21 00:00:00,,,,,6,sb @ cc/ccc,240,recap,painted,,w148b,,escaped


‘10(kink)’-->0; misc:”kink at 10mm” We will use the function appendstr to do this."

In [30]:
idx_10kink = (df.rtl.isin(badtypes))&(df.rtl=='10(kink)')
df.loc[idx_10kink,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
6220,sj,2-9-12-18,f,2004-07-22 00:00:00,65,94,10(kink),,9.4,wall rt side v wall v cave tr,,recap,painted,,w154b,,hurt toes 11-13 in capture; Bss Tss


In [31]:
df.loc[idx_10kink,'misc']= df.loc[idx_10kink,:].misc.apply(lambda x: appendstr(x,"kink at 10mm",';'))
df.loc[idx_10kink,'rtl']= '0'

These entries now look like this:

In [32]:
df.loc[idx_10kink,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
6220,sj,2-9-12-18,f,2004-07-22 00:00:00,65,94,0,,9.4,wall rt side v wall v cave tr,,recap,painted,,w154b,,hurt toes 11-13 in capture; Bss Tss;kink at 10mm;


Now we will inspect those that had at least one other length measurement (svl or tl).

In [33]:
pd.reset_option('max_colwidth')
idx_rtlnaplus1 = (df.rtl.isna())&(((df.svl.isna())&~(df.tl.isna()))|(~(df.svl.isna())&(df.tl.isna())))
df.loc[idx_rtlnaplus1]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2027,sj,2-6-13-20,f,2001-03-23 00:00:00,~70,,,,,bottom R wall v cave trail,30.0,sighting,,,?,"could read toes 6,13 for certain; toe 2 uncert...",
4642,sj,,m,2002-03-16 00:00:00,large,,,,,active in crevice in wall 3m v juniper xing,112.0,sighting,,,?,,
4814,sj,,,2002-03-17 00:00:00,large,,,,,H4a,194.0,sighting,,,w85a??,,"probably w85a but could only see the ""5"""
5175,sj,,f,2002-03-20 00:00:00,large,,,,,L across from wall,318.0,sighting,,,w||t,,
5640,sj,,m,2002-03-19 00:00:00,large,,,,,up rt wall @ pool,,sighting,,,???,,~25mm original T; rest regrown
5784,sv,,?,2002-03-19 00:00:00,small,,,,,sb 4m ^ flatR,,sighting,,,,,had moth so didn'tcatch


All but one of these was a sighting.  We will have to look at the field notes to confirm whether or not data were actually missing for the remaining entry.

In [34]:
df.loc[(df.rtl.isna())&((df.svl.notna())|(df.tl.notna()))&df['new.recap'].str.contains('recap'),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
5907,sv,1-6-16-17-20,m,2004-07-04 00:00:00,52,53,,,3.6,bottom chute,355,recap,painted,,w.t,,few mites


Once we have addressed these, we will force rtl to an int type.

Now we check to see for out of range rtl values, *i.e.* rtl values less than -1 or suspiciously high.

We will exclude 0 and -1 values for rtl in these figures because of the large proportion of in range values they account for.

In [35]:
dfnobadtypes0neg1 = (~df.rtl.isin(badtypes))&(~df.rtl.isin(['0','-1']))
dfother = ~(df.species.dropna().str.contains('v|j'))&(df.species.notna())&(dfnobadtypes0neg1)
jarrovii = go.Histogram(x = df.loc[(df.species.str.contains('j'))&(dfnobadtypes0neg1)
                                   ,'rtl'].astype(int, 'ignore'),name = 'S. jarrovii',xbins =dict(size=1)
                        #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
virgatus = go.Histogram(x = df.loc[(df.species.str.contains('v'))&(dfnobadtypes0neg1)
                                   ,'rtl'].astype(int, 'ignore'), name = 'S. virgatus',xbins =dict(size=1)
                       #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
other = go.Histogram(x = df.loc[dfother,'rtl'].astype(int, 'ignore'), name = 'other',xbins =dict(size=1)
                                  #,histnorm='probability'
                     , cumulative=dict(enabled = False
                                                                           , direction = 'increasing'))
data = [jarrovii, virgatus,other]
layout = go.Layout(
    title = 'Histogram of rtl by species',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'rtl (mm)',
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Lizards',
        titlefont = dict(
            size = 18))
)
fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Histogram of rtl by species (new)')

Perhaps it's worth inspecting values greater than 50. 

In [36]:
idx_dfabove50 = (df.species.str.contains('j|v'))&(~df.rtl.isin(badtypes))\
&(df.loc[(~df.rtl.isin(badtypes)),'rtl'].astype(int, 'ignore')>=50)
df[idx_dfabove50]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
419,sj,5-11-18,M,2014-07-03 00:00:00,85.0,73,56,,19.5,black r,171.0,new,yes,,o5c,14-04,
634,sj,5-10-11-16,m,2000-03-20 00:00:00,81.0,93,55,,15.5,2falls,,,,,r76c,,
773,sj,2-8-13,F,2012-05-25 00:00:00,82.0,79,57,,21.5,loose on T; H5,,recap,yes,,w18c.b,,pics of site
779,sj,3-9-15,f,2012-05-28 00:00:00,79.0,85,58,,19.3,1falls,,recap,yes,,w34c,,salmon tail; Tss
901,sj,2-12-17,m,2003-04-30 00:00:00,89.0,97,56,,,1falls,1.0,recap,painted,,bYc,,shedding
1193,sj,4-9-12-20,m,2003-07-02 00:00:00,81.0,62,50,,18.5,alligator juniper @top of site,452.0,recap,painted,,oSa,,Tss; Bpss
1582,sj,4-9-12-20,m,2003-07-18 00:00:00,78.0,65,50,,17.5,Rs across from oak R,418.0,recap,painted,,oTc,,looked like it was previously marked with orange
1844,sj,2 - 8 - 13,F,2011-06-21 00:00:00,70.0,73,54,,10.4,stream bed 2m v H5,198.0,recap,yes,,g26b,,
1846,sj,3 - 9 - 15,F,2011-06-23 00:00:00,75.0,82,56,,7.7,right Rs bottom bowl,-7.0,recap,yes,,g35b,,was w45c; recently dropped; TSS
1853,sj,1 - 7 - 14 - 19,M,2011-06-19 00:00:00,80.0,73,52,,20.0,opp oak R,418.0,recap,yes,,g10b.t,,BSS T-shed HSS


<a id='outstanding1'></a>

Some of these values are reasonable, but there are few for which we will need to go back to the field notes in 2011.  Those rows in which rtl > tl need to be investigated.

[Back to Outstanding Problems](#outstanding)

In [37]:
idx_rtltlbig = (idx_dfabove50)&(df.rtl.astype(int,errors = 'ignore')>df.tl.astype(int,errors ='ignore'))
df.loc[idx_rtltlbig]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1867,sj,b 1 - 7 - 11,F,2011-06-20 00:00:00,7.5,70,90,,0,3m right side ^ Juniper Xing,118,recap,yes,,,,"Break at 50, tail still attached w48c -> g18b ..."
1873,sj,b 2 - 9 - 15 - 17,F,2011-06-20 00:00:00,7.6,68,86,,0,10m up CCC on slab,250,recap,yes,,g19b,,


These appear to be cases where svl,tl,rtl and mass may have been entered into the wrong columns, i.e. the correct placement of current values-->correct column should probably be:
- svl-->mass
- tl-->svl
- rtl-->tl
- mass-->rtl

We will correct these now.

In [38]:
import copy
def swap(df):
    tmp = {
        'rtl':copy.copy(df['rtl']),
        'tl':copy.copy(df['tl']),
        'svl':copy.copy(df['svl']),
        'mass':copy.copy(df['mass'])
    }
#     print(tmp)
    df['rtl'] = tmp['mass']
    df['tl'] = tmp['rtl']
    df['svl'] = tmp['tl']
    df['mass'] = tmp['svl']
#     print(tmp)
    return df


In [39]:
df.loc[idx_rtltlbig,['svl','rtl','tl','mass']] = swap(df.loc[idx_rtltlbig,['svl','rtl','tl','mass']])
df.loc[idx_rtltlbig,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
1867,sj,b 1 - 7 - 11,F,2011-06-20 00:00:00,70,90,0,,7.5,3m right side ^ Juniper Xing,118,recap,yes,,,,"Break at 50, tail still attached w48c -> g18b ..."
1873,sj,b 2 - 9 - 15 - 17,F,2011-06-20 00:00:00,68,86,0,,7.6,10m up CCC on slab,250,recap,yes,,g19b,,


Now we force rtl to int type, ignoring errors.

In [40]:
df['rtl'] = df.rtl.astype(int,errors = 'ignore')

<a id='tl'></a>

## 'tl' 
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColbyCol)

Here we investigate and clean values in the column 'tl'. These should be int type values that are positive.  First, we test to see if all of the values are of type int.

In [41]:
df.tl.astype(int,errors='ignore').apply(lambda x: type(x)).value_counts(dropna=False)

<class 'float'>    3590
<class 'str'>      2709
Name: tl, dtype: int64

Let us inspect the entries for which attempting to convert 'tl' results in a float type.

In [42]:
idx_floatNaNtl = df.tl.astype(int,errors='ignore').apply(lambda x: type(x) is float)
df.loc[idx_floatNaNtl,'tl'].value_counts(dropna=False)

NaN    3590
Name: tl, dtype: int64

These are all NaN entries and can be ignored for the time being.

Let's inspect the non NaN entries now.

In [43]:
idx_strtl = df.tl.astype(int,errors='ignore').apply(lambda x: type(x) is str)
df.loc[idx_strtl,'tl'].value_counts(dropna=False)

70         79
73         70
68         69
75         69
69         68
65         62
72         60
71         57
66         55
78         55
67         54
63         52
76         50
60         48
74         47
85         45
64         44
90         44
80         43
61         42
100        41
88         40
79         38
55         37
62         36
57         35
59         35
86         34
52         34
50         33
93         33
58         32
98         30
91         30
53         30
81         29
54         29
102        29
77         28
89         28
47         27
87         26
95         26
97         25
84         25
82         25
99         25
103        24
51         24
46         24
83         24
49         24
92         24
40         23
94         23
56         22
101        22
48         21
96         20
105        19
106        19
43         19
45         19
104        18
35         16
120        15
110        15
109        14
42         13
111        13
44         13
112   

With the exception of the value '56 (42)', the tl values that are not NaN could be converted to int types.  Let's inspect this entry.

In [44]:
pd.set_option('max_colwidth',1000)
idx_5642tl = df.loc[(idx_strtl) & (df.tl=='56 (42)'),:].index
df.loc[df.index.isin(idx_5642tl)]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
4111,sj,1-2-3-4-5,f,2009-07-13 00:00:00,69,56 (42),-1,,9.2,T opp mid wall v juniper xing,85,new,painted,,y7a,,missing LFF (left front foot); open break in tail at 42


Based on the notes in the misc column, tl should be recorded as 56.  We will do this now.

In [45]:
df.loc[df.index.isin(idx_5642tl),'tl']='56'

Now the entry looks like this.

In [46]:
df.loc[df.index.isin(idx_5642tl),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
4111,sj,1-2-3-4-5,f,2009-07-13 00:00:00,69,56,-1,,9.2,T opp mid wall v juniper xing,85,new,painted,,y7a,,missing LFF (left front foot); open break in tail at 42


We will use a histogram to try and identify abnormalities among the other tl values.

In [47]:
# dfnobadtypes0neg1 = (~df.tl.isin(badtypes))&(~df.tl.isin(['0','-1']))
dfother = ~(df.species.dropna().str.contains('v|j'))&(df.species.notna())&(df.tl.notna())
jarrovii = go.Histogram(x = df.loc[(df.species.str.contains('j'))&(df.tl.notna())
                                   ,'tl'].astype(int, 'ignore'),name = 'S. jarrovii',xbins =dict(size=1)
                        #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
virgatus = go.Histogram(x = df.loc[(df.species.str.contains('v'))&(df.tl.notna())
                                   ,'tl'].astype(int, 'ignore'), name = 'S. virgatus',xbins =dict(size=1)
                       #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
other = go.Histogram(x = df.loc[dfother,'tl'].astype(int, 'ignore'), name = 'other',xbins =dict(size=1)
                                  #,histnorm='probability'
                     , cumulative=dict(enabled = False
                                                                           , direction = 'increasing'))
data = [jarrovii, virgatus, other]
layout = go.Layout(
    title = 'Histogram of tl by species',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'tl (mm)',
#         tickfont = dict(
#         size = 8),
#         tickangle = 85,
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Lizards',
        titlefont = dict(
            size = 18))
)
fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Histogram of tl by species (new)')

For now there is not much we can identify graphically.  We will revist this later.  For now we will force tl to int.

In [48]:
df['tl'] = df.tl.astype(int, errors = 'ignore')

<a id='svl'></a>

## 'svl' 
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColbyCol)



We will take a similar approach for svl.

In [49]:
df.svl.astype(int,errors='ignore').apply(lambda x: type(x)).value_counts(dropna=False)

<class 'float'>    3584
<class 'str'>      2715
Name: svl, dtype: int64

Let us inspect the entries for which attempting to convert 'svl' results in a float type.

In [50]:
idx_floatNaNsvl = df.svl.astype(int,errors='ignore').apply(lambda x: type(x) is float)
df.loc[idx_floatNaNsvl,'svl'].value_counts(dropna=False)

NaN    3584
Name: svl, dtype: int64

These are all NaN entries and can be ignored for the time being.

Let's inspect the non NaN entries now.

In [51]:
idx_strsvl = df.svl.astype(int,errors='ignore').apply(lambda x: type(x) is str)
df.loc[idx_strsvl,'svl'].value_counts(dropna=False)

50       108
52        85
53        82
56        81
55        80
51        77
49        75
48        74
60        74
75        71
70        71
54        70
58        61
47        60
65        60
61        58
46        57
57        52
45        51
73        50
63        50
68        49
72        47
76        46
59        45
66        44
62        44
64        43
80        43
78        40
43        38
82        37
74        36
40        36
77        35
42        35
71        34
69        33
67        32
85        31
44        30
79        30
81        28
84        26
39        26
83        26
87        23
38        23
89        21
32        20
31        20
41        20
36        20
37        20
35        20
34        19
86        19
88        18
90        16
33        14
30        13
91        12
93         8
92         7
29         7
28         6
large      4
27         4
98         2
96         2
26         2
105        2
13         2
22         2
95         2
112        1
small      1

The values 'large', 'small', and '~70' require closer inspection.

In [52]:
idx_txtvals = df.svl.isin(['small','large','~70'])
df.loc[idx_txtvals]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2027,sj,2-6-13-20,f,2001-03-23 00:00:00,~70,,,,,bottom R wall v cave trail,30.0,sighting,,,?,"could read toes 6,13 for certain; toe 2 uncertain and didn't see 20 but this is the only right-sized female who could possible fit! Originally caught in July 1998, sb 20m ^ cave trail.",
4642,sj,,m,2002-03-16 00:00:00,large,,,,,active in crevice in wall 3m v juniper xing,112.0,sighting,,,?,,
4814,sj,,,2002-03-17 00:00:00,large,,,,,H4a,194.0,sighting,,,w85a??,,"probably w85a but could only see the ""5"""
5175,sj,,f,2002-03-20 00:00:00,large,,,,,L across from wall,318.0,sighting,,,w||t,,
5640,sj,,m,2002-03-19 00:00:00,large,,,,,up rt wall @ pool,,sighting,,,???,,~25mm original T; rest regrown
5784,sv,,?,2002-03-19 00:00:00,small,,,,,sb 4m ^ flatR,,sighting,,,,,had moth so didn'tcatch


All of these values for svl should be set to NaN since these are estimates, not measured values.  For the entry with the svl value of '~70', we can add the estimated value to the misc column. We will use the [appendstr](#appendstr) function here again.

In [53]:
idx_apprx70svl = (idx_txtvals)&(df.svl=='~70')
df.loc[idx_apprx70svl,'misc'] = df.loc[idx_apprx70svl,'misc'].apply(lambda x: appendstr(x,connector=';'
                                                        ,position='end'
                                                        ,value='svl extimated to be ~70mm'))
df.loc[idx_apprx70svl,:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2027,sj,2-6-13-20,f,2001-03-23 00:00:00,~70,,,,,bottom R wall v cave trail,30,sighting,,,?,"could read toes 6,13 for certain; toe 2 uncertain and didn't see 20 but this is the only right-sized female who could possible fit! Originally caught in July 1998, sb 20m ^ cave trail.",svl extimated to be ~70mm


In [54]:
df.loc[idx_txtvals,'svl']=np.nan
df.loc[idx_txtvals]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2027,sj,2-6-13-20,f,2001-03-23 00:00:00,,,,,,bottom R wall v cave trail,30.0,sighting,,,?,"could read toes 6,13 for certain; toe 2 uncertain and didn't see 20 but this is the only right-sized female who could possible fit! Originally caught in July 1998, sb 20m ^ cave trail.",svl extimated to be ~70mm
4642,sj,,m,2002-03-16 00:00:00,,,,,,active in crevice in wall 3m v juniper xing,112.0,sighting,,,?,,
4814,sj,,,2002-03-17 00:00:00,,,,,,H4a,194.0,sighting,,,w85a??,,"probably w85a but could only see the ""5"""
5175,sj,,f,2002-03-20 00:00:00,,,,,,L across from wall,318.0,sighting,,,w||t,,
5640,sj,,m,2002-03-19 00:00:00,,,,,,up rt wall @ pool,,sighting,,,???,,~25mm original T; rest regrown
5784,sv,,?,2002-03-19 00:00:00,,,,,,sb 4m ^ flatR,,sighting,,,,,had moth so didn'tcatch


Now we force the remaining svl values to int type.

In [55]:
df['svl'] = df.svl.astype(int, errors= 'ignore')

<a id='autotomized'></a>

### 'autotomized' 
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColbyCol)

[creating 'rtl_orig'and relabeling 'rtl' and 'autotomized](#rtlRTL_ORIGautotomized)
- [copy the values in rtl to a new column, *rtl_orig*](#copyrtl)
- [relabel entries in the autotomized column based on the values in the rtl_orig column](#relabelaut) 
- [relabel entries in the rtl column](#relabelrtl)

Here we populate the 'autotomized' column based on the values in 'rtl'.  Most of the source files did not have this category and have NaN values others have float values of 1.0, 2.0 or 3.0 for intact, autotomized with no regrowth or autotomized with regrowth, respectively.  The cleaned data for autotomized will contain  bool type values True, for having experienced auttomy (irrespective of regrowth) and False for having no evidence of havign experienced autotomy.

In [56]:
df.autotomized.value_counts(dropna=False)

NaN     6212
 1.0      61
 3.0      17
 2.0       9
Name: autotomized, dtype: int64

We will inspect the rtl values for entries with non NaN values for autotomized to determine if we can depend on rtl values to determine autotomy status.  In order to rely on rtl values, the following conditions must be met:
- all entries in which autotomized equals 1.0 must have 0 for rtl
- all entries in which autotomized equals 2.0 or 3.0 must have -1 or some value >0 for rtl

In [57]:
intact = df.loc[(df.autotomized==1),'rtl'].astype(int,errors = 'ignore').value_counts(dropna=False)
values2check = [x for x in intact.index[intact.index!=0]]
if len(values2check)>0:
    print("The rtl values associated with {} need a closer look.".format(values2check))
else:
    print("Values for 'intact' entries are as expected.  Continue.")
pd.set_option('max_colwidth',1000)
# df.loc[(df.autotomized==1)&(df.rtl.isin(['21'])),:]
df.loc[(df.autotomized==1)&(df.rtl.astype(int, errors = 'ignore').isin([str(x) for x in values2check])),:]
# need to see what broke this line

The rtl values associated with [21] need a closer look.


Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
29,sv,3-7-11-16,f,19vi2010,60,42,21,1.0,8.5,6m^bottom site on R ^fallen T in sb,-14,NEW,yes,,y1a,CA-01-cc,gravid


<a id ='outstanding2'></a>

[Back to Outstanding Problems](#outstanding)

This lizard appears to have been misrecorded and should be listed as autotomized given the amount of regrowth. This should be confirmed in the field notes. If we trust the data as recorded and depend on the rtl values to label autotomized this will be corrected, so for now we will leave this as is.

In [58]:
autotomized = df.loc[(df.autotomized==2),'rtl'].value_counts(dropna=False)
aut_values2check = [x for x in autotomized.index[autotomized.index!='-1']]# change to 'isin' aregument with 0 and -1
if len(aut_values2check)>0:
    print("{} values associated with an rtl value of {} need a closer look."\
          .format(df.loc[(df.autotomized==2)&(df.rtl.isin(aut_values2check)),:].shape[0],aut_values2check))
else:
    print("Values for 'autotomized' entries are as expected.  Continue.")
pd.set_option('max_colwidth',1000)
idx_aut_entries2check = (df.autotomized==2)&(df.rtl.isin([str(x) for x in aut_values2check]))
df.loc[idx_aut_entries2check,:]

8 values associated with an rtl value of ['0'] need a closer look.


Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
5,sj,3-6-15-17,m,18viii2010,52,70,0,2.0,4.5,r outcrop ^ oak R,425.0,new,yes,,y68c,67-10-cc,Tss
20,sj,3-6-12-18,f,26vii2010,66,79,0,2.0,9.0,339m; rt side 1m up,339.0,NEW,yes,,y54c,48-10-cc,Bss
26,sj,3-6-13-20,f,5viii2010,48,69,0,2.0,3.5,5m up ccc,,new,yes,,>c,59-10-cc,
53,sj,3-10-13-(14)-16,m,12vi2010,92,112,0,2.0,18.0,talus^Rwall v talus left side 4m up,,recap,yes,,w1c,,"toe 14 looks like natural toe loss; skinny [pics], spine visible [pics], old puncture wounds on back [pics], possible injury of back rt leg at knee [pics]"
77,sj,4-7-8-9-11-18,m,26vii2010,96,97,0,2.0,24.5,2m left of stump,364.0,recap,yes,,y51c,,old mark wXc still visible; not the same as wXc at chute
80,sj,3-10-13-14-16,m,26vii2010,91,111,0,2.0,18.5,1m v top R wall v talus in sb,320.0,recap,yes,,y55c,,1m v top R wall in sb; last mark still visible; injuries to dorsum [pics]; toe 14 may be natural toe loss; lizard is skinny: spine and hip bones visible [pics]
83,sj,,f,5viii2010,70,86,0,2.0,11.0,bottom site,,recap,,,y61c.t,,BSS; toe 19 could be natural toe loss
85,sv,3-8-12-16,m,20vi2010,52,58,0,2.0,3.5,5m v wall v wall v juniper xing left side sb,80.0,recap,yes,,y6a,,


<a id = 'outstanding3'></a>

[Back to Outstanding Problems](#outstanding)

Some of these cases are very straight forward given that the ratio of svl to tl is very close to 1, but others would be worth checking the original data to confirm. Another option is to use the svl to tl ratio of animals that we are sure are intact to decide how to classify these.  For now we will trust the system of recording used in 2010 and update the rtl values to '-1'.

In [59]:
df.loc[idx_aut_entries2check,'rtl'] = '-1'

In [60]:
df.loc[idx_aut_entries2check]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
5,sj,3-6-15-17,m,18viii2010,52,70,-1,2.0,4.5,r outcrop ^ oak R,425.0,new,yes,,y68c,67-10-cc,Tss
20,sj,3-6-12-18,f,26vii2010,66,79,-1,2.0,9.0,339m; rt side 1m up,339.0,NEW,yes,,y54c,48-10-cc,Bss
26,sj,3-6-13-20,f,5viii2010,48,69,-1,2.0,3.5,5m up ccc,,new,yes,,>c,59-10-cc,
53,sj,3-10-13-(14)-16,m,12vi2010,92,112,-1,2.0,18.0,talus^Rwall v talus left side 4m up,,recap,yes,,w1c,,"toe 14 looks like natural toe loss; skinny [pics], spine visible [pics], old puncture wounds on back [pics], possible injury of back rt leg at knee [pics]"
77,sj,4-7-8-9-11-18,m,26vii2010,96,97,-1,2.0,24.5,2m left of stump,364.0,recap,yes,,y51c,,old mark wXc still visible; not the same as wXc at chute
80,sj,3-10-13-14-16,m,26vii2010,91,111,-1,2.0,18.5,1m v top R wall v talus in sb,320.0,recap,yes,,y55c,,1m v top R wall in sb; last mark still visible; injuries to dorsum [pics]; toe 14 may be natural toe loss; lizard is skinny: spine and hip bones visible [pics]
83,sj,,f,5viii2010,70,86,-1,2.0,11.0,bottom site,,recap,,,y61c.t,,BSS; toe 19 could be natural toe loss
85,sv,3-8-12-16,m,20vi2010,52,58,-1,2.0,3.5,5m v wall v wall v juniper xing left side sb,80.0,recap,yes,,y6a,,


In [61]:
regrown = df.loc[(df.autotomized==3),'rtl'].value_counts(dropna=False).reset_index()['index']\
.astype(int, errors = 'ignore')
values2check = [x for x in regrown<=0]
if sum(values2check)>0:
    print("The values associated with {} need a closer look.".format(values2check))
else:
    print("Values for 'regrown' entries are as expected.  Continue.")
pd.set_option('max_colwidth',1000)
df.loc[(df.autotomized==3)&(df.rtl.isin([str(x) for x in values2check])),:]

Values for 'regrown' entries are as expected.  Continue.


Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc


The entries labeled as a 3.0 in the autotomized column do not appear as though their rtl values will present an issue for calculating new autotomized values.  We will leave these as they are.

<a id='rtlRTL_ORIGautotomized'></a>

### creating 'rtl_orig' and relabeling 'rtl' and 'autotomized'
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColByCol)

[Back to: 'autotomized'](#autotomized)

Now we will:
- [copy the values in rtl to a new column, *rtl_orig*](#copyrtl)
- [relabel entries in the autotomized column based on the values in the rtl_orig column](#relabelaut) 
- [relabel entries in the rtl column](#relabelrtl)

<a id='copyrtl'></a>

#### copy the values in rtl to a new column, *rtl_orig*
[Back to: 'autotomized'](#autotomized)

In [62]:
df['rtl_orig'] = df.rtl

<a id='relabelaut'></a>

#### relabel entries in the autotomized column based on the values in the rtl_orig column
[Back to: 'autotomized'](#autotomized)

We will do this using the following logic:
    - if rtl_orig !=0 & rtl_orig.notna(), autotomized = True
    - if rtl_orig ==0, automized = False
    - if rtl_orig.isna(), autotomized = np.nan

In [63]:
idx_auttrue = (~df.rtl_orig.isin(['0']))&(df.rtl_orig.notna())
df.loc[idx_auttrue,'autotomized'] = True

In [64]:
idx_autfalse = (df.rtl_orig.isin(['0']))&(df.rtl_orig.notna())
df.loc[idx_autfalse,'autotomized'] = False

In [65]:
idx_autnan = df.rtl_orig.isna()
df.loc[idx_autnan,'autotomized'] = np.nan

<a id='relabelrtl'></a>

#### relabel entries in the rtl column
[Back to: 'autotomized'](#autotomized)
We will do this using the following logic:
    - if rtl_orig == -1, rtl = 0

In [66]:
idx_rtlneg1 = df.rtl_orig=='-1'
df.loc[idx_rtlneg1,'rtl'] = 0

In [67]:
df.autotomized.value_counts(dropna=False)

NaN      3591
False    1989
True      719
Name: autotomized, dtype: int64

<a id = 'toes'></a>

## toes 
[Top](#TOC)

[Top Cleaning](#CleaningData)

Here we make changes to toes based on comments regarding a 2004 male Sv with toes recorded as '1-6-16-17-20'

In [68]:
idx_sv2004m16161720 = (df.species=='sv') \
& (df.sex=='m') \
& (df.date.str.contains('2004-07-04')) \
&(df.svl=='52')&(df.tl=='53')

df.loc[idx_sv2004m16161720]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig
5907,sv,1-6-16-17-20,m,2004-07-04 00:00:00,52,53,,,3.6,bottom chute,355,recap,painted,,w.t,,few mites,


In [69]:
df.loc[idx_sv2004m16161720,'toes'] = '1-7-16-17-20'
df.loc[idx_sv2004m16161720,'rtl'] = 0

Now this entry looks like this.

In [70]:
df.loc[idx_sv2004m16161720]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig
5907,sv,1-7-16-17-20,m,2004-07-04 00:00:00,52,53,0,,3.6,bottom chute,355,recap,painted,,w.t,,few mites,


First we will rename "toes" to "toes_orig"

In [71]:
df = df.rename(columns = {'toes':'toes_orig'},index = str)

Next we create a new column, "toes"  for the renamed toes

In [72]:
df['toes'] = df.toes_orig

Now we attempt to identify problem toes name and correct or export for review.

In [73]:
pattern1 = ".( {1,}-.|.- {1,}.)" # toes entries with any number of spaces on either side of a hyphen
pattern2 = ".( {,}\w{,} {1,})." # toes entries with space around or between numbers <- the spaces here should be deleted
pattern3 = ".(')."
pattern4 = "./."  # entries with '/' <-- need to replace these with '-'
pattern5 = "(\?{1,})"#<-- these needs to be investigated
pattern6 = "^\d{3,}$" # entries consist of only a single number comprised of at least three digits 
#<-- these needs to be investigated by checking raw field notes
pattern7 = ".(-{2,})." # entries which have at least 2 consecutive '-' <- these should be investigated
pattern8 = "^0" # entries in which single digit numbers have a leading "0" <-- Check raw field notes on this too
pattern9 = "a\w" #<--handled hyphens should be inserted  between the [ab] and \w 
# entries that contain an 'a' or 'b' followed by any character in the set [a-zA-Z0-9_]
pattern10 = "b\w" #<--handled hyphens should be inserted  between the [ab] and \w 
pattern11 = "\wa" # entries that contain an 'a' or 'b' preceded by any character in the set [a-zA-Z0-9_]
pattern12 = "\wb" # entries that contain an 'a' or 'b' preceded by any character in the set [a-zA-Z0-9_]
pattern13 = "[()]"
# remove space before 'a' at end of toes
#investigate '\d-', 
#'-(*)-', 
#' (16) ', 
#'---', <- may not exist in raw data
#'\d- ', 
#'- \d', 
#transcription errors from excel (toes in date format,
#'-\d\d\d\d' <- may not be in the data set

We'll have to change this block if we add or remove toe patterns.
This is not ideal and needs to be fixed

In [74]:
toe_pattern = pd.Series([*range(1,14)]) 
toe_pattern = make_str(toe_pattern)
print(toe_pattern)

toe_pattern_descr = pd.Series([pattern1,pattern2,pattern3,pattern4
                               ,pattern5,pattern6,pattern7,pattern8
                               ,pattern9,pattern10,pattern11,pattern12,pattern13])
toe_pattern_descr = toe_pattern_descr.astype(str)
print(toe_pattern_descr)

toe_pattern_reference = pd.DataFrame({'toe_pattern': toe_pattern,'description':toe_pattern_descr})
toe_pattern_reference

0     01
1     02
2     03
3     04
4     05
5     06
6     07
7     08
8     09
9     10
10    11
11    12
12    13
dtype: object
0     .( {1,}-.|.- {1,}.)
1      .( {,}\w{,} {1,}).
2                   .(').
3                     ./.
4                (\?{1,})
5                ^\d{3,}$
6               .(-{2,}).
7                      ^0
8                     a\w
9                     b\w
10                    \wa
11                    \wb
12                   [()]
dtype: object


Unnamed: 0,toe_pattern,description
0,1,".( {1,}-.|.- {1,}.)"
1,2,".( {,}\w{,} {1,})."
2,3,.(').
3,4,./.
4,5,"(\?{1,})"
5,6,"^\d{3,}$"
6,7,".(-{2,})."
7,8,^0
8,9,a\w
9,10,b\w


We first replace the string 'nan' with a null value

In [75]:
df.loc[df.toes=='nan','toes'] = np.nan

Let's see how many of these patterns we need to correct

In [76]:
df['toe_pattern'] = np.nan

Here we use a for-loop to label the patterns 
(there's probably a better way to do this with pandas map or apply, but I'll have to figure this out, for now this is fast enough, but it could make a difference with a larger data set or with more patterns)

In [77]:
for i in range(0,toe_pattern_reference.shape[0]):
    tmp_pat_num = toe_pattern_reference.iloc[i,0]
    tmp_pattern = toe_pattern_reference.iloc[i,1]
    df = label_pattern(df,tmp_pat_num,tmp_pattern,'toe_pattern','toes')

A quick summary of the number of observations for each pattern in the data set

In [78]:
toe_errors =df.toe_pattern.value_counts(dropna=False).reset_index()\
.rename(columns = {'index':'toe_pattern','toe_pattern':'observations'})
toe_errors.loc[toe_errors.toe_pattern.isnull(),'toe_pattern'] = 'Not covered by current patterns'
toe_errors_desc = toe_errors.merge(toe_pattern_reference,'left',on='toe_pattern')
toe_errors_desc

Unnamed: 0,toe_pattern,observations,description
0,Not covered by current patterns,5998,
1,02,258,".( {,}\w{,} {1,})."
2,01,39,".( {1,}-.|.- {1,}.)"
3,05,2,"(\?{1,})"
4,13,1,[()]
5,09,1,a\w


Now let's make sure we've accounted for every row in the data set

In [79]:
accountedRows = toe_errors.observations.sum()
totalRows = df.shape[0]
notAccountedRows = df.shape[0] - toe_errors.observations.sum()
print("\nThere are {} rows accounted for in the patterns (including null values) and there {} rows in the full data set.\
  There are {} rows unaccounted for.".format(accountedRows,totalRows,notAccountedRows))


There are 6299 rows accounted for in the patterns (including null values) and there 6299 rows in the full data set.  There are 0 rows unaccounted for.


Now we correct these patterns. We'll preserve the original toe data in a column called "toes_orig" just in case.  We can drop this later, if we are comfortable with the changes.  The new toes will be labeled "toes".

In [80]:
corrections_config = {'01':{'action':'replace','pattern_b':" ",'replacement':"\"\""},
            '02':{'action':'replace','pattern_b':" ",'replacement':"-"},
            '03':{'action':'replace','pattern_b':"\'",'replacement':"\"\""},
            '04':{'action':'replace','pattern_b':"/",'replacement':"-"},
            '05':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '06':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '07':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '08':{'action':'replace','pattern_b':"^0",'replacement':"\"\""},
            '09':{'action':'replace','pattern_b':'a','replacement':'-a'},
            '10':{'action':'replace','pattern_b':'b','replacement':'-b'},          
            '11':{'action':'replace','pattern_b':"a",'replacement':"a-"},
            '12':{'action':'replace','pattern_b':"b",'replacement':"b-"},
            '13':{'action':'replace','pattern_b':"[()]",'replacement':"\"\""}}

In [81]:
toe_errors_desc['action'] = toe_errors_desc.loc[toe_errors_desc.toe_pattern.str.len()==2].toe_pattern\
.map(lambda x: corrections_config[x]['action'],na_action='ignore')

toe_errors_desc['replacement'] = toe_errors_desc.loc[toe_errors_desc.toe_pattern.str.len()==2].toe_pattern\
.map(lambda x: corrections_config[x]['replacement'],na_action='ignore')

toe_errors_desc = toe_errors_desc.sort_values('toe_pattern').reset_index(drop=True)
toe_errors_desc

Unnamed: 0,toe_pattern,observations,description,action,replacement
0,01,39,".( {1,}-.|.- {1,}.)",replace,""""""
1,02,258,".( {,}\w{,} {1,}).",replace,-
2,05,2,"(\?{1,})",save,
3,09,1,a\w,replace,-a
4,13,1,[()],replace,""""""
5,Not covered by current patterns,5998,,,


In [82]:
for i in range(0,toe_errors_desc.shape[0]):
    tmp_pat_num = toe_errors_desc.iloc[i,0]
    tmp_pattern = toe_errors_desc.iloc[i,2]
    action = toe_errors_desc.iloc[i,3]
    tmp_replacement = toe_errors_desc.iloc[i,4]
    tmp_x = df.loc[df.toe_pattern==tmp_pat_num,:]
    
    if action =='save':
        tmp_filename = 'pattern'+tmp_pat_num+'.csv'
        tmp_x.to_csv(tmp_filename)
        print("Pattern {} successfully saved to {}.".format(tmp_pattern,tmp_filename))
    if action =='replace':
        df.loc[df.toe_pattern==tmp_pat_num,'toes'] = replace_pattern(x=df.loc[df.toe_pattern==tmp_pat_num]
                                                                     ,pattern = tmp_pat_num
                                                                     ,pattern_b = tmp_pattern
                                                                     ,source_col = 'toes'
                                                                    ,replacement = tmp_replacement)
        print("Pattern {} successfully replaced with {}.".format(tmp_pattern,tmp_replacement))
    else:
        print("No direction provided for pattern {}.  No action was taken.".format(tmp_pattern))

Pattern .( {1,}-.|.- {1,}.) successfully replaced with "".
Pattern .( {,}\w{,} {1,}). successfully replaced with -.
Pattern (\?{1,}) successfully saved to pattern05.csv.
No direction provided for pattern (\?{1,}).  No action was taken.
Pattern a\w successfully replaced with -a.
Pattern [()] successfully replaced with "".
No direction provided for pattern nan.  No action was taken.


Now we confirm that the patterns we expect to have eliminated have indeed been eliminated from the data set

In [83]:
for i in range(0,toe_pattern_reference.shape[0]):
    tmp_pattern = str(toe_pattern_reference.iloc[i,1])
    print(report_pattern(df,tmp_pattern,'toes','Post-Correction'))

Post-Correction:
toe pattern .( {1,}-.|.- {1,}.):0
Post-Correction:
toe pattern .( {,}\w{,} {1,}).:0
Post-Correction:
toe pattern .(').:0
Post-Correction:
toe pattern ./.:0
Post-Correction:
toe pattern (\?{1,}):2
Post-Correction:
toe pattern ^\d{3,}$:0
Post-Correction:
toe pattern .(-{2,}).:0
Post-Correction:
toe pattern ^0:0
Post-Correction:
toe pattern a\w:0
Post-Correction:
toe pattern b\w:0
Post-Correction:
toe pattern \wa:0
Post-Correction:
toe pattern \wb:0
Post-Correction:
toe pattern [()]:0


This is as expected since we left toe pattern 4 uncorrected.

<a id='sex'></a>

### sex
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColbyCol)

Next we move on to cleaning the "sex" column.

First we want to get an idea of the types of problems in the sex column.  We start by striping leading and trailing whitespaces.  You can see here that there were none in the data set.

In [84]:
print(df.sex.str.len().unique())# returns unique lengths of sex
df.sex=df.sex.str.strip()
print(df.sex.str.len().unique())

[ 1. nan  2.  3.  4.  5.]
[ 1. nan  2.  3.  0.  4.  5.]


#### Identify non "m" or "f" values and their frequencies

In [85]:
patterns_sex="m|f|NA"
non_matches=df.sex.loc[df.sex.str.match(patterns_sex)!=True]
print("\nThere are {} entries for sex which do not match the patterns {}:"\
      .format(non_matches.shape[0],patterns_sex.split("|")))
non_matches.value_counts()


There are 3707 entries for sex which do not match the patterns ['m', 'f', 'NA']:


juv      122
F        112
M         88
?         15
?f         6
n          2
?m         1
unm        1
[m]        1
adult      1
???        1
           1
Name: sex, dtype: int64

#### Identify values to convert to NA, m, or f

In [86]:
sex2NA = ['adult','juv','nan','\?\?\?','\?']
sex2m = ['unm','M']
sex2f = ['F']
df.loc[df.sex.isin(sex2NA)==True]
print("There are {} entries that should be converted to 'NaN'".format(df.sex.isin(sex2NA).sum()))
print("There are {} entries that should be converted to 'm'".format(df.sex.isin(sex2m).sum()))
print("There are {} entries that should be converted to 'f'".format(df.sex.isin(sex2f).sum()))

There are 123 entries that should be converted to 'NaN'
There are 89 entries that should be converted to 'm'
There are 112 entries that should be converted to 'f'


#### Convert the values to NA, f, or m, respectively.

In [87]:
df.loc[df.sex.isin(sex2NA),'sex']=np.nan
df.loc[df.sex.isin(sex2m),'sex']='m'
df.loc[df.sex.isin(sex2f),'sex']='f'
print("Now there are {} entries that should be converted to 'NaN'".format(df.sex.isin(sex2NA).sum()))
print("Now there are {} entries that should be converted to 'm'".format(df.sex.isin(sex2m).sum()))
print("Now there are {} entries that should be converted to 'f'".format(df.sex.isin(sex2f).sum()))

Now there are 0 entries that should be converted to 'NaN'
Now there are 0 entries that should be converted to 'm'
Now there are 0 entries that should be converted to 'f'


#### Set all remaining sex with "?" to NaN

In [88]:
df.loc[(df.sex.str.contains('\?')) & (df.sex.notnull()),'sex'] = np.nan

<a id = 'species'></a>

### Species
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColbyCol)

In [89]:
print(df.species.str.len().unique())# returns unique lengths of species
df.species=df.species.str.strip()
print(df.species.str.len().unique())

[ 2.  3.  5.  1. nan  4. 30.]
[ 2.  3.  5.  1. nan 30.]


In [90]:
df.species.value_counts(dropna = False)

sj                                3494
sv                                1161
j                                  720
uo                                 226
NaN                                204
v                                  180
Sj                                 119
cn                                  59
Sv                                  52
Uo                                  27
sc                                  22
Ae                                  11
as                                   8
?                                    3
??                                   2
cn ex                                2
ek                                   2
c                                    1
ce                                   1
sj?                                  1
sc?                                  1
Snake sighted at root crossing       1
cn2                                  1
up                                   1
Name: species, dtype: int64

In [91]:
patterns_species="j|v|sj|sv|NA"
idx_notsjsv = (df.species.str.match(patterns_species)!=True)&(df.species.str.contains('j|v',case=False)!=True)
non_matches=df.species.loc[idx_notsjsv]
print("\nThere are {} entries for species which do not match the patterns {} and are unlikely to be definitely \
'sv', 'sj':".format(non_matches.shape[0],patterns_species.split("|")))
non_matches.value_counts()


There are 572 entries for species which do not match the patterns ['j', 'v', 'sj', 'sv', 'NA'] and are unlikely to be definitely 'sv', 'sj':


uo                                226
cn                                 59
Uo                                 27
sc                                 22
Ae                                 11
as                                  8
?                                   3
ek                                  2
??                                  2
cn ex                               2
c                                   1
sc?                                 1
Snake sighted at root crossing      1
cn2                                 1
ce                                  1
up                                  1
Name: species, dtype: int64

We will set species for these entries to  'other'.

In [92]:
df.loc[df.species.isin(non_matches.unique()),'species'] = 'other'

<a id ='outstanding4'></a>
#### *Sceloporus jarrovii*
[Back to Outstanding Problems](#outstanding)

In [93]:
df.loc[df.species.str.contains('j',case=False),'species'].value_counts(dropna=False)

sj     3494
j       720
Sj      119
sj?       1
Name: species, dtype: int64

The values with '?' should be investigated.

In [94]:
idx_sjrev = df.species.str.contains('j',case=False)&(df.species.str.contains('\?'))
df.loc[idx_sjrev,'species'].value_counts(dropna=False)
df.loc[idx_sjrev,:]

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern
270,sj?,4-10-19,m,2005-07-11 00:00:00,54,61,23,True,6.8,sb,-45,recap,painted,,w27a,,paint almost gone --> repainted,23,4-10-19,


In [95]:
idx_sj = df.species.str.contains('j',case=False)&(~df.species.str.contains('\?'))
df.loc[idx_sj,'species'].value_counts(dropna=False)

sj    3494
j      720
Sj     119
Name: species, dtype: int64

We will convert the others should be converted to 'j'.

In [96]:
df.loc[idx_sj,'species'] = 'j'

#### *Sceloporus virgatus*

In [97]:
idx_sv = df.species.str.contains('v',case=False)
df.loc[idx_sv,'species'].value_counts(dropna=False)

sv    1161
v      180
Sv      52
Name: species, dtype: int64

We will convert these to 'v'.

In [98]:
df.loc[idx_sv,'species'] = 'v'

<a id='newrecap'></a>

### new.recap
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColByCol)

1. [potential new and recap](#newandrecap)

In [99]:
newRecapKeep = ['recap', 'new', 'r', 'n']
df.loc[~df['new.recap'].isin(newRecapKeep),'new.recap'].value_counts(dropna=False)

sighting                            2250
NaN                                 1247
NEW                                  625
sighted                              185
new                                  107
N                                     71
missed                                36
recap?                                 9
?                                      5
new?                                   3
recap                                  2
???                                    2
Dead                                   1
recap/new                              1
collected                              1
sighted                                1
sighing                                1
visual recapture                       1
recap ?                                1
didn't catch                           1
recapq                                 1
not caught                             1
heard                                  1
?sighting                              1
recap? - toes su

<a id = 'newandrecap'></a>
#### potential new and recap

[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColByCol)

[Back to: new.recap](#newrecap)

Let's identify cases that may be classified as recaptures or new captures 

In [100]:
idx_potnew = df['new.recap'].str.contains('n',case=False)==True
print([x for x in df.loc[idx_potnew,'new.recap'].value_counts(dropna=False).index])
df.loc[idx_potnew,'new.recap'].value_counts(dropna=False)

['sighting', 'NEW', 'new', 'new ', 'N', 'new?', 'New', 'recap? - toes suggest a NEW mark', 'not caught', '?sighting', 'recap/new', "didn't catch", 'sighing']


sighting                            2250
NEW                                  625
new                                  554
new                                  107
N                                     71
new?                                   3
New                                    1
recap? - toes suggest a NEW mark       1
not caught                             1
?sighting                              1
recap/new                              1
didn't catch                           1
sighing                                1
Name: new.recap, dtype: int64

'NEW', 'New', 'new ', 'new', and 'N' are certainly new captures and should be converted to 'N'.

In [101]:
idx_new = df['new.recap'].isin(['NEW', 'New', 'new ', 'new', 'N'])
df.loc[idx_new,'new.recap'] = 'N'
df.loc[idx_new,'new.recap'].value_counts(dropna = False)

N    1358
Name: new.recap, dtype: int64

<a id = 'resumehere'></a>

'new?', 'recap/new', and  'recap? - toes suggest a NEW mark' require closer inspection

In [102]:
idx_new2check = df['new.recap'].isin(['new?', 'recap/new', 'recap? - toes suggest a NEW mark'])
df.loc[idx_new2check,'new.recap'].value_counts(dropna=False)
df.loc[idx_new2check]

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern
95,j,3-6-11-19,m,19viii2010,,,,,,1m v bottom rt curved wall,24.0,recap? - toes suggest a NEW mark,yes,,y:t,,B shedding; T shed; remarked,,3-6-11-19,
304,j,16,m,2013-07-11 00:00:00,73.0,30.0,24.0,True,8.4,downed juniper,220.0,recap/new,yes,,o46c,13-50,"toe loss natural?;Bss;Tss Head partially shed;nematodes excited body, on in vial",24.0,16,
418,j,15,f,2014-07-03 00:00:00,67.0,56.0,30.0,True,8.0,wall v bottom wall,83.0,new?,yes,,o4c,,Natural toe loss?; B shed Tss,30.0,15,
2661,v,13-14,m,2015-07-13 00:00:00,47.0,38.0,15.0,True,3.5,,,new?,yes,,w30c,,toes 13 & 14 appear natural loss; no new toes cut; no tissue,15.0,13-14,
6253,j,7,m,2004-07-22 00:00:00,39.0,52.0,0.0,False,2.0,sb 4m v bottom s-curve,276.0,new?,painted,,w162b,,did not cut any new toes,0.0,7,


In [103]:
#try using a dict to do thing more efficiently

new = ['new','n']
recap = ['recap','r']
df.loc[~df['new.recap'].isin(newRecapKeep),'new.recap'] = np.nan
df.loc[df['new.recap'].isin(new),'new.recap'] = 'new'
df.loc[df['new.recap'].isin(recap),'new.recap'] = 'recap'

<a id='CorrectingClass'></a>

## Correcting class of columns
[Top](#TOC)

[Top Cleaning](#CleaningData)

In [104]:
#We need to add real error handling into these conversion chunks

##Convert integer columns to int
intCols = ['meters']
df[intCols]=df[intCols].astype(int,errors='ignore')

##Convert numeric columns to numeric
numCols = ['svl','tl','rtl','mass']
df[numCols]=df[numCols].apply(pd.to_numeric,errors='coerce')

##Convert string columns to str
strCols = ['toes','sex','species','vial']
df[strCols]=df[strCols].astype(str, errors='ignore')

#Convert date to datetime
df.loc[df.date=="NA"]=np.nan
df.date = pd.to_datetime(df.date,errors='coerce')

##Convert bool columns to bool
# boolCols = ['review_sex','review_species','review_painted','review_new.recap',\
#             'review_rtl','forceMale','forceFemale','forceRecap','forceNew',\
#             'forceSighting','drop_species','drop_morphometrics','autotomized']
# df[boolCols]=df[boolCols].astype(bool, errors='ignore')

In [105]:
print("\nAfter applying the above changes, the data types are as follows:\n{}".format(df.dtypes))


After applying the above changes, the data types are as follows:
species                object
toes_orig              object
sex                    object
date           datetime64[ns]
svl                   float64
tl                    float64
rtl                   float64
autotomized            object
mass                  float64
location               object
meters                 object
new.recap              object
painted                object
sighting              float64
paint.mark             object
vial                   object
misc                   object
rtl_orig               object
toes                   object
toe_pattern            object
dtype: object


<a id='AddVar1'></a>

## Adding variables [*year*](#year) and [*rtl_orig*](#rtlorig)

<a id='year'></a>

### Year
[Back to: Top](#TOC)

[Back to: Adding variables](#AddVar1)

We will use data contained in the *date* column to create the variable *year*.  TO do this we will define a small function, *myint*, to convert year to an int type.

<a id='myint'></a>

In [106]:
def myint(x, verbose = False):
    try:
        x = str(x).split('.')[0]
    except:
        x = x
        if verbose == True:
            print('{} is of type {} and cannot be forced to int.'.format(x,type(x)))
    return x


Here is are a few examples of how [*myint*](#myint) works.

In [107]:
bar = [None, 1.0, "f"]
print([type(x) for x in bar])
[myint(x) for x in bar]

[<class 'NoneType'>, <class 'float'>, <class 'str'>]


['None', '1', 'f']

In [108]:
bar = [None, 2001.0, "2001.0"]
print([type(x) for x in bar])
[myint(x,True) for x in bar]

[<class 'NoneType'>, <class 'float'>, <class 'str'>]


['None', '2001', '2001']

Now we apply [*myint*](#myint) to the 'date' column to create the variable year and inspect the unique values.

In [109]:
df['year'] = df.date.dt.year.apply(myint,verbose=False)
df.year.value_counts(dropna=False)

2002    1477
2003    1017
2017     759
2001     681
2004     478
nan      235
2000     209
2005     202
2007     182
2006     163
2009     162
2015     147
2008     134
2013     106
2016     101
2014      97
2012      85
2011      64
Name: year, dtype: int64

Let's inspect the entries with 'nan' values.  Note these 'nan' values are string values and not NaN.

In [110]:
df.loc[df.year=='nan',:]

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,year
0,j,2-6-12-15,m,NaT,80.0,110.0,29.0,True,20.0,10m v bottom bowl,-15.0,,yes,,y2c,03-10-cc,toe 15 missing at capture; possible recap,29.0,2-6-12-15,,
1,j,2-9-15-17,f,NaT,56.0,77.0,0.0,False,5.5,20m up CCC,240.0,,yes,,y62c,61-10-cc,Tss,0.0,2-9-15-17,,
2,j,3-6-11-17,m,NaT,50.0,68.0,0.0,False,4.0,1m vT at top R island,157.0,,yes,,y<c.t,,Bss; lost toes,0.0,3-6-11-17,,
3,j,3-6-15-16,f,NaT,72.0,62.0,46.0,True,11.0,halfway between pool and 2 falls 2m up rt side,385.0,,yes,,y65c,65-10-cc,,46.0,3-6-15-16,,
4,j,3-6-12-17,m,NaT,57.0,82.0,0.0,False,6.0,R outcrop ^ oak R,425.0,,yes,,y67c,66-10-cc,,0.0,3-6-12-17,,
5,j,3-6-15-17,m,NaT,52.0,70.0,0.0,True,4.5,r outcrop ^ oak R,425.0,,yes,,y68c,67-10-cc,Tss,-1.0,3-6-15-17,,
6,j,3-6-15-18,f,NaT,53.0,72.0,0.0,False,5.0,pine R,408.0,,yes,,y69c,68-10-cc,,0.0,3-6-15-18,,
7,j,3-6-15-19,f,NaT,74.0,100.0,0.0,False,12.5,5m v wall v wall v juniper xing,80.0,,yes,,y70c,69-10-cc,too easy to pass up,0.0,3-6-15-19,,
8,j,2-6-13-16,m,NaT,55.0,74.0,0.0,False,5.0,3m ^ 1 falls,3.0,,yes,,y71c,70-10-cc,,0.0,2-6-13-16,,
9,j,2-6-14-17,m,NaT,47.0,64.0,0.0,False,3.5,3m ^ 1 falls,3.0,,yes,,y÷c.t,71-10-cc,"possible Bss; mark is a yellow division symbol in the ""c""-position with a yellow dot in the ""t""-position",0.0,2-6-14-17,,


<a id='AddCol'></a>

# Adding New Columns
[Top](#TOC)

We need to add new columns which we will use later in analyses:
- [TL_SVL](#TlSvl)
- [Mass_SVL](#MassSvl)
- [Lizard Number](#LizardNumber)
     - [assign lizard numbers](#Assign) 
     - [QC the lizard numbers](#QcLizNum) 
- [Days Since Capture](#daysSinceCapture)
- [Number of Captures](#capture)

<a id= 'TlSvl'></a>

## TL_SVL 
[Top](#TOC)

[Top Add Columns](#AddCol)



In [111]:
df['tl_svl']=(df.tl/df.svl)

<a id='MassSvl'></a>

## Mass_SVL
[Top](#TOC)

[Top Add Columns](#AddCol)



In [112]:
df['mass_svl']=(df.mass/df.svl)

<a id= 'LizardNumber'></a>

## Lizard Number
[Top](#TOC)

[Top Add Columns](#AddCol)

Here we use a set of functions to:
 - [assign lizard numbers](#Assign) to unique individuals (we repeat this step to ensure we have assigned all animals a number) and 
 - [QC the numbers](#QcLizNum) assigned.

<a id='Assign'></a>

### Assign lizard numbers
[Top](#TOC)

[Top Add Columns](#AddCol)

We make a first attempt at assigning lizard numbers.  We use the *lizsort* function to identify the subset of rows from the original dataset which have sufficient information to allow us to make an automated decision about the uniqueness of the individuals identified in those rows.  We name that df *sortable*.  The unsortable data are saved to a path as a file, *unsortable.csv*.  

In [113]:
sortable = lizsort(df, path = sourceDataBig)  


There were 3683 entries for which values for one of the critical criteria, (['species', 'toes', 'sex', 'date', 'svl']), were null.      These entries could not be evaluated and were written out to the file unsortable.csv for evaluation.


Next we call the *mindate* function on *sortable*.  This identifies the earliest date at which each unique combination of *sortCriteria* are recorded in a new column, *initialCaptureDate*.  The default sortCriteria are of the variables *species*, *toes*, and *sex*.  This also calculates and adds a column for *year_diff*, the difference in years between the initial capture date and the date value in a given row. 

In [114]:
sortable = mindate(sortable)

Next we call a the function *smallest*, which is analogous to *mindate*, but groups data in *sortable* into unique combinations of *species*, *toes*, *sex*, and *initialCaptureDate* before assigning the smallest SVL value recorded for each group to a new column for that group, *smallest_svl*.  *smallest* then calculates a new column *svl_diff* which is analogous to *year_diff*.

In [115]:
sortable = smallest(sortable)

Next we call the *validate* function on *sortable*, which applies a series of validation tests to the data, sequentially numbers unique combinations of *sortCriteria* and returns a dict containing uniquely numbered individuals and summary data.

In [116]:
tmp_sort = validate(sortable)
df_numbered1 = tmp_sort['val_data']


Of those entries we can handle, there are 1568 individuals as defined by ['species', 'toes', 'sex'] which pass validation based    on ['year_diff <= 7', 'svl_diff >= -2'] and 154 rows which do not pass validation.


### Second attempt to assign lizard numbers

[Top](#TOC)

[Top Add Columns](#AddCol)

Here we make a second attempt at assigning lizard numbers to ensure that all lizards have been assigned.  This second attempt is focused on those rows which were unvalidated during the first attempt *n_val_data*.  Since these are already a subset fo those data which were sortabel, we need only call the *mindate*, *smallest*, and *validate* functions.

In [117]:
n_val = mindate(tmp_sort['n_val_data'])
n_val = smallest(n_val)
df_numbered2 = validate(n_val)['val_data']


Of those entries we can handle, there are 45 individuals as defined by ['species', 'toes', 'sex'] which pass validation based    on ['year_diff <= 7', 'svl_diff >= -2'] and 0 rows which do not pass validation.


Since no rows remain unvalidated, we will not attempt a third validation.  We will simply append *df_numbered1* and *df_numbered2* to create *df_numbered* to create our full numbered dataset.

In [118]:
df_numbered = df_numbered1.append(df_numbered2,ignore_index=True,sort=False)
print("df:{}\ndf_numbered1:{}\ndf_numbered2:{}\ndf_numbered:{}".format(df.shape,df_numbered1.shape,df_numbered2.shape,
                                                               df_numbered.shape))
df_numbered.head()

df:(6299, 23)
df_numbered1:(2616, 27)
df_numbered2:(154, 27)
df_numbered:(2770, 27)


Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,year,tl_svl,mass_svl,initialCaptureDate,year_diff,svl_diff,liznumber
0,v,05-12-16,m,2005-07-08,50.0,71.0,0.0,False,4.0,Rs in sb @ downed juniper,220,,painted,,w30a,05-36,,0,05-12-16,,2005,1.42,0.08,2005-07-08,0,0.0,1154
1,j,1-2,f,2005-07-22,38.0,51.0,0.0,False,1.4,sb at CC/CCC,240,,painted,,w^avc,05-81,,0,1-2,,2005,1.342105,0.036842,2005-07-22,0,0.0,74
2,v,1-4-9-16,m,2005-07-04,46.0,62.0,0.0,False,3.3,sb bottom wall below juniper xing,87,,painted,,w6a,05-11,no blue patches,0,1-4-9-16,,2005,1.347826,0.071739,2005-07-04,0,0.0,1159
3,other,2-3,m,2005-07-15,50.0,88.0,0.0,False,3.8,T at upper curved wall,40,,painted,,wLa,,check to see if really NEW; toes look to be natural loss and enough so that I didn't slip further; blue th and belly,0,2-3,,2005,1.76,0.076,2005-07-15,0,0.0,1054
4,other,2-6-11,m,2005-07-08,112.0,115.0,0.0,False,35.5,lft side sb 3m^ lizard R,136,,painted,,w.c,05-34,ran to dead double oak 12 m up wall during capture,0,2-6-11,,2005,1.026786,0.316964,2005-07-08,0,0.0,1056


<a id='QcLizNum'></a>

### QC of lizard numbers
[Top](#TOC)

[Top Add Columns](#AddCol)

First we display the output data frame.

In [119]:
df_numbered

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,year,tl_svl,mass_svl,initialCaptureDate,year_diff,svl_diff,liznumber
0,v,05-12-16,m,2005-07-08,50.0,71.0,0.0,False,4.0,Rs in sb @ downed juniper,220,,painted,,w30a,05-36,,0.0,05-12-16,,2005,1.42,0.08,2005-07-08,0,0.0,1154
1,j,1-2,f,2005-07-22,38.0,51.0,0.0,False,1.4,sb at CC/CCC,240,,painted,,w^avc,05-81,,0.0,1-2,,2005,1.342105,0.036842,2005-07-22,0,0.0,74
2,v,1-4-9-16,m,2005-07-04,46.0,62.0,0.0,False,3.3,sb bottom wall below juniper xing,87,,painted,,w6a,05-11,no blue patches,0.0,1-4-9-16,,2005,1.347826,0.071739,2005-07-04,0,0.0,1159
3,other,2-3,m,2005-07-15,50.0,88.0,0.0,False,3.8,T at upper curved wall,40,,painted,,wLa,,check to see if really NEW; toes look to be natural loss and enough so that I didn't slip further; blue th and belly,0.0,2-3,,2005,1.76,0.076,2005-07-15,0,0.0,1054
4,other,2-6-11,m,2005-07-08,112.0,115.0,0.0,False,35.5,lft side sb 3m^ lizard R,136,,painted,,w.c,05-34,ran to dead double oak 12 m up wall during capture,0.0,2-6-11,,2005,1.026786,0.316964,2005-07-08,0,0.0,1056
5,other,2-6-12,m,2005-07-15,69.0,94.0,0.0,False,11.0,oakT & sb rt side,164,,painted,,w.t,05-58,eating a large moth [photos],0.0,2-6-12,,2005,1.362319,0.15942,2005-07-15,0,0.0,1057
6,j,3-10-11-16,f,2005-07-09,70.0,108.0,0.0,False,12.4,sb,-45,,painted,,w47c,05-37,Bss; Tss; Pt on venter,0.0,3-10-11-16,,2005,1.542857,0.177143,2005-07-09,0,0.0,319
7,j,3-10-11-18,f,2005-07-11,30.0,42.0,0.0,False,0.8,sb v wall v wall v juniper xing,83,,painted,,w.c,05-48,,0.0,3-10-11-18,,2005,1.4,0.026667,2005-07-11,0,0.0,322
8,j,3-10-11-19,f,2005-07-11,35.0,46.0,0.0,False,1.2,sb 3m v H0,111,,painted,,w|t,05-49,T recently shed; shed cloacal skin still attached; Bss,0.0,3-10-11-19,,2005,1.314286,0.034286,2005-07-11,0,0.0,324
9,j,3-10-11-20,m,2005-07-11,34.0,46.0,0.0,False,1.1,top s-curve,280,,painted,,w-t,05-51,"prefemoral mites on both sides; T rec shed; B?, probably ss",0.0,3-10-11-20,,2005,1.352941,0.032353,2005-07-11,0,0.0,325


Identify individuals that have same species and toes, but different sex for review

In [120]:
df_numbered = df_numbered.merge(df_numbered.groupby(['species','toes']).sex.nunique().reset_index()\
                       .rename(columns = {'sex':'sex_count'}),how = 'inner', on = ['species','toes'])
df_numbered.loc[df_numbered.sex_count>1,:].to_csv('entries flagged with same species and toes diff sex.csv')
print("{} rows have the same species and toes but different values for sex"\
      .format(df_numbered.loc[df_numbered.sex_count>1,:].shape[0]))
df_numbered.head()

786 rows have the same species and toes but different values for sex


Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,year,tl_svl,mass_svl,initialCaptureDate,year_diff,svl_diff,liznumber,sex_count
0,v,05-12-16,m,2005-07-08,50.0,71.0,0.0,False,4.0,Rs in sb @ downed juniper,220,,painted,,w30a,05-36,,0,05-12-16,,2005,1.42,0.08,2005-07-08,0,0.0,1154,1
1,j,1-2,f,2005-07-22,38.0,51.0,0.0,False,1.4,sb at CC/CCC,240,,painted,,w^avc,05-81,,0,1-2,,2005,1.342105,0.036842,2005-07-22,0,0.0,74,2
2,j,1-2,m,2001-03-18,61.0,66.0,42.0,True,7.6,r wall at juniper xing,113,,,,w..a,toes in vial 01-13,,42,1-2,,2001,1.081967,0.12459,2001-03-18,0,0.0,75,2
3,v,1-4-9-16,m,2005-07-04,46.0,62.0,0.0,False,3.3,sb bottom wall below juniper xing,87,,painted,,w6a,05-11,no blue patches,0,1-4-9-16,,2005,1.347826,0.071739,2005-07-04,0,0.0,1159,1
4,other,2-3,m,2005-07-15,50.0,88.0,0.0,False,3.8,T at upper curved wall,40,,painted,,wLa,,check to see if really NEW; toes look to be natural loss and enough so that I didn't slip further; blue th and belly,0,2-3,,2005,1.76,0.076,2005-07-15,0,0.0,1054,1


In [121]:
print("Lizard Numbers in the sample range from {} to {}."\
      .format(df_numbered.liznumber.min(),df_numbered.liznumber.max()))

Lizard Numbers in the sample range from 1 to 1568.


In [122]:
possibleLizNum = set(range(int(df_numbered.liznumber.min()),int(df_numbered.liznumber.max())))
actualLizNum = set(pd.Series(df_numbered.liznumber.unique()).dropna().apply(int))
print("\nThere are {} entries.  There are {} unique lizard numbers.\
\n\nThe liznumber ranges from {} to {}."\
  .format(df_numbered.shape[0],len(df_numbered.liznumber.unique())\
          ,df_numbered.liznumber.min(),df_numbered.liznumber.max()))

missingLizNum = possibleLizNum - actualLizNum
if len(missingLizNum)>0:
    print("\n\nThe following numbers are not assigned to a lizard:\n{}"\
      .format(missingLizNum))
else:
    print("\n\nThere are no numbers which were not assigned.")


There are 2770 entries.  There are 1568 unique lizard numbers.

The liznumber ranges from 1 to 1568.


There are no numbers which were not assigned.


<a id='daysSinceCapture'></a>

### Days Since Capture
[Top](#TOC)

[Top Add Columns](#AddCol)

*daysSinceCapture* identifies the number of days since the animal was captured

<a id='capture'></a>

In [123]:
df_numbered.loc[:,'daysSinceCapture'] = (df_numbered.date - df_numbered.initialCaptureDate).dt.days

### Capture Number
[Top](#TOC)

[Top Add Columns](#AddCol)

*capture* identifies the number of times an animal has been captured prior to an entry.
We will need to [QC capture](#QcCapture) as well.

In [124]:
# need to QC this this seems to be leading to several cases in which recap individuals that 
# only have one capture
df_numbered['capture'] = df_numbered.sort_values(['liznumber','date'])\
.groupby(['liznumber']).daysSinceCapture.cumcount()+1

In [125]:
print(df_numbered.loc[df_numbered.species.isin(['j','v'])].groupby('capture').capture.count())

capture
1     1438
2      429
3      182
4       81
5       41
6       20
7       13
8       10
9        8
10       8
11       8
12       8
13       8
14       8
15       8
16       8
17       8
18       7
19       7
20       7
21       7
22       7
23       6
24       6
25       6
26       6
27       5
28       5
29       5
30       5
31       5
32       4
33       4
34       4
35       4
36       4
37       4
38       4
39       4
40       4
41       4
42       4
43       4
44       4
45       4
46       4
47       4
48       4
49       4
50       4
51       4
52       4
53       4
54       4
55       4
56       4
57       4
58       3
59       3
60       3
61       3
62       3
63       3
64       3
65       3
66       3
67       3
68       3
69       2
70       2
71       2
72       2
73       2
74       2
75       2
76       2
77       2
78       2
79       2
80       2
81       2
82       2
83       1
84       1
85       1
86       1
Name: capture, dtype: int64


<a id='yearstoolarge'></a>

### years too large
[Top](#TOC)

In [126]:
yeartoomuch = df_numbered.loc[df_numbered.year_diff>=5,'liznumber']
checkyears = df_numbered.loc[df_numbered.liznumber.isin(yeartoomuch)].sort_values(['liznumber'])
checkyears.to_csv('check years.csv')
checkyears

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,year,tl_svl,mass_svl,initialCaptureDate,year_diff,svl_diff,liznumber,sex_count,daysSinceCapture,capture
1072,j,1-13-19,f,2000-03-17,52.0,74.0,0.0,False,4.2,1falls,,,,,r1c,,,0,1-13-19,,2000,1.423077,0.080769,2000-03-17,0,0.0,51,1,0,1
1073,j,1-13-19,f,2000-03-17,53.0,69.0,0.0,False,5.0,Rs opp slab,,recap,,,r18c,,shed since last recapture,0,1-13-19,,2000,1.301887,0.09434,2000-03-17,0,1.0,51,1,0,2
1074,j,1-13-19,f,2000-06-24,63.0,93.0,0.0,False,6.7,halfway between 1 falls and cave trail,,,,,o11a,,,0,1-13-19,,2000,1.47619,0.106349,2000-03-17,0,11.0,51,1,99,3
1075,j,1-13-19,f,2001-07-13,79.0,108.0,0.0,False,14.6,15m ^ 1falls,16.0,recap,,,r31a,shed since; Tss,,0,1-13-19,,2001,1.367089,0.18481,2000-03-17,1,27.0,51,1,483,4
1076,j,1-13-19,f,2008-07-18,68.0,86.0,0.0,False,9.3,H3/H4,186.0,recap,painted,,y11a.t,,y11 paintmark v faded; painted on top of faded mark; slight Bss,0,1-13-19,,2008,1.264706,0.136765,2000-03-17,8,16.0,51,1,3045,5
1418,j,2-9-15-17,f,2012-05-27,81.0,108.0,0.0,False,22.0,top CCC,,recap,yes,,w32c,,salmon T,0,2-9-15-17,,2012,1.333333,0.271605,2007-07-05,5,15.0,311,1,1788,2
1419,j,2-9-15-17,f,2007-07-05,66.0,90.0,0.0,False,8.2,@ H3,177.0,,yes,,y13c,07-15,BSS,0,2-9-15-17,,2007,1.363636,0.124242,2007-07-05,0,0.0,311,1,0,1
535,j,2-9-15-19,f,2007-07-06,65.0,87.0,0.0,False,9.8,left wall stream bed @ 12,12.0,,yes,,y18c,07-17,,0,2-9-15-19,,2007,1.338462,0.150769,2007-07-06,0,0.0,313,1,0,1
533,j,2-9-15-19,f,2013-07-04,77.0,101.0,0.0,False,14.3,large R left sb (-45),-45.0,recap,yes,,o17c,,pale white tail,0,2-9-15-19,,2013,1.311688,0.185714,2007-07-06,6,12.0,313,1,2190,3
534,j,2-9-15-19,f,2012-05-27,75.0,102.0,0.0,False,10.7,2m^bottom site,,recap,yes,,w26c,,just recently dropped babies,0,2-9-15-19,,2012,1.36,0.142667,2007-07-06,5,10.0,313,1,1787,2


In [127]:
jarrovii = go.Histogram(x = df_numbered.loc[df_numbered.species.isin(['j'])].groupby('liznumber')\
                     .year_diff.max(),name = 'S. jarrovii')
virgatus = go.Histogram(x = df_numbered.loc[df_numbered.species.isin(['v'])].groupby('liznumber')\
                     .year_diff.max(), name = 'S. virgatus')
data = [jarrovii, virgatus]
layout = go.Layout(
    title = 'Number of Individuals by Years Between First and Last Capture 2000-2017',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'Maximum Number of Years Since Initial Capture',
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Lizards',
        titlefont = dict(
            size = 18))
)
fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Frequency of Captures in Crystal Creek 2000 - 2017 (by species)')

In [128]:
# Freeze work on this figure until we've resolved issues with calculation based on year
# ADD HORIZONTAL LINES FOR EACH YEAR
j_lizards = go.Scatter(x = df_numbered.loc[df_numbered.species.isin(['j'])].liznumber,
                   y = df_numbered.loc[df_numbered.species.isin(['j'])]\
                      .groupby('liznumber').daysSinceCapture.max(), 
                     mode = 'markers', name='S. jarrovii')
v_lizards = go.Scatter(x = df_numbered.loc[df_numbered.species.isin(['v'])].liznumber,
                   y = df_numbered.loc[df_numbered.species.isin(['v'])]\
                      .groupby('liznumber').daysSinceCapture.max(), 
                     mode = 'markers', name='S. virgatus')
# year1 = go.Scatter(x=[df_numbered.liznumber.min(),df_numbered.liznumber.max()],y = (365))
# year2 = go.Scatter(y = 365*2)
# year3 = go.Scatter(y = 365*3)
# year4 = go.Scatter(y = 365*4)
# year5 = go.Scatter(y = 365*5)
# year6 = go.Scatter(y = 365*6)
# year7 = go.Scatter(y = 365*7)
# year8 = go.Scatter(y = 365*8)

# data = [j_lizards, v_lizards, year1, year2, year3, year4, year5, year6, year7, year8]
data = [j_lizards, v_lizards]
layout = go.Layout(
    title = 'Days Since Initial Capture in Crystal Creek 2000 - 2017',
        titlefont = dict(
            size = 20),
    xaxis = dict(
            title='Lizard Number',
            titlefont=dict(
                size=18)),
    yaxis = dict(
            title='Greatest Number of Days Since<br> Initial Capture',
            titlefont=dict(
                size=18)))

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename = 'Days Since Initial Capture in Crystal Creek 2000 - 2017')

In [129]:
dfF = df_numbered.loc[(df_numbered.sex =='f' )& (df_numbered.species.isin(['j','v']))]
dfM = df_numbered.loc[(df_numbered.sex =='m') & (df_numbered.species.isin(['j','v']))]

In [130]:
# Freeze work on this figure until we've resolved issues with calculation based on year
females = go.Scatter(
    x = dfF.liznumber,
    y = dfF.groupby('liznumber').daysSinceCapture.max(),
    name = 'females',
    mode = 'markers',
    marker = dict(
        color = 'rgba(152, 0, 0, .8)',
        opacity = 0.75,
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        )
    )
)

males = go.Scatter(
    x = dfM.liznumber,
    y = dfM.groupby('liznumber').daysSinceCapture.max(),
    name = 'males',
    mode = 'markers',
    marker = dict(
        color = 'rgba(255, 182, 193, .9)',
        opacity = 0.75,
        line = dict(
            width = 2,
        )
    )
)

data = [females, males]

layout = dict(title = 'Days Since Initial Capture in Crystal Creek 2000 - 2017 By Sex',
              yaxis = dict(
                  title='Greatest Number of Days Since<br> Initial Capture',
                  titlefont=dict(
                      size=18)
              ),
              xaxis = dict(zeroline = False)
             )

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Days Since Initial Capture in Crystal Creek 2000 - 2017 By Sex')

In [131]:
# Something is wrong with 2006 and 2011 data.  Try grouping data by lizard numbers to address high numbers.
# Freeze work on this figure until we've resolved issues with calculation based on year
# Capture rate between Males and Females does not appear to be significantly different even before
# statistical analysis
males = go.Histogram(x = df_numbered.loc[(df_numbered.sex == 'm')& (df_numbered.species.isin(['j','v']))
                                                                    ,'year']
                     ,opacity= 0.75,name='males')
females = go.Histogram(x = df_numbered.loc[(df_numbered.sex == 'f')& (df_numbered.species.isin(['j','v']))
                                                                      ,'year']
                       , opacity= 0.75, name = 'females')
data = [males,females]
py.iplot(data, filename = 'Distribution of Sex by Year in Crystal Creek 2000 - 2017')

In [132]:
df_numbered.head()

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,year,tl_svl,mass_svl,initialCaptureDate,year_diff,svl_diff,liznumber,sex_count,daysSinceCapture,capture
0,v,05-12-16,m,2005-07-08,50.0,71.0,0.0,False,4.0,Rs in sb @ downed juniper,220,,painted,,w30a,05-36,,0,05-12-16,,2005,1.42,0.08,2005-07-08,0,0.0,1154,1,0,1
1,j,1-2,f,2005-07-22,38.0,51.0,0.0,False,1.4,sb at CC/CCC,240,,painted,,w^avc,05-81,,0,1-2,,2005,1.342105,0.036842,2005-07-22,0,0.0,74,2,0,1
2,j,1-2,m,2001-03-18,61.0,66.0,42.0,True,7.6,r wall at juniper xing,113,,,,w..a,toes in vial 01-13,,42,1-2,,2001,1.081967,0.12459,2001-03-18,0,0.0,75,2,0,1
3,v,1-4-9-16,m,2005-07-04,46.0,62.0,0.0,False,3.3,sb bottom wall below juniper xing,87,,painted,,w6a,05-11,no blue patches,0,1-4-9-16,,2005,1.347826,0.071739,2005-07-04,0,0.0,1159,1,0,1
4,other,2-3,m,2005-07-15,50.0,88.0,0.0,False,3.8,T at upper curved wall,40,,painted,,wLa,,check to see if really NEW; toes look to be natural loss and enough so that I didn't slip further; blue th and belly,0,2-3,,2005,1.76,0.076,2005-07-15,0,0.0,1054,1,0,1


In [133]:
column_order = ['liznumber','date','initialCaptureDate',]

In [134]:
df.year.value_counts(dropna=False).reset_index()

Unnamed: 0,index,year
0,2002.0,1477
1,2003.0,1017
2,2017.0,759
3,2001.0,681
4,2004.0,478
5,,235
6,2000.0,209
7,2005.0,202
8,2007.0,182
9,2006.0,163


<a id='QcCapture'></a>

### QC of Capture number and Recap status
[Top](#TOC)

[Top Add Columns](#AddCol)

[Top Capture Number](#capture)

In [135]:
recapQuestion=df_numbered\
.loc[(df_numbered.capture==1 )&(df_numbered['new.recap']=='recap')&(df_numbered.species.isin(['j','v'])),:]
print("There are {} instances in rows for which a lizard appears to have only one capture, \
but is listed as a recap.\
The distribution of these across years in the sample is as follows:\n{}."\
      .format(recapQuestion.shape[0],recapQuestion.year.value_counts()))
recapQuestion.to_csv("Questionable recaptures.csv")#These individuals need to be rechecked in the raw notes
recapQuestion.head()

There are 385 instances in rows for which a lizard appears to have only one capture, but is listed as a recap.The distribution of these across years in the sample is as follows:
2002    52
2009    49
2003    40
2004    37
2000    33
2005    32
2017    22
2007    22
2001    20
2012    19
2015    17
2013    15
2008    14
2016     7
2014     5
2011     1
Name: year, dtype: int64.


Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,year,tl_svl,mass_svl,initialCaptureDate,year_diff,svl_diff,liznumber,sex_count,daysSinceCapture,capture
47,j,3-10-15-20,f,2007-07-14,73.0,82.0,20.0,True,13.3,R left side SB 4m v pine R,408,recap,yes,,y50c,,rec shed; lose scales,20,3-10-15-20,,2007,1.123288,0.182192,2007-07-14,0,0.0,350,2,0,1
120,v,5-12-19,m,2007-07-07,55.0,59.0,0.0,False,6.2,right side SB S curve,280,recap,yes,,y10a,,B and TSS,0,5-12-19,,2007,1.072727,0.112727,2007-07-07,0,0.0,1395,2,0,1
256,v,8-11-20,m,2002-03-19,52.0,74.0,0.0,False,5.0,5m ^ CC/CCC on rt side,245,recap,"old paint, repainted over",,y20b,,"not legible, however",0,,2.0,2002,1.423077,0.096154,2002-03-19,0,20.0,1567,3,0,1
257,v,8-12-20,f,2002-07-19,58.0,71.0,0.0,False,5.7,sb at R outcrop at 425 on rt side,425,recap,gravid,,w43a,,,0,,2.0,2002,1.224138,0.098276,2002-07-19,0,19.0,1566,3,0,1
325,v,7,f,2003-07-23,58.0,76.0,0.0,False,7.2,sb 1m ^ 1falls,1,recap,painted,,wUb,,B and T ss,0,7,,2003,1.310345,0.124138,2003-07-23,0,0.0,1546,1,0,1


In [136]:
recapQuestion.loc[recapQuestion.svl<54,:]

Unnamed: 0,species,toes_orig,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc,rtl_orig,toes,toe_pattern,year,tl_svl,mass_svl,initialCaptureDate,year_diff,svl_diff,liznumber,sex_count,daysSinceCapture,capture
256,v,8-11-20,m,2002-03-19,52.0,74.0,0.0,False,5.0,5m ^ CC/CCC on rt side,245.0,recap,"old paint, repainted over",,y20b,,"not legible, however",0.0,,2.0,2002,1.423077,0.096154,2002-03-19,0,20.0,1567,3,0,1
350,v,1-7-16-(18)-19-20,m,2005-07-05,52.0,53.0,38.0,True,4.2,R v stump @ top of chute,362.0,recap,painted,,w15a.c,,posterior 1/2 of body and tail shed; anterior Bss; sighted before lunch but captured after,38.0,1-7-16-(18)-19-20,,2005,1.019231,0.080769,2005-07-05,0,0.0,1198,1,0,1
363,v,13-14,m,2005-07-06,53.0,63.0,0.0,False,4.6,sb just opp top pyramidR,377.0,recap,painted,,w19a,,recapture questionable; toes look like predation loss rather than being cut,0.0,13-14,,2005,1.188679,0.086792,2005-07-06,0,6.0,1249,1,0,1
477,v,4-6-17,f,2005-07-13,53.0,69.0,0.0,False,3.7,sb left side at H3,177.0,recap,painted,,w51a,,dropped recently; empty abdomen; blue spots on throat with faint orangering,0.0,4-6-17,,2005,1.301887,0.069811,2005-07-13,0,0.0,1361,1,0,1
480,v,4-7-13,f,2004-07-12,53.0,72.0,0.0,False,5.2,left sb @ flat R,208.0,recap,painted,,w6c,,shed since last capture; loose scales T & H,0.0,4-7-13,,2004,1.358491,0.098113,2004-07-12,0,0.0,1366,1,0,1
498,j,4-9-19,f,2005-07-20,51.0,58.0,0.0,False,3.9,sb rt side,-27.0,recap,painted,,w61a,,kink in T at 21; blue spots on throat; no orange; compare with w27a wrt last years toes,0.0,4-9-19,,2005,1.137255,0.076471,2005-07-20,0,0.0,517,1,0,1
514,v,7-9,m,2005-07-09,52.0,61.0,0.0,False,5.2,Rwall rt at H5,202.0,recap,painted,,w35a,,Bss; Tshed,0.0,7-9,,2005,1.173077,0.1,2005-07-09,0,0.0,1461,1,0,1
983,v,10-11-19,m,2014-07-04,51.0,71.0,0.0,False,5.0,juniper xing,115.0,recap,yes,,o5a,,Bss,0.0,10-11-19,,2014,1.392157,0.098039,2014-07-04,0,0.0,1225,1,0,1
1067,j,1-12-18,f,2000-03-18,50.0,68.0,0.0,False,3.7,3m^chute on Rs,,recap,,,r36c,,shed since last recapture,0.0,1-12-18,,2000,1.36,0.074,2000-03-18,0,0.0,40,1,0,1
1140,j,1-9-13-16,f,2000-03-18,50.0,76.0,0.0,False,6.5,wall at H5,,recap,,,r27c,"marked originally as 1-13-16 on 9VII99, remarked as 1-9-13-16 on 19III00",shed since last recapture,0.0,1-9-13-16,,2000,1.52,0.13,2000-03-18,0,0.0,112,2,0,1


<a id='exportFinal'></a>

# Export Cleaned data
[Top](#TOC)

Now we export the cleaned data to a csv.

In [137]:
df_numbered = df_numbered.rename(index = str, columns = {'new.recap':'newRecap'})
qc_drop_cols = df_numbered.columns[df_numbered.columns.str.contains('force|drop')]
df_full = df_numbered.drop(qc_drop_cols,1)

In [138]:
timestamp = (pd.to_datetime('now')-pd.Timedelta(hours=4))
timestamp = str(timestamp)[:-10].replace(':','hrs')+'min'
#path=''C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\''
# path=outputBig
filename = 'cleaned CC data 2000-2017_' + timestamp+ '.csv'
# filename = path + '/cleaned CC data 2000-2017' + '.csv'
df_full.to_csv(filename,index = False)
filename

'cleaned CC data 2000-2017_2019-01-07 21hrs27min.csv'