# Cleaning CC data

This python notebook operates on a csv created after editing in open refine and is designed to finish cleaning columns of interest which were easier to clean in python.

# [Resume Here](#resumehere)

<a id='TOC'></a>

# Table of Contents

1. [Setting up Python](#SettingUp)
    
    1. [Setting the Location](#SettingLoc)
    
    2. [Importing Data](#ImportingData)
    
    3. [Preparing for a Save](#PreparingSave)
    
4. [Functions](#Functions)
    
2. [Inspecting the Data](#InspectingData)
3. [Cleaning Data](#CleaningData)
    1. [Column-by-Column Cleaning](#ColbyCol)
        1. [rtl](#rtl)
        2. [tl](#tl)
        3. [svl](#svl)
        4. [autotomized](#autotomized)
        5. [toes](#toes)
        6. [sex](#sex)
        7. [new.recap](#newrecap)
    2. [Correcting class of columns](#CorrectingClass)
    
4. [Adding Columns](#AddCol)

    1. [TL_SVL](#TlSvl)
    
    2. [Mass_SVL](#MassSvl)
    
    3. [Lizard Number](#LizardNumber)

5. [Export Cleaned Data](#exportFinal)

<a id='SettingUp'></a>

# Setting up Python

[Top](#TOC)

Here we import necessary packages. 
This chunk may take a while.

In [1]:
import pandas as pd
import numpy as np
import os
from liz_number import lizsort,mindate,smallest,validate
from liz_toes import make_str,label_pattern, replace_pattern,report_pattern

import plotly
import plotly.plotly as py
import plotly.graph_objs as go

plotly.tools.set_config_file(world_readable=True)

# increase print limit
pd.options.display.max_rows = 99999
pd.options.display.max_columns = 50

<a id='SettingLoc'></a>

## Setting the location
[Top](#TOC)

These chunks identify the locations from which we can get data and to which we can save data.

## Source Data
Source files can be found in the following locations:

In [2]:
sourceDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/Cleaned Combined Data'
sourceDataBig = 'S:/Chris/TailDemography/TailDemography/Cleaned Combined Data'
# sourceBlack = 'C:/Users/test/Desktop'

### Intermediate Source Data
Intermediate files can be found in the following locations:

In [3]:
sourceInterDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/Intermediate Files/DeepCleaning'
sourceinterDataBig = 'S:/Chris/TailDemography/TailDemography/Intermediate Files/DeepCleaning'
# sourceBlack = 'C:/Users/test/Desktop'

## Output Data paths
Outputfiles can be found in the following locations:

In [4]:
outputPers = 'C:/Users/Christopher/Google Drive/TailDemography/outputFiles'
outputBig = 'S:/Chris/TailDemography/TailDemography/outputFiles'
# outputBlack = 'C:/Users/test/Desktop'

<a id='ImportingData'></a>

## Importing data
[Top](#TOC)

Here we import data from one of the available locations

In [5]:
os.chdir(sourceDataBig)
df=pd.read_csv('Appended and Trimmed CC Data 2000-2017_2019-01-01 14hrs11min.csv')
df.head()

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
0,sj,2-6-11-16,m,2001-03-18 00:00:00,76,80,1,,13.7,r wall at juniper xing,113,NEW,,,w.c,toes in vial 01-1,
1,sj,2-6-11-18,f,2001-03-18 00:00:00,82,109,0,,17.5,r wall at juniper xing,113,NEW,,,w.b,toes in vial 01-2,
2,sj,2-6-11-19,f,2001-03-18 00:00:00,58,69,-1,,8.5,r wall at juniper xing,113,NEW,,,w.a,toes in vial 01-3,
3,sj,2-6-11-20,m,2001-03-18 00:00:00,65,91,0,,9.2,r wall at juniper xing,113,NEW,,,w-a,toes in vial 01-4,
4,sj,2-6-11-19-20,f,2001-03-18 00:00:00,58,76,0,,7.8,r wall at juniper xing,113,NEW,,,w-b,toes in vial 01-5,


<a id='PreparingSave'></a>

## Preparing for a save
[Top](#TOC)

Now we change the working directory so that inermediate files are saved to our preferred location.

In [6]:
os.chdir(sourceDataBig)
df.head()

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
0,sj,2-6-11-16,m,2001-03-18 00:00:00,76,80,1,,13.7,r wall at juniper xing,113,NEW,,,w.c,toes in vial 01-1,
1,sj,2-6-11-18,f,2001-03-18 00:00:00,82,109,0,,17.5,r wall at juniper xing,113,NEW,,,w.b,toes in vial 01-2,
2,sj,2-6-11-19,f,2001-03-18 00:00:00,58,69,-1,,8.5,r wall at juniper xing,113,NEW,,,w.a,toes in vial 01-3,
3,sj,2-6-11-20,m,2001-03-18 00:00:00,65,91,0,,9.2,r wall at juniper xing,113,NEW,,,w-a,toes in vial 01-4,
4,sj,2-6-11-19-20,f,2001-03-18 00:00:00,58,76,0,,7.8,r wall at juniper xing,113,NEW,,,w-b,toes in vial 01-5,


<a id= 'InspectingData'></a>

<a id='Functions'></a>

# Functions
[Back to: Top](#TOC)

1. [appendstr](#appenstr)

<a id = 'appendstr'></a>

In [7]:
def appendstr(x, value, connector = '', position='end'):
    """
    appends *value* and *x* separated by a *connector* with the position of *val* determined by *position*
    :param x:
    :param value:
    :param connector:
    :param position:
    """
    assert((isinstance(x,str)|(x is None)|(x!=x))),"x must be str type, NoneType or NaN: x is {} type."\
    .format(type(x))
    if ((x!=x)|(x is None)):
        x=''
    assert(isinstance(value,str)),"value must be str type: value is {} type.".format(type(value))
    assert(isinstance(connector,str))\
    , "connector must be str or None type, not {} type.".format(type(connector))
    assert(isinstance(position,(str,int))), "position must be either str or int type, not {}."\
           .format(type(position))           
    if isinstance(position,str):
        assert(position in ['start','end']), "If position is str type, it must be either 'start' or 'end'."
        positiondict = {'start':0,'end':len(x)}
        position = positiondict[position]
    if isinstance(position,int):
        assert(position in range(0,1+len(x)))\
        , "If position is int type, it must be a value in the range 0 through {}.".format(len(x))
    prefix = x[:position]
    suffix = x[position:]
    if len(x)==0:
        res = value
    else:
        if position == 0:
            res = prefix+value+connector+suffix
        if position == len(x):
            res = prefix+connector+value+suffix
        if (position>0&position<1):
            res = prefix+connector+value+connector+suffix

    return res
    

Here's an example of how *appendstr* works.

In [8]:
foo='bar'
appendstr(foo,'test',connector='_',position=1)

'b_test_ar'

In [9]:
appendstr(foo,'test',connector='_',position=1)

'b_test_ar'

In [10]:
appendstr(None,'test',connector='_',position='end')

'test'

In [11]:
appendstr(None,'test',position=0)

'test'

## Inspecting the Data
[Top](#TOC)

Let's take a look at the data.

In [12]:
print("\nThere are {} data points in our data set.".format(df.shape[0]))


There are 6597 data points in our data set.


In [13]:
print("\nThe columns in the data have the following data types:\n{}".format(df.dtypes))


The columns in the data have the following data types:
species         object
toes            object
sex             object
date            object
svl             object
tl              object
rtl             object
autotomized    float64
mass            object
location        object
meters          object
new.recap       object
painted         object
sighting       float64
paint.mark      object
vial            object
misc            object
dtype: object


<a id= 'CleaningData'></a>

# Cleaning the Data
[Back to: Top](#TOC)

Now we get to the actual cleaning of the data.  We will inspect the data and take the appropriate cleaning steps:
1. [Column-by-Column Cleaning](#ColbyCol)

2. [Correcting class of columns](#CorrectingClass)

<a id='ColbyCol'></a>

## Column-by-Column Cleaning
[Back to: Top](#TOC)

We will handle the cleaning for each column in this section.
1. [rtl](#rtl)
2. [tl](#tl)
3. [svl](#svl)
4. [autotomized](#autotomized)
5. [toes](#toes)
6. [sex](#sex)
7. [new.recap](#newrecap)

<a id='rtl'></a>

## 'rtl' 
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColByCol)

Here we investigate and clean values in the column 'rtl'. These should be int type values that are greater than or equal to -1.  First, we test to see if all of the values are of type int.

In [14]:
badtypes = []
for val in df.rtl:
    try:
        x = isinstance(type(int(val)),int)
    except:
        badtypes=badtypes+[val]
print("'badtypes' represents {} entries in the df:".format(len(badtypes)))
if len(badtypes)==0:
    print("\nAll values in df.rtl can be successfuly converted to int.\n\n")
#     df['rtl'] = df.rtl.apply(int)
else:
    print("\nAll values in df.rtl could not be converted to int.  The following values could not be \
converted and should be investigated:\n\n{}\n\nbadtypes values are distributed as follows in the df:\n\n{}"\
          .format(list(set(badtypes)),df.loc[df.rtl.isin(badtypes),'rtl'].value_counts(dropna=False)))

'badtypes' represents 3937 entries in the df:

All values in df.rtl could not be converted to int.  The following values could not be converted and should be investigated:

[nan, '?', '10(kink)', 'o', '-', '32 -12']

badtypes values are distributed as follows in the df:

NaN         3931
?              2
-              1
10(kink)       1
32 -12         1
o              1
Name: rtl, dtype: int64


The non-NaN values are few, so we will inspect these first.

In [15]:
pd.set_option('max_colwidth',100000)
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna()),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2244,sj,,m,2003-04-19 00:00:00,56,32,?,,,talus 326,326.0,NEW,painted,,b7c,,
2267,sj,4-10-14-18,m,2003-04-30 00:00:00,76,19,?,,,wall 15m,15.0,recap,painted,,b9a,,9 looks like a backwards P and t combined
2397,sv,1-6-11-20,m,2003-06-27 00:00:00,41,60,o,,4.0,sb 5m ^ cave trail,50.0,NEW,painted,,sMb,,"lost toes for vial, accidently cut off toe 11"
3452,uo,4-6-18,m,2004-07-12 00:00:00,52,75,32 -12,,4.7,sb opp fallen juniper -> flat R,208.0,new,painted,,w^c,04-63,blue throat and blue belly; accidentally cut toe 6
3548,sv,,f,2004-07-21 00:00:00,-,-,-,,6.0,sb @ cc/ccc,240.0,recap,painted,,w148b,,escaped
3575,sj,2-9-12-18,f,2004-07-22 00:00:00,65,94,10(kink),,9.4,wall rt side v wall v cave tr,,recap,painted,,w154b,,hurt toes 11-13 in capture; Bss Tss


Based on review discussions, we will make the changes below:
- ‘?’--> 0; misc: “unsure if tail was recently broken at very tip”
- ‘o’--> 0
- ‘32 -12’ -->32; misc: “potential double-break at 12 \[george to check before use\]” 
- ‘-’-->NaN
- ‘10(kink)’-->0; misc:”kink at 10mm”
We will use the function [*appendstr*](#appendstr) to do this.

"‘?’--> 0; misc: “unsure if tail was recently broken at very tip”

In [16]:
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='?'),'misc']= df\
.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='?'),:].misc\
.apply(lambda x: appendstr(x,"unsure if tail was recently broken at very tip",';'))
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='?'),'rtl']= '0'

These entries now look like this:

In [17]:
df.loc[(df.date.isin(['2003-04-19 00:00:00','2003-04-30 00:00:00']))\
       &(df.svl.isin(['56','76']))&(df.tl.isin(['32','19'])),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2244,sj,,m,2003-04-19 00:00:00,56,32,0,,,talus 326,326,NEW,painted,,b7c,,unsure if tail was recently broken at very tip
2267,sj,4-10-14-18,m,2003-04-30 00:00:00,76,19,0,,,wall 15m,15,recap,painted,,b9a,,9 looks like a backwards P and t combined;unsure if tail was recently broken at very tip;


"‘o’--> 0"

In [18]:
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='o'),'rtl']= '0'

These entries now look like this:

In [19]:
df.loc[(df.date.isin(['2003-06-27 00:00:00']))\
       &(df.svl.isin(['41']))&(df.tl.isin(['60'])),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
2397,sv,1-6-11-20,m,2003-06-27 00:00:00,41,60,0,,4,sb 5m ^ cave trail,50,NEW,painted,,sMb,,"lost toes for vial, accidently cut off toe 11"


"‘32-12’ -->32; misc: “potential double-break at 12 \[george to check before use\]"

In [20]:
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='32 -12'),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
3452,uo,4-6-18,m,2004-07-12 00:00:00,52,75,32 -12,,4.7,sb opp fallen juniper -> flat R,208,new,painted,,w^c,04-63,blue throat and blue belly; accidentally cut toe 6


In [21]:
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='32 -12'),'misc']= df\
.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='32 -12'),:].misc\
.apply(lambda x: appendstr(x,"potential double-break at 12 [george to check before use]",';'))

df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='32 -12'),'rtl']= '32'

These entries now look like this:

In [22]:
df.loc[(df.date.isin(['2004-07-12 00:00:00']))\
       &(df.svl.isin(['52']))&(df.tl.isin(['75'])),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
3452,uo,4-6-18,m,2004-07-12 00:00:00,52,75,32,,4.7,sb opp fallen juniper -> flat R,208,new,painted,,w^c,04-63,blue throat and blue belly; accidentally cut toe 6;potential double-break at 12 [george to check before use];


"‘-’-->NaN"

In [23]:
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='-'),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
3548,sv,,f,2004-07-21 00:00:00,-,-,-,,6,sb @ cc/ccc,240,recap,painted,,w148b,,escaped


In [21]:
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='-'),'rtl']= '32'

These entries now look like this:

In [28]:
df.loc[(df.date.isin(['2004-07-21 00:00:00']))\
       &(df.svl.isin(['-']))&(df.tl.isin(['-'])),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
3548,sv,,f,2004-07-21 00:00:00,-,-,-,,6,sb @ cc/ccc,240,recap,painted,,w148b,,escaped


‘10(kink)’-->0; misc:”kink at 10mm” We will use the function appendstr to do this."

In [24]:
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='10(kink)'),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
3575,sj,2-9-12-18,f,2004-07-22 00:00:00,65,94,10(kink),,9.4,wall rt side v wall v cave tr,,recap,painted,,w154b,,hurt toes 11-13 in capture; Bss Tss


In [25]:
df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='10(kink)'),'misc']= df\
.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='10(kink)'),:].misc\
.apply(lambda x: appendstr(x,"kink at 10mm",';'))

df.loc[(df.rtl.isin(badtypes))&(df.rtl.notna())&(df.rtl=='10(kink)'),'rtl']= '0'

These entries now look like this:

In [27]:
df.loc[(df.date.isin(['2004-07-22 00:00:00']))\
       &(df.svl.isin(['65']))&(df.tl.isin(['94'])),:]

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
3575,sj,2-9-12-18,f,2004-07-22 00:00:00,65,94,0,,9.4,wall rt side v wall v cave tr,,recap,painted,,w154b,,hurt toes 11-13 in capture; Bss Tss;kink at 10mm;



We can try to further classify the NaN values in the rtl column. Those with no other measurements (svl, tl, or mass) will be of little use to us and can probably safely be ignored as they will likely be dropped from any further analysis.  Let's see how many of these there are and we will set there rtl values to -999

In [None]:
idx_nomeasurement = ((df.rtl.isna())&(df.svl.isna())&(df.tl.isna()))
print(df.loc[idx_nomeasurement].shape[0])
df.loc[idx_nomeasurement,['svl','tl','rtl']]=-666
print(df.loc[df.rtl==-666].shape[0])

Almost all of the entries of concern are accounted for here.  We will drop these from the dataset.

In [None]:
df = df.loc[~idx_nomeasurement]
print("After dropping the entries with no measurements at all, the df now has {} entries."\
      .format(df.shape[0]))

Now we will inspect those that had at least one other length measurement (svl or tl).

In [None]:
pd.reset_option('max_colwidth')
df.loc[(df.rtl.isna())&((df.svl.notna())|(df.tl.notna())),:]

All but one of these was a sighting.  We will have to look at the field notes to confirm whether or not data were actually missing for the remaining entry.

In [None]:
df.loc[(df.rtl.isna())&((df.svl.notna())|(df.tl.notna()))&df['new.recap'].str.contains('recap'),:]

Once we have addressed these, we will force rtl to an int type.

Now we check to see for out of range rtl values, *i.e.* rtl values less than -1 or suspiciously high.

We will exclude 0 and -1 values for rtl in these figures because of the large proportion of in range values they account for.

In [None]:
jarrovii = go.Histogram(x = df.loc[(df.species.str.contains('j'))&(~df.rtl.isin(badtypes))\
                                   &(~df.rtl.isin(['0','-1']))
                                   ,'rtl'].astype(int, 'ignore'),name = 'S. jarrovii',xbins =dict(size=1)
                        #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
virgatus = go.Histogram(x = df.loc[(df.species.str.contains('v'))&(~df.rtl.isin(badtypes))\
                                   &(~df.rtl.isin(['0','-1']))
                                   ,'rtl'].astype(int, 'ignore'), name = 'S. virgatus',xbins =dict(size=1)
                       #,histnorm='probability'
                        , cumulative=dict(enabled = False, direction = 'increasing'))
other = go.Histogram(x = df.loc[~(df.species.str.contains('v|j'))&(df.species.notna())\
                                &(~df.rtl.isin(badtypes))&(~df.rtl.isin(['0','-1'])),'rtl']\
                                  .astype(int, 'ignore'), name = 'other',xbins =dict(size=1)
                                  #,histnorm='probability'
                     , cumulative=dict(enabled = False
                                                                           , direction = 'increasing'))
data = [jarrovii, virgatus,other]
layout = go.Layout(
    title = 'Histogram of rtl by species',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'rtl (mm)',
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Lizards',
        titlefont = dict(
            size = 18))
)
fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Histogram of rtl by species (new)')

Perhaps it's worth inspecting values greater than 58. 

In [None]:
df.loc[(df.species.str.contains('j'))&\
       (~df.rtl.isin(badtypes)\
       )&(df.loc[(~df.rtl.isin(badtypes)),'rtl'].astype(int, 'ignore')>=50),:]


<a id = 'resumehere'></a>

Some of these values are reasonable, but there are few for which we will need to go back to the field notes in 2011.

In [29]:
df.loc[(df.species.str.contains('j'))&\
       (~df.rtl.isin(badtypes)\
       )&(df.loc[(~df.rtl.isin(badtypes))&(df.loc[df.svl.isin(['~70','large']),:]\
                                           .svl.dropna().astype(int,'ignore')<10),'rtl']\
          .astype(int, 'ignore')>=50),:]

ValueError: invalid literal for int() with base 10: '~70'

<a id='tl'></a>

## 'tl' 
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColByCol)



<a id='svl'></a>

## 'svl' 
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColByCol)



<a id='autotomized'></a>

## 'autotomized' 
[Back to: Top](#TOC)

[Back to: Cleaning](#CleaningData)

[Back to: Column-by-Column Cleaning](#ColByCol)

Here we populate the 'autotomized' column based on the values in 'rtl'.  Most of the source files did not have this category and have NaN values others have float values of 1.0, 2.0 or 3.0 for intact, autotomized with no regrowth or autotomized with regrowth, respectively.  The cleaned data for autotomized will contain  bool type values True, for having experienced auttomy (irrespective fo regrowth) and False for having no evidence of havign experienced autotomy.

In [None]:
df.autotomized.value_counts(dropna=False)

We will inspect the rtl values for entries with non NaN values for autotomized to determine if we can depend on rtl values to determine autotomy status.  In order to rely on rtl values, the following conditions must be met:
- all entries in which autotomized equals 1.0 must have 0 for rtl
- all entries in which autotomized equals 2.0 or 3.0 must have -1 or some value >0 for rtl

In [None]:
intact = df.loc[(df.autotomized==1),'rtl'].astype(int).value_counts(dropna=False)
values2check = [x for x in intact.index[intact.index!=0]]
if len(values2check)>0:
    print("The values associated with {} need a closer look.".format(values2check))
else:
    print("Values for 'intact' entries are as expected.  Continue.")
pd.set_option('max_colwidth',1000)
# df.loc[(df.autotomized==1)&(df.rtl.isin(['21'])),:]
df.loc[(df.autotomized==1)&(df.rtl.astype(int).isin(values2check)),:] # need to see what broke this line

This lizard appears to have been misrecorded and should be listed as autotomized given the amount of regrowth.  If we depend on the rtl values to label autotomized this will be corrected, so for now we will leave this as is.

In [None]:
autotomized = df.loc[(df.autotomized==2),'rtl'].value_counts(dropna=False)
values2check = [x for x in autotomized.index[autotomized.index!='-1']]# change to 'isin' aregument with 0 and -1
if len(values2check)>0:
    print("{} values associated with {} need a closer look."\
          .format(df.loc[(df.autotomized==2)&(df.rtl.isin(values2check)),:].shape[0],values2check))
else:
    print("Values for 'autotomized' entries are as expected.  Continue.")
pd.set_option('max_colwidth',1000)
df.loc[(df.autotomized==2)&(df.rtl.isin(values2check)),:]

Some of these cases are very straight forward given that the ratio of svl to tl is very close to 1, but others would be worth checking the original data to confirm. Another option is to use the svl to tl ratio of animals that we are sure are intact to decide how to classify these.

In [None]:
regrown = df.loc[(df.autotomized==3),'rtl'].value_counts(dropna=False)
values2check = [x for x in regrown.index[regrown.index.astype(int,'ignore')<=0]]
if len(values2check)>0:
    print("The values associated with {} need a closer look.".format(values2check))
else:
    print("Values for 'regrown' entries are as expected.  Continue.")
pd.set_option('max_colwidth',1000)
df.loc[(df.autotomized==3)&(df.rtl.isin(values2check)),:]

The entries labeled as a 3.0 in the autotomized column do not appear as though their rtl values will present an issue for calculating new autotomized values.  We will leave these as they are.

In [None]:
df.loc[((df.autotomized==2)|(df.autotomized==3)),'rtl'].value_counts(dropna=False)

## toes 
[Top](#TOC)

[Top Cleaning](#CleaningData)

First we will rename "toes" to "toes_orig"

In [None]:
df = df.rename(columns = {'toes':'toes_orig'},index = str)

Next we create a new column, "toes"  for the renamed toes

In [None]:
df['toes'] = df.toes_orig

Now we attempt to identify problem toes name and correct or export for review.

In [None]:
pattern1 = ".( {1,}-.|.- {1,}.)" # toes entries with any number of spaces on either side of a hyphen
pattern2 = ".( {,}\w{,} {1,})." # toes entries with space around or between numbers <- the spaces here should be deleted
pattern3 = ".(')."
pattern4 = "./."  # entries with '/' <-- need to replace these with '-'
pattern5 = "(\?{1,})"#<-- these needs to be investigated
pattern6 = "^\d{3,}$" # entries consist of only a single number comprised of at least three digits 
#<-- these needs to be investigated by checking raw field notes
pattern7 = ".(-{2,})." # entries which have at least 2 consecutive '-' <- these should be investigated
pattern8 = "^0" # entries in which single digit numbers have a leading "0" <-- Check raw field notes on this too
pattern9 = "a\w" #<--handled hyphens should be inserted  between the [ab] and \w 
# entries that contain an 'a' or 'b' followed by any character in the set [a-zA-Z0-9_]
pattern10 = "b\w" #<--handled hyphens should be inserted  between the [ab] and \w 
pattern11 = "\wa" # entries that contain an 'a' or 'b' preceded by any character in the set [a-zA-Z0-9_]
pattern12 = "\wb" # entries that contain an 'a' or 'b' preceded by any character in the set [a-zA-Z0-9_]
pattern13 = "[()]"
# remove space before 'a' at end of toes
#investigate '\d-', 
#'-(*)-', 
#' (16) ', 
#'---', <- may not exist in raw data
#'\d- ', 
#'- \d', 
#transcription errors from excel (toes in date format,
#'-\d\d\d\d' <- may not be in the data set

We'll have to change this block if we add or remove toe patterns.
This is not ideal and needs to be fixed

In [None]:
toe_pattern = pd.Series([*range(1,14)]) 
toe_pattern = make_str(toe_pattern)
print(toe_pattern)

toe_pattern_descr = pd.Series([pattern1,pattern2,pattern3,pattern4
                               ,pattern5,pattern6,pattern7,pattern8
                               ,pattern9,pattern10,pattern11,pattern12,pattern13])
toe_pattern_descr = toe_pattern_descr.astype(str)
print(toe_pattern_descr)

toe_pattern_reference = pd.DataFrame({'toe_pattern': toe_pattern,'description':toe_pattern_descr})
toe_pattern_reference

We first replace the string 'nan' with a null value

In [None]:
df.loc[df.toes=='nan','toes'] = np.nan

Let's see how many of these patterns we need to correct

In [None]:
df['toe_pattern'] = np.nan

Here we use a for-loop to label the patterns 
(there's probably a better way to do this with pandas map or apply, but I'll have to figure this out, for now this is fast enough, but it could make a difference with a larger data set or with more patterns)

In [None]:
for i in range(0,toe_pattern_reference.shape[0]):
    tmp_pat_num = toe_pattern_reference.iloc[i,0]
    tmp_pattern = toe_pattern_reference.iloc[i,1]
    df = label_pattern(df,tmp_pat_num,tmp_pattern,'toe_pattern','toes')

A quick summary of the number of observations for each pattern in the data set

In [None]:
toe_errors =df.toe_pattern.value_counts(dropna=False).reset_index()\
.rename(columns = {'index':'toe_pattern','toe_pattern':'observations'})
toe_errors.loc[toe_errors.toe_pattern.isnull(),'toe_pattern'] = 'Not covered by current patterns'
toe_errors_desc = toe_errors.merge(toe_pattern_reference,'left',on='toe_pattern')
toe_errors_desc

Now let's make sure we've accounted for every row in the data set

In [None]:
accountedRows = toe_errors.observations.sum()
totalRows = df.shape[0]
notAccountedRows = df.shape[0] - toe_errors.observations.sum()
print("\nThere are {} rows accounted for in the patterns (including null values) and there {} rows in the full data set.\
  There are {} rows unaccounted for.".format(accountedRows,totalRows,notAccountedRows))

Now we correct these patterns. We'll preserve the original toe data in a column called "toes_orig" just in case.  We can drop this later, if we are comfortable with the changes.  The new toes will be labeled "toes".

In [None]:
corrections_config = {'01':{'action':'replace','pattern_b':" ",'replacement':"\"\""},
            '02':{'action':'replace','pattern_b':" ",'replacement':"-"},
            '03':{'action':'replace','pattern_b':"\'",'replacement':"\"\""},
            '04':{'action':'replace','pattern_b':"/",'replacement':"-"},
            '05':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '06':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '07':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '08':{'action':'replace','pattern_b':"^0",'replacement':"\"\""},
            '09':{'action':'replace','pattern_b':'a','replacement':'-a'},
            '10':{'action':'replace','pattern_b':'b','replacement':'-b'},          
            '11':{'action':'replace','pattern_b':"a",'replacement':"a-"},
            '12':{'action':'replace','pattern_b':"b",'replacement':"b-"},
            '13':{'action':'replace','pattern_b':"[()]",'replacement':"\"\""}}

In [None]:
toe_errors_desc['action'] = toe_errors_desc.loc[toe_errors_desc.toe_pattern.str.len()==2].toe_pattern\
.map(lambda x: corrections_config[x]['action'],na_action='ignore')

toe_errors_desc['replacement'] = toe_errors_desc.loc[toe_errors_desc.toe_pattern.str.len()==2].toe_pattern\
.map(lambda x: corrections_config[x]['replacement'],na_action='ignore')

toe_errors_desc = toe_errors_desc.sort_values('toe_pattern').reset_index(drop=True)
toe_errors_desc

In [None]:
for i in range(0,toe_errors_desc.shape[0]):
    tmp_pat_num = toe_errors_desc.iloc[i,0]
    tmp_pattern = toe_errors_desc.iloc[i,2]
    action = toe_errors_desc.iloc[i,3]
    tmp_replacement = toe_errors_desc.iloc[i,4]
    tmp_x = df.loc[df.toe_pattern==tmp_pat_num,:]
    
    if action =='save':
        tmp_filename = 'pattern'+tmp_pat_num+'.csv'
        tmp_x.to_csv(tmp_filename)
        print("Pattern {} successfully saved to {}.".format(tmp_pattern,tmp_filename))
    if action =='replace':
        df.loc[df.toe_pattern==tmp_pat_num,'toes'] = replace_pattern(x=df.loc[df.toe_pattern==tmp_pat_num]
                                                                     ,pattern = tmp_pat_num
                                                                     ,pattern_b = tmp_pattern
                                                                     ,source_col = 'toes'
                                                                    ,replacement = tmp_replacement)
        print("Pattern {} successfully replaced with {}.".format(tmp_pattern,tmp_replacement))
    else:
        print("No direction provided for pattern {}.  No action was taken.".format(tmp_pattern))

Now we confirm that the patterns we expect to have eliminated have indeed been eliminated from the data set

In [None]:
for i in range(0,toe_pattern_reference.shape[0]):
    tmp_pattern = str(toe_pattern_reference.iloc[i,1])
    report_pattern(df,tmp_pattern,'toes','Post-Correction')

<a id='sex'></a>

### sex
[Top](#TOC)

[Top Cleaning](#CleaningData)

Next we move on to cleaning the "sex" column.

First we want to get an idea of the types of problems in the sex column.  We start by striping leading and trailing whitespaces.  You can see here that there were none in the data set.

In [None]:
print(df.sex.str.len().unique())# returns unique lengths of sex
df.sex=df.sex.str.strip()
print(df.sex.str.len().unique())

#### Identify non "m" or "f" values and their frequencies

In [None]:
patterns_sex="m|f|NA"
non_matches=df.sex.loc[df.sex.str.match(patterns_sex)!=True]
print("\nThere are {} entries for sex which do not match the patterns {}:"\
      .format(non_matches.shape[0],patterns_sex.split("|")))
non_matches.value_counts()

#### Identify values to convert to NA, m, or f

In [None]:
sex2NA=['adult','juv','nan']
sex2m=['unm']
df.loc[df.sex.isin(sex2NA)==True]
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

#### Convert the values to NA or m, respectively.

In [None]:
df.loc[df.sex.isin(sex2m)]

In [None]:
df.loc[df.sex.isin(sex2NA),'sex']=np.nan
df.loc[df.sex.isin(sex2m),'sex']='m'
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

#### Set all remaining species and sex with "?" to NaN

In [None]:
df.loc[(df.species.str.contains('\?')) & (df.species.notnull()),'species'] = np.nan
df.loc[(df.sex.str.contains('\?')) & (df.sex.notnull()),'sex'] = np.nan

<a id='newrecap'></a>

### new.recap
[Top](#TOC)

[Top Cleaning](#CleaningData)

In [None]:
df.columns

In [None]:
df.head()

In [None]:
#try using a dict to do thing more efficiently
newRecapKeep = ['recap', 'new', 'r', 'n']
new = ['new','n']
recap = ['recap','r']
df.loc[~df['new.recap'].isin(newRecapKeep),'new.recap'] = np.nan
df.loc[df['new.recap'].isin(new),'new.recap'] = 'new'
df.loc[df['new.recap'].isin(recap),'new.recap'] = 'recap'

<a id='CorrectingClass'></a>

## Correcting class of columns
[Top](#TOC)

[Top Cleaning](#CleaningData)

In [None]:
#We need to add real error handling into these conversion chunks

##Convert integer columns to int
intCols = ['meters']
df[intCols]=df[intCols].astype(int,errors='ignore')

##Convert numeric columns to numeric
numCols = ['svl','tl','rtl','mass']
df[numCols]=df[numCols].apply(pd.to_numeric,errors='coerce')

##Convert string columns to str
strCols = ['toes','sex','species','vial']
df[strCols]=df[strCols].astype(str, errors='ignore')

#Convert date to datetime
df.loc[df.date=="NA"]=np.nan
df.date = pd.to_datetime(df.date,errors='coerce')

##Convert bool columns to bool
# boolCols = ['review_sex','review_species','review_painted','review_new.recap',\
#             'review_rtl','forceMale','forceFemale','forceRecap','forceNew',\
#             'forceSighting','drop_species','drop_morphometrics','autotomized']
# df[boolCols]=df[boolCols].astype(bool, errors='ignore')

In [None]:
print("\nAfter applying the above changes, the data types are as follows:\n{}".format(df.dtypes))

<a id='AddVar1'></a>

## Adding variables [*year*](#year) and [*rtl_orig*](#rtlorig)

<a id='year'></a>

### Year
[Back to: Top](#TOC)

[Back to: Adding variables](#AddVar1)

We will use data contained in the *date* column to create the variable *year*.  TO do this we will define a small function, *myint*, to convert year to an int type.

<a id='myint'></a>

In [None]:
def myint(x, verbose = False):
    try:
        x = str(x).split('.')[0]
    except:
        x = x
        if verbose == True:
            print('{} is of type {} and cannot be forced to int.'.format(x,type(x)))
    return x


Here is are a few examples of how [*myint*](#myint) works.

In [None]:
bar = [None, 1.0, "f"]
print([type(x) for x in bar])
[myint(x) for x in bar]

In [None]:
bar = [None, 2001.0, "2001.0"]
print([type(x) for x in bar])
[myint(x,True) for x in bar]

Now we apply [*myint*](#myint) to the 'date' column to create the variable year and inspect the unique values.

In [None]:
df['year'] = df.date.dt.year.apply(myint,verbose=False)
df.year.value_counts(dropna=False)

Let's inspect the entries with 'nan' values.  Note these 'nan' values are string values and not NaN.

In [None]:
df.loc[df.year=='nan',:]

<a id='AddCol'></a>

# Adding New Columns
[Top](#TOC)

We need to add new columns which we will use later in analyses:
- [TL_SVL](#TlSvl)
- [Mass_SVL](#MassSvl)
- [Lizard Number](#LizardNumber)
     - [assign lizard numbers](#Assign) 
     - [QC the lizard numbers](#QcLizNum) 
- [Days Since Capture](#daysSinceCapture)
- [Number of Captures](#capture)

<a id= 'TlSvl'></a>

## TL_SVL 
[Top](#TOC)

[Top Add Columns](#AddCol)



In [None]:
df['tl_svl']=(df.tl/df.svl)

<a id='MassSvl'></a>

## Mass_SVL
[Top](#TOC)

[Top Add Columns](#AddCol)



In [None]:
df['mass_svl']=(df.mass/df.svl)

<a id= 'LizardNumber'></a>

## Lizard Number
[Top](#TOC)

[Top Add Columns](#AddCol)

Here we use a set of functions to:
 - [assign lizard numbers](#Assign) to unique individuals (we repeat this step to ensure we have assigned all animals a number) and 
 - [QC the numbers](#QcLizNum) assigned.

<a id='Assign'></a>

### Assign lizard numbers
[Top](#TOC)

[Top Add Columns](#AddCol)

We make a first attempt at assigning lizard numbers.  We use the *lizsort* function to identify the subset of rows from the original dataset which have sufficient information to allow us to make an automated decision about the uniqueness of the individuals identified in those rows.  We name that df *sortable*.  The unsortable data are saved to a path as a file, *unsortable.csv*.  

In [None]:
sortable = lizsort(df, path = sourceDataBig)  

Next we call the *mindate* function on *sortable*.  This identifies the earliest date at which each unique combination of *sortCriteria* are recorded in a new column, *initialCaptureDate*.  The default sortCriteria are of the variables *species*, *toes*, and *sex*.  This also calculates and adds a column for *year_diff*, the difference in years between the initial capture date and the date value in a given row. 

In [None]:
sortable = mindate(sortable)

Next we call a the function *smallest*, which is analogous to *mindate*, but groups data in *sortable* into unique combinations of *species*, *toes*, *sex*, and *initialCaptureDate* before assigning the smallest SVL value recorded for each group to a new column for that group, *smallest_svl*.  *smallest* then calculates a new column *svl_diff* which is analogous to *year_diff*.

In [None]:
sortable = smallest(sortable)

Next we call the *validate* function on *sortable*, which applies a series of validation tests to the data, sequentially numbers unique combinations of *sortCriteria* and returns a dict containing uniquely numbered individuals and summary data.

In [None]:
tmp_sort = validate(sortable)
df_numbered1 = tmp_sort['val_data']

### Second attempt to assign lizard numbers

[Top](#TOC)

[Top Add Columns](#AddCol)

Here we make a second attempt at assigning lizard numbers to ensure that all lizards have been assigned.  This second attempt is focused on those rows which were unvalidated during the first attempt *n_val_data*.  Since these are already a subset fo those data which were sortabel, we need only call the *mindate*, *smallest*, and *validate* functions.

In [None]:
n_val = mindate(tmp_sort['n_val_data'])
n_val = smallest(n_val)
df_numbered2 = validate(n_val)['val_data']

Since no rows remain unvalidated, we will not attempt a third validation.  We will simply append *df_numbered1* and *df_numbered2* to create *df_numbered* to create our full numbered dataset.

In [None]:
df_numbered = df_numbered1.append(df_numbered2,ignore_index=True,sort=False)
print("df:{}\ndf_numbered1:{}\ndf_numbered2:{}\ndf_numbered:{}".format(df.shape,df_numbered1.shape,df_numbered2.shape,
                                                               df_numbered.shape))
df_numbered.head()

<a id='QcLizNum'></a>

### QC of lizard numbers
[Top](#TOC)

[Top Add Columns](#AddCol)

First we display the output data frame.

In [None]:
df_numbered

Identify individuals that have same species and toes, but different sex for review

In [None]:
df_numbered = df_numbered.merge(df_numbered.groupby(['species','toes']).sex.nunique().reset_index()\
                       .rename(columns = {'sex':'sex_count'}),how = 'inner', on = ['species','toes'])
df_numbered.loc[df_numbered.sex_count>1,:].to_csv('entries flagged with same species and toes diff sex.csv')
print("{} rows have the same species and toes but different values for sex"\
      .format(df_numbered.loc[df_numbered.sex_count>1,:].shape[0]))
df_numbered.head()

In [None]:
print("Lizard Numbers in the sample range from {} to {}."\
      .format(df_numbered.liznumber.min(),df_numbered.liznumber.max()))

In [None]:
possibleLizNum = set(range(int(df_numbered.liznumber.min()),int(df_numbered.liznumber.max())))
actualLizNum = set(pd.Series(df_numbered.liznumber.unique()).dropna().apply(int))
print("\nThere are {} entries.  There are {} unique lizard numbers.\
\n\nThe liznumber ranges from {} to {}."\
  .format(df_numbered.shape[0],len(df_numbered.liznumber.unique())\
          ,df_numbered.liznumber.min(),df_numbered.liznumber.max()))

missingLizNum = possibleLizNum - actualLizNum
if len(missingLizNum)>0:
    print("\n\nThe following numbers are not assigned to a lizard:\n{}"\
      .format(missingLizNum))
else:
    print("\n\nThere are no numbers which were not assigned.")

<a id='daysSinceCapture'></a>

### Days Since Capture
[Top](#TOC)

[Top Add Columns](#AddCol)

*daysSinceCapture* identifies the number of days since the animal was captured

<a id='capture'></a>

In [None]:
df_numbered.loc[:,'daysSinceCapture'] = (df_numbered.date - df_numbered.initialCaptureDate).dt.days

### Capture Number
[Top](#TOC)

[Top Add Columns](#AddCol)

*capture* identifies the number of times an animal has been captured prior to an entry.
We will need to [QC capture](#QcCapture) as well.

In [None]:
# need to QC this this seems to be leading to several cases in which recap individuals that 
# only have one capture
df_numbered['capture'] = df_numbered.sort_values(['liznumber','date'])\
.groupby(['liznumber']).daysSinceCapture.cumcount()+1

In [None]:
print(df_numbered.loc[df_numbered.species.isin(['j','v'])].groupby('capture').capture.count())

<a id='yearstoolarge'></a>

### years too large
[Top](#TOC)

In [None]:
yeartoomuch = df_numbered.loc[df_numbered.year_diff>=5,'liznumber']
checkyears = df_numbered.loc[df_numbered.liznumber.isin(yeartoomuch)].sort_values(['liznumber'])
checkyears.to_csv('check years.csv')
checkyears

In [None]:
jarrovii = go.Histogram(x = df_numbered.loc[df_numbered.species.isin(['j'])].groupby('liznumber')\
                     .year_diff.max(),name = 'S. jarrovii')
virgatus = go.Histogram(x = df_numbered.loc[df_numbered.species.isin(['v'])].groupby('liznumber')\
                     .year_diff.max(), name = 'S. virgatus')
data = [jarrovii, virgatus]
layout = go.Layout(
    title = 'Number of Individuals by Years Between First and Last Capture 2000-2017',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'Maximum Number of Years Since Initial Capture',
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Lizards',
        titlefont = dict(
            size = 18))
)
fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Frequency of Captures in Crystal Creek 2000 - 2017 (by species)')

In [None]:
# Freeze work on this figure until we've resolved issues with calculation based on year
# ADD HORIZONTAL LINES FOR EACH YEAR
j_lizards = go.Scatter(x = df_numbered.loc[df_numbered.species.isin(['j'])].liznumber,
                   y = df_numbered.loc[df_numbered.species.isin(['j'])]\
                      .groupby('liznumber').daysSinceCapture.max(), 
                     mode = 'markers', name='S. jarrovii')
v_lizards = go.Scatter(x = df_numbered.loc[df_numbered.species.isin(['v'])].liznumber,
                   y = df_numbered.loc[df_numbered.species.isin(['v'])]\
                      .groupby('liznumber').daysSinceCapture.max(), 
                     mode = 'markers', name='S. virgatus')
# year1 = go.Scatter(x=[df_numbered.liznumber.min(),df_numbered.liznumber.max()],y = (365))
# year2 = go.Scatter(y = 365*2)
# year3 = go.Scatter(y = 365*3)
# year4 = go.Scatter(y = 365*4)
# year5 = go.Scatter(y = 365*5)
# year6 = go.Scatter(y = 365*6)
# year7 = go.Scatter(y = 365*7)
# year8 = go.Scatter(y = 365*8)

# data = [j_lizards, v_lizards, year1, year2, year3, year4, year5, year6, year7, year8]
data = [j_lizards, v_lizards]
layout = go.Layout(
    title = 'Days Since Initial Capture in Crystal Creek 2000 - 2017',
        titlefont = dict(
            size = 20),
    xaxis = dict(
            title='Lizard Number',
            titlefont=dict(
                size=18)),
    yaxis = dict(
            title='Greatest Number of Days Since<br> Initial Capture',
            titlefont=dict(
                size=18)))

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename = 'Days Since Initial Capture in Crystal Creek 2000 - 2017')

In [None]:
dfF = df_numbered.loc[(df_numbered.sex =='f' )& (df_numbered.species.isin(['j','v']))]
dfM = df_numbered.loc[(df_numbered.sex =='m') & (df_numbered.species.isin(['j','v']))]

In [None]:
# Freeze work on this figure until we've resolved issues with calculation based on year
females = go.Scatter(
    x = dfF.liznumber,
    y = dfF.groupby('liznumber').daysSinceCapture.max(),
    name = 'females',
    mode = 'markers',
    marker = dict(
        color = 'rgba(152, 0, 0, .8)',
        opacity = 0.75,
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        )
    )
)

males = go.Scatter(
    x = dfM.liznumber,
    y = dfM.groupby('liznumber').daysSinceCapture.max(),
    name = 'males',
    mode = 'markers',
    marker = dict(
        color = 'rgba(255, 182, 193, .9)',
        opacity = 0.75,
        line = dict(
            width = 2,
        )
    )
)

data = [females, males]

layout = dict(title = 'Days Since Initial Capture in Crystal Creek 2000 - 2017 By Sex',
              yaxis = dict(
                  title='Greatest Number of Days Since<br> Initial Capture',
                  titlefont=dict(
                      size=18)
              ),
              xaxis = dict(zeroline = False)
             )

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Days Since Initial Capture in Crystal Creek 2000 - 2017 By Sex')

In [None]:
# Something is wrong with 2006 and 2011 data.  Try grouping data by lizard numbers to address high numbers.
# Freeze work on this figure until we've resolved issues with calculation based on year
# Capture rate between Males and Females does not appear to be significantly different even before
# statistical analysis
males = go.Histogram(x = df_numbered.loc[(df_numbered.sex == 'm')& (df_numbered.species.isin(['j','v']))
                                                                    ,'year']
                     ,opacity= 0.75,name='males')
females = go.Histogram(x = df_numbered.loc[(df_numbered.sex == 'f')& (df_numbered.species.isin(['j','v']))
                                                                      ,'year']
                       , opacity= 0.75, name = 'females')
data = [males,females]
py.iplot(data, filename = 'Distribution of Sex by Year in Crystal Creek 2000 - 2017')

In [None]:
df_numbered.head()

In [None]:
column_order = ['liznumber','date','initialCaptureDate',]

In [None]:
df.year.value_counts(dropna=False).reset_index()

<a id='QcCapture'></a>

### QC of Capture number and Recap status
[Top](#TOC)

[Top Add Columns](#AddCol)

[Top Capture Number](#capture)

In [None]:
recapQuestion=df_numbered\
.loc[(df_numbered.capture==1 )&(df_numbered['new.recap']=='recap')&(df_numbered.species.isin(['j','v'])),:]
print("There are {} instances in rows for which a lizard appears to have only one capture, \
but is listed as a recap.\
The distribution of these across years in the sample is as follows:\n{}."\
      .format(recapQuestion.shape[0],recapQuestion.year.value_counts()))
recapQuestion.to_csv("Questionable recaptures.csv")#These individuals need to be rechecked in the raw notes
recapQuestion.head()

In [None]:
recapQuestion.loc[recapQuestion.svl<54,:]

<a id='exportFinal'></a>

# Export Cleaned data
[Top](#TOC)

Now we export the cleaned data to a csv.

In [None]:
df_numbered = df_numbered.rename(index = str, columns = {'new.recap':'newRecap'})
qc_drop_cols = df_numbered.columns[df_numbered.columns.str.contains('force|drop')]
df_full = df_numbered.drop(qc_drop_cols,1)

In [None]:
timestamp = (pd.to_datetime('now')-pd.Timedelta(hours=4))
timestamp = str(timestamp)[:-10].replace(':','hrs')+'min'
#path=''C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\''
# path=outputBig
filename = 'cleaned CC data 2000-2017_' + timestamp+ '.csv'
# filename = path + '/cleaned CC data 2000-2017' + '.csv'
df_full.to_csv(filename,index = False)
filename