# Cleaning CC data

This python notebook operates on a csv created after editing in open refine and is designed to finish cleaning columns of interest which were easier to clean in python.

## Setting up Python

Here we import necessary packages. 
This chunk may take a while.

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib
%matplotlib notebook

# increase print limit
pd.options.display.max_rows = 99999

### Use this chunk to read data from local folder on Chris' machine

In [2]:
# Source Data
sourceDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/outputFiles'
sourceDataBig = 'S:/Chris/TailDemography/combined data'

#Output Data paths
outputPers = 'C:/Users/Christopher/Google Drive/TailDemography/outputFiles'
outputBig = 'S:/Chris/TailDemography/data'


In [3]:
os.chdir(sourceDataBig)
# os.listdir()
df=pd.read_csv('mapped-data-all_18-01-08_post_openrefine.csv')
df.head()


Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
0,j,1-13-19,2000-03-17T00:00:00Z,f,52.0,74.0,0,4.2,r1c,1falls,...,False,False,False,False,False,False,False,False,False,False
1,j,1-13-20,2000-03-17T00:00:00Z,m,56.0,77.0,0,5.6,r2c,1falls,...,False,False,False,False,False,False,False,False,False,False
2,j,1-14-19,2000-03-17T00:00:00Z,f,57.0,81.0,0,6.6,r3c,wall on rt side v wall at pine xing,...,False,False,False,False,False,False,False,False,False,False
3,j,1-14-20,2000-03-17T00:00:00Z,f,57.0,79.0,0,5.5,r4c,wall on rt side v wall at pine xing,...,False,False,False,False,False,False,False,False,False,False
4,j,3-8,2000-03-17T00:00:00Z,f,82.0,89.0,27,17.0,r5c,oak across from bottom wall at pine xing,...,False,False,False,False,False,False,False,False,False,False


Let's take a look at the data

In [4]:
print("\nThere are {} data points in our data set.".format(df.shape[0]))


There are 8197 data points in our data set.


In [5]:
print("\nThe columns in the data have the following data types:\n{}".format(df.dtypes))


The columns in the data have the following data types:
species                object
toes                   object
date                   object
sex                    object
svl                   float64
tl                    float64
rtl_orig               object
mass                   object
paint.mark             object
location               object
meters                 object
new.recap              object
painted                object
misc                   object
vial                   object
year                    int64
rtl                   float64
autotomized              bool
new.recap_orig         object
sighting               object
review_sex               bool
review_species           bool
review_painted           bool
review_new.recap         bool
review_rtl               bool
forceMale                bool
forceFemale              bool
forceRecap               bool
forceNew                 bool
forceSighting            bool
drop_species             bool
drop_morphomet

## Correcting class of columns

In [6]:
#We need to add real error handling into these conversion chunks

##Convert integer columns to int
intCols = ['meters','year']
df[intCols]=df[intCols].astype(int,errors='ignore')

##Convert numeric columns to numeric
numCols = ['svl','tl','rtl','rtl_orig','mass']
df[numCols]=df[numCols].apply(pd.to_numeric,errors='coerce')

##Convert string columns to str
strCols = ['toes','sex','species','vial']
df[strCols]=df[strCols].astype(str, errors='ignore')

#Convert date to datetime
df.loc[df.date=="NA"]=np.nan
df.date = pd.to_datetime(df.date,errors='coerce')

##Convert bool columns to bool
boolCols = ['review_sex','review_species','review_painted','review_new.recap',\
            'review_rtl','forceMale','forceFemale','forceRecap','forceNew',\
            'forceSighting','drop_species','drop_morphometrics','autotomized']
df[boolCols]=df[boolCols].astype(bool, errors='ignore')

In [7]:
print("\nAfter applying the above changes, the data types re as follows:\n{}".format(df.dtypes))


After applying the above changes, the data types re as follows:
species                       object
toes                          object
date                  datetime64[ns]
sex                           object
svl                          float64
tl                           float64
rtl_orig                     float64
mass                         float64
paint.mark                    object
location                      object
meters                        object
new.recap                     object
painted                       object
misc                          object
vial                          object
year                         float64
rtl                          float64
autotomized                     bool
new.recap_orig                object
sighting                      object
review_sex                      bool
review_species                  bool
review_painted                  bool
review_new.recap                bool
review_rtl                      bool
forceMale 

## Remove leading and trailing whitespaces

for col in df:
    print(len(col))# returns unique lengths of sex
    col=col.strip()

for col in df:
    col=col.strip()

## Cleaning toes column

First we will rename "toes" to "toes_orig"

In [21]:
df = df.rename(columns = {'toes':'toes_orig'},index = str)

Index(['species', 'toes_orig', 'date', 'sex', 'svl', 'tl', 'rtl_orig', 'mass',
       'paint.mark', 'location', 'meters', 'new.recap', 'painted', 'misc',
       'vial', 'year', 'rtl', 'autotomized', 'new.recap_orig', 'sighting',
       'review_sex', 'review_species', 'review_painted', 'review_new.recap',
       'review_rtl', 'forceMale', 'forceFemale', 'forceRecap', 'forceNew',
       'forceSighting', 'drop_species', 'drop_morphometrics', 'toe_pattern'],
      dtype='object')

Next we create a new column, "toes"  for the renamed toes

In [23]:
df['toes'] = df.toes_orig

Index(['species', 'toes_orig', 'date', 'sex', 'svl', 'tl', 'rtl_orig', 'mass',
       'paint.mark', 'location', 'meters', 'new.recap', 'painted', 'misc',
       'vial', 'year', 'rtl', 'autotomized', 'new.recap_orig', 'sighting',
       'review_sex', 'review_species', 'review_painted', 'review_new.recap',
       'review_rtl', 'forceMale', 'forceFemale', 'forceRecap', 'forceNew',
       'forceSighting', 'drop_species', 'drop_morphometrics', 'toe_pattern',
       'toes'],
      dtype='object')

Now we attempt to identify problem toes name and correct or export for review.

In [9]:
pattern1 = ".( {1,}-.|.- {1,}.)" # toes entries with any number of spaces on either side of a hyphen
pattern2 = ".( {,}\w{,} {1,})." # toes entries with space around or between numbers <- the spaces here should be deleted
pattern3 = ".(')."
pattern4 = "./."  # entries with '/' <-- need to replace these with '-'
pattern5 = "(\?{1,})"#<-- these needs to be investigated
pattern6 = "(\?{1,}])" #<--These need to be excluded from analyses
pattern7 = "^\d{3,}$" # entries consist of only a single number comprised of at least three digits 
#<-- these needs to be investigated by checking raw field notes
pattern8 = ".(-{2,})." # entries which have at least 2 consecutive '-' <- these should be investigated
pattern9 = "(^0)" # entries in which single digit numbers have a leading "0" <-- Check raw field notes on this too
pattern10 = " {1,}[ab]"
pattern11 = "[ab] {1,}" #<--handled spaces should be replaced by "-"
pattern12 = "[ab]\w" #<--handled hyphens should be inserted  between the [ab] and \w 
# entries that contain an 'a' or 'b' followed by any character in the set [a-zA-Z0-9_]
pattern13 = "\w[ab]" # entries that contain an 'a' or 'b' preceded by any character in the set [a-zA-Z0-9_]
pattern14 = "[()]"
pattern15 = ".(\d a)"
# remove space before 'a' at end of toes
#investigate '\d-', 
#'-(*)-', 
#' (16) ', 
#'---', <- may not exist in raw data
#'\d- ', 
#'- \d', 
#transcription errors from excel (toes in date format,
#'-\d\d\d\d' <- may not be in the data set

We first replace the string 'nan' with a null value

In [10]:
df.loc[df.toes=='nan','toes'] = np.nan

Let's see how many of these patterns we need to correct

df.loc[df.toes.str.match(pattern1)==True,'toe_pattern'] = '01'
df.loc[df.toes.str.match(pattern2)==True,'toe_pattern'] = '02'
df.loc[df.toes.str.match(pattern3)==True,'toe_pattern'] = '03'
df.loc[df.toes.str.match(pattern4)==True,'toe_pattern'] = '04'
df.loc[df.toes.str.match(pattern5)==True,'toe_pattern'] = '05'
df.loc[df.toes.str.match(pattern6)==True,'toe_pattern'] = '06'
df.loc[df.toes.str.match(pattern7)==True,'toe_pattern'] = '07'
df.loc[df.toes.str.match(pattern8)==True,'toe_pattern'] = '08'
df.loc[df.toes.str.match(pattern9)==True,'toe_pattern'] = '09'
df.loc[df.toes.str.match(pattern10)==True,'toe_pattern'] = '10'
df.loc[df.toes.str.match(pattern11)==True,'toe_pattern'] = '11'
df.loc[df.toes.str.match(pattern12)==True,'toe_pattern'] = '12'
df.loc[df.toes.str.match(pattern13)==True,'toe_pattern'] = '13'
df.loc[df.toes.str.match(pattern14)==True,'toe_pattern'] = '14'
df.loc[df.toes.str.match(pattern15)==True,'toe_pattern'] = '15'
df.toe_pattern.value_counts(dropna = False).reset_index()\
.rename({'index':'toe_pattern','toe_pattern':'Nmatches'},axis = 'columns').sort_values('toe_pattern')

### Correcting these patterns

In [11]:
import pandas as pd
def label_pattern (series , pat_num , pattern , pattern_b , replacement , pat_col = 'toe_pattern' , col = 'toes'):
    """searches a pandas series for a regex expression, pattern, and replaces with replacement"""
#     label the pattern
    df.loc[df[col].str.match(pattern)==True,pat_col] = str(pat_num)
    print('pre:\ntoe pattern {}:{}'.format(str(pat_num),df.loc[df[pat_col] ==str(pat_num),:].shape[0]))
    
    df.loc[df[col].str.match(pattern)==True,col] = df.loc[df[col].str.match(pattern)==True,col]\
    .replace(pattern_b,replacement)
    
    res = df
    print('post:\ntoe pattern {}:{}'.format(str(pat_num),df.loc[df[pat_col] ==str(pat_num),:].shape[0]))
    return res

In [43]:
df.loc[df.toes.str.match(pattern1)==True,'toes'].replace(" ","")


6055                           4 - 6 - 12
6056                           4 - 6 - 13
6057                               6 - 15
6058                               6 - 16
6059                               6 - 17
6060                      3 - 9 - 13 - 20
6061                      3 - 9 - 14 - 16
6062                      3 - 9 - 14 - 17
6063                      3 - 9 - 14 - 18
6064                      4 - 6 - 11 - 20
6065                      4 - 6 - 12 - 16
6066                      4 - 6 - 12 - 18
6067                      4 - 6 - 12 - 19
6068                      4 - 6 - 13 - 16
6069                      4 - 6 - 13 - 17
6070                      4 - 6 - 13 - 19
6082                      3 - 9 - 13 - 16
6083                           1 - 8 - 11
6084                           2 - 8 - 13
6085                           3 - 7 - 15
6086                           3 - 9 - 15
6087                               1 - 12
6088                               8 - 16
6089                          1 - 

In [11]:
import pandas as pd
def replace_pattern (x , pattern , pattern_b , replacement):
    """searches a pandas series for a regex expression, pattern, and replaces with replacement"""
    print('pre:\ntoe pattern {}:{}'.format(str(pattern),x.str.match(pattern)==True.sum()))
    
    x = x.loc[x.str.match(pattern)==True].replace(pattern_b,replacement)
    
    res = df
    print('post:\ntoe pattern {}:{}'.format(str(pattern),x.str.match(pattern)==True.sum()))
    return res

In [12]:
tmp  = replace_pattern(df=df,pat_num='01',pattern = pattern1,pattern_b = " ",replacement="") 

pre:
toe pattern 01:40
post:
toe pattern 01:40


In [15]:
tmp.loc[tmp.toe_pattern=='01']

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics,toe_pattern
6055,j,4 - 6 - 12,2011-06-20,m,68.0,101.0,0.0,9.2,g20b,15m up CCC,...,False,False,False,False,False,False,False,False,False,1
6056,j,4 - 6 - 13,2011-06-28,m,77.0,106.0,0.0,13.8,g40b,big juniper right 8m ^ juniper Xing,...,False,False,False,False,False,False,False,False,False,1
6057,uo,6 - 15,2011-06-20,m,49.0,83.0,0.0,4.0,g.t,opp oak R,...,False,False,False,False,False,False,False,True,False,1
6058,uo,6 - 16,2011-06-28,f,49.0,72.0,0.0,4.2,g.b,CC/CCC,...,False,False,False,False,False,False,False,True,False,1
6059,uo,6 - 17,2011-07-02,f,48.0,69.0,0.0,4.0,g.c,H4b,...,False,False,False,False,False,False,False,True,False,1
6060,v,3 - 9 - 13 - 20,2011-06-22,m,46.0,60.0,0.0,3.7,g6c,left Rs opp Pine R,...,False,False,False,False,False,False,False,False,False,1
6061,v,3 - 9 - 14 - 16,2011-06-27,m,49.0,64.0,0.0,4.0,g8c,left Rs 3m ^ oak R,...,False,False,False,False,False,False,False,False,False,1
6062,v,3 - 9 - 14 - 17,2011-06-30,m,49.0,46.0,12.0,4.3,g9c,2m ^ cave trail opp sb,...,False,False,False,False,False,False,False,False,False,1
6063,v,3 - 9 - 14 - 18,2011-07-03,f,49.0,62.0,0.0,4.0,g14c,2m v pool sb,...,False,False,False,False,False,False,False,False,False,1
6064,j,4 - 6 - 11 - 20,2011-06-20,f,61.0,81.0,0.0,6.6,g17b,4m ^ left side SB Juniper Xing,...,False,False,False,False,False,False,False,False,False,1


Next, for "toes" entries with an "a" or "b" followed or preceeded by a space, we replace the space with a hyphen and add a column, toe_pattern, that indicates which pattern was found in the source data.

In [None]:
df.loc[df.toes.str.match(pattern14)==True,'toe_pattern'] = 14
print('pre:\ntoe pattern 14:{}'.format(df.loc[df.toe_pattern ==14,:].shape[0]))
#for 'b'
df.loc[df.toes.str.match(pattern14)==True,'toes']=df.loc[df.toes.str.match(pattern14)==True,'toes']\
.replace("\(|\)","",regex=True)
#for 'a'
print('post:\ntoe pattern 14:{}'.format(df.loc[df.toe_pattern ==14,:].shape[0]))
df.loc[df.toe_pattern ==14,:]

Next, for "toes" entries with an "a" or "b" followed or preceeded by a space, we replace the space with a hyphen and add a column, toe_pattern, that indicates which pattern was found in the source data.

In [None]:
df.loc[df.toes.str.match(pattern1)==True,'toe_pattern'] = 1
print('pre:\ntoe pattern 1:{}'.format(df.loc[df.toe_pattern ==1,:].shape[0]))
#for 'b'
df.loc[df.toes.str.match(pattern1)==True,'toes']=df.loc[df.toes.str.match(pattern1)==True,'toes'].replace(" ","",regex=True)
#for 'a'
print('post:\ntoe pattern 1:{}'.format(df.loc[df.toe_pattern ==1,:].shape[0]))
df.loc[df.toe_pattern ==1,:]

pattern2 = ".( {,}\w{,} {1,})." # toes entries with space around or between numbers

In [None]:
df.loc[df.toes.str.match(pattern2)==True,'toe_pattern'] = 2
print('pre:\ntoe pattern 2:{}'.format(df.loc[df.toe_pattern ==2,:].shape[0]))
df.loc[df.toes.str.match(pattern2)==True,'toes']=df.loc[df.toes.str.match(pattern2)==True,'toes']\
.replace(" {1,}","-",regex=True)
print('post:\ntoe pattern 2:{}'.format(df.loc[df.toe_pattern ==2,:].shape[0]))
df.loc[df.toe_pattern ==2,:]

Next, for "toes" entries with an "a" or "b" followed or preceeded by a space, we replace the space with a hyphen and add a column, toe_pattern, that indicates which pattern was found in the source data.

In [None]:
df.loc[df.toes.str.match(pattern11)==True,'toe_pattern'] = 11
print('pre:\ntoe pattern 11:{}'.format(df.loc[df.toe_pattern ==11,:].shape[0]))
#for 'b'
df.loc[df.toes.str.match(pattern11)==True,'toes']=df.loc[df.toes.str.match(pattern11)==True,'toes'].replace("b {1,}","b-",regex=True)
#for 'a'
df.loc[df.toes.str.match(pattern11)==True,'toes']=df.loc[df.toes.str.match(pattern11)==True,'toes'].replace("a {1,}","a-",regex=True)
print('post:\ntoe pattern 11:{}'.format(df.loc[df.toe_pattern ==11,:].shape[0]))
df.loc[df.toe_pattern ==11,:]

"[ab]\w" # entries that contain an 'a' or 'b' followed by any charcter in the set [a-zA-Z0-9_] pattern12

In [None]:
df.loc[df.toes.str.match(pattern12)==True,'toe_pattern'] = 12
print('pre:\ntoe pattern 12:{}'.format(df.loc[df.toe_pattern ==12,:].shape[0]))
#for 'b'
df.loc[df.toes.str.match(pattern12)==True,'toes']=df.loc[df.toes.str.match(pattern12)==True,'toes'].replace("b","b-",regex=True)
#for 'a'
df.loc[df.toes.str.match(pattern12)==True,'toes']=df.loc[df.toes.str.match(pattern12)==True,'toes'].replace("a","a-",regex=True)
print('post:\ntoe pattern 12:{}'.format(df.loc[df.toe_pattern ==12,:].shape[0]))
df.loc[df.toe_pattern ==12,:]

pattern4 "./."  # entries with '/' need to replace these with '-'

In [None]:
df.loc[df.toes.str.match(pattern4)==True,'toe_pattern'] = 4
print('pre:\ntoe pattern 4:{}'.format(df.loc[df.toe_pattern ==4,:].shape[0]))
df.loc[df.toes.str.match(pattern4)==True,'toes']=df.loc[df.toes.str.match(pattern4)==True,'toes'].replace("/","-",regex=True)
print('post:\ntoe pattern 4:{}'.format(df.loc[df.toe_pattern ==4,:].shape[0]))
df.loc[df.toe_pattern ==4,:]

For toes entries with any number of spaces on either side of a hyphen, but no spaes surrounding other characters we delete those spaces

df.loc[df.toes.str.match(pattern1)==True]=df.loc[df.toes.str.match(pattern1)==True].replace(" ","",regex=True)
df.loc[df.toes.str.match(pattern2)==True]=df.loc[df.toes.str.match(pattern2)==True].replace(" ","-",regex=True)
df.loc[df.toes.str.match(pattern3)==True]=df.loc[df.toes.str.match(pattern3)==True].replace("'","",regex=True)

Confirm that these patterns have been corrected

In [None]:
print("\nThere are {} toe marks with spaces surrounding hypens.".format(df.loc[df.toes.str.match(pattern1)==True].shape[0]))
print("\nThere are {} toe marks with spaces surrounding numbers.".format(df.loc[df.toes.str.match(pattern2)==True].shape[0]))
print("\nThere are {} toe marks with an ' preceeding the entry.".format(df.loc[df.toes.str.match(pattern3)==True].shape[0]))

## Cleaning Sex column

In [None]:
print(df.sex.str.len().unique())# returns unique lengths of sex
df.sex=df.sex.str.strip()
print(df.sex.str.len().unique())

### Identify non "m" or "f" values and their frequencies

In [None]:
patterns_sex="m|f|NA"
non_matches=df.sex.loc[df.sex.str.match(patterns_sex)!=True]
print("\nThere are {} entries for sex which do not match the patterns {}:"\
      .format(non_matches.shape[0],patterns_sex.split("|")))
non_matches.value_counts()

### Identify values to convert to NA, m, or f

In [None]:
sex2NA=['adult','juv','nan']
sex2m=['unm']
df.loc[df.sex.isin(sex2NA)==True]
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

### Convert the values to NA or m, respectively.

In [None]:
df.loc[df.sex.isin(sex2m)]

In [None]:
df.loc[df.sex.isin(sex2NA),'sex']=np.nan
df.loc[df.sex.isin(sex2m),'sex']='m'
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

# Set all remaining species and sex with "?" to NaN

In [None]:
df.loc[(df.species.str.contains('\?')) & (df.species.notnull()),'species'] = np.nan
df.loc[(df.sex.str.contains('\?')) & (df.sex.notnull()),'sex'] = np.nan

# Cleaning autotmized column

In [None]:
autotomyDict = {False:'intact',True:'autotomized'}

df.loc[:,'autotomized'] = df.loc[:,'autotomized'].map(autotomyDict)
df.autotomized.unique()

# Cleaning new.recap column

In [None]:
df.head()

In [None]:
#try using a dict to do thing more efficiently
newRecapKeep = ['recap', 'new', 'r', 'n']
new = ['new','n']
recap = ['recap','r']
df.loc[~df['new.recap'].isin(newRecapKeep),'new.recap'] = np.nan
df.loc[df['new.recap'].isin(new),'new.recap'] = 'new'
df.loc[df['new.recap'].isin(recap),'new.recap'] = 'recap'

## Add Columns

In [None]:
# tl_svl and mass_svl
df['tl_svl']=(df.tl/df.svl)
df['mass_svl']=(df.mass/df.svl)

## Create function to generate lizardNumber 
 lizard number is a numeric identifier of unique animals in the data set
function takes the following arguments:
    - *x*: series object on which function acts on
    - *sortCriteria*: list of strings of column names on which to sort data.  data are sorted by columns from left to right
    - *validationCriteria*: dictionary of dictionaries that identify columns to validate and validation expression of the form {{'column':'column_2 >= column_1'},{'otherColumn':'column_2 >= column_1'}}
    - *result*: dictionary of of dictionaries detailing the value *x* takes if validations are True or False of the form: {{'True':x=x[i]},{'False':x=x[i]+1},{errors: 'raise'}}, errors may be 'raise' *default* (terminates function and returns an error) or 'ignore' (returns 'NA')
Function action:
- first sort data by species, toes, then date

- for time points 1 and , with 2 being later: 
    - toes2 == toes1
    - svl2-svl1 >=-2
    - year2-year1 <=7
    - for species ==j:
        - if svl >=56:
            - if sex2==sex1:
                lizardNumber[i+1]=lizardNumber[i]
          else:
            - lizardNumber[i+1]=lizardNumber[i]+1

In [None]:
import pandas as pd
import os

sortCriteria = ['species','toes', 'sex']
validation = ['date','svl']


def lizsort(x,path:str,sortCriteria = ['species','toes', 'sex'], validation = ['date','svl'],\
            unsortablefile ='unsortable.csv'):
    """
    takes a pandas data frame and returns a pandas dataframe with only those values which 
    can be evaluated according to given criteria and prints a summaryof the files evaluated
    :param path:
    :param sortCriteria:
    :param validation:
    :param unsortablefile:
    """
    #identify lizards with sufficient data to evaluate
    #report on those without sufficient data and save them to a file for later evaluation
    critical = sortCriteria +validation
    unsortable = x.loc[x.loc[:,critical].isnull().any(axis=1)]
    sortable = x.loc[x.loc[:,critical].notnull().all(axis=1)]
    os.chdir(path)
    unsortable.to_csv(unsortablefile)
    print("\nThere were {} entries for which values for one of the critical criteria, ({}), were null.  \
    These entries could not be evaluated and were written out to the file {} for evaluation."\
          .format(unsortable.shape[0],critical,unsortablefile))
    return sortable

def mindate(x, sortCriteria = ['species','toes', 'sex']): # finds date of the initial capture of an animal
    """
    takes a pandas data frame and returns a dataframe with sorting criteria adds a column containing the earliest date 
    at which each unique combination of the sort criteria was sighted. [Requires that the source dataframe,x, has a 
    column labeled 'date'.]
    """
    if any(x.columns=='initialCaptureDate'):
        x = tmp_sort['n_val_data'].drop('initialCaptureDate',1)
    sortable_min_date =pd.DataFrame(x.groupby(sortCriteria).date.min()).\
    rename(index = str, columns= {'date':'initialCaptureDate'}).reset_index()
    x = x.merge(sortable_min_date,how = 'left', on = sortCriteria)
    x['year_diff'] = x.date.dt.year - x.initialCaptureDate.dt.year
    return x

def smallest(x, svlGroup = ['species','toes', 'sex','initialCaptureDate']):#finds svl of animal at date of the initial capture.  needs to be moved out of function
    if any(x.columns=='smallest_svl'):
        x = x.drop('smallest_svl',1)
    sortable_smallest_svl =x.groupby(svlGroup).svl.min().reset_index()\
    .rename(index = str, columns= {'svl':'smallest_svl'})
    #sortable_smallest_svl
    x = x.merge(sortable_smallest_svl,how = 'left', on = svlGroup)
    x['svl_diff'] = x.svl - x.smallest_svl
    return x

def validate (x, sortCriteria = ['species','toes', 'sex'],validation = ['date','svl']):
    x['tmp'] = 1 
    numbers = x.loc[(x.year_diff<=7) & (x.svl_diff>=-2),:].\
    groupby(['species','sex','toes']).tmp.min().cumsum().reset_index()
    validated = x.loc[(x.year_diff<=7) & (x.svl_diff>=-2),:].shape[0]
    not_val_data = x.loc[(x.year_diff<=7) & (x.svl_diff>=-2),:]
    not_validated = x.loc[~((x.year_diff<=7) & (x.svl_diff>=-2)),:].shape[0]
    numbers = numbers.rename(columns={'tmp':'liznumber'}) # rename last column to liznumber
    #the next line merges the numbers to the original data frame to assign the lizard number to the full record
    #of an animal.  It then drop 'tmp'and 'smallest_svl, sinc ewe won't be using these again
    x = x.merge(numbers,'left', on = ['species','sex','toes']).drop(['tmp','smallest_svl'],1)
    print("\nOf those entries we can handle, there are {} individuals as defined by {} which pass validataion based\
    on {} and {} which do not pass validation."\
          .format(validated,sortCriteria,validation,not_validated))
    return {'val_data':x,'n_val_data':not_val_data,'n_validated':not_validated}

def genliznum2(df, path:str, errors:str= 'raise'):
    """
    calls functions to generate a unique identifier for each lizard
    
    Lizard number is a numeric identifier of unique animals in the data set function takes the following arguments:
    :param df:  series object on which function acts on
    :param sortCriteria: list of strings of column names on which to sort data.  data are sorted by columns from left \
    to right
    :param validation: dictionary of dictionaries that identify columns to validate and validation expression of the form:\
     {{'column':'column_2 >= column_1'},{'otherColumn':'column_2 >= column_1'}}
    :param errors: str , errors may be 'raise' *default* (terminates function and returns an error) or 'ignore' (returns 'NA')
    :return: dataframe
    #dictionary  of dictionaries detailing the value *x* takes if validations are True or False of the form: \
    #{{'True':x=x[i]},{'False':x=x[i]+1},{errors: 'raise'}}
    """
    sortable = lizsort(df, path = path)
    sortable = mindate(sortable)
    sortable = smallest(sortable)
    tmp_sort = validate(sortable)
    sortable = tmp_sort['val_data']
    n_val = mindate(tmp_sort['n_val_data'])
    n_val = smallest(n_val)
    n_val = validate(n_val)['val_data']
 
    res = n_val
    return res


genliznum2(df, path = 'C:\\Users\\Christopher\\Documents\\GitHub\\tailDemography\\data')

### Initial attempt to assign lizard numbers

In [None]:
sortable = lizsort(df, path = 'S:\\Chris\\TailDemography\\data')
    
sortable = mindate(sortable)
sortable = smallest(sortable)
tmp_sort = validate(sortable)
sortable = tmp_sort['val_data']

### Second attempt to assign lizard numbers

In [None]:
n_val = mindate(tmp_sort['n_val_data'])
n_val = smallest(n_val)
df_numbered = validate(n_val)['val_data']

### Displaying the output data frame

In [None]:
df_numbered

### QC of lizard numbers

Identify individuals that have same species and toes, but different sex for review

In [None]:
df = df.merge(df.groupby(['species','toes']).sex.nunique().reset_index().rename(columns = {'sex':'sex_count'})\
         ,how = 'inner', on = ['species','toes'])
print(df.loc[df.sex_count>1,:].shape[0])
df.loc[df.sex_count>1,:].to_csv('entries flagged with same species and toes diff sex.csv')
df.head()

In [None]:
df.groupby(['species','toes']).sex.nunique()

In [None]:
print("Lizard Numbers in the sample range from {} to {}."\
      .format(df_numbered.liznumber.min(),df_numbered.liznumber.max()))

In [None]:
possibleLizNum = set(range(int(df_numbered.liznumber.min()),int(df_numbered.liznumber.max())))
actualLizNum = set(pd.Series(df_numbered.liznumber.unique()).dropna().apply(int))
print("\nThere are {} entries.  There are {} unique lizard numbers.\
\n\nThe liznumber ranges from {} to {}."\
  .format(df_numbered.shape[0],len(df_numbered.liznumber.unique())\
          ,df_numbered.liznumber.min(),df_numbered.liznumber.max()))

missingLizNum = possibleLizNum - actualLizNum
if len(missingLizNum)>0:
    print("\n\nThe following numbers are not assigned to a lizard:\n{}"\
      .format(missingLizNum))
else:
    print("\n\nThere are no numbers which were not assigned.")

## Add additional columns
- *daysSinceCapture* [int]:identifies the number of days since the animal was captured
- *capture* [int]: identifies the number of times an animal has been captured prior to an entry

In [None]:
df_numbered.loc[:,'daysSinceCapture'] = (df_numbered.date - df_numbered.initialCaptureDate).dt.days


In [None]:
# need to QC this
df_numbered['capture'] = df_numbered.sort_values(['liznumber','date'])\
.groupby(['liznumber']).daysSinceCapture.cumcount()+1

In [None]:
print(df_numbered.groupby('capture').capture.count())

In [None]:
df_numbered.groupby('capture').capture.hist() # move this figure to R ggplot2

In [None]:
df_numbered.groupby('liznumber').capture.max().plot.hist()

## QC of Capture number and Recap status

In [None]:
df_numbered.columns

In [None]:
recapQuestion=df_numbered.loc[(df_numbered.capture==1 )& (df_numbered["new.recap"]=='recap'),:]
print("There are {} instances in rows for which a lizard appears to have only one capture, but is listed as a recap.\
The distribution of these across years in the sample is as follows:\n{}."\
      .format(recapQuestion.shape[0],recapQuestion.year.value_counts()))
recapQuestion.to_csv("Questionable recaptures.csv")#These individuals need to be rechecked in the raw notes
recapQuestion.head()

In [None]:
recapQuestion.loc[recapQuestion.svl<54,:]

Now we export the cleaned data to a csv

In [None]:
df_numbered = df_numbered.rename(index = str, columns = {'new.recap':'newRecap'})
qc_drop_cols = df_numbered.columns[df_numbered.columns.str.contains('force|drop')]
df_full = df_numbered.drop(qc_drop_cols,1)

In [None]:
timestamp = pd.to_datetime('now')-pd.Timedelta(hours=5)
#path=''C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\''
path=outputBig
#filename = path + 'cleaned CC data 2000-2017_' + str(timestamp)+ '.csv'
filename = path + '/cleaned CC data 2000-2017' + '.csv'
df_full.to_csv(filename,index = False)
filename