# Cleaning CC data

This python notebook operates on a csv created after editing in open refine and is designed to finish cleaning columns of interest which were easier to clean in python.

<a id='TOC'></a>

# Table of Contents

1. [Setting up Python](#SettingUp)
    
    a. [Setting the Location](#SettingLoc)
    
    b. [Importing Data](#ImportingData)
    
    c. [Preparing for a Save](#PreparingSave)

    
2. [Inspecting the Data](#InspectingData)
3. [Cleaning Data](#CleaningData)
4. [Adding Columns](#AddCol)


<a id='SettingUp'></a>

# Setting up Python

[Top](#TOC)

Here we import necessary packages. 
This chunk may take a while.

In [1]:
import pandas as pd
import numpy as np
import os
from liz_number import lizsort,mindate,smallest,validate
from liz_toes import make_str,label_pattern, replace_pattern,report_pattern

import plotly
import plotly.plotly as py
import plotly.graph_objs as go

plotly.tools.set_config_file(world_readable=True)

# increase print limit
pd.options.display.max_rows = 99999
pd.options.display.max_columns = 50

<a id='SettingLoc'></a>

## Setting the location
[Top](#TOC)

These chunks identify the locations from which we can get data and to which we can save data.

In [2]:
# Source Data
sourceDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/outputFiles'
sourceDataBig = 'S:/Chris/TailDemography/combined data'
sourceBlack = 'C:/Users/test/Desktop'

#Output Data paths
outputPers = 'C:/Users/Christopher/Google Drive/TailDemography/outputFiles'
outputBig = 'S:/Chris/TailDemography/data'
outputBlack = 'C:/Users/test/Desktop'


<a id='ImportingData'></a>

## Importing data
[Top](#TOC)

Here we import data from one of the available locations

In [None]:
os.chdir(sourceDataBig)
df=pd.read_csv('mapped-data-all_18-01-08_post_openrefine.csv')
df.head()

<a id='PreparingSave'></a>

## Preparing for a save
[Top](#TOC)

Now we change the working directory so that output files are saved to our preferred location.

In [3]:
os.chdir(outputBig)
df.head()

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,meters,new.recap,painted,misc,vial,year,rtl,autotomized,new.recap_orig,sighting,review_sex,review_species,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
0,j,1-13-19,2000-03-17T00:00:00Z,f,52.0,74.0,0,4.2,r1c,1falls,,new,,,,2000,0.0,False,new,,True,False,False,False,False,False,False,False,False,False,False,False
1,j,1-13-20,2000-03-17T00:00:00Z,m,56.0,77.0,0,5.6,r2c,1falls,,new,,,,2000,0.0,False,new,,True,False,False,False,False,False,False,False,False,False,False,False
2,j,1-14-19,2000-03-17T00:00:00Z,f,57.0,81.0,0,6.6,r3c,wall on rt side v wall at pine xing,,new,,,,2000,0.0,False,new,,True,False,False,False,False,False,False,False,False,False,False,False
3,j,1-14-20,2000-03-17T00:00:00Z,f,57.0,79.0,0,5.5,r4c,wall on rt side v wall at pine xing,,new,,,,2000,0.0,False,new,,True,False,False,False,False,False,False,False,False,False,False,False
4,j,3-8,2000-03-17T00:00:00Z,f,82.0,89.0,27,17.0,r5c,oak across from bottom wall at pine xing,,recap,,shed since last recapture,,2000,27.0,True,recap,,True,False,False,False,False,False,False,False,False,False,False,False


<a id= 'InspectingData'></a>

<a id= 'CleaningData'></a>

# Cleaning the Data
[Top](#TOC)

Now we get to the actual cleaning of the data.  We will inspect the data and take the appropriate cleaning steps:
- [Inspecting the Data](#InspectingData)
- [Correcting class of columns](#CorrectingClass)
- [Cleaning Toes](#CleaningToes)

## Inspecting the Data
[Top](#TOC)

[Top Cleaning](#CleaningData)

Let's take a look at the data.

In [4]:
print("\nThere are {} data points in our data set.".format(df.shape[0]))


There are 8197 data points in our data set.


In [5]:
print("\nThe columns in the data have the following data types:\n{}".format(df.dtypes))


The columns in the data have the following data types:
species                object
toes                   object
date                   object
sex                    object
svl                   float64
tl                    float64
rtl_orig               object
mass                   object
paint.mark             object
location               object
meters                 object
new.recap              object
painted                object
misc                   object
vial                   object
year                    int64
rtl                   float64
autotomized              bool
new.recap_orig         object
sighting               object
review_sex               bool
review_species           bool
review_painted           bool
review_new.recap         bool
review_rtl               bool
forceMale                bool
forceFemale              bool
forceRecap               bool
forceNew                 bool
forceSighting            bool
drop_species             bool
drop_morphomet

<a id='CorrectingClass'></a>

## Correcting class of columns
[Top](#TOC)

[Top Cleaning](#CleaningData)

In [6]:
#We need to add real error handling into these conversion chunks

##Convert integer columns to int
intCols = ['meters']
df[intCols]=df[intCols].astype(int,errors='ignore')

##Convert numeric columns to numeric
numCols = ['svl','tl','rtl','rtl_orig','mass']
df[numCols]=df[numCols].apply(pd.to_numeric,errors='coerce')

##Convert string columns to str
strCols = ['toes','sex','species','vial']
df[strCols]=df[strCols].astype(str, errors='ignore')

#Convert date to datetime
df.loc[df.date=="NA"]=np.nan
df.date = pd.to_datetime(df.date,errors='coerce')

##Convert bool columns to bool
boolCols = ['review_sex','review_species','review_painted','review_new.recap',\
            'review_rtl','forceMale','forceFemale','forceRecap','forceNew',\
            'forceSighting','drop_species','drop_morphometrics','autotomized']
df[boolCols]=df[boolCols].astype(bool, errors='ignore')

In [7]:
print("\nAfter applying the above changes, the data types are as follows:\n{}".format(df.dtypes))


After applying the above changes, the data types are as follows:
species                       object
toes                          object
date                  datetime64[ns]
sex                           object
svl                          float64
tl                           float64
rtl_orig                     float64
mass                         float64
paint.mark                    object
location                      object
meters                        object
new.recap                     object
painted                       object
misc                          object
vial                          object
year                         float64
rtl                          float64
autotomized                     bool
new.recap_orig                object
sighting                      object
review_sex                      bool
review_species                  bool
review_painted                  bool
review_new.recap                bool
review_rtl                      bool
forceMale

For some reason 'year' was converted to float, so we will fix this here.

In [8]:
df.year = df.year.astype(int)
df.dtypes

species                       object
toes                          object
date                  datetime64[ns]
sex                           object
svl                          float64
tl                           float64
rtl_orig                     float64
mass                         float64
paint.mark                    object
location                      object
meters                        object
new.recap                     object
painted                       object
misc                          object
vial                          object
year                           int32
rtl                          float64
autotomized                     bool
new.recap_orig                object
sighting                      object
review_sex                      bool
review_species                  bool
review_painted                  bool
review_new.recap                bool
review_rtl                      bool
forceMale                       bool
forceFemale                     bool
f

<a id='CleaningToes'></a>

## Cleaning toes column
[Top](#TOC)

[Top Cleaning](#CleaningData)

First we will rename "toes" to "toes_orig"

In [9]:
df = df.rename(columns = {'toes':'toes_orig'},index = str)

Next we create a new column, "toes"  for the renamed toes

In [10]:
df['toes'] = df.toes_orig

Now we attempt to identify problem toes name and correct or export for review.

In [11]:
pattern1 = ".( {1,}-.|.- {1,}.)" # toes entries with any number of spaces on either side of a hyphen
pattern2 = ".( {,}\w{,} {1,})." # toes entries with space around or between numbers <- the spaces here should be deleted
pattern3 = ".(')."
pattern4 = "./."  # entries with '/' <-- need to replace these with '-'
pattern5 = "(\?{1,})"#<-- these needs to be investigated
pattern6 = "^\d{3,}$" # entries consist of only a single number comprised of at least three digits 
#<-- these needs to be investigated by checking raw field notes
pattern7 = ".(-{2,})." # entries which have at least 2 consecutive '-' <- these should be investigated
pattern8 = "^0" # entries in which single digit numbers have a leading "0" <-- Check raw field notes on this too
pattern9 = "a\w" #<--handled hyphens should be inserted  between the [ab] and \w 
# entries that contain an 'a' or 'b' followed by any character in the set [a-zA-Z0-9_]
pattern10 = "b\w" #<--handled hyphens should be inserted  between the [ab] and \w 
pattern11 = "\wa" # entries that contain an 'a' or 'b' preceded by any character in the set [a-zA-Z0-9_]
pattern12 = "\wb" # entries that contain an 'a' or 'b' preceded by any character in the set [a-zA-Z0-9_]
pattern13 = "[()]"
# remove space before 'a' at end of toes
#investigate '\d-', 
#'-(*)-', 
#' (16) ', 
#'---', <- may not exist in raw data
#'\d- ', 
#'- \d', 
#transcription errors from excel (toes in date format,
#'-\d\d\d\d' <- may not be in the data set

We'll have to change this block if we add or remove toe patterns.
This is not ideal and needs to be fixed

In [12]:
toe_pattern = pd.Series([*range(1,14)]) 
toe_pattern = make_str(toe_pattern)
print(toe_pattern)

toe_pattern_descr = pd.Series([pattern1,pattern2,pattern3,pattern4
                               ,pattern5,pattern6,pattern7,pattern8
                               ,pattern9,pattern10,pattern11,pattern12,pattern13])
toe_pattern_descr = toe_pattern_descr.astype(str)
print(toe_pattern_descr)

toe_pattern_reference = pd.DataFrame({'toe_pattern': toe_pattern,'description':toe_pattern_descr})
toe_pattern_reference

0     01
1     02
2     03
3     04
4     05
5     06
6     07
7     08
8     09
9     10
10    11
11    12
12    13
dtype: object
0     .( {1,}-.|.- {1,}.)
1      .( {,}\w{,} {1,}).
2                   .(').
3                     ./.
4                (\?{1,})
5                ^\d{3,}$
6               .(-{2,}).
7                      ^0
8                     a\w
9                     b\w
10                    \wa
11                    \wb
12                   [()]
dtype: object


Unnamed: 0,toe_pattern,description
0,1,".( {1,}-.|.- {1,}.)"
1,2,".( {,}\w{,} {1,})."
2,3,.(').
3,4,./.
4,5,"(\?{1,})"
5,6,"^\d{3,}$"
6,7,".(-{2,})."
7,8,^0
8,9,a\w
9,10,b\w


We first replace the string 'nan' with a null value

In [13]:
df.loc[df.toes=='nan','toes'] = np.nan

Let's see how many of these patterns we need to correct

In [14]:
df['toe_pattern'] = np.nan

Here we use a for-loop to label the patterns 
(there's probably a better way to do this with pandas map or apply, but I'll have to figure this out, for now this is fast enough, but it could make a difference with a larger data set or with more patterns)

In [15]:
for i in range(0,toe_pattern_reference.shape[0]):
    tmp_pat_num = toe_pattern_reference.iloc[i,0]
    tmp_pattern = toe_pattern_reference.iloc[i,1]
    df = label_pattern(df,tmp_pat_num,tmp_pattern,'toe_pattern','toes')

A quick summary of the number of observations for each pattern in the data set

In [16]:
toe_errors =df.toe_pattern.value_counts(dropna=False).reset_index()\
.rename(columns = {'index':'toe_pattern','toe_pattern':'observations'})
toe_errors.loc[toe_errors.toe_pattern.isnull(),'toe_pattern'] = 'Not covered by current patterns'
toe_errors_desc = toe_errors.merge(toe_pattern_reference,'left',on='toe_pattern')
toe_errors_desc

Unnamed: 0,toe_pattern,observations,description
0,Not covered by current patterns,7891,
1,02,252,".( {,}\w{,} {1,})."
2,01,40,".( {1,}-.|.- {1,}.)"
3,05,7,"(\?{1,})"
4,06,3,"^\d{3,}$"
5,13,1,[()]
6,04,1,./.
7,09,1,a\w
8,08,1,^0


Now let's make sure we've accounted for every row in the data set

In [17]:
accountedRows = toe_errors.observations.sum()
totalRows = df.shape[0]
notAccountedRows = df.shape[0] - toe_errors.observations.sum()
print("\nThere are {} rows accounted for in the patterns (including null values) and there {} rows in the full data set.\
  There are {} rows unaccounted for.".format(accountedRows,totalRows,notAccountedRows))


There are 8197 rows accounted for in the patterns (including null values) and there 8197 rows in the full data set.  There are 0 rows unaccounted for.


### And now we correct these patterns
We'll preserve the original toe data in a column called "toes_orig" just in case.  We can drop this later, if we are comfortable with the changes.  The new toes will be labeled "toes".

In [18]:
corrections_config = {'01':{'action':'replace','pattern_b':" ",'replacement':"\"\""},
            '02':{'action':'replace','pattern_b':" ",'replacement':"-"},
            '03':{'action':'replace','pattern_b':"\'",'replacement':"\"\""},
            '04':{'action':'replace','pattern_b':"/",'replacement':"-"},
            '05':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '06':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '07':{'action':'save','pattern_b':np.nan,'replacement':np.nan},
            '08':{'action':'replace','pattern_b':"^0",'replacement':"\"\""},
            '09':{'action':'replace','pattern_b':'a','replacement':'-a'},
            '10':{'action':'replace','pattern_b':'b','replacement':'-b'},          
            '11':{'action':'replace','pattern_b':"a",'replacement':"a-"},
            '12':{'action':'replace','pattern_b':"b",'replacement':"b-"},
            '13':{'action':'replace','pattern_b':"[()]",'replacement':"\"\""}}

In [19]:
toe_errors_desc['action'] = toe_errors_desc.loc[toe_errors_desc.toe_pattern.str.len()==2].toe_pattern\
.map(lambda x: corrections_config[x]['action'],na_action='ignore')

toe_errors_desc['replacement'] = toe_errors_desc.loc[toe_errors_desc.toe_pattern.str.len()==2].toe_pattern\
.map(lambda x: corrections_config[x]['replacement'],na_action='ignore')

toe_errors_desc = toe_errors_desc.sort_values('toe_pattern').reset_index(drop=True)
toe_errors_desc

Unnamed: 0,toe_pattern,observations,description,action,replacement
0,01,40,".( {1,}-.|.- {1,}.)",replace,""""""
1,02,252,".( {,}\w{,} {1,}).",replace,-
2,04,1,./.,replace,-
3,05,7,"(\?{1,})",save,
4,06,3,"^\d{3,}$",save,
5,08,1,^0,replace,""""""
6,09,1,a\w,replace,-a
7,13,1,[()],replace,""""""
8,Not covered by current patterns,7891,,,


In [20]:
for i in range(0,toe_errors_desc.shape[0]):
    tmp_pat_num = toe_errors_desc.iloc[i,0]
    tmp_pattern = toe_errors_desc.iloc[i,2]
    action = toe_errors_desc.iloc[i,3]
    tmp_replacement = toe_errors_desc.iloc[i,4]
    tmp_x = df.loc[df.toe_pattern==tmp_pat_num,:]
    
    if action =='save':
        tmp_filename = 'pattern'+tmp_pat_num+'.csv'
        tmp_x.to_csv(tmp_filename)
        print("Pattern {} successfully saved to {}.".format(tmp_pattern,tmp_filename))
    if action =='replace':
        df.loc[df.toe_pattern==tmp_pat_num,'toes'] = replace_pattern(x=df.loc[df.toe_pattern==tmp_pat_num]
                                                                     ,pattern = tmp_pat_num
                                                                     ,pattern_b = tmp_pattern
                                                                     ,source_col = 'toes'
                                                                    ,replacement = tmp_replacement)
        print("Pattern {} successfully replaced with {}.".format(tmp_pattern,tmp_replacement))
    else:
        print("No direction provided for pattern {}.  No action was taken.".format(tmp_pattern))

Pattern .( {1,}-.|.- {1,}.) successfully replaced with "".
Pattern .( {,}\w{,} {1,}). successfully replaced with -.
Pattern ./. successfully replaced with -.
Pattern (\?{1,}) successfully saved to pattern05.csv.
No direction provided for pattern (\?{1,}).  No action was taken.
Pattern ^\d{3,}$ successfully saved to pattern06.csv.
No direction provided for pattern ^\d{3,}$.  No action was taken.
Pattern ^0 successfully replaced with "".
Pattern a\w successfully replaced with -a.
Pattern [()] successfully replaced with "".
No direction provided for pattern nan.  No action was taken.


### Now we confirm that the patterns we expect to have eliminated have indeed been eliminated from the data set

In [21]:
for i in range(0,toe_pattern_reference.shape[0]):
    tmp_pattern = str(toe_pattern_reference.iloc[i,1])
    report_pattern(df,tmp_pattern,'toes','Post-Correction')

<a id='Sex'></a>

## Cleaning Sex column
[Top](#TOC)

[Top Cleaning](#CleaningData)

Next we move on to cleaning the "sex" column.

First we want to get an idea of the types of problems in the sex column.  We start by striping leading and trailing whitespaces.  You can see here that there were none in the data set.

In [22]:
print(df.sex.str.len().unique())# returns unique lengths of sex
df.sex=df.sex.str.strip()
print(df.sex.str.len().unique())

[1 3 2 5]
[1 3 2 5]


### Identify non "m" or "f" values and their frequencies

In [23]:
patterns_sex="m|f|NA"
non_matches=df.sex.loc[df.sex.str.match(patterns_sex)!=True]
print("\nThere are {} entries for sex which do not match the patterns {}:"\
      .format(non_matches.shape[0],patterns_sex.split("|")))
non_matches.value_counts()


There are 5412 entries for sex which do not match the patterns ['m', 'f', 'NA']:


nan      5255
juv       128
?          16
?f          6
n           2
[m]         1
unm         1
adult       1
?m          1
???         1
Name: sex, dtype: int64

### Identify values to convert to NA, m, or f

In [24]:
sex2NA=['adult','juv','nan']
sex2m=['unm']
df.loc[df.sex.isin(sex2NA)==True]
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

5384
1


### Convert the values to NA or m, respectively.

In [25]:
df.loc[df.sex.isin(sex2m)]

Unnamed: 0,species,toes_orig,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,meters,new.recap,painted,misc,vial,year,rtl,autotomized,new.recap_orig,sighting,review_sex,review_species,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics,toes,toe_pattern
7726,up,,2017-07-20,unm,,,,,,3m above sb on rt side 4m ^ CC/CCC,249,,,climbing; couldn't catch,,2017,,True,,,True,False,False,True,False,False,False,False,False,False,True,True,,


In [26]:
df.loc[df.sex.isin(sex2NA),'sex']=np.nan
df.loc[df.sex.isin(sex2m),'sex']='m'
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

0
0


### Set all remaining species and sex with "?" to NaN

In [27]:
df.loc[(df.species.str.contains('\?')) & (df.species.notnull()),'species'] = np.nan
df.loc[(df.sex.str.contains('\?')) & (df.sex.notnull()),'sex'] = np.nan

<a id='Autotomized'></a>

## Cleaning autotmized column
[Top](#TOC)

[Top Cleaning](#CleaningData)

Here we inspect and clean the autotomized columns.

In [28]:
autotomyDict = {False:'intact',True:'autotomized'}

df.loc[:,'autotomized'] = df.loc[:,'autotomized'].map(autotomyDict)
df.autotomized.unique()

array(['intact', 'autotomized'], dtype=object)

<a id='NewRecap'></a>

## Cleaning new.recap column
[Top](#TOC)

[Top Cleaning](#CleaningData)

In [29]:
df.head()

Unnamed: 0,species,toes_orig,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,meters,new.recap,painted,misc,vial,year,rtl,autotomized,new.recap_orig,sighting,review_sex,review_species,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics,toes,toe_pattern
0,j,1-13-19,2000-03-17,f,52.0,74.0,0.0,4.2,r1c,1falls,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-19,
1,j,1-13-20,2000-03-17,m,56.0,77.0,0.0,5.6,r2c,1falls,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-20,
2,j,1-14-19,2000-03-17,f,57.0,81.0,0.0,6.6,r3c,wall on rt side v wall at pine xing,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-14-19,
3,j,1-14-20,2000-03-17,f,57.0,79.0,0.0,5.5,r4c,wall on rt side v wall at pine xing,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-14-20,
4,j,3-8,2000-03-17,f,82.0,89.0,27.0,17.0,r5c,oak across from bottom wall at pine xing,,recap,,shed since last recapture,,2000,27.0,autotomized,recap,,True,False,False,False,False,False,False,False,False,False,False,False,3-8,


In [30]:
#try using a dict to do thing more efficiently
newRecapKeep = ['recap', 'new', 'r', 'n']
new = ['new','n']
recap = ['recap','r']
df.loc[~df['new.recap'].isin(newRecapKeep),'new.recap'] = np.nan
df.loc[df['new.recap'].isin(new),'new.recap'] = 'new'
df.loc[df['new.recap'].isin(recap),'new.recap'] = 'recap'

<a id='AddCol'></a>

# Adding New Columns
[Top]

We need to add new columns which we will use later in analyses:
- [TL_SVL](#TlSvl)
- [Mass_SVL](#MassSvl)
- [Lizard Number](#LizardNumber)

<a id= 'TlSvl'></a>

## TL_SVL 
[Top](#TOC)

[Top Add Columns](#AddCol)



In [None]:
df['tl_svl']=(df.tl/df.svl)

<a id='MassSvl'></a>

## Mass_SVL
[Top](#TOC)

[Top Add Columns](#AddCol)



In [31]:
df['mass_svl']=(df.mass/df.svl)

<a id= 'LizardNumber'></a>

## Lizard Number
[Top](#TOC)

[Top Add Columns](#AddCol)

### Initial attempt to assign lizard numbers

In [32]:
sortable = lizsort(df, path = 'S:\\Chris\\TailDemography\\data')
    
sortable = mindate(sortable)
sortable = smallest(sortable)
tmp_sort = validate(sortable)
sortable = tmp_sort['val_data']


There were 5877 entries for which values for one of the critical criteria, (['species', 'toes', 'sex', 'date', 'svl']), were null.      These entries could not be evaluated and were written out to the file unsortable.csv for evaluation.

Of those entries we can handle, there are 2240 individuals as defined by ['species', 'toes', 'sex'] which pass validation based    on ['date', 'svl'] and 80 which do not pass validation.


### Second attempt to assign lizard numbers

In [33]:
help(mindate)

Help on function mindate in module liz_number:

mindate(x, sort_criteria=['species', 'toes', 'sex'])
    takes a pandas data frame and returns a dataframe with sorting criteria adds a column containing the earliest date
    at which each unique combination of the sort criteria was sighted. [Requires that the source dataframe,x, has a
    column labeled 'date'.]



In [34]:
tmp_sort['n_val_data'].head()

Unnamed: 0,species,toes_orig,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,meters,new.recap,painted,misc,vial,year,rtl,autotomized,new.recap_orig,sighting,review_sex,review_species,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics,toes,toe_pattern,tl_svl,mass_svl,initialCaptureDate,year_diff,smallest_svl,svl_diff,tmp
0,j,1-13-19,2000-03-17,f,52.0,74.0,0.0,4.2,r1c,1falls,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-19,,1.423077,0.080769,2000-03-17,0,52.0,0.0,1
1,j,1-13-20,2000-03-17,m,56.0,77.0,0.0,5.6,r2c,1falls,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-20,,1.375,0.1,2000-03-17,0,56.0,0.0,1
2,j,1-14-19,2000-03-17,f,57.0,81.0,0.0,6.6,r3c,wall on rt side v wall at pine xing,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-14-19,,1.421053,0.115789,2000-03-17,0,57.0,0.0,1
3,j,1-14-20,2000-03-17,f,57.0,79.0,0.0,5.5,r4c,wall on rt side v wall at pine xing,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-14-20,,1.385965,0.096491,2000-03-17,0,57.0,0.0,1
4,j,3-8,2000-03-17,f,82.0,89.0,27.0,17.0,r5c,oak across from bottom wall at pine xing,,recap,,shed since last recapture,,2000,27.0,autotomized,recap,,True,False,False,False,False,False,False,False,False,False,False,False,3-8,,1.085366,0.207317,2000-03-17,0,82.0,0.0,1


In [35]:
n_val = mindate(tmp_sort['n_val_data'])
n_val = smallest(n_val)
df_numbered = validate(n_val)['val_data']


Of those entries we can handle, there are 2240 individuals as defined by ['species', 'toes', 'sex'] which pass validation based    on ['date', 'svl'] and 0 which do not pass validation.


### Displaying the output data frame

In [36]:
df_numbered

Unnamed: 0,species,toes_orig,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,meters,new.recap,painted,misc,vial,year,rtl,autotomized,new.recap_orig,sighting,review_sex,review_species,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics,toes,toe_pattern,tl_svl,mass_svl,year_diff,svl_diff,initialCaptureDate,liznumber
0,j,1-13-19,2000-03-17,f,52.0,74.0,0.0,4.2,r1c,1falls,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-19,,1.423077,0.080769,0,0.0,2000-03-17,37
1,j,1-13-20,2000-03-17,m,56.0,77.0,0.0,5.6,r2c,1falls,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-20,,1.375,0.1,0,0.0,2000-03-17,512
2,j,1-14-19,2000-03-17,f,57.0,81.0,0.0,6.6,r3c,wall on rt side v wall at pine xing,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-14-19,,1.421053,0.115789,0,0.0,2000-03-17,44
3,j,1-14-20,2000-03-17,f,57.0,79.0,0.0,5.5,r4c,wall on rt side v wall at pine xing,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-14-20,,1.385965,0.096491,0,0.0,2000-03-17,45
4,j,3-8,2000-03-17,f,82.0,89.0,27.0,17.0,r5c,oak across from bottom wall at pine xing,,recap,,shed since last recapture,,2000,27.0,autotomized,recap,,True,False,False,False,False,False,False,False,False,False,False,False,3-8,,1.085366,0.207317,0,0.0,2000-03-17,273
5,j,1-15-16,2000-03-17,m,58.0,64.0,24.0,5.5,r6c,sb half way up from bottom wall to pine xing,,new,,,,2000,24.0,autotomized,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-15-16,,1.103448,0.094828,0,0.0,2000-03-17,518
6,j,1-18,2000-03-17,f,58.0,62.0,20.0,7.0,r7c,sb 3/4 way up from bottom wall to pine xing,,new,,,,2000,20.0,autotomized,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-18,,1.068966,0.12069,0,0.0,2000-03-17,49
7,j,1-13-18,2000-03-17,f,54.0,75.0,0.0,5.5,r8c,sb at pine xing,,recap,,shed since last recapture,,2000,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-18,,1.388889,0.101852,0,0.0,2000-03-17,35
8,j,1-19,2000-03-17,m,62.0,84.0,0.0,7.5,r9c,sb 10m ^ root xing,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-19,,1.354839,0.120968,0,0.0,2000-03-17,524
9,j,1-20,2000-03-17,f,60.0,80.0,0.0,8.0,r10c,sb at H3,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-20,,1.333333,0.133333,0,33.0,2000-03-17,53


### QC of lizard numbers

Identify individuals that have same species and toes, but different sex for review

In [37]:
df = df.merge(df.groupby(['species','toes']).sex.nunique().reset_index().rename(columns = {'sex':'sex_count'})\
         ,how = 'inner', on = ['species','toes'])
print(df.loc[df.sex_count>1,:].shape[0])
df.loc[df.sex_count>1,:].to_csv('entries flagged with same species and toes diff sex.csv')
df.head()

456


Unnamed: 0,species,toes_orig,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,meters,new.recap,painted,misc,vial,year,rtl,autotomized,new.recap_orig,sighting,review_sex,review_species,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics,toes,toe_pattern,tl_svl,mass_svl,sex_count
0,j,1-13-19,2000-03-17,f,52.0,74.0,0.0,4.2,r1c,1falls,,new,,,,2000,0.0,intact,new,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-19,,1.423077,0.080769,1
1,j,1-13-19,2000-03-17,f,53.0,69.0,0.0,5.0,r18c,Rs opp slab,,recap,,shed since last recapture,,2000,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-19,,1.301887,0.09434,1
2,j,1-13-19,2000-06-24,f,63.0,93.0,0.0,6.7,o11a,halfway between 1 falls and cave trail,,,,,,2000,0.0,intact,,,True,False,False,True,False,False,False,False,False,False,False,False,1-13-19,,1.47619,0.106349,1
3,j,1-13-19,2001-07-13,f,79.0,108.0,0.0,14.6,r31a,15m ^ 1falls,16.0,recap,,,shed since; Tss,2001,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-19,,1.367089,0.18481,1
4,j,1-13-19,2008-07-18,f,68.0,86.0,0.0,9.3,y11a.t,H3/H4,186.0,recap,painted,y11 paintmark v faded; painted on top of faded...,,2008,0.0,intact,r,,True,False,False,True,False,False,False,True,False,False,False,False,1-13-19,,1.264706,0.136765,1


In [38]:
df.groupby(['species','toes']).sex.nunique()

species  toes                 
as       15                       0
         16                       0
cn ex    1-7                      1
         7                        1
j        1- 6 -14                 1
         1-10-11-16               1
         1-10-11-17               1
         1-10-11-18               1
         1-10-11-19               1
         1-10-11-20               1
         1-10-12-16               1
         1-10-12-17               1
         1-10-12-18               1
         1-10-12-19               1
         1-10-12-20               1
         1-10-13-15               1
         1-10-13-16               1
         1-10-13-18               2
         1-10-13-19               1
         1-10-13-20               2
         1-10-14-16               1
         1-10-14-17               1
         1-10-14-18               2
         1-10-14-19               1
         1-10-14-20               1
         1-10-15-16               1
         1-10-15-17              

In [39]:
print("Lizard Numbers in the sample range from {} to {}."\
      .format(df_numbered.liznumber.min(),df_numbered.liznumber.max()))

Lizard Numbers in the sample range from 1 to 1419.


In [40]:
possibleLizNum = set(range(int(df_numbered.liznumber.min()),int(df_numbered.liznumber.max())))
actualLizNum = set(pd.Series(df_numbered.liznumber.unique()).dropna().apply(int))
print("\nThere are {} entries.  There are {} unique lizard numbers.\
\n\nThe liznumber ranges from {} to {}."\
  .format(df_numbered.shape[0],len(df_numbered.liznumber.unique())\
          ,df_numbered.liznumber.min(),df_numbered.liznumber.max()))

missingLizNum = possibleLizNum - actualLizNum
if len(missingLizNum)>0:
    print("\n\nThe following numbers are not assigned to a lizard:\n{}"\
      .format(missingLizNum))
else:
    print("\n\nThere are no numbers which were not assigned.")


There are 2240 entries.  There are 1419 unique lizard numbers.

The liznumber ranges from 1 to 1419.


There are no numbers which were not assigned.


## Add additional columns
- *daysSinceCapture* [int]:identifies the number of days since the animal was captured
- *capture* [int]: identifies the number of times an animal has been captured prior to an entry

In [41]:
df_numbered.loc[:,'daysSinceCapture'] = (df_numbered.date - df_numbered.initialCaptureDate).dt.days


In [42]:
# need to QC this this seems to be leading to several cases in which recap individuals that 
# only have one capture
df_numbered['capture'] = df_numbered.sort_values(['liznumber','date'])\
.groupby(['liznumber']).daysSinceCapture.cumcount()+1

In [43]:
df_numbered.species.unique()

array(['j', 'uo', 'v', 'sc', 'cn ex'], dtype=object)

In [44]:
print(df_numbered.loc[df_numbered.species.isin(['j','v'])].groupby('capture').capture.count())

capture
1    1294
2     449
3     194
4      84
5      39
6      16
7       7
8       3
Name: capture, dtype: int64


In [45]:
jarrovii = go.Histogram(x = df_numbered.loc[df_numbered.species.isin(['j'])].groupby('liznumber')\
                     .capture.max(),name = 'S. jarrovii')
virgatus = go.Histogram(x = df_numbered.loc[df_numbered.species.isin(['v'])].groupby('liznumber')\
                     .capture.max(), name = 'S. virgatus')
data = [jarrovii, virgatus]
layout = go.Layout(
    title = 'Maximum Number of Captures per Individual 2000-2017',
    titlefont = dict(
        size = 20),
    xaxis = dict(
        dtick = 1,
        title = 'Maximum Number of Captures',
        titlefont = dict(
            size = 18)),
    yaxis = dict(
        title = 'Number of Lizards',
        titlefont = dict(
            size = 18))
)
fig = go.Figure(
        data = data,
        layout = layout)
py.iplot(fig, filename = 'Frequency of Captures in Crystal Creek 2000 - 2017 (by species)')

In [46]:
# ADD HORIZONTAL LINES FOR EACH YEAR
j_lizards = go.Scatter(x = df_numbered.loc[df_numbered.species.isin(['j'])].liznumber,
                   y = df_numbered.loc[df_numbered.species.isin(['j'])]\
                      .groupby('liznumber').daysSinceCapture.max(), 
                     mode = 'markers')
v_lizards = go.Scatter(x = df_numbered.loc[df_numbered.species.isin(['v'])].liznumber,
                   y = df_numbered.loc[df_numbered.species.isin(['v'])]\
                      .groupby('liznumber').daysSinceCapture.max(), 
                     mode = 'markers')
# year1 = go.Scatter(x=[df_numbered.liznumber.min(),df_numbered.liznumber.max()],y = (365))
# year2 = go.Scatter(y = 365*2)
# year3 = go.Scatter(y = 365*3)
# year4 = go.Scatter(y = 365*4)
# year5 = go.Scatter(y = 365*5)
# year6 = go.Scatter(y = 365*6)
# year7 = go.Scatter(y = 365*7)
# year8 = go.Scatter(y = 365*8)

# data = [j_lizards, v_lizards, year1, year2, year3, year4, year5, year6, year7, year8]
data = [j_lizards, v_lizards]
layout = go.Layout(
    title = 'Days Since Initial Capture in Crystal Creek 2000 - 2017',
        titlefont = dict(
            size = 20),
    xaxis = dict(
            title='Lizard Number',
            titlefont=dict(
                size=18)),
    yaxis = dict(
            title='Greatest Number of Days Since<br> Initial Capture',
            titlefont=dict(
                size=18)))

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename = 'Days Since Initial Capture in Crystal Creek 2000 - 2017')

In [56]:
dfF = df_numbered.loc[(df_numbered.sex =='f' )& (df_numbered.species.isin(['j','v']))]
dfM = df_numbered.loc[(df_numbered.sex =='m') & (df_numbered.species.isin(['j','v']))]

In [57]:
females = go.Scatter(
    x = dfF.liznumber,
    y = dfF.groupby('liznumber').daysSinceCapture.max(),
    name = 'females',
    mode = 'markers',
    marker = dict(
        color = 'rgba(152, 0, 0, .8)',
        opacity = 0.75,
        line = dict(
            width = 2,
            color = 'rgb(0, 0, 0)'
        )
    )
)

males = go.Scatter(
    x = dfM.liznumber,
    y = dfM.groupby('liznumber').daysSinceCapture.max(),
    name = 'males',
    mode = 'markers',
    marker = dict(
        color = 'rgba(255, 182, 193, .9)',
        opacity = 0.75,
        line = dict(
            width = 2,
        )
    )
)

data = [females, males]

layout = dict(title = 'Days Since Initial Capture in Crystal Creek 2000 - 2017 By Sex',
              yaxis = dict(
                  title='Greatest Number of Days Since<br> Initial Capture',
                  titlefont=dict(
                      size=18)
              ),
              xaxis = dict(zeroline = False)
             )

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Days Since Initial Capture in Crystal Creek 2000 - 2017 By Sex')

In [58]:
males = go.Histogram(x = df_numbered.loc[(df_numbered.sex == 'm')& (df_numbered.species.isin(['j','v']))
                                                                    ,'capture']
                     ,opacity= 0.75,name='males')
females = go.Histogram(x = df_numbered.loc[(df_numbered.sex == 'f')& (df_numbered.species.isin(['j','v']))
                                                                      ,'capture']
                       , opacity= 0.75, name = 'females')
data = [males,females]
py.iplot(data, filename = 'Frequency of Captures by Sex in Crystal Creek 2000 - 2017')

## QC of Capture number and Recap status

In [60]:
df_numbered.columns

Index(['species', 'toes_orig', 'date', 'sex', 'svl', 'tl', 'rtl_orig', 'mass',
       'paint.mark', 'location', 'meters', 'newRecap', 'painted', 'misc',
       'vial', 'year', 'rtl', 'autotomized', 'new.recap_orig', 'sighting',
       'review_sex', 'review_species', 'review_painted', 'review_new.recap',
       'review_rtl', 'forceMale', 'forceFemale', 'forceRecap', 'forceNew',
       'forceSighting', 'drop_species', 'drop_morphometrics', 'toes',
       'toe_pattern', 'tl_svl', 'mass_svl', 'year_diff', 'svl_diff',
       'initialCaptureDate', 'liznumber', 'daysSinceCapture', 'capture'],
      dtype='object')

In [61]:
recapQuestion=df_numbered\
.loc[(df_numbered.capture==1 )&(df_numbered['newRecap']=='recap')&(df_numbered.species.isin(['j','v'])),:]
print("There are {} instances in rows for which a lizard appears to have only one capture, \
but is listed as a recap.\
The distribution of these across years in the sample is as follows:\n{}."\
      .format(recapQuestion.shape[0],recapQuestion.year.value_counts()))
recapQuestion.to_csv("Questionable recaptures.csv")#These individuals need to be rechecked in the raw notes
recapQuestion.head()

There are 280 instances in rows for which a lizard appears to have only one capture, but is listed as a recap.The distribution of these across years in the sample is as follows:
2002    50
2000    33
2003    28
2007    22
2009    19
2005    18
2012    17
2001    17
2015    16
2008    14
2004    12
2013    10
2010    10
2016     7
2014     4
2017     3
Name: year, dtype: int64.


Unnamed: 0,species,toes_orig,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,meters,newRecap,painted,misc,vial,year,rtl,autotomized,new.recap_orig,sighting,review_sex,review_species,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics,toes,toe_pattern,tl_svl,mass_svl,year_diff,svl_diff,initialCaptureDate,liznumber,daysSinceCapture,capture
4,j,3-8,2000-03-17,f,82.0,89.0,27.0,17.0,r5c,oak across from bottom wall at pine xing,,recap,,shed since last recapture,,2000,27.0,autotomized,recap,,True,False,False,False,False,False,False,False,False,False,False,False,3-8,,1.085366,0.207317,0,0.0,2000-03-17,273,0,1
7,j,1-13-18,2000-03-17,f,54.0,75.0,0.0,5.5,r8c,sb at pine xing,,recap,,shed since last recapture,,2000,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,1-13-18,,1.388889,0.101852,0,0.0,2000-03-17,35,0,1
10,j,3-14-17,2000-03-17,f,74.0,97.0,0.0,15.6,r11c,1falls,,recap,,shed since last recapture,,2000,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,3-14-17,,1.310811,0.210811,0,0.0,2000-03-17,254,0,1
13,j,5-7-15-20,2000-03-17,f,65.0,86.0,18.0,8.4,r14c,sb half way up from bottom wall to pine xing,,recap,,shed since last recapture,,2000,18.0,autotomized,recap,,True,False,False,False,False,False,False,False,False,False,False,False,5-7-15-20,,1.323077,0.129231,0,0.0,2000-03-17,402,0,1
14,j,1-11-19,2000-03-17,f,60.0,76.0,0.0,7.0,r15c,T crossing sb at CC/CCC,,recap,,shed since last recapture,,2000,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,1-11-19,,1.266667,0.116667,0,0.0,2000-03-17,22,0,1


In [52]:
recapQuestion.loc[recapQuestion.svl<54,:]

Unnamed: 0,species,toes_orig,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,meters,new.recap,painted,misc,vial,year,rtl,autotomized,new.recap_orig,sighting,review_sex,review_species,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics,toes,toe_pattern,tl_svl,mass_svl,year_diff,svl_diff,initialCaptureDate,liznumber,daysSinceCapture,capture
26,j,1-9-13-16,2000-03-18,f,50.0,76.0,0.0,6.5,r27c,wall at H5,,recap,,shed since last recapture; marked originally a...,,2000,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,1-9-13-16,,1.52,0.13,0,0.0,2000-03-18,94,0,1
35,j,1-12-18,2000-03-18,f,50.0,68.0,0.0,3.7,r36c,3m^chute on Rs,,recap,,shed since last recapture,,2000,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,1-12-18,,1.36,0.074,0,0.0,2000-03-18,26,0,1
48,j,4-6-15-20,2000-03-18,f,51.0,67.0,0.0,3.2,r49c,Rs opp 2tripleR,,recap,,shed since last recapture [slight break on T a...,,2000,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,4-6-15-20,,1.313725,0.062745,0,0.0,2000-03-18,329,0,1
67,j,10-15-18,2000-03-19,f,46.0,53.0,23.0,3.2,r68c,bottom S surve,,recap,,shed since last recapture,,2000,23.0,autotomized,recap,,True,False,False,False,False,False,False,False,False,False,False,False,10-15-18,,1.152174,0.069565,0,0.0,2000-03-19,118,0,1
239,j,1-9-14-19,2001-03-19,m,52.0,70.0,0.0,3.6,wOa,L opp 2tripleR,335.0,recap,,,wOa and w+b displaying to one another,2001,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,1-9-14-19,,1.346154,0.069231,0,0.0,2001-03-19,568,0,1
380,uo,1-2,2001-03-18,m,50.0,77.0,0.0,5.4,w.a,Rin sb 5m^lizardR,142.0,recap,,,,2001,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,True,False,1-2,,1.54,0.108,0,0.0,2001-03-18,991,0,1
388,v,7-,2002-03-25,f,51.0,72.0,0.0,,y41b,v 1 falls on right,-5.0,recap,toe loss probably natural - short,,,2002,0.0,intact,recap,,True,False,True,False,False,False,False,False,False,False,False,False,7-,,1.411765,,0,0.0,2002-03-25,1186,0,1
403,v,1-7-18,2002-07-08,m,50.0,70.0,0.0,4.3,w25a,5m v curved wall on right,22.0,recap,,,,2002,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,1-7-18,,1.4,0.086,0,4.0,2002-07-08,1239,0,1
422,v,9-19,2002-03-31,m,49.0,69.0,0.0,,y50b,9m ^ cave trail in sb,54.0,recap,,,,2002,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,9-19,,1.408163,,0,0.0,2002-03-31,1417,0,1
427,v,2-8-15-19,2002-07-06,f,48.0,62.0,0.0,3.2,w19a,stacked wall - 75,75.0,recap,painted,,,2002,0.0,intact,recap,,True,False,False,False,False,False,False,False,False,False,False,False,2-8-15-19,,1.291667,0.066667,0,0.0,2002-07-06,1108,0,1


Now we export the cleaned data to a csv

In [53]:
df_numbered = df_numbered.rename(index = str, columns = {'new.recap':'newRecap'})
qc_drop_cols = df_numbered.columns[df_numbered.columns.str.contains('force|drop')]
df_full = df_numbered.drop(qc_drop_cols,1)

In [54]:
timestamp = (pd.to_datetime('now')-pd.Timedelta(hours=4))
timestamp = str(timestamp).replace(':','_')
#path=''C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\''
# path=outputBig
filename = 'cleaned CC data 2000-2017_' + timestamp+ '.csv'
# filename = path + '/cleaned CC data 2000-2017' + '.csv'
df_full.to_csv(filename,index = False)
filename

'cleaned CC data 2000-2017_2018-12-04 19_39_29.623717.csv'