# Cleaning CC data

This python notebook operates on a csv created after editing in open refine and is designed to finish cleaning columns of interest which were easier to clean in python.

## Setting up Python

Here we import necessary packages. 
This chunk may take a while.

In [1]:
import pandas as pd
import numpy as np

### Use this chunk to import from google sheets

import gspread
from oauth2client.service_account import ServiceAccountCredentials
####use creds to create a client to interact with the Google Drive API
scope = ['https://spreadsheets.google.com/feeds']
creds = ServiceAccountCredentials.from_json_keyfile_name('TD_client.json', scope)
client = gspread.authorize(creds)

data = client.open("mapped-data-all_18-01-08_post_openrefine.csv").sheet1
df=pd.DataFrame(data.get_all_records())

### Use this chunk to read data from local folder on Chris' machine

In [2]:
df=pd.read_csv('C:/Users/Christopher/Google Drive/TailDemography/outputFiles/mapped-data-all_18-01-08_post_openrefine.csv')

Let's take a look at the data

In [3]:
print("\nThere are {} data points in our data set.".format(df.shape[0]))


There are 8197 data points in our data set.


In [4]:
print("\nThe columns in the data have the following data types:\n{}".format(df.dtypes))


The columns in the data have the following data types:
species                object
toes                   object
date                   object
sex                    object
svl                   float64
tl                    float64
rtl_orig               object
mass                   object
paint.mark             object
location               object
meters                 object
new.recap              object
painted                object
misc                   object
vial                   object
year                    int64
rtl                   float64
autotomized              bool
new.recap_orig         object
sighting               object
review_sex               bool
review_species           bool
review_painted           bool
review_new.recap         bool
review_rtl               bool
forceMale                bool
forceFemale              bool
forceRecap               bool
forceNew                 bool
forceSighting            bool
drop_species             bool
drop_morphomet

## Correcting class of columns

In [5]:
#We need to add real error handling into these conversion chunks

##Convert integer columns to int
intCols = ['meters','year']
df[intCols]=df[intCols].astype(int,errors='ignore')

##Convert numeric columns to numeric
numCols = ['svl','tl','rtl','rtl_orig','mass']
df[numCols]=df[numCols].apply(pd.to_numeric,errors='coerce')

##Convert string columns to str
strCols = ['toes','sex','species','vial']
df[strCols]=df[strCols].astype(str, errors='ignore')

#Convert date to datetime
df.loc[df.date=="NA"]=np.nan
df.date = pd.to_datetime(df.date,errors='coerce')

##Convert bool columns to bool
boolCols = ['review_sex','review_species','review_painted','review_new.recap',\
            'review_rtl','forceMale','forceFemale','forceRecap','forceNew',\
            'forceSighting','drop_species','drop_morphometrics','autotomized']
df[boolCols]=df[boolCols].astype(bool, errors='ignore')

In [6]:
print("\nAfter applying the above changes, the data types re as follows:\n{}".format(df.dtypes))


After applying the above changes, the data types re as follows:
species                       object
toes                          object
date                  datetime64[ns]
sex                           object
svl                          float64
tl                           float64
rtl_orig                     float64
mass                         float64
paint.mark                    object
location                      object
meters                        object
new.recap                     object
painted                       object
misc                          object
vial                          object
year                         float64
rtl                          float64
autotomized                     bool
new.recap_orig                object
sighting                      object
review_sex                      bool
review_species                  bool
review_painted                  bool
review_new.recap                bool
review_rtl                      bool
forceMale 

## Remove leading and trailing whitespaces

for col in df:
    print(len(col))# returns unique lengths of sex
    col=col.strip()

for col in df:
    col=col.strip()

## Cleaning toes column

In [7]:
#df.toes.astype(str)
#df.toes=df.toes.str.strip()#remove white spaces before and after the toes
pattern1=".( - )." #toes entries with space around hyphen
pattern2=".( \d+ )." #toes entries with space around numbers
pattern3=".(')."
print("\nThere are {} entries in the data set.\
        \nOf these, there are {} toe marks with spaces surrounding hyphens,\
        \n{} with spaces surrounding numbers,\
        \nand {} with an apostrophe preceeding the entry."\
      .format(df.shape[0]\
              ,df.loc[df.toes.str.match(pattern1)==True].shape[0]\
              ,df.loc[df.toes.str.match(pattern2)==True].shape[0],\
              df.loc[df.toes.str.match(pattern3)==True].shape[0]))


There are 8197 entries in the data set.        
Of these, there are 38 toe marks with spaces surrounding hyphens,        
213 with spaces surrounding numbers,        
and 0 with an apostrophe preceeding the entry.


In [8]:
df.loc[df.toes=='nan','toes'] = np.nan

In [9]:
df.loc[df.toes.str.match(pattern1)==True]

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
6055,j,4 - 6 - 12,2011-06-20,m,68.0,101.0,0.0,9.2,g20b,15m up CCC,...,False,False,False,False,False,False,False,False,False,False
6056,j,4 - 6 - 13,2011-06-28,m,77.0,106.0,0.0,13.8,g40b,big juniper right 8m ^ juniper Xing,...,False,False,False,False,False,False,False,False,False,False
6057,uo,6 - 15,2011-06-20,m,49.0,83.0,0.0,4.0,g.t,opp oak R,...,False,False,False,False,False,False,False,False,True,False
6058,uo,6 - 16,2011-06-28,f,49.0,72.0,0.0,4.2,g.b,CC/CCC,...,False,False,False,False,False,False,False,False,True,False
6059,uo,6 - 17,2011-07-02,f,48.0,69.0,0.0,4.0,g.c,H4b,...,False,False,False,False,False,False,False,False,True,False
6060,v,3 - 9 - 13 - 20,2011-06-22,m,46.0,60.0,0.0,3.7,g6c,left Rs opp Pine R,...,False,False,False,False,False,False,False,False,False,False
6061,v,3 - 9 - 14 - 16,2011-06-27,m,49.0,64.0,0.0,4.0,g8c,left Rs 3m ^ oak R,...,False,False,False,False,False,False,False,False,False,False
6062,v,3 - 9 - 14 - 17,2011-06-30,m,49.0,46.0,12.0,4.3,g9c,2m ^ cave trail opp sb,...,False,False,False,False,False,False,False,False,False,False
6063,v,3 - 9 - 14 - 18,2011-07-03,f,49.0,62.0,0.0,4.0,g14c,2m v pool sb,...,False,False,False,False,False,False,False,False,False,False
6064,j,4 - 6 - 11 - 20,2011-06-20,f,61.0,81.0,0.0,6.6,g17b,4m ^ left side SB Juniper Xing,...,False,False,False,False,False,False,False,False,False,False


In [10]:
df.loc[df.toes.str.match(pattern2)==True]

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
3387,v,3 13 16,2004-07-02,m,54.0,51.0,38.0,5.2,wTb,sb 10m v falls,...,False,False,False,False,False,False,False,False,False,False
3389,v,4 6 11,2004-07-02,m,49.0,66.0,0.0,4.5,wAc,see wAa,...,False,False,False,False,False,False,False,False,False,False
3390,v,4 6 12,2004-07-02,m,45.0,62.0,0.0,3.3,wAb,3 m above cave trail on rt,...,False,False,False,False,False,False,False,False,False,False
3395,v,4 6 13,2004-07-02,f,57.0,68.0,0.0,7.6,w1a,on pine at top of site,...,False,False,False,False,False,False,False,False,False,False
3413,v,4 6 14,2004-07-03,f,50.0,69.0,0.0,4.1,w2a,8m^ cave trail,...,False,False,False,False,False,False,False,False,False,False
3417,v,4 6 15,2004-07-03,f,49.0,70.0,0.0,7.0,w2b,sb 1m v top of rock wall,...,False,False,False,False,False,False,False,False,False,False
3421,v,4 6 16,2004-07-03,m,44.0,64.0,0.0,3.6,w2c,sb 1m v H3,...,False,False,False,False,False,False,False,False,False,False
3422,v,4 6 17,2004-07-03,f,39.0,59.0,0.0,3.2,w3a,sb at H3,...,False,False,False,False,False,False,False,False,False,False
3424,v,4 6 18,2004-07-03,f,52.0,68.0,0.0,4.6,w3b,1m v flat rock on left side,...,False,False,False,False,False,False,False,False,False,False
3425,v,1 9 16,2004-07-03,m,55.0,73.0,0.0,5.3,w3c,1m v opposite fallen juniper,...,False,False,False,False,False,False,False,False,False,False


In [11]:
df.loc[df.toes.str.match(pattern3)==True]

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics


### Correcting these patterns

In [12]:
df.loc[df.toes.str.match(pattern1)==True]=df.loc[df.toes.str.match(pattern1)==True].replace(" ","",regex=True)
df.loc[df.toes.str.match(pattern2)==True]=df.loc[df.toes.str.match(pattern2)==True].replace(" ","-",regex=True)
df.loc[df.toes.str.match(pattern3)==True]=df.loc[df.toes.str.match(pattern3)==True].replace("'","",regex=True)

Confirm that these patterns have been corrected

In [13]:
print("\nThere are {} toe marks with spaces surrounding hypens.".format(df.loc[df.toes.str.match(pattern1)==True].shape[0]))
print("\nThere are {} toe marks with spaces surrounding numbers.".format(df.loc[df.toes.str.match(pattern2)==True].shape[0]))
print("\nThere are {} toe marks with an ' preceeding the entry.".format(df.loc[df.toes.str.match(pattern3)==True].shape[0]))


There are 0 toe marks with spaces surrounding hypens.

There are 0 toe marks with spaces surrounding numbers.

There are 0 toe marks with an ' preceeding the entry.


## Cleaning Sex column

In [14]:
print(df.sex.str.len().unique())# returns unique lengths of sex
df.sex=df.sex.str.strip()
print(df.sex.str.len().unique())

[1 3 2 5]
[1 3 2 0 5]


### Identify non "m" or "f" values and their frequencies

In [15]:
patterns_sex="m|f|NA"
non_matches=df.sex.loc[df.sex.str.match(patterns_sex)!=True]
print("\nThere are {} entries for sex which do not match the patterns {}:"\
      .format(non_matches.shape[0],patterns_sex.split("|")))
non_matches.value_counts()


There are 5412 entries for sex which do not match the patterns ['m', 'f', 'NA']:


nan      5254
juv       128
?          16
?f          6
n           2
adult       1
[m]         1
?m          1
unm         1
???         1
            1
Name: sex, dtype: int64

### Identify values to convert to NA, m, or f

In [16]:
sex2NA=['adult','juv','nan']
sex2m=['unm']
df.loc[df.sex.isin(sex2NA)==True]
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

5383
1


### Convert the values to NA or m, respectively.

In [17]:
df.loc[df.sex.isin(sex2m)]

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
7726,up,,2017-07-20,unm,,,,,,3m above sb on rt side 4m ^ CC/CCC,...,False,True,False,False,False,False,False,False,True,True


In [18]:
df.loc[df.sex.isin(sex2NA),'sex']=np.nan
df.loc[df.sex.isin(sex2m),'sex']='m'
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

0
0


# Cleaning autotmized column

In [19]:
autotomyDict = {False:'intact',True:'autotomized'}

df.loc[:,'autotomized'] = df.loc[:,'autotomized'].map(autotomyDict)
df.autotomized.unique()

array(['intact', 'autotomized'], dtype=object)

# Cleaning new.recap column

In [20]:
df.head()

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
0,j,1-13-19,2000-03-17,f,52.0,74.0,0.0,4.2,r1c,1falls,...,False,False,False,False,False,False,False,False,False,False
1,j,1-13-20,2000-03-17,m,56.0,77.0,0.0,5.6,r2c,1falls,...,False,False,False,False,False,False,False,False,False,False
2,j,1-14-19,2000-03-17,f,57.0,81.0,0.0,6.6,r3c,wall on rt side v wall at pine xing,...,False,False,False,False,False,False,False,False,False,False
3,j,1-14-20,2000-03-17,f,57.0,79.0,0.0,5.5,r4c,wall on rt side v wall at pine xing,...,False,False,False,False,False,False,False,False,False,False
4,j,3-8,2000-03-17,f,82.0,89.0,27.0,17.0,r5c,oak across from bottom wall at pine xing,...,False,False,False,False,False,False,False,False,False,False


In [21]:
#try using a dict to do thing more efficiently
newRecapKeep = ['recap', 'new', 'r', 'n']
new = ['new','n']
recap = ['recap','r']
df.loc[~df['new.recap'].isin(newRecapKeep),'new.recap'] = np.nan
df.loc[df['new.recap'].isin(new),'new.recap'] = 'new'
df.loc[df['new.recap'].isin(recap),'new.recap'] = 'recap'

## Add Columns

In [22]:
# tl_svl and mass_svl
df['tl_svl']=(df.tl/df.svl)
df['mass_svl']=(df.mass/df.svl)

## Create function to generate lizardNumber 
 lizard number is a numeric identifier of unique animals in the data set
function takes the following arguments:
    - *x*: series object on which function acts on
    - *sortCriteria*: list of strings of column names on which to sort data.  data are sorted by columns from left to right
    - *validationCriteria*: dictionary of dictionaries that identify columns to validate and validation expression of the form {{'column':'column_2 >= column_1'},{'otherColumn':'column_2 >= column_1'}}
    - *result*: dictionary of of dictionaries detailing the value *x* takes if validations are True or False of the form: {{'True':x=x[i]},{'False':x=x[i]+1},{errors: 'raise'}}, errors may be 'raise' *default* (terminates function and returns an error) or 'ignore' (returns 'NA')
Function action:
- first sort data by species, toes, then date

- for time points 1 and , with 2 being later: 
    - toes2 == toes1
    - svl2-svl1 >=-2
    - year2-year1 <=7
    - for species ==j:
        - if svl >=56:
            - if sex2==sex1:
                lizardNumber[i+1]=lizardNumber[i]
          else:
            - lizardNumber[i+1]=lizardNumber[i]+1

In [23]:
import pandas as pd
import os
def genliznum2(df, path:str, errors:str= 'raise'):
    """
    Lizard number is a numeric identifier of unique animals in the data set function takes the following arguments:
    :param df:  series object on which function acts on
    :param sortCriteria: list of strings of column names on which to sort data.  data are sorted by columns from left \
    to right
    :param validation: dictionary of dictionaries that identify columns to validate and validation expression of the form:\
     {{'column':'column_2 >= column_1'},{'otherColumn':'column_2 >= column_1'}}
    :param errors: str , errors may be 'raise' *default* (terminates function and returns an error) or 'ignore' (returns 'NA')
    :return: dataframe
    #dictionary  of dictionaries detailing the value *x* takes if validations are True or False of the form: \
    #{{'True':x=x[i]},{'False':x=x[i]+1},{errors: 'raise'}}
    """
    sortCriteria = ['species','toes', 'sex']
    validation = ['date','svl']
    critical = sortCriteria +validation
    #identify lizards with sufficient data to evaluate
    #report on those without sufficient data and save them to a file for later evaluation
    unsortable = df.loc[df.loc[:,critical].isnull().any(axis=1)]
    sortable = df.loc[df.loc[:,critical].notnull().all(axis=1)]
    os.chdir(path)
    unsortablefile ='unsortable.csv'
    unsortable.to_csv(unsortablefile)
    print("\nThere were {} entries for which values for one of the critical criteria, ({}), were null.  These entries could \
not be evaluated and were written out to the file {} for evaluation."\
          .format(unsortable.shape[0],critical,unsortablefile))
    
    sortable_min_date =pd.DataFrame(sortable.groupby(sortCriteria).date.min()).\
    rename(index = str, columns= {'date':'earliestSighting'}).reset_index()
    sortable = sortable.merge(sortable_min_date,how = 'left', on = sortCriteria)
    sortable['year_diff'] = sortable.date.dt.year - sortable.earliestSighting.dt.year
    
    svlGroup = ['species','toes', 'sex','earliestSighting']
    

    sortable_smallest_svl =sortable.groupby(svlGroup).svl.min().reset_index()\
    .rename(index = str, columns= {'svl':'smallest_svl'})
    sortable_smallest_svl
    sortable = sortable.merge(sortable_smallest_svl,how = 'left', on = svlGroup)
    sortable['svl_diff'] = sortable.svl - sortable.smallest_svl
    
    #Now we create a dataframe containing potential lizard numbers, all of which are designated as unassigned
    dfLizNumber = pd.DataFrame(df.index +1).reset_index().rename(index = str, columns = {0:'lizard_number'})
    dfLizNumber['assignment_status'] = 'unassigned'
    dfLizNumber = dfLizNumber.loc[:, ['lizard_number','assignment_status']]

    # Now we 
    
    res = {'data':sortable,'numbers':dfLizNumber}
    return res


In [76]:
test_tmp = genliznum2(df,'C:\\Users\\Christopher\\Documents\\GitHub\\tailDemography\\data')

test = test_tmp['data']
test['tmp'] = 1

#test ['liznumber'] = np.nan

test



There were 5568 entries for which values for one of the critical criteria, (['species', 'toes', 'sex', 'date', 'svl']), were null.  These entries could not be evaluated and were written out to the file unsortable.csv for evaluation.


Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,forceSighting,drop_species,drop_morphometrics,tl_svl,mass_svl,earliestSighting,year_diff,smallest_svl,svl_diff,tmp
0,j,1-13-19,2000-03-17,f,52.0,74.0,0.0,4.2,r1c,1falls,...,False,False,False,1.423077,0.080769,2000-03-17,0,52.0,0.0,1
1,j,1-13-20,2000-03-17,m,56.0,77.0,0.0,5.6,r2c,1falls,...,False,False,False,1.375000,0.100000,2000-03-17,0,56.0,0.0,1
2,j,1-14-19,2000-03-17,f,57.0,81.0,0.0,6.6,r3c,wall on rt side v wall at pine xing,...,False,False,False,1.421053,0.115789,2000-03-17,0,57.0,0.0,1
3,j,1-14-20,2000-03-17,f,57.0,79.0,0.0,5.5,r4c,wall on rt side v wall at pine xing,...,False,False,False,1.385965,0.096491,2000-03-17,0,57.0,0.0,1
4,j,3-8,2000-03-17,f,82.0,89.0,27.0,17.0,r5c,oak across from bottom wall at pine xing,...,False,False,False,1.085366,0.207317,2000-03-17,0,82.0,0.0,1
5,j,1-15-16,2000-03-17,m,58.0,64.0,24.0,5.5,r6c,sb half way up from bottom wall to pine xing,...,False,False,False,1.103448,0.094828,2000-03-17,0,58.0,0.0,1
6,j,1-18,2000-03-17,f,58.0,62.0,20.0,7.0,r7c,sb 3/4 way up from bottom wall to pine xing,...,False,False,False,1.068966,0.120690,2000-03-17,0,58.0,0.0,1
7,j,1-13-18,2000-03-17,f,54.0,75.0,0.0,5.5,r8c,sb at pine xing,...,False,False,False,1.388889,0.101852,2000-03-17,0,54.0,0.0,1
8,j,1-19,2000-03-17,m,62.0,84.0,0.0,7.5,r9c,sb 10m ^ root xing,...,False,False,False,1.354839,0.120968,2000-03-17,0,62.0,0.0,1
9,j,1-20,2000-03-17,f,60.0,80.0,0.0,8.0,r10c,sb at H3,...,False,False,False,1.333333,0.133333,2000-03-17,0,27.0,33.0,1


In [77]:
print(test.loc[(test.year_diff<=7) & (test.svl_diff>=-2),:].tmp.unique())
print(len(test.loc[(test.year_diff<=7) & (test.svl_diff>=-2),:].groupby(['species','sex','toes'])))
numbers = test.loc[(test.year_diff<=7) & (test.svl_diff>=-2),:].groupby(['species','sex','toes']).tmp.min().cumsum()\
.reset_index()
numbers = numbers.rename(columns={'tmp':'liznumber'})
numbers


[1]
1593


Unnamed: 0,species,sex,toes,liznumber
0,cn ex,f,1-7,1
1,cn ex,f,7,2
2,j,,3-6-13-16,3
3,j,?,1-9-13-19,4
4,j,?,11-18,5
5,j,?,12-16,6
6,j,?,14-18,7
7,j,?,2-13-17 a,8
8,j,?,2-8-13,9
9,j,?,3-7-12-20,10


In [83]:
test = test.merge(numbers,'left', on = ['species','sex','toes'])
test

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,earliestSighting,year_diff,smallest_svl,svl_diff,tmp,liznumber_x,liznumber_y,liznumber_x.1,liznumber_y.1,liznumber
0,j,1-13-19,2000-03-17,f,52.0,74.0,0.0,4.2,r1c,1falls,...,2000-03-17,0,52.0,0.0,1,45,45,45,45,45
1,j,1-13-20,2000-03-17,m,56.0,77.0,0.0,5.6,r2c,1falls,...,2000-03-17,0,56.0,0.0,1,562,562,562,562,562
2,j,1-14-19,2000-03-17,f,57.0,81.0,0.0,6.6,r3c,wall on rt side v wall at pine xing,...,2000-03-17,0,57.0,0.0,1,52,52,52,52,52
3,j,1-14-20,2000-03-17,f,57.0,79.0,0.0,5.5,r4c,wall on rt side v wall at pine xing,...,2000-03-17,0,57.0,0.0,1,53,53,53,53,53
4,j,3-8,2000-03-17,f,82.0,89.0,27.0,17.0,r5c,oak across from bottom wall at pine xing,...,2000-03-17,0,82.0,0.0,1,285,285,285,285,285
5,j,1-15-16,2000-03-17,m,58.0,64.0,24.0,5.5,r6c,sb half way up from bottom wall to pine xing,...,2000-03-17,0,58.0,0.0,1,568,568,568,568,568
6,j,1-18,2000-03-17,f,58.0,62.0,20.0,7.0,r7c,sb 3/4 way up from bottom wall to pine xing,...,2000-03-17,0,58.0,0.0,1,57,57,57,57,57
7,j,1-13-18,2000-03-17,f,54.0,75.0,0.0,5.5,r8c,sb at pine xing,...,2000-03-17,0,54.0,0.0,1,43,43,43,43,43
8,j,1-19,2000-03-17,m,62.0,84.0,0.0,7.5,r9c,sb 10m ^ root xing,...,2000-03-17,0,62.0,0.0,1,575,575,575,575,575
9,j,1-20,2000-03-17,f,60.0,80.0,0.0,8.0,r10c,sb at H3,...,2000-03-17,0,27.0,33.0,1,61,61,61,61,61


In [None]:
#test['earliestSighting'] = np.nan
svlGroup = ['species','toes', 'sex','earliestSighting']
test_smallest_svl = pd.DataFrame(test.groupby(svlGroup).svl.min()).rename(index = str, columns= {'svl':'smallest_svl'})\
.reset_index()
test_smallest_svl
#test

test.merge(test_smallest_svl,how = 'left', on = svlGroup).smallest_svl.unique()
#test.head()

In [None]:
test['dateMin'] = test.groupby(sortCriteria).date.min()


In [None]:
print('The qc output of gen_liz_num(df) shows that {} rows in the df dataframe could not be processed. \nOf these, \
{} did not contain information on species, toes, and sex. \n{} did not contain information on either toes or sex.\n\
{} did not contain information on both toes and sex. \n{} did not not contain information on sex, but contained\
 information on toes. \n{} did not not contain information on toes, but contained\
 information on sex.'\
      .format(qc.shape[0],qc.loc[(qc.species.isnull()) & (qc.toes.isnull()) & (qc.sex.isnull()),:].shape[0]\
                                                             ,qc.loc[(qc.toes.isnull())|(qc.sex.isnull()),:].shape[0]\
                                                             ,qc.loc[(qc.toes.isnull()) & (qc.sex.isnull()),:].shape[0]\
             ,qc.loc[(qc.toes.notnull()) & (qc.sex.isnull()),:].shape[0]\
             ,qc.loc[(qc.toes.isnull()) & (qc.sex.notnull()),:].shape[0]))
print('\n Of the above, {} had paintmarks which we might be used to attempt to recover data.\
\nOf those in the dataframe without a lizNumber, {} had paintmarks which we might be used to attempt to recover data.'\
      .format(df.loc[(df['paint.mark'].notnull()) & (df.lizMark.isnull()),:].shape[0],\
             df.loc[(df['paint.mark'].notnull()) & (df.lizNumber.isnull()),:].shape[0]))
qc.head(3)

We can see from the output of the last chunk that the function is operating as expected.  The data that weren't assigned a lizard number cannot be assigned one and should probably be removed from analyses that cross years.  We can try to

# NOTE: 
- _unprocessed_ still needs to be manually entered

In [None]:
df.lizNumber.min()
df.lizNumber.max()

In [None]:
possibleLizNum = set(range(int(df.lizNumber.min()),int(df.lizNumber.max())))
actualLizNum = set(pd.Series(df.lizNumber.unique()).dropna().apply(int))
missingLizNum = possibleLizNum - actualLizNum

print("\nThere are {} entries.  There are {} unique lizard numbers.\
\n\nThe lizNumber ranges from {} to {}.\
\n\nThe following numbers are not assigned to a lizard:\n{}"\
      .format(df.shape[0],len(df.lizNumber.unique())\
              ,df.lizNumber.min(),df.lizNumber.max(),missingLizNum))

Now we export the cleaned data to a csv

In [None]:
df = df.rename(index = str, columns = {'new.recap':'newRecap'})

In [None]:
timestamp = pd.to_datetime('now')-pd.Timedelta(hours=5)
#path=''C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\''
path='C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\'
#filename = path + 'cleaned CC data 2000-2017_' + str(timestamp)+ '.csv'
filename = path + 'cleaned CC data 2000-2017_' + '.csv'
df.to_csv(filename,index = False)
filename

In [None]:
df.groupby('lizardNumber').species.count()