# Cleaning CC data

This python notebook operates on a csv created after editing in open refine and is designed to finish cleaning columns of interest which were easier to clean in python.

## Setting up Python

Here we import necessary packages. 
This chunk may take a while.

In [1]:
import pandas as pd
import numpy as np

### Use this chunk to import from google sheets

import gspread
from oauth2client.service_account import ServiceAccountCredentials
####use creds to create a client to interact with the Google Drive API
scope = ['https://spreadsheets.google.com/feeds']
creds = ServiceAccountCredentials.from_json_keyfile_name('TD_client.json', scope)
client = gspread.authorize(creds)

data = client.open("mapped-data-all_18-01-08_post_openrefine.csv").sheet1
df=pd.DataFrame(data.get_all_records())

### Use this chunk to read data from local folder on Chris' machine

In [2]:
df=pd.read_csv('C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\mapped-data-all_18-01-08_post_openrefine.csv')

In [3]:
df2=pd.read_csv('C:\\Users\\Christopher\\Google Drive\\TailDemography\\csvFiles by year\\xCC2017x.csv')
#df2.head()
df2.loc[df2.Toes=='43085',]

Unnamed: 0,1,Species,Toes,Date,Sex,SVL,TL,RTL,Mass,Paint Mark,...,Meters,New/Recap,Painted,Misc.,Vial,Time,Unnamed: 17,Spotted,Mark,2015 or earlier


Let's take a look at the data

In [4]:
print("\nThere are {} data points in our data set.".format(df.shape[0]))


There are 8197 data points in our data set.


In [5]:
print("\nThe columns in the data have the following data types:\n{}".format(df.dtypes))


The columns in the data have the following data types:
species                object
toes                   object
date                   object
sex                    object
svl                   float64
tl                    float64
rtl_orig               object
mass                   object
paint.mark             object
location               object
meters                 object
new.recap              object
painted                object
misc                   object
vial                   object
year                    int64
rtl                   float64
autotomized              bool
new.recap_orig         object
sighting               object
review_sex               bool
review_species           bool
review_painted           bool
review_new.recap         bool
review_rtl               bool
forceMale                bool
forceFemale              bool
forceRecap               bool
forceNew                 bool
forceSighting            bool
drop_species             bool
drop_morphomet

## Correcting class of columns

In [6]:
#We need to add real error handling into these conversion chunks

##Convert integer columns to int
intCols = ['meters','year']
df[intCols]=df[intCols].astype(int,errors='ignore')

##Convert numeric columns to numeric
numCols = ['svl','tl','rtl','rtl_orig','mass']
df[numCols]=df[numCols].apply(pd.to_numeric,errors='coerce')

##Convert string columns to str
strCols = ['toes','sex','species','vial']
df[strCols]=df[strCols].astype(str, errors='ignore')

#Convert date to datetime
df.loc[df.date=="NA"]=np.nan
df.date = pd.to_datetime(df.date,errors='coerce')

##Convert bool columns to bool
boolCols = ['review_sex','review_species','review_painted','review_new.recap',\
            'review_rtl','forceMale','forceFemale','forceRecap','forceNew',\
            'forceSighting','drop_species','drop_morphometrics','autotomized']
df[boolCols]=df[boolCols].astype(bool, errors='ignore')

In [7]:
print("\nAfter applying the above changes, the data types re as follows:\n{}".format(df.dtypes))


After applying the above changes, the data types re as follows:
species                       object
toes                          object
date                  datetime64[ns]
sex                           object
svl                          float64
tl                           float64
rtl_orig                     float64
mass                         float64
paint.mark                    object
location                      object
meters                        object
new.recap                     object
painted                       object
misc                          object
vial                          object
year                         float64
rtl                          float64
autotomized                     bool
new.recap_orig                object
sighting                      object
review_sex                      bool
review_species                  bool
review_painted                  bool
review_new.recap                bool
review_rtl                      bool
forceMale 

## Remove leading and trailing whitespaces

for col in df:
    print(len(col))# returns unique lengths of sex
    col=col.strip()

for col in df:
    col=col.strip()

## Cleaning toes column

In [8]:
#df.toes.astype(str)
#df.toes=df.toes.str.strip()#remove white spaces before and after the toes
pattern1=".( - )." #toes entries with space around hyphen
pattern2=".( \d+ )." #toes entries with space around numbers
pattern3=".(')."
print("\nThere are {} entries in the data set.\
        \nOf these, there are {} toe marks with spaces surrounding hyphens,\
        \n{} with spaces surrounding numbers,\
        \nand {} with an apostrophe preceeding the entry."\
      .format(df.shape[0]\
              ,df.loc[df.toes.str.match(pattern1)==True].shape[0]\
              ,df.loc[df.toes.str.match(pattern2)==True].shape[0],\
              df.loc[df.toes.str.match(pattern3)==True].shape[0]))


There are 8197 entries in the data set.        
Of these, there are 38 toe marks with spaces surrounding hyphens,        
213 with spaces surrounding numbers,        
and 0 with an apostrophe preceeding the entry.


In [9]:
df.loc[df.toes=='nan','toes'] = np.nan

In [10]:
df.loc[df.toes.str.match(pattern1)==True]

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
6055,j,4 - 6 - 12,2011-06-20,m,68.0,101.0,0.0,9.2,g20b,15m up CCC,...,False,False,False,False,False,False,False,False,False,False
6056,j,4 - 6 - 13,2011-06-28,m,77.0,106.0,0.0,13.8,g40b,big juniper right 8m ^ juniper Xing,...,False,False,False,False,False,False,False,False,False,False
6057,uo,6 - 15,2011-06-20,m,49.0,83.0,0.0,4.0,g.t,opp oak R,...,False,False,False,False,False,False,False,False,True,False
6058,uo,6 - 16,2011-06-28,f,49.0,72.0,0.0,4.2,g.b,CC/CCC,...,False,False,False,False,False,False,False,False,True,False
6059,uo,6 - 17,2011-07-02,f,48.0,69.0,0.0,4.0,g.c,H4b,...,False,False,False,False,False,False,False,False,True,False
6060,v,3 - 9 - 13 - 20,2011-06-22,m,46.0,60.0,0.0,3.7,g6c,left Rs opp Pine R,...,False,False,False,False,False,False,False,False,False,False
6061,v,3 - 9 - 14 - 16,2011-06-27,m,49.0,64.0,0.0,4.0,g8c,left Rs 3m ^ oak R,...,False,False,False,False,False,False,False,False,False,False
6062,v,3 - 9 - 14 - 17,2011-06-30,m,49.0,46.0,12.0,4.3,g9c,2m ^ cave trail opp sb,...,False,False,False,False,False,False,False,False,False,False
6063,v,3 - 9 - 14 - 18,2011-07-03,f,49.0,62.0,0.0,4.0,g14c,2m v pool sb,...,False,False,False,False,False,False,False,False,False,False
6064,j,4 - 6 - 11 - 20,2011-06-20,f,61.0,81.0,0.0,6.6,g17b,4m ^ left side SB Juniper Xing,...,False,False,False,False,False,False,False,False,False,False


In [11]:
df.loc[df.toes.str.match(pattern2)==True]

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
3387,v,3 13 16,2004-07-02,m,54.0,51.0,38.0,5.2,wTb,sb 10m v falls,...,False,False,False,False,False,False,False,False,False,False
3389,v,4 6 11,2004-07-02,m,49.0,66.0,0.0,4.5,wAc,see wAa,...,False,False,False,False,False,False,False,False,False,False
3390,v,4 6 12,2004-07-02,m,45.0,62.0,0.0,3.3,wAb,3 m above cave trail on rt,...,False,False,False,False,False,False,False,False,False,False
3395,v,4 6 13,2004-07-02,f,57.0,68.0,0.0,7.6,w1a,on pine at top of site,...,False,False,False,False,False,False,False,False,False,False
3413,v,4 6 14,2004-07-03,f,50.0,69.0,0.0,4.1,w2a,8m^ cave trail,...,False,False,False,False,False,False,False,False,False,False
3417,v,4 6 15,2004-07-03,f,49.0,70.0,0.0,7.0,w2b,sb 1m v top of rock wall,...,False,False,False,False,False,False,False,False,False,False
3421,v,4 6 16,2004-07-03,m,44.0,64.0,0.0,3.6,w2c,sb 1m v H3,...,False,False,False,False,False,False,False,False,False,False
3422,v,4 6 17,2004-07-03,f,39.0,59.0,0.0,3.2,w3a,sb at H3,...,False,False,False,False,False,False,False,False,False,False
3424,v,4 6 18,2004-07-03,f,52.0,68.0,0.0,4.6,w3b,1m v flat rock on left side,...,False,False,False,False,False,False,False,False,False,False
3425,v,1 9 16,2004-07-03,m,55.0,73.0,0.0,5.3,w3c,1m v opposite fallen juniper,...,False,False,False,False,False,False,False,False,False,False


In [12]:
df.loc[df.toes.str.match(pattern3)==True]

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics


### Correcting these patterns

In [13]:
df.loc[df.toes.str.match(pattern1)==True]=df.loc[df.toes.str.match(pattern1)==True].replace(" ","",regex=True)
df.loc[df.toes.str.match(pattern2)==True]=df.loc[df.toes.str.match(pattern2)==True].replace(" ","-",regex=True)
df.loc[df.toes.str.match(pattern3)==True]=df.loc[df.toes.str.match(pattern3)==True].replace("'","",regex=True)

Confirm that these patterns have been corrected

In [14]:
print("\nThere are {} toe marks with spaces surrounding hypens.".format(df.loc[df.toes.str.match(pattern1)==True].shape[0]))
print("\nThere are {} toe marks with spaces surrounding numbers.".format(df.loc[df.toes.str.match(pattern2)==True].shape[0]))
print("\nThere are {} toe marks with an ' preceeding the entry.".format(df.loc[df.toes.str.match(pattern3)==True].shape[0]))


There are 0 toe marks with spaces surrounding hypens.

There are 0 toe marks with spaces surrounding numbers.

There are 0 toe marks with an ' preceeding the entry.


## Cleaning Sex column

In [15]:
print(df.sex.str.len().unique())# returns unique lengths of sex
df.sex=df.sex.str.strip()
print(df.sex.str.len().unique())

[1 3 2 5]
[1 3 2 0 5]


### Identify non "m" or "f" values and their frequencies

In [16]:
patterns_sex="m|f|NA"
non_matches=df.sex.loc[df.sex.str.match(patterns_sex)!=True]
print("\nThere are {} entries for sex which do not match the patterns {}:"\
      .format(non_matches.shape[0],patterns_sex.split("|")))
non_matches.value_counts()


There are 5412 entries for sex which do not match the patterns ['m', 'f', 'NA']:


nan      5254
juv       128
?          16
?f          6
n           2
unm         1
?m          1
[m]         1
adult       1
???         1
            1
Name: sex, dtype: int64

### Identify values to convert to NA, m, or f

In [17]:
sex2NA=['adult','juv']
sex2m=['unm']
df.loc[df.sex.isin(sex2NA)==True]
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

129
1


### Convert the values to NA or m, respectively.

In [18]:
df.loc[df.sex.isin(sex2m)]

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
7726,up,,2017-07-20,unm,,,,,,3m above sb on rt side 4m ^ CC/CCC,...,False,True,False,False,False,False,False,False,True,True


In [19]:
df.loc[df.sex.isin(sex2NA),'sex']=np.nan
df.loc[df.sex.isin(sex2m),'sex']='m'
print(df.sex.loc[df.sex.isin(sex2NA)==True].count())
print(df.sex.loc[df.sex.isin(sex2m)==True].count())

0
0


# Cleaning new.recap column

In [20]:
df.head()

Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,review_painted,review_new.recap,review_rtl,forceMale,forceFemale,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics
0,j,1-13-19,2000-03-17,f,52.0,74.0,0.0,4.2,r1c,1falls,...,False,False,False,False,False,False,False,False,False,False
1,j,1-13-20,2000-03-17,m,56.0,77.0,0.0,5.6,r2c,1falls,...,False,False,False,False,False,False,False,False,False,False
2,j,1-14-19,2000-03-17,f,57.0,81.0,0.0,6.6,r3c,wall on rt side v wall at pine xing,...,False,False,False,False,False,False,False,False,False,False
3,j,1-14-20,2000-03-17,f,57.0,79.0,0.0,5.5,r4c,wall on rt side v wall at pine xing,...,False,False,False,False,False,False,False,False,False,False
4,j,3-8,2000-03-17,f,82.0,89.0,27.0,17.0,r5c,oak across from bottom wall at pine xing,...,False,False,False,False,False,False,False,False,False,False


In [21]:
#try using a dict to do thing more efficiently
newRecapKeep = ['recap', 'new', 'r', 'n']
new = ['new','n']
recap = ['recap','r']
df.loc[~df['new.recap'].isin(newRecapKeep),'new.recap'] = np.nan
df.loc[df['new.recap'].isin(new),'new.recap'] = 'new'
df.loc[df['new.recap'].isin(recap),'new.recap'] = 'recap'

## Add Columns

In [22]:
# tl_svl and mass_svl
df['tl_svl']=(df.tl/df.svl)
df['mass_svl']=(df.mass/df.svl)

## Create function to generate lizardNumber 
 lizard number is a numeric identifier of unique animals in the data set
function takes the following arguments:
    - *x*: series object on which function acts on
    - *sortCriteria*: list of strings of column names on which to sort data.  data are sorted by columns from left to right
    - *validationCriteria*: dictionary of dictionaries that identify columns to validate and validation expression of the form {{'column':'column_2 >= column_1'},{'otherColumn':'column_2 >= column_1'}}
    - *result*: dictionary of of dictionaries detailing the value *x* takes if validations are True or False of the form: {{'True':x=x[i]},{'False':x=x[i]+1},{errors: 'raise'}}, errors may be 'raise' *default* (terminates function and returns an error) or 'ignore' (returns 'NA')
Function action:
- first sort data by species, toes, then date

- for time points 1 and , with 2 being later: 
    - toes2 == toes1
    - svl2-svl1 >=-2
    - year2-year1 <=7
    - for species ==j:
        - if svl >=56:
            - if sex2==sex1:
                lizardNumber[i+1]=lizardNumber[i]
          else:
            - lizardNumber[i+1]=lizardNumber[i]+1

In [23]:
#create a variable that uniquely identifies a lizard by toes, species and sex
print("\nThe original dataframe has {} observations.".format(df.shape[0]))
df['lizMark'] = df.toes + df.species + df.sex
tmpToes=pd.DataFrame(df.lizMark.value_counts().reset_index())
tmpToes.columns= ['lizMark','count']
tmpToes=tmpToes.loc[~tmpToes.lizMark.str.contains("\?")]
tmpToes['lizardNumber']=(tmpToes.index +1)


The original dataframe has 8197 observations.


In [24]:
df=df.merge(tmpToes,how='outer',on='lizMark')
print("\nThe dataframe resulting from this merge has {} observations.".format(df.shape[0]))
df.head()


The dataframe resulting from this merge has 8197 observations.


Unnamed: 0,species,toes,date,sex,svl,tl,rtl_orig,mass,paint.mark,location,...,forceRecap,forceNew,forceSighting,drop_species,drop_morphometrics,tl_svl,mass_svl,lizMark,count,lizardNumber
0,j,1-13-19,2000-03-17,f,52.0,74.0,0.0,4.2,r1c,1falls,...,False,False,False,False,False,1.423077,0.080769,1-13-19jf,5.0,44.0
1,j,1-13-19,2000-03-17,f,53.0,69.0,0.0,5.0,r18c,Rs opp slab,...,False,False,False,False,False,1.301887,0.09434,1-13-19jf,5.0,44.0
2,j,1-13-19,2000-06-24,f,63.0,93.0,0.0,6.7,o11a,halfway between 1 falls and cave trail,...,False,False,False,False,False,1.47619,0.106349,1-13-19jf,5.0,44.0
3,j,1-13-19,2001-07-13,f,79.0,108.0,0.0,14.6,r31a,15m ^ 1falls,...,False,False,False,False,False,1.367089,0.18481,1-13-19jf,5.0,44.0
4,j,1-13-19,2008-07-18,f,68.0,86.0,0.0,9.3,y11a.t,H3/H4,...,True,False,False,False,False,1.264706,0.136765,1-13-19jf,5.0,44.0


Now that we have a base set of lizard numbers, we can edit this column with a function that finds duplicate lizardNumbers and changes the lizardNumber of those that don't meet certain criteria:
Function action:
- first, sort data by lizardNumber then date
- second, for a given lizardNumber:
    - compare each lizard to the earliest data point and determine which ones meet the following criteria:
        - [time points 1 and 2, with 2 being later (?at least 1 year later?): [GEORGE]]
        - svl_i - svl_1 >= -2
        - year_i - year_1 <=7
    - If the above conditions are met, do nothing
    - Where they are not met:
        - change the lizardNumber of lizards which didn't meet criteria to new lizardNumber.max() + 1
        - repeat until lizards within each lizardNumber meet criteria

In [25]:
def refineLizardNumber (data,lizCol='lizardNumber',svlCol='svl',yrCol='year',dateCol='date',markCol='lizMark'):
   # assert isinstance(data,pd.DataFrame),'data must be a pandas DataFrame'
    #assert set([lizCol,svlCol,yrCol,dateCol]).issubset(data.columns)\
    #, 'lizCol, svlCol, yrCol, and dateCol must be a pandas Series'
    #sort data by lizard number then date.
    tmp=pd.DataFrame(data.sort_values(by=[lizCol,dateCol]).reset_index())
    #Create a identifier to link individuals back to the full data set
    tmp['lizMarkFull']=tmp[markCol]+tmp[svlCol].astype(str)+tmp[yrCol].astype(str)
#    same=pd.Series()
    #process the dataframe one lizardNumber at a time
    for liz in tmp[lizCol].unique():
        lizidx=tmp[tmp[lizCol]==liz]
        if liz==np.nan:
            tmp[lizCol]=lizidx[lizCol]
        else:
            try:
                first=lizidx[dateCol]==lizidx[dateCol].min()
                firstSVL=(lizidx.loc[first,[svlCol]])
                firstYear=lizidx.loc[first,[yrCol]]
                same=((lizidx[svlCol]-firstSVL[svlCol].values >=-2) & (lizidx[yrCol]-firstYear[yrCol].values <= 7))==True
                lizidx.loc[~same,lizCol]=tmp[lizCol].max()+1
                tmp.loc[tmp['lizMarkFull'].isin(lizidx.loc[~same,'lizMarkFull']),lizCol]=lizidx.loc[~same,lizCol]
            except:
                print("\nUnable to process lizardNumber {}.".format(liz))
            continue
            #Change lizardNumber for values not meeting criteria
    return(tmp.drop('lizMarkFull',axis=1))
    #return(same)       

In [26]:
def iterLiz (function, data):
    df = data
    print('\nOriginal max lizardNumber is {}.'.format(df.lizardNumber.max()))
    df2 = function(df)
    while df2.lizardNumber.max() > df.lizardNumber.max():
        df = function(df)
        print('\nMax lizardNumber is {}.'.format(df.lizardNumber.max()))
        break
    print("\nMax lizardNumber is {}.".format(df.lizardNumber.max()))
    return df

In [27]:
df=iterLiz(refineLizardNumber,df)
print("\nApplying the refineLizardNumber function to the df dataframe returns a dataframe with {} observations."\
      .format(df.shape[0]))


Original max lizardNumber is 1636.0.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s



Unable to process lizardNumber 44.0.

Unable to process lizardNumber 613.0.

Unable to process lizardNumber 777.0.

Unable to process lizardNumber 965.0.

Unable to process lizardNumber 1499.0.

Unable to process lizardNumber 44.0.

Unable to process lizardNumber 613.0.

Unable to process lizardNumber 777.0.

Unable to process lizardNumber 965.0.

Unable to process lizardNumber 1499.0.

Max lizardNumber is 1769.0.

Max lizardNumber is 1769.0.

Applying the refineLizardNumber function to the df dataframe returns a dataframe with 8197 observations.


# NOTE: 
- _unprocessed_ still needs to be manually entered

In [28]:
unprocessed = [32,562,954,1383,1475]

In [29]:
# These are the problematic rows
problemData=df.loc[df.lizardNumber\
                   .isin(unprocessed)\
                   ,['date','year','svl','sex','meters','new.recap','location','toes','lizardNumber','species']]
print("\nSo far {} rows could not be processed.".format(problemData.shape[0]))
problemData


So far 9 rows could not be processed.


Unnamed: 0,date,year,svl,sex,meters,new.recap,location,toes,lizardNumber,species
208,2002-03-16,2002.0,43.0,m,447.0,new,8 m v top of site,2-11-18,32.0,v
209,2002-07-11,2002.0,49.0,m,442.0,new,"3m v R with oak, juniper T 5m from sb",2-11-18,32.0,v
210,2003-06-29,2003.0,51.0,m,,recap,R left 10m v top of site,2-11-18,32.0,v
211,2004-07-12,2004.0,55.0,m,418.0,recap,Rs opp oak R,2-11-18,32.0,v
212,2005-07-08,2005.0,56.0,m,421.0,recap,Rs opp side and 3m^ oak R,2-11-18,32.0,v
1616,2002-03-20,2002.0,53.0,f,262.0,new,slab,4-10-15-17,562.0,j
2000,2000-03-19,2000.0,55.0,f,,new,pyramid R,6-14-19,954.0,j
2420,2005-07-15,2005.0,69.0,m,164.0,new,oakT & sb rt side,2-6-12,1383.0,sc
2510,2001-07-28,2001.0,40.0,m,395.0,new,sb 5m ^ 2falls,2-8-15-19,1475.0,j


In [30]:
possibleLizNum = set(range(int(df.lizardNumber.min()),int(df.lizardNumber.max())))
actualLizNum = set(pd.Series(df.lizardNumber.unique()).dropna().apply(int))
missingLizNum = possibleLizNum - actualLizNum

print("\nThere are {} entries.  There are {} unique lizard numbers.\
\n\nThe lizardNumber ranges from {} to {}.\
\n\nThe following numbers are not assigned to a lizard:\n{}"\
      .format(df.shape[0],len(df.lizardNumber.unique())\
              ,df.lizardNumber.min(),df.lizardNumber.max(),missingLizNum))


There are 8197 entries.  There are 1715 unique lizard numbers.

The lizardNumber ranges from 1.0 to 1769.0.

The following numbers are not assigned to a lizard:
{774, 1031, 1159, 9, 1545, 1550, 1295, 784, 1556, 665, 1434, 924, 1056, 1312, 680, 1577, 811, 1328, 561, 1585, 1331, 1082, 1597, 961, 1218, 579, 1092, 1601, 1222, 1479, 1480, 1097, 1225, 1602, 718, 1486, 593, 1493, 1622, 1114, 991, 1632, 1379, 1381, 1513, 1263, 880, 753, 626, 1395, 1522, 885, 1526, 632, 1403}


### Export the rows that could not be processed

In [31]:
# code goes here
path='C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\'
filename = path+"data that could not be processed by refineLizardNumber.csv"
df.loc[df.lizardNumber.isin(unprocessed)].to_csv(filename)
print(filename)

C:\Users\Christopher\Google Drive\TailDemography\outputFiles\data that could not be processed by refineLizardNumber.csv


Now we export the cleaned data to a csv

In [33]:
df = df.rename(index = str, columns = {'new.recap':'newRecap'})

In [34]:
timestamp = pd.to_datetime('now')-pd.Timedelta(hours=5)
#path=''C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\''
path='C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\'
#filename = path + 'cleaned CC data 2000-2017_' + str(timestamp)+ '.csv'
filename = path + 'cleaned CC data 2000-2017_' + '.csv'
df.to_csv(filename,index = False)
filename

'C:\\Users\\Christopher\\Google Drive\\TailDemography\\outputFiles\\cleaned CC data 2000-2017_.csv'