<a id='TOC'></a>

# Cleaning Data (Part 1)
The purpose of this notebook is to read in raw excel data for multiple years, rename and trim columns, append cleaned files into a single dataframe and export this dataframe as an excel file.

# Table of Contents
1. [Setting up Python](#SettingUp)
    
    1. [Setting the Location](#SettingLoc)
    
    2. [Importing Necessary Packages](#ImportingPackages)
    
    3. [Functions](#functions)
    
    4. [Preparing for a Save](#PreparingSave)  

2. [Handling Columns](#HandlingColumns)
    
    1. [Find Unique Column Names](#FindUniqueCol)
    
    2. [Eliminate Unnecessary Columns](#DropCol)
    
    3. [Combine Synonyms](#CombineCol)

3. [Reading and Appending Data](#ReadingAppendingData)

4. [Exporting Data](#ExportingData)

<a id='SettingUp'></a>

# Setting up Python
[Top](#TOC)

[Setting the Location](#SettingLoc)
    
[Importing Necessary Packages](#ImportingPackages)
    
[Getting Data](#GettingData)
    
[Preparing for a Save](#PreparingSave)

<a id='ImportingPackages'></a>

## Importing Necessary Packages

[Top](#TOC)

[Setting Up Python](#SettingUp)

Here we import necessary packages. 
This chunk may take a while.

In [1]:
import pandas as pd
# import numpy as np
import glob,os

# increase print limit
pd.options.display.max_rows = 99999
pd.options.display.max_colwidth = 50
pd.set_option('mode.sim_interactive', True)

<a id='functions'></a>

## Functions
[Back to: Top](#TOC)

[Back to: Setting Up Python](#SettingUp)

1. [xlcolshape](#xlcolshape)

2. [xluniquecol2](#xluniquecol2)

3. [colmatchtodict](#colmatchtodict)

4. [findsyn](#findsyn)

5. [readnsplit](#readnsplit)

6. [mapndrop](#mapndrop)

7. [namefile](#namefile)

<a id='xlcolshape'></a>

[Back to: Functions](#Functions)

In [3]:
def xlcolshape(file, verbose = True):
    """xlcolshape takes a file name as a string and returns the shape of the excel file"""
    assert isinstance(verbose,bool),"'verbose' must be bool not,{}".format(type(verbose))
    dictionary = {}
    for sheet in pd.ExcelFile(file).sheet_names:
        try:
            tmp = pd.read_excel(file,sheet_name =sheet).shape
            dictentry = file+'_'+sheet
            dictionary[dictentry] = tmp
            if verbose == True:
                print("Doing stuff you asked me to do for file \'{}\',sheet \'{}\' programmer person."\
                      .format(file, sheet))            
        except:
            print("This didn't work for file {}, sheet {}".format(file, sheet))
            
    return dictionary

<a id='xluniquecol2'></a>

[Back to: Functions](#Functions)

In [4]:
def xluniquecol2(file, header = 0, verbose=True):
    tmp = []
    for sheet in pd.ExcelFile(file).sheet_names:
        if (('species' in pd.read_excel(file,sheet_name=sheet, header = header).columns)\
            or('Species' in pd.read_excel(file,sheet_name=sheet, header = header).columns)):
            try:
                tmp = list(set(tmp+list(pd.read_excel(file,sheet_name=sheet).columns)))
                if verbose==True:
                    print("Doing stuff you asked me to do for file \'{}\',sheet \'{}\' programmer person."\
                          .format(file,sheet))
                res = tmp
            except:
                print("This didn't work for file {}, sheet {}".format(file,sheet))
        else:
            print("Check columns for file {}.".format(file))
            res = None
    return res
            

<a id='colmatchtodict'></a>

[Back to: Functions](#Functions)

In [5]:
def colmatchtodict(x,series, dictsource, key= None):
    """This takes a string, x, and a looks for values in a series that match that contain that string.
    Those values which match are returned as values in a python dict for the key, key.""" 
    assert isinstance(series,pd.Series)
    if key is None:
        key = x
    tmp = series[series.astype(str).str.contains(x,case = False)].tolist()
    dictsource[key] = tmp
    return dictsource
    

<a id='findsyn'></a>

[Back to: Functions](#Functions)

In [6]:
def findsyn (name,dictionary, verbose = True):
    """
    *findsyn* checks searches the values of the dict *dictionary* for the string, *name* and returns 
    the key for the key,value pair to which *name* belongs.
    """
    tmp = pd.DataFrame({'preferredcol':list(dictionary.keys()),'synonymns':list(dictionary.values())})
    try:
        res = list(tmp.preferredcol[tmp.synonymns.apply(lambda x:name in x)])[0]
    except:
        res = None
        if verbose == True:
            print("No value matching \"{}\" was found in the dictionary.".format(name))
    return res


<a id='readnsplit'></a>

[Back to: Functions](#Functions)

In [7]:
def readnsplit(file,newsourcefolder,dtype=None,verbose=True):
    """
    This function reads an excel file, splits its sheets into separate files and saves them to folder
    *newsourcefolder*.
    """
    suffix = '.'+file.split('.')[1]
    prefix = file[:-len(suffix)]
    for sheet in pd.ExcelFile(file).sheet_names:
        try:
            splitfile = newsourcefolder+'/'+prefix+'_'+sheet+suffix
            tmp = pd.read_excel(file,dtype=dtype, sheet_name=sheet).to_excel(splitfile,index=False)
            if verbose==True:
                print("Success!  \'{}\',sheet \'{}\' has been saved to {} and the corresponding\
                google drive file as {}.".format(file,sheet,newsourcefolder,splitfile))
            continue
        except:
            print("Unable to save \'{}\',sheet \'{}\' as a separate file.".format(file,sheet))         


<a id='mapndrop'></a>

[Back to: Functions](#Functions)

In [8]:
def mapndrop(df,dictionary,verbose=True):
    """
    This function renames columns in *df* deemed synonymous according to a dict,
    *dictionary*, and drops unnecessary columns before returning the cleaner dataframe.
    """
    try:
        df.columns = pd.Series(df.columns).map(lambda x:dictionary[x])
        tmp = df
        if verbose==True:
            print("Successfully mapped columns for df.")
        dropidx =[None==col for col in list(tmp.columns)]
        tmp=tmp.drop(columns=df.columns[dropidx])
        if verbose==True:
            print("Successfully dropped unnecessary columns for df.")
    except:
        tmp = None
        print("Skipping mapndrop call for df.")
    return tmp


<a id='namefile'></a>

[Back to: Functions](#Functions)

In [9]:
def namefile(name, tzadjust=5,tzdirection = '-', adjprecision='minutes', filetype = 'csv'):
    """takes a filename and filetype, and adds a timestamp adjusted relative to gmt to a precision 
    and returns a string that concatenates them."""
    assert isinstance(name,str),"'name' must be of type str."
    assert isinstance(tzadjust,int),"'tzadjust' must be of type int"
    assert adjprecision in ['date','hour','minutes','seconds', 'max'], "'adjprecision' must be either \
    'date', hour','minutes','seconds', or 'max'"
    precision= {'max':None,'seconds':-7,'minutes':-9, 'hours':-14,'date':-20}
    if tzdirection== '-':
        timestamp = (pd.to_datetime('now')-pd.Timedelta(hours=tzadjust))
    else:
        timestamp = (pd.to_datetime('now')+pd.Timedelta(hours=int(tzadjust[1:])))
    timestamp = str(timestamp).replace(':','hrs',1).replace(':','min',1)
    timestamp = timestamp[:precision[adjprecision]]
    filename = name+'_' + timestamp+ '.' +filetype
    return filename


<a id='PreparingSave'></a>

## Preparing for a Save
[Top](#TOC)

[Setting up Python](#SettingUp)

<a id='SettingLoc'></a>

## Setting the Location
[Top](#TOC)

[Setting Up Python](#SettingUp)

These chunks identify the locations from which we can get data and to which we can save data.

### Source Data
Raw data can be found in the following locations:

In [12]:
# sourceDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/outputFiles'
# sourceDataBig = 'S:/Chris/TailDemography/TailDemography/Raw Data'
# sourceBlack = 'C:/Users/test/Desktop'
sourceGandolf = 'C:/Users/craga/Google Drive/TailDemography/Raw Data'


### Intermediate Source Data
Intermediate files can be found in the following locations:

In [11]:
# sourceInterDataPers = 'C:/Users/Christopher/Google Drive/TailDemography/Intermediate Files/Source'
# sourceinterDataBig = 'S:/Chris/TailDemography/TailDemography/Intermediate Files/Source'
# sourceBlack = 'C:/Users/test/Desktop'
sourceInterGandolf = 'C:/Users/craga/Google Drive/TailDemography/Intermediate Files/Source'

Now we change the working directory to the source path.

In [13]:
os.chdir(sourceGandolf)

### Output Data
The cleaned data will be saved to one of these locations:

In [14]:
# outputPers = 'C:/Users/Christopher/Google Drive/TailDemography/outputFiles'
# outputBig = 'S:/Chris/TailDemography/TailDemography/Cleaned Combined Data'
# outputBlack = 'C:/Users/test/Desktop'
outputGandolf = 'C:/Users/craga/Google Drive/TailDemography/Cleaned Combined Data'

### Review files

In [15]:
# outputPers = 'C:/Users/Christopher/Google Drive/TailDemography/Files for review/Source files'
# reviewfolderBig = 'S:/Chris/TailDemography/TailDemography/Files for review/Source files'
reviewGandolf = 'C:/Users/craga/Google Drive/TailDemography/Files for review/Source files'

# Handling Columns
[Top](#TOC)

We don't have to look in the multiple-sheet file.  It's clear that we'll have to identify a common set of columns prior to combining these files.  Let's define a few functions to help us do this.

We will want to do the following:
1. [Find Unique Column Names](#FindUniqueCol)
2. [Eliminate Unnecessary Columns](#DropCol)
3. [Combine Synonyms](#CombineCol)

Here we use search the source path to locate and eventually read the raw data into our notebook.

In [16]:
rawfiles = glob.glob('*.xls*')
rawfiles

['CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19.xlsx',
 'CC 2004.xlsx',
 'CC 2015 - captures.xls',
 'CC 2016 - captures.xls',
 'CC 2017 Lizards - 3viii17 captures and obs.xls',
 'xCC2005x.xls',
 'xCC2006x.xls',
 'xCC2007x.xls',
 'xCC2008x.xls',
 'xCC2009x.xls',
 'xCC2010x.xlsx',
 'xCC2011x.xls',
 'xCC2012x.xls',
 'xCC2013x.xls',
 'xCC2014x.xlsx']

We'll separate these into files with single or multiple sheets.

In [17]:
rawfiles_ms = [rawfiles[0],rawfiles[7]]
rawfiles_ss = list(set(rawfiles)- set(rawfiles_ms))

The names of files with multiple sheets are now in the variable *rawfiles_ms*.

In [18]:
rawfiles_ms

['CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19.xlsx',
 'xCC2007x.xls']

The names of files with a single sheet are now in the variable *rawfiles_ss*.

In [19]:
rawfiles_ss

['CC 2004.xlsx',
 'xCC2010x.xlsx',
 'CC 2016 - captures.xls',
 'xCC2011x.xls',
 'xCC2013x.xls',
 'xCC2006x.xls',
 'CC 2017 Lizards - 3viii17 captures and obs.xls',
 'xCC2008x.xls',
 'xCC2009x.xls',
 'xCC2014x.xlsx',
 'xCC2005x.xls',
 'CC 2015 - captures.xls',
 'xCC2012x.xls']

Now let's take a look at the number of columns in each file. We'll start with the single sheet files, since this is the easiest.  We will use the function, *xlcolshape* to make this easier. 
When we call this function on the first of the single-sheet files, we can see that it returns a tuple in the format ('number of rows', 'number of columns'). The code for *xlcolshape* can be found in [Functions](#functions).

In [20]:
xlcolshape(rawfiles_ss[0])

Doing stuff you asked me to do for file 'CC 2004.xlsx',sheet '2004 ' programmer person.


{'CC 2004.xlsx_2004 ': (479, 16)}

We will apply this function to the list of files for our inspection.

In [21]:
pd.Series(rawfiles).apply(lambda x: xlcolshape(x,verbose=False))

0     {'CC 2000-03-modified from CC-SJ 00-03 final-m...
1                     {'CC 2004.xlsx_2004 ': (479, 16)}
2            {'CC 2015 - captures.xls_2015': (241, 19)}
3            {'CC 2016 - captures.xls_2016': (103, 21)}
4     {'CC 2017 Lizards - 3viii17 captures and obs.x...
5                      {'xCC2005x.xls_2005': (202, 17)}
6                      {'xCC2006x.xls_2006': (163, 17)}
7     {'xCC2007x.xls_Sheet1': (507, 16), 'xCC2007x.x...
8                      {'xCC2008x.xls_2008': (134, 20)}
9                      {'xCC2009x.xls_2009': (162, 16)}
10                   {'xCC2010x.xlsx_Sheet1': (99, 41)}
11                    {'xCC2011x.xls_Sheet1': (64, 19)}
12                      {'xCC2012x.xls_data': (85, 19)}
13             {'xCC2013x.xls_CC 2013 data': (106, 20)}
14                     {'xCC2014x.xlsx_2014': (97, 19)}
dtype: object

## Finding Unique Columns
[Top](#TOC)

[Handling Columns](#HandlingColumns)

We'll use the function, *xluniqucol2* to extract column names and convert them to an approved set.  We'll use that function to allow us to only add unique names to a list of column names. 

Here is an example of how xluniquecol2 works for a file with one sheet.  You can find the code for *xluniquecol2* in [Functions](#functions).

In [22]:
xluniquecol2(rawfiles_ss[0],verbose=False)

['SVL',
 'paint mark',
 'VIAL',
 'meters',
 'painted or not',
 'TOES',
 'mass',
 'RTL',
 'NEW/recap',
 'TIME',
 'misc',
 'TL',
 'sex',
 'location',
 'date',
 'species']

Here is an example of how xluniquecol2 works for a file with multiple sheets.

In [23]:
xluniquecol2(rawfiles_ms[0],verbose=False)

['SVL',
 'paint mark',
 'VIAL',
 'meters',
 'painted or not',
 'TOES',
 'mass',
 'RTL',
 'NEW/recap',
 'TIME',
 'misc',
 'TL',
 'sex',
 'location',
 'Unnamed: 16',
 'date',
 'species']

Now we will create an empty set, *uniquecols2*, that will eventually contain the unique column names in all of the files.

We will append the unique column names from each file to *uniquecols2*.

In [24]:
tmp = pd.Series(rawfiles).apply(xluniquecol2,verbose=False)
uniquecols2 = list()
for u in tmp:
    uniquecols2 = uniquecols2+u
uniquecols2 = list(set(uniquecols2))
uniquecols2

['Mark',
 'paint mark',
 1,
 'Toe 9',
 'VIAL',
 'Unnamed: 17',
 'Toe 4',
 'Mass',
 '1st Capture (year)',
 'mass (g)',
 'Species',
 'RTL',
 'Painted',
 'Toe 13',
 'Toe 15',
 'sex',
 'Vial',
 'Toe 14',
 'TL (mm)',
 'species',
 'Toe 17',
 'Toes',
 'Meters',
 'painted or not',
 'Toe 6',
 'Sex',
 'Unnamed: 0',
 'Misc.',
 'NEW/recap',
 'TIME',
 'misc',
 'TL',
 'Date',
 ' painted or not',
 'location',
 'RTL (mm)',
 'Toe 11',
 'Time',
 'Toe 16',
 'Spotted',
 'Toe 3',
 'painted',
 'Toe 5',
 'Toe 7',
 'Location',
 'Tail condition (1=intact; 2=autotomized; 3=regrown)',
 'meters',
 'TOES',
 'mass',
 'Toe 1',
 'Toe 18',
 '2015 or earlier',
 'SVL (mm)',
 'Unnamed: 16',
 'Toe 12',
 'Year',
 'Toe 8',
 'Marked',
 'SVL',
 'Collectors',
 'Toe 2',
 'New/Recap',
 'misc/notes',
 'Toe 10',
 'Toe 19',
 'Toe 20',
 'Years Alive (known)',
 'Paint Mark',
 'Unnamed: 19',
 'Painted or Not',
 'date']

## Eliminate Unnecessary Columns
[Top](#TOC)

[Cleaning Data](#CleaningData)

[Handling Columns](#HandlingColumns)

Now we will try to identify unnecessary columns and eliminate them. Much of this will be done manually.

In [25]:
keepCol = ['species', 'date', 'sex', 'svl', 'tl', 'rtl', 'mass',
       'paint.mark', 'location', 'meters', 'new.recap', 'painted', 'misc',
       'vial', 'autotomized', 'sighting', 'toes','filename']

In [26]:
set(pd.Series(keepCol).str.lower())-set(pd.Series(uniquecols2).str.lower())

{'autotomized', 'filename', 'new.recap', 'paint.mark', 'sighting'}

In [27]:
set(pd.Series(uniquecols2).str.lower())-set(pd.Series(keepCol).str.lower())

{' painted or not',
 '1st capture (year)',
 '2015 or earlier',
 'collectors',
 'mark',
 'marked',
 'mass (g)',
 'misc.',
 'misc/notes',
 nan,
 'new/recap',
 'paint mark',
 'painted or not',
 'rtl (mm)',
 'spotted',
 'svl (mm)',
 'tail condition (1=intact; 2=autotomized; 3=regrown)',
 'time',
 'tl (mm)',
 'toe 1',
 'toe 10',
 'toe 11',
 'toe 12',
 'toe 13',
 'toe 14',
 'toe 15',
 'toe 16',
 'toe 17',
 'toe 18',
 'toe 19',
 'toe 2',
 'toe 20',
 'toe 3',
 'toe 4',
 'toe 5',
 'toe 6',
 'toe 7',
 'toe 8',
 'toe 9',
 'unnamed: 0',
 'unnamed: 16',
 'unnamed: 17',
 'unnamed: 19',
 'year',
 'years alive (known)'}

Since data for years 2000-2003 are contained in the same Excel file we will have to treat this file differently than the others.

## Combining Synonymous Columns
[Top](#TOC)

[Cleaning Data](#CleaningData)

[Handling Columns](#HandlingColumns)

Once we have identified the columns we need to keep, we'll need to apply this list to the files as they are read into python by doing the following:

We will use a function, *colmatchtodict*,  to identify potential synonyms. Here's an example of how *colmatchtodict* works.  The code for this function can be found in [Functions](#functions).

In [28]:
coldict = {}

In [29]:
colmatchtodict('toes',pd.Series(uniquecols2),coldict, key = 'toes')

{'toes': ['Toes', 'TOES']}

Now let's see what happened when we apply this funtion to our, keepCol.

In [30]:
coldict = {}

In [31]:
pd.Series(keepCol).apply(lambda x: colmatchtodict(x=x,series=pd.Series(uniquecols2),dictsource=coldict))
coldict

{'species': ['Species', 'species'],
 'date': ['Date', 'date'],
 'sex': ['sex', 'Sex'],
 'svl': ['SVL (mm)', 'SVL'],
 'tl': ['RTL', 'TL (mm)', 'TL', 'RTL (mm)'],
 'rtl': ['RTL', 'RTL (mm)'],
 'mass': ['Mass', 'mass (g)', 'mass'],
 'paint.mark': ['paint mark', 'Paint Mark'],
 'location': ['location', 'Location'],
 'meters': ['Meters', 'meters'],
 'new.recap': ['NEW/recap', 'New/Recap'],
 'painted': ['Painted',
  'painted or not',
  ' painted or not',
  'painted',
  'Painted or Not'],
 'misc': ['Misc.', 'misc', 'misc/notes'],
 'vial': ['VIAL', 'Vial'],
 'autotomized': ['Tail condition (1=intact; 2=autotomized; 3=regrown)'],
 'sighting': [],
 'toes': ['Toes', 'TOES'],
 'filename': []}

We will manually adjust the values for 'tl' and 'filename'.

In [32]:
coldict['tl']=['TL (mm)', 'TL', 'tl']

Now we need to use this dict to relabel the columns we wish to keep.

We will use the function, *findsyn* to identify potential synonymous to the columnlabels in *keepcols* among the column labels in *uniquecols2*. 

Here is are a few examples of how *findsyn* works.  The code can be found in [Functions](#functions).

In [33]:
findsyn('RTi',coldict,verbose=False)

In [34]:
findsyn('RTi',coldict,verbose=True)

No value matching "RTi" was found in the dictionary.


In [35]:
findsyn('RTL',coldict,verbose=True)

'rtl'

Now we apply *findsyn* to *uniquecol* and create a column of synonyms.

In [36]:
uniquecols2df = pd.DataFrame({'uniquecols2':uniquecols2})
uniquecols2df['preferredcol'] = uniquecols2df.uniquecols2.apply(lambda x: findsyn(x,coldict,False))
uniquecols2df

Unnamed: 0,uniquecols2,preferredcol
0,Mark,
1,paint mark,paint.mark
2,1,
3,Toe 9,
4,VIAL,vial
5,Unnamed: 17,
6,Toe 4,
7,Mass,mass
8,1st Capture (year),
9,mass (g),mass


Now we will turn this dataframe back into a dict so that we can easily use it to rename columns

In [37]:
uniquecols2df.index = uniquecols2df.uniquecols2
uniquecols2dict = pd.Series(uniquecols2df.preferredcol).to_dict()
uniquecols2dict

{'Mark': None,
 'paint mark': 'paint.mark',
 1: None,
 'Toe 9': None,
 'VIAL': 'vial',
 'Unnamed: 17': None,
 'Toe 4': None,
 'Mass': 'mass',
 '1st Capture (year)': None,
 'mass (g)': 'mass',
 'Species': 'species',
 'RTL': 'rtl',
 'Painted': 'painted',
 'Toe 13': None,
 'Toe 15': None,
 'sex': 'sex',
 'Vial': 'vial',
 'Toe 14': None,
 'TL (mm)': 'tl',
 'species': 'species',
 'Toe 17': None,
 'Toes': 'toes',
 'Meters': 'meters',
 'painted or not': 'painted',
 'Toe 6': None,
 'Sex': 'sex',
 'Unnamed: 0': None,
 'Misc.': 'misc',
 'NEW/recap': 'new.recap',
 'TIME': None,
 'misc': 'misc',
 'TL': 'tl',
 'Date': 'date',
 ' painted or not': 'painted',
 'location': 'location',
 'RTL (mm)': 'rtl',
 'Toe 11': None,
 'Time': None,
 'Toe 16': None,
 'Spotted': None,
 'Toe 3': None,
 'painted': 'painted',
 'Toe 5': None,
 'Toe 7': None,
 'Location': 'location',
 'Tail condition (1=intact; 2=autotomized; 3=regrown)': 'autotomized',
 'meters': 'meters',
 'TOES': 'toes',
 'mass': 'mass',
 'Toe 1': None

We'll use the dict, *uniquecols2dict* to rename the synonymous columns in our file....once we read them in,
that is.

# Reading and Appending Data
[Top](#TOC)

Now we use the function *readnsplit* to actually read in the source files, drop unnecessary columns and renaming columns according to a dictionary. 

Here is an example of how *readnsplit* works.  The code can be found in [Functions](#functions).

In [36]:
readnsplit(rawfiles[0],sourceinterDataBig,str)

Succes!  'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19.xlsx',sheet '2000' has been saved to S:/Chris/TailDemography/TailDemography/Intermediate Files/Source and the corresponding                google drive file as S:/Chris/TailDemography/TailDemography/Intermediate Files/Source/CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2000.xlsx.
Succes!  'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19.xlsx',sheet '2001' has been saved to S:/Chris/TailDemography/TailDemography/Intermediate Files/Source and the corresponding                google drive file as S:/Chris/TailDemography/TailDemography/Intermediate Files/Source/CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2001.xlsx.
Succes!  'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19.xlsx',sheet '2002' has been saved to S:/Chris/TailDemography/TailDemography/Intermediate Files/Source and the corresponding                google drive f

In [37]:
for file in rawfiles:
    readnsplit(file,sourceInterGandolf,dtype=str, verbose=False)

We need to change the directory to the location where the intermediate files this operates on can be found.  We will also save a list of the files names in that location for convenience.

In [38]:
os.chdir(sourceInterGandolf)
splitfiles = glob.glob('*xls*')
splitfiles

['CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2000.xlsx',
 'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2001.xlsx',
 'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2002.xlsx',
 'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2003.xlsx',
 'CC 2004_2004 .xlsx',
 'CC 2015 - captures_2015.xls',
 'CC 2016 - captures_2016.xls',
 'CC 2017 Lizards - 3viii17 captures and obs_2017.xls',
 'xCC2005x_2005.xls',
 'xCC2006x_2006.xls',
 'xCC2007x_2007.xls',
 'xCC2007x_Sheet1.xls',
 'xCC2008x_2008.xls',
 'xCC2009x_2009.xls',
 'xCC2010x_Sheet1.xlsx',
 'xCC2011x_Sheet1.xls',
 'xCC2012x_data.xls',
 'xCC2013x_CC 2013 data.xls',
 'xCC2014x_2014.xlsx']

Now we remove 'xCC2007x_Sheet1.xls' from the list of files we will process intermediate files since this is a subset of the 'xCC2007x_2007.xls' reordered and with some columns dropped.

In [39]:
splitfiles = list(set(splitfiles)-set(['xCC2007x_Sheet1.xls']))
splitfiles

['xCC2012x_data.xls',
 'CC 2015 - captures_2015.xls',
 'xCC2008x_2008.xls',
 'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2003.xlsx',
 'xCC2014x_2014.xlsx',
 'CC 2017 Lizards - 3viii17 captures and obs_2017.xls',
 'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2000.xlsx',
 'xCC2005x_2005.xls',
 'xCC2009x_2009.xls',
 'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2002.xlsx',
 'xCC2007x_2007.xls',
 'xCC2010x_Sheet1.xlsx',
 'xCC2006x_2006.xls',
 'xCC2011x_Sheet1.xls',
 'CC 2000-03-modified from CC-SJ 00-03 final-modified w headers-3Jan19_2001.xlsx',
 'CC 2016 - captures_2016.xls',
 'xCC2013x_CC 2013 data.xls',
 'CC 2004_2004 .xlsx']

Now we use the function *mapndrop* to drop unnecessary columns and renaming columns according to a dictionary.

Here are a few examples of how *mapndrop* works.  The code can be found in [Functions](#functions).

In [40]:
mapndrop(df=pd.read_excel(splitfiles[0],dtype=str),dictionary=uniquecols2dict,verbose = True)

Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.


Unnamed: 0,species,toes,date,sex,svl,tl,rtl,mass,paint.mark,location,meters,new.recap,painted,misc,vial
0,sv,2,2012-06-02 00:00:00,m,53.0,72.0,0.0,4.4,w19b,pine at top of site,,???,yes,actually caught on way down; toe #2 may be nat...,
1,sv,,2012-05-29 00:00:00,,,,,,???,1falls,,didn't catch,,,
2,sc,11,2012-06-01 00:00:00,F,98.0,129.0,0.0,35,.t,bottom site Rt side,,new,yes,BRIA CAUGHT IT!!!!!! :),12-41
3,cn ex,1-7,2012-05-27 00:00:00,f,89.0,165.0,75.0,19,w.a,sb at CCC,,new,yes,,12-27
4,sj,5-11-16,2012-05-24 00:00:00,F,62.0,85.0,0.0,8.4,w1c,5m^bottom site,,new,yes,,12-01
5,sj,5-11-17,2012-05-24 00:00:00,F,62.0,85.0,0.0,5.7,w2c,bottom wall v wall v pine xing,,new,yes,,12-02
6,sj,5-11-18,2012-05-24 00:00:00,F,62.0,86.0,0.0,7.6,w3c,T top R island,,new,yes,TOES CHANGED on 10 June 2012: from 5-11-18 to ...,12-03
7,sj,5-11-18,2012-05-24 00:00:00,F,71.0,99.0,0.0,8.9,w11c,R top 2 falls,,new,yes,NOTE that this animal may have been marked as ...,12-09
8,sj,5-11-19,2012-05-24 00:00:00,F,66.0,87.0,0.0,7.1,w5c,H3/H4,,new,yes,,12-04
9,sj,5-11-20,2012-05-24 00:00:00,M,77.0,105.0,0.0,16.6,w7c,top CCC,,new,yes,,12-05


In [41]:
mapndrop(df=pd.read_excel(splitfiles[4],dtype=str),dictionary=uniquecols2dict, verbose=True)

Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.


Unnamed: 0,species,toes,date,sex,svl,tl,rtl,mass,paint.mark,location,meters,new.recap,painted,misc,vial
0,sj,17-15-20,2014-07-03 00:00:00,M,84,98,24,16.5,o10c,left wall v scree,321,recap,yes,,
1,sj,8-14-19,2014-07-03 00:00:00,M,85,120,0,21.7,o11c,left mid chute,358,recap,yes,Trec shed;loose scales,
2,sj,8-13-20,2014-07-03 00:00:00,F,71,68,-1,11.0,o12c .t,mid chute,358,recap,yes,T shed; Bss,
3,sj,8-15-17,2014-07-03 00:00:00,F,86,121,0,23.5,o13c,right wall @ pool,380,recap,yes,B shed; Tss,
4,sj,5-13-16,2014-07-03 00:00:00,F,75,100,0,11.0,o14c,2 falls,390,new,yes,Btrec shed,14-09
5,sj,8-14-16,2014-07-03 00:00:00,M,82,120,0,22.7,o15c,5m ^ 2 falls,395,recap,yes,,
6,sj,8-15-16,2014-07-03 00:00:00,M,85,116,0,22.0,o16c,opp oak r,418,recap,yes,Brec; Tss,
7,sj,8-12-17,2014-07-03 00:00:00,F,73,88,7,13.0,o17c,slab,262,recap,yes,B shed; Tss,
8,sj,7 -11-17,2014-07-03 00:00:00,F,68,45,30,8.7,o1c,-25,-25,recap,yes,,
9,sj,5-11,2014-07-03 00:00:00,F,60,84,0,7.1,o2c,r wall left sb v 1 falls,-7,new,yes,,14-01


We'll create a df, *df*, with no data, but columns from our desired columns, *i.e.* the keys for coldict, as a placeholder to which we can append new data.

In [42]:
df = pd.DataFrame(columns=coldict.keys())
df

Unnamed: 0,species,date,sex,svl,tl,rtl,mass,paint.mark,location,meters,new.recap,painted,misc,vial,autotomized,sighting,toes,filename


Now we will read in all of the successfully split files, clean the column names, and concatenate them into one large df.

In [43]:
for file in splitfiles:
    df = pd.concat([df,mapndrop(pd.read_excel(file,dtype=str),uniquecols2dict)],sort=True)
    print(df.shape[0])
print("\n\nFinal df has {} columns and {} rows.".format(df.shape[1],df.shape[0]))
df.head()

Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
85
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
326
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
460
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
1477
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
1574
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
2372
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
2581
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
2783
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
2945
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
4422
Successfully mapped columns for df.
Successfully dropped unnecessary columns for df.
4604
Successfully m

Unnamed: 0,autotomized,date,filename,location,mass,meters,misc,new.recap,paint.mark,painted,rtl,sex,sighting,species,svl,tl,toes,vial
0,,2012-06-02 00:00:00,,pine at top of site,4.4,,actually caught on way down; toe #2 may be nat...,???,w19b,yes,0.0,m,,sv,53.0,72.0,2,
1,,2012-05-29 00:00:00,,1falls,,,,didn't catch,???,,,,,sv,,,,
2,,2012-06-01 00:00:00,,bottom site Rt side,35.0,,BRIA CAUGHT IT!!!!!! :),new,.t,yes,0.0,F,,sc,98.0,129.0,11,12-41
3,,2012-05-27 00:00:00,,sb at CCC,19.0,,,new,w.a,yes,75.0,f,,cn ex,89.0,165.0,1-7,12-27
4,,2012-05-24 00:00:00,,5m^bottom site,8.4,,,new,w1c,yes,0.0,F,,sj,62.0,85.0,5-11-16,12-01


In [44]:
df = df.reindex(['species', 'toes', 'sex', 'date', 'svl', 'tl', 'rtl', 'autotomized', 'mass', 
                 'location', 'meters', 'new.recap', 'painted', 'sighting', 
                 'paint.mark', 'vial', 'misc'], axis=1)
df.head()

Unnamed: 0,species,toes,sex,date,svl,tl,rtl,autotomized,mass,location,meters,new.recap,painted,sighting,paint.mark,vial,misc
0,sv,2,m,2012-06-02 00:00:00,53.0,72.0,0.0,,4.4,pine at top of site,,???,yes,,w19b,,actually caught on way down; toe #2 may be nat...
1,sv,,,2012-05-29 00:00:00,,,,,,1falls,,didn't catch,,,???,,
2,sc,11,F,2012-06-01 00:00:00,98.0,129.0,0.0,,35.0,bottom site Rt side,,new,yes,,.t,12-41,BRIA CAUGHT IT!!!!!! :)
3,cn ex,1-7,f,2012-05-27 00:00:00,89.0,165.0,75.0,,19.0,sb at CCC,,new,yes,,w.a,12-27,
4,sj,5-11-16,F,2012-05-24 00:00:00,62.0,85.0,0.0,,8.4,5m^bottom site,,new,yes,,w1c,12-01,


In [45]:
df.shape

(6299, 17)

# Exporting Data
[Top](#TOC)

Here we call the function, *namefile*, to create a timestamped name for file to be exported.  You can find the code for *namefile* in [Functions](#functions).

In [47]:
filename = namefile('Appended and Trimmed CC Data 2000-2017')
os.chdir(outputGandolf)
df.to_csv(filename,index = False)
print("\'{}\' has been saved to \'{}\' and the corresponding drive google drive location."\
      .format(filename, outputGandolf))

'Appended and Trimmed CC Data 2000-2017_2019-07-25 19hrs23min.csv' has been saved to 'C:/Users/craga/Google Drive/TailDemography/Cleaned Combined Data' and the corresponding drive google drive location.
