## Create a cleaned and combined file from one or multiple IDigBio raw occurence files

###### Begin by going to https://www.idigbio.org/portal/search and searching the portal for a desired species.  For each species download the comma-seperated CSV "occurence" file. Rename the file(s) with the scientific name of the represented species (Ex: Homo_sapian.csv).  Place the file(s) you want to include in the analyses in a directory with an informative name and then in a sub-directory named Input_Files.  There were four files created in this manner to be used as examples and placed in a folder named "Mussel_Example."


The final format should be...

Directory containing this code >
Directory with informative name > 
Directory named Input_Files >
CSV file(s) with species name as their file name

###### The default working directory is the Scott_Quarter_Project directory.

This creates a file containing the species, institution code, catalog number, basis of record, individual count, country, state or province, locality, collection year, decimal latitude, and decimal longitude.  It also can combine multiple species into a single file. 

The identity of institution codes can be looked up at http://grbio.org/find-biorepositories

Import necessary packages.

In [1]:
import os
import re
import csv

Name the directory that contains the file(s) you want to include in this single output document.  This directory should contain no other CSV files.  This step allows you to pick and choose the file(s), in case the working directory includes extra CSV files.

**This cell should be customized as needed**

In [2]:
directory_name = 'Mussel_Project'

Extract CSV file names from defined directory as a list.

**Confirm the output of this cell is correct.**

In [3]:
path = directory_name + "/Input_Files"
extension = '.csv'

list_of_files = []

for root, dirs_list, files_list in os.walk(path):
    for file_name in files_list:
        if os.path.splitext(file_name)[-1] == extension:
            list_of_files.append(file_name)
print(list_of_files)

['Amblema_plicata.csv', 'Ligumia_nasuta.csv', 'Quadrula_pustulosa.csv', 'Quadrula_quadrula.csv']


Use UNIX to create a directory for cleaned files

In [4]:
os.chdir(directory_name)
os.system("mkdir Cleaned_Files")

1

Create functions that return the collection year, latitude, longitude or a blank string, depending on the record

In [5]:
def find_collection_year(date):
    try:
        collection_year = (re.search(r'(\d{4})', date)).group(1)
    except:
        collection_year = ""
    return collection_year

def find_lat(geopoint):
    try:
        lat = re.search(r'-?[0-9]\d*(\.\d+)?', geopoint).group(0)
    except:
        lat = ""
    return lat

def find_long(geopoint):
    try:
        long = re.search(r'(-?[0-9]\d*(\.\d+)?)}$', geopoint).group(1)
    except:
        long = ""
    return long

Create a function that creates an empty csv, cleans up the data for a given file, and saves the cleaned data by row as a file in the newley created Cleaned_Files directory.

In [6]:
def clean_file(file):
    # Make a list of the categories in order
    filename = "Input_Files/" + file
    outputname = "Cleaned_Files/" + file
    categories = ["species", "institution_code", "catalog_number", "basis_of_record", 
                 "individual_count", "country", "state_or_province", "locality", "collection_year",
                 "latitude", "longitude", "\n"]
    categories = ",".join(categories)
    fd = open (outputname, 'a')
    fd.write(categories)
    #make sure the species name is consistent for the records
    species = re.search(r'(([a-zA-Z]*)_([a-zA-Z]*)).csv', list_of_files[0])
    species = species.group(2) + " " + species.group(3)
    #extract the unique values for each record
    with (open(filename)) as f:
        reader = csv.DictReader(f, delimiter = ',')
        #for each row gather variables, removing periods in some
        for i, rec in enumerate(reader):
            institution_code = rec['dwc:institutionCode']
            institution_code = institution_code.replace(",", "")
            catalog_number = rec['dwc:catalogNumber']
            catalog_number = catalog_number.replace(",", "")
            basis_of_record = rec['dwc:basisOfRecord']
            basis_of_record = basis_of_record.replace(",", "")
            #If the individual count is blank, assume it is 1 
            individual_count = rec['dwc:individualCount']
            if individual_count == "":
                individual_count = '1'
            country = rec['dwc:country']
            country = country.replace(",", "")
            state_or_province = rec['dwc:stateProvince']
            state_or_province = state_or_province.replace(",", "")
            locality = rec["dwc:locality"]
            locality = locality.replace(",", "")
            collection_year = find_collection_year(rec['idigbio:eventDate'])
            latitude = find_lat(rec['idigbio:geoPoint'])
            longitude =find_long(rec['idigbio:geoPoint'])
            rec = [species, institution_code, catalog_number, basis_of_record, 
                 individual_count, country, state_or_province, locality, collection_year,
                 latitude, longitude, "\n"]
            rec = ",".join(rec)
            fd = open (outputname, 'a')
            fd.write(rec)
    fd.close()
    return()

Make individual csv files for each species

In [7]:
for i in files_list:
    clean_file(i)
    print("Completed file " + i)

Completed file Amblema_plicata.csv
Completed file Ligumia_nasuta.csv
Completed file Quadrula_pustulosa.csv
Completed file Quadrula_quadrula.csv


Make a combined csv file for each species

In [8]:
combined_file_name = "Cleaned_Files/Combined_" + directory_name + ".csv"
combined_file = open(combined_file_name,"a")

#Write in headers
categories = ["species", "institution_code", "catalog_number", "basis_of_record", 
                 "individual_count", "country", "state_or_province", "locality", "collection_year",
                 "latitude", "longitude", "\n"]
categories = ",".join(categories)
combined_file.write(categories)

# now the individual files:    
for file in files_list:
    file = combined_file_name = "Cleaned_Files/" + file
    f = open(file)
    f.__next__() # skip the header
    for line in f:
         combined_file.write(line)
    f.close() # not really needed
combined_file.close()

This combined csv file or individual species file can be input into the R script "Species_Record_Analyses" to determine the top institution with these specimens, as well as create lists of specimen details, including catalog numbers, based on institution.