<div style="text-align: right">
    
    Caleb Powell
    calebadampowell@gmail.com
    https://github.com/CapPow
    

    Dakila Ledesma
    bgq527@mocs.utc.edu
    https://github.com/bgq527
    
</div>    

## Python for Biology, Geology, and Environmental Science Majors (BGE).

This notebook is intended BGE majors who are new to Python, it assumes basic knowledge of python's syntax and data types. Intermediate python users may find it useful to skip ahead to one of the concise **recaps**.

### Working Biodiversity Data

### Table of Contents
 - [The iDigBio Portal](#idigbio)
 - [Morel Hunting Date](#morelExample)
 - [Retrieving Data From iDigBio](#retrievingData)
     - [Querying for specific data](#buildquery)
 - [Analyze the data using Pandas](#pandasanalysis)
     - [Dropping Null values](#dropna)
 - [Using Python's datetime library](#datetime)
 - [Operators reference](#operators)
 
<a id='idigbio'></a>
### The iDigBio Portal

To practice using python for retrieving and analyzing biodiversity data, we will be using data retrieved from the [Integrated Digitized Biocollections (iDigBio)](https://www.idigbio.org/portal) portal which aggregates biodiversiy data from natural history collections.

<a id='morelExample'></a>
<img src="files/assets/morel.jpg">

### Morel Hunting Date
 In this example, we'll use their python library to help determine the best time of year to go hunting for a popular gourmet mushroom, morels. Morels are typically wild harvested and notoriously ephemeral, meaning they are only around for brief period making timing very important. You'll need to:

- Retrieve data from the iDigBio portal
- Analyze the data using Pandas
- Use Python's datetime library

***

To start with, we will need to install the [`idigbio` python client](https://github.com/iDigBio/idigbio-python-client#installation). This library makes it very easy to interact with iDigBio's web application programming interface (API). Web APIs or "data services" offer progrmatic access to data making the automation of data gathering much more simple. 

After installing the library, we will need to import the necessary librarys:

In [1]:
import pandas as pd
import idigbio

<a id='retrievingData'></a>
### Retrieving Data From iDigBio
Much of this walkthrough was adapted using the example code provided on the [idigbio's github](https://github.com/iDigBio/idigbio-python-client#basic-usage). Since the idigbio library accesses data through iDigBio's web API, additional details on how to use it in the [web API's documentation](https://github.com/idigbio/idigbio-search-api/wiki#records).

The library offers 2 options when returning the biodiversity data, either in JSON format or as a Pandas DataFrame. In this example, we'll be using the Pandas DataFrame interface. The Pandas DataFrame interface is specified following the library name:
```{python}
idigbio.pandas()
```
To avoid having to specify this each time, we'll create a variable named `api` to use as a shortcut:

In [2]:
api = idigbio.pandas()

### Querying for specific data
[idigbio's documentation](https://github.com/iDigBio/idigbio-python-client#examples) shows us that the `.search_records()` function imported from idigbio.pandas() expects a dictionary structured where keys are the field parameters and values are the specific values we would like to query on. This means our `search_records()` call should look something like:
```{python}
my_query = {field_1 : value_1, 
            field_2 : value_2,
            field_3 : value_3}

api.search_records(my_query)
```
A list of available fields for record queries is [available here](https://github.com/idigbio/idigbio-search-api/wiki/Index-Fields#record-query-fields).

iDigBio aggregates data from millions of natural history records. Since we are interested in the best date to find [morels](#morelExample) in Tennessee, we'll need to define a query which specifies a taxon and a region. 

In this case, "True Morels" fall under the genus _Morchella_, and since I'm at the University of Tennessee at Chattanooga, we will define our region as the state of "Tennessee". Using the record [query fields](https://github.com/idigbio/idigbio-search-api/wiki/Index-Fields#record-query-fields) documentation we can see that `genus` and `stateprovince` are options which seem to fit our parameters.

Using this information our `search_records()` should look similar to this:

```{python}
my_query = {'genus' : 'Morchella',
            'stateprovince' : 'Tennessee'}

api.search_records(my_query)
```
<a id='buildquery'></a>
However, if we use variable names in place of the values this code will be easy to modify and reuse later. Run the cell below to make the query.

In [3]:
# set a variable for the genus we want to query
genusOfInterest = 'Morchella'

# set a variable for the state we're interested in
statesOfInterest = ['Tennessee', 'Kentucky', 'Virginia','North Carolina',
                    'South Carolina','Georgia','Alabama', 'Mississippi', 'Arkansas']

# define a dictionary with the query's "key word arguments"
my_query = {'genus':genusOfInterest,
            "stateprovince":statesOfInterest}

# call iDigbio's api, using the query we built. The result is a dataframe.
df = api.search_records(rq=my_query)

# spotcheck what was returned
# notice we are using the "display()" function, which is specific to jupyter notebooks
# This can be replaced with the "print()" in other interpreters.
display(df.sample(2))
# examine the the columns available
display(df.columns)
# examine the total quantity of results
display(df.shape)

Unnamed: 0_level_0,basisofrecord,canonicalname,catalognumber,class,collectioncode,collectionid,collector,continent,coordinateuncertainty,country,...,recordset,scientificname,specificepithet,startdayofyear,stateprovince,taxonid,taxonomicstatus,taxonrank,typestatus,verbatimeventdate
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3652eedc-88da-4909-a230-b69371f2f0b2,preservedspecimen,morchella esculenta,tenn-f-007245,pezizomycetes,tenn-f,97e2d271-3744-48a3-92b5-5a86afbfb01d,h.m. jennison,north america,,united states,...,04d9b721-259c-4d6b-b48f-2e23edf66c9f,morchella esculenta,esculenta,77.0,tennessee,8574619,accepted,species,,
64e32ded-fd76-477c-beae-ae29e74503bb,preservedspecimen,morchella,tenn-f-030541,pezizomycetes,tenn-f,97e2d271-3744-48a3-92b5-5a86afbfb01d,margaret boarts,north america,,united states,...,04d9b721-259c-4d6b-b48f-2e23edf66c9f,morchella,,77.0,tennessee,2594601,doubtful,genus,,


Index(['basisofrecord', 'canonicalname', 'catalognumber', 'class',
       'collectioncode', 'collectionid', 'collector', 'continent',
       'coordinateuncertainty', 'country', 'countrycode', 'county',
       'datasetid', 'datecollected', 'datemodified', 'dqs', 'etag',
       'eventdate', 'family', 'flags', 'genus', 'geopoint', 'hasImage',
       'hasMedia', 'highertaxon', 'indexData', 'institutioncode',
       'institutionid', 'kingdom', 'locality', 'mediarecords', 'municipality',
       'occurrenceid', 'order', 'phylum', 'recordids', 'recordnumber',
       'recordset', 'scientificname', 'specificepithet', 'startdayofyear',
       'stateprovince', 'taxonid', 'taxonomicstatus', 'taxonrank',
       'typestatus', 'verbatimeventdate'],
      dtype='object')

(100, 47)

### Analyze the data using Pandas
<a id='pandasanalysis'></a>

As we expected, our data was returned as a Pandas ["DataFrame"](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), we can now utalize any of the Pandas methods to analyze it. One oddity in this DataFrame is it does not have a pandas generated index but is instead indexed using "uuid" a type of [universal identifier](https://dwc.tdwg.org/rdf/#132-internationalized-resource-identifier-iri-non-normative).

The `df.shape` of the result is exactly 100, this seems like a suspiciously round number. The [web API's documentation](https://github.com/idigbio/idigbio-search-api/wiki#records) tells us there is an optional `limit` parameter. When an optional parameter is left blank it is given a default value. Let's test using a higher limit value to see if we get additional results. Since we have already defined `my_query` we can simply call the `api.search_records()` again, with the optional parameter this time:

In [4]:
df = api.search_records(rq=my_query, limit = 1000)

# display the shape of the results
display(df.shape)

(174, 47)

That increased the quantity of resulting records!

***
The columns in the resulting DataFrame are organized under a biodiversity data standard  called [The Darwin Core](https://dwc.tdwg.org/). Being standardized makes it easy to understand (and look up) what we expect to find in each column. For example:
- ['eventdate'](https://terms.tdwg.org/wiki/dwc:eventDate) column stores the date the specimen was collected

- ['startdayofyear'](https://terms.tdwg.org/wiki/dwc:startDayOfYear) column is the day of the year (e.g., 1 is January 1st).

We can use this data to determine the most frequent day of the year Morel's are found in our region of interest region. Since we're interested in the date of collection, we should take a look at the data in the `eventdate` column. 

In [5]:
display(df['eventdate'].tail(3))

uuid
743f7634-e87c-43b4-9f96-3c4a726ba012    NaN
9ef764ac-800c-479e-b106-e9e1b4fb74c2    NaN
20005957-6e79-4684-918d-2742663c6d87    NaN
Name: eventdate, dtype: object

### Dropping Null values
<a id='dropna'></a>
The result `NaN` is short for "Not a Number" which is Pandas telling us it is an empty field. It if lacks data for the date of collection, it will not be useful in our analysis. We can use Pandas' [`.dropna()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) with the optional `subset` parameter to clear those away:

In [6]:
# drop those rows with empty 'eventdate' and overwrite df with the results.
df = df.dropna(subset=['eventdate'])

# Spot check 'evetndate' for a few rows and display the resulting shape.
display(df['eventdate'].tail(2))
display(df.shape)

uuid
42cf631e-ad7a-415d-86d4-f4a8786c0252    [redacted]
fc42b6ca-e476-4f1f-b88d-fac0e0d2e97e    [redacted]
Name: eventdate, dtype: object

(64, 47)

It looks like that dropped a number of rows, yet some of the remaining values in `eventdate` are not dates but text strings informing us the date data was 'redacted.' Morels hunting is a popular hobby in some regions so it is reasonable that some mycologists have chosen to protect the data surrounding this genus.

***
To handle those columns which are not dates, we will use Pandas' [`to_datetime()` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) which can attempts to convert a value, or array of values (such as a series) into a date object. Using the optional `errors` parameter, we can tell Pandas how to handle non-convertable data such as those 'redacted' strings. The "coerce" option for the `errors` parameter makes those non-convertable values `NaT` or null values. Let's combine this with what [we learned](#dropna) about `drop_na()` to further clean this data.

In [7]:
# "Cast" the entire eventdate column into a date object.
# Those which cannot be converted will become "Nat"
df['eventdate'] = pd.to_datetime(df['eventdate'], errors='coerce')

# Drop the resulting "Nat" objects.
df = df.dropna(subset=['eventdate'])
display(df.shape)

(10, 47)

There are not many rows left, it may be more informative to return to the previous cells and add additional states from similar latitudes to the query to increase results. Return to the [initial query](#buildquery), add additional states to `statesOfInterest`, then rerun the following cells to attempt to gather a more robust dataset.

***
<a id='datetime'></a>

## Using Python's datetime library

Once the dataset is gathered and has been restricted to only those with known dates of collection, we can determine the average date of collection. To begin, we will determine an average date for each record in our dataset. Because of leap years, specific dates do not always land on the same day of the year. Luckily, our data contains the column: ['startdayofyear'](https://terms.tdwg.org/wiki/dwc:startDayOfYear) which gives us the:
> "ordinal day of the year on which the Event occurred"

To begin, we will use pandas `.mean()` function to determine an average `startdayofyear`.

In [8]:
# notice we included the optional parameter "skipna=True,"
avgDayOfYear = df['startdayofyear'].mean(skipna=True)
display(avgDayOfYear)


100.0

The avgDayOfYear is interesting but how do we make this information more useable? Let's convert this value to a date on this year's calendar by adding `avgDayOfYear` to January 1st of this year.

First we'll import the datetime library, then will use the its`.now()` function to get the current date:

In [9]:
import datetime
# notice there is a class called datetime within the library called datetime
# this is why the command looks so redundant.
currentDate = datetime.datetime.now()

The result: `currentDate` is a [datetime object](https://docs.python.org/3/library/datetime.html#datetime-objects) which is a datatype. `currentDate` has attributes for each portion of the date, for example:
***
To display the current month, as an integer:
```{python}
display(currentDate.month)
```
***
To display the current year, as an integer:
```{python}
display(currentDate.year)
```
Using the current year, we will build a new datetime object to represent January 1st of this year. 

In [10]:
thisYear = currentDate.year
startOfYear = datetime.datetime(thisYear,1,1)

Now, to get the best date for morel hunting in this calendar year, we will add the `avgDayOfYear` to the `startOfYear`. To do this, we will use datetime's `timedelta()` function which returns the difference between two datetime values (as a date).

In [11]:
bestDate = startOfYear + datetime.timedelta(avgDayOfYear)

# display the results
display(f'The average day for collecting morels is {bestDate}.')

'The average day for collecting morels is 2019-04-11 00:00:00.'

<a id='operators'></a>
### Operators references

#### Arithmetic
|Type|Python|
|-----|-----|
|Addition|+|
|Subtraction|-|
|Multiplication|*|
|Division|/|
|Floor Division|//|
|Squared|**|
|Modulo|%|

#### Logic

|Normal|Python|Alternative
|-----|-----|-----|
|And|and|-|
|Or|or|-|
|Not|not|!|
|More than|>|-|
|Less than|<|-|
|Equal to|==|-|
|Not equal to|!=|-|
|More than or equal to|>=|-|
|Less than or equal to|<=|-|

#### Assignment
|Type|Python|
|-----|-----|
|Assign|=|
|Add to|+=|
|Subtract to|-=|
|Multiply to|*=|
|Divide to|/=|
|Floor divide to|//=|
|Modulo to|%=|
