<div style="text-align: right">
    
    Caleb Powell
    calebadampowell@gmail.com
    https://github.com/CapPow
    

    Dakila Ledesma
    bgq527@mocs.utc.edu
    https://github.com/bgq527
    
</div>    

## Python for Biology, Geology, and Environmental Science Majors (BGE).

This notebook is intended BGE majors who are new to Python, it assumes basic knowledge of python's syntax and data types. Intermediate python users may find it useful to skip ahead to one of the concise **recaps**.

### Working Biodiversity Data

### Table of Contents
 - [Pandas for working with tabular data](#pandasIntro)
     - [Pandas' Data Structures](#dataStructures)
     - [Selecting Subsets](#columnSelection)
     - [Pandas recap](#pandasRecap)
 - [Hypothesis Testing](#hypothesisTesting)
     - [Hypothesis recap](#hypothesisRecap)
 - [Visualizing the results](#visualization)
 - [Operators reference](#operators)
 

### The iDigBio Portal

To practice using python for retrieving and analyzing biodiversity data, we will be using data retrieved from the [Integrated Digitized Biocollections (iDigBio)](https://www.idigbio.org/portal) portal which aggregates biodiversiy data from natural history collections.

<a id='morelExample'></a>
<img src="files/assets/morel.jpg">

### Morel Hunting Date
 In this example, we'll use their python library to help determine the best time of year to go hunting for a popular gourmet mushroom, morels. Morels are typically wild harvested and notoriously ephemeral, meaning they are only around for brief period making timing very important. You'll need to:

- Retrieve data from the iDigBio portal
- Analyze the data using Pandas
- Use Python's datetime library

***

To start with, we will need to install the [`idigbio` python client](https://github.com/iDigBio/idigbio-python-client#installation). This library makes it very easy to interact with iDigBio's web application programming interface (API). Web APIs or "data services" offer progrmatic access to data making the automation of data gathering much more simple. 

After installing the library, we will need to import the necessary librarys:

In [6]:
import pandas as pd
import idigbio

<a id='retrievingData'></a>
### Retrieving Data From iDigBio
Much of this walkthrough was adapted using the example code provided on the [idigbio's github](https://github.com/iDigBio/idigbio-python-client#basic-usage). Since the idigbio library accesses data through iDigBio's web API, additional details on how to use it in the [web API's documentation](https://github.com/idigbio/idigbio-search-api/wiki#records).

The library offers 2 options when returning the biodiversity data, either in JSON format or as a Pandas DataFrame. In this example, we'll be using the Pandas DataFrame interface. The Pandas DataFrame interface is specified following the library name:
```{python}
idigbio.pandas()
```
To avoid having to specify this each time, we'll create a variable named `api` to use as a shortcut:

In [5]:
api = idigbio.pandas()

### Querying for specific data
[idigbio's documentation](https://github.com/iDigBio/idigbio-python-client#examples) shows us that the `.search_records()` function imported from idigbio.pandas() expects a dictionary structured where keys are the field parameters and values are the specific values we would like to query on. This means our `search_records()` call should look something like:
```{python}
my_query = {field_1 : value_1, 
            field_2 : value_2,
            field_3 : value_3}

api.search_records(my_query)
```
A list of available fields for record queries is [available here](https://github.com/idigbio/idigbio-search-api/wiki/Index-Fields#record-query-fields).

iDigBio aggregates data from millions of natural history records. Since we are interested in the best date to find [morels](#morelExample) in Tennessee, we'll need to define a query which specifies a taxon and a region. 

In this case, "True Morels" fall under the genus _Morchella_, and since I'm at the University of Tennessee at Chattanooga, we will define our region as the state of "Tennessee". Using the record [query fields](https://github.com/idigbio/idigbio-search-api/wiki/Index-Fields#record-query-fields) documentation we can see that `genus` and `stateprovince` are options which seem to fit our parameters.

Using this information our `search_records()` should look similar to this:

```{python}
my_query = {'genus' : 'Morchella',
            'stateprovince' : 'Tennessee'}

api.search_records(my_query)
```

However, if we use variable names in place of the values this code will be easy to modify and reuse later. Run the cell below to make the query.

In [15]:
# set a variable for the genus we want to query
genusOfInterest = 'Morchella'

# set a variable for the state we're interested in
stateOfInterest = 'Tennessee' #,'Georgia','North Carolina','Alabama']

# define a dictionary with the query's "key word arguments"
my_query = {'genus':genusOfInterest, "stateprovince":stateOfInterest}

# call iDigbio's api, using the query we built. The result is a dataframe.
df = api.search_records(rq=my_query)

# spotcheck what was returned
display(df.sample(2))
# examine the the columns available
display(df.columns)
# examine the total quantity of results
display(df.shape)

Unnamed: 0_level_0,basisofrecord,canonicalname,catalognumber,class,collectioncode,collectionid,collector,continent,coordinateuncertainty,country,...,recordnumber,recordset,scientificname,specificepithet,startdayofyear,stateprovince,taxonid,taxonomicstatus,taxonrank,verbatimeventdate
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
70abd314-2ba7-4b72-8286-5bfbce01d0c5,preservedspecimen,morchella crassipes,tenn-f-004143,pezizomycetes,tenn-f,97e2d271-3744-48a3-92b5-5a86afbfb01d,l.r. hesler,north america,,united states,...,,04d9b721-259c-4d6b-b48f-2e23edf66c9f,morchella crassipes,crassipes,77.0,tennessee,2594612,accepted,species,
8e38017d-129d-4014-b047-5f09a071d22a,preservedspecimen,morchella conica,tenn-f-003806,pezizomycetes,tenn-f,97e2d271-3744-48a3-92b5-5a86afbfb01d,"s.l. wallace, l.r. hesler",north america,,united states,...,,04d9b721-259c-4d6b-b48f-2e23edf66c9f,morchella conica,conica,77.0,tennessee,9014337,accepted,species,


Index(['basisofrecord', 'canonicalname', 'catalognumber', 'class',
       'collectioncode', 'collectionid', 'collector', 'continent',
       'coordinateuncertainty', 'country', 'countrycode', 'county',
       'datasetid', 'datecollected', 'datemodified', 'dqs', 'etag',
       'eventdate', 'family', 'flags', 'genus', 'geopoint', 'hasImage',
       'hasMedia', 'indexData', 'institutioncode', 'kingdom', 'locality',
       'mediarecords', 'municipality', 'occurrenceid', 'order', 'phylum',
       'recordids', 'recordnumber', 'recordset', 'scientificname',
       'specificepithet', 'startdayofyear', 'stateprovince', 'taxonid',
       'taxonomicstatus', 'taxonrank', 'verbatimeventdate'],
      dtype='object')

(78, 44)

### Analyze the data using Pandas

Since the result is a Pandas ["DataFrame"](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), we can utalize any of the Pandas methods to analyze it. For example, `collector` is among the columns in our DataFrame. The cell below is a brief review of using Pandas and list methods to determine the most prolific collector in our DataFrame.

In [48]:
# Pandas methods:
# find the most common collector appears to be
display(df['collector'].mode())
# However how many unique entries in the DataFrame are the same name written different ways
display(df['collector'].unique())
# store each unique entry to a list like object
collectors = df['collector'].unique()

# List methods:
# start with an empty list
collectorTerms = []
# for iterate over each entry in the 
for collectorName in collectors:  
    collectorName = collectorName.strip()
    collectorTerms.extend(collectorName.split())    

long_collector_terms = []
for singleTerm in collectorTerms:
    singleTerm = singleTerm.strip(" .;,")
    if len(singleTerm) > 3:
        long_collector_terms.append(singleTerm)

collector_count = {}
for term in long_collector_terms:
    row_condition = df['collector'].str.contains(term)
    results = df.loc[row_condition,:]
    rowCount, colCount = results.shape
    collector_count[term] = rowCount

    
import operator
prolific = max(collector_count.items(), key=operator.itemgetter(1))
display(prolific)

df.shape

0    l.r. hesler
dtype: object

array(['p. b. matheny', 'r. swenie; a. hobbs', 'b.p. looney',
       'c.c. braaten', 'yie hong ke', 'margaret boarts',
       's.l. wallace, l.r. hesler', 'l.r. hesler', 'l.r. hesler & party',
       'n. rennie & r.h. petersen', 'a.d. wolfenbarger', 'r. petersen',
       'j.w. johnson', 'a.d. wolfenbarger & j. robinson',
       'a.j. sharp, l.r. hesler', 's.a. cain', 'anne watson',
       'r.h. petersen', 'a.j. sharp', 's.l.w.', 'l.r. hesler, a.j. sharp',
       'h.m. jennison', 'l. fuller', 'mrs. a.j. sharp', 'hesler l. r.',
       'l. r. hesler', 'geo. taylor', 'r. swenie', 's.a. cain, duncan',
       'a.j. sharp & l.r. hesler', 'sharp; hesler', 'j.n. mccarroll',
       'e.b. lickey', 'w.n. higgenbottom', 'wilbur duncan, s. cain',
       'cain; duncan', 'duncan; cain', 'tenn; duncan & cain'],
      dtype=object)

('hesler', 33)

(78, 44)

<a id='dropna'></a>
There are many columns in the pandas_output DataFrame:

- ['eventdate'](https://terms.tdwg.org/wiki/dwc:eventDate) column stores the date the specimen was collected

- ['startdayofyear'](https://terms.tdwg.org/wiki/dwc:startDayOfYear) column is the day of the year (e.g., 1 is January 1st).

We can use this data to determine the most frequent day of the year Morel's are found in this region.


In [156]:
# start by dropping all records which have no data in 'eventdate'
# notice we save the result of dropna back to the pandas_output.
# this means we overwrite pandas_output the results after dropping the null values
pandas_output = pandas_output.dropna(subset=['eventdate'])

# before we move on we should check how many records are left
# remember the shape attribute is a tuple of (rows, columns)
print(pandas_output.shape)

(21, 46)


In [157]:
# Calculate the mean of the 'startdayofyear' column. 
# notice we included the parameter "skipna=True,"
# remember Shift+Tab while the cursor is inside a function call displays that function's options.
avgDayOfYear = pandas_output['startdayofyear'].mean(skipna=True)
print(f'The average day of the year for {genusOfInterest} in {nearbyStates} is: {avgDayOfYear}.')

# the avgDayOfYear is useful but how do we make this information more useable?
# Let's convert this to a date by adding the avgDayOfYear to a January 1st of this year.
# First we'll import the "datetime" library which comes with python.
import datetime

# Using the datetime library's "now()" function, save the current date to a variable
currentDate = datetime.datetime.now()
# display the results of the current date
print(f'The current date & time is: {currentDate}.')
# The currentDate produced has a ".year" attribute
thisYear = currentDate.year
print(f'The current year is {thisYear}')

# save a variable for a dateTime object representing January 1st of this year.
startOfYear = datetime.date(thisYear,1,1)

# add the avgDayOfYear, to get this year's best date
# datetime's timedelta function returns the difference between two datetime values (as a date).
bestDate = startOfYear + datetime.timedelta(avgDayOfYear)

# print the results
print(f'The average day for collecting morels is {bestDate}.')

The average day of the year for Morchella in ['Tennessee', 'Georgia', 'North Carolina', 'Alabama'] is: 100.875.
The current date & time is: 2019-06-03 14:29:49.752707.
The current year is 2019
The average day for collecting morels is 2019-04-11.


<a id='activityMorel'></a>
### _**Activity: Midwest morel hunting**_
Often scripts are written using an example or template as a starting point. In the cell(s) below, modify the morel hunting example by changing the states checked to ones found in the Midwest. To do this, start by referencing the [the initial query we built](#query).

<img src="assets/middle3-1.png">

<a id='publicAPIs'></a>
### Public web APIs:

[datetime documentation](https://docs.python.org/3/library/datetime.html)

[Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)

[list of public APIs](https://github.com/toddmotto/public-apis)

[iDigBio's Python API (examples and documentation)](https://github.com/iDigBio/idigbio-python-client)

<a id='operators'></a>
### Operators references

#### Arithmetic
|Type|Python|
|-----|-----|
|Addition|+|
|Subtraction|-|
|Multiplication|*|
|Division|/|
|Floor Division|//|
|Squared|**|
|Modulo|%|

#### Logic

|Normal|Python|Alternative
|-----|-----|-----|
|And|and|-|
|Or|or|-|
|Not|not|!|
|More than|>|-|
|Less than|<|-|
|Equal to|==|-|
|Not equal to|!=|-|
|More than or equal to|>=|-|
|Less than or equal to|<=|-|

#### Assignment
|Type|Python|
|-----|-----|
|Assign|=|
|Add to|+=|
|Subtract to|-=|
|Multiply to|*=|
|Divide to|/=|
|Floor divide to|//=|
|Modulo to|%=|
