# Python coding bootcamp, day 1

    Caleb Powell
    calebadampowell@gmail.com
    https://github.com/CapPow
    

    Dakila Ledesma
    bgq527@mocs.utc.edu
    https://github.com/bgq527
    
Materials heavily modified from: http://swcarpentry.github.io/python-novice-inflammation/  
  
### Objectives
_Today, we'll cover:_

* [Using notebooks for Python development](#notebooks)
    * Jupyter Notebook
* [Understanding and using variables](#variables)
* [Overview of basic datatypes](#dtypes)
    * String
    * Integer
    * Float
* [Overview of basic containers](#containers)
    * Lists
    * Tuples
    * Dictionaries
* [Building and using functions](#functions)
* [Loops for repated tasks](#usingloops)
* [Using external libraries](#externalLibs)
* [Pandas for working with tabular data](#pandas)
    * Load a csv and xlsx files
    * Select specific data (conditional, exact, and 'contains')
* [Real world example: morel hunting date](#morelHunting)
    * iDigbio API
    * datetime library

<a id='notebooks'></a>
## Using notebooks for Python development

Today we'll be writing, and running python in notebook documents using the "Jupyter Notebook App." We'll start by reviewing a portion of Jupyter's introduction materials available from [the project's github](https://github.com/jupyter/notebook/tree/master/docs/source/examples/Notebook): 

## What is the Jupyter Notebook?

The Jupyter Notebook is an **interactive computing environment** that enables users to author notebook documents that include: 
- Live code
- Interactive widgets
- Plots
- Narrative text
- Equations
- Images
- Video

These documents provide a **complete and self-contained record of a computation** that can be converted to various formats and easily shared with others.

Notebook documents contain the **inputs and outputs** of an interactive session as well as **narrative text** that accompanies the code but is not meant for execution. **Rich output** generated by running code, including HTML, images, video, and plots, is embeddeed in the notebook, which makes it a complete and self-contained record of a computation. 

When you run the notebook web application on your computer, notebook documents are just **files on your local filesystem with a `.ipynb` extension**. This allows you to use familiar workflows for organizing your notebooks into folders and sharing them with others.

Notebooks consist of a **linear sequence of cells**. There are three basic cell types:

* **Code cells:** Input and output of live code that is run in the kernel
* **Markdown cells:** Narrative text with embedded LaTeX equations
* **Raw cells:** Unformatted text that is included, without modification, when notebooks are converted to different formats using nbconvert

So far, every cell in this notebook has been a **markdown cell**.

**Notebooks can be exported** to different static formats including HTML, reStructeredText, LaTeX, PDF, and slide shows ([reveal.js](https://revealjs.com)) using Jupyter's `nbconvert` utility.

## The Notebook dashboard

When you first start the notebook server, your browser will open to the notebook dashboard. The dashboard serves as a home page for the notebook. Its main purpose is to display the notebooks and files in the current directory. For example, here is a screenshot of the dashboard page for the `examples` directory in the Jupyter repository:

![Jupyter dashboard showing files tab](assets/dashboard_files_tab.png)

The top of the notebook list displays clickable breadcrumbs of the current directory. By clicking on these breadcrumbs or on sub-directories in the notebook list, you can navigate your file system.

To create a new notebook, click on the "New" button at the top of the list and select a kernel from the dropdown (as seen below).  Which kernels are listed depend on what's installed on the server.  Some of the kernels in the screenshot below may not exist as an option to you.

![Jupyter "New" menu](assets/dashboard_files_tab_new.png)

The notebook list shows green "Running" text and a green notebook icon next to running notebooks (as seen below). Notebooks remain running until you explicitly shut them down; closing the notebook's page is not sufficient.


![Jupyter dashboard showing one notebook with a running kernel](assets/dashboard_files_tab_run.png)

To shutdown, delete, duplicate, or rename a notebook check the checkbox next to it and an array of controls will appear at the top of the notebook list (as seen below).  You can also use the same operations on directories and files when applicable.

![Buttons: Duplicate, rename, shutdown, delete, new, refresh](assets/dashboard_files_tab_btns.png)

To see all of your running notebooks along with their directories, click on the "Running" tab:

![Jupyter dashboard running tab](assets/dashboard_running_tab.png)

## Overview of the Notebook UI

If you create a new notebook or open an existing one, you will be taken to the notebook user interface (UI). This UI allows you to run code and author notebook documents interactively. The notebook UI has the following main areas:

* Menu
* Toolbar
* Notebook area and cells

The notebook has an interactive tour of these elements that can be started in the "Help:User Interface Tour" menu item.

## Mouse navigation

All navigation and actions in the Notebook are available using the mouse through the menubar and toolbar, which are both above the main Notebook area:

![Jupyter notebook menus and toolbar](assets/menubar_toolbar.png)

The first idea of mouse based navigation is that **cells can be selected by clicking on them.** The currently selected cell gets a grey or green border depending on whether the notebook is in edit or command mode. If you click inside a cell's editor area, you will enter edit mode. If you click on the prompt or output area of a cell you will enter command mode.

Try selecting different cells and going between edit and command mode. Try typing into a cell.

The second idea of mouse based navigation is that **cell actions usually apply to the currently selected cell**. Thus if you want to run the code in a cell, you would select it and click the <button class='btn btn-default btn-xs'><i class="fa fa-step-forward icon-step-forward"></i></button> button in the toolbar or the "Cell:Run" menu item. Similarly, to copy a cell you would select it and click the <button class='btn btn-default btn-xs'><i class="fa fa-copy icon-copy"></i></button> button in the toolbar or the "Edit:Copy" menu item. With this simple pattern, you should be able to do most everything you need with the mouse.

Markdown cells have one other state that can be modified with the mouse. These cells can either be rendered or unrendered. When they are rendered, you will see a nice formatted representation of the cell's contents. When they are unrendered, you will see the raw text source of the cell. To render the selected cell with the mouse, click the <button class='btn btn-default btn-xs'><i class="fa fa-step-forward icon-step-forward"></i></button> button in the toolbar or the "Cell:Run" menu item. To unrender the selected cell, double click on the cell.

## Keyboard Navigation

The modal user interface of the Jupyter Notebook has been optimized for efficient keyboard usage. This is made possible by having two different sets of keyboard shortcuts: one set that is active in edit mode and another in command mode.

The most important keyboard shortcuts are `Enter`, which enters edit mode, and `Esc`, which enters command mode.

In edit mode, most of the keyboard is dedicated to typing into the cell's editor. Thus, in edit mode there are relatively few shortcuts.  In command mode, the entire keyboard is available for shortcuts, so there are many more.  The `Help`->`Keyboard Shortcuts` dialog lists the available shortcuts.

We recommend learning the command mode shortcuts in the following rough order:

1. Basic navigation: `enter`, `shift-enter`, `up/k`, `down/j`
2. Saving the notebook: `s`
2. Change Cell types: `y`, `m`, `1-6`, `t`
3. Cell creation: `a`, `b`
4. Cell editing: `x`, `c`, `v`, `d`, `z`
5. Kernel operations: `i`, `0` (press twice)

# Python Basics

Now that we've covered the basics of notebooks, let's start using this notebook to learn Python.

<a id='variables'></a>
## Understanding and using variables
Any Python interpreter can be used as a calculator. Let's use a code cell below to demonstrate simple math functions using Python. A cell can be re-run at any time, run the code below then modify it and run it again.

In [4]:
3 + 5 * 4


23

This is great but not very interesting. To do anything useful with data, we need to [assign](https://swcarpentry.github.io/python-novice-inflammation/reference/#assign) its value to a [variable](https://swcarpentry.github.io/python-novice-inflammation/reference/#variable). In Python, we can assign a value to a variable, using the equals sign `=`. For example, to assign value `60` to a variable `weight_kg`, we would execute:

In [2]:
weight_kg = 60


From now on, whenever we use `weight_kg`, Python will substitute the value we assigned to
it. In essence, **a variable is just a name for a value**.

In Python, variable names:

 - can include letters, digits, and underscores
 - cannot start with a digit
 - are [case sensitive](https://swcarpentry.github.io/python-novice-inflammation/reference/#case-sensitive).

This means that, for example:
 - `weight0` is a valid variable name, whereas `0weight` is not
 - `weight` and `Weight` are different variables


<a id='usingvariables'></a>
## Using Variables in Python
To display the value of a variable to the screen in Python, we can use the `print` function:


In [3]:
print(weight_kg)

60


We can display multiple things at once using only one `print` command:

In [4]:
print('weight in kilograms: ', weight_kg)

weight in kilograms:  60


Moreover, we can do arithmetic and use variables right inside the `print` function using an ["f" string](https://docs.python.org/3/reference/lexical_analysis.html#f-strings), where we can include python code inside of a string by placing the code inside of `{}`.

In [5]:
print(f'{weight_kg} kg in pounds is: {2.2 * weight_kg}lbs.')

60 kg in pounds is: 132.0lbs.


The above command, however, did not change the value of `weight_kg`:

In [6]:
print(weight_kg)

60


<a id='saveVariable'></a>
To change the value of the `weight_kg` variable, we have to
**assign** `weight_kg` a new value using the equals `=` sign:

In [7]:
weight_kg = weight_kg + 5
print(f'weight in kilograms is now: {weight_kg}')

weight in kilograms is now: 65


## The notebook "gotcha"

As we learned Programming in the [jupyter notebook introduction](#notebooks), notebooks can make your analysis transparant and reproducable. They are also convienent when learning how to program because you can run and test small portions of your code at a time. This convienence has one "gotcha," problem which often confuses students:

**Cells can be run out of order**, causing the variables stored in memory to be altered in non-obvious ways. Notice the code from the [cell above](#saveVariable) adds a value to an existing variable and then overwrites that variable with the new one.

```{python}
weight_kg = weight_kg + 5
```

- What happens if you run [that cell](#saveVariable) again?

- What happens if you go all the way to the start of the "[Using Variables in Python](#usingvariables)" section and run those cells again?

## Variables as Sticky Notes

A variable is analogous to a sticky note with a name written on it:

assigning a value to a variable is like putting that sticky note on a particular value.
<img src="files/assets/python-sticky-note-variables-01.svg">

This means that assigning a value to one variable does **not** change values of other variables. For example, let's store the subject's weight in pounds in its own variable:

In [8]:
# reset the weight_kg to 65.
weight_kg = 65.0

# There are 2.2 pounds per kilogram
weight_lb = 2.2 * weight_kg
print(f'weight in kilograms: {weight_kg} and in pounds: {weight_lb}')


weight in kilograms: 65 and in pounds: 143.0


Let's now change `weight_kg` to a different value:


In [9]:
weight_kg = 100.0
print(f'weight in kilograms is now: {weight_kg} and weight in pounds is still:{weight_lb}')

weight in kilograms is now: 100.0 and weight in pounds is still:143.0


<img src="files/assets/python-sticky-note-variables-03.svg">

Since `weight_lb` doesn't "remember" where its value comes from, it is not updated when we change `weight_kg`.

<a id='dtypes'></a>
## Overview of basic datatypes
Python knows various types of data. Three common ones are:

* integer numbers
* floating point numbers, and
* strings.

In the example above, variable `weight_kg` has an integer value of `60`.
To create a variable with a floating point value, we can execute:


In [14]:
weight_kg = 60.0

And to create a string we simply have to add single or double quotes around some text, for example:

In [15]:
weight_kg_text = 'weight in kilograms:'

In [16]:
integer_var = 3
float_var = 3.1
complex_var = 3+2j

print(F'Type of integer_var: {type(integer_var)}')
print(F'Type of float_var: {type(float_var)}')
print(F'Type of complex_var: {type(complex_var)}')

Type of integer_var: <class 'int'>
Type of float_var: <class 'float'>
Type of complex_var: <class 'complex'>


#### String

In [16]:
first_name = "Bob"
last_name = "Builder"

print(F'My whole name is {first_name} {last_name}')

My whole name is Bob Builder


What about when you want to include a quotation symbol in a string? For example, how would you display the text: `You may call me "Mr. Builder"`? There are multiple ways to "escape" the python's intrepretation of characters within a string.

For example, you can interchange the use of double and single quotes:

In [19]:
print(f"You may call me 'Mr. {last_name}'")

You may call me 'Mr. Builder'


Alternatively, you can use the `\` to escape the next character.

In [21]:
print(f'You may call me \'Mr. {last_name}\'')

You may call me 'Mr. Builder'


<a id='containers'></a>
## Overview of basic containers

#### Lists

In Python, lists are a flexible way to store; they can contain numbers, strings, and even methods.

Here, we are creating list with the following numbers: 8, 6, 7, 14, -3, -12.

In [22]:
numbers_list = [8, 6, 7, 14, -3, -12]
print(numbers_list)

[8, 6, 7, 14, -3, -12]


To add numbers to an already created list, you can use the method `append()`. As you cn see here, we are adding the number 0 to the end of the list.

In [23]:
numbers_list.append(0)
print(numbers_list)

[8, 6, 7, 14, -3, -12, 0]


Sorting lists in Python is easy. All you need to do is call the `sort()` function.

In [15]:
numbers_list.sort()
print(numbers_list)

[-12, -3, 0, 6, 7, 8, 14]


Reversing a list is also easy, using the `reverse()` function. This could be used, for example, if you want to have a reverse-sorted array.

In [19]:
numbers_list.reverse()
print(numbers_list)

[14, 8, 7, 6, 0, -3, -12]


Removing a specific value in a list is done through the `remove()` function. For example, we are removing the number 7 in the list below.

In [20]:
numbers_list.remove(7)
print(numbers_list)

[14, 8, 6, 0, -3, -12]


If you want to remove an item at a specific index in the list, use `del`. Here, we are deleting the item at index 1, which is the second item in the list.

In [21]:
del numbers_list[1]
print(numbers_list)

[14, 6, 0, -3, -12]


If you want to move an item (e.g. *cut and paste* rather than *copy and paste*) to another array, or to another variable, use `pop()`. As you can see in the below example, my number gets set to the number at index 3, or the fourth number, and the item at index 3 was deleted.

In [22]:
my_number = numbers_list.pop(3)
print(F'My number = {my_number}')
print(F'Numbers list = {numbers_list}')

My number = -3
Numbers list = [14, 6, 0, -12]


You can concatenate (append) multiple lists really easily, using the addition operator.

In [23]:
concat_numbers_list = numbers_list + numbers_list
print(concat_numbers_list)

[14, 6, 0, -12, 14, 6, 0, -12]


You can also insert numbers in a list, using `insert()`. You will need to specify where to insert it, and the value of the insertion. As you can see, we are inserting the number 29 to the first index.

In [24]:
numbers_list.insert(1, 29)
print(numbers_list)

[14, 29, 6, 0, -12]


Setting a value in the list is also easy, using the index and the equals operator.

In [25]:
numbers_list[0] = 120
print(numbers_list)

[120, 29, 6, 0, -12]


Lastly, you can get the count or length of a list using the `len()` method.

In [26]:
print(len(numbers_list))

5


#### Tuples

Tuples are like lists, however they are *immutable*, meaning that you cannot change the variables within the tuple after it is set (such as removing variables, setting variables, etc.)

In [27]:
numbers_tuple = (14, 12, 94)
print(numbers_tuple)

(14, 12, 94)


#### Dictionaries

Dictionaries are like lists, but instead of using indices (e.g. ```number_list[0]```, where ```0``` is the index), you use keys instead.

In [28]:
information = {'name':'Bob Builder', 'age':55, 'gender':'male'}
print(F"Bob's name is {information['name']}")
print(F"Bob's gender is {information['gender']}")

Bob's name is Bob Builder
Bob's gender is male


### Operators

#### Arithmetic
|Type|Python|
|-----|-----|
|Addition|+|
|Subtraction|-|
|Multiplication|*|
|Division|/|
|Floor Division|//|
|Squared|**|
|Modulo|%|

#### Logic

|Normal|Python|Alternative
|-----|-----|-----|
|And|and|-|
|Or|or|-|
|Not|not|!|
|More than|>|-|
|Less than|<|-|
|Equal to|==|-|
|Not equal to|!=|-|
|More than or equal to|>=|-|
|Less than or equal to|<=|-|

#### Assignment
|Type|Python|
|-----|-----|
|Assign|=|
|Add to|+=|
|Subtract to|-=|
|Multiply to|*=|
|Divide to|/=|
|Floor divide to|//=|
|Modulo to|%=|


## Adding Comments to Your Code 

In python, the `#` symbol tells the interpreter the remainder of that line should be ignored. This is often used to insert comments into code. It is good practice to frequently comment you code, this can help when you, or someone else looks back on it. 

In [29]:
# here is one example of an in code comment.
x = 5 * 5 # here is another example of an in code comment
# notice how the variable assigned is unaffected by the inline comments 
print(f'The value of x is still: {x}')

# sometimes, comments are used to remove unnecessary lines of code which might be useful later.
# y = x * 5
# print(y)

The value of x is still: 25


<a id='functions'></a>
## Building and using functions

When learning [how to work with variables](#usingvariables), we converted kilograms to pounds. what if we want to use that code again, on a different dataset or at a different point in our program? Cutting and pasting it is going to make our code get very long and very repetitive, very quickly.

We'd like a way to package our code so that it is easier to reuse, and Python provides for this by letting us define things called 'functions' a shorthand way of re-executing longer pieces of code.

Let's start by defining a function `fahr_to_celsius` that converts temperatures from Fahrenheit to Celsius:

In [30]:
def fahr_to_celsius(temp):
    """ a function which converts fahrenheit to celsius.

        Parameters
        ----------
        temp : integer or float, temperature value in fahrenheit

    """
    return ((temp - 32) * (5/9))

<img src="files/assets/python-function.svg">



The function definition opens with the keyword `def` followed by the name of the function (`fahr_to_celsius`) and a parenthesized list of parameter names (`temp`). Inside a set of ```"""``` is a description of the function, and its [parameters](http://swcarpentry.github.io/python-novice-inflammation/reference/#parameters), called a [docstring](http://swcarpentry.github.io/python-novice-inflammation/reference/#docstring). This is information can be accessed later in jupyter notebook using the ```Shift + Tab``` key combination.

The [body](http://swcarpentry.github.io/python-novice-inflammation/reference/#body) of the function --- the statements that are executed when it runs --- is indented below the definition line. The body concludes with a `return` keyword followed by the return value.

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a [return statement](http://swcarpentry.github.io/python-novice-inflammation/reference/#return-statement) to send a result back to whoever asked for it.

Let's try running our function.


In [31]:
freezing_point_celsius = fahr_to_celsius(32)
print(f'freezing point of water: {freezing_point_celsius} C')


freezing point of water: 0.0 C


This command should call our function, using "32" as the input and save the results to the variable called "freezing_point_celsius."

In fact, calling our own function is no different from calling any other function:

In [32]:
boiling_point_celsius = fahr_to_celsius(212)
print(f'boiling point of water: {boiling_point_celsius} C')

boiling point of water: 100.0 C


## Exercise, write a `celsius_to_kelvin` function

Now that we've seen how to turn Fahrenheit into Celsius, write a function in the code cell below named `celsius_to_kelvin` which converts Celsius into Kelvin.

_hint: temperatures in kelvin are converted using_ `k = c + 273.15`

## Composing Functions

What about converting Fahrenheit to Kelvin? We could write out the formula, but we don't need to. Instead, we can [compose](swcarpentry.github.io/python-novice-inflammation/reference/#compose) the two functions we have already created:

In [None]:
def fahr_to_kelvin(temp_f):
    """ a function which converts fahrenheit to kelvin.

        Parameters
        ----------
        temp_f : integer or float, temperature value in fahrenheit

    """
    temp_c = fahr_to_celsius(temp_f)
    temp_k = celsius_to_kelvin(temp_c)
    return temp_k

print(f'boiling point of water in Kelvin: {fahr_to_kelvin(212.0)}')

<a id='usingloops'></a>
## Loops for repated tasks

One major benefit of programming is the automation of redundant tasks. Let's apply what we have already learned to automate the conversion of many temperatures measurements from fahrenheit to celsius. Using a list of temperature we'll use a [for loop](http://swcarpentry.github.io/python-novice-inflammation/reference/#for-loop') to repeat an operation --- in this case, converting fahrenheit to celsius.

The general form of a loop is:

```python
for item in collection:
    # do things using variable, such as print
```

where each `item` in the `collection` is looped through one after another.

We can call the [loop variable](http://swcarpentry.github.io/python-novice-inflammation/reference/#loop-variable) anything we like, but there must be a colon at the end of the line starting the loop, and we must indent anything we want to run inside the loop. Unlike many other languages, there is no command to signify the end of the loop body (e.g. `end for`); what is indented after the `for` statement belongs to the loop.

In [1]:
# create a list of example temperatures
temps_in_fahr = [32, 41, 35, 34, 39, 44, 41, 39, 35]

for temp in temps_in_fahr:
    fahr_to_celsius(temp)

NameError: name 'fahr_to_celsius' is not defined

After running the loop above, nothing happens. Why?

The function `fahr_to_celsius` was run on every item in the list `temps_in_fahr`, yet we neglicted to do anything with the results returned from the. We did not save the output anywhere. Let's try that loop again, but this time save the results into a new list called `temps_in_celsius`.

In [35]:
# create a list of example temperatures
temps_in_fahr = [33, 42, 35, 34, 39, 44, 41, 39, 35]

# create an empty list to store the converted temperatures
temps_in_celsius = []

for temp in temps_in_fahr:
    temps_in_celsius.append(fahr_to_celsius(temp))

# after completing the loop, print the list 'temps_in_celsius'
print(temps_in_celsius)

[0.5555555555555556, 5.555555555555555, 1.6666666666666667, 1.1111111111111112, 3.8888888888888893, 6.666666666666667, 5.0, 3.8888888888888893, 1.6666666666666667]


<a id='cleaning'></a>
## Cleaning up the results

The results of the above loop are now being saved to `temps_in_celsius`, but the `5/9` division included in the [fahr_to_celsius function](#creatingfunctions) is producing too many decimal points. 

There are a few ways to solve this problem, for example: we could add Python's build-in [round() function](https://docs.python.org/3/library/functions.html#round) inside of the loop, or inside of the [fahr_to_celsius function](#creatingfunctions).

Using what we have learned about jupyter notebook cells, let's go back up to the [fahr_to_celsius function](#creatingfunctions) and add in the round() function. Making it look something like this:

```python
def fahr_to_celsius(temp):
    """ a function which converts fahrenheit to celsius.

        Parameters
        ----------
        temp : integer or float, temperature value in fahrenheit

    """
    temp_c = (temp - 32) * (5/9)   
    return round(temp_c, 2)
```

After making your changes, re-run the modified cell (to store those changes) then run the above loop again. If everything is correct, the values in `temps_in_celsius` should have a more appropriate number of decimal places.

<a id='externalLibs'></a>
## Using External Libraries

While a lot of powerful, general tools such as round() are built into Python, specialized tools built up from these basic units live in [libraries](http://swcarpentry.github.io/python-novice-inflammation/reference/#library) that can be called upon when needed. We will use one of these libraries, called [Pandas](http://pandas.pydata.org/pandas-docs/stable/) to read and manipulate tabular (spreadsheet style) data. There are many ways to find and install external python libaries. The most common method is using `pip` to install packages from the popular "Python Package Index" or "PyPI."

To check if you already have pip installed, enter the following commmand in your terminal:

```
pip --version
```

If you cannot install packages using pip, you may need to install it. Detailed instructions are [available here] (https://packaging.python.org/tutorials/installing-packages/#ensure-you-can-run-pip-from-the-command-line).

Once pip is available enter the following command into your terminal to install pandas:

```
pip install pandas
```

<a id='pandas'></a>
## Pandas for working with tabular data


To access Pandas' functions, we need to [import](http://swcarpentry.github.io/python-novice-inflammation/reference/#import) the library into our script. Imports usually look like this:

```python
import library_name
```
Where the library's functions are available following the library name. For example:

```python
library_name.function_name()
```

Sometimes you'll see users import pandas and assign it a shorter variable name, like this:
```python
import pandas as pd
```

When this is done, the library would be accessed as:
```python
pd.some_function()
```

In [36]:
# for simplicity we'll stick to the default name pandas.
import pandas

To practice using tabular data in python, we'll be using an data from Christie Aschwanden's ["You Cant Trust What You Read About Nutrition"](https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/). Pandas can [read and write to many tabular formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), including csv files. While it is most common to read a csv which is already on your hard drive, it is also possible for pandas to read a csv directly from the internet. 

Since tabular data loaded into pandas is called a ["DataFrame"](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), users often use "df" as a variable name for a DataFrame loaded into pandas. For clarity, we will use the more descriptive variable name: "nutrition_data" for our dataframe.

In [37]:
# Set the data source URL to a variable
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/nutrition-studies/raw_anonymized_data.csv'
# Read the csv available at that URL, as a DataFrame
nutrition_data = pandas.read_csv(url)

# Display a small portion of the data to spot check it.
# Two things to notice here: 
    # First, nutrition_data.sample(3) chooses 3 random rows
    # Second, we used "display" instead of "print." 
    # Display is only for jupyter notebooks and prints things with better formatting.
display(nutrition_data.sample(3))

# the "shape" of a dataframe is an attribute which describes the total rows, and columns present.
print(f'the data is shaped as: {nutrition_data.shape}')

Unnamed: 0,ID,cancer,diabetes,heart_disease,belly,ever_smoked,currently_smoke,smoke_often,smoke_rarely,never_smoked,...,DT_FIBER_INSOL,DT_FIBER_SOL,DT_PROT_ANIMAL,DT_PROT_VEGETABLE,DT_NITROGEN,PHYTIC_ACID,OXALIC_ACID,COUMESTROL,BIOCHANIN_A,FORMONONETIN
32,1076,No,No,No,Innie,No,No,No,No,Yes,...,13.48,4.9,54.0,30.38,13.59,606.28,331.64,0.0901,0.0834,0.0121
51,1044,Yes,No,No,Innie,No,No,No,No,Yes,...,15.49,6.1,72.33,26.23,15.91,659.28,212.16,0.0431,0.0844,0.0203
3,1166,No,No,No,Innie,No,No,No,No,Yes,...,26.34,10.85,28.71,44.59,12.15,1570.07,334.08,0.283,0.089,0.0126


the data is shaped as: (54, 1093)


## Selecting Data in a Pandas DataFrame

Pandas has many ways to select subsets of data, such as by position, by Condition, or by a combination. Below are a series of examples selecting specific subsets of data from the dataframe.

In [38]:
# select all the data for a single column, by the name of that column
current_smoker = nutrition_data['currently_smoke']
# display a small sample of the 'currently_smoke' column's data.
display(current_smoker.sample(3))

# select all the data for a single row, by the row number.
# more details at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
first_respondant = nutrition_data.iloc[0]
# displaying all a sample of the first responder's answers.
display(first_respondant.sample(3))

# select based on condition. In this case, those paitents with "Outie" belly buttons.
outies = nutrition_data[nutrition_data['belly'] == 'Outie']
display(outies.sample(3))

# select based on multiple conditions, such as: 
    # consuming cabbage more than 5 times a week, and having a a cat.
condition_1 = nutrition_data['CABBAGEFREQ'] > 5
condition_2 = nutrition_data['cat'] == 'Yes'

cabbage_cats = nutrition_data[condition_1 & condition_2]
display(cabbage_cats.sample(3))

53     No
8      No
23    Yes
Name: currently_smoke, dtype: object

GI                                                  34.74
CARROTSQUAN                                             2
GROUP_CAFE_LECHE_1_PCT_OR_2_PCT_MILK_TOTAL_GRAMS        0
Name: 0, dtype: object

Unnamed: 0,ID,cancer,diabetes,heart_disease,belly,ever_smoked,currently_smoke,smoke_often,smoke_rarely,never_smoked,...,DT_FIBER_INSOL,DT_FIBER_SOL,DT_PROT_ANIMAL,DT_PROT_VEGETABLE,DT_NITROGEN,PHYTIC_ACID,OXALIC_ACID,COUMESTROL,BIOCHANIN_A,FORMONONETIN
1,1053,No,Yes,Yes,Outie,Yes,Yes,No,Yes,No,...,9.11,3.37,59.41,18.25,12.51,434.98,112.66,0.0107,0.139,0.00743
22,1146,No,No,No,Outie,No,No,No,No,Yes,...,15.36,5.71,47.25,28.81,12.03,629.62,395.82,0.0687,0.113,0.0167
34,1177,No,Yes,No,Outie,No,No,No,No,Yes,...,10.72,4.66,28.62,18.4,7.43,558.89,155.55,0.275,0.0549,0.0158


Unnamed: 0,ID,cancer,diabetes,heart_disease,belly,ever_smoked,currently_smoke,smoke_often,smoke_rarely,never_smoked,...,DT_FIBER_INSOL,DT_FIBER_SOL,DT_PROT_ANIMAL,DT_PROT_VEGETABLE,DT_NITROGEN,PHYTIC_ACID,OXALIC_ACID,COUMESTROL,BIOCHANIN_A,FORMONONETIN
15,1105,No,No,Yes,Innie,No,No,No,No,Yes,...,15.68,5.46,54.77,29.75,13.63,633.68,291.9,0.147,0.0746,0.00616
43,1019,Yes,No,No,Innie,No,No,No,No,Yes,...,11.19,3.13,53.0,14.89,10.92,225.8,338.43,0.0126,0.0868,0.00728
46,1002,No,No,No,Innie,No,No,No,No,Yes,...,30.75,11.35,68.53,45.88,18.41,1108.32,743.68,0.233,0.22,0.0428


## Hypothesis Testing

Now that we can load tabular data into python using pandas, and we can select subsets of that data. Let's use what we have learned to test a hypothesis. In this example, we will test if there is a significant difference in coffee consumption those with and without a history of heart disease. To do this test, we will use the `stats` functions available in the external library `scipy`. 

This starts with importing those functions.

In [39]:
# we're importing the stats functions of the larger library called scipy
from scipy import stats

To run this statistical test, we will make variables for the population subsets we are interested in, by saving the results of conditional selections to descriptive variable names.

In [40]:
heart_disease_yes = nutrition_data[nutrition_data['heart_disease'] == 'Yes']
print(f'heart_disease_yes group drinks {heart_disease_yes["COFFEEDRINKSFREQ"].mean()} cups of coffee a week.')

heart_disease_no = nutrition_data[nutrition_data['heart_disease'] == 'No']
print(f'heart_disease_no group drinks {heart_disease_no["COFFEEDRINKSFREQ"].mean()} cups of coffee a week.')

heart_disease_yes group drinks 2.75 cups of coffee a week.
heart_disease_no group drinks 2.323529411764706 cups of coffee a week.


Now that we have determined our subsets, we need to verify that the subsets meet the assumptions for a ttest. To do this, we will start with a [levene](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html) test for homogeneity of variances.

In [41]:
levene_results = stats.levene(heart_disease_yes['COFFEEDRINKSFREQ'], heart_disease_no['COFFEEDRINKSFREQ'])
print(levene_results)

LeveneResult(statistic=0.5534421183980399, pvalue=0.46026317254784666)


The test is not significant meaning the subsets have similar variances, so we can proceed. Next we must verify that both subsets are normally distributed. Using what we've [learned about loops](#usingloops), we will run a [normal distribution test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html) on each subset. 

In [42]:
for subset in [heart_disease_yes['COFFEEDRINKSFREQ'], heart_disease_no['COFFEEDRINKSFREQ']]:
    #Check the shape, and normal distribution of both groups using a loop.
    print(subset.shape)
    normalResults = stats.normaltest(subset)
    print(normalResults)

(20,)
NormaltestResult(statistic=9.049196953467886, pvalue=0.010839065699871767)
(34,)
NormaltestResult(statistic=17.95538580507401, pvalue=0.00012619365297101605)


Both subsets are normally distributed, meaning we have satisfied the assusmptions and can run the independent t-test.

In [43]:
# stats.ttest returns a tuple of the calculated t-statistic, and the p-value as (t, p)
# assign both variables in one statement:

t, p = stats.ttest_ind(heart_disease_yes['COFFEEDRINKSFREQ'], heart_disease_no['COFFEEDRINKSFREQ'])
print(f'The t-statistic was {t}, with a p-value of {p}')

The t-statistic was 0.6969408404074194, with a p-value of 0.4889439550183742


# Real world example: morel hunting date
<a id='morelHunting'></a>

For the next section, we will answer a real life question using an external library. [iDigBio ](https://www.idigbio.org/portal) aggregates biodiversiy data from natural history collections. In this example, we'll use their python library to help determine the best time of year to go hunting for a popular gourmet mushroom, morels. Morels are typically wild harvested and notoriously ephemeral, meaning they are only around for brief period making timing very important.

<img src="files/assets/morel.jpg">

To start with, we'll install the new library 'idigbio' using the pip command we used when learning about [external libraries](#externalLibs)

```
pip install idigbio
```

After we have the library, we'll modify an example given on the [idigbio's github](https://github.com/iDigBio/idigbio-python-client#basic-usage) in order to determine the best date for hunting Morels in our region.

<a id='query'></a>

In [45]:
# Example usage available at: https://github.com/iDigBio/idigbio-python-client
import idigbio

# make a variable for the idigbio api's pandas option
api = idigbio.pandas()

# set a variable for the genus Morchella (true morels)
genusOfInterest = 'Morchella'

# set a variable for the states we're interested in
nearbyStates = ['Tennessee','Georgia','North Carolina','Alabama']

# define a dictionary with the query's "key word arguments"
query = {'genus':genusOfInterest, "stateprovince":nearbyStates}

# call iDigbio's api, using the query we built. The result is a dataframe.
pandas_output = api.search_records(rq=query, limit = 500)

# the "shape" of a dataframe is an attribute which returns a tuple containing the row and column count.
# notice, since we know there are exactly 2 items in the tuple we can assign 2 variables to the attribute.
rowQty, columnQty = pandas_output.shape

# print how large the results are
print(f'{rowQty} rows, and {columnQty} columns returned.')

# display a small portion of the results, to spot check
# Two things to notice here: 
    # First, pandas_output.sample(2) chooses 2 random rows
    # Second, we used "display" instead of "print." 
    # Display is only for jupyter notebooks and prints things with better formatting.
display(pandas_output.sample(2))

119 rows, and 46 columns returned.


Unnamed: 0_level_0,basisofrecord,canonicalname,catalognumber,class,collectioncode,collectionid,collector,continent,coordinateuncertainty,country,...,recordnumber,recordset,scientificname,specificepithet,startdayofyear,stateprovince,taxonid,taxonomicstatus,taxonrank,verbatimeventdate
uuid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6a440fbc-9d30-434f-b18f-48a4c9abf936,preservedspecimen,morchella conica,tenn-f-004141,pezizomycetes,tenn-f,97e2d271-3744-48a3-92b5-5a86afbfb01d,l.r. hesler,north america,,united states,...,,04d9b721-259c-4d6b-b48f-2e23edf66c9f,morchella conica,conica,77.0,tennessee,9014337,accepted,species,
29f4be1a-ffbf-4bf7-b606-dad9203884b2,preservedspecimen,morchella angusticeps,rms0000938,pezizomycetes,,53b69cf0-a097-4840-91ed-a1eb33f8ded2,l. r. hesler,north america,,united states,...,,9a06cc34-be24-4ebf-b599-cbb1d4b8ac7b,morchella angusticeps,angusticeps,77.0,tennessee,2594624,accepted,species,


<a id='dropna'></a>
There are many columns in the pandas_output DataFrame:

- ['eventdate'](https://terms.tdwg.org/wiki/dwc:eventDate) column stores the date the specimen was collected

- ['startdayofyear'](https://terms.tdwg.org/wiki/dwc:startDayOfYear) column is the day of the year (e.g., 1 is January 1st).

We can use this data to determine the most frequent day of the year Morel's are found in this region.


In [46]:
# start by dropping all records which have no data in 'eventdate'
# notice we save the result of dropna back to the pandas_output.
# this means we overwrite pandas_output the results after dropping the null values
pandas_output = pandas_output.dropna(subset=['eventdate'])

# before we move on we should check how many records are left
# remember the shape attribute is a tuple of (rows, columns)
print(pandas_output.shape)

(21, 46)


In [47]:
# Calculate the mean of the 'startdayofyear' column. 
# notice we included the parameter "skipna=True,"
# remember Shift+Tab while the curser is inside a function call displays that function's options.
avgDayOfYear = pandas_output['startdayofyear'].mean(skipna=True)
print(f'The average day of the year for {genusOfInterest} in {nearbyStates} is: {avgDayOfYear}.')

# the avgDayOfYear is useful but how do we make this information more useable?
# Let's convert this to a date by adding the avgDayOfYear to a January 1st of this year.
# First we'll import the "datetime" library which comes with python.
import datetime

# Using the datetime library's "now()" function, save the current date to a variable
currentDate = datetime.datetime.now()
# display the results of the current date
print(f'The current date & time is: {currentDate}.')
# The currentDate produced has a ".year" attribute
thisYear = currentDate.year
print(f'The current year is {thisYear}')

# save a variable for a dateTime object representing January 1st of this year.
startOfYear = datetime.date(thisYear,1,1)

# add the avgDayOfYear, to get this year's best date
# datetime's timedelta function returns the difference between two datetime values (as a date).
bestDate = startOfYear + datetime.timedelta(avgDayOfYear)

# print the results
print(f'The average day for collecting morels is {bestDate}.')

The average day of the year for Morchella in ['Tennessee', 'Georgia', 'North Carolina', 'Alabama'] is: 100.875.
The current date & time is: 2019-06-01 15:31:02.194480.
The current year is 2019
The average day for collecting morels is 2019-04-11.


# Exercise, improve upon this work

Recall there were not many records after [we dropped those without an event date](#dropna). Often scripts are written using an example as a starting point. In the cell(s) below, modify the morel hunting example by changing the states checked to ones found in the Midwest. To do this, start by referencing the [the initial query we built](#query).

<img src="files/assets/middle3-1.png">

Additional Resources:

[datetime documentation](https://docs.python.org/3/library/datetime.html)

[Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)

[list of public APIs](https://github.com/toddmotto/public-apis)

[iDigBio's Python API (examples and documentation)](https://github.com/iDigBio/idigbio-python-client)

Todo:

including description of "in ram" effects of a jupyter notebook.

