# Python coding bootcamp, day 1

    Caleb Powell
    calebadampowell@gmail.com
    https://github.com/CapPow
    

    Dakila Ledesma
    bgq527@mocs.utc.edu
    https://github.com/bgq527
    
Materials heavily modified from: http://swcarpentry.github.io/python-novice-inflammation/  
  
### Objectives
_After today, you will be able to:_

* Setup a python programming environment
    * Python 3
    * Jupyter Notebook
* Overview of basic datatypes
    * String
    * Integer
    * Float
* Understand and use variables
* Build and use functions
* Use loops for repated tasks
* Use Pandas for working with tabular data
    * Load a csv and xlsx files
    * Select specific data (conditional, exact, and 'contains')
* Use python to gather & analyze data
    * iDigbio API
    * datetime library

<a id='variables'></a>
## Variables
Any Python interpreter can be used as a calculator:

In [None]:
3 + 5 * 4


This is great but not very interesting. To do anything useful with data, we need to [assign](https://swcarpentry.github.io/python-novice-inflammation/reference/#assign) its value to a [variable](https://swcarpentry.github.io/python-novice-inflammation/reference/#variable). In Python, we can assign a value to a variable, using the equals sign `=`. For example, to assign value `60` to a variable `weight_kg`, we would execute:

In [None]:
weight_kg = 60


From now on, whenever we use `weight_kg`, Python will substitute the value we assigned to
it. In essence, **a variable is just a name for a value**.

In Python, variable names:

 - can include letters, digits, and underscores
 - cannot start with a digit
 - are [case sensitive](https://swcarpentry.github.io/python-novice-inflammation/reference/#case-sensitive).

This means that, for example:
 - `weight0` is a valid variable name, whereas `0weight` is not
 - `weight` and `Weight` are different variables


<a id='usingvariables'></a>
## Using Variables in Python
To display the value of a variable to the screen in Python, we can use the `print` function:


In [None]:
print(weight_kg)

We can display multiple things at once using only one `print` command:

In [None]:
print('weight in kilograms: ', weight_kg)

Moreover, we can do arithmetic and use variables right inside the `print` function using an ["f" string](https://docs.python.org/3/reference/lexical_analysis.html#f-strings), where we can include python code inside of a string by placing the code inside of `{}`.

In [None]:
print(f'{weight_kg} kg in pounds is: {2.2 * weight_kg}lbs.')

The above command, however, did not change the value of `weight_kg`:

In [None]:
print(weight_kg)

To change the value of the `weight_kg` variable, we have to
**assign** `weight_kg` a new value using the equals `=` sign:

In [None]:
weight_kg = 65.0
print(f'weight in kilograms is now: {weight_kg}')

## Variables as Sticky Notes

A variable is analogous to a sticky note with a name written on it:

assigning a value to a variable is like putting that sticky note on a particular value.
<img src="files/assets/python-sticky-note-variables-01.svg">

This means that assigning a value to one variable does **not** change values of other variables. For example, let's store the subject's weight in pounds in its own variable:

In [None]:
# There are 2.2 pounds per kilogram
weight_lb = 2.2 * weight_kg
print(f'weight in kilograms: {weight_kg} and in pounds: {weight_lb}')


<img src="files/assets/python-sticky-note-variables-02.svg">

Let's now change `weight_kg`:


In [None]:
weight_kg = 100.0
print(f'weight in kilograms is now: {weight_kg} and weight in pounds is still:{weight_lb}')

<img src="files/assets/python-sticky-note-variables-03.svg">

Since `weight_lb` doesn't "remember" where its value comes from, it is not updated when we change `weight_kg`.

<a id='typesofdata'></a>
## Types Of Data
Python knows various types of data. Three common ones are:

* integer numbers
* floating point numbers, and
* strings.

In the example above, variable `weight_kg` has an integer value of `60`.
To create a variable with a floating point value, we can execute:


In [None]:
weight_kg = 60.0

And to create a string we simply have to add single or double quotes around some text, for example:

In [None]:
weight_kg_text = 'weight in kilograms:'

In [None]:
integer_var = 3
float_var = 3.1
complex_var = 3+2j

print(F'Type of integer_var: {type(integer_var)}')
print(F'Type of float_var: {type(float_var)}')
print(F'Type of complex_var: {type(complex_var)}')

#### String

In [None]:
first_name = "Bob"
last_name = "Builder"

whole_name = first_name + " " + last_name
print(F'My whole name is {whole_name}')

#### Lists

Lists are a way to store variables. Lists in Python are quite flexible; they can contain numbers, strings, and even methods.

In [None]:
numbers_list = [8, 6, 7, 14, -3, -12]

numbers_list.append(0)
numbers_list.sort()

print(numbers_list)

In [None]:
numbers_list.reverse()
print(numbers_list)

In [None]:
numbers_list.remove(7)
print(numbers_list)

In [None]:
del numbers_list[1]
print(numbers_list)

In [None]:
my_number = numbers_list.pop(3)
print(F'My number = {my_number}')
print(F'Numbers list = {numbers_list}')

In [None]:
concat_numbers_list = numbers_list + numbers_list
print(concat_numbers_list)

In [None]:
numbers_list.insert(1, 29)
print(numbers_list)

In [None]:
numbers_list[0] = 120
print(numbers_list)

In [None]:
print(len(numbers_list))

#### Tuples

Tuples are like lists, however they are *immutable*, meaning that you cannot change the variables within the tuple after it is set (such as removing variables, setting variables, etc.)

In [None]:
numbers_tuple = (14, 12, 94)
print(numbers_tuple)

#### Dictionaries

Dictionaries are like lists, but instead of using indices (e.g. ```number_list[0]```, where ```0``` is the index), you use keys instead.

In [None]:
information = {'name':'Bob Builder', 'age':55, 'gender':'male'}
print(F"Bob's name is {information['name']}")
print(F"Bob's gender is {information['gender']}")

### Operators

#### Arithmetic
|Type|Python|
|-----|-----|
|Addition|+|
|Subtraction|-|
|Multiplication|*|
|Division|/|
|Floor Division|//|
|Squared|**|
|Modulo|%|

#### Logic

|Normal|Python|Alternative
|-----|-----|-----|
|And|and|-|
|Or|or|-|
|Not|not|!|
|More than|>|-|
|Less than|<|-|
|Equal to|==|-|
|Not equal to|!=|-|
|More than or equal to|>=|-|
|Less than or equal to|<=|-|

#### Assignment
|Type|Python|
|-----|-----|
|Assign|=|
|Add to|+=|
|Subtract to|-=|
|Multiply to|*=|
|Divide to|/=|
|Floor divide to|//=|
|Modulo to|%=|


## Adding Comments to Your Code 

In python, the `#` symbol tells the interpreter the remainder of that line should be ignored. This is often used to insert comments into code. It is good practice to frequently comment you code, this can help when you, or someone else looks back on it. 

In [None]:
# here is one example of an in code comment.
x = 5 * 5 # here is another example of an in code comment
# notice how the variable assigned is unaffected by the inline comments 
print(f'The value of x is still: {x}')

# sometimes, comments are used to remove unnecessary lines of code which might be useful later.
# y = x * 5
# print(y)

<a id='creatingfunctions'></a>
## Creating & Using Functions

When learning [how to work with variables](#usingvariables), we converted kilograms to pounds. what if we want to use that code again, on a different dataset or at a different point in our program? Cutting and pasting it is going to make our code get very long and very repetitive, very quickly.

We'd like a way to package our code so that it is easier to reuse, and Python provides for this by letting us define things called 'functions' a shorthand way of re-executing longer pieces of code.

Let's start by defining a function `fahr_to_celsius` that converts temperatures from Fahrenheit to Celsius:

In [None]:
def fahr_to_celsius(temp):
    """ a function which converts fahrenheit to celsius.

        Parameters
        ----------
        temp : integer or float, temperature value in fahrenheit

    """
    return ((temp - 32) * (5/9))

<img src="files/assets/python-function.svg">



The function definition opens with the keyword `def` followed by the name of the function (`fahr_to_celsius`) and a parenthesized list of parameter names (`temp`). Inside a set of ```"""``` is a description of the function, and its [parameters](http://swcarpentry.github.io/python-novice-inflammation/reference/#parameters), called a [docstring](http://swcarpentry.github.io/python-novice-inflammation/reference/#docstring). This is information can be accessed later in jupyter notebook using the ```Shift + Tab``` key combination.

The [body](http://swcarpentry.github.io/python-novice-inflammation/reference/#body) of the function --- the statements that are executed when it runs --- is indented below the definition line. The body concludes with a `return` keyword followed by the return value.

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a [return statement](http://swcarpentry.github.io/python-novice-inflammation/reference/#return-statement) to send a result back to whoever asked for it.

Let's try running our function.


In [None]:
freezing_point_celsius = fahr_to_celsius(32)
print(f'freezing point of water: {freezing_point_celsius} C')


This command should call our function, using "32" as the input and save the results to the variable called "freezing_point_celsius."

In fact, calling our own function is no different from calling any other function:

In [None]:
boiling_point_celsius = fahr_to_celsius(212)
print(f'boiling point of water: {boiling_point_celsius} C')

## Exercise, write a `celsius_to_kelvin` function

Now that we've seen how to turn Fahrenheit into Celsius, write a function in the code cell below named `celsius_to_kelvin` which converts Celsius into Kelvin.

_hint: temperatures in kelvin are converted using_ `k = c + 273.15`

## Composing Functions

What about converting Fahrenheit to Kelvin? We could write out the formula, but we don't need to. Instead, we can [compose](swcarpentry.github.io/python-novice-inflammation/reference/#compose) the two functions we have already created:

In [None]:
def fahr_to_kelvin(temp_f):
    """ a function which converts fahrenheit to kelvin.

        Parameters
        ----------
        temp_f : integer or float, temperature value in fahrenheit

    """
    temp_c = fahr_to_celsius(temp_f)
    temp_k = celsius_to_kelvin(temp_c)
    return temp_k


print(f'boiling point of water in Kelvin: {fahr_to_kelvin(212.0)}')

<a id='usingloops'></a>
## Use loops for repated tasks

One major benefit of programming is the automation of redundant tasks. Let's apply what we have already learned to automate the conversion of many temperatures measurements from fahrenheit to celsius. Using a list of temperature we'll use a [for loop](http://swcarpentry.github.io/python-novice-inflammation/reference/#for-loop') to repeat an operation --- in this case, converting fahrenheit to celsius.

The general form of a loop is:

```python
for item in collection:
    # do things using variable, such as print
```

where each `item` in the `collection` is looped through one after another.

We can call the [loop variable](http://swcarpentry.github.io/python-novice-inflammation/reference/#loop-variable) anything we like, but there must be a colon at the end of the line starting the loop, and we must indent anything we want to run inside the loop. Unlike many other languages, there is no command to signify the end of the loop body (e.g. `end for`); what is indented after the `for` statement belongs to the loop.

In [None]:
# create a list of example temperatures
temps_in_fahr = [32, 41, 35, 34, 39, 44, 41, 39, 35]

for temp in temps_in_fahr:
    fahr_to_celsius(temp)

After running the loop above, nothing happens. Why?

The function `fahr_to_celsius` was run on every item in the list `temps_in_fahr`, yet we neglicted to do anything with the results returned from the. We did not save the output anywhere. Let's try that loop again, but this time save the results into a new list called `temps_in_celsius`.

In [None]:
# create a list of example temperatures
temps_in_fahr = [33, 42, 35, 34, 39, 44, 41, 39, 35]

# create an empty list to store the converted temperatures
temps_in_celsius = []

for temp in temps_in_fahr:
    temps_in_celsius.append(fahr_to_celsius(temp))

# after completing the loop, print the list 'temps_in_celsius'
print(temps_in_celsius)

## Cleaning up the results

The results of the above loop are now being saved to `temps_in_celsius`, but the `5/9` division included in the [fahr_to_celsius function](#creatingfunctions) is producing too many decimal points. 

There are a few ways to solve this problem, for example: we could add Python's build-in [round() function](https://docs.python.org/3/library/functions.html#round) inside of the loop, or inside of the [fahr_to_celsius function](#creatingfunctions).

Using what we have learned about jupyter notebook cells, let's go back up to the [fahr_to_celsius function](#creatingfunctions) and add in the round() function. Making it look something like this:

```python
def fahr_to_celsius(temp):
    """ a function which converts fahrenheit to celsius.

        Parameters
        ----------
        temp : integer or float, temperature value in fahrenheit

    """
    temp_c = (temp - 32) * (5/9)   
    return round(temp_c, 2)
```

After making your changes, re-run the modified cell (to store those changes) then run the above loop again. If everything is correct, the values in `temps_in_celsius` should have a more appropriate number of decimal places.

## Using External Libraries

While a lot of powerful, general tools such as round() are built into Python, specialized tools built up from these basic units live in [libraries](http://swcarpentry.github.io/python-novice-inflammation/reference/#library) that can be called upon when needed. We will use one of these libraries, called [Pandas](http://pandas.pydata.org/pandas-docs/stable/) to read and manipulate tabular (spreadsheet style) data. There are many ways to find and install external python libaries. The most common method is using `pip` to install packages from the popular "Python Package Index" or "PyPI."

To check if you already have pip installed, enter the following commmand in your terminal:

```
pip --version
```

If you cannot install packages using pip, you may need to install it. Detailed instructions are [available here] (https://packaging.python.org/tutorials/installing-packages/#ensure-you-can-run-pip-from-the-command-line).

Once pip is available enter the following command into your terminal to install pandas:

```
pip install pandas
```


## Loading Data Into Python

To access Pandas' functions, we need to [import](http://swcarpentry.github.io/python-novice-inflammation/reference/#import) the library into our script. Imports usually look like this:

```python
import library_name
```
Where the library's functions are available following the library name. For example:

```python
library_name.function_name()
```

Sometimes you'll see users import pandas and assign it a shorter variable name, like this:
```python
import pandas as pd
```

When this is done, the library would be accessed as:
```python
pd.some_function()
```

In [None]:
# for simplicity we'll stick to the default name pandas.
import pandas

To practice using tabular data in python, we'll be using an data from Christie Aschwanden's ["You Cant Trust What You Read About Nutrition"](https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/). Pandas can [read and write to many tabular formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), including csv files. While it is most common to read a csv which is already on your hard drive, it is also possible for pandas to read a csv directly from the internet. 

Since tabular data loaded into pandas is called a ["DataFrame"](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), users often use "df" as a variable name for a DataFrame loaded into pandas. For clarity, we will use the more descriptive variable name: "nutrition_data" for our dataframe.

In [None]:
# Set the data source URL to a variable
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/nutrition-studies/raw_anonymized_data.csv'
# Read the csv available at that URL, as a DataFrame
nutrition_data = pandas.read_csv(url)

# Display a small portion of the data to spot check it.
# Two things to notice here: 
    # First, nutrition_data.sample(3) chooses 3 random rows
    # Second, we used "display" instead of "print." 
    # Display is only for jupyter notebooks and prints things with better formatting.
display(nutrition_data.sample(3))

# the "shape" of a dataframe is an attribute which describes the total rows, and columns present.
print(f'the data is shaped as: {nutrition_data.shape}')

## Selecting Data in a Pandas DataFrame

Pandas has many ways to select subsets of data, such as by position, by Condition, or by a combination. Below are a series of examples selecting specific subsets of data from the dataframe.

In [None]:
# select all the data for a single column, by the name of that column
current_smoker = nutrition_data['currently_smoke']
# display a small sample of the 'currently_smoke' column's data.
display(current_smoker.sample(3))

# select all the data for a single row, by the row number.
# more details at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
first_respondant = nutrition_data.iloc[0]
# displaying all a sample of the first responder's answers.
display(first_respondant.sample(3))

# select based on condition. In this case, those paitents with "Outie" belly buttons.
outies = nutrition_data[nutrition_data['belly'] == 'Outie']
display(outies.sample(3))

# select based on multiple conditions, such as: 
    # consuming cabbage more than 5 times a week, and having a a cat.
condition_1 = nutrition_data['CABBAGEFREQ'] > 5
condition_2 = nutrition_data['cat'] == 'Yes'

cabbage_cats = nutrition_data[condition_1 & condition_2]
display(cabbage_cats.sample(3))

## Hypothesis Testing

Now that we can load tabular data into python using pandas, and we can select subsets of that data. Let's use what we have learned to test a hypothesis. In this example, we will test if there is a significant difference in coffee consumption those with and without a history of heart disease. To do this test, we will use the `stats` functions available in the external library `scipy`. 

This starts with importing those functions.

In [None]:
# we're importing the stats functions of the larger library called scipy
from scipy import stats

To run this statistical test, we will make variables for the population subsets we are interested in, by saving the results of conditional selections to descriptive variable names.

In [None]:
heart_disease_yes = nutrition_data[nutrition_data['heart_disease'] == 'Yes']
print(f'heart_disease_yes group drinks {heart_disease_yes["COFFEEDRINKSFREQ"].mean()} cups of coffee a week.')

heart_disease_no = nutrition_data[nutrition_data['heart_disease'] == 'No']
print(f'heart_disease_no group drinks {heart_disease_no["COFFEEDRINKSFREQ"].mean()} cups of coffee a week.')

Now that we have determined our subsets, we need to verify that the subsets meet the assumptions for a ttest. To do this, we will start with a [levene](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html) test for homogeneity of variances.

In [None]:
levene_results = stats.levene(heart_disease_yes['COFFEEDRINKSFREQ'], heart_disease_no['COFFEEDRINKSFREQ'])
print(levene_results)

The test is not significant meaning the subsets have similar variances, so we can proceed. Next we must verify that both subsets are normally distributed. Using what we've [learned about loops](#usingloops), we will run a [normal distribution test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html) on each subset. 

In [None]:
for subset in [heart_disease_yes['COFFEEDRINKSFREQ'], heart_disease_no['COFFEEDRINKSFREQ']]:
    #Check the shape, and normal distribution of both groups using a loop.
    print(subset.shape)
    normalResults = stats.normaltest(subset)
    print(normalResults)

Both subsets are normally distributed, meaning we have satisfied the assusmptions and can run the independent t-test.

In [None]:
# stats.ttest returns a tuple of the calculated t-statistic, and the p-value as (t, p)
# assign both variables in one statement:

t, p = stats.ttest_ind(heart_disease_yes['COFFEEDRINKSFREQ'], heart_disease_no['COFFEEDRINKSFREQ'])
print(f'The t-statistic was {t}, with a p-value of {p}')

## Morel hunting example
<a id='query'></a>

<img src="files/assets/morel.jpg">

In [None]:
# API details at: https://github.com/iDigBio/idigbio-python-client
import idigbio

# make a variable for the idigbio api's pandas option
api = idigbio.pandas()

# set a variable for the genus Morchella (true morels)
genusOfInterest = 'Morchella'

# set a variable for the state we're interested in
nearbyStates = ['Tennessee','Georgia','North Carolina','Alabama']

# define a dictionary with the query's "key word arguments"
query = {'genus':genusOfInterest, "stateprovince":nearbyStates}

# call iDigbio's api, using the query we built
pandas_output = api.search_records(rq=query, limit = 500)

# the "shape" of a dataframe is an attribute which returns a tuple containing the row and column count.
# notice, since we know there are exactly 2 items in the tuple we can assign 2 variables to the attribute.
rowQty, columnQty = pandas_output.shape

# print how large the results are
print(f'{rowQty} rows, and {columnQty} columns returned.')

# display a small portion of the results, to spot check
# Two things to notice here: 
    # First, pandas_output.sample(2) chooses 2 random rows
    # Second, we used "display" instead of "print." 
    # Display is only for jupyter notebooks and prints things with better formatting.
display(pandas_output.sample(2))

<a id='dropna'></a>

remember pandas_output was the result from querying idigbio's api for all the "True Morels" found in Tennessee. 

the ['eventdate'](https://terms.tdwg.org/wiki/dwc:eventDate) column stores the date the specimen was collected while the ['startdayofyear'](https://terms.tdwg.org/wiki/dwc:startDayOfYear) column is the day of the year (e.g., 1 is January 1st).

We can use this data to determine the most frequent day of the year Morel's are found in this region.


In [None]:
# start by dropping all records which have no data in 'eventdate'
# notice we save the result of dropna back to the pandas_output.
# this means we overwrite pandas_output the results after dropping the null values
pandas_output = pandas_output.dropna(subset=['eventdate'])

# before we move on we should check how many records are left
# remember the shape attribute is a tuple of (rows, columns)
print(pandas_output.shape)

In [None]:
# Calculate the mean of the 'startdayofyear' column. 
# notice we included the parameter "skipna=True,"
# remember Shift+Tab while the curser is inside a function call displays that function's options.
avgDayOfYear = pandas_output['startdayofyear'].mean(skipna=True)
print(f'The average day of the year for {genusOfInterest} in {nearbyStates} is: {avgDayOfYear}.')

# the avgDayOfYear is useful but how do we make this information more useable?
# Let's convert this to a date by adding the avgDayOfYear to a January 1st of this year.
# First we'll import the "datetime" library which comes with python.
import datetime

# Using the datetime library's "now()" function, save the current date to a variable
currentDate = datetime.datetime.now()
# display the results of the current date
print(f'The current date & time is: {currentDate}.')
# The currentDate produced has a ".year" attribute
thisYear = currentDate.year
print(f'The current year is {thisYear}')

# save a variable for a dateTime object representing January 1st of this year.
startOfYear = datetime.date(thisYear,1,1)

# add the avgDayOfYear, to get this year's best date
# datetime's timedelta function returns the difference between two datetime values (as a date).
bestDate = startOfYear + datetime.timedelta(avgDayOfYear)

# print the results
print(f'The average day for collecting morels is {bestDate}.')

# Exercise, improve upon this work

Recall there were not many records after [we dropped those without an event date](#dropna). Often scripts are written using an example as a starting point. In the cell(s) below, improve on the morel hunting example by increasing the sample size, one simple way to do this is to include additional states [in the initial query we built](#query).

<img src="files/assets/southeast.gif">

Additional Resources:

[datetime documentation](https://docs.python.org/3/library/datetime.html)

[Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)

[list of public APIs](https://github.com/toddmotto/public-apis)

[iDigBio's Python API (examples and documentation)](https://github.com/iDigBio/idigbio-python-client)

Todo:

including description of "in ram" effects of a jupyter notebook.

