# Lab 04 Functions and Visualization

<i>Elements of Data Science</i><br><br>
Welcome to lab 4!
This week, we will focus on functions and visualization. <br>Functions are described in [Chapter 8](https://inferentialthinking.com/chapters/08/Functions_and_Tables.html) of the Inferential Thinking text. <br>Visualizations is covered in [Chapter 7](https://inferentialthinking.com/chapters/07/Visualization.html).
First, set up the tests and imports by running the cell below.

In [None]:
# Enter your name as a string
name = ...

In [None]:
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# These lines load the tests.
from gofer.ok import check

### Let's explore the most recent COVID data from the New York Times
This data is updated and stored at GitHub: https://github.com/nytimes/covid-19-data <br>
US rolling average: https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us.csv <br>
US States rolling average: https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us-states.csv <br>

In [None]:
COVID_data = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us.csv'
COVID=Table.read_table(COVID_data)
COVID=COVID.set_format(0, DateFormatter(format='%Y-%m-%d',))

If the above read does not work we can use the data handling packages *pandas* as first discussed in the introduction to Lab 03. It can be run by removing comments, #, in front of the below lines.

In [None]:
# import pandas as pd
# data_db = pd.read_csv(COVID_data) # Read data with pandas
# COVID = Table.from_df(data_db) # Create datascience Table object
# COVID=COVID.set_format(0, DateFormatter(format='%Y-%m-%d',))

In [None]:
COVID.sort("date",descending=False) # Display most recent first

### Use where to select data from November - December 2021
Here are the possible arguments for the <i>where</i> Table method:<br>

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|

In [None]:
COVID.where("deaths",are.between(0,1))

Dates produce an error as you will see in the next cell, below we will see the steps needed to work with dates

In [None]:
# Dates produce an error, below we will see the steps needed to work with dates
COVID.where("date",are.between("11/01/2021","12/31/2021"))

### Dates and times in Tables
One thing that is more complicated then we would like but is a common need for a data scientist is encoding data and time. Computer operating systems like Windows and Linux use as a reference point or epoch January 1, 1970 and determine the number of seconds since midnight the start of 1970 to do time computations.<br><br>
Seconds  since January 1, 1970 today is given by time.time() after importing time module:

In [None]:
import time                # Python time functions
from time import strptime 
time.time() # Seconds since common epoch

We can also use a string containing the year, month & day using the <b><i>strptime</i></b> function.

In [None]:
time0 = strptime('2020-01-21', '%Y-%m-%d')
time.mktime(time0)

### Question 1: 
Determine the number of seconds between January 21, 2020 (considered the start of the COVID pandemic in the US) and December 31, 2021 (both at midnight). Use two methods: <br> A) Multiplying 60 seconds * 60 minutes * 24 hours * ... <br> B) Using strptime and time.mktime()

In [None]:
difftimeA = ... # Compute through multiplaction 60 seconds * 60 minutes * 24 hours * ... 

time1a = strptime('2020-01-21', '%Y-%m-%d')
time1 = time.mktime(time0)
time2a = ...
time2 = ...

difftimeB = ...
print(difftimeB, difftimeA)

In [None]:
check('tests/q1a.py')

### Question 2: 
The date in the COVID table evaluates to a time in seconds since the epoch like evaluated above. Now define a subset of the data to examine trends between November and December of 2021

In [None]:
time1 = ... # Seconds since epoch
time2 = ... # Seconds since epoch
Late2021 = COVID.where(0,are.between(time1,time2))
Late2021

In [None]:
check('tests/q2a.py')

### Plot
If we attempt to plot using the 'date' column we get the seconds from the epoch (January 1, 1970). This does not work well so we will address this below.

In [None]:
Late2021.plot('date','cases_avg_per_100k')

### Histogram
A histogram method is realized by appending .hist('column name')

In [None]:
Late2021.hist('deaths')

In [None]:
Late2021.stats()

### Question 3
Construct a histogram and stats for November - December 2020 and compare this to those from November - December 2021 in a markdown cell below the histogram and statistics.

In [None]:
time1 =  ... # Seconds since epoch
time2 =  ... # Seconds since epoch
Late2020 = ...
Late2020

In [None]:
check('tests/q3a.py')

In [None]:
Late2020.hist('deaths')
...

#### Your comparison in this markdown cell (double click to edit)

...

### Plotting with dates
Dates can also be tricky to get a good x-axis. This is particularly complicated with the time being defined as seconds since the common epoch of January 1, 1970 @12 midnight.

In [None]:
# Input Data to plot
dates = Late2021.column('date')  
deaths = Late2021.column('deaths') 
# See: https://matplotlib.org/stable/api/dates_api.html
# Already done: import matplotlib.pyplot as plt
# mdates does the trick!
#
## DATE PLOTTING CODE TEMPLATE TO COPY ##
import matplotlib.dates as mdates
sec = mdates.epoch2num(dates)  # Convert our dates to Matplot dates
loc = mdates.AutoDateLocator() # Fancy function for dates
fmt = mdates.AutoDateFormatter(loc)
plt.gca().xaxis.set_major_formatter(fmt)
plt.gca().xaxis.set_major_locator(loc)
## END: DATE PLOTTING CODE TEMPLATE TO COPY ##
#
# Now plot
plt.plot(sec,deaths)
plt.gcf().autofmt_xdate()

### Question 4
Now use the same plotting template (copy from above) and modify to plot your Late2020 data.  In the markdown cell below, describe the differences in the line graphs between 2020 and 2021.  

In [None]:
# Plot of November - December 2020 COVID data
dates = Late2020.column('date')
...

...

...
# Now plot
plt.plot(sec,deaths)
plt.gcf().autofmt_xdate()

#### Your comparison in this markdown cell (double click to edit)

...

In [None]:
check('tests/q4a.py')

## 2. Defining functions

Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50.  (No percent sign.)

A function definition has a few parts.

##### `def`
It always starts with `def` (short for **def**ine):

    def

##### Name
Next comes the name of the function.  Let's call our function `to_percentage`.
    
    def to_percentage

##### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  `to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.
    '''
    def to_percentage(proportion)
    '''

We put a colon after the signature to tell Python it's over.

    def to_percentage(proportion):

##### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing a triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
    
    
##### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function.  We can write anything we could write anywhere else.  First let's give a name to the number we multiply a proportion by to get a percentage.

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100

##### `return`
The special instruction `return` in a function's body tells Python to make the value of the function call equal to whatever comes right after `return`.  We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor

**Question 5.** Define `to_percentage` in the cell below.  Call your function to convert the proportion .2 to a percentage.  Name that percentage `twenty_percent`.

In [None]:
def ...
    """ ... """
    ... = ...
    return ...

twenty_percent = ...
twenty_percent

In [None]:
check('tests/q5a.py')

**Question 6.** Now define another function which takes the ratio of two number and then uses the *'to_percentage'* function above to convert it into a percentage. One issue is when the denominator is zero we get a result which is not a number or `nan` in Python. This can be changed to a zero as a place holder with one of the two little tricks shown below that can be incorporated as two lines of your code.

In [None]:
# First approach to deal with dividing by zero
from math import nan
z = nan
print("First: ",z)
# Use this part in your function
if z != z: # if conditional statement
    z = 0
# Up to here
print("Now: ", z)

A second approach which uses Python *try:* and *except:*. The *except:* is executing if the *try:* fails due to an exception such as computing as *nan*. 

In [None]:
# Second approach: use this part in your function with z as the ratio
try:
    z = 1/0
except:
    z = 0
print("Now: ", z)

In [None]:
# Now your function...
def ratio(x1,x2):
    """ Computes a ratio of x1 to x2 """
    ...
    r = to_percentage(z)
    return r

In [None]:
check('tests/q6a.py')

### COVID cases leading to bad outcomes
Now we will apply the function to our COVID data. Here we will need to use the *with_columns* method of a Table object to add the result of applying the ratio function with two columns as arguments. These columns will be *deaths* and *cases*. The percentage return by the function will create a new column.<br>
<br>See Inferential Thinking 8.1.1for inspiration: <br> https://inferentialthinking.com/chapters/08/1/Applying_a_Function_to_a_Column.html

**Question 7.** Now apply your function to create a new column, *deathrate*. Examine the histogram for deathrate. Now plot the trend for *deathrate* for the entire timeperiod of the dataset. Remember the special codes from above to define the x ('date') and y ('deathrate') data to plot. Discuss the results in the markdown cell below.

In [None]:
COVID = COVID.with_columns("deathrate",...).sort("deathrate")
# Check that there are no nan...
COVID

In [None]:
# Histogram
...

In [None]:
# Plot
# Be sure to re-sort data by date, the plot connect subsequent data points
...
# Input Data to plot
dates = COVID.column('date')  
deathrate = COVID.column('deathrate') 
# See: https://matplotlib.org/stable/api/dates_api.html
# Already done: import matplotlib.pyplot as plt
# mdates does the trick!
## USE DATE PLOTTING CODE TEMPLATE HERE ###
import matplotlib.dates as mdates
...
...


#### Your discussion of results from question 7 in this markdown cell (double click to edit)

...

In [None]:
check('tests/q7a.py')

**Congratulations** , you're done with lab 4! Be sure to
run all the tests and verify that they all pass (the next cell has a shortcut for that),
Save and Checkpoint from the File menu
Run the last two cells for partial grading. Comments and markdown will be graded separately. 

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import check
correct = 0
for x in range(1, 8):
    print('Testing question {}: '.format(str(x)))
    g = check('tests/q{}a.py'.format(str(x)))
    if g.grade == 1.0:
        print("Passed")
        correct += 1
    else:
        print('Failed')
        display(g)

print('Grade:  {}'.format(str(correct/7)))

In [None]:
print(name," Great work!")