# **C-HACK Tutorial 3: Functions and Pandas Introduction**

**Instructor**: redacted<br>
**Contact**: redacted

Today, we will discuss **_functions_** in more depth.  We've seen them previously and used them, for example the `.append()` **_function_**.  Here, we'll dig into how you can make your own functions to encapsulate code that you will reuse over and over.  

Then we'll jump into the **Pandas** package.  Packages are collections of related functions.  These are the things we `import`. Pandas is a two dimensional data structure like a spreadsheet in Excel.

### 3.1 Review from Tutorial on Data Structures and Flow Control

In our last tutorial, we discussed **_lists_**, **_dictionaries_**, and **_flow control_**.

**_Lists_** are **_ordered collections_** of data that can be used to hold multiple pieces of information while preserving thier order.  We use `[` and `]` to access elements by their indices which start with `0`.  All things that operate on **_lists_** like slices use the concept of an inclusive lower bound and an exclusive upper bound.  So, the following gets elements from the **_list_** `my_list` with index values of `0`, `1`, and `2`, but **not** `3`!

```
my_list[0:3]
```

It is equivalent to what other way of writing the same statement using **_slicing_**?  Hint, think about leaving out one of the numbers in the slice!

**_Dictionaries_** are **_named_** **_collections_** of data that can be used to hold multiple pieces of information as **_values_** that are addressed by **_keys_** resulting in a **_key_** to **_value_** data structure.  They are accessed with `[` and `]` but intialized with `{` and `}`.  E.g.

```
my_dict = { 'cake' : 'Tasty!', 'toenails' : 'Gross!' }
my_dict['cake']
```

Finally, we talked about **_flow control_** and using the concept of **_conditional execution_** to decide which code statements were executed.  Remember this figure?


<img src="https://docs.oracle.com/cd/B19306_01/appdev.102/b14261/lnpls008.gif">Flow control figure</img>

What are the **_if_** statments?

Where do **_for_** loops fit in?

What was the overarching concept of a **_function_**?

### 3.2 Functions

For loops let you repeat some code for every item in a list.  Functions are similar in that they run the same lines of code for new values of some variable.  They are different in that functions are not limited to looping over items.

Functions are a critical part of writing easy to read, reusable code.

Create a function like:
```
def function_name (parameters):
    """
    optional docstring
    """
    function expressions
    return [variable]
```

_Note:_ Sometimes I use the word argument in place of parameter.

Here is a simple example.  It prints a string that was passed in and returns nothing.

```
def print_string(str):
    """This prints out a string passed as the parameter."""
    print(str)
    return
```

In [None]:
def print_string(str):
    """This prints out a string passed as the parameter."""
    print(str)
    return

To call the function, use:
```
print_string("Dave is awesome!")
```

_Note:_ The function has to be defined before you can call it!

In [None]:
print_string2("Dave is OK")

If you don't provide an argument or too many, you get an error.

In [None]:
print_string()

#### 3.2.1 Function Parameters

Parameters (or arguments) in Python are all passed by reference.  This means that if you modify the parameters in the function, they are modified outside of the function.

See the following example:

```
def change_list(my_list):
   """This changes a passed list into this function"""
   my_list.append('four');
   print('list inside the function: ', my_list)
   return

my_list = [1, 2, 3];
print('list before the function: ', my_list)
change_list(my_list);
print('list after the function: ', my_list)
```

In [None]:
def change_list(my_list):
   """This changes a passed list into this function"""
   my_list.append('four');
   print('list inside the function: ', my_list)
   return

my_list = [1, 2, 3];
print('list before the function: ', my_list)
change_list(my_list);
print('list after the function: ', my_list)

#### 3.2.2 For advanced folks...

Variables have scope: **_global_** and **_local_**

In a function, new variables that you create are not saved when the function returns - these are **_local_** variables.  Variables defined outside of the function can be accessed but not changed - these are **_global_** variables, _Note_ there is a way to do this with the **_global_** keyword.  Generally, the use of **_global_** variables is not encouraged, instead use parameters.

```
my_global_1 = 'bad idea'
my_global_2 = 'another bad one'
my_global_3 = 'better idea'

def my_function():
    print(my_global)
    my_global_2 = 'broke your global, man!'
    global my_global_3
    my_global_3 = 'still a better idea'
    return
    
my_function()
print(my_global_2)
print(my_global_3)
```

In general, you want to use parameters to provide data to a function and return a result with the `return`. E.g.

```
def sum(x, y):
    my_sum = x + y
    return my_sum
```

If you are going to return multiple objects, what data structure that we talked about can be used?  Give and example below.

#### 3.2.3 Parameters have  different types:

| type | behavior |
|------|----------|
| required | positional, must be present or error, e.g. `my_func(first_name, last_name)` |
| keyword | position independent, e.g. `my_func(first_name, last_name)` can be called `my_func(first_name='Dave', last_name='Beck')` or `my_func(last_name='Beck', first_name='Dave')` |
| default | keyword params that default to a value if not provided |


```
def print_name(first, last='the C-Hacker'):
    print('Your name is %s %s' % (first, last))
    return
```

In [None]:
def print_name(first, last='the C-Hacker'):
    print('Your name is %s %s' % (first, last))
    return

Play around with the above function.

In [None]:
print_name('Dave', last='his Data Science Majesty')

Functions can contain any code that you put anywhere else including:
* if...elif...else
* for...else
* while
* other function calls

```
def print_name_age(first, last, age):
    print_name(first, last)
    print('Your age is %d' % (age))
    if age > 35:
        print('You are really old.')
    return
```


In [None]:
def print_name_age(first, last, age):
    print_name(first, last)
    print('Your age is %d' % (age))
    if age > 35:
        print('You are really old.')
    return

```
print_name_age(age=46, last='Beck', first='Dave')
```

In [None]:
print_name_age(age=46, last='Beck', first='Dave')

## 3.3 Pandas and the Scientific Python Toolkit

In addition to Python's built-in modules like the ``math`` module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python.
Some of the most important ones are:

#### [``numpy``](http://numpy.org/): Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data.
If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

#### [``scipy``](http://scipy.org/): Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more.
We will not look closely at Scipy today, but we will use its functionality later in the course.

#### [``pandas``](http://pandas.pydata.org/): Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a *Data Frame*.
If you've used the [R](http://rstats.org) statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

#### [``matplotlib``](http://matplotlib.org): Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

### 3.3.1 Pandas

We begin by loading the Panda's package.  Packages are collections of functions that share a common utility.  We've seen `import` before.  Let's use it to import Pandas and all the richness that pandas has.

```
import pandas
```

In [None]:
import pandas

```
df = pandas.DataFrame()
```

In [None]:
df = pandas.DataFrame()

Because we'll use it so much, we often import under a shortened name using the ``import ... as ...`` pattern:

```
import pandas as pd
import numpy as np
```

In [None]:
import pandas as pd
import numpy as np

Let's create an empty _data frame_ and put the result into a variable called `df`.  This is a popular choice for a _data frame_ variable name.

```
df = pd.DataFrame()
```

In [None]:
df = pd.DataFrame()


Let's create some random data as a pandas data frame.  Before we get to the dataframe, let's briefly talk about numpy's `random` function.  If we look at the [`random`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.random.html) documentation, you can see it takes a size argument.  This should be a `list` or a `tuple` that says what the "height" and "width" of the generated data will be.  In our case, we will get 10 rows of data in three columns with the following:

```
np.random.random((10,3))
```


Notice we change the value of the `df` variable to point to a new data frame.

```
df = pd.DataFrame(data=np.random.random((10,3)), columns=['v1', 'v2', 'v3'])
```

In [None]:
df = pd.DataFrame(data=np.random.random((10,3)), columns=['v1', 'v2', 'v3'])

*Note: strings in Python can be defined either with double quotes or single quotes*

### 3.3.2 Viewing Pandas Dataframes

The ``head()`` and ``tail()`` methods show us the first and last rows of the data.

```
df.head()
df.tail()
```

In [None]:
df.head()

In [None]:
df.tail()

### Important!

Prior to the start of C-Hack, you received instructions on how to set up your Google Drive & Colab.  The videos are here:

* [How to load a file into Colab 1: Basics](https://www.youtube.com/watch?v=5rZn-aVNR0A)
* [How to load a file into Colab 2: .csv and Python](https://www.youtube.com/watch?v=_2z3tFPbwjA)

Before moving forward, let's load a more interesting set of data as a DataFrame: the NOAA storm dataset. We can load it from a `csv` (comma seperated values) file that we have in google drive.  Before we can do this, however, we need to attach this notebook to our Google drive.

*Note:* when you run this, your browser will popup a new window asking you to authenticate to Google drive.  

```
from google.colab import drive
drive.mount('/content/drive')
```

In [None]:
from google.colab import drive
drive.mount('/content/drive')

With our notebook now able to access the files in our Google drive, we can now open the CSV (comma separated value) file that contains the NOAA data we want to use.  To do that, we use the Pandas `read_csv` function and give it a file path that is in our drive.

```
path_to_file = '/content/drive/MyDrive/C-HACK 2022 EVENT/Tutorials/NOAA_storm_data/cleaned_data.csv'
df = pd.read_csv(path_to_file, index_col=0)
df.head()
```

In [None]:
path_to_file = '/content/drive/MyDrive/C-HACK 2022 EVENT/Tutorials/NOAA_storm_data/cleaned_data.csv'
df = pd.read_csv(path_to_file, index_col=0)
df.head()

From this `head` of the dataframe, we can see the following columns are present in our dataset and the table belows shows the data types and descriptions from NOAA about the fields.

| Column            | Data type   | Description                                                                                                        |
|-------------------|-------------|--------------------------------------------------------------------------------------------------------------------|
| BEGIN_DATE_TIME   | Date / time | When the adverse weather event start.                                                                               |
| EVENT_TYPE        | String      | Human readable name for the type of adverse weather event; E.g. Hail, Thunderstorm, Wind, Snow, Ice (spelled out)  |
| INJURIES_DIRECT   | Integer     | The number of injuries directly caused by the weather event.                                                       |
| INJURIES_INDIRECT | Integer     | The number of injuries indirectly caused by the weather event.                                                     |
| DEATHS_DIRECT     | Integer     | The number of deaths directly caused by the weather event.                                                         |
| DEATHS_INDIRECT   | Integer     | The number of deaths indirectly caused by the weather event.                                                       |
| DAMAGE_PROPERTY   | Float       | The estimated amount of damage to property incurred by the weather event.                                          |
| DAMAGE_CROPS      | Float       | The estimated amount of damage to crops incurred by the weather event.                                             |
| LATITUDE          | Float       | The latitude in decimal degrees of the begin point of event or damage path.                                        |
| LONGITUDE         | Float       | The longitude in decimal degrees of the begin point of event or damage path.                                       |


Cool!  Now we know what data we have, now let's see how much data we have!

The ``shape`` attribute shows us the number of elements:

```
df.shape
```

Note it doesn't have the `()` because it isn't a **_function_** - it is an **_attribute_** or variable attached to the `df` object.

In [None]:
df.shape

The ``columns`` attribute gives us the column names

```
df.columns
```


In [None]:
df.columns

The ``index`` attribute gives us the index names.  Note that in this instance, our index column is a unique identifier for each event in the form of an integer from 0 to 33632.

```
df.index
```

In [None]:
df.index

The ``dtypes`` attribute gives the data types of each column, remember the data type *_floating point_**?:

```
df.dtypes
```

In [None]:
df.dtypes

### 3.3.3. Manipulating data with ``pandas``

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:

```
df['LONGITUDE']
```

In [None]:
df['LONGITUDE']

Mathematical operations on columns happen *element-wise*:

```
df['DAMAGE_PROPERTY'] / 1000.
```

In [None]:
df['DAMAGE_PROPERTY'] / 1000.

Columns can be created (or overwritten) with the assignment operator.

Let's create a `DAMAGE_PROPERTY_THOUSANDS` column which convers the `DAMAGE_PROPERTY` values into unites of 1000s of dollars.

```
df['DAMAGE_PROPERTY_THOUSANDS'] = df['DAMAGE_PROPERTY'] / 1000
```

In [None]:
df['DAMAGE_PROPERTY_THOUSANDS'] = df['DAMAGE_PROPERTY'] / 1000

Let's use the `.head()` **_function_** to see our new data!

```
df.head()
```

In [None]:
df.head()

In preparation for grouping the data, let's bin the events by their property damage. For that, we'll use ``pd.cut``

```
df['PROPERTY_DAMAGE_GROUP'] = pd.cut(df['LONGITUDE'], 1000)
df.head()
df.dtypes
```

In [None]:
df['DAMAGE_PROPERTY_group'] = pd.cut(df['DAMAGE_PROPERTY'], 1000)


In [None]:
df.head()

In [None]:
df.dtypes

### 3.3.4 Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at *value counts* and the basics of *group-by* operations.

#### 3.3.4.1 Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data.
We'll take a look at two of these here.

The ``pandas.value_counts`` returns statistics on the unique values within each column.

We can use it, for example, to break down the molecules by their mass group that we just created:

```
pd.value_counts(df['DAMAGE_PROPERTY_group'])
```

In [None]:
pd.value_counts(df['DAMAGE_PROPERTY_group'])

What happens if we try this on a continuous valued variable like longitude?

```
pd.value_counts(df['LONGITUDE'])
```

In [None]:
pd.value_counts(df['LONGITUDE'])

We can do a little data exploration with this to look at the distribution of data in a column.  Here, let's look at the number of direct deaths in our dataset.

```
pd.value_counts(df['DEATHS_DIRECT'])
```

In [None]:
pd.value_counts(df['DEATHS_DIRECT'])

Thankfully, of our 33633 adverse weather events, 33510 had no deaths!  That is how many percent?

```
(33510/33633) * 100
```

In [None]:
(33510/33633) * 100

### Question:
How can we alter the `pd.value_counts` call to show us the percent of events with the specified number of deaths?

Hint:  Can you do math on the `pd.value_counts` call?  Can you do it without hard coding the number of rows in the dataframe?

#### 3.3.4.2 Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations.
You can visualize the group-by like this (image borrowed from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do))

![image](https://swcarpentry.github.io/r-novice-gapminder/fig/12-plyr-fig1.png)

:Let's break take this in smaller steps.
Recall our ``DAMAGE_PROPERTY_group`` column.

```
pd.value_counts(df['DAMAGE_PROPERTY_group'])
```

In [None]:
pd.value_counts(df['DAMAGE_PROPERTY_group'])

`groupby` allows us to look at the number of values for each column and each value.

```
df.groupby(['DAMAGE_PROPERTY_group']).count()
```

In [None]:
df.groupby(['DAMAGE_PROPERTY_group']).count()

Now, let's find the mean of each of the columns for each ``LONG_group``.  *Notice* what happens to the non-numeric columns.

```
df.groupby(['DAMAGE_PROPERTY_group']).mean()
```

In [None]:
df.groupby(['DAMAGE_PROPERTY_group']).mean()

*Note* that in some instances, the values returned are `NaN` or [Not a Number](https://en.wikipedia.org/wiki/NaN).  This is used to represent something that cannot be calculated.  Why do you think some of these values cannot be calculated?

Hint: What is 0/0?

You can specify a groupby using the names of table columns and compute other functions, such as the ``sum``, ``count``, ``std``, and ``describe``.

```
df.groupby(['DAMAGE_PROPERTY_group'])['DEATHS_DIRECT'].describe()
```

In [None]:
df.groupby(['DAMAGE_PROPERTY_group'])['DEATHS_DIRECT'].describe()

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

```
<data object>.groupby(<grouping values>).<aggregate>()
```

You can even group by multiple values: for example we can look at the statistics grouped by the `DAMAGE_PROPETY_GROUP` and `DEATHS_DIRECT`.

In [None]:
df.groupby(['DAMAGE_PROPERTY_group', 'DEATHS_DIRECT']).describe()

## 3.5 Breakout for Functions and Pandas

Write a function that takes a column in Pandas and computes the [arithmetic mean](https://en.wikipedia.org/wiki/Arithmetic_mean) value of the data in it without using Pandas **_aggregate_** functions.

Compare that result to the one from Pandas **_aggregate_** function `.mean()`.  How did your values compare?  Were they exactly equal?  Did you expect them to be given what you know about **_floating point_** numbers?