<html>
<table width="100%" cellspacing="2" cellpadding="2" border="1">
<tbody>
<tr>
<td valign="center" align="center" width="25%"><img src="media/decartes.jpg"
alt="DeCART Icon" width="128" height="171"><br>
</td>
<td valign="center" align="center" width="75%">
<h1 align="center"><font size="+1">DeCART Summer School<br>
for<br>
Biomedical Data Science</font></h1></td>
<td valign="center" align="center" width="25%"><img
src="media/U_Health_stacked_png_red.png" alt="Utah Health
Logo" width="128" height="134"><br>
</td>
</tr>
</tbody>
</table>
<br>
</html>

# Basic Numeric Data Characterization

Numpy provides a number of functions/methods for characterizing 

In [1]:
import os
import numpy as np
import utils
import numpy.random as ra
import pandas as pd

In [2]:
from quizzes.characterizing_numeric_data import *

In [3]:
pd.__version__

'0.23.1'

In [6]:
DATADIR = os.path.join(os.path.expanduser("~"),"DATA")
HRDIR = os.path.join(DATADIR,"Numerics", "mimic2", "hr", "subjects")
BPDIR = os.path.join(DATADIR,"Numerics", "mimic2", "bp", "subjects")

hr_files = os.listdir(HRDIR)
hr_files[10]

'3587.txt'

## Heart Rate file for patient ``3325``

We are going to start with using [Pandas](http://pandas.pydata.org/) to read, summarize, and visualize numeric data. We are going to start with the MIMIC2 patient ``#3325``. This is what the first five lines of the file looks like (You can explore this in the Linux shell with less, more, or cat.

```Python
88
84
87
78
85
```

Pandas has two basic functions for reading in tabular data: [``read_table``](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html) and [``read_csv``](https://pandas.pydata.org/pandas-docs/stable/generat

ed/pandas.read_csv.html). They are really the smae function with different default values for the delimiter, the character that is used to separate the data on each row: (a tab (``\t``) character for ``read_table`` and a comma (``,``) for ``read_csv``. We will use ``read_table`` to read in the heart rate values.

Both of these functions swill return a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) which is a
>Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). 

Going back to an earlier analogy, a data structure is a complex molecule of data. 

A Pandas DataFrame is an object (everything is an object), and so it has attributes and methods. Two of the methods we will use right off the bat are

* [``head()``](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html#pandas.DataFrame.head)
* [``tail()``](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html#pandas.DataFrame.tail)

which return the first (last) n (default=5) rows of the DataFrame.

### Read in the data and look at the head

In [7]:

hr = pd.read_table(os.path.join(HRDIR,'3235.txt'))
hr.head()

Unnamed: 0,88
0,84
1,87
2,78
3,85
4,80


The DataFrame provides an **index** for each row (e.g. $0, 1, 2, \cdots$). It will also display the label (name) for each column.

### Does anything seem wrong?

In [8]:
hr.shape, hr.size

((482, 1), 482)

### Reread the data, specifying what to use for the header

In [9]:
hr = pd.read_table(os.path.join(HRDIR,'3235.txt'), header=None)
hr.head()

Unnamed: 0,0
0,88
1,84
2,87
3,78
4,85


This looks better but zero for a column label/name is not very meaningful. We can provide names to use for the columns with a ``names`` keyword argument.

### Reread the data providing a name for the heart rate column

In [11]:
hr = pd.read_table(os.path.join(HRDIR,'3235.txt'), header=None, names=["heart_rate"])
hr.head()

Unnamed: 0,heart_rate
0,88
1,84
2,87
3,78
4,85


### DataFrame Attributes

DataFrames have attributes that describe it, including

* ``shape``: shape of the frame
* ``size``: total number of elements in the frame
* ``dtypes``: the data types for each column in the frame

In [12]:
print(hr.shape, hr.size)
print(hr.dtypes)

(483, 1) 483
heart_rate    int64
dtype: object


## Exercise

Use the Pandas ``read_table``function to read in the blood pressure data for the same patient (``3235``). Answer the following questions:

1. How many rows are in the blood pressure data frame?
1. What data type (e.g. np.uint8) is used for the first column (systolic) of measurements?

In [None]:

data_shape("replace_me_with_the_number_of_rows")

In [15]:
bp = pd.read_table(os.path.join(BPDIR,'3235.txt'), header=None, names=["systolic", "diastolic"])
bp.head()

Unnamed: 0,systolic,diastolic
0,128,68
1,128,69
2,129,74
3,100,54
4,108,56


#### Compute summary statistics

We can compute summary statistics on a Pandas DataFrame or Series using either a numpy function or a method of the DataFrame or Series.

### How do we know what numpy functions are defined or what methods a DataFrame has?

In [16]:
print(hr.max())

heart_rate    303
dtype: int64


In [17]:
np.max(hr)

heart_rate    303
dtype: int64

In [18]:
np.max(hr["heart_rate"])

303

In [19]:
hr["heart_rate"].max()

303

## Exercise

What is the median value of the diastolic blood pressure for patient 3235?

In [23]:
median_diastolic(54)

'You provided the correct median value'

In [21]:
diastolic = bp["diastolic"]
diastolic.median()

54.0

In [27]:
bp.describe()

Unnamed: 0,systolic,diastolic
count,485.0,485.0
mean,112.554639,53.513402
std,15.861865,6.555416
min,36.0,31.0
25%,101.0,50.0
50%,113.0,54.0
75%,122.0,57.0
max,166.0,79.0


In [39]:
hr.head()


Unnamed: 0,heart_rate,one
0,88,1
1,84,1
2,87,1
3,78,1
4,85,1


### Python Aside

It can often become confusing of what variables (including functions) are alive and well in your workspace. Python provides some functions for exploring this question:

* `dir()`
* `globals()` 
* `locals()`

IPython provides magics for this `%who` and `%whos`

I find keeping track of variables in Jupyter notebooks more important than in traditional scripting because I end up with a lot of global variables.

In [28]:
%who

BPDIR	 DATADIR	 HRDIR	 assert_equal	 bp	 data_shape	 data_types	 diastolic	 hr	 
hr_files	 median_diastolic	 np	 numbers	 os	 pd	 ra	 utils	 


In [29]:
%whos

Variable           Type         Data/Info
-----------------------------------------
BPDIR              str          /home/clinicaldatahub/DAT<...>merics/mimic2/bp/subjects
DATADIR            str          /home/clinicaldatahub/DATA
HRDIR              str          /home/clinicaldatahub/DAT<...>merics/mimic2/hr/subjects
assert_equal       method       <bound method TestCase.as<...>al.Dummy testMethod=nop>>
bp                 DataFrame         systolic  diastolic\<...>n\n[485 rows x 2 columns]
data_shape         function     <function data_shape at 0x7f3d9552ed90>
data_types         function     <function data_types at 0x7f3d9552ed08>
diastolic          Series       0      68\n1      69\n2  <...>Length: 485, dtype: int64
hr                 DataFrame         heart_rate\n0       <...>n\n[483 rows x 1 columns]
hr_files           list         n=3903
median_diastolic   function     <function median_diastolic at 0x7f3d94e80bf8>
np                 module       <module 'numpy' from '/op<...>kages/

## Exercise

To explore equality of floating point numbers, perturb your answer by adding ``0.1``, ``0.01``, etc. until your perturbed answer is considered equal.

In [35]:
median_diastolic(54 + 0.000000000000001)

'You provided the correct median value'

## [``describe()``](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)

Pandas DataFrames (Series) come with a ``describe()`` method that provides summary statistics.

In [37]:
round(hr.describe(),2)

Unnamed: 0,heart_rate
count,483.0
mean,92.86
std,14.4
min,75.0
25%,85.0
50%,90.0
75%,99.0
max,303.0


## Creating a New Column

We can create a new column in the DataFrame with a simple assignment statement:

In [41]:
hr["one"] = 1
hr

Unnamed: 0,heart_rate,one
0,88,1
1,84,1
2,87,1
3,78,1
4,85,1
5,80,1
6,78,1
7,79,1
8,78,1
9,86,1


In [42]:
hr["range"] = range(len(hr))

In [43]:
hr["inverse range"] = range(len(hr),0, -1)
hr.head()

Unnamed: 0,heart_rate,one,range,inverse range
0,88,1,0,483
1,84,1,1,482
2,87,1,2,481
3,78,1,3,480
4,85,1,4,479


We can also create a new column based on a function of the existing columns.

We can do this in two ways

### Method 1

In [44]:
hr["range_diff"] = hr["inverse range"] - hr["range"]
hr.head()

Unnamed: 0,heart_rate,one,range,inverse range,range_diff
0,88,1,0,483,483
1,84,1,1,482,481
2,87,1,2,481,479
3,78,1,3,480,477
4,85,1,4,479,475


### Method 2

Our second method uses the [``apply()``]() method and a function. For this, I frequently use [**anonymous function**](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions). 

#### Anonymous Functions

The syntax for anonymous functions is as follows:

```Python
lambda variable(s): some_function_of_the_variable(s)
```

So an anonymous doubling function would be

```Python
lambda x: 2*x
```

Here is an anonymous function that returns the upper case version of a string

```Python
lambda y: y.upper()
```

#### Pandas ``apply()``

* ``apply()`` applies a function to each element in the DataFrame (Series)
* We specify whether we want to apply the function by columns or rows by specifying the keyword argument ``axis`` (which defaults to 0 for columns)
    * Remembering to set `axis` is very important and something I frequently to do
* When we apply by rows, we get a variable that contains each row; we can access specific columns within the row

In [45]:
hr["range_diff2"] = hr.apply(lambda row: row["range_diff"] - row["range"], 
                             axis=1)
hr.head()

Unnamed: 0,heart_rate,one,range,inverse range,range_diff,range_diff2
0,88,1,0,483,483,483
1,84,1,1,482,481,480
2,87,1,2,481,479,477
3,78,1,3,480,477,474
4,85,1,4,479,475,471


## Exercise:

[Data standardization](http://scikit-learn.org/stable/modules/preprocessing.html) transforms numeric data ($x$) by subtracting the mean ($\mu_x$) and dividing by the standard deviation ($\sigma_x$):

$$\tilde{x} = \frac{x-\mu_x}{\sigma_x}$$

Read in one of the heart rate data files and compute the standardized form of the data. Assign this normalized heart rate to the DataFrame with a label of **"normalized hr"**.

In [47]:
hr2 = pd.read_table(os.path.join =(HRDIR, '10804.txt'), header=none = hr.apply(lambda row: xbar = (("heart_rate" - row(hr.mean()) / (hr.std))
hr.head()

SyntaxError: invalid syntax (<ipython-input-47-8f31f333a80e>, line 2)

## Exercise


The student's t-test is often used to comapre two populations. Using ``len``, ``math.sqrt`` and numpy methods/functions compute the [t value](https://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test) for two files.

$$t = \frac{\bar{{X_1}}-\bar{X_2}}{s_{X_1X_2}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$ where
$$s_{X_1X_2} = \sqrt{\frac{(n_1-1)s_{X_1}^2 + (n_2-1)s_{X_2}^2}{n_1+n_2-2}}$$.
1. Pick two heart rate files and comptue the t value for the two files.

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">University of Uah Data Science for Health</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Brian E. Chapman</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.