[View in Colaboratory](https://colab.research.google.com/github/tompollard/buenosaires2018/blob/master/tableone.ipynb)

# Demonstrating the `tableone` package

In research papers, it is common for the first table ("Table 1") to display summary statistics of the study data. The `tableone` package is used to create this table. For an introduction to basic statistical reporting in biomedical journals, we recommend reading the [SAMPL Guidelines](http://www.equator-network.org/wp-content/uploads/2013/07/SAMPL-Guidelines-6-27-13.pdf). For more reading on accurate reporting in health research, visit the [EQUATOR Network](http://www.equator-network.org/).

## A note for users of `tableone`

While we have tried to use best practices in creating this package, automation of even basic statistical tasks can be unsound if done without supervision. We encourage use of `tableone` alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling. 

It is beyond the scope of our documentation to provide detailed guidance on summary statistics, but as a primer we provide some considerations for choosing parameters when creating a summary table at: http://tableone.readthedocs.io/en/latest/bestpractice.html. 

*Guidance should be sought from a statistician when using `tableone` for a research study, especially prior to submitting the study for publication*.

## Installation

To install the package with pip, run the following command in your terminal: ``pip install tableone``. To install the package with Conda, run: ``conda install -c conda-forge tableone``. For more detailed installation instructions, refer to the [documentation](http://tableone.readthedocs.io/en/latest/install.html). To install in Colaboratory, use `!pip install tableone`

In [0]:
# install the tableone package
!pip install tableone

## Importing libraries

Before using the `tableone` package, we need to import it. We will also import `pandas` for loading our sample dataset and `matplotlib` for creating plots.

In [0]:
# import libraries
from tableone import TableOne
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Loading sample data

We begin by loading the data that we would like to summarize into a Pandas DataFrame. 
- Variables are in columns
- Encounters/observations are in rows.

In [0]:
# load sample data into a pandas dataframe
url="https://raw.githubusercontent.com/tompollard/tableone/master/data/pn2012_demo.csv"
data=pd.read_csv(url)

In [0]:
data.head()

## Example 1: Simple summary of data with Table 1

In this example we provide summary statistics across all of the data.

In [0]:
# view the tableone docstring
TableOne??

In [0]:
# create an instance of TableOne with the input arguments
# firstly, with no grouping variable
overall_table = TableOne(data)

In [0]:
# view first 10 rows of tableone
overall_table

**Summary of the table**:
- the first row ('`n`') displays a count of the encounters/observations in the input data.
- the '`isnull`' column displays a count of the null values for the particular variable.
- if categorical variables are not defined in the arguments, they are detected automatically.
- continuous variables (e.g. '`age`') are summarized by '`mean (std)`'.
- categorical variables (e.g. '`ascites`') are summarized by '`n (% of non-null values)`'.

## Exploring the warning raised by Hartigan's Dip Test

Hartigan's Dip Test is a test for multimodality. The test has suggested that the `Age`, `SysABP`, and `Height` distributions may be multimodal. We'll plot the distributions here.

In [0]:
data[['Age','SysABP','Height']].dropna().plot.kde(figsize=[12,8])
plt.legend(['Age (years)', 'SysABP (mmHg)', 'Height (cm)'])
plt.xlim([-30,250])

## Exploring the warning raised by Tukey's rule

Tukey's rule has found far outliers in Height, so we'll look at this in a boxplot

In [0]:
data[['Age','Height','SysABP']].boxplot(whis=3)
plt.show()

In both cases it seems that there are values that may need to be taken into account when calculating the summary statistics. For `SysABP`, a clearly bimodal distribution, the researcher will need to decide how to handle the peak at ~0, perhaps by cleaning the data and/or describing the issue in the summary table. For `Height`, the researcher may choose to report median, rather than mean.

## Example 2: Table 1 without stratification

In this example we provide summary statistics across all of the data, specifying columns, categorical variables, and non-normal variables.

In [0]:
# columns to be summarized
columns = ['Age', 'SysABP', 'Height', 'Weight', 'ICU', 'death']

# columns containing categorical variables
categorical = ['ICU', 'death']

# non-normal variables
nonnormal = ['Age']

# alternative labels
labels={'LOS': 'Length of stay', 'death': 'mortality'}

# create tableone with the input arguments
mytable = TableOne(data, columns=columns, categorical=categorical, 
nonnormal=nonnormal, labels=labels)
mytable

**Summary of the table**:

- as before, except that the variables are explicitly defined in the input arguments.
- continuous variables are now summarized by '`median [IQR]`' if specified as `nonnormal`.
- the labels argument means that 'sex' is now shown as 'gender, and 'trt' is now 'treatment'.

## Example 3: Table 1 with stratification

In this example, we group data across a categorical variable.

In [0]:
# optionally, a categorical variable for stratification
groupby = ['death']

In [0]:
# create an instance of TableOne with the input arguments
grouped_table = TableOne(data, columns, categorical, groupby, nonnormal)

In [0]:
# view first 10 rows of tableone
grouped_table

**Summary of the table**:
- data is now summarized across the groups specified in the `groupby` argument.
- as before, the summary statistics are either '`mean (std)`', '`median [IQR]`', or '`n (% of non-null values)`'.

## Computing p values

We can run a test to compute p values by setting the ``pval`` argument to True.

In [0]:
# create grouped_table with p values
grouped_table = TableOne(data, columns, categorical, groupby, nonnormal, pval = True)

In [0]:
# view first 10 rows of tableone
grouped_table

**Summary of the table**:
- the '`ptest`' column displays the name of the test used to compare the groups.
- the '`pval`' column displays the p value generated by the test in the '`ptest`' column, to 3 decimal places.

## Exporting the table to file (LaTeX, CSV, etc)

Tables can be exported to file in various formats, including:
- LaTeX
- CSV
- HTML

To export the table, call the relevant `to_<format>()` method on the DataFrame.

In [0]:
# Save table to LaTeX
fn = 'tableone.tex'
grouped_table.to_latex(fn)

In [0]:
# Save table to HTML
fn2 = 'tableone.html'
grouped_table.to_html(fn2)