# Computing Descriptive Statistics with Pandas
*Curtis Miller*

Often in data analysis projects we begin with descriptive statistics to get a sense of a dataset's properties. Fortunately it is easy to get these statistics from Pandas `DataFrame`s.

I illustrate by computing various descriptive statistics for the classic [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set).

## Loading in `iris`

The following code loads in the packages we will need and also the `iris` dataset.

In [None]:
import pandas as pd
from pandas import DataFrame
from sklearn.datasets import load_iris    # sklearn.datasets includes common example datasets

In [None]:
iris_obj = load_iris()    # A function to load in the iris dataset
iris_obj.data    # Dataset preview

In [None]:
iris_obj.feature_names

In [None]:
iris_obj.target

In [None]:
iris_obj.target_names

`load_iris()` loads in an object containing the iris dataset, which I stored in `iris_obj`. I now turn this into a `DataFrame`.

In [None]:
iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,
                 index=pd.Index([i for i in range(iris_obj.data.shape[0])])).\
           join(DataFrame(iris_obj.target, columns=pd.Index(["species"]), index=pd.Index([i for i in range(iris_obj.target.shape[0])])))
iris

In [None]:
iris.species.replace({0: 'setosa', 1: 'versicolor', 2: 'virginica'}, inplace=True)
iris

For this particular dataset, the grouping by species suggests that descriptive statistics should be done on groups. We create the groups like so.

In [None]:
iris_grps = iris.groupby("species")

for name, data in iris_grps:
    print(name)
    print("---------------------\n\n")
    print(data.iloc[:, 0:4])
    print("\n\n\n")

A lot of the methods for getting summary statistics for a `DataFrame` also work for group objects.

## Getting the Basics

Let's compute some basic statistics.

I use $n$ to denote the sample size. This number is the number of rows in the dataset, and can be obtained via `count()`.

In [None]:
iris.count()

The **sample mean** is the arithmetic mean of the dataset.

$$\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$$

In [None]:
iris.mean()    # Sample mean for every numeric column

The **sample median** is the "middle" data point, after ordering the dataset. Let $x_{(i)}$ represent ordered data ($x_{(1)}$ is smallest, $x_{(n)}$ largest).

$$\tilde{x} = \begin{cases}
x_{\left(\frac{n+1}{2}\right)} & \text{ if } n \text{ is odd} \\
\frac{1}{2}\left(x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2} + 1\right)}\right) & \text{ if } n \text{ is even} \\
\end{cases}$$

In [None]:
iris.median()

The **sample variance** is a measure of dispersion, roughly the "average" squared distance of a data point from the mean. The **standard deviation** is the square root of the variance and interpreted as the "average" distance a data point is from the mean.

$$s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2$$
$$s = \sqrt{s^2}$$

In [None]:
iris.var()

In [None]:
iris.std()

The **$p$th percentile** is the number in the dataset such that roughly $p$% of the data is less than this number. This number is also referred to as a quantile.

In [None]:
iris.quantile(.1)   # The 10th percentile

In [None]:
iris.quantile(.95)    # The 95th percentile

In [None]:
iris.quantile(.75)    # Commonly known as the third quartile

In [None]:
iris.quantile(.25)    # Commonly known as the first quartile

If $Q_i$ denotes the $i$th quartile, the **inner-quartile range** (**IQR**) is the difference between the third quartile and the first quartile.

$$IQR = Q_3 - Q_1$$

In [None]:
# There is no function for computing the IQR but it is nevertheless easy to obtain
iris.quantile(.75) - iris.quantile(.25)

Other interesting quantities include the maximum and minimum values.

In [None]:
iris.max()

In [None]:
iris.min()

Many of these summaries work for grouped data as well.

In [None]:
iris_grps.mean()

In [None]:
iris_grps.std()

In [None]:
iris_grps.quantile(.75)

In [None]:
iris_grps.quantile(.75) - iris_grps.quantile(.25)

## Other Useful Methods

The method `describe()` gets a number of useful summaries for a dataset.

In [None]:
iris.describe()

In [None]:
# This also works well for grouped data.
iris_grps.describe()

If we want custom numerical summaries, we can write functions to compute them for Pandas `Series` then apply them to the columns of a `DataFrame`.

I demonstrate by writing a function that computes the **range**, which is the difference between the maximum and minimum of a dataset.

$$\text{range} = x_{(n)} - x_{(1)}$$

In [None]:
# Compute the range of a dataset
def range_stat(s):
    return s.max() - s.min()

iris.iloc[:, 0:4].apply(range_stat)

In [None]:
# Use aggregate() for groups
iris_grps.aggregate(range_stat)