# **Python Statistics Fundamentals: How to Describe Your Data**

# Understanding Descriptive Statistics
**Descriptive statistics** is about describing and summarizing data. It uses two main approaches:

**The quantitative approach** describes and summarizes data numerically.
**The visual approach** illustrates data with charts, plots, histograms, and other graphs.

You can apply descriptive statistics to one or many datasets or variables. When you describe and summarize a single variable, you‚Äôre performing **univariate analysis**. When you search for statistical relationships among a pair of variables, you‚Äôre doing a **bivariate analysis**. Similarly, a **multivariate analysis** is concerned with multiple variables at once.

# Types of Measures

In this tutorial, you‚Äôll learn about the following types of measures in descriptive statistics:

- **Central tendency** tells you about the centers of the data. Useful measures include the mean, median, and mode.
- **Variability** tells you about the spread of the data. Useful measures include variance and standard deviation.
- **Correlation or joint variability** tells you about the relation between a pair of variables in a dataset. Useful measures include covariance and the correlation coefficient.

# Population and Samples
In statistics, the **population** is a set of all elements or items that you‚Äôre interested in. Populations are often vast, which makes them inappropriate for collecting and analyzing data. That‚Äôs why statisticians usually try to make some conclusions about a population by choosing and examining a representative subset of that population.

This subset of a population is called a **sample**. Ideally, the sample should preserve the essential statistical features of the population to a satisfactory extent. That way, you‚Äôll be able to use the sample to glean conclusions about the population.

# Outliers
An **outlier** is a data point that differs significantly from the majority of the data taken from a sample or population. There are many possible causes of outliers, but here are a few to start you off:

- Natural variation in data
- Change in the behavior of the observed system
- Errors in data collection

Data collection errors are a particularly prominent cause of outliers. For example, the limitations of measurement instruments or procedures can mean that the correct data is simply not obtainable. Other errors can be caused by miscalculations, data contamination, human error, and more.

There isn‚Äôt a precise mathematical definition of outliers. You have to rely on experience, knowledge about the subject of interest, and common sense to determine if a data point is an outlier and how to handle it.

# Choosing Python Statistics Libraries

There are many Python statistics libraries out there for you to work with, but in this tutorial, you‚Äôll be learning about some of the most popular and widely used ones:

- **Python‚Äôs** statistics is a built-in Python library for descriptive statistics. You can use it if your datasets are not too large or if you can‚Äôt rely on importing other libraries.

- **NumPy** is a third-party library for numerical computing, optimized for working with single- and multi-dimensional arrays. Its primary type is the array type called ndarray. This library contains many routines for statistical analysis.

- **SciPy** is a third-party library for scientific computing based on NumPy. It offers additional functionality compared to NumPy, including scipy.stats for statistical analysis.

- **pandas** is a third-party library for numerical computing based on NumPy. It excels in handling labeled one-dimensional (1D) data with Series objects and two-dimensional (2D) data with DataFrame objects.

- **Matplotlib** is a third-party library for data visualization. It works well in combination with NumPy, SciPy, and pandas.

Note that, in many cases, Series and DataFrame objects can be used in place of NumPy arrays. Often, you might just pass them to a NumPy or SciPy statistical function. In addition, you can get the unlabeled data from a Series or DataFrame as a np.ndarray object by calling .values or .to_numpy().

# Getting Started With Python Statistics Libraries

The built-in Python statistics library has a relatively small number of the most important statistics functions. The official documentation is a valuable resource to find the details. If you‚Äôre limited to pure Python, then the Python statistics library might be the right choice.

A good place to start learning about NumPy is the official User Guide, especially the quickstart and basics sections. The official reference can help you refresh your memory on specific NumPy concepts. While you read this tutorial, you might want to check out the statistics section and the official scipy.stats reference as well.

If you want to learn pandas, then the official Getting Started page is an excellent place to begin. The introduction to data structures can help you learn about the fundamental data types, Series and DataFrame. Likewise, the excellent official introductory tutorial aims to give you enough information to start effectively using pandas in practice.

matplotlib has a comprehensive official User‚Äôs Guide that you can use to dive into the details of using the library. Anatomy of Matplotlib is an excellent resource for beginners who want to start working with matplotlib and its related libraries.

# Calculating Descriptive Statistics
Start by importing all the packages you‚Äôll need:

These are all the packages you‚Äôll need for Python statistics calculations. Usually, you won‚Äôt use Python‚Äôs built-in math package, but it‚Äôll be useful in this tutorial. Later, you‚Äôll import matplotlib.pyplot for data visualization.

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

Let‚Äôs create some data to work with. You‚Äôll start with Python lists that contain some arbitrary numeric data:

Now you have the lists x and x_with_nan. They‚Äôre almost the same, with the difference that x_with_nan contains a nan value. It‚Äôs important to understand the behavior of the Python statistics routines when they come across a not-a-number value (nan). In data science, missing values are common, and you‚Äôll often replace them with nan.

In [4]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
print(x)
print(x_with_nan)


[8.0, 1, 2.5, 4, 28.0]
[8.0, 1, 2.5, nan, 4, 28.0]


Now, create np.ndarray and pd.Series objects that correspond to x and x_with_nan:

You now have two NumPy arrays (y and y_with_nan) and two pandas Series (z and z_with_nan). All of these are 1D sequences of values.

In [8]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
print("array",y)
print("array",y_with_nan)
print(z)

print(z_with_nan)


array [ 8.   1.   2.5  4.  28. ]
array [ 8.   1.   2.5  nan  4.  28. ]
0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64
0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64


# Measures of Central Tendency
The **measures of central tendency** show the central or middle values of datasets. There are several definitions of what‚Äôs considered to be the center of a dataset. In this tutorial, you‚Äôll learn how to identify and calculate these measures of central tendency:

- Mean
- Weighted mean
- Geometric mean
- Harmonic mean
- Median
- Mode

# Mean
The **sample mean**, also called the **sample arithmetic mean** or simply the **average**, is the arithmetic average of all the items in a dataset. The mean of a dataset ùë• is mathematically expressed as Œ£·µ¢ùë•·µ¢/ùëõ, where ùëñ = 1, 2, ‚Ä¶, ùëõ. In other words, it‚Äôs the sum of all the elements ùë•·µ¢ divided by the number of items in the dataset ùë•.

Although this is clean and elegant, you can also apply built-in Python statistics functions:

In [None]:
x = 1, 2.5, 4, 8, 28
mean_ = sum(x) / len(x)
print(mean_)

8.7

In [11]:
mean_ = statistics.mean(x)
print(mean_)
mean_ = statistics.fmean(x)
print(mean_)

8.7
8.7


You‚Äôve called the functions mean() and fmean() from the built-in Python statistics library and got the same result as you did with pure Python. fmean() is introduced in Python 3.8 as a faster alternative to mean(). It always returns a floating-point number.

However, if there are nan values among your data, then statistics.mean() and statistics.fmean() will return nan as the output:

In [12]:
mean_ = statistics.mean(x_with_nan)
print(mean_)
mean_ = statistics.fmean(x_with_nan)
print(mean_)

nan
nan


This result is consistent with the behavior of sum(), because sum(x_with_nan) also returns nan.

If you use NumPy, then you can get the mean with np.mean():

In the example above, mean() is a function, but you can use the corresponding method .mean() as well:

In [13]:
mean_ = np.mean(y)
mean_

np.float64(8.7)

In [14]:
mean_ = y.mean()
mean_

np.float64(8.7)

The function mean() and method .mean() from NumPy return the same result as statistics.mean(). This is also the case when there are nan values among your data:

In [15]:
np.mean(y_with_nan)

y_with_nan.mean()


np.float64(nan)

You often don‚Äôt need to get a nan value as a result. If you prefer to ignore nan values, then you can use np.nanmean():

In [16]:
np.nanmean(y_with_nan)


np.float64(8.7)

nanmean() simply ignores all nan values. It returns the same value as mean() if you were to apply it to the dataset without the nan values.

pd.Series objects also have the method .mean():

In [17]:
mean_ = z.mean()
mean_


np.float64(8.7)

As you can see, it‚Äôs used similarly as in the case of NumPy. However, .mean() from pandas ignores nan values by default:

This behavior is the result of the default value of the optional parameter skipna. You can change this parameter to modify the behavior.

In [18]:
z_with_nan.mean()

np.float64(8.7)

# Weighted Mean
The **weighted mean**, also called the **weighted arithmetic mean** or **weighted average**, is a generalization of the arithmetic mean that enables you to define the relative contribution of each data point to the result.

You define one **weight ùë§·µ¢** for each data point ùë•·µ¢ of the dataset ùë•, where ùëñ = 1, 2, ‚Ä¶, ùëõ and ùëõ is the number of items in ùë•. Then, you multiply each data point with the corresponding weight, sum all the products, and divide the obtained sum with the sum of weights: Œ£·µ¢(ùë§·µ¢ùë•·µ¢) / Œ£·µ¢ùë§·µ¢.

Note: It‚Äôs convenient (and usually the case) that all weights are nonnegative, ùë§·µ¢ ‚â• 0, and that their sum is equal to one, or Œ£·µ¢ùë§·µ¢ = 1.

The weighted mean is very handy when you need the mean of a dataset containing items that occur with given relative frequencies. For example, say that you have a set in which 20% of all items are equal to 2, 50% of the items are equal to 4, and the remaining 30% of the items are equal to 8. You can calculate the mean of such a set like this:

In [19]:
0.2 * 2 + 0.5 * 4 + 0.3 * 8


4.8

Here, you take the frequencies into account with the weights. With this method, you don‚Äôt need to know the total number of items.

You can implement the weighted mean in pure Python by combining sum() with either range() or zip():

In [20]:
x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]
wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
print(wmean)

wmean = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
print(wmean)


6.95
6.95


Again, this is a clean and elegant implementation where you don‚Äôt need to import any libraries.

However, if you have large datasets, then NumPy is likely to provide a better solution. You can use np.average() to get the weighted mean of NumPy arrays or pandas Series:

In [21]:
y, z, w = np.array(x), pd.Series(x), np.array(w)
wmean = np.average(y, weights=w)
print(wmean)

wmean = np.average(z, weights=w)
print(wmean)


6.95
6.95


The result is the same as in the case of the pure Python implementation. You can also use this method on ordinary lists and tuples.

Another solution is to use the element-wise product w * y with np.sum() or .sum():

In [22]:
(w * y).sum() / w.sum()


np.float64(6.95)

That‚Äôs it! You‚Äôve calculated the weighted mean.

However, be careful if your dataset contains nan values:

In this case, average() returns nan, which is consistent with np.mean().

In [23]:
w = np.array([0.1, 0.2, 0.3, 0.0, 0.2, 0.1])
print((w * y_with_nan).sum() / w.sum())

print(np.average(y_with_nan, weights=w))

print(np.average(z_with_nan, weights=w))


nan
nan
nan


# Harmonic Mean

The harmonic mean is the reciprocal of the mean of the reciprocals of all items in the dataset: ùëõ / Œ£·µ¢(1/ùë•·µ¢), where ùëñ = 1, 2, ‚Ä¶, ùëõ and ùëõ is the number of items in the dataset ùë•. One variant of the pure Python implementation of the harmonic mean is this:

In [24]:
hmean = len(x) / sum(1 / item for item in x)
hmean


2.7613412228796843

It‚Äôs quite different from the value of the arithmetic mean for the same data x, which you calculated to be 8.7.

You can also calculate this measure with statistics.harmonic_mean():

In [25]:
hmean = statistics.harmonic_mean(x)
hmean


2.7613412228796843

The example above shows one implementation of statistics.harmonic_mean(). If you have a nan value in a dataset, then it‚Äôll return nan. If there‚Äôs at least one 0, then it‚Äôll return 0. If you provide at least one negative number, then you‚Äôll get statistics.StatisticsError:

In [27]:
print(statistics.harmonic_mean(x_with_nan))

print(statistics.harmonic_mean([1, 0, 2]))

statistics.harmonic_mean([1, 2, -2])  # Raises StatisticsError

nan
0


StatisticsError: harmonic mean does not support negative values

Keep these three scenarios in mind when you‚Äôre using this method!

A third way to calculate the harmonic mean is to use scipy.stats.hmean():

In [28]:
print(scipy.stats.hmean(y))
print(scipy.stats.hmean(z))


2.7613412228796843
2.7613412228796843


Again, this is a pretty straightforward implementation. However, if your dataset contains nan, 0, a negative number, or anything but positive numbers, then you‚Äôll get a ValueError!

# Geometric Mean
The **geometric mean** is the ùëõ-th root of the product of all ùëõ elements ùë•·µ¢ in a dataset ùë•: ‚Åø‚àö(Œ†·µ¢ùë•·µ¢), where ùëñ = 1, 2, ‚Ä¶, ùëõ. The following figure illustrates the arithmetic, harmonic, and geometric means of a dataset:

Again, the green dots represent the data points 1, 2.5, 4, 8, and 28. The red dashed line is the mean. The blue dashed line is the harmonic mean, and the yellow dashed line is the geometric mean.

You can implement the geometric mean in pure Python like this:

In [31]:
gmean = 1
for item in x:
    gmean *= item

gmean **= 1 / len(x)
gmean


4.677885674856041

As you can see, the value of the geometric mean, in this case, differs significantly from the values of the arithmetic (8.7) and harmonic (2.76) means for the same dataset x.

Python 3.8 introduced statistics.geometric_mean(), which converts all values to floating-point numbers and returns their geometric mean:

In [32]:
gmean = statistics.geometric_mean(x)
gmean


4.67788567485604

You‚Äôve got the same result as in the previous example, but with a minimal rounding error.

If you pass data with nan values, then statistics.geometric_mean() will behave like most similar functions and return nan:

In [33]:
gmean = statistics.geometric_mean(x_with_nan)
gmean

nan

Indeed, this is consistent with the behavior of statistics.mean(), statistics.fmean(), and statistics.harmonic_mean(). If there‚Äôs a zero or negative number among your data, then statistics.geometric_mean() will raise the statistics.StatisticsError.

You can also get the geometric mean with scipy.stats.gmean():

In [34]:
scipy.stats.gmean(y)
scipy.stats.gmean(z)

np.float64(4.67788567485604)

You obtained the same result as with the pure Python implementation.

If you have nan values in a dataset, then gmean() will return nan. If there‚Äôs at least one 0, then it‚Äôll return 0.0 and give a warning. If you provide at least one negative number, then you‚Äôll get nan and the warning.

# Median
The **sample median** is the middle element of a sorted dataset. The dataset can be sorted in increasing or decreasing order. If the number of elements ùëõ of the dataset is odd, then the median is the value at the middle position: 0.5(ùëõ + 1). If ùëõ is even, then the median is the arithmetic mean of the two values in the middle, that is, the items at the positions 0.5ùëõ and 0.5ùëõ + 1.

For example, if you have the data points 2, 4, 1, 8, and 9, then the median value is 4, which is in the middle of the sorted dataset (1, 2, 4, 8, 9). If the data points are 2, 4, 1, and 8, then the median is 3, which is the average of the two middle elements of the sorted sequence (2 and 4). The following figure illustrates this:

The data points are the green dots, and the purple lines show the median for each dataset. The median value for the upper dataset (1, 2.5, 4, 8, and 28) is 4. If you remove the outlier 28 from the lower dataset, then the median becomes the arithmetic average between 2.5 and 4, which is 3.25.

The figure below shows both the mean and median of the data points 1, 2.5, 4, 8, and 28:

Again, the mean is the red dashed line, while the median is the purple line.

The main difference between the behavior of the mean and median is related to dataset outliers or extremes. The mean is heavily affected by outliers, but the median only depends on outliers either slightly or not at all. Consider the following figure:

The upper dataset again has the items 1, 2.5, 4, 8, and 28. Its mean is 8.7, and the median is 5, as you saw earlier. The lower dataset shows what‚Äôs going on when you move the rightmost point with the value 28:

If you increase its value (move it to the right), then the mean will rise, but the median value won‚Äôt ever change.
If you decrease its value (move it to the left), then the mean will drop, but the median will remain the same until the value of the moving point is greater than or equal to 4.
You can compare the mean and median as one way to detect outliers and asymmetry in your data. Whether the mean value or the median value is more useful to you depends on the context of your particular problem.

Here is one of many possible pure Python implementations of the median:

In [None]:
n = len(x)
if n % 2:
    median_ = sorted(x)[round(0.5*(n-1))]
else:
    x_ord, index = sorted(x), round(0.5 * n)
    median_ = 0.5 * (x_ord[index-1] + x_ord[index])

print(median_)

4


Two most important steps of this implementation are as follows:

Sorting the elements of the dataset
Finding the middle element(s) in the sorted dataset
You can get the median with statistics.median():

In [None]:
median_ = statistics.median(x)
print(median_)

median_ = statistics.median(x[:-1])
print(median_)

The sorted version of x is [1, 2.5, 4, 8.0, 28.0], so the element in the middle is 4. The sorted version of x[:-1], which is x without the last item 28.0, is [1, 2.5, 4, 8.0]. Now, there are two middle elements, 2.5 and 4. Their average is 3.25.

median_low() and median_high() are two more functions related to the median in the Python statistics library. They always return an element from the dataset:

- If the number of elements is odd, then there‚Äôs a single middle value, so these functions behave just like median().
- If the number of elements is even, then there are two middle values. In this case, median_low() returns the lower and median_high() the higher middle value.

You can use these functions just as you‚Äôd use median():

In [35]:
print(statistics.median_low(x[:-1]))

print(statistics.median_high(x[:-1]))

2.5
4


Again, the sorted version of x[:-1] is [1, 2.5, 4, 8.0]. The two elements in the middle are 2.5 (low) and 4 (high).

Unlike most other functions from the Python statistics library, median(), median_low(), and median_high() don‚Äôt return nan when there are nan values among the data points:

In [36]:
print(statistics.median(x_with_nan))

print(statistics.median_low(x_with_nan))

print(statistics.median_high(x_with_nan))

6.0
4
8.0


Beware of this behavior because it might not be what you want!

You can also get the median with np.median():

In [38]:
median_ = np.median(y)
print(median_)

median_ = np.median(y[:-1])
print(median_)

4.0
3.25


You‚Äôve obtained the same values with statistics.median() and np.median().

However, if there‚Äôs a nan value in your dataset, then np.median() issues the RuntimeWarning and returns nan. If this behavior is not what you want, then you can use nanmedian() to ignore all nan values:

In [39]:
np.nanmedian(y_with_nan)

np.nanmedian(y_with_nan[:-1])


np.float64(3.25)

The obtained results are the same as with statistics.median() and np.median() applied to the datasets x and y.

pandas Series objects have the method .median() that ignores nan values by default:

In [40]:
print(z.median())

print(z_with_nan.median())

4.0
4.0


The behavior of .median() is consistent with .mean() in pandas. You can change this behavior with the optional parameter skipna.

# Mode
The **sample mode** is the value in the dataset that occurs most frequently. If there isn‚Äôt a single such value, then the set is **multimodal** since it has multiple modal values. For example, in the set that contains the points 2, 3, 2, 8, and 12, the number 2 is the mode because it occurs twice, unlike the other items that occur only once.

This is how you can get the mode with pure Python:

In [None]:
u = [2, 3, 2, 8, 12]
mode_ = max((u.count(item), item) for item in set(u))[1]
mode_

2

You use u.count() to get the number of occurrences of each item in u. The item with the maximal number of occurrences is the mode. Note that you don‚Äôt have to use set(u). Instead, you might replace it with just u and iterate over the entire list.

You can obtain the mode with statistics.mode() and statistics.multimode():

In [42]:
mode_ = statistics.mode(u)
mode_
mode_ = statistics.multimode(u)
mode_


[2]

As you can see, mode() returned a single value, while multimode() returned the list that contains the result. This isn‚Äôt the only difference between the two functions, though. If there‚Äôs more than one modal value, then mode() raises StatisticsError, while multimode() returns the list with all modes:

In [43]:
v = [12, 15, 12, 15, 21, 15, 12]
print(statistics.mode(v))  # Raises StatisticsError
print(statistics.multimode(v))


12
[12, 15]


You should pay special attention to this scenario and be careful when you‚Äôre choosing between these two functions.

statistics.mode() and statistics.multimode() handle nan values as regular values and can return nan as the modal value:

In [44]:
print(statistics.mode([2, math.nan, 2]))
print(statistics.multimode([2, math.nan, 2]))
print(statistics.mode([2, math.nan, 0, math.nan, 5]))
print(statistics.multimode([2, math.nan, 0, math.nan, 5]))

2
[2]
nan
[nan]


In the first example above, the number 2 occurs twice and is the modal value. In the second example, nan is the modal value since it occurs twice, while the other values occur only once.

You can also get the mode with scipy.stats.mode():

In [45]:
u, v = np.array(u), np.array(v)
mode_ = scipy.stats.mode(u)
print(mode_)
mode_ = scipy.stats.mode(v)
print(mode_)

ModeResult(mode=np.int64(2), count=np.int64(2))
ModeResult(mode=np.int64(12), count=np.int64(3))


This function returns the object with the modal value and the number of times it occurs. If there are multiple modal values in the dataset, then only the smallest value is returned.

You can get the mode and its number of occurrences as NumPy arrays with dot notation:

In [46]:
print(mode_.mode)
print(mode_.count)

12
3


This code uses .mode to return the smallest mode (12) in the array v and .count to return the number of times it occurs (3). scipy.stats.mode() is also flexible with nan values. It allows you to define desired behavior with the optional parameter nan_policy. This parameter can take on the values 'propagate', 'raise' (an error), or 'omit'.

pandas Series objects have the method .mode() that handles multimodal values well and ignores nan values by default:

In [49]:
u, v, w = pd.Series(u), pd.Series(v), pd.Series([2, 2, math.nan])
print(u.mode())

print(v.mode())

print(w.mode())


0    2
dtype: int64
0    12
1    15
dtype: int64
0    2.0
dtype: float64


As you can see, .mode() returns a new pd.Series that holds all modal values. If you want .mode() to take nan values into account, then just pass the optional argument dropna=False.

# Measures of Variability
The measures of central tendency aren‚Äôt sufficient to describe data. You‚Äôll also need the **measures of variability** that quantify the spread of data points. In this section, you‚Äôll learn how to identify and calculate the following variability measures:

- Variance
- Standard deviation
- Skewness
- Percentiles
- Ranges

# Variance
The **sample variance** quantifies the spread of the data. It shows numerically how far the data points are from the mean. You can express the sample variance of the dataset ùë• with ùëõ elements mathematically as ùë†¬≤ = Œ£·µ¢(ùë•·µ¢ ‚àí mean(ùë•))¬≤ / (ùëõ ‚àí 1), where ùëñ = 1, 2, ‚Ä¶, ùëõ and mean(ùë•) is the sample mean of ùë•. If you want to understand deeper why you divide the sum with ùëõ ‚àí 1 instead of ùëõ, then you can dive deeper into Bessel‚Äôs correction.

The following figure shows you why it‚Äôs important to consider the variance when describing datasets:

There are two datasets in this figure:

- Green dots: This dataset has a smaller variance or a smaller average difference from the mean. It also has a smaller range or a smaller difference between the largest and smallest item.
- White dots: This dataset has a larger variance or a larger average difference from the mean. It also has a bigger range or a bigger difference between the largest and smallest item.
Note that these two datasets have the same mean and median, even though they appear to differ significantly. Neither the mean nor the median can describe this difference. That‚Äôs why you need the measures of variability.

Here‚Äôs how you can calculate the sample variance with pure Python:

In [50]:
n = len(x)
mean_ = sum(x) / n
var_ = sum((item - mean_)**2 for item in x) / (n - 1)
var_


123.2

This approach is sufficient and calculates the sample variance well. However, the shorter and more elegant solution is to call the existing function statistics.variance():

In [51]:
var_ = statistics.variance(x)
var_

123.2

You‚Äôve obtained the same result for the variance as above. variance() can avoid calculating the mean if you provide the mean explicitly as the second argument: statistics.variance(x, mean_).

If you have nan values among your data, then statistics.variance() will return nan:

In [52]:
statistics.variance(x_with_nan)

nan

This behavior is consistent with mean() and most other functions from the Python statistics library.

You can also calculate the sample variance with NumPy. You should use the function np.var() or the corresponding method .var():

In [None]:
var_ = np.var(y, ddof=1)
print(var_)

var_ = y.var(ddof=1)
print(var_)

123.19999999999999
123.19999999999999


It‚Äôs very important to specify the parameter ddof=1. That‚Äôs how you set the delta degrees of freedom to 1. This parameter allows the proper calculation of ùë†¬≤, with (ùëõ ‚àí 1) in the denominator instead of ùëõ.

If you have nan values in the dataset, then np.var() and .var() will return nan:

In [54]:
np.var(y_with_nan, ddof=1)
y_with_nan.var(ddof=1)

np.float64(nan)

This is consistent with np.mean() and np.average(). If you want to skip nan values, then you should use np.nanvar():

In [55]:
np.nanvar(y_with_nan, ddof=1)

np.float64(123.19999999999999)

np.nanvar() ignores nan values. It also needs you to specify ddof=1.

pd.Series objects have the method .var() that skips nan values by default:

In [56]:
print(z.var(ddof=1))
print(z_with_nan.var(ddof=1))

123.19999999999999
123.19999999999999


It also has the parameter ddof, but its default value is 1, so you can omit it. If you want a different behavior related to nan values, then use the optional parameter skipna.

You calculate the population variance similarly to the sample variance. However, you have to use ùëõ in the denominator instead of ùëõ ‚àí 1: Œ£·µ¢(ùë•·µ¢ ‚àí mean(ùë•))¬≤ / ùëõ. In this case, ùëõ is the number of items in the entire population. You can get the population variance similar to the sample variance, with the following differences:

Replace (n - 1) with n in the pure Python implementation.
Use statistics.pvariance() instead of statistics.variance().
Specify the parameter ddof=0 if you use NumPy or pandas. In NumPy, you can omit ddof because its default value is 0.
Note that you should always be aware of whether you‚Äôre working with a sample or the entire population whenever you‚Äôre calculating the variance!

# Standard Deviation
The **sample standard deviation** is another measure of data spread. It‚Äôs connected to the sample variance, as standard deviation, ùë†, is the positive square root of the sample variance. The standard deviation is often more convenient than the variance because it has the same unit as the data points. Once you get the variance, you can calculate the standard deviation with pure Python:

In [57]:
std_ = var_ ** 0.5
std_


np.float64(11.099549540409285)

Although this solution works, you can also use statistics.stdev():

In [58]:
std_ = statistics.stdev(x)
std_

11.099549540409287

Of course, the result is the same as before. Like variance(), stdev() doesn‚Äôt calculate the mean if you provide it explicitly as the second argument: statistics.stdev(x, mean_).

You can get the standard deviation with NumPy in almost the same way. You can use the function std() and the corresponding method .std() to calculate the standard deviation. If there are nan values in the dataset, then they‚Äôll return nan. To ignore nan values, you should use np.nanstd(). You use std(), .std(), and nanstd() from NumPy as you would use var(), .var(), and nanvar():

In [59]:
print(np.std(y, ddof=1))

print(y.std(ddof=1))

print(np.std(y_with_nan, ddof=1))

print(y_with_nan.std(ddof=1))

print(np.nanstd(y_with_nan, ddof=1))


11.099549540409285
11.099549540409285
nan
nan
11.099549540409285


Don‚Äôt forget to set the delta degrees of freedom to 1!

pd.Series objects also have the method .std() that skips nan by default:

In [None]:
z.std(ddof=1)

z_with_nan.std(ddof=1)

The parameter ddof defaults to 1, so you can omit it. Again, if you want to treat nan values differently, then apply the parameter skipna.

The **population standard deviation** refers to the entire population. It‚Äôs the positive square root of the population variance. You can calculate it just like the sample standard deviation, with the following differences:

- Find the square root of the population variance in the pure Python implementation.
- Use statistics.pstdev() instead of statistics.stdev().
- Specify the parameter ddof=0 if you use NumPy or pandas. In NumPy, you can omit ddof because its default value is 0.
As you can see, you can determine the standard deviation in Python, NumPy, and pandas in almost the same way as you determine the variance. You use different but analogous functions and methods with the same arguments.

# Skewness
The **sample skewness** measures the asymmetry of a data sample.

There are several mathematical definitions of skewness. One common expression to calculate the skewness of the dataset ùë• with ùëõ elements is (ùëõ¬≤ / ((ùëõ ‚àí 1)(ùëõ ‚àí 2))) (Œ£·µ¢(ùë•·µ¢ ‚àí mean(ùë•))¬≥ / (ùëõùë†¬≥)). A simpler expression is Œ£·µ¢(ùë•·µ¢ ‚àí mean(ùë•))¬≥ ùëõ / ((ùëõ ‚àí 1)(ùëõ ‚àí 2)ùë†¬≥), where ùëñ = 1, 2, ‚Ä¶, ùëõ and mean(ùë•) is the sample mean of ùë•. The skewness defined like this is called the **adjusted Fisher-Pearson standardized moment coefficient.**

The previous figure showed two datasets that were quite symmetrical. In other words, their points had similar distances from the mean. In contrast, the following image illustrates two asymmetrical sets:

The first set is represented by the green dots and the second with the white ones. Usually, **negative skewness** values indicate that there‚Äôs a dominant tail on the left side, which you can see with the first set. **Positive skewness values** correspond to a longer or fatter tail on the right side, which you can see in the second set. If the skewness is close to 0 (for example, between ‚àí0.5 and 0.5), then the dataset is considered quite symmetrical.

Once you‚Äôve calculated the size of your dataset n, the sample mean mean_, and the standard deviation std_, you can get the sample skewness with pure Python:

In [60]:
x = [8.0, 1, 2.5, 4, 28.0]
n = len(x)
mean_ = sum(x) / n
var_ = sum((item - mean_)**2 for item in x) / (n - 1)
std_ = var_ ** 0.5
skew_ = (sum((item - mean_)**3 for item in x)
        * n / ((n - 1) * (n - 2) * std_**3))
skew_


1.947043227390592

The skewness is positive, so x has a right-side tail.

You can also calculate the sample skewness with scipy.stats.skew():

In [61]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
print(scipy.stats.skew(y, bias=False))
print(scipy.stats.skew(y_with_nan, bias=False))


1.9470432273905927
nan


The obtained result is the same as the pure Python implementation. The parameter bias is set to False to enable the corrections for statistical bias. The optional parameter nan_policy can take the values 'propagate', 'raise', or 'omit'. It allows you to control how you‚Äôll handle nan values.

pandas Series objects have the method .skew() that also returns the skewness of a dataset:

Like other methods, .skew() ignores nan values by default, because of the default value of the optional parameter skipna.

In [62]:
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
print(z.skew())
print(z_with_nan.skew())


1.9470432273905924
1.9470432273905924


# Percentiles
The sample ùëù percentile is the element in the dataset such that ùëù% of the elements in the dataset are less than or equal to that value. Also, (100 ‚àí ùëù)% of the elements are greater than or equal to that value. If there are two such elements in the dataset, then the sample ùëù percentile is their arithmetic mean. Each dataset has three quartiles, which are the percentiles that divide the dataset into four parts:

- The first quartile is the sample 25th percentile. It divides roughly 25% of the smallest items from the rest of the dataset.
- The second quartile is the sample 50th percentile or the median. Approximately 25% of the items lie between the first and second quartiles and another 25% between the second and third quartiles.
- The third quartile is the sample 75th percentile. It divides roughly 25% of the largest items from the rest of the dataset.
Each part has approximately the same number of items. If you want to divide your data into several intervals, then you can use statistics.quantiles():

In [None]:
x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
print(statistics.quantiles(x, n=2))

print(statistics.quantiles(x, n=4, method='inclusive'))

In this example, 8.0 is the median of x, while 0.1 and 21.0 are the sample 25th and 75th percentiles, respectively. The parameter n defines the number of resulting equal-probability percentiles, and method determines how to calculate them.

You can also use np.percentile() to determine any sample percentile in your dataset. For example, this is how you can find the 5th and 95th percentiles:

In [63]:
y = np.array(x)
print(np.percentile(y, 5))

print(np.percentile(y, 95))

1.3
23.999999999999996


percentile() takes several arguments. You have to provide the dataset as the first argument and the percentile value as the second. The dataset can be in the form of a NumPy array, list, tuple, or similar data structure. The percentile can be a number between 0 and 100 like in the example above, but it can also be a sequence of numbers:

In [65]:
np.percentile(y, [25, 50, 75])
np.median(y)


np.float64(4.0)

This code calculates the 25th, 50th, and 75th percentiles all at once. If the percentile value is a sequence, then percentile() returns a NumPy array with the results. The first statement returns the array of quartiles. The second statement returns the median, so you can confirm it‚Äôs equal to the 50th percentile, which is 8.0.

If you want to ignore nan values, then use np.nanpercentile() instead:

In [66]:
y_with_nan = np.insert(y, 2, np.nan)
print(y_with_nan)
print(np.nanpercentile(y_with_nan, [25, 50, 75]))


[ 8.   1.   nan  2.5  4.  28. ]
[2.5 4.  8. ]


That‚Äôs how you can avoid nan values.

NumPy also offers you very similar functionality in quantile() and nanquantile(). If you use them, then you‚Äôll need to provide the quantile values as the numbers between 0 and 1 instead of percentiles:

In [67]:
np.quantile(y, 0.05)
np.quantile(y, 0.95)
np.quantile(y, [0.25, 0.5, 0.75])
np.nanquantile(y_with_nan, [0.25, 0.5, 0.75])

array([2.5, 4. , 8. ])

The results are the same as in the previous examples, but here your arguments are between 0 and 1. In other words, you passed 0.05 instead of 5 and 0.95 instead of 95.

pd.Series objects have the method .quantile():

In [68]:
z, z_with_nan = pd.Series(y), pd.Series(y_with_nan)
print(z.quantile(0.05))

print(z.quantile(0.95))

print(z.quantile([0.25, 0.5, 0.75]))

print(z_with_nan.quantile([0.25, 0.5, 0.75]))


1.3
23.999999999999996
0.25    2.5
0.50    4.0
0.75    8.0
dtype: float64
0.25    2.5
0.50    4.0
0.75    8.0
dtype: float64


.quantile() also needs you to provide the quantile value as the argument. This value can be a number between 0 and 1 or a sequence of numbers. In the first case, .quantile() returns a scalar. In the second case, it returns a new Series holding the results.