### Understanding Descriptive Statistics

Descriptive statistics is about describing and summarising data using 2 main approaches:

1) Quantitative Approach - describes and summarises data numerically

2) Visual Approach - illustrates data with charts, plots, histograms and other charts

When you describe and summarise a single variable you're performing a **univariate analysis**. When you search for statistical relationships among a pair of variables you're doing a **bivariate** analysis. Similarly a **multivariate analysis** is concerned with multiple variables at once.

#### Types of Measures
- **Central tendency** tells you about the centres of the data. Useful measures include the mean, median and mode
- **Variability** tells you about the spread of data. Useful measures includes variance and standard deviation
- **Correlation or joint variability** tells you about the relation between a pair of variables in a dataset. Useful measures include covariance and the **correlation coefficient** 

### Calculating Descriptive Statistics

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

We'll start with Python lists that contain some arbitrary numeric data

In [2]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
x

[8.0, 1, 2.5, 4, 28.0]

In [3]:
x_with_nan

[8.0, 1, 2.5, nan, 4, 28.0]

Now, create np.ndarray and pd.Series objects that correspond to x and x_with_nan

In [6]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)

In [7]:
y

array([ 8. ,  1. ,  2.5,  4. , 28. ])

In [8]:
y_with_nan

array([ 8. ,  1. ,  2.5,  nan,  4. , 28. ])

In [9]:
z

0     8.0
1     1.0
2     2.5
3     4.0
4    28.0
dtype: float64

In [10]:
z_with_nan

0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64

### Measures of Central Tendency

The **measures of central tendency** show the central or middle values of datasets. There are several definitions of what's considered to be the centre of a dataset. In this tutorial we will identify the following:
- Mean
- Weighted mean
- Geometric mean
- Harmonic mean
- Median
- Mode

##### Mean
The **sample mean**, also called the **sample arithmetic mean** or simply the **average**, is the arithmetic average of all items in a dataset. You can calculate the mean in pure Python using `sum()` and `len()` without importing libraries:

In [11]:
mean_ = sum(x) / len(x)
mean_

8.7

Although this is clean and elegant, you can also apply built-in Python statistics functions:

In [12]:
mean_ = statistics.mean(x)
mean_

8.7

`fmean()` was added as part of Python 3.8 as a faster alternative to `mean()`

In [13]:
mean_ = statistics.fmean(x)
mean_

8.7

However, if there are nan values among your data, then statistics.mean() and statistics.fmean() will return nan as the output:

In [14]:
mean_ = statistics.mean(x_with_nan)
mean_

nan

In [15]:
mean_ = statistics.fmean(x_with_nan)
mean_

nan

This behaviour is consistent with the the behaviour of `sum()` because `sum(x_with_nan)` also returns nan

If you use NumPy you can get the mean with `np.mean()`:

In [16]:
mean_ = np.mean(y)
mean_

8.7

In the example above, `mean()` is a function but you can also use the corresponding method `.mean()` as well:

In [17]:
mean_ = y.mean()
mean_

8.7

Equivalently, using this with nan values will give you nan

In [19]:
mean_ = y_with_nan.mean()
mean_

nan

You often don't need/want to get a nan value as a result. If you prefer to ignore nan values then you can use np.nanmean(). This simply ignores all nan values. It returns the same value as mean() if you were to apply it to the dataset without the nan values

In [20]:
np.nanmean(y_with_nan)

8.7

`pd.Series` objects also have the method .mean():

In [21]:
mean_ = z.mean()
mean_

8.7

As you can see it's used similarly as in the case of NumPy. However, `.mean()` from Pandas ignores nan values by default. This is the result of the default value of the optional parameter `skipna`. You can change this parameter to modify the behaviour.

In [22]:
z_with_nan.mean()

8.7

##### Weighted Mean
The **weighted mean** also called the **weighted arithmetic mean** or **weighted average** is a generalisation of the arithmetic mean that enables you to define the relative contribution of each data point to the result.

You define one **weight w<sub>i</sub>** for each data point x<sub>i</sub> of the dataset x, where *i* = 1, 2, ..., *n* and *n* is the number of items in x. Then you multiply each data point with the corresponding weight, sum all the products, and divide the obtained sum with the sum of weights.

The weighted mean is very handy when you need the mean of a dataset containing items that occur with given relative frequencies. For example, say that you have a set in which 20% of all items are equal to 2, 50% of thje items are equal to 4, and the remaining 30% of the items are equal to 8. You can calculate the mean of such a set like this:

In [23]:
0.2 * 2 + 0.5 * 4 + 0.3 *  8

4.8

You can implement the weighted mean in pure Python by combining `sum()` with either `range()` or `zip()`

In [25]:
x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]

In [27]:
w_mean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
w_mean

6.95

In [28]:
w_mean = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
w_mean

6.95