# Understanding Descriptive statistics

Descriptive statistics is about describing and summarizing data, It uses two main approaches:

- Quantitative approach, where the data is numeric.
- Visual approach, where the data is illustrated with plots and histograms.

# Types of measures

- Central tendency
- Variability
- Correlation or joint variability

# Population and samples
The population is a set of all elements or items that you are interested in, the subset of population is called a sample, ideally, the sample should preserve the essencial statistical features of the population to a satisfactory extent.

# Outliers
An outlier is a data point that differs significantly from the majority of the data taken from a sample or population. There are many possible causes of outliers but here are few to start off:

- Natural variation in data
- change in the behavior of the observed system
- errors in data collection


In [6]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

In [2]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
x

[8.0, 1, 2.5, 4, 28.0]

In [6]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x_with_nan), pd.Series(x_with_nan)

z

0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64

# Measures of central tendency
The measures of central tendency show the central or middle values of datasets. The most common measures of central tendency are:

- mean 
- Weighted mean
- Geometric mean
- Harmonic mean
- median
- mode

# Mean
the smaple arithmetic mean or the _average_ is the arithmetic average of all the items in a dataset.
The mean of a dataset $x$ is mathematically equivalent to
$$
\Sigma_{i} \frac{x_{i}}{n}
$$


In [11]:
# We can calculate the mean with the sum() and len() functions
mean_ = sum(x) / len(x)

# Also we can use the build in statistics functions
mean_ = statistics.mean(x)
mean_ = statistics.fmean(x) # fmean is 'faster mean'

# if we had nan values in our data, the mean and fmean functions will not work and just return a nan
mean_ = statistics.fmean(x_with_nan)

# if we need to know the mean of a dataset with nan values, we need the numpy nanmean() function
mean_ = np.nanmean(x_with_nan)
mean_

8.7

# Weighted mean

The weighted mean, also called the weighted arithmetic mean or weighted average is a generalization of the arithmetic mean that enable define the relative contribution of each point to the result.

We need to define the $weight\; w_{i}$ for each data point $x_{i}$ of the dataset. Mathematically this is equals to:

$$\dfrac{\sum_{i}w_{i}x_{i}}{\sum_{i}w_{i}}$$

The weighted mean is useful when we need the mean of a dataset containing items that occur with relative frequences.

    For example, say that you have a set in which 20% of all items are equal to 2, 50% of the items are equal to 4, and the remaining 30% of the items are equal to 8. You can calculate the mean of such a set like this:

In [1]:
0.2 * 2 + 0.5 * 4 + 0.3 * 8

4.8

In [7]:
x = [8.0, 1, 2.5, 4, 28]
w = [0.1, 0.2, 0.3, 0.25, 0.15]

wmean = sum(w[i] * x[i] for i in range(len(x)))/sum(w)
wmean

6.95

# Harmonic mean
is the reciprocal of the mean of the reciprocals of all items in the dataset:
$$ \frac{n}{\sum_{i=1}^n \frac{1}{x_i}} $$

In [8]:
hmean = len(x) / sum(1/item for item in x)
hmean = statistics.harmonic_mean(x)
hmean

2.7613412228796843

# Geometric mean
is the n-th root of the product of all n elements $x_{i}$ in a dataset
$$x = \sqrt[n]{\prod_{i=1}^{n}x_{i}}$$

In [10]:
gmean = 1
for item in x:
    gmean *= item
    
gmean **= 1 / len(x)
gmean

4.677885674856041

# Median
the _sample median_ is the middle element of a sorted dataset. The dataset can be sorted in increasing or decreasing order.

The difference between the behavior of the mean and median is related to dataset outliers or extremes. The mean is heavily affected by outliers, but the median only depends on outliers either slightly or not at all.

We can compare the mean and median as one way to detect outliers and asymmetry in our data.

In [None]:
n = len(x)

if n % 2:
    median_ = sorted(x)[round(0.5*(n-1))]
else:
    x_ord, index = sorted(x), round(0.5*n)
    median_ = 0.5 * (x_ord[index-1] + x_ord[index])
    
median_