# Understanding Descriptive statistics

Descriptive statistics is about describing and summarizing data, It uses two main approaches:

- Quantitative approach, where the data is numeric.
- Visual approach, where the data is illustrated with plots and histograms.

# Types of measures

- Central tendency
- Variability
- Correlation or joint variability

# Population and samples
The population is a set of all elements or items that you are interested in, the subset of population is called a sample, ideally, the sample should preserve the essencial statistical features of the population to a satisfactory extent.

# Outliers
An outlier is a data point that differs significantly from the majority of the data taken from a sample or population. There are many possible causes of outliers but here are few to start off:

- Natural variation in data
- change in the behavior of the observed system
- errors in data collection


In [2]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

In [3]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
x

[8.0, 1, 2.5, 4, 28.0]

In [4]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x_with_nan), pd.Series(x_with_nan)

z

0     8.0
1     1.0
2     2.5
3     NaN
4     4.0
5    28.0
dtype: float64

# Measures of central tendency
The measures of central tendency show the central or middle values of datasets. The most common measures of central tendency are:

- mean 
- Weighted mean
- Geometric mean
- Harmonic mean
- median
- mode

## Mean
the smaple arithmetic mean or the _average_ is the arithmetic average of all the items in a dataset.
The mean of a dataset $x$ is mathematically equivalent to
$$
\Sigma_{i} \frac{x_{i}}{n}
$$


In [11]:
# We can calculate the mean with the sum() and len() functions
mean_ = sum(x) / len(x)

# Also we can use the build in statistics functions
mean_ = statistics.mean(x)
mean_ = statistics.fmean(x) # fmean is 'faster mean'

# if we had nan values in our data, the mean and fmean functions will not work and just return a nan
mean_ = statistics.fmean(x_with_nan)

# if we need to know the mean of a dataset with nan values, we need the numpy nanmean() function
mean_ = np.nanmean(x_with_nan)
mean_

8.7

## Weighted mean

The weighted mean, also called the weighted arithmetic mean or weighted average is a generalization of the arithmetic mean that enable define the relative contribution of each point to the result.

We need to define the $weight\; w_{i}$ for each data point $x_{i}$ of the dataset. Mathematically this is equals to:

$$\dfrac{\sum_{i}w_{i}x_{i}}{\sum_{i}w_{i}}$$

The weighted mean is useful when we need the mean of a dataset containing items that occur with relative frequences.

    For example, say that you have a set in which 20% of all items are equal to 2, 50% of the items are equal to 4, and the remaining 30% of the items are equal to 8. You can calculate the mean of such a set like this:

In [1]:
0.2 * 2 + 0.5 * 4 + 0.3 * 8

4.8

In [2]:
x = [8.0, 1, 2.5, 4, 28]
w = [0.1, 0.2, 0.3, 0.25, 0.15]

wmean = sum(w[i] * x[i] for i in range(len(x)))/sum(w)
wmean

6.95

## Harmonic mean
is the reciprocal of the mean of the reciprocals of all items in the dataset:
$$ \frac{n}{\sum_{i=1}^n \frac{1}{x_i}} $$

In [8]:
hmean = len(x) / sum(1/item for item in x)
hmean = statistics.harmonic_mean(x)
hmean

2.7613412228796843

## Geometric mean
is the n-th root of the product of all n elements $x_{i}$ in a dataset
$$x = \sqrt[n]{\prod_{i=1}^{n}x_{i}}$$

In [10]:
gmean = 1
for item in x:
    gmean *= item
    
gmean **= 1 / len(x)
gmean

4.677885674856041

## Median
the _sample median_ is the middle element of a sorted dataset. The dataset can be sorted in increasing or decreasing order.

The difference between the behavior of the mean and median is related to dataset outliers or extremes. The mean is heavily affected by outliers, but the median only depends on outliers either slightly or not at all.

We can compare the mean and median as one way to detect outliers and asymmetry in our data.

In [11]:
n = len(x)

if n % 2:
    median_ = sorted(x)[round(0.5*(n-1))]
else:
    x_ord, index = sorted(x), round(0.5*n)
    median_ = 0.5 * (x_ord[index-1] + x_ord[index])
    
median_

4

Unlike most other functions from the python statistics library, the median(), median_low(), and median_high() dont return nan when there are nan values among the data points.

In [15]:
statistics.median(x_with_nan)
statistics.median_low(x_with_nan)
statistics.median_high(x_with_nan)

8.0

## Mode
the sample mode is the value in the dataset that occurs most frequently

# Measure of variability
The mearsures of central tendency are not the only ones, there are other type of measure called the Measure of Variability, this type of measure describe the spread of the data.

the variability measure are:
- Variance
- Standard Deviation
- Skewness
- Percentiles
- Ranges

## Variance
The sample variance show how far the data points are from the mean. We can express the sample variance of the dataset mathematically as:
$$s^{2} =  \frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{n-1}$$

In [3]:
n = len(x)
mean_ = sum(x)/n
var_ = sum((item - mean_)**2 for item in x) / (n-1)
var_

123.19999999999999

In [6]:
var_ = statistics.variance(x)
var_

123.2

We can also calculate the *population variance*, is almost the same as the sample variance, the main difference is that we use the entire population instead of a sample, so instead of $(n - 1)$ we only need to use $n$:
    $$s^{2} =  \frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{n}$$


## Standard Deviation
The sample standard deviation is the positive square root of the sample variance. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread over a wide range.

The formula of standard deviation is:
$$s = \sqrt{\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{n-1}}$$

In [7]:
std_ = var_ ** 0.5
std_

11.099549540409287

Also we can find the *population standard deviation*, this is the positive square root of the population variance.

The formula is:
$$s = \sqrt{\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{n}}$$ 

## Skewness
The **Sample Skewness** measures the asymmetry of a data sample.

There are a lot of mathematically descriptions for skewness, the most common is the **Fisher Pearson standardized moment coefficient**:
$$g_{1} = \frac{\sum_{i=1}^{N}(Y_{i} - \bar{Y})^{3}/N} {s^{3}}$$

Usually negative skewness indicate that there is a dominant tail on the left, positive skewness indicate that the right side has a dominant tail.
If the dataset skewness is between -1/2 and 1/2 then the dataset is considered asymmetrical.

In [11]:
x = [8.0, 1, 2.5, 4, 28]
n = len(x)

mean_ = sum(x)/n
var_ = sum((item - mean_)**2 for item in x) / (n - 1)
std_ = var_ ** 0.5
skew_ = (sum((item - mean_)**3 for item in x) * n / ((n - 1) * (n - 2) * std_**3))

skew_

1.9470432273905929

In [6]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
scipy.stats.skew(y, bias=False)

1.9470432273905927

# _Percentiles_

The sample percentile p is the element in the dataset such that p% of the elements in the dataset are less than or equal to that value.

In [8]:
x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
statistics.quantiles(x, n = 4, method='inclusive')

[0.1, 8.0, 21.0]

# _Ranges_
The range of data is the difference between the maximum and minimum element in the dataset

In [10]:
np.ptp(x)

46.0

# _Summary of descriptive Statistics_

SciPy and pandas can offer methods to get descriptive statistics of a dataset:

In [5]:
res = scipy.stats.describe(y, ddof=1, bias=False)
print(res.nobs)
print(res.minmax)
print(res.mean)
print(res.variance)
print(res.skewness)
print(res.kurtosis)
res

5
(1.0, 28.0)
8.7
123.19999999999999
1.9470432273905927
3.878019618875446


DescribeResult(nobs=5, minmax=(1.0, 28.0), mean=8.7, variance=123.19999999999999, skewness=1.9470432273905927, kurtosis=3.878019618875446)

In [6]:
res = z.describe()
res

count     5.00000
mean      8.70000
std      11.09955
min       1.00000
25%       2.50000
50%       4.00000
75%       8.00000
max      28.00000
dtype: float64

# Measures of Correlation between pairs of data

Often we need to examine the relationship between the corresponding elements of two variables in a dataset. Day there are two variables x and y, with an equal number of elements n, let $x_{1}$ from $x$ be in function of $y_{1}$ from $y$ and so on with $x_{2}$ and $y_{2}$. We can say that there are n pairs of corresponding elements of x and y.

The most useful measure of correlation are:

- **Postive Correlation:** exists when larger values of x corresponds to larger values of y
- **Negative Correlation:** exists when larges values of x corresponds to smaller values of y
- **Weak or no correlation exists:** if there is no such apparent relationship 

In [1]:
x = list(range(-10, 11))
y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]

x_, y_ = np.array(x), np.array(y)
x__, y__ = pd.series(x_), pd.series(y_)

NameError: name 'np' is not defined

## Covariance
The sample covariance is a measure that quantifies the strength and direction of a relationship between a pair of variables:

- if the correlation is positive then the covariance is positive
- if the correlation is negative then the covariance is negative
- if the correlation is weak then the covariance is close to zero

The covariance of the variables x and y is mathematically defined as:

$$cov_{x,y} = \frac{\sum{(x_{i} - \overline{x})(y_{i} - \overline{y})}}{N - 1}$$

    the covariance of two identical variables is the variance

In [12]:
n = len(x)
mean_x, mean_y = sum(x) / n, sum(y)/n
cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n)) / (n - 1))

print(np.cov(x_, y_))
cov_xy

[[38.5        19.95      ]
 [19.95       13.91428571]]


19.95