# Statistics basics
***
## Introduction
In this tutorial you are going to learn how to use Python tools to visualize and analyze randomly generated data. The purpose is to get acquainted with statistical concepts such as __probability density functions__ and __cumulative distribution functions__.

As usual, we start by importing Python packages that we will use later on. In this tutorial, we are using _NumPy_ to create random numbers, the set of statistical tools in _SciPy_, and _Matplotlib_ and _Seaborn_ to plot graphs.

In [None]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set_context('notebook')

## Example: normal distribution
As an example, we create an array of random numbers extracted from a normal distribution of mean (`loc`) 0 and standard deviation (`scale`) 1.

In [None]:
# create an array of random numbers taken from a normal distribution


### Visualization
Let's plot the histogram of the random data with the _Seaborn_ function `distplot`.

In [None]:
# histogram


The histogram visualization is sensitive to the number of bins in which the variable range is divided. `distplot` calculates by default the number of bins that it estimates best, but you can set a fixed number of bins and check how the histagram changes.

In [None]:
# histogram with different number of bins


To avoid this limitation, `distplot` uses a function called kernel density estimation (KDE) to create a softened line that represents the shape of the data distribution independent of the number of bins.

In [None]:
# histogram + kernel density estimate (default)


The kernel density estimate resembles what we expect our distribution to look like, a Gaussian bell. Actually, if we know the distribution of our data, we can fit and plot the distribution with `distplot`.

In [None]:
# histogram + fitted normal distribution


### Distribution fitting
Now were are going to use thes statistical tools in _SciPy_ (`stats`) to fit a distribution to our data and to learn what are the __probability density function__ and the __cumulative distribution function__.

To fit a distribution function in `Scipy.stats` we only need to call the function `fit` to the specific distribution.

In [None]:
# fit a normal distribution


We check that the fit is not perfect, since we generated the random values from a normal distribution $N(0,1)$, i.e., $loc=0$ and $scale=1$. This is due to a limited amount of data, but the fit is good enough.

### Probability density function (PDF)
The probability density function represents the _relative likelihood_ that the variable takes a specific value. In practical terms, values with a higher PDF are more likely to happen. In the case of the normal distribution, the probability density function is the well-known Gaussian bell. Let's plot it for the distribution we have fitted above.

In `scipy.stats`, we will use the function `pdf`.

In [None]:
# linearly spaced values of X


In [None]:
# PDF (probability density function)


In [None]:
# line plot of the PDF


### Cumulative distribution function (CDF)
The cumulative distribution function is the probability that the variable $X$ takes values lower or equal than $x$; for that reason it is also called the non-exceedance probability. The cumulative distribution function (CDF) is the integral of the probability density function (PDF).
$$CDF = P(x\leq X) = \int PDF$$

The CDF is a monotonic incresing function that ranges between 0 and 1.

In `scipy.stats`, we will use the function `cdf`.

In [None]:
# CDF (cumulative distribution function)


In [None]:
# line plot of the CDF


### Percent point function (PPF)
`Scipy.stats` includes the function `ppf` (percent point function) that inverts the CDF, that is, it calculates the value of $X$ for a given value of the CDF.

In [None]:
# linearly spaced values CDF


In [None]:
# x using 'ppf'


In [None]:
# line plot of the CDF


### Use of the CDF
The fact that the CDF is a monotonic increasing function allows us to connect the values of the variable $X$ with its probability and viceversa. For instance:
1. What is the probability that the variable $X$ takes a value lower than $x=1$? → __CDF__
2. What is the value $x$ that is not exceeded 99% of the time? → __PPF__

In [None]:
# probability that the variable 𝑋 takes a value lower than 𝑥=1


In [None]:
# plot explaining what 'cdf' does


In [None]:
# value 𝑥 that is not exceeded 99% of the time


In [None]:
# plot explaining what 'ppf' does


### Useful link:
[Seaborn: visualizing the distribution of a dataset](https://seaborn.pydata.org/tutorial/distributions.html)