# Programming for Data Science and Artificial Intelligence

## 6 Statistics with Python

### Readings: 
- Statistics (scipy.stats) https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html

### Random variables and probability distribution

A **random variable** is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, discrete and continuous.

A **discrete random variable** is one which may take on only a countable number of distinct values and thus can be quantified. For example, you can define a random variable $X$ to be the number which comes up when you roll a fair dice. $X$ can take values : <code>[1,2,3,4,5,6]</code> and therefore is a discrete random variable.

The **probability distribution** of a **discrete random variable** is a list of probabilities associated with each of its possible values. It is also sometimes called the probability function or the probability mass function. To have a mathematical sense, suppose a random variable $X$ may take $k$ different values, with the probability that $X=xi$ defined to be $P(X=x_i)=p_i$. Then the probabilities pi must satisfy the following:

1. $0 < p_i < 1$ for each $i$

2. $p_1 + p_2 + ... + p_k = 1$


For example, $p_1$ may refer to probability of rolling dice of 1 which could be some percentage like 13%.  $k$ here then is 6 because dice have 6 sides.

Some examples of discrete probability distributions are **Bernoulli distribution**, **Binomial distribution**, **Poisson distribution** etc.

A **continuous random variable** is one which takes an infinite number of possible values. For example, you can define a random variable $X$  to be the height of students in a class. Since the continuous random variable is defined over an interval of values, it is represented by the area under a curve (or the integral).

The **probability distribution** of a **continuous random variable**, known as probability distribution functions, are the functions that take on continuous values. The probability of observing any single value is equal to 0  since the number of values which may be assumed by the random variable is infinite. For example, a random variable X
may take all values over an interval of real numbers. Then the probability that $X$ is in the set of outcomes $A$.  The probability $P(A)$ is then defined to be the area above A and under a curve. The curve, which represents a function $p(x)$, must satisfy the following:

1. The curve has no negative values ($p(x)>0$ for all $x$)

2. The total area under the curve is equal to 1

A curve meeting these requirements is often known as a **density curve**. Some examples of continuous probability distributions are **normal distribution**, **exponential distribution**, **beta distribution**, etc.

There’s another type of distribution that often pops up in literature which you should know about called **cumulative distribution function** All random variables (discrete and continuous) have a cumulative distribution function. It is a function giving the probability that the random variable X is less than or equal to x, for every value x. For a discrete random variable, the cumulative distribution function is found by summing up the probabilities.

### A bit how distributions are related to Data Science

You probably are wondering why and how distributions are related to Data Science.  Rolling dice seems a bit far fetch to data science.

We can easily see the relations like this by first understanding these simple facts:
1. Many things in the world have a distribution.  
     - For example, human heights, if we tried to collect data from the whole population of the world (which is impossible), is assumed to follow a normal distribution.  That is, statisticians believe that heights are crowded around average.  Of course, this assumption is even more pronounced if we focus on only one country
     - All image has a distribution.  Cat photos all share similar distributions.  Dog photos also share similar distributions.  It may be hard to imagine what are the numbers here, but imagine pixel brightness/colors here which we can create a distribution.  By understanding these distributions, we can sometimes generate a fake photo of Cat!  and Dog!  or even use these distributions to predict whether it's a cat or dog
     - Shopping behaviors has a distribution.  Assume the numbers here are number of dollars used.  If we understand this distribution, we can perform something cool like customer segmentation
     - Each music in the world has a distribution.  Drums has its own distribution of bass, beats, etc (I don't really know much about music either!).  Guitar also has its own distribution.  If we know these distributions, we can filter out noise by filter only data with these distributions.  Isn't it amazing?
     - Customer queuing can be modeled with distribution.  Let's say queuing follows an exponential distribution.  This means that queuing starts off as an ok thing and get worse after some point.  Understanding this distribution allows us to arange the queue in a more efficient way.
     - Perhaps car crashes can be modeled with a Poisson distribution which describes number of crashes within a time interval.  Using this, we may able to predict number of crashes in the future.
2. Understanding distribution shape can be done by machine
    - One difficult thing about distribution is that it is not easy to understand the distribution of data.  Consider an image with (1920 x 1080) pixels....., it is really difficult!  However, providing a machine, we can ask the machine look at particular batch of pixels and try to understand the distribution.  Do you start to see why machine and distribution are related to Data Science
    - Of course, that's not the end of pipeline.  Once the machine gets a rough idea what is the distribution, it can further use it to predict, and then revise its model.  That's exactly how distribution and data science are related.  Of course, not all machine learning uses distributions directly, but one can strongly say that almost all models at least have some assumptions of distributions when the creators create the models.   
    
Ok, enough talking.  Let's jump to Python right away to see some examples.