# Data Mining and Probabilistic Reasoning, WS18/19


Dr. Gjergji Kasneci, The University of Tübingen

-----
## Basics of Probability Theory using Python
-----

###### Date 29/10/2018

Teaching assistants:

 - Vadim Borisov (vadim.borisov@uni-tuebingen.de)

 - Johannes Haug (johannes-christian.haug@uni-tuebingen.de)

In [1]:
# loading python libraries 
import numpy as np
from scipy import stats

import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.figure(figsize=(13, 10))


<matplotlib.figure.Figure at 0x7f2a6828a9b0>

# Bernoulli distribution 


$$P(X=x)= p^{x}(1-p)^{1-x}$$


In literature:

```X ~ Bernoulli(p)```

In [4]:
X = stats.bernoulli(p=0.045) 
print('P(0) = ', X.pmf(1))           
print('P(x<=2) = ', X.cdf(4))         
print('E[X] = ', X.mean())            
print('Var[X] = ', X.var())            
print('Std[X] = ', X.std())            
print('Sample from X = ', X.rvs())            
print('10 samples from X =', X.rvs(10))         

P(0) =  0.045000000000000005
P(x<=2) =  1.0
E[X] =  0.045
Var[X] =  0.042975
Std[X] =  0.2073041244162788
Sample from X =  0
10 samples from X = [0 0 0 0 0 0 0 0 0 0]


In [None]:
# print(max(X.rvs(100000)))

In [None]:
x = np.arange(0, 2, 0.1)

plt.bar(x,X.pmf(x))


# Binomial distribution 


$$P(X=k)=\binom{m}{k} p^{k}(1-p)^{m-k}$$
where:
$$\binom{m}{k}=\frac{m!}{k!(m-k)!}$$

In literature:

```X ~ Binomial(p,10)```

In [None]:
X = stats.binom(n=5,p=0.5) # Declare X to be a binomial random variable
print('P(0) = ', X.pmf(0))
print('P(x<=2) = ', X.cdf(4))    
print('E[X] = ', X.mean())            
print('Var[X] = ', X.var())            
print('Std[X] = ', X.std())            
print('Sample from X = ', X.rvs())            
print('10 samples from X =', X.rvs(10))         

In [None]:
x = np.arange(0, 6, 1)

plt.bar(x, X.pmf(x))
#plt.plot(x, X.pmf(x))


#  Uniform Distribution

In [None]:
X = np.random.uniform(1,0,1000)
   

In [None]:
count, bins, ignored = plt.hist(X, 10, density=True)
plt.plot(bins, np.ones_like(bins), linewidth=2, color='b')

#  Normal Distribution

$$ \displaystyle f(x\mid \mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}} $$

where: 
- $\mu$ mean
- $\sigma$ standard deviation 
- $\sigma ^2$ variance 


In literature:

 $$X \sim \mathcal{N}(\mu,\,\sigma^{2})\, $$

In [None]:
mu = 0.1
sigma = 1

X = stats.norm(loc=mu, scale=sigma)

# do multiple norms 

In [None]:
print('P(4) = ', X.pdf(4))   
print('P(x<=2) = ', X.cdf(2))
print('E[X] = ', X.mean())            
print('Var[X] = ', X.var())            
print('Std[X] = ', X.std())            
print('Sample from X = ', X.rvs())            
print('10 samples from X =', X.rvs(10))      

In [None]:
#%matplotlib notebook

x = np.linspace(-4,4,100)
pdf = X.pdf(x)
plt.plot(x, pdf)

In [None]:
cdf = X.cdf(x)
plt.plot(x,cdf)

- Cumulative distribution function (CDF) will give you the probability that a random variable is less than or equal to a certain real number.

- Probability mass function (PMF) gives you the probability that a discrete random variable is exactly equal to some real value

- Probability density function (PDF) of a random variable X, when integrated over a set of real numbers A, will give the probability that X lies in A.

# The exponential distribution and Central Limit Theorem (CLT).

In a nutshell:

Central Limit Theorem is “The central limit theorem (CLT) is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. [Source]( https://towardsdatascience.com/understanding-the-central-limit-theorem-642473c63ad8)



<center>

![image](./data/IllustrationCentralTheorem.png)
</center>

### Exponential distribution


 $$f(x;\lambda) = \begin{cases}
\lambda e^{-\lambda x} & x \ge 0, \\
0 & x < 0.
\end{cases}$$



In [None]:
# scale = 1/lambda from docs 
lambda_ = 2
X = stats.expon(scale = 1/lambda_)

In [None]:
print('P(4) = ', X.pdf(4))           
print('P(x<=2) = ', X.cdf(2))         
print('E[X] = ', X.mean())            
print('Var[X] = ', X.var())            
print('Std[X] = ', X.std())            
print('Sample from X = ', X.rvs())            
print('10 samples from X =', X.rvs(10))      

In [None]:
x = np.linspace(0,10,100)
plt.plot(x, X.pdf(x))

In [None]:
# number of samples 
n = 2
sample = X.rvs(n)
print('Samples: ', sample)


In [None]:
sample = X.rvs(size=(2000, n))
sample

In [None]:
sample.shape

In [None]:
means = sample.mean(axis=1)
means

In [None]:
means.shape

In [None]:

plt.hist(means, density=True, bins=20)
plt.show()