## Derivatives

Derivatives is used to find the slope of a function at a given point. The slope is useful to measure the rate of change of at the given point in the function

In statistics and Machine learning, derivates are useful for execution of certain algorithms such as, gradient descent. When the slope is zero, that means we are at the minimum or the maximum of an output variable.


In [2]:
def derivative_x(f, x, step):
    derivtive = (f(x + step) - f(x))/((x + step) - x)
    return derivtive


def f(x):
    return x**2


print(derivative_x(f, 2, 0.00001))

4.000010000000827


#### Partial Derivatives

These derivatives of functions that have multiple input variables.

#### Integrals

The opposite of derivatives is integral, which finds the area under the curve for a given range.

In [7]:
def approximate_integral(a, b, n, f):
    delta_x = (b-a)/n
    total_sum = 0
    for i in range(1, n + 1):
        midpoint = a + (i-1) * delta_x + (delta_x/2)
        total_sum += f(midpoint) * delta_x
    return total_sum

def my_function(x):
    return x ** 2 + 1


area = approximate_integral(a=0, b=1, n=5, f=my_function)
print(area)

area = approximate_integral(a=0, b=1, n=1000, f=my_function)
print(area)

1.33
1.3333332500000004


### Probability

Probability is how strongly we believe an event will happen, often expressed as a percentage. 
Probability is about quantifying predictions of events yet to happen, whereas likelihood is measuring the frequency of events that already occurred. 

In statistics and machine learning, we often use likelihood (the past) in the form of data to predict probability (the future).

**The probability of an event happening is strictly between 0.0 and 1.0 inclusive**. That means if $P(x) = 0.7$, then $\lnot P(X) = 1 - P(X) = 0.3$.

This is another distinction between probability and likelihood. Probabilities of all possible mutually exclusive outcomes of an event must sum up to 1.0. Likelihoods, however, are not subject to this rule.

Alternatively, probabilities can be expressed as odds $O(X)$ such as $7:3, 7/3$, or $2.\overline{333}$. 

To turn an odds $O(X)$ into a proportional probability $P(X)$, use this formula

$$
P(X) = \frac{O(X)}{1 + O(X)}
$$

**Probability vs. Statistics**: Probability is purely theoretical of how likely an event is to happen and does not require data. Statistics, on the other hand, can not exist without data and uses it to discover probability and provides tools to describe data.

#### Probability Math
The single probability of an event $P(X)$ is called _marginal probability_.

##### Joint Probability
This is the probability of two or more events occurring simultaneously.
A joint probability is used to find the probability of separate events with separate probabilities occurring together.

if two events are independents:

$$
P(A \text{ and } B) = P(A) \times P(B)
$$

The above formula is also called the product rule

In [4]:
# Generate all possible outcomes between a coin toss and die toss.
coin = ["H", "T"]
die = ["1", "2", "3", "4", "5", "6"]
outcomes = [one + two for one in coin for two in die]
print(outcomes, len(outcomes)) # the probability of any one outcome is 1/12


['H1', 'H2', 'H3', 'H4', 'H5', 'H6', 'T1', 'T2', 'T3', 'T4', 'T5', 'T6'] 12


##### Union Probability
 This deals with probabilities of one event or another occurring.

 For mutually exclusive events

 $$
 P(A \text{ or } B) = P(A) + P(B)
 $$

 This is also called the addition rule.

 In general whether events are exclusive or not

 $$
 P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)
 $$

 If $A$ and $B$ are exclusive $P(A \text{ and } B) = 0$.
 

#### Conditional Probability and Bayes' Theorem
Conditional probability is the probability of an event $A$ occurring given event $B$ has occurred; $P(A \text{ given } B)$ or $P(A|B)$.

##### Bayes' Formula
$$
P(A|B) = \frac{P(B|A)\times P(A)}{P(B)}
$$

##### Conditional probability and Joint probability
$$
P(A \text{ and } B) = P(B) \times P(A|B) = P(A) \times P(B|A)
$$

If event $A$ has no impact on event $B, P(B|A) = P(B)$

$$
P(A \text{ or } B) = P(A) + P(B) - P(A|B) \times P(B)
$$

### Binomial Distribution
Binomial distribution measures how likely $k$ successes can happen out of $n$ trials given $p$ probability.

#### Binomial distribution from Scratch
The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes-no question and each with its Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p).

#### Probability mass function of Binomial distribution

Given a random variable $X$ such that $X \sim B(n, p$, the probability of getting exactly $k$ successes is given by the _probability mass function $f(k, n, p)$_:

$$
f(k, n, p) = P(X=k) =  {n \choose k}  p^{k}(1 -p)^{n-k}
$$

for $k = 0, 1, 2, \cdots, n$, where 

$$
{n \choose k} = \frac{n!}{k!(n - k)!}
$$

is the binomial coefficient.

##### Understanding the formula
$p^k(1-p)^{n -k}$ is the probability of obtaining a sequence of $n$ independent Bernoulli trials in which $k$ trials are success and the remaining $n-k$ trials are failures. Since the trials are independent with a constant probability of success and a sequence of $n$ such trials has the same probability of being achieved (regardless of the position of success within the sequence).

$n\choose k$ is the number of possible sequence with $k$ successes and $n -k$ failures.

The Binomial distribution is concerned with the probability of obtaining  any of these sequence, meaning the probability of obtaining one of them $p^k(1-p)^{n-k}$ must be added $n\choose k$ times.

#### Cumulative Distribution function
$$
F(k, n, p) = P(X\le k) = \sum^{\lfloor{k}\rfloor}_{i=0} {n \choose i} p^i(1 - p)^{n-i}
$$

In [20]:
#PMF
def factorial(n):
    product = 1
    for i in range(2, n + 1):
        product *= i
    return product

def binomial_coefficient(n, k):
    return factorial(n)/(factorial(k) *factorial(n -k)) # n!/(k!(n -k)!)

def probability(k, n, p):
    return (p**k) * (1 - p)**(n-k)

def binomial_pmf(k, n, p):
    return binomial_coefficient(n, k) * probability(k, n, p)

# Cumulative distribution function
def binom_cdf(k, n, p):
    cumulative_sum = 0
    for i in range(k+1):
        cumulative_sum += binomial_pmf(i, n, p)
    return cumulative_sum

n = 10
p = 0.9
for k in range(n +1):
    print(f"{k} -> {binomial_pmf(k, n, p)}")

# Probability of getting 8 or less success
print("probability of getting 8 successes or less: ",binom_cdf(8, n, p))

0 -> 9.999999999999978e-11
1 -> 8.999999999999983e-09
2 -> 3.644999999999994e-07
3 -> 8.747999999999988e-06
4 -> 0.00013778099999999982
5 -> 0.0014880347999999986
6 -> 0.01116026099999999
7 -> 0.057395627999999976
8 -> 0.19371024449999993
9 -> 0.387420489
10 -> 0.3486784401000001
probability of getting 8 successes or less:  0.2639010708999999


### Beta distribution
The _beta distribution allows us to see the likelihood of different underlying probabilities for an event to occur given _alpha_ success and _beta_ failures.

#### Beta distribution from scratch
The beta distribution has been applied to model the behaviour of random variables limited to intervals of finite length in a wide variety of disciplines. In particular, we are interested in finding how likely is the underlying (and assumed) probability of success of binomial distribution with $\alpha$ number successes and $\beta$ failures.

##### Probability Density function (PDF)
Given $0\le x \le 1$ and the shape parameters (in our case, number of successes and number of failures) $\alpha, \beta >0$, probability density function is given by the following
$$
f(x, \alpha, \beta) = \frac{\Gamma(\alpha + \beta)\times x^{\alpha - 1}\times (1 - x)^{\beta -1}}{\Gamma(\alpha)\times \Gamma(\beta)}
$$

where $\Gamma(z)$ is the _gamma function_ given $(z-1)!$ for $z\in\mathbb{N}$.

##### Cumulative Distribution Function (CDF)
Suppose we are given the number of successes (8) and the number of failures (2) of 10 Bernoulli trials. We want to determine the probability that the underlying probability of $X\sim \text{Binom}(n, p)$ is greater than or equal to $90$%.

This means we will need to calculate the under the curve of the pdf function of the beta distribution. 

This requires integration of the pdf function.  But this is difficult. We can, however, **use Reimann's sum to approximate the integration** 

We can also use the fact that **the CDF of the beta function is the same as the CDF of the binomial distribution, where $k = \beta-1, n = \alpha + \beta - 1, p = 1 - x$**.


In [12]:
# PDF of beta function

# factorial calculation
def factorial(n):
    product = 1
    for i in range(2, n + 1):
        product *= i
    return product

def beta_pdf(x, alpha=8, beta=2):
    if x < 0:
        raise ValueError("x must not be zero")
    probability = x**(alpha -1) * (1 - x)**(beta - 1)
    constant = factorial(alpha + beta -1)/(factorial(alpha -1) * factorial(beta -1))
    return constant * probability

# CDF of beta function


def reimann_sum_approximation(a, b, n, f):
    delta_x = (b - a)/n
    total_sum = 0
    for i in range(1, n+1):
        midpoint = a + (delta_x * (i - 1)) + (delta_x/ 2)
        total_sum += f(midpoint) * delta_x
    return total_sum

# 1: using Reimann Sum approximation
beta_cdf11 = reimann_sum_approximation(a=0, b=0.9, n=1_000, f=beta_pdf)
beta_cdf12 = reimann_sum_approximation(a=0.9, b=1.0, n=1_000, f=beta_pdf)

# 2: Using Binomial distributions CDF
beta_cdf21 = binom_cdf(k=2-1, n=10-1, p=1 - 0.9)
beta_cdf22 =  1 - binom_cdf(k=2-1, n=10-1, p =1 -0.9)

# calculating the likelihood of the underlying probability of the binomial distribution is 90% or less is 
print("The likelihood of the underlying probability equals or less than 90% is: ",beta_cdf11)

# calculating the likelihood of the underlying probability of the binomial distribution is 90% or greater is 
print("The likelihood of the underlying probability equals or greater than 90% is: ",beta_cdf12)
print(beta_cdf21)
print(beta_cdf22)

The likelihood of the underlying probability equals or less than 90% is:  0.7748412362768455
The likelihood of the underlying probability equals or greater than 90% is:  0.22515904881135335
0.7748409780000001
0.22515902199999993


### Exercises
1. You have 137 passengers booked on a flight from Las Vegas to Dallas. However, you estimate each passenger is 40%  likely to not show up. You are trying to figure out how many seats to over-book so that the plane does not fly empty. How likely is it that at least $50$ passengers will not show up?

#### Solution
Let's define $k$ as the number of successes (number of people not showing up). $k = 137 * 0.4 = 54.8$
and $p$ (probability of success) $0.4$ and $n = 137$

$X\sim \text{Binom}(k, n, p)$

In [21]:
# How likely is it that at least $50$ passengers will not show up
probabiliyt_50_or_more = 1 - binom_cdf(k=49, n=137, p=0.4) # 1 - (probability of 49 people or less showing up.)
print(probabiliyt_50_or_more)

0.822095588147425


2. You flipped a coin 10 times and got heads 8 times and tails 2 times. Do you think this coin has any good probability of being fair? Why or why not?

#### Solution
We are interested in the question of how likely it is that the coin is fair. That means what is the probability that the underlying probability of the observed binomial outcome is $50$%. 
Given alpha of $8$ and beta of $2$. We can use Beta distribution to answer.

Looking at the answers below we can see The likelihood that the underlying probability is greater or equal to 50% is 98%. It's very unlikely this is a fair coin. 

In [25]:
# The likelihood that the underlying probability is greater or equal to 50%
print(reimann_sum_approximation(a=.50, b=1.0, n= 1000, f = beta_pdf))
# The likelihood that the underlying probability is less or equal to 50%
print(reimann_sum_approximation(a=0, b=0.5, n= 1000, f = beta_pdf))

0.9804695351555468
0.019531214843764355


### Descriptive and inferential statistics

Descriptive statistics is used to provide a summary of the data such as the mean, the median, mode, charts, bell curves and other tools used to describe data.

Inferential statistics is used to uncover attributes about a larger population based on a sample. 

**Population** is a particular group of interest we want to study. A **sample** is a subset of the population that is ideally random and unbiased, which we use to infer attributes about the population. 

##### Types of Biases
- _Confirmation bias_ is gathering only data that supports your belief, which can even be done unknowingly.
- _Sefl-selection__ bias is when certain types of subjects are more likely to include themselves in the experiment.
- _Survival bias_ captures only living and survived subjects, while the deceased ones are never accounted for.

#### Descriptive Statistics
- *Mean* is the average of a set of values. The mean is calculated the same way for both populations and samples.
  - *Sample Mean $\bar{x}$*
  - *Population mean $\mu$*
  - *Weighted mean gives each item a different weight.* This is helpful when we want some values to contribute to the mean more than other.

  $$
  \text{weighted mean } = \frac{(x_1\cdot w_1) + (x_2\cdot w_2) + \ldots + (x_n\cdot w_n)}{w_1 + w_2 + \ldots + w_n}
  $$

- **Median** is the middle most value in a set of ordered values.


In [2]:
# number of pets
sample = [0, 1, 5, 7, 9, 10, 14]

def median(values):
    ordered = sorted(values)
    n = len(ordered)
    mid = n // 2 - 1 if n % 2 == 0 else n // 2
    if n % 2 == 0:
        return (ordered[mid] + ordered[mid+1])/2.0
    return ordered[mid]
print(median(sample))

7


The median is a good alternative to the mean when the data is skewed by outliers or values that are extremely large or small compared to the rest of the values. 