## Maximum Likelihood Estimation

Maximum likelihood estimation is a method that determines values for the parameters of a model. The parameter values are found such that they maximise the likelihood that the process described by the model produced the data that were actually observed.

**Two assumptions we made are used so often in Machine Learning that they have a special name together as an entity : "The i.i.d. assumption" i.e. Independent and Identically distributed samples.**

\begin{align}
    = \underset{\theta}{\operatorname{argmax}}  \underset{i}{\operatorname{\sum}}  \log  P(x_i | \theta)
\end{align}

## Maximum A Posteriori (MAP)



\begin{align}
   \theta_{MAP} = \underset{\theta}{\operatorname{argmax}}  \underset{i}{\operatorname{\sum}}  \log  P(x_i | \theta)P(\theta)
    \text{ - (According to log properties)}
\end{align}

So this is our MAP equation. Comparing this with MLE , the key difference is the prior  P(θ) , otherwise they are identical.

## Conjugate Prior Distributions

In Bayesian probability theory, if the posterior distributions p(θ | X) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function

#### Parameterizations
Let C(n, k) denote the binomial coefficient(n, k).

- The Bernoulli distribution has probability of success $p$
- The beta distribution has PDF: $f(p) = Γ(α + β) pα-1(1-p)β-1 / (Γ(α) Γ(β))$
- The geometric distribution has only one parameter, p, and has PMF: $f(x) = p (1-p)x$
- The binomial distribution with parameters n and p has PMF: $f(x) = C(n, x) px(1-p)n–x$
- The negative binomial distribution with parameters r and p has PMF: $f(x) = C(r + x – 1, x) pr(1-p)x$
- The exponential distribution parameterized in terms of the rate λ has PDF: $f(x) = λ exp(-λ x)$
- The gamma distribution parameterized in terms of the rate has PDF: $f(x) = βα xα-1exp(-β x) / Γ(α)$
- The Poisson distribution has one parameter λ and PMF $f(x) = exp(-λ) λx/ x!$
- The normal distribution parameterized in terms of precision $τ (τ = 1/σ2)$ has PDF:
$$f(x) = (τ/2π)1/2 exp( -τ(x – μ)2/2 )$$

#### Posterior parameters
For each sampling distribution, assume we have data $x1, x2, …, xn$

- If the sampling distribution for x is binomial(m, p) with m known, and the prior distribution is beta(α, β), the posterior distribution for p is $beta(α + Σxi, β + mn – Σx_i)$. The Bernoulli is the special case of the binomial with m = 1.
- If the sampling distribution for x is negative binomial(r, p) with r known, and the prior distribution is beta(α, β), the posterior distribution for p is $beta(α + nr, β + Σxi)$. The geometric is the special case of the negative binomial with r = 1.
- If the sampling distribution for x is gamma(α, β) with α known, and the prior distribution on β is gamma(α0, β0), the posterior distribution for β is $gamma(α0 + nα, β0 + Σxi)$. The exponential is a special case of the gamma with α = 1.
- If the sampling distribution for x is Poisson(λ), and the prior distribution on λ is gamma(α0, β0), the posterior on λ is $gamma(α0 + Σxi, β0 + n)$.
- If the sampling distribution for x is normal(μ, τ) with τ known, and the prior distribution on μ is normal(μ0, τ0), the posterior distribution on μ is $normal((μ0 τ0 + τ Σxi)/(τ0 + nτ), τ0 + nτ)$.
- If the sampling distribution for x is normal(μ, τ) with μ known, and the prior distribution on τ is gamma(α, β), the posterior distribution on τ is $gamma(α + n/2, (n-1)S2)$ where S2 is the sample variance.
- If the sampling distribution for x is lognormal(μ, τ) with τ known, and the prior distribution on μ is normal(μ0, τ0), the posterior distribution on μ is $normal((μ0 τ0 + τ Πxi)/(τ0 + nτ), τ0 + nτ)$.
- If the sampling distribution for x is lognormal(μ, τ) with μ known, and the prior distribution on τ is gamma(α, β), the posterior distribution on τ is $gamma(α + n/2, (n-1)S2)$ where S2 is the sample variance.

## MLE with Normal Distributions

We know the parameters used to desribe a normal distribution are  (μ and σ2)(μ and σ2) . Where  μμ  is the mean and sigma squared identifies the variance in the data.

#### MLE in Python

    #for normal distribution
    from scipy.stats import norm # for generating sample data and fitting distributions
    import matplotlib.pyplot as plt
    plt.style.use('seaborn')
    import numpy as np

    ##### get data and label

    #returns the mean and std
    param = norm.fit(sample)
    param[0], param[1]
   

## Naive Bayes Classification

MAP is the basis of **Naive Bayes (NB) Classifier**. It is a simple algorithm that uses the integration of maximum likelihood estimation techniques for classification.

The fundamental Naive Bayes assumption is that each feature makes an **independent** and **equal** (i.e. are identical) contribution to the outcome. This is known as the **i.i.d assumption**. 

### Types of Naive Bayes Algorithm

Naive Bayes Algorithm works with a number of data distributions for classification tasks. Here are three popular distributions that would routinely come across while doing data analysis. 

#### Gaussian Naive Bayes

When data features values are continuous (i.e. real numbers), NAive Bayes makes the the assumption that the values associated with each class are distributed according to Gaussian/Normal Distribution.

If in our data, an attribute say $x$ contains continuous data. We first segment the data by the class and then compute mean $\mu_{y}$ and Variance ${\sigma_{y}}^{2}$  of each class.

$$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$$

We shall see this approach in practice in the upcoming labs where we take a deep dive into Gaussian Naive Bayes. 


#### MultiNomial Naive Bayes

MultiNomial Naive Bayes is preferred to use on data that is multinomially distributed. 


>In probability theory, the **Multinomial distribution** is a generalization of the binomial distribution. For example, it models the probability of counts for rolling a k-sided die n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories. (wiki)

It is one of the standard classic algorithms, used often with text categorization (classification). Each event in text classification represents the occurrence of a word in a document. [Visit here](https://syncedreview.com/2017/07/17/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation/) for an example on this. 

#### Bernoulli Naive Bayes

Bernoulli Naive Bayes is used on the data that is distributed according to multivariate Bernoulli distributions.i.e., multiple features can be there, but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. So, it requires features to be binary valued. In the context of text data , one can think of categorizing incoming emails as ham / spam etc. Have a quick look at the detailed slides [HERE](http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnSlides/inf2b13-learnlec07-nup.pdf) to see this in action. We shall be developing a similar experiment towards the end of this section. 

The Bernoulli and Multinomial text models created in Naive Bayes following a "Bag of Words" approach perform with similar level of accuracy as more high end classifiers. 

## Testing Data for Normality

#### Kolmogorov-Smirnov Test
null hypothesis: data is normal. If p-value less than .05, reject null hypothesis

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html

#one sample test
scipy.stats.kstest(data, cdf, args=(), N=20, alternative='two-sided', mode='approx')
#two sample test
scipy.stats.ks_2samp(data1, data2)[source]

## Naive Bayes and Information Retrieval

Being on the intersection of machine learning and information retrieval, several issues arise when applying the naive Bayes classifier. David D. Lewis gives an overview of these in the paper ["Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval"](https://link.springer.com/content/pdf/10.1007%2FBFb0026666.pdf).

We would like you to read through the article, without getting lost in the details and the math. Try to focus on the application areas and the advantages/disadvantages of certain models. Good luck!