[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io)

# AI Program

## Introduction to Estimation - Empirical CDf, Histogram and Kernel Density Estimation (KDE)

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 27/01/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0005EstimationNonParametric.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.neighbors import KernelDensity

# Miscellaneous
import os
import math
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Fixel Algorithms Packages


## Non Parametric Estimation Methods

This notebook introduces:

 - The _histogram_ estimator of the PDF.
 - The _kernel density estimation_ method to estimate the PDF.
 - The _empirical cumulative density function_ estimator of the CDF.

In [None]:
# Parameters

# Data
numSamples = 5000

vW = np.array([0.5, 0.2, 0.3]) #<! Weights of the GMM model
vµ = np.array([-3.0, 2.0, 3.0]) #<! Mean of the GMM Model
vσ = np.sqrt(np.array([2, 0.1, 0.1])) #<! Standard Deviation of the GMM model

# Visualizations
minGridVal = -10
maxGridVal = 10
numGridPts = 10001


In [None]:
# Auxiliary Functions

# PDF of N(x; mu, σ²)
def Pz(xx, vMu, mSig):
    return sp.stats.multivariate_normal.pdf(xx, vMu, mSig)


## Generate / Load Data

We'll generate data from a Gaussian Mixture Model: A combination of Gaussian functions in the PDF.

In [None]:
# Loading / Generating Data

numDims = len(vµ)

xx  = np.linspace(minGridVal, maxGridVal, numGridPts)
Δx  = xx[1] - xx[0]
mPx = np.stack([Pz(xx, vµ[ii], np.square(vσ[ii])) for ii in range(numDims)])
vPx = vW @ mPx
vFx = np.cumsum(vPx) * Δx #<! Reference CDF

vIdx = np.random.choice(range(numDims), numSamples, p = vW)
mX   = np.stack([vσ[ii] * np.random.randn(numSamples) + vµ[ii] for ii in range(numDims)])
vX   = mX[vIdx, range(numSamples)]

## Histogram

$$\boxed{\hat{f}_{X}\left(x\right)=\frac{1}{\left|R_{k}\right|}\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}\left\{ x_{i}\in R_{k}\right\} },\qquad x\in R_{k}$$

Where ${R}_{k}$ is the $k$ -th bin, $\left|{R}_{k}\right|$ is the bin length (Area) and $N$ is the number of samples.

In [None]:
def PlotHist(K: int = 10, N: int = 500):
    plt.figure(figsize = (10, 5))
    plt.hist(vX[:N], bins = K, color = 'b', edgecolor = 'k', density = True, label = 'Histogram')
    plt.plot(xx, vPx, c = 'r', lw = 2, label = '$f_x$')
    plt.title(f'$K = {K}$')
    plt.legend()

In [None]:
kSlider = IntSlider(min = 5, max = 250, step = 5, value = 5, layout = Layout(width = '30%'))
nSlider = IntSlider(min = 5, max = numSamples, step = 5, value = 500, layout = Layout(width = '30%'))
interact(PlotHist, K = kSlider, N = nSlider)
plt.show()

## Kernel Density Estimation (KDE)

$$\boxed{f_{X}\left(x\right)=\frac{1}{N}\sum_{i=1}^{N}h\left(x-x_{i}\right)}$$
$$h\left(x\right)=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left(-\frac{x^{2}}{2\sigma^{2}}\right)$$

In [None]:
def PlotKDE(σ: float = 1, N: int = 100):

    σ = max(1e-2, σ)
    
    oKDE = KernelDensity(kernel = 'gaussian', bandwidth = σ)
    oKDE = oKDE.fit(np.reshape(vX[:N], (N, 1)))
    # vHatPx = np.exp(np.squeeze(oKDE.score_samples(np.atleast_2d(xx))))
    vHatPx = np.exp(np.squeeze(oKDE.score_samples(np.reshape(xx, (-1, 1)))))
    
    plt.figure(figsize = (10, 5))
    plt.plot(xx, vHatPx, color = 'b', lw = 2, label = 'KDE')
    plt.plot(xx, vPx, color = 'r', lw = 2, label = '$f_x$')
    plt.title(f'σ = {σ: 0.3f}')
    plt.grid()
    plt.legend() 

In [None]:
σSlider = FloatSlider(min = 0, max = 5, step = 0.01, value = 1, layout = Layout(width = '30%'))
nSlider = IntSlider(min = 5, max = numSamples, step = 5, value = 500, layout = Layout(width = '30%'))
interact(PlotKDE, σ = σSlider, N = nSlider)
plt.show()

## Empirical Cumulative Density Function (CDF)

$$\hat{F}_{X}\left(x\right)=\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}\left\{ x_{i}\leq x\right\} $$

In [None]:
# The ECDF
vHatFx = np.mean((vX[:, None] <= xx[None, :]), axis = 0)

# Equivalent of above with `np.reshape()`
# vHatFx = np.mean((np.reshape(vX, (-1, 1)) <= np.reshape(xx, (1, -1))), axis = 0)

In [None]:
plt.figure(figsize = (10, 5))
plt.plot(xx, vHatFx, color = 'b', lw = 2, label = '$\hat{F}_x$')
plt.plot(xx, vFx, 'r--', lw = 2, label = '$F_x$')
plt.title ('ECDF')
plt.xlabel('$x$')
plt.legend()
plt.grid()
plt.show()

* <font color='brown'>(**#**)</font> In practice one could build a parametric model of the ECDF (For instance, a polynomial regression).