# Density Estimation


## 1.1 Reading the Data
In the following section an Excel file containing mortality rate associated with the hypertensive heart disease in the United States across different age groups and sexes will be processed (Data was downloaded from WHO website). We will extract total mortality rate per year for male (1951 - 2019) and female (1951 - 2019) populations separately.   

In [None]:
import numpy as np
import pandas as pd

myData = np.array(pd.read_excel('Mortality_Hypertension_America.xlsx'))

# Extraxt total mortality for male population into a two dimensional array. 
# column 0: year, colomn 1: gender ('Male', column 2: yearly mortality
M = np.array([row[[1,2,5]] for row in myData if row[2] == 'Male' and row[3] == 'Age_all'])

# Extraxt total mortality for female population into a two dimensional array. 
# column 0: year, colomn 1: gender ('Female', column 2: yearly mortality
F = np.array([row[[1,2,5]] for row in myData if row[2] == 'Female' and row[3] == 'Age_all'])
F = F[1:,:] # Make sure data starts from year 1951

print(M.shape)
print(F.shape)
print(M)
print(F)

## 1.2 Histograms

We can use histograms to explore sample data distribution. A histogram divides the variable into bins, counts the data points in each bin, and shows the bins on the x-axis and the counts on the y-axis. The binwidth defines the shape of sample distribution. 
Generate two histograms using the male mortality data, one with 10 bins and another with 20 bins. Display binwidth for each case.


## 1.3 Kernel Density Estimation

In statistics **kernel density estimation (KDE)** is used to perform probability density estimation of random variables based on kernels as weights.

Assuming $(x_{1}, x_{2} , \dots, x_{n})$ are independant and identically distributed samples drawn from some univariate distribution with unkown density, we are interested in estimating the shape of this unkown density function $f$ using the following formula:

$f_K(x) = \frac{1}{n} \sum_{i = 1}^{n} K(x - x_{i};h) = \frac{1}{nh} \sum_{i = 1}^{n} K(\frac{x - x_{i}}{h})$

Here $K$ is a kernel which is a non negative function and $h \gt 0$ is a called bandwidth and works as smoothing parameter. The bandwidth controls the tradeoff between bias and variance in the density estimation. 

A few examples of the kernels are:

- Gaussian: $K(x;h) \propto e^{(-\frac{x^{2}}{2h^{2}})}$
- Tophat: $K(x;h) \propto 1 \text{ if } x \lt h$
- Epanechnikov: $K(x;h) \propto 1 - \frac{x^{2}}{h^{2}}$ with $|\frac{x}{h}| \le 1$

Write your own code (DO NOT use available packages) to generate a Gaussian KDE based on the male mortality sample data. Generate 300 samples for kernel density estimation with uniform samples between 0 amd max value in your data + 10,000. Use different bandwidths. Show your results by superimposing the KDE result on the normalized histogram with 10 bins.

Write your own code (DO NOT use available packages) to generate a Epanechnikov KDE based on the male mortality sample data. Generate 300 samples for kernel density estimation with uniform samples between 0 amd max value in your data + 10,000. Use different bandwidths. Show your results by superimposing the KDE result on the normalized histogram with 10 bins.

## 1.4 Bivariate Kernel Density Estimation

The concept of univariate KDE can be extended to multivariate data. In this example we gerenate bivariate KDE for a set of two dimensional random variables in Oxford, UK: 

- Monthly maximum temprature
- Monthly sunshine duration



Write your own code to implement bi-variate product kernel estimator for estimating distribution of maximum temprature againt sun duraion variables for the month of June in Oxford, UK. Disply 3D bar histogram (standardized) od these variables. You can use 15 bins. Set the temprature range to 5 - 35 deg C and sun duration to 100 - 400 hours range. The product kernel estimator takes the following form:

$\hat{f}(x,y) = \frac{1}{nh_{x}h_{y}} \sum_{i=1}^{n} K(\frac{x_{i} - x}{h_{x}})K(\frac{y_{i} - y}{h_{y}})$

where 

$K(t) = \frac{1}{\sqrt{2\pi}}e^{-\frac{t^{2}}{2}}$

$h_{x}$ and $h_{y}$ are bandwidths and $n$ is the number samples in paired data ($x_{i},y_{i}$).

Generate 100 uniform samples for temrature data in the range of 5 - 35 and another 100 uniform sample for the sun duration in the range of 100 - 400 hours. Calculate the bi-variate KDE based on the equation above and display the surface plot.
