### Estimation
If we are given a random sample from a distribution, and we are asked to find the *mean* of the distribution itself, one way we could do so is to use the *mean* of the sample as an estimate of the *mean* of the distribution. This process is called **estimation**, and the statistic that we used, which is the sample mean, is the **estimator**.

In [1]:
import pandas as pd
import numpy as np

randomSample = pd.Series([-0.441, 1.774, -0.101, -1.138, 2.975, -2.138])
randomSample.mean()

0.15516666666666667

However, what if the random sample of a distribution had outliers? Would taking the mean of the sample to estimate the mean of the distribution be the best choice?

In [2]:
randomSample2 = pd.Series([-0.441, 1.774, -0.101, -1.138, 2.975, -213.8])
randomSample2.mean()

-35.121833333333335

In this case, the sample mean has been skewed by an outlier, which also affects our estimation of the distribution mean.

One option is to identify and discard outliers, then compute the sample mean of the values left. Here, we're keeping all the values within 4 units from 0, and discarding the rest:

In [3]:
randomSample3 = randomSample2.where(np.abs(randomSample2) < 4)
randomSample3.mean()

0.6138

Another option is to use the median as an estimator, like so:

In [4]:
randomSample2.median()

-0.271

Picking the best estimator depends on the circumstances, like whether there are outliers, and on what the goal is.

If there are no outliers, the sample mean minimizes the **mean squared error**. In this case, we compute the root of the mean squared error to nullify the squaring that we do. It looks like this:

In [5]:
def RootMeanSqrError(estimates, actual):
    # Squared error between the estimates and the actual value
    sqrErrors = [(estimate - actual)**2 for estimate in estimates]
    sum = np.sum(sqrErrors)
    avg = sum / len(estimates)
    sqrt = np.sqrt(avg)
    return sqrt

Now, let's simulate getting a random sample of size 7, 1000 times. In the function below, *n* is the size of the random sample, and *m* is the amount of simulations we'll be doing. For this example, we're also getting the random sample from a normal (gaussian) distribution with `mu = 0` and `sigma = 1`, since we're pretending that our out-of-sample follows this normal distribution. Note that *mu* is the mean of the normal distribution, while *sigma* is the standard deviation.

Between mean and median, let's see which one is the best estimator of the distribution mean, given that there are no outliers:

In [16]:
mu = 0
sigma = 1
n = 7
m = 1000

means = []
medians = []
for i in range(m):
    values = np.random.normal(mu, sigma, n)
    means.append(np.mean(values))
    medians.append(np.median(values))
    
print("RMSE of means:", RootMeanSqrError(means, mu))
print("RMSE of medians:", RootMeanSqrError(medians, mu))

RMSE of means: 0.3748377444328869
RMSE of medians: 0.45378614029744924


For this example, the error from estimation of the distribution mean using the sample means, at about `0.38`, is lower than that of medians, at about `0.45`.

For some problems, it is nice to be able to minimize the MSE, but it is not always the best strategy. It still depends on the type of problem to decide whether or not to use the MSE.

#### Guess the Variance
Given a random sample, the variance of the distribution could be estimated by first calculating the variance of each value from the mean, squaring these, then getting the average of these values, like so: 

In [17]:
def Variance(sample):
    mean = np.mean(sample)
    diffs = [(value - mean)**2 for value in sample]
    sum = np.sum(diffs)
    avg = sum / len(diffs)
    return avg

print("Variance:", Variance(randomSample))
print("Numpy Variance:", np.var(randomSample))

Variance: 2.9873351388888882
Numpy Variance: 2.9873351388888882


For large enough samples, this is a good enough estimator, but for small samples, the value given tends to be too low. This is therefore called a **biased estimator**. An estimator is **unbiased** if the mean error, after many iterations, is 0.

There is another estimator of the variance that is unbiased. Here, we also compare it with numpy's unbiased variance calculation using the `ddof` parameter, which uses `n - ddof` when calculating the average:

In [18]:
def VarianceUnbiased(sample):
    mean = np.mean(sample)
    diffs = [(value - mean)**2 for value in sample]
    sum = np.sum(diffs)
    avg = sum / (len(diffs) - 1)
    return avg

print("Unbiased Variance:", VarianceUnbiased(randomSample))
print("Numpy Unbiased Variance:", np.var(randomSample, ddof=1))

Unbiased Variance: 3.584802166666666
Numpy Unbiased Variance: 3.584802166666666


See the difference? We have simply subtracted the length of the sample, usually *n*, by 1. 

Let's simulate a random sample of size 7, 1000 times again. This time, we'll compare the errors of both the biased and unbiased variance estimators:

In [19]:
def MeanError(estimates, actual):
    errors = [estimate - actual for estimate in estimates]
    return np.mean(errors)

mu = 0
sigma = 1
n = 7
m = 1000

biased = []
unbiased = []
for i in range(1000):
    values = np.random.normal(mu, sigma, n)
    biased.append(Variance(values)) # or using np.var(values)
    unbiased.append(VarianceUnbiased(values)) # or using np.var(values, ddof=1)
    
actualVariance = sigma**2 # Variance is also standard deviation squared

print("Mean error of biased:", MeanError(biased, actualVariance))
print("Mean error of unbiased:", MeanError(unbiased, actualVariance))

Mean error of biased: -0.16390481348808378
Mean error of unbiased: -0.02455561573609773


In this case, the unbiased estimation is much closer to 0 than the biased estimation. As the number of iterations, or *m*, increases, we expect the mean error to approach 0.

Properties like the MSE and bias are long-term expectations based on many iterations of the estimation game. But when you apply the estimator to real data, you only get one estimate. It would not be meaningful to say that the estimate is biased or unbiased, since being biased or unbiased is a property of the estimator, not the estimate. 

After choosing an estimator with appropriate properties and use it to generate an estimate, the next step is finding the uncertainty of the estimate.