# Challenge 3: Parameter Estimation 1

The first step is to import relevant libraries.

In [1]:
import pandas as pd
import numpy as np

The second step is to import data from the given file.

In [2]:
df = pd.DataFrame.from_csv("3challenge-1.csv")
dftraining = df.loc[~np.isnan(df['label'])]
dftesting = df.loc[np.isnan(df['label'])]
print(dftraining.shape)
print(dftesting.shape)

(5000, 9)
(5000, 9)


One can use the data in their original Pandas DataFrame format, or one can transform these objects into Numpy-arrays.

In [3]:
TrainingData = dftraining.as_matrix(columns=None)
TestData = dftesting.as_matrix(columns=['Y0', 'Y1', 'Y2', 'Y3', 'Y4', 'Y5', 'Y6', 'Y7'])

After creating an algorithm and generating labels, one should update the original CSV file.

If we know Y0,Y1,...,Y7 are iid $B(40,\theta)$,

we set $\hat{\theta} = \frac{\bar{X}}{40}$, then $E(\hat{\theta}) = \theta$, $Var(\hat{\theta}) = \frac{\theta(1-\theta)}{320}$.

If $Var(\hat{\theta})$ = CRLB, then $\hat{\theta}$ should be minimal variance unbiased estimator.

Log-likelihood function is :

$log\,f_{X}(x;\theta)=log\prod_{i=1}^{8}{C_{x_{i}}^{40}\theta^{x_{i}}(1-\theta)^{40-x_{i}}}=\sum_{i=1}^{8}{logC_{x_{i}}^{40}} + \sum_{i=1}^{8}{x_{i}log\theta} + \sum_{i=1}^{8}(40-x_{i})log(1-\theta)$

$I_{X}(\theta) = -E(\frac{\partial^{2}log\,f_{X}(x;\theta)}{\partial \theta^{2}}) = E(\sum_{i=1}^{8}{\frac{x_{i}}{\theta^{2}}}+\sum_{i=1}^{8}{\frac{40-x_{i}}{(\theta-1)^{2}}}) = \frac{320}{\theta(1-\theta)}$

$LB_{Cramer-Rao}\geq I_{X}^{-1}(\theta) = \frac{\theta(1-\theta)}{320} = Var(\hat{\theta})$

Therefore, $\hat{\theta}$ is the minimal variance unbiased estimator.


In [6]:
A=np.mat([1/320,1/320,1/320,1/320,1/320,1/320,1/320,1/320])
A=A.T
y_est=np.mat(TrainingData[:,0:8])*A
y=np.reshape(TrainingData[:,8],(5000,1))
var=(y-y_est).T*(y-y_est)
print(var[0,0])

2.80450764543


The value above is the MSE of MVU estimator in theory. We even don't need training data for this estimator.

I'll try to make a linear regression for training data and design another estimator in the following.

In [7]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(TrainingData[:,0:8], TrainingData[:,8])
y_lr=regr.predict(TrainingData[:,0:8])
y_lr=np.mat(y_lr).T
var_lr=(y-y_lr).T*(y-y_lr)
print(var_lr[0,0])

2.7456056673


It seems contradictive that MSE of linear regression estimator is less than the CRLB. 

But this could be reasonable. The theoretical MMSE(CRLB) is the expectation of MSE, i.e. when the number of observation approaches to infinity, the MSE of any estimator should no less than CRLB. While, there are only 5000 observations in this training set. 

For training set, this linear regression estimator has smaller MSE but it's not unbiased.

In conclusion, $\hat{\theta}$ should be the best estimator for test data.

In [8]:
y_pre=np.mat(TestData)*A
Array=np.append(TestData,y_pre,axis=1)
dftesting=pd.DataFrame(Array,columns=['Y0', 'Y1', 'Y2', 'Y3', 'Y4', 'Y5', 'Y6', 'Y7','label'])

In [9]:
df = pd.concat([dftraining, dftesting], join='outer', ignore_index=True)
df.to_csv("3challenge-1.csv")