# Challenge 3: Parameter Estimation 1

The first step is to import relevant libraries.

In [1]:
import pandas as pd
import numpy as np

The second step is to import data from the given file.

In [3]:
df = pd.DataFrame.from_csv("3challenge-1.csv")
dftraining = df.loc[~np.isnan(df['label'])]
dftesting = df.loc[np.isnan(df['label'])]
print(dftraining.shape)
print(dftesting.shape)

(5000, 9)
(5000, 9)


One can use the data in their original Pandas DataFrame format, or one can transform these objects into Numpy-arrays.

In [4]:
TrainingData = dftraining.as_matrix(columns=None)
TestData = dftesting.as_matrix(columns=['Y0', 'Y1', 'Y2', 'Y3', 'Y4', 'Y5', 'Y6', 'Y7'])

After creating an algorithm and generating labels, one should update the original CSV file.

$\because$ $X_i$~$Binomial(40,\theta)$, $X_i$ and $X_j$ are independent when $i \neq j$, 

$\therefore$ $p_{X_i}(x)={40\choose x}\theta^{x}(1-\theta)^{40-x}=\frac{40!}{x!(40-x)!}\theta^{x}(1-\theta)^{40-x}$,    

$\therefore$ $p_{\underline X}(\underline x)=\prod_{i=1}^8p_{X_i}(x_i)=(40!)^8{(\prod_{i=1}^8x_i!(40-x_i)!)^{-1}}(1-\theta)^{320}{(\frac{\theta}{1-\theta})}^{\sum_{i=1}^8X_i}  
=h(\underline x)g(T(\underline x)|\theta)$, 

$\ \ $ where $h(\underline x)=(40!)^8{(\prod_{i=1}^8x_i!(40-x_i)!)^{-1}}$, $g(t|\theta)=(1-\theta)^{320}{(\frac{\theta}{1-\theta})}^t$, $T(\underline x)=\sum_{i=1}^8X_i$,

$\therefore$ $T(\underline x)=\sum_{i=1}^8X_i$ is a sufficient statistic for $\theta$.

$\because$ $\theta$~$Beta(2,5)$, $\therefore$ $\theta \in[0,1]$.

$\because$ $T(\underline x)=(\sum_{i=1}^8X_i)$~$Binomial(320,\theta)$,

$\therefore$ $E_\theta [g(T)] = \sum_{t=0}^{320}g(t)\frac{320!}{t!(320-t)!}(1-\theta)^{320}(\frac{\theta}{1-\theta})^t$ is a polynomial of $\frac{\theta}{1-\theta}$ when $\theta \in(0,1)$,

$\ \ $ $E_0[g(T)] = g(0)$ and $P_0(T=0)=1$ when $\theta = 0$, $E_1[g(T)] = g(320)$ and $P_1(T=320)=1$ when $\theta = 1$,

$\therefore$ $E_\theta [g(T)] = 0$ for all $\theta$

$\Rightarrow \ P_{\theta}(g(T)=0) = 1$ when $\theta \in (0,1)$, $P_0(g(T)=0) = P_0(g(0)=0)P_0(T=0) = 1$, $P_1(g(T)=0) = P_1(g(320)=0)P_1(T=320) = 1$,

$\Rightarrow \ P_{\theta}(g(T)=0) = 1$ for all $\theta$,

$\therefore$ $T(\underline x)=\sum_{i=1}^8X_i$ is a complete statistic.

Suppose $\theta '(\underline X) = \frac{1}{320}\sum_{i=1}^8X_i$, 

then $E[\theta '(\underline X)] = \frac{1}{8}\sum_{\mathcal{X}}(\sum_{i=1}^8\frac{X_i}{40})p_{\underline X}(\underline x)=\frac{1}{8}\sum_{i=1}^8(\sum_{x_i=0}^{40}\frac{X_i}{40}p_{X_i}(x_i))(\sum_{\mathcal{X}/\mathcal{X}_i}\prod_{j=1,j\neq i}^8p_{X_j}(x_j))=\frac{1}{8}\sum_{i=1}^8\theta \times 1 = \theta$,

$\therefore\ \theta '(\underline X)$ is an unbiased estimator of $\theta$,

$\therefore\ \theta "(\underline X) = \frac{1}{320}\sum_{i=1}^8X_i$ is a MVU estimator of $\theta$.

In [5]:
Test_Label = 1/40*np.average(TestData,axis=1)            # estimate the theta by the statistic from above
TestData_1 = np.c_[TestData,Test_Label]
dftesting_1 = pd.DataFrame(TestData_1,columns = ['Y0', 'Y1', 'Y2', 'Y3', 'Y4', 'Y5', 'Y6', 'Y7','label'] )
#******************************** Variation Testing ***************************************
TrainingData_est = 1/40*np.average(TrainingData[:,0:8],axis=1) 
Var = np.average((TrainingData_est - TrainingData[:,8])**2)
print("The approximation of the mean-squared error is about %.6f." % Var)

The approximation of the mean-squared error is about 0.000561.


In [6]:
df = pd.concat([dftraining, dftesting_1], join='outer', ignore_index=True)
df.to_csv("3challenge-1.csv")