Homework 2: Statistical experiments
----
- Author: Carson Hanel
- Class : STAT 489 Principles of Datascience and Statistics
- Prof  : Alan Dabney

-1. Variable Explanation:
----
Standard Distribution:
----
- $\sigma$ = The standard deviation
- $\sigma^2$ = The variance
- $\mu$ = the true mean of the sample
- $\bar{x}$ = the estimated mean of the sample
- $n$ = The sample size
- $N$ = The number of values within the sample
- $N$($\mu$ = 0, $\sigma^2$ = 0) = the normal distribution

In [9]:
import numpy as np
import random
import math
import time
from scipy import stats

In [72]:
# Seed random
random.seed(time.time())

# Trials and values per trial
N = 5000
n = 100

# Create the random data for both gamma and standard normals
data_std = np.random.standard_normal((n, N))
data_gam = np.random.gamma(1, 1, (n, N))

# Just showing the data is truly random
print("Gamma:")
print(data_gam[:5,:5])
print("Standard:")
print(data_std[:5,:5])

Gamma:
[[ 1.79722526  1.20137891  0.65172731  0.01697816  0.21273913]
 [ 0.2706963   1.16002715  0.05033513  0.49403742  1.05551058]
 [ 1.94914248  0.56853525  0.51708952  1.36284182  0.06338771]
 [ 0.46702396  0.27705309  0.02588675  0.08280801  1.45206318]
 [ 0.14322237  0.21427689  6.13247993  0.46920567  1.95117935]]
Standard:
[[ 0.01183736 -1.7544278  -0.71253037 -0.7834558   0.06995385]
 [ 0.88423845 -0.5994952  -0.05254781  0.2516199  -1.86709008]
 [ 1.38617501 -0.89413782 -0.7063221   1.44143815  0.46341997]
 [-0.34453516  0.78109501  0.09506905 -0.90234608  0.3869368 ]
 [ 1.28899579  0.85954198  0.48901115  1.7819353  -0.45715746]]


First, I'll begin by defining some important parts of sample tests:
----
- Sample (5000 total simulations):

$Y$ = $\sum_{i=1}^{n}$$y_i$ where n is 100 values 

- Sample mean: 

$\bar{y}$ = $\frac{y_1 + y_2 + ... + y_n}{n}$

- Sample standard deviation:

$\hat{\sigma}$ = $\sqrt{\frac{(y_1-\bar{y})^2 + (y_2-\bar{y})^2+...+(y_n-\bar{y})^2}{n-1}}$

- Test statistic:

$t$ = $\frac{\bar{y}-m_0}{\frac{\hat{\sigma}}{\sqrt{n}}}$ where $m_0$ is the hypothesized value, or rather, $\mu$ = 0

In [23]:
# Create an array for the t, p values
sim_ts = np.zeros((N))
sim_pv = np.zeros((N))

# Save computing power
n_sqrt = np.sqrt(n)

# Iterate our generated simulations and calculate t-statistcs:
for i in range(N):                       
    sim_i     = data_std[:, i]                     # Grab the i'th simualtion value
    x_bar_i   = np.mean(sim_i)                     # Calculate the sample mean
    s_i       = np.std(sim_i)                      # Calculate the sample standard deviation
    sim_ts[i] = (x_bar_i - 0)/(s_i / n_sqrt)       # Test statistic
    sim_pv[i] = stats.t.sf(sim_ts[i], n-1)         # P-value
                                                   # credit: http://docs.scipy.org/doc/scipy/reference/tutorial/stats.

Note: How T and P statistics relate
----

In [20]:
# Peek at the first five trials
for i in range(5):
    print("T-statistic:" + str(sim_ts[i]))
    print("P-value    :" + str(sim_pv[i]))
    print("")

T-statistic:1.23254330145
P-value    :0.110332622635

T-statistic:-0.650494376126
P-value    :0.741560128541

T-statistic:-0.141347191855
P-value    :0.55605855261

T-statistic:0.440036604946
P-value    :0.330434895032

T-statistic:-0.609329435351
P-value    :0.728149023141



In [21]:
# Calculate the percent of p values in N simulations less than thresholds
lt_05 = float(len([x for x in sim_pv if x < .05])) / N
lt_10 = float(len([x for x in sim_pv if x < .10])) / N
lt_15 = float(len([x for x in sim_pv if x < .15])) / N

# As you can see, the p-values are pretty solid!
print(".05 p-value: " + str(lt_05))
print(".10 p-value: " + str(lt_10))
print(".15 p-value: " + str(lt_15))

.05 p-value: 0.0534
.10 p-value: 0.1058
.15 p-value: 0.1536


-2. Variable Explanation
----

Gamma Distribution:
----
- $\alpha$ = shape of the experiment
- $\beta$ = scale of the experiment
- $\Gamma$($\alpha$ = 1, $\beta$ = 1) = the gamma distribution

In [58]:
# Create an array for the CI tuples
sim_ci = np.zeros((10, N, 2))

# Save computing power
n_sqrt = np.sqrt(n)

# Calculate 95% confidence intervals
for i in range(N):    
    for j in np.arange(10, 100, 10):
        sim_i        = data_gam[:j, i]               # Grab the i'th simualtion value
        x_bar_i      = np.mean(sim_i)                # Calculate the sample mean
        s_i          = np.std(sim_i)                 # Calculate the sample standard deviation
        sim_ci[j/10, i, 0] = x_bar_i - 1.96 * s_i / n_sqrt
        sim_ci[j/10, i, 1] = x_bar_i + 1.96 * s_i / n_sqrt

In [69]:
# Calculate which percent of trials include Mu = 0
cvrg = np.zeros((10, N))
for i in range(10):
    for j in range(N):
        if(sim_ci[i, j, 0] < 1 and sim_ci[i, j, 1] > 1):
            cvrg[i, j] = 1
    print(str((i+1)*10) + " sample percent covering Mu: " + str(np.mean(cvrg[i])))

10 sample percent covering Mu: 0.0
20 sample percent covering Mu: 0.4016
30 sample percent covering Mu: 0.5796
40 sample percent covering Mu: 0.6806
50 sample percent covering Mu: 0.7634
60 sample percent covering Mu: 0.8176
70 sample percent covering Mu: 0.8584
80 sample percent covering Mu: 0.8894
90 sample percent covering Mu: 0.9148
100 sample percent covering Mu: 0.9244
