Homework 2: Statistical experiments
----
- Author: Carson Hanel
- Class : STAT 489 Principles of Datascience and Statistics
- Prof  : Alan Dabney

-1. Variable Explanation:
----
Standard Distribution:
----
- $\sigma$ = The standard deviation
- $\sigma^2$ = The variance
- $\mu$ = the true mean of the sample
- $\bar{x}$ = the estimated mean of the sample
- $n$ = The sample size
- $N$ = The number of values within the sample
- $N$($\mu$ = 0, $\sigma^2$ = 0) = the normal distribution

In [1]:
import numpy as np
import random
import math
import time
from scipy import stats

In [2]:
# Seed random
random.seed(time.time())

# Trials and values per trial
N = 5000
n = 100

# Create the random data for both gamma and standard normals
data_std = np.random.standard_normal((n, N))
data_gam = np.random.gamma(1, 1, (n, N))


# Just showing the data is truly random
print("Gamma:")
print(data_gam[:5,:5])
print("Standard:")
print(data_std[:5,:5])

Gamma:
[[ 1.39166856  2.95022937  3.2152953   0.30580004  0.078118  ]
 [ 1.8880772   0.56576729  0.31334158  0.55678298  0.24068433]
 [ 1.24905435  0.70264868  2.44709946  0.9003395   0.45689789]
 [ 0.72723502  0.40291857  0.1630274   0.08346973  0.29579881]
 [ 0.06855272  1.91070313  1.22314052  1.3902951   3.47182825]]
Standard:
[[-1.31613006  1.63402001  1.52483261  1.97623751 -0.43204724]
 [-1.58017785  0.92940386 -1.31208601  0.03152515 -1.56648048]
 [-0.21385747 -0.12168759 -0.97892401 -1.08874484 -0.96293625]
 [-0.69737994 -0.13680079  0.54844969 -0.00363674 -0.38587469]
 [ 0.34392427 -1.54948076 -1.36669858 -0.81154387 -0.97039009]]


First, I'll begin by defining some important parts of sample tests:
----
- Sample (5000 total simulations):

$Y$ = $\sum_{i=1}^{n}$$y_i$ where n is 100 values 

- Sample mean: 

$\bar{y}$ = $\frac{y_1 + y_2 + ... + y_n}{n}$

- Sample standard deviation:

$\hat{\sigma}$ = $\sqrt{\frac{(y_1-\bar{y})^2 + (y_2-\bar{y})^2+...+(y_n-\bar{y})^2}{n-1}}$

- Test statistic:

$t$ = $\frac{\bar{y}-m_0}{\frac{\hat{\sigma}}{\sqrt{n}}}$ where $m_0$ is the hypothesized value, or rather, $\mu$ = 0

In [3]:
# Create an array for the t, p values
sim_ts = np.zeros((N))
sim_pv = np.zeros((N))

# Save computing power
n_sqrt = np.sqrt(n)

# Iterate our generated simulations and calculate t-statistcs:
for i in range(N):                       
    sim_i     = data_std[:, i]                     # Grab the i'th simualtion value
    x_bar_i   = np.mean(sim_i)                     # Calculate the sample mean
    s_i       = np.std(sim_i)                      # Calculate the sample standard deviation
    sim_ts[i] = (x_bar_i - 0)/(s_i / n_sqrt)       # Test statistic
    sim_pv[i] = stats.t.sf(sim_ts[i], n-1)         # P-value
                                                   # credit: http://docs.scipy.org/doc/scipy/reference/tutorial/stats.

Note: How T and P statistics relate
----

In [4]:
# Peek at the first five trials
for i in range(5):
    print("T-statistic:" + str(sim_ts[i]))
    print("P-value    :" + str(sim_pv[i]))
    print("")

T-statistic:-0.89555033342
P-value    :0.813667355585

T-statistic:-0.941714457856
P-value    :0.825684428964

T-statistic:-1.56741513407
P-value    :0.939895701209

T-statistic:0.157791606109
P-value    :0.437471203585

T-statistic:0.624025877801
P-value    :0.267022950253



In [5]:
# Calculate the percent of p values in N simulations less than thresholds
lt_05 = float(len([x for x in sim_pv if x < .05])) / N
lt_10 = float(len([x for x in sim_pv if x < .10])) / N
lt_15 = float(len([x for x in sim_pv if x < .15])) / N

# As you can see, the p-values are pretty solid!
# Fixed: changed from one-sided to two-sided p-value.
print(".05 p-value: " + str(lt_05))
print(".10 p-value: " + str(lt_10))
print(".15 p-value: " + str(lt_15))

.05 p-value: 0.0536
.10 p-value: 0.1064
.15 p-value: 0.1582


-2. Variable Explanation
----

Gamma Distribution:
----
- $\alpha$ = shape of the experiment
- $\beta$ = scale of the experiment
- $\Gamma$($\alpha$ = 1, $\beta$ = 1) = the gamma distribution

In [6]:
# Create an array for the CI tuples
sim_ci = np.zeros((10, N, 2))

# Calculate 95% confidence intervals
for j in np.arange(0, 100, 10):
    for i in range(N):
        sim_i        = data_gam[:(j+10), i]                  # Grab the i'th simualtion value
        x_bar_i      = np.mean(sim_i)                   # Calculate the sample mean
        s_i          = np.std(sim_i)                    # Calculate the sample standard deviation
        sim_ci[j/10, i, 0] = x_bar_i - 1.96 * s_i / np.sqrt((j+10))
        sim_ci[j/10, i, 1] = x_bar_i + 1.96 * s_i / np.sqrt((j+10))

In [7]:
# Calculate which percent of trials include Mu = 0
cvrg = np.zeros((10, N))
for i in range(10):
    for j in range(N):
        if(sim_ci[i, j, 0] < 1 and sim_ci[i, j, 1] > 1):
            cvrg[i, j] = 1
    print(str((i+1)*10) + " sample percent covering Mu: " + str(np.mean(cvrg[i])))
    
# versus your code
for n in range(10, 101, 10):
    results = np.empty(5000)
    for i in range(0, 5000):
        rvs = stats.gamma.rvs(a=1, scale=1, size = n)
        upper = rvs.mean() + 1.96 * rvs.std() / np.sqrt(n) #Ahhhh! My n wasn't varying.
        lower = rvs.mean() - 1.96 * rvs.std() / np.sqrt(n)
        results[i] = upper > 1 and lower < 1
    print(results.mean())

10 sample percent covering Mu: 0.8624
20 sample percent covering Mu: 0.893
30 sample percent covering Mu: 0.9128
40 sample percent covering Mu: 0.9246
50 sample percent covering Mu: 0.926
60 sample percent covering Mu: 0.9302
70 sample percent covering Mu: 0.93
80 sample percent covering Mu: 0.9318
90 sample percent covering Mu: 0.9364
100 sample percent covering Mu: 0.9366
0.857
0.8992
0.914
0.9234
0.9248
0.938
0.9332
0.9312
0.9346
0.942
