In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from scipy.stats import norm
import matplotlib
from matplotlib import pyplot as plt
import matplotlib.mlab as mlab
import statsmodels as sm
%matplotlib inline

## Lesson11: t-Tests, Part3

In the previous lesson we did dependent samples tests (repeated measures).  

This controls for individual differences

## PS 10b: t-Tests, Part 2

In [None]:
counties = pd.DataFrame({'pre-test':[8, 7, 6, 9, 10, 5, 7, 11, 8, 7],
                       'post-test':[5, 6, 4, 6, 5, 3, 2, 9, 4, 4]})
counties.mean()

In [None]:
diffs = counties['post-test'] - counties['pre-test']
mean = diffs.mean()
sigma = diffs.std(ddof=1)
se = sigma/np.sqrt(n)
n = len(diffs)
df = n - 1
mean, sigma, se, n, df

#### What are the t_critical values for $\alpha = 0.05$

In [None]:
sp.stats.t.ppf(.05, df)

#### Let's find the t_statistic for the data.  

The null hypothesis is that there is no difference.  

The alternative hypothesis is that there is a reduction in accidents due to phone use.  

$$H_{0}\;:\;\mu_{D} \geq 0$$
$$H_{A}\;:\;\mu_{D} < 0$$

recall t_statistic value is the number of standard errors of the mean that the intervention is from the mean.  In this case we're using difference, so our mean in the equation:

$t_{statistic} = \frac{\hat{x} - \mu}{SE}$

$\mu = 0$ Because we're measuring difference.

Therefore:

$t_{statistic} = \frac{\hat{x} - 0}{SE}$

In [None]:
t_statistic = mean/float(se)
t_statistic

Calculate p-value.  

In [None]:
sp.stats.t.cdf(t_statistic, df)

#### Calculate Cohen's D

Recall: Cohen's D measures how many $\sigma$'s from the mean the intervention $\hat{x}$ is from $\mu$

$$Cohens\;D = \frac{\hat{x}-\mu}{\sigma}$$

In [None]:
cohens_d = (mean - 0)/float(sigma)
cohens_d

In [None]:
sp.stats.t.cdf(-2.25, df)

The intervention mean $\hat{x}$ is below the 2.6% percentile of the untreated population. 

#### Calculate R_squared Coefficient of Determination

Recall: $R^{2} = Coeffecient\;of\;Determination$
$$R^{2} = \frac{t_{stat}^{2}}{t_{stat}^{2} + dof}$$

These tells us the amount of the correlation that is explained by the model.  

In [None]:
r_squared = t_statistic**2/(t_statistic**2 + df)
r_squared

#### What is the confidence interval?

recall: confidence interval is a margin of error.  The margin of error is a function of the standard error and a t_score.  The t_score is calculated from standard error and degrees of freedom (n - 1)

had to manually assign 1.33 to se here becuase udacity's answer checker does calculation basd on 1.33 rounded.  

In [None]:
confidence = .95
alpha = .05
se = 1.33/np.sqrt(n)
t = sp.stats.t.ppf(.975, df)
margin_95 = t * se
df, se, t, mean, margin_95

In [None]:
upperbound = mean + margin_95
lowerbound = mean - margin_95
lowerbound, upperbound

### Lesson 10b: t-Tests, Part 2

Average consumer spends \$151 dollars on food in the US.  

We want to see if a program lowers the cost of food in the US.  

Dependent Variable = Cost of food

Independent Variable = The Program Implemented

If the null hypothesis is false, the program will result in equal to or greater monthly expense. 

$$H_{0} = \mu_{program} \geq 151$$
$$H_{A} = \mu_{program} < 151$$

What type of test should we do?  

We want to see if there's a decrease in cost.  So we want to do a one tailed test in the direction of decreasing cost.  So we want to do a one tailed test in the negative direction.  

Knowns:

In [None]:
mean = 151
n = 25
df = n - 1
sample_sigma = 50
se = sample_sigma/np.sqrt(n)
se

What's the T_critical for $\alpha = .05$

Since this is one tailed negative direction test make sure you are looking for the value in the left tail region of the t distribution.  

In [None]:
t_critical = sp.stats.t.ppf(.05, df)
t_critical

What is the mean difference if 

$\hat{x} = 126$

mean difference = $\hat{x} - \mu$

In [None]:
x_hat = 126
x_hat - mean

Compute t-statistic, round to two decimal places.  

In [None]:
t_stat = (x_hat - mean)/se
t_stat

Does the t_statistic fall within the critical region?  

Yes it does.  The t_statistic is more negative than the t_critical.  The t_critical for p < .05 is:

$t_{critical} \leq -1.71$.  

Our $t_{\hat{x}} = -2.5$

Therefore t is in the critical region

What is the p-value for this t?  

In [None]:
sp.stats.t.cdf(t_stat, df)

Since this is a left sided one tailed test we can just take the area under the curve up ot this point at t_stat.  

So:

$p = .0098$

By $p \leq .05$, these results are significant. 

Now let's compute Cohen's D.  Presumably because we're interested in measuring the size of effect.  

Recall that Cohen's D is a measure of the distance of the intervention mean from the mean in standard deviations.  

$cohens_d = \frac{(\hat{x} - \mu)}{\sigma}$

In [None]:
cohens_d = (x_hat - mean)/float(sample_sigma)
cohens_d

What is the Coefficient of Determination?  

Now we will compute the Coefficient of Determination $r^{2}$.  

How much of the difference can be explained by the Variance

Recall $r^{2}$ is the ratio between squared t and squard t plus degrees of freedom.  

What does this mean?  

In [None]:
t_stat**2/(t_stat**2 + df)

This tell us that ~21% of the difference in prices is due to the intervention.  

Now let's compute confidence intervals of 95% for our results.  

Confidence interval is a function of: desired confidence, standard error of the sample_mean, sample_size(dof for t-test)

In [None]:
confidence = .95
se, df, confidence

In [None]:
margin_95 = sp.stats.t.ppf((confidence + (1 - confidence)/2), df) * se
margin_95

Don't forget when doing 't' or 'z' lookups the differences between two tailed and one tailed tests.  

You always use two tailed for confidence intervals.  Because of this the area under the curve (two-tailed) for confidence interval of .95 would be:

$t_{.975} - t_{.025}$ 

So when doing a look up for the margin of error you're essentially using '.975' instead of .95.  Because integrals.  

In [None]:
upperbound = x_hat + margin_95
lowerbound = x_hat - margin_95
lowerbound, upperbound

### PS 10a: t-Tests, Part 1

kids vocabularies

In [None]:
mean_before = 3
std_before = 1.2

mean_after = 12
std_after = 2.7
n = 1000
df = n - 1

mean_difference = mean_after - mean_before
std = np.sqrt(std_after**2 + std_before**2)
se = std / np.sqrt(n)
mean_difference, std, se

$$H_{0}:\, \mu_{time1} \leq \mu_{time2} $$
$$H_{a}:\, \mu_{time1} > \mu_{time2} $$

We don't have population parameters so we'll use t-test.  

We're using two samples.  Neither is a population, so we should use differences.  

$$t_{statistic} = \frac{\bar{x} - \mu}{SE} $$

$$\mu = 0\, $$
$$Since\; we\; are\; using\; differences\;$$

$$t_{statistic} = \frac{9 - 0}{.0934} $$

In [None]:
t_statistic = mean_difference / se
t_statistic

For completeness, we'll look up the probability of getting this t_statistic from the t-table

In [None]:
p_value = sp.stats.t.cdf(t_statistic, df)
p_value

:)

Now let's find confidence interval of this data.  

Confidence interval is a bound based on a supplied desired confidence in the data.  Here we'll choose:
$$confidence = 95\%$$

In [None]:
t_score = sp.stats.t.ppf(.975, df)

$$E = SE * t_{score}$$

In [None]:
confidence_margin_95 = se * t_score
upperbound = mean_difference + confidence_margin_95
lowerbound = mean_difference - confidence_margin_95
lowerbound, upperbound

We're 95% confident the mean change in word usage increases by between 8.817 and 9.183 words per sentence.  

Based on a P-Value of 1.0 and a t_statistic

### Lesson 10a: t-Tests, Part 1

- we dont' usually know the population mean and sigma

In this lesson we'll cover:
1. How different a sample mean is from a population
1. How different two sample means are from eachother
    1. Dependence
    1. Independence

#### Types of Test Designs

Repeated Measures (e.g. errors on two types of keyboards)
$$H_{o} : \mu_{1} = \mu_{2}$$

#### Dealing with dependence and T-tests

In [None]:
keyboard_performance = pd.DataFrame.from_csv('/Users/ryanlambert/Downloads/Keyboards - Lesson 10 - Sheet1.csv')

In [None]:
keyboard_performance = keyboard_performance.reset_index()

In [None]:
keyboard_performance = keyboard_performance.iloc[:, 0:2]

In [None]:
keyboard_performance.mean()[0] - keyboard_performance.mean()[1]

In [None]:
keyboard_performance.std(ddof=1)

In [None]:
difference = keyboard_performance['QWERTY errors'] - keyboard_performance['Alphabetical errors']

In [None]:
difference.std()

In [None]:
n = keyboard_performance.count()[0]
difference_sigma = difference.std()
qwerty_mean = keyboard_performance['QWERTY errors'].mean()
alphabet_mean = keyboard_performance['Alphabetical errors'].mean()
difference_mean = difference.mean()
se =  difference_sigma/np.sqrt(n)
df = n - 1
t_stat_qwerty = (qwerty_mean - alphabet_mean)/se

In [None]:
(qwerty_mean - alphabet_mean) / se

Always use Degrees of Freedom when doing T-test lookups

Standard error is NEVER computed with degrees of freedom

Standard deviation is computed with bessel's correction if it's a sample instead of a population

In [None]:
sp.stats.t.ppf(.975, 24)

In [2]:
-2.72/float(3.69/ np.sqrt(25))

-3.685636856368564

In [None]:
cohens_d = difference.mean() / float(difference_sigma)
cohens_d

Now let's calculate confidence interval 95% for how much difference we think there is

In [None]:
confidence_margin_95 = sp.stats.t.ppf(.975,df ) * se
confidence_margin_95

In [None]:
lower_bound = difference_mean - confidence_margin_95
upper_bound = difference_mean + confidence_margin_95
lower_bound, upper_bound

#### Dealing with T-tests and confidence intervals Santa Clara rentals

Let's calculate a confidence interval for this new mean.  

In [None]:
mean = 1700
std = 200
n = 100
df = n - 1
se = std/np.sqrt(n)
"we'll use t because we're using sample_std rather than population_std"
"PAY ATTENTION, T-TEST USES DEGREES OF FREEDOM (N-1) INSTEAD OF SAMPLE SIZE"
confidence_95 = se * sp.stats.t.ppf(.95 + ((1 - .95)/2.0), df)
confidence_95

In [None]:
lower_bound = mean - confidence_95
upper_bound = mean + confidence_95
print lower_bound, upper_bound

Confidence interval is a margin in which we have x% confidence

let's calculate the p-values

In [None]:
mean_rent = 1830
n = 25
df = n - 1
std_rent = 200
sample_mean_rent = 1700
se = std_rent / np.sqrt(n)
t_stat = (sample_mean_rent - mean_rent) / se
cohens_d = (sample_mean_rent - mean_rent) / float(std_rent)
se, sp.stats.t.cdf(t_stat, df), t_stat, cohens_d


but let's also get the 'critical_values' for alpha of .05 (each tail .025)

In [None]:
sp.stats.t.ppf(.975, 24)

#### random number sample 

In [None]:
mean = 10 
sample_set = [5, 19, 11, 23, 12, 7, 3, 21]
sample_sigma = np.std(sample_set, ddof=1)
sample_mean = np.mean(sample_set)
sample_size = len(sample_set)
df = sample_size - 1
se = sample_sigma / np.sqrt(sample_size)
"t_stat is same as z_score"
t_statistic = (sample_mean - mean) / se
t_statistic

In [None]:
(1 - sp.stats.t.cdf(t_statistic, df)) * 2

#### Finch beak widths

In [None]:
finch_beaks = pd.DataFrame.from_csv('/Users/ryanlambert/Downloads/Finches - Lesson 10 - Sheet1.csv')
finch_beaks = finch_beaks.reset_index()

In [None]:
n = 500
df = n - 1
sigma_finch = .4
mean_finch = 6.47
mean_finch_prior = 6.07
se = sigma_finch/np.sqrt(n)

In [None]:
t_statistic = (mean_finch - mean_finch_prior) / se

In [None]:
sp.stats.t.cdf(t_statistic, df)

dof = 23
t - stat > 2.45

Degrees of freedom example

In [None]:
sudoku = np.matrix('1, 1, 1, 0, 0, 0, 0, 0, 0;0, 0, 0, 1, 1, 1, 0, 0, 0;0, 0, 0, 0, 0, 0, 1, 1, 1;1, 0, 0, 1, 0, 0, 1, 0, 0;0, 1, 0, 0, 1, 0, 0, 1, 0;0, 0, 1, 0, 0, 1, 0, 0, 1')

In [None]:
something = pd.DataFrame(sudoku)

In [None]:
import sympy

In [None]:
sudoku = sympy.Matrix(sudoku)

Degrees of freedom is equivalent to the basis matrix for the system.  

It is also a count for the number of independent variables

In [None]:
sudoku.rref()

### Lesson 9 Hypothesis Testing

PS 9b: Additional Practice 

flower shop

In [None]:
mean = 7895
sigma = 230
n = 5
intervention_mean = 9640
se = sigma/np.sqrt(n)
se

In [None]:
z_score = (intervention_mean - mean) / se
z_score

In [None]:
1 - sp.stats.norm.cdf(z_score)

sprinters

In [None]:
mean = 22.965
sigma = .360
n = 16
se = sigma / np.sqrt(16)
print mean, sigma, n, se

In [None]:
z_score = (22.793 - 22.965)/ se
z_score

In [None]:
p_value = sp.stats.norm.cdf(z_score)
p_value

PS 9a

In [None]:
se = std/np.sqrt(36)
intervention = 28
mean = 25
critical_value = (intervention - mean)/se
sp.stats.norm.cdf(critical_value), sp.stats.norm.ppf(.95), critical_value

In [None]:
mean = 25
std = 6
x = np.linspace(0, 50, 100000)
y = mlab.normpdf(x , mean, std)
plt.plot(x, y)
plt.plot((25 - 6, 25 - 6), (0, .07))

In [None]:
z_score = (7.8 - 7.47)/(2.41/np.sqrt(50))
z_score

In [None]:
z_score = (8.3 - 7.47)/(2.41/np.sqrt(50))
z_score

In [None]:
sp.stats.norm.cdf(.749) - ((1- sp.stats.norm.cdf(.749))/2.0)

In [None]:
sp.stats.norm.ppf(.975)

In [None]:
1 - sp.stats.norm.cdf((8.3 - 7.47)/(2.41/np.sqrt(50)))

In [None]:
(1 - sp.stats.norm.cdf(2.435))*2

In [None]:
sample_size = 30
mean = 7.47
std = 2.41
intervention = 8.3
se = std/np.sqrt(sample_size)

z_score = (intervention - mean)/ se
z_score

In [None]:
critical_value = sp.stats.norm.ppf(.975)
critical_value

In [None]:
engagement_data.mean()

In [None]:
engagement_data.std(ddof=0)

In [None]:
engagement_data = pd.DataFrame.from_csv('/Users/ryanlambert/Downloads/Engagement and Learning Results - Lesson 9 - Sheet1.csv')

In [None]:
sample = [3, 3, 3, 3, 3, 3, 4, 4, 4, 4]

In [None]:
n = len(sample)
mean = np.mean(sample)
std = np.mean(sample)
se = std/np.sqrt(n)
print mean, std, se

In [None]:
sp.stats.norm.ppf(.0005)

In [None]:
sp.stats.norm.cdf((7.13 - 7.5)/(.64/np.sqrt(20)))

In [None]:
critical_value = sp.stats.norm.ppf(.975)
critical_value

In [None]:
z_score = (7.13 - 7.5)/(.64/np.sqrt(20))
z_score

In [None]:
sp.stats.norm.cdf(-z_score)

In [None]:
1 - sp.stats.norm.cdf(z_score)

In [None]:
1 - sp.stats.norm.cdf(2.57)

PS 8b: Additional Practice

In [None]:
(175 - 180)/float(se)

In [None]:
(175 - 180) / se

In [None]:
z_score = (175 - 180) / se

sp.stats.norm.cdf(z_score)

In [None]:
confidence = .99
sample = 9
mean = 175
std = 18
se = std/np.sqrt(sample)
critical_value = sp.stats.norm.ppf(confidence + (1 - confidence)/2)
upper_bound = mean + se * critical_value
lower_bound = mean - se * critical_value
print lower_bound, upper_bound, critical_value, se * critical_value

In [None]:
sample = [8, 9, 12, 13, 14, 16]
mean = np.mean(sample)
std = np.std(sample)
se = np.std(sample)/np.sqrt(len(sample))
critical_value_95 = sp.stats.norm.ppf(.95 + (1 - .95)/2)
upper_bound = mean + se * critical_value_95
lower_bound = mean - se * critical_value_95
print lower_bound, upper_bound

PS 8a: Estimation

In [None]:
n = 25
mu = 68
std = 10
margin = sp.stats.norm.ppf(.975)

se = std/np.sqrt(n)
mu_new = 75
bound = se * margin

print mu_new - bound, mu_new + bound


Lesson 8: Estimation

In [None]:
sp.stats.norm.cdf(10.06)

In [None]:
sp.stats.norm.cdf(.92)

In [None]:
(8.94 - 7.5)/.14

Engagement Ratio Data 

In [None]:
.73/ np.sqrt(20)

In [None]:
x_bar = .13
se = .107/np.sqrt(20)
bound = se * sp.stats.norm.ppf(.975)
lower_bound, upper_bound = x_bar - bound, x_bar + bound
print lower_bound, upper_bound

In [None]:
engagement_data = np.genfromtxt('/Users/ryanlambert/Downloads/Engagement ratio - Lesson 8 - Sheet1.csv')

In [None]:
engagement_data.std()

In [None]:
engagement_data.mean()

Now let's do the same thing as below but for 98% confidence

In [None]:
confidence_interval = .98
upper_bound = mu + se * sp.stats.norm.ppf(.99)
lower_bound = mu - se * sp.stats.norm.ppf(.99)
print lower_bound, upper_bound

In [None]:
n = 250
mu = 40
std = 16.04
se = std/np.sqrt(n)

In [None]:
sp.stats.norm.ppf(.99)

In [None]:
sp.stats.norm.ppf(.01)

calculate confidence interval for n = 250, mu = 40, std = 16.04

In [None]:
n = 250
mu = 40
std = 16.04
se = std/np.sqrt(n)

In [None]:
upper_bound = mu + se * 1.96
lower_bound = mu - se * 1.96
print upper_bound, lower_bound

In [None]:
print 

What's the lower bound and upper bound?  

In [None]:
bound = 1.96 * 16.04 / np.sqrt(35)
print 40 - bound, 40 + bound

What are the z-score values that bound 95% of the data?  

In [None]:
sp.stats.norm.ppf(.025) 

In [None]:
sp.stats.norm.ppf(.975)

- I initially used .95 straight up from the zlookup.  This means that I was taking a one sided tail cdf (are under curve).  

- if total area under curve of interest is 95% then I should do z-lookups for .975 and .025 since (1 - .975) + (1 - .025) = .05 (the tails)

confidence interval bounds

In [None]:
40 - (2.71 * 2)

In [None]:
40 + (2.71 * 2)

PS 7: Sampling Distributions

number of facebook friends frequency graph

In [None]:
num_friends = pd.DataFrame.from_csv('/Users/ryanlambert/Downloads/Facebook Friends - Problem Set 7 - Sheet1.csv')

In [None]:
num_friends = num_friends.reset_index()

In [None]:
num_friends.std() / np.sqrt(10)

In [None]:
9 * 25

In [None]:
len(num_friends)

In [None]:
1 - sp.stats.norm.cdf(2.5)

Statistics Lesson 7: "Klout Sampling distributions"

- randomly select 1 klout score, then 5, then 10 

In [None]:
np.random.choice(klout_scores, replace=False, size=1)

In [None]:
np.average(np.random.choice(klout_scores, replace=False, size=5))

In [None]:
np.average(np.random.choice(klout_scores, replace=False, size=10))

- find probability of randomly selecting a sample (n = 250) with a mean at least 40?

In [None]:
1 - sp.stats.norm.cdf(zscore)

- find zscore for sample mean of 40
- standard_err for n = 250

In [None]:
zscore = (40 - klout_average) / float(standard_err)
zscore

- Find standard error for n = 250

In [None]:
standard_err = population_std / float(np.sqrt(250))
standard_err

- how many std's is the klout score '40' above the average sample mean.  

In [None]:
klout_average = np.mean(klout_scores)
klout_average

In [None]:
klout_zscore = (40 - klout_average) / float(standard_err)
klout_zscore

In [None]:
1 - sp.stats.norm.cdf(klout_zscore)

- What will be the standard deviation of the sample means? (standard error)

In [None]:
population_mean = 37.72
population_std = 16.04
sample_size = 35
standard_err = population_std / float(np.sqrt(sample_size))
standard_err

In [None]:
klout_scores = np.genfromtxt('/Users/ryanlambert/Downloads/Klout scores (Lesson 7) - Sheet1.csv', delimiter=',')

In [None]:
np.average(klout_scores)

In [None]:
np.std(klout_scores, ddof=0)

In [None]:
SAT_avg = 1497
SAT_std = 322

In [None]:
2/ -1.08

In [None]:
z_score = lambda x, mean, std: (x - mean)/ float(std)

In [None]:
z_score(95, 80, 10)

In [None]:
z_score(60, 60, 40)

In [None]:
mean = 90
std = 10
sp.stats.norm.ppf(.64) * std + mean
# sp.stats.norm.cdf(z_score(95, mean, std))

from: http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.norm.html

In [None]:
sp.stats.norm.cdf(-2.33)

In [None]:
sample = [1, 1.5, 2, 2.5, 1.5, 2, 2.5, 3, 2, 2.5, 3, 3.5, 2.5, 3, 3.5, 4]

In [None]:
np.std(sample)

Sample means Standard Deviation "SE"

In [None]:
np.std(sample)

Population

In [None]:
population = [1, 2, 3, 4]

In [None]:
np.std(population)

"What's the ratio of pop_std to SE"

In [None]:
np.std(population)/ float(np.std(sample))

"all possible sample combinations of size two from the population"

"What is the standard deviation of the means of those samples?"

"What is the population standard deviation?"

In [None]:
population = [1, 2, 3, 4, 5, 6]

In [None]:
population_array = []

for i in population:
    for j in population:
        population_array.append([i, j])

In [None]:
population_array_means_size_2 = [np.average(i) for i in population_array]

In [None]:
np.mean(population_array_means_size_2)

what is the standard deviation of the sampling distribution with n = 5? 
'standard error'

In [None]:
1.7078 / np.sqrt(5)

What's the standard Error?

In [None]:
3.49/ np.sqrt(5)