# Sampling Statistics

## SWBAT
* State the mean and variance of the sampling distribution of the mean
* Compute the standard error of the mean
* State the central limit theorem !!


## Introduction: 

Sampling distribution can be thought of as relative frequency distribution with a large number of samples. A relative frequency 
distribution tends to approach the sampling distribution as number of samples increase. 
> When variables are discrete, the heights of distributions are probablilities, and when continuous variables these are probability densities. 

In order to learn the population mean, we dont measure the whole population. Instead, we take a random sample and use sample mean ( $\hat{x}$ ) to estimate population mean ( μ ). The sample mean in such cases depends upon the values of samples chosen, however the population mean remains fixed. While using sample mean to estimate population mean, we come across sampling error, which directly relates to the standard deviation of a sampling statistic. Let's learn about these concepts through an example.

>**Pumpkin Weights**
>The population is the weight of six pumpkins (in pounds) displayed in a carnival "guess the weight" game booth. You are asked to guess the average weight of the six pumpkins by taking a random sample without replacement from the population.

| Pumpkin | Weight (in pounds) |
|---------|--------------------|
| A       | 19                 |
| B       | 14                 |
| C       | 15                 |
| D       | 9                  |
| E       | 10                 |
| F       | 17                 |

### Step 1
Lets calculate the population mean first, which we calculate as:
μ = sum of all elements / N

In [None]:
from collections import Counter

In [17]:

# Create two lists with pumpkin names and weights
pumpkin = ['A', 'B', 'C', 'D', 'E', 'F']
weights = [19, 14, 15, 9, 10, 17]

# Combine both lists to create a dictionary
pumpkin_dict = {}
for i in range(len(pumpkin)):
    pumpkin_dict[pumpkin[i]] = weights[i]

pumpkin_dict

#{'A': 19, 'B': 14, 'C': 15, 'D': 9, 'E': 10, 'F': 17}


{'A': 19, 'B': 14, 'C': 15, 'D': 9, 'E': 10, 'F': 17}

In [18]:
def calculate_mu(x):
    d = float(sum(x.values())) / len(x)    
    return (d)   

calculate_mu(pumpkin_dict)
# 14.0

14.0

### Step 2
Obtain the sampling distribution of the sample mean for a sample size of 2 when one samples without replacement.

In [160]:
import itertools
pumpkin_dict.items()
n = 5 # Sample Size
x = list(itertools.combinations(pumpkin_dict, n))
#x = map(set, itertools.combinations(pumpkin_dict, 2))
print (x)
len(x)

[('A', 'B', 'C', 'D', 'E'), ('A', 'B', 'C', 'D', 'F'), ('A', 'B', 'C', 'E', 'F'), ('A', 'B', 'D', 'E', 'F'), ('A', 'C', 'D', 'E', 'F'), ('B', 'C', 'D', 'E', 'F')]


6

In [161]:

# columns = ['Sample1', 'Weight1', 'Sample2', 'Weight2', 'y_hat', 'Prob']
# df = pd.DataFrame(index=index, columns=columns)
y_hat_list = []

for y in range(len(x)):
    sum = 0
    for i in range(n):
    
        key = x[y][i]
        val =pumpkin_dict[str(x[y][i])]
        sum += val
    y_hat = sum/n
    y_hat_list.append(y_hat)

print(y_hat_list)
freq = Counter(y_hat_list)

#def add_prob(freq, definition):
prob = []
for element in y_hat_list:
    for key in freq.keys():
        if element == key:
            #prob.append([element, str(freq[key])+"/"+str(len(x))   ])
            prob.append(str(freq[key])+"/"+str(len(x))   )

print(len(y_hat_list), len(prob))  
ttt = list(zip(x, y_hat_list, prob))
ttt


[13.4, 14.8, 15.0, 13.8, 14.0, 13.0]
6 6


[(('A', 'B', 'C', 'D', 'E'), 13.4, '1/6'),
 (('A', 'B', 'C', 'D', 'F'), 14.8, '1/6'),
 (('A', 'B', 'C', 'E', 'F'), 15.0, '1/6'),
 (('A', 'B', 'D', 'E', 'F'), 13.8, '1/6'),
 (('A', 'C', 'D', 'E', 'F'), 14.0, '1/6'),
 (('B', 'C', 'D', 'E', 'F'), 13.0, '1/6')]

We can see that the chance that the sample mean is exactly the population mean (i.e. 15) is only 1 in 15, very small. It may happen that the sample mean can never be the same value as the population mean. When using the sample mean to estimate the population mean, some possible error will be involved since sample mean is random.

The mean of the sample mean when the sample size is 2:

In [162]:
np.mean(y_hat_list)

14.0

Thus, even though each sample may give you an answer involving some error, the expected value is right at the target: exactly the population mean. In other words, if one does the experiment over and over again, the overall average of the sample mean is exactly the population mean.

Now let's obtain the sampling distribution for the sample mean when the sample size is 5.

In [None]:
# 14 same as before

Again, we see that using sample mean to estimate population mean involves sampling error. However, the error with a sample of size 5 is on the average smaller than with a sample of size 2.

**CONTd**