## Sample Distribution Of The Sample Mean
<br>
<b>Notation</b><br>
The whole idea behind <b>inferential statistics</b> is to take a sample of a population, get calculations such as mean and standard deviation and infer information about the population based on those calculations<br>

* Population: N
* sample: n
* Population Mean: $\mu$
* Sample Mean: $\displaystyle\bar{x}$

We know how to find parameters that describe a population, like mean, variance, and standard deviation. But we also know that finding the values can be difficult or impossible, because it's not all ways easy to collect data on every subject in a large population<br>
<br>
So instead of collecting population data, we choose a subset of the population , sample.<br>
<br>

In the same way we find $\color{dodgerblue}\text{parameters}$ for a population, we can find $\color{dodgerblue}\text{statistics}$ for the sample. Then based on the statistics ,we can infer the population parameter might be similar to it's corresponding sample statistic. This is the bases for $\color{dodgerblue}\text{Inferential Statistics}$ <br>
<br>
<b><i>Sample Distribution Of The Sample Mean</i></b><br>
It is important to consider the size of the sample of a study.<br>
<br>
If the mean height of girls in a class is 65'' and there are 30 girls in the class, we take a sample of 3 girls $\large\color{dodgerblue}\bar{x}_1$ which happens to be the 3 tallest girls, the mean of that sample wouldn't be a good estimate of the population mean. Similarly if we choose the 3 shortest girls. <br>
<br>

<span style = "color:skyblue;font-size:101%">
So how do we adjust for the fact that individual samples might produce sample statistics that are bad estimates of their corresponding population parameters?

<br>
<br>
Well, instead of taking just one sample from the population, try to take every possible sample of 3 girls from the population of girls in our class.<br>
<br>
</span>

The total number of all possible sample is $N^n$ , where N is the total population from which we take our samples, and n is the sample size. 
</span>
<br>
<br>

So if we take samples of 3 girls from a population of 30 girls , the total number of possible samples is:<br>

$$N^n = 30^3 = 27,000$$
<br>

In this scenario, we're sampling with replacement, which means we pick a random sample of three girls, calculate the sample mean, put them back into the population; and pick another random sample of three girls, calculate the sample mean, put them back into the population.... we'd keep doing this over and over until we've taken every unique 3-girl sample. Which means we will have $\large\bar{x_1},~\bar{x_2},~\bar{x_3}...\bar{x}_{27000}$ sample means. 
<br>
<br>
We will have 27,000 sample means, as a data set. <br>
<br>
Each sample has a mean, and this set of sample means actually forms it's own distribution around the real population mean. In other words, if we look at these means as a distribution, it turns out that this probability distribution of sample means is always normal (as long as we're taking large enough samples), and this normal distribution is called the sampling distribution of the sample mean (SDSM)<br>
<br>
<span style = "color:skyblue;font-size:101%">
The most important fact regarding SDSM is that the mean of the sample distribution $\mu_{\hat{x}}$ is always going to be equal to the mean of the population
</span><br>
<Br>
$\mu_{\hat{x}}$ is the mean of the sample means<br>
<Br>

&emsp;&emsp;$\displaystyle\large\mu_{\hat{x}} = \mu$<br>
<br>

In [1]:

""" 
1. Create a population of female students with a mean height of 65 and N = 30
    a. calculate mean and std of the population 

2. Create a list of all combinations of 3 student samples from the population (4060)
which is the Sample Distribution

3. Create the Sample Distribution of the Sample Mean by getting the mean of each
the 4060 samples

4. Calculate the mean and standard deviation of the SDSM 
"""

from IPython.display import display, Math
import numpy as np 
import sys

sys.path.insert(0, '..')
import resources.datum as datum
import resources.combinatorics as cb

comb = cb.Combinatorics()


N = 30
mu = 65
sigma = 3

new_datum = datum.Data()

cnt, min_val, max_val, x_bar, std, heights = new_datum.make_discrete_data(N = N, mu=mu, sigma=sigma)

combinations = comb.get_combination_list(lst=list(heights), size=3)


sdsm = [np.mean(x) for x in combinations]

sdsm_mean = round(np.mean(sdsm), 4)
sdsm_sigma = round(np.std(sdsm), 4)

msg = '\\large \\text{Population Statistics:}\\\\'
msg = msg + 'count: %s\\\\\\text{minimum value: }%s\\\\\\text{maximum value: }%s\\\\\\mu: %s\\\\\\sigma: %s\\\\~\\\\'
msg = msg + '\\text{Sample Distribution Of The Sample Mean Statistics:}\\\\\\bar{x}: %s\\\\\\displaystyle SE = \\sigma_{\\bar{x}}: %s'


display(Math(msg%(cnt, min_val, max_val, mu, sigma, sdsm_mean, sdsm_sigma)))

<IPython.core.display.Math object>

The plot below shows that the Sample Distribution Of the Sample Mean is normally distributed. 

In [2]:

from bokeh.plotting import figure, show, output_notebook, curdoc

output_notebook(hide_banner=True)
curdoc().theme = 'dark_minimal'


# create a new plot with a title and axis labels
p = figure(title="Sample Means Of Female Student Heights\n", x_axis_label="x", y_axis_label="y")


# Histogram
bins = np.linspace(np.min(sdsm), np.max(sdsm), 32)
hist, edges = np.histogram(sdsm, density=True, bins=bins)
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
         fill_color="skyblue", line_color="white",
         legend_label=f"{len(sdsm)} sample means")

# show the results
show(p)


<span style = "color:skyblue;font-size:101%">
The variance of the sample distribution of the sample mean is equal to population variance divided by the sample size (n)<br>
</span>
<br>

&emsp;&emsp;$\large \sigma \dfrac{\sigma^2}{n}$<br>
<br>
<span style = "color:skyblue;font-size:101%">
Which means standard deviation of the sample distribution of the sample mean is equal to the population standard deviation divided by the square root of n
</span><br>
<br>

$\large\sigma_{\bar{x}} = \dfrac{\sigma}{\sqrt{n}}$<br>
<br>

Keep in mind that we don’t always know population variance and population standard deviation<br>
For the variance of the sampling distribution of the sample mean , when do not know the population variance we use the sample variance $\Bigl(\dfrac{s^2}{n}\Bigl)$ instead 
<br>
<br>


In [4]:

""" 
1. Create a population of female students with a mean height of 65 and N = 30
    a. calculate mean and std of the population 

2. Create a list of all combinations of 3 student samples from the population (4060)
which is the Sample Distribution

3. Create the Sample Distribution of the Sample Mean by getting the mean of each
the 4060 samples

4. Calculate the mean and standard deviation of the SDSM 
"""

from IPython.display import display, Math
import numpy as np 
import sys

sys.path.insert(0, '..')
import resources.datum as datum
import resources.combinatorics as cb

comb_ex1 = cb.Combinatorics()

pop = [150, 156, 158, 164]

combinations_1 = comb_ex1.get_permutation_list(lst=list(pop), size=2)
print(len(combinations_1))

sdsm = [np.mean(x) for x in combinations_1]

sdsm_mean = round(np.mean(sdsm), 4)
sdsm_sigma = round(np.std(sdsm), 4)

msg = '\\large \\text{Population Statistics:}\\\\'
msg = msg + 'count: %s\\\\\\text{minimum value: }%s\\\\\\text{maximum value: }%s\\\\\\mu: %s\\\\\\sigma: %s\\\\~\\\\'
msg = msg + '\\text{Sample Distribution Of The Sample Mean Statistics:}\\\\\\bar{x}: %s\\\\\\displaystyle SE = \\sigma_{\\bar{x}}: %s'


display(Math(msg%(cnt, min_val, max_val, mu, sigma, sdsm_mean, sdsm_sigma)))



12


<IPython.core.display.Math object>

$\Bigl\sqrt{n}$