# Confidence Interval and p-value

When learning about statistics, there are several basic concepts:
- Population
- Sample
- Feature/Attribute of Population or Sample
- Measures of Feature/Attribute:
    - Mean, Min, Max
    - Standard Deviation
    - etc

There are several kinds of Confidence Interval:
- Of Mean
- Of Standard Deviation
- Of Correlation Coefficient

To explain what is **confidence interval** and **p-value**, let's take the following scenario:
- Size of Population is 100
- Size of Sample is 36

There are more than 1.977 x 10<sup>27</sup> ways to get a sample of 36 from population of 100.

\begin{equation*}
{100 \choose 36} = 1.9772046 \times 10^{27}
\end{equation*}

In [1]:
import numpy as np
import scipy.stats

POPULATION_SIZE = 100
SAMPLE_SIZE = 36

population = np.random.randn( POPULATION_SIZE )
population_describe = scipy.stats.describe( population )
population_describe

DescribeResult(nobs=100, minmax=(-2.9104809974965637, 2.6702182266351366), mean=0.043335966503553924, variance=1.136982420021777, skewness=-0.1440277254838043, kurtosis=0.1146119415894522)

Now, we will draw a sample 500 times from difference confidence levels.

In [2]:
TEST_COUNT = 500
    
def TestGetSample( interval, testCount=TEST_COUNT ):
    in_interval = 0
    for _ in range( 0, testCount ):
        sample = np.random.choice( population, size=SAMPLE_SIZE )
        describe = scipy.stats.describe( sample )        
        if interval[0] <= describe.mean <= interval[1] :
            in_interval += 1
    return in_interval, in_interval/testCount

print( "----- confidence level=95% -----" )
interval = scipy.stats.t.interval(0.95, len(population)-1, loc=np.mean(population), scale=scipy.stats.sem(population))
print( "interval: ", interval )
for _ in range( 0, 3 ):
    print( "in interval: ", TestGetSample( interval) )
    
print( "----- confidence level=99% -----" )
interval = scipy.stats.t.interval(0.99, len(population)-1, loc=np.mean(population), scale=scipy.stats.sem(population))
print( "interval: ", interval )
for _ in range( 0, 3 ):
    print( "in interval: ", TestGetSample( interval) )
    

----- confidence level=95% -----
interval:  (-0.1682398523648724, 0.2549117853719802)
in interval:  (383, 0.766)
in interval:  (388, 0.776)
in interval:  (367, 0.734)
----- confidence level=99% -----
interval:  (-0.2367160130406633, 0.32338794604777116)
in interval:  (443, 0.886)
in interval:  (447, 0.894)
in interval:  (446, 0.892)


Here are two unlikely samples:
- Order the population, then
- Get top 36, or
- Get bottom 36

In [3]:
sample = np.sort( population )[:36]
# sample = np.sort( population )[-36:]

print( "len: ", len( sample ) )
print( "mean: ", np.mean( sample ) )
print( "is in interval: ", ( interval[0] <= np.mean(sample) ) and ( np.mean(sample ) <= interval[1] ))

len:  36
mean:  -1.0562170150267027
is in interval:  False


### As we can see:

- The wider the confidence level, the wider the interval as well.
- The wider the interval, the more chances the drawned sample will be inside the interval.

### p-value:

Previously, we have:
- Set confidence level ($\alpha$), 
- Calculated confidence interval
- Generated a sample, and then 
- Calculated whether this sample is inside the interval.
    
Now, we could:
- Generate or have a sample
- Calculate what confidence level ($\alpha$) could include this sample, 
- ( 100% - $\alpha$ ) is the **p-value** of this sample

## References

- https://en.wikipedia.org/wiki/P-value