<a href="https://colab.research.google.com/github/Daniel-Benson-Poe/365DataScience/blob/main/Confidence_Intervals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Confidence Intervals

An interval of values around the point estimate. More accurate representation of reality than a single point estimate.

The point estimate is at the mid point of the confidence interval.

Confidence Interval has a 95% chance of representing the population parameter.

Of course, that means there is still a 5% chance the population parameter actually lies outside the range.

## Confidence Level

1 - alpha

Alpha is a value between 0 and 1.

For a 95% confidence that the population parameter is within the interval, alpha would have to be equal to 5.

If we want a 99% confidence that the population parameter is within the interval, alpha would have to be equal to 1.


### Formula for all confidence levels:

[point estimate - reliability factor * standard error, point estimate + reliability factor * standard error]

## Reliability factor

Let's look at this through an example:

Data Scientist Salary

|Dataset|
|-|
117,313
104,002
113,038
101,936
84,560
113,136
80,740
100,536
105,052
87,201
91,986
94,868
90,745
102,848
85,927
112,276
108,637
96,818
92,307
114,564

|-|-|
|-|-|
|Sample Mean|100,200
|Population STD|15,000
|Standard Error|2,739



Common confidence levels = 90%, 95%, 99%

Common alphas = 0.1, 0.05, 0.01

Z-score, or z of alpha, comes from the Standard normal distribution table, or z-table.

Common term for Z is 'critical value'

Using the table we first calculate z of alpha / 2.

We know that alpha = 0.05, so

z of alpha / 2 would be 0.05 / 2 = 0.025



This will match the value of 1 - 0.025 = 0.975

Finding 0.975 on the table gives us the values 0.06 on the x axis and 1.9 on the y axis. Using these two numbers we can find z of 0.025

z 0f 0.025 = 1.9 + 0.06 = 1.96

Thus, we have found the critical value for this confidence interval.

Now we can plug everything into the formula:

[100200 - 1.96*(15000/sqrt(30)), 100200 + 1.96*(15000/(sqrt(30))] = [94,833, 105,568]

So, we are 95% confident that the average data scientist salary will be in the interval [\$94,833, \$105,568]

Now say we want to be 99% confident.

Confidence interval: 99%

alpha = 0.01



alpha/2 = 0.01/2 = 0.005

Look at table for value 1 - 0.005 = 0.995

We can't find this value on the table! In this case, we just round to the nearest value. In this case, the value is 0.9951.

So we get the values 0.08 and 2.5.

z0.005 = 2.5 + 0.08 = 2.58

Plugging these into the equation, we get:

[93,135, 107,206]

Looking at the two, we can see that:

For 95% confidence interval, our interval is narrower but only 95% confrident

For 99% confidence interval, our interval is broader but has a higher confidence

# Student's T Distribution

Allows for:

Inference through small samples

Unknown population variance

Huge real-life applications

Looks very similar to  normal distribution but has fatter tails and a lower peak.

Allows for higher dispersion of variables and uncertainty.

Related to the student's t distribution

t with n-1 degrees of freedom, significance level of alpha = (sample mean - population mean)/(standard error of the sample)

There are degrees of freedom:

for a sample of n, we have n-1 degrees of freedom

for a sample of 20, we have 20-1 = 19 degrees of freedom

Like the z table, we also have a t table

# Examples

## Population variance unknown, t-score

Data Scientist Salary

|Dataset|
|-|
|78,000
|90,000
|765,000
117,000
105,000
96,000
89,500
102,300
80,000

|-|-|
|-|-|
Sample Mean|92,533
Sample Standard Deviation|13,932
Standard Error|4,644

We are missing the population variance!

That's ok, we can just use the student's t variation.

If population variance is known, sample standard variation goes with the z statistic.

If population variance is not known, sample standard variation goes with the t statistic.

We find the t statistic from the t table.

First we figure out the degrees of freedom:

we have 9 samples, so n-1 = 9-1 = 8 degrees of freedom

Next we have to find alpha divided by 2.

This depends on the confidence level we want to obtain. In this example, we'll go with 95%

So, 95% CI => alpha = 5% = 0.05

0.05 / 2 = 0.025

So, for the first part of our equation:

t8,0.025 = 2.31 (the associated t statistic in the t table)

Now we just need to plug in all the numbers:

sample mean +/- t statistic * standard error:

92,544 +/- 2.31 * 4,644 =

Confidence interval (95%, unknown variance):

($81,806, $103,261)

Compare this with the result we got from having a known population variance and the z statistic:

($94,833, $105,568)

Knowing the population variance gives us a narrower confidence interval

# Margin of Error

When the population variance is known, we use the formula:

sample mean +/- z statistic * standard error

The margin of error is also equal to the second half of this equation: z statistic * standard error

When the population variance is not known, the margin of error is the second half of the t statistic equation:

t statistic * standard error

# Example 1: Dependent Samples

Two types of samples:

* Before and after situation

* Cause and effect

Very often used in medicine.

Magnesium Levels

|Patient|Before|After|Difference|
|-|-|-|-|
1|2.00|1.70|-0.30
2|1.40|1.70|0.30
3|1.30|1.80|0.50
4|1.10|1.30|0.20
5|1.80|1.70|-0.10
6|1.60|1.50|-0.10
7|1.50|1.60|0.10
8|0.70|1.70|1.00
9|0.90|1.70|0.80
10|1.50|2.40|0.90

Mean: 0.33

St. deviation: 0.45

Assuming a 95% confidence:

95% t-stat: 2.26

Plugging all these values into the confidence interval equation for t statistics:

CI = 0.33+- 2.26*(0.45/sqrt(10)) = (0.01, 0.065)

What can we determine from these results?

1. In 95% of the cases, the true mean will fall in this interval

1. The whole interval is positive

1. The levels of Mg in the test subjects' blood is higher

=> based on our small sample, the pill is effective

# Example 2: Independent Samples

There are three possible cases:

* Population variance known

* Population variance unknown but assumed to be equal

* Population variance unknown but assumed to be different

## Known population variance

University example

|-|Engineering|Management|Difference
|-|-|-|-|
Size|100|70|?
Sample mean|58|65|-7.00
Population std|10|5|1.16

95% z-stat: 1.96

Considerations:

1. The populations are normally distributed

1. The population variances are known

1. The sample sizes are different

1. Different departments

1. Different teachers

1. Different grades

1. Different exams

The two samples are independent from each other: The Engineering grades have no impact at all on the Management grades, so we know we are dealing with an independent variables sample.

Problem: We want to find a 95% confidence interval for the difference between the grades of the students from engineering and management.

Considerations:

1. Samples are big

1. Population variances are known

1. Populations are assumed to follow the normal distribution

CI = difference point estimator ( +- z statistic * (square root of the sum of the standard error for x (Engineering) and the standard error for y (Management)). The square root of the sum of standard errors is the 1.16 we found in the table.

We got a result of:

(-9.28, -4.72)

Takeaways:

1. We are 95% confident that the tru mean difference between engineering and management grades falls into this interval

1. The whole interval is negative => engineers were consistently getting lower grades

1. Had we calculated difference as: 'management - engineering', we would get a confidence interval: (4.72, 9.29) telling us that management students were getting better grades

## Population Variance unknown but assumed to be equal

Problem: Estimate the difference of price of apples in NY and LA

You don't know what the population variance of apple prices in NY or LA is, but youa ssume it should be the same.

|NY apples|LA apples|
|-|-|
3.80|3.02
3.76|3.22
3.87|3.24
3.99|3.02
4.02|3.06
4.25|3.15
4.13|3.81
3.98|3.44
3.99|-
3.62|-

|-|NY|LA|
|-|-|-|
Mean|3.94|3.25
Std. deviation|0.18|0.27
Sample size|10|8

In this case the unbiased estimate is called the pooled variance:

pooled variance =

[(sample size x - 1)*(std. dev. x)$^2$ + (sample size y - 1)* (std. dev. y)$^2$] /  sample size x + sample size y - 2

Given the information above, we can plug in:

(10-1)*0.18$^2$ + (8-1)*0.27$^2$) / 10 + 8 - 2 = 0.05

So pooled variance = 0.05

Pooled std = 0.22

1. Population variance unknown

1. Small samples

So we are using the T statistic

The confidence interval equation using T statistic:

mean of x - mean of x +/- t statistic * sqrt(standard error of x + standard error of y)

where the degrees of freedom to calculate the t stat are equal to the total sample size minus the number of variables, or, in this case:

nx + ny - 2 = 10 + 8 - 2 = 16

Our 95% t-stat = 2.12

So our confidence interval is:

(3.94 - 3.25) +/- 2.12 * sqrt(0.05/10 + 0.05/8)

CI = (0.47, 0.92)

What can we confer?

We are 95% confident th average difference in apple prices between NY and LA are somewhere between 0.47 and 0.92.

Apples in NY are much more expensive than in LA

## Population Variance unknown and samples are assumed to be different

Comparing apples and oranges

The confidence interval formula:

(mean x - mean y) +/- t statistic * sqrt(standard error x + standard error y)

where the t statistic is calculated by calculating the degrees of freedom as follows:

v = (standard error x + standard error y)$^2$ / [(standard error x)$^2$ / {(sample size x - 1 + (standard error y)$^2$)} / (sample size y - 1)]