# Inferential Statistics

## Definitions

### Distribution

Number of possible values a variable can have and how frequently they occur.

### Normal Distribution

https://en.wikipedia.org/wiki/Normal_distribution
- No skew
- Mean = Median = Mode
- $X \sim N(\text{Mean, Standard Deviation}) = X \sim N(\mu, \sigma^2) = X \sim N(0,1)$
- $\mu$ = Mean
- $\sigma$ = Standard Deviation
- $\sigma^2$ = Variance

### Standard Normal Distribution

- Taking any Normal Disribution and transforming (Standardising) it to $\mu-0, \sigma=1$
- $Z=\frac{x-\mu}{\sigma}$ where $Z$ is the "Z-score"

### Sampling Distribution
https://en.wikipedia.org/wiki/Sampling_distribution

$$\overline{X}\sim N(\mu,\frac{\sigma^2}{n})\\
\text{Where }n\text{ is the number of samples}$$


 In statistics, a **sampling distribution** or finite-sample distribution is the probability distribution of a given random-sample-based statistic. If an arbitrarily large number of samples, each involving multiple observations (data points), were separately used in order to compute one value of a statistic (such as, for example, the sample mean or sample variance) for each sample, then the sampling distribution is the probability distribution of the values that the statistic takes on. In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.

Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference. More specifically, they allow analytical considerations to be based on the probability distribution of a statistic, rather than on the joint probability distribution of all the individual sample values. 

Standard Error: $\frac{\sigma}{\sqrt{n}}$

### Central Limit Theorem

https://en.wikipedia.org/wiki/Central_limit_theorem

No matter the underlying distribution, sampling distribution aproximates to a normal distribution.

Mean of a sampling distribution will be equal to the population mean.



### Estimaters/Estimates

- Any estimate has the characteristics:
    - Bias
    - Efficiency
- Always want the least biased and most effiecient estimate
    
#### Point Estimate

https://en.wikipedia.org/wiki/Point_estimation#Point_estimators

- Not very efficient
- Generally prefer Confidence Interval Estimate

Point estimate always lies within the Confidence Interval

# Confidence Interval Estimate

https://en.wikipedia.org/wiki/Confidence_interval

Is the range within which we expect the population parameter to be and its estimation is based on the sample data we have.

less confidence = narrower interval
more confidence = wider interval

*Example:*

Rolling two dice:

- 95% confident of getting (7,10) $\rightarrow$ less confidence = narrower interval
    - 5% confident result is outside interval.
        - (refered to as $\alpha$) $\le \alpha \le 1$
        - $\alpha$ can be 0.05, 0.01, 1
- 99% confident of getting (4,12) $\rightarrow$ more confidence = wider interval

Two cases for calculating confidence interval for a population:
1. Population variance is **known**
2. Population variance is **unknown**

### If Population Variance is Known:

*$\rightarrow$Assumtpion: the data is normally distributed*

- 30 Observations $\rightarrow n=30$
- $\mu$ Mean = \$100200
- $\sigma$ Standard Distribution = \$15000
- $\frac{\sigma}{\sqrt{n}}$ Standard Error = \$2739
- if 95% confident then $\alpha = 0.05$
- $Z_{\alpha/2} \rightarrow$ Reliability Check
- $Z_{\alpha/2}\frac{\sigma}{\sqrt{n}} \rightarrow$ Margin of Error
$$
\overline{x}-Z_{\alpha/2}\frac{\sigma}{\sqrt{n}},\overline{x}+Z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\\
$$

To use Z Table:
- Find nearest Z value on the table
- Add col value and row value
- Answer is the *Critical Value*

### If Population Variance is Unknown:

- Use Student's T Distribution
- Inference through small samples
- Degree of freedom
- Use T statistic instead of Z statistic
- $t_{n-1},\frac{\alpha}{2} \frac{s}{\sqrt{n}} \rightarrow$ Margin of Error

$$t_{n-1},\alpha = \frac{\overline{x}-\mu}{s\sqrt{n}}$$

$n\rightarrow$ is the number of observations

$(n-1)\rightarrow$ degree of freedom

$$\overline{x}+t_{n-1},\frac{\alpha}{2} \frac{s}{\sqrt{n}}, \overline{x}-t_{n-1},\frac{\alpha}{2} \frac{s}{\sqrt{n}}$$


if the margin of error is large, the Confidence Interval will be wider

if the margin of error is small, the Confidence Interval will be narrower

---

### Confidence Intervals with 2 means: DEPENDANT Samples
- Subject remains the same.
- Often used in a medical context

e.g. before and after measurements in a drug trial from the same person.

|Before|After|Difference
|--|--|:--|
|value|value|dif
|value|value|dif
|value|value|dif

*see: 4.1.Confidence-intervals.Two-means.Dependent-samples-lesson.xlsx*

### Confidence Intervals with 2 means: INDEPENDANT Samples

Cases:
1. Known Population Variance
2. Unknown Population Variance but assumed Equal
3. Unknown Population Variance but assumed Different

#### Case 1: Known Population Variance
Considerations:
1. Population data is normally distribution
2. Population variances are known
3. Sample size is different

*Note:* generally, is the number of samples is greater than or equal to 30, then T-stat can be used.

e.g. scores obtained by ENG(enginering) and MGMT(management) students

|_|ENG|MGMT|Difference
|--|--|--|--|
size|100|70|
mean|58|65|
STD|10|5|

$$
(\overline{x}-\overline{y})\pm Z_{a/2} \sqrt{\frac{\sigma^2x}{n_2}+\frac{\sigma^2y}{n_y}}
$$

#### Case 2: Unknown Population Variance but assumed Equal

*see: 6.1.Confidence-intervals.Two-means.Independent-samples.xlsx*

Pooled Sample Variance:
$$S_p^2=\frac{(n_x-1)S_x^2 + (n_y-1)S_y^2}{n_x + n_y -2}$$


#### Case 3: Unknown Population Variance but assumed Different
*Note: Rarely used*
$$
(\overline{x}-\overline{y})\pm T_v , \frac{a}{2} \sqrt{\frac{S_x^2}{n_x}+\frac{S_y^2}{n_y}}
$$

----
# Hypothesis Testing
A hypothesis is an idea that can be **tested**

1. Formulating a hypothesis
2. Find the right tests for the hypothesis
3. Execute the test
4. Decission Making

Example:
Average data science salary is $113000

|Term|Notation|data|
|--|:--:|:--|
Null Hypothesis| $H_0$| $\mu_0=$113000
Alternative Hypothesis|$H_1 H_A$|$\mu_1!=$113000

If $\mu = \mu_0$ accept the null hypothesis, else reject

This example would be a "2 tailed" test because there are 2 alternate hypothesis.

If there is only one alternative hypothesis, it would be termed a "1 tailed test"

The Null hypothesis represents the generalised/norm whereas, the Alternative hypothesis represents personal opinion. The Null Hypothesis is the one that is tested. Typically the aim is to reject the null hypothesis, i.e. one of the alternative hypothesis is correct.

## Signifigance Level $\alpha$

Signigigance Level $\alpha$ is defined as: The probability of rejecting a null hypothesis ($H_0$), if it is true or the probability of making this error.

If the $H_0$ (null hypothesis) is false, reject it. At times we might reject the $H_0$ which is actually true or correct.

The most common values alpha will take is 0.05(95\%) and 0.01(99\%).

**Example:**

The dean of a university claims that the average grade of students is 70\%

- In this case the null hypothesis $H_0$ is: $\mu_0=70%$
- The alternative hypothesis $H_1 H_A$ would be $\mu_1, \mu_A !=70%$

$$\begin{align}
Z&=\frac{\overline{x}-\mu_0}{\frac{S}{\sqrt{n}}}
\end{align}$$

Where:
- $Z$ is the Z-score
- $\overline{x}$ is the Sample Mean
- $\mu_0$ is the Hypothesised Mean
- $\frac{S}{\sqrt{n}}$ is the Standard Error

If $Z=0$ then $H_0$ (Null Hypothesis) is Accepted.

There can be a margin wherein $Z$ may not equal 0 but is still within an acceptable range.
This is determined by $\alpha/2$. If $Z$ is below or above this value $z$ (the critical value), it is said to be in the Regection Region,and the hypothesis is rejected. If $Z$ is within these scores it is accepted.

Errors:
- Type 1 Error (False Positive)
    - Reject a true $H_0$
        - $\alpha$ - significance level
        - Developers reponsiblity
- Type 2 Error (Flase Negative)
    - Accept a false $H_0$
        - $\beta$ - sample size 
        - variance $\sigma$
        
$(1-\beta)$:
- Probability of rejecting a false $H_0$
- Power of the test

---
**Confusion Matrix**
![confusion%20matrix.png](attachment:confusion%20matrix.png)
----

$$\text{Recall}=\frac{TP}{TP+FN}\\
\text{Precision}=\frac{TP}{TP+FP}\\
\text{F-Variance}=\frac{2\times\text{Recall}\times\text{Precision}}{\text{Recall}+\text{Precision}}$$

# Testing

## Single Population

### Testing Single Mean with Known Variance

- Sample Size = 30
- Sample Mean = \$100200
- STD = \$15000
- STD Error = \$2739

$H_0\rightarrow \mu_0 = \$ 113000 $

$H_1\rightarrow \mu_1 \ne \$ 113000 $

$Z \rightarrow$ Z-score, standardised variable associated with the test

$z \rightarrow$ calculated from table (critacal value)

Stage 1:
$Z\sim N(\overline{x}-\mu_0,1)$

Final Stage:
$z\sim N(0,1)$

In this case:

$Z = 4.67$ and $z=1.96$ 

if $Z>z$ then reject the null hypothesis

---
**p-value** - value of signicance level beyond which we cannot reject the hypothesis.

If p-value < $\alpha \rightarrow$ Reject Hypothesis

---