# Descriptive Statistic 
Identify significative elements in a dataset.

1. Univariate
2. Bivariate
3. Multivariate

# Inferential Statistic: 
Explain the significative elements in a dataset via relationships with other elements in the same dataset.

1. Hypothesis Testing
2. Model Fitting

# Rule-based Learning Models: 

After the methods of Inferential Statistics have been applied to data and the different Hypotheses have been tested and validated it may be possible to deduce rules based on the data. In this case, the rules extracted from the data are considered a consolidated set that will not be subject to revision in the presence of additional data. These rules will be applied to make predictions from any future available data.

# Machine Learning Models: 

This methodology extracts patterns from the available data and applies these patterns as rules in cases where predictions of available data must be made. In contrast to rule-based models though, the set of patterns will be revised as new data becomes available and is used to improve the models. 



# Hypothesis Testing Process:  

## Features of a Hypothesis:  

- a hypothesis is a proposal for an explanation that is believed to be true
- a hypothesis must be objectively testable

## Hypothesis Testing Process Steps:  

1. provide a **null hypothesis** is considered to be **true** until it is proven to be false
2. provide an **alternative hypothesis** asserts a specific relationship between data
3. select a **suitable** statistical test among those available from the large catalogue of hypothesis tests
4. choose a **significance level** for the **p-value**
5. run the test to produce a **test statistics** which will then converted to a **p-value**

## The Null Hypothesis:  

The **null hypothesis** poses no relationship is a statement that may be clarified by the following examples:

- this drug **is not** effective in curing diabetes.
- there is **no significative difference** between the average GPAs of students in any university.

## The Alternative Hypothesis:  

- this drug **is effective** in curing / combating diabetes.
- students of University A have higher average GPAs than the those of the students of University B

## Selection of a suitable statistical test for a hypothesis:

The selection of a suitable hypothesis test from among those available depends on the following main factors:

- the assumptions of the test
- the data that is available to perform the test

## Running of the test and conversion to a p-value:

- the **significance level** for the **p-value** is usually in the form of a percentage threshold
- a **small p-value** that is below the threshold, indicates that the test of the hypothesis produces an outcome that verifies the hypothesis
- a **large p-value** that is below the threshold, indicates that the outcome has been produced by chance alone
- if the **alternative hypothesis** is confirm then the  **null hypothesis** is rejected and viceversa  


# [Tyep I & Type II](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2996198/#:~:text=A%20type%20I%20error%20(false,actually%20false%20in%20the%20population.)

There exist two types of errors that  may occur when applying statistical inference:

A **type I  error (false-positive)** occurs if an investigator rejects a null hypothesis when it is true in the population.  
A **type II error (false-negative)** occurs if an investigator fails to reject a null hypothesis when it is false in the population.

Although type I and type II errors can never be avoided entirely, the investigator can reduce their likelihood by increasing the sample size.   
The larger the sample, the lower the likelihood it will differ substantially from the population.

False-positive and false-negative results can also occur because of bias i.e. with the observer, instrument, etc.   
Errors due to bias, are not referred to as type I and type II errors and are difficult to detect and cannot usually be quantified.

# Power of a Statistical Test

The **Power of a Statistical Test (Pw)** is the probability of rejecting H0 when H1, the alternative hypothesis, is true.  
This also means that **1/Pw** is the probability of committing **Type II Error** when applying the **Statistical Hypothesis Tests**.  
The goal is to design and apply **Statistical Hypothesis Tests** with **high values of Pw**.

## Alpha of a Statistical Test

The Alpha of a **Statistical Hypothesis Test** is the probability of rejecting H0 when H0 is true.   
Alpha is therefore the probability of committing **Error Type I**.

The Alpha of a **Statistical Hypothesis Test** is a threshold value chosen when we apply a statistical test.  
The typical values for alpha are 1% or 5%.

## p-value of a Statistical Test

Every **Statistical Hypothesis Test** produces a **Test Statistic and a corresponding p-value**.  
The p-value of the test statistic is compared to the Alpha value to decide whether to accept H0.  

1. p-value < alpha => reject H0 and accept H1
2. p-value > alpha => accept H0 and reject H1


# The T-Test

## Resources
[t-Test - Full Course - Everything you need to know](https://www.youtube.com/watch?v=VekJxtk4BYM&t=66s)  
[Types of t-tests](https://app.pluralsight.com/ilx/video-courses/4b4cdb5a-b0b9-4c17-8e5c-9d9e07e522a0/cd386635-4bb8-4317-a87e-f944b673f5f9/3dc25bee-4486-4b4b-b158-79ee0a7e446c)

A t-test is used **to learn about averages across two categories**.   
A t-test is used **to verify that the averages computed across two categories are statistically significant**.  
A t-test produces as its **test statistics** the **differences in averages across two categories** (of the same population).  

For any statistical t-test, it can be said that the hypotheses are as follows:

H1: the difference between the averages across the categories of a population is NOT statistically significant
HO: the difference between the averages across the categories of a population IS statistically significant

## Examples

1. You want to test whether the average male birth weight statistically differs from the average female birth weight across a population.
2. You want to test whether the average school results at school A are statistically different from those at school B.

## Conditions for a T-Test to be applicable

1. The sample **means are normally distributed**, which is generally true in the case of large populations as a consequence of the **CLT (Central Limit Theorem)**
2. The **Variances** follow the **Chi2 distribution** 

## Types of t-tests

1. One sample location test
2. Two sample location test
3. Paired difference test
4. Regression coefficient test

### One sample location test

It is used to compare the average of a sample with a certain numerical value. 

#### Example:

Is the average weight of babies born in a certain town statistically different from the average weight in the general population?
H0: the population mean is statistically equal to the given value.
H1: the population mean is not statistically equal to the given value.

## Levene's statistical test for homogeneity of variances

[Levene's test - Test for variance equality](https://www.youtube.com/watch?v=x51GDTiPIfI)  

H0: the variances of two populations are equal
H1: the variances of two populations are NOT equal

The selection of the t-test type always requires determining whether the two populations have equal variance.
This is the use case for Lavene's test.

### Two sample location test

It is used to compare the averages of two populations.
**The populations are assumed to be independent from each other**.

There are variants of the two sample location tests, whose details depend upon: 

- the relative size of the populations
- the differences in the variances of the two populations

The **Welche's t-test** is an often-used variant that covers the following scenario.
The relative size of the population may be equal or not, but their variances are not equal.

- Lavene's statistical test for homogeneity of variances is applied to the two populations
- if Lavene's test rejects the H0 then **Welche's t-test** is applied
- if Lavene's test does not reject the H0 then **two-sample location test for equal variances** is applied

#### Example:

Is the average weight of babies born in town-a statistically different from the average weight in the population of town-b?
H0: the population mean of town-a is statistically equal to town-b.
H1: the population mean of town-a is not statistically equal to town-b.


### Paired difference test

The paired difference test is commonly used as **before-after** tests.
**The populations are assumed to be dependent or matched** and the averages are computed before and after a particular event.

#### Example:

Is the average cholesterol level in patients after a drug treatment the same as before it?
Was the drug treatment significant on the average cholesterol level in patients? 

### Regression coefficient test

Perform a regression analysis using predictors and a target.
is the coefficient of any of the independent variables greater than zero?

A t-test determines whether the coefficient of any of the independent variables is greater than 0. 
if the p-value of a regression coefficient indicates that the coefficient is greater than 0 it means that that variable has an effect on the regression analysis.

#### Example:

## Limitation of t-test

- work best for two-groups comparisons
- if you need to compare several groups you may need many pairwise comparisons
- 



