# Hypothesis Testing

### What is Hypothesis Testing in Statistics?

- Hypothesis testing uses sample data from the population to draw useful conclusions regarding the population probability distribution. 
- It tests an assumption made about the data using different types of hypothesis testing methodologies.
- The hypothesis testing results in either rejecting or not rejecting the null hypothesis.


### Null and alternative hypotheses

- A null hypothesis (H0) always predicts no true effect, no relationship between variables, or no difference between groups.
    
- An alternative hypothesis (Ha or H1) states your main prediction of a true effect, a relationship between variables, or a difference between groups.

### Test statistics and p values

  Every statistical test produces:

- A test statistic that indicates how closely your data match the null hypothesis.
- A corresponding p value that tells you the probability of obtaining this result if the null hypothesis is true.

### What exactly is a p-value?

- The p-value, or probability value, tells you how likely it is that your data could have occurred under the null hypothesis.
- It does this by calculating the likelihood of your test statistic, which is the number calculated by a statistical test using your data.

### Type 1 error:

- A Type I error means rejecting the null hypothesis when it’s actually true. 
- It means concluding that results are statistically significant when, in reality, they came about purely by chance or because of unrelated factors.
- The risk of committing this error is the significance level (alpha or α) you choose.

### Type 2 error:

- A Type II error means not rejecting the null hypothesis when it’s actually false. 
- This is not quite the same as “accepting” the null hypothesis, because hypothesis testing can only tell you whether to reject the null hypothesis.

### Power of test

- Power is the extent to which a test can correctly detect a real effect when there is one. 
- High power in a study indicates a large chance of a test detecting a true effect. 
- Low power means that your test only has a small chance of detecting a true effect or that the results are likely to be distorted by random and systematic error.
- The risk of a Type II error is inversely related to the statistical power of a study. 
- The higher the statistical power, the lower the probability of making a Type II error.

### Statistical assumptions

   Statistical tests make some common assumptions about the data they are testing:

1. Independence of observations (a.k.a. no autocorrelation):
- The observations/variables you include in your test are not related (for example, multiple measurements of a single test subject are not independent, while measurements of multiple different test subjects are independent).

2. Homogeneity of variance: 
- The variance within each group being compared is similar among all groups. If one group has much more variation than others, it will limit the test’s effectiveness.

3. Normality of data: 
- The data follows a normal distribution (a.k.a. a bell curve). This assumption applies only to quantitative data.


![tests.JPG](attachment:tests.JPG)

### Correlation vs. Causation

- Correlation describes an association between variables: when one variable changes, so does the other. 
- A correlation is a statistical indicator of the relationship between variables. 
- These variables change together but this covariation isn’t necessarily due to a direct or indirect causal link.

- Causation means that changes in one variable brings about changes in the other; there is a cause-and-effect relationship between variables. 
- The two variables are correlated with each other and there is also a causal link between them.

### Correlation Coefficient

- A correlation coefficient is a number between -1 and 1 that tells you the strength and direction of a relationship between variables.
- It reflects how similar the measurements of two or more variables are across a dataset.

![corelation.JPG](attachment:corelation.JPG)

### Types of correlation coefficients:

1. Pearson’s r

- Pearson’s r, describes the linear relationship between two quantitative variables.
- It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables
- It’s not a good measure of correlation if your variables have a nonlinear relationship, or if your data have outliers, skewed distributions, or come from categorical variables.

2. Spearman’s rho

- Spearman’s rho, or Spearman’s rank correlation coefficient, is the most common alternative to Pearson’s r. 
- It’s a rank correlation coefficient because it uses the rankings of data from each variable (e.g., from lowest to highest) rather than the raw data itself.

### Pearson’s r vs Spearman’s rho

- While the Pearson correlation coefficient measures the linearity of relationships, the Spearman correlation coefficient measures the monotonicity of relationships.

- In a linear relationship, each variable changes in one direction at the same rate throughout the data range. 
- In a monotonic relationship, each variable also always changes in only one direction but not necessarily at the same rate.

### Hypothesis Testing Steps

- Step 1: Set up the null hypothesis by correctly identifying whether it is the left-tailed, right-tailed, or two-tailed hypothesis testing.
- Step 2: Set up the alternative hypothesis.
- Step 3: Choose the correct significance level, α , and find the critical value.
- Step 4: Calculate the correct test statistic (z, t or χ) and p-value.
- Step 5: Compare the test statistic with the critical value or compare the p-value with α to arrive at a conclusion. In other words, decide if the null hypothesis is to be rejected or not.

### References:
- https://www.scribbr.com/statistics/statistical-tests/
- https://www.scribbr.com/statistics/correlation-coefficient/
- https://www.cuemath.com/data/hypothesis-testing/