# Hypothesis testing 
- Hypothesis testing compares two opposite ideas about a group of people or things and uses data from a small part of that group (a sample) to decide which idea is more likely true. We collect and study the sample data to check if the claim is correct.

-Defining Hypotheses
   - Null Hypothesis (H₀): The starting assumption. For example, "The average visits are 50."
   - Alternative Hypothesis (H₁): The opposite, saying there is a difference. For example, "The average visits are not 50."

- Key Terms of Hypothesis Testing
  To understand the Hypothesis testing firstly we need to understand the key terms which are given below:
  - Significance Level (α): How sure we want to be before saying the claim is false. Usually, we choose 0.05 (5%).
  - p-value: The chance of seeing the data if the null hypothesis is true. If this is less than α, we say the claim is probably false.
  - Test Statistic: A number that helps us decide if the data supports or rejects the claim.
  - Critical Value: The cutoff point to compare with the test statistic.
  - Degrees of freedom: A number that depends on the data size and helps find the critical value.
- This is one of the most important and practical topics in statistics for data analysts and data scientists.
- Hypothesis testing is a statistical method used to make decisions or inferences about a population using sample data. It helps answer questions like:
  - “Is the average income higher this year?”
  - “Do two groups have different means?”
  - “Are two variables related?”


### Concept
 -  Null hypothesis (H₀) : there’s no effect or no difference (the default assumption
 -  Alternative hypothesis (H₁ or Hₐ) : There is an effect or difference

### Steps
 - Collect sample data
 - Choose a test & significance level (usually α = 0.05)
 -  Compute a test statistic and p-value
 -  Reject or fail to reject H₀
- Hypothesis testing in Python involves using statistical libraries to evaluate claims about a population based on sample data. The core concept is to determine whether observed differences or relationships are statistically significant or likely due to random chance.

## H-Tests in Python

- Hypothesis testing in Python involves using statistical libraries to evaluate claims about a population based on sample data. The core concept is to determine whether observed differences or relationships are statistically significant or likely due to random chance.

### Steps to do in Python
- Define Null (H₀) and Alternative (H₁) Hypotheses:
   - H₀: The statement of no effect or no difference (e.g., means are equal).
   - H₁: The statement that contradicts H₀ (e.g., means are different).
- Choose a Significance Level (α):
   - This is the probability threshold for rejecting the null hypothesis, typically 0.05.
- Collect and Analyze Data:
   - Obtain a sample from the population of interest.
   - Select an Appropriate Statistical Test:
   - The choice of test depends on the data type, number of samples, and the nature of the hypothesis (e.g., t-test for comparing means, chi-square for categorical associations, ANOVA for comparing multiple means).

### Example
- A company claims the average battery life is 10 hours. You test 30 samples — is the mean really 10 hours?
   - H₀: μ = 10
   - H₁: μ ≠ 10
- If p-value < 0.05 → reject H₀ → significant difference.

### Tests in Python (scipy library):
  - scipy.stats.ttest_ind (independent samples),
  - scipy.stats.ttest_rel (paired samples),
  - scipy.stats.ttest_1samp (one-sample).
  - scipy.stats.chi2_contingency for testing association between categorical variables.
  - scipy.stats.f_oneway for comparing means of three or more groups.
  - scipy.stats.spearmanr for measuring the strength and direction of linear relationships.
  -  scipy.stats.shapiro, scipy.stats.kstest to check if data follows a normal distribution.


#### Types of Hypothesis
  - Type : Used for: Data Types : Python Function
  - One-sample t-test : Compare sample mean to a known value : Continuous : scipy.stats.ttest_1samp()
  - Two-sample t-test : Compare means of two independent groups Continuous scipy.stats.ttest_ind()
  - Paired t-test Compare means of two related samples Continuous (paired) scipy.stats.ttest_rel()
  - Chi-square test : Test association between categorical variables Categorical scipy.stats.chi2_contingency()
  - ANOVA (F-test) : Compare means across 3+ groups Continuous scipy.stats.f_oneway()
  - Mann–Whitney U / Wilcoxon Non-parametric alternativesContinuous (non-normal) mannwhitneyu(), wilcoxon()
  - Z-test Compare means with known σ (large samples) Continuous statsmodels.stats.weightstats.ztest()
    
![image.png](attachment:c809645f-75ce-40f1-972f-d30ead0bf91c.png)

### One Sample vs Two Sample Test
- A one-sample test compares a single group's mean to a known, fixed value, while a two-sample test compares the means of two independent groups to each other.
  - One Sample
    - Ho (2T) : Avg Height of the Class = 160 cm (2-Tail), H1 : avgHt != 160 cms
    - Ho (1T) Avg height of the Class is >= 160 cm (1 tail), H1 : avgHt < 160cm or < 160 cms (1 tail)
     
  - Two Sample Tests
    - ttest_ind(data_group1, data_group2, equal_var=True/False)
    - data_group1 - 1st sample
    - data_group2 - 2nd sample
    - equal_var = True (Equal variance assumed eg of Patients)
    - equal_var = False (Equal variance not there asssumed eg Males and Females, Online/Offline Course)
     
  - Two Sample (Independent Samples)
    - Avg Height of Males = to Avg Height of Females (2 Tail)
    - Avg Height of Males > to Avg Height of Females (1 Tail)
     
  - Two Sample (Paired Samples)
    - Students were randomly grouped into 2 : One group was taught Data Analytics ONLINE and other was taught OFFLINE
    - Avg Marks scored of ONLINE = to Avg Marks scored of OFFLINE (2-Tail)
    - Avg Marks scored of ONLINE > to Avg Marks scored of OFFLINE (1-Tail)
    - Patients were divided into 2 Groups : One Group was given BP medicine and other Group was given a plain medicine without any BP drug in it (Placebo)
    - BP measured in Group 1 = BP measured in Group 2 (2-tail)
    - BP measured in Group 1 < BP measured in Group 2 (1-tail)
     
- In short, one-sample tests assess if a group's average is different from a specific standard, whereas two-sample tests check if two groups are different from each other.

### Multiple Sample - ANOVA

#### Two Tail or One Tail : Left or Right Side
- In Python, when performing t-tests using the scipy.stats module, you can specify left-tailed or right-tailed tests by using the alternative parameter in functions like ttest_ind (for independent samples) or ttest_rel (for related samples).
- Two-tailed test (default): alternative='two-sided' (or omit the parameter, as this is the default). This tests if the means are different, without specifying a direction.
- Left-tailed test: alternative='less'. This tests if the mean of the first sample is significantly less than the mean of the second sample (or a hypothesized population mean).
- Right-tailed test: alternative='greater'. This tests if the mean of the first sample is significantly greater than the mean of the second sample (or a hypothesized population mean).
- import numpy as np
- from scipy import stats

##### Generate sample data
- np.random.seed(42)
- sample1 = np.random.normal(loc=5, scale=1, size=100)
- sample2 = np.random.normal(loc=4.8, scale=1, size=100)

##### Two-tailed t-test
- t_stat_two, p_value_two = stats.ttest_ind(sample1, sample2, alternative='two-sided')
- print(f"Two-tailed test: t-statistic = {t_stat_two:.3f}, p-value = {p_value_two:.3f}")
##### Left-tailed t-test (testing if mean of sample1 is less than mean of sample2)
- t_stat_less, p_value_less = stats.ttest_ind(sample1, sample2, alternative='less')
- print(f"Left-tailed test: t-statistic = {t_stat_less:.3f}, p-value = {p_value_less:.3f}")

##### Right-tailed t-test (testing if mean of sample1 is greater than mean of sample2)
- t_stat_greater, p_value_greater = stats.ttest_ind(sample1, sample2, alternative='greater')
- print(f"Right-tailed test: t-statistic = {t_stat_greater:.3f}, p-value = {p_value_greater:.3f}") = Key Points: The alternative parameter directly controls the direction of your hypothesis test and the calculation of the p-value. Choosing the correct alternative is crucial and depends on your research question and the directionality of your alternative hypothesis.

### One Tail and Two Tail Test
- Hypothesis tests in Python, using libraries like SciPy, allow for both one-tailed and two-tailed analyses. The choice between a one-tailed and two-tailed test depends on the alternative hypothesis being investigated.
- One-Tailed Test: A one-tailed test (also known as a one-sided test) is used when the alternative hypothesis specifies a direction for the effect. This means you are only interested in detecting a deviation in one specific direction (e.g., greater than, or less than, a certain value).
   - Example: Testing if a new drug increases a patient's recovery rate, or if a new manufacturing process reduces the number of defects.
   - In Python (using SciPy's ttest_ind for a two-sample t-test)
   - stats.ttest_ind(data_group1, data_group2, equal_var=True)
- Two-Tailed Test: A two-tailed test (also known as a two-sided test) is used when the alternative hypothesis does not specify a direction. You are interested in detecting a significant difference in either direction (e.g., greater than or less than a certain value).
   - Example: Testing if a new teaching method changes student performance (it could be better or worse), or if there is a difference in average heights between two populations.
   - In Python (using SciPy's ttest_ind): stats.ttest_ind(data_group1, data_group2, equal_var=True)
- The primary distinction lies in how the significance level (alpha) is distributed. In a one-tailed test, the entire alpha is concentrated in one tail of the distribution. In a two-tailed test, the alpha is split between both tails. This means a one-tailed test has more power to detect an effect in the specified direction but cannot detect an effect in the opposite direction.

### Which test to do

- practice each path in Python:
  - One-sample t-test → compare sample mean vs known mean
  - Chi-square test → compare survey responses (gender × preference)
  - ANOVA → compare scores across departments
  - Correlation → analyze mpg vs wt in the mtcars dataset

![image.png](attachment:cc53729d-a7cc-4c6b-acc0-ae3d8b2758d4.png)