<a href="https://colab.research.google.com/github/Rumaisa1054/Data_Science/blob/main/stats_visuals/stats/Steps_in_EDA_tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Homogeneity (of variance)**

**Definition:**
Homogeneity (or **homoscedasticity**) means that **the variance (spread) of a variable is the same across all groups or samples**.

* In other words, all groups have **similar variability**.
* It is an important **assumption for many parametric tests**, like **t-tests and ANOVA**.

---

# **Why it matters**

* Parametric tests assume that each group is drawn from populations with **equal variance**.
* If variances are very different (heteroscedasticity), the test results may be **invalid** or **less reliable**.

---

# **Example**

Suppose you are comparing test scores of 3 classrooms:

| Classroom | Scores         | Variance |
| --------- | -------------- | -------- |
| A         | 70, 72, 68, 74 | 6        |
| B         | 65, 66, 64, 67 | 4        |
| C         | 71, 69, 72, 70 | 5        |

* Variances are similar → **homogeneous** → ANOVA can be safely applied.
* If one classroom had variance = 30 → **heterogeneous** → ANOVA assumptions are violated.

---

# **How to Test Homogeneity**
**Levene’s Test** → most commonly used

In [None]:
'''
Data (CDCR)
1. compostion (info - shape - data_type - missing_values - imputation)
2. disttribution (histogram - normal or not? )
3. comparision
4. relationship
'''

# **INFERENTIAL STATISTICS**

## **TO REACH A CONCLUSION FROM DATA - DO HYPOTHESIS TEST AND PREDICTION**

# **THIS data is often a sample from a population and - tests are done on sample to draw conclusions , predictions about the whole population**

1. independent vs dependent
2. hypothesis testing
3. confidence intervals
4. multiple comparisons of mean ( regression analysis - anova)


In [None]:
# comparison
# dono same or different

# relationship (one increase other increase - one increase other decrease - one increase other const)

# **Hypothesis testing**

## **1. Example Question**

**Scenario:** A nutritionist creates a new drink.

**Hypothesis:** Drinking this new drink will **normalize blood pressure** (lower it).
**Opposite statement:** The drink has **no effect**.

### **Null and Alternative Hypotheses**

* **Null hypothesis (H₀):** The drink has **no effect** on blood pressure.
* **Alternative hypothesis (H₁):** The drink **lowers blood pressure**.

* **Null hypothesis (H₀):** `NO skill difference in students` of karachi and lahore
* **Alternative hypothesis (H₁):** `There is  skill difference` in students of karachi and lahore

---

## **2. Gender Comparison**

Comparing blood pressure between **male** and **female** groups.

### **Null and Alternative Hypotheses**

* **Null hypothesis (H₀):** Male blood pressure is the **same as female**.
* **Alternative hypothesis (H₁):** Male blood pressure is **different from female**.

---

## **3. Important Note**

> When running **Shapiro-Wilk test**, the null hypothesis is that the data **is normally distributed** (a “positive” thing).
> In other hypothesis tests, the null hypothesis usually represents the **status quo or no effect**.

---

## **4. Comparison of Means**

* The test involves comparing the **mean blood pressure of males vs females**.
* Depending on normality, sample size, and variance, you may use **t-test, Welch’s t-test, or Mann-Whitney test**.


| Step                                         | Description                                                            | Example                                                |
| -------------------------------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------ |
| **1. Define hypotheses**                     | Formulate **null (H₀)** and **alternative (H₁)** hypotheses.           | H₀: μ = 100 (no change); H₁: μ ≠ 100 (there is change) |
| **2. Choose significance level (α)**         | Commonly **0.05** or **0.01**.                                         | α = 0.05 → 5% risk of Type I error                     |
| **3. Select the appropriate test**           | Decide **parametric/non-parametric**, number of samples, type of data. | t-test, ANOVA, chi-square, Mann-Whitney, etc.          |
| **4. Collect data & compute test statistic** | Use formula or software to calculate **t, z, F, χ², etc.**             | t = (x̄ - μ) / (s/√n)                                  |
| **5. Compute p-value or critical value**     | p-value = probability of observing result under H₀                     | p = 0.03                                               |
| **6. Make a decision**                       | Compare p-value with α:                                                |                                                        |


# **Hypothesis Testing & Confidence Intervals Notes**

---

## **Hypothesis Testing**

**Steps:**
1. Define Null (H₀) and Alternative (H₁) hypotheses
    
    **`Quick Memory Tip`**

    `H₀ = boring / nothing happens / status quo`

    `H₁ = interesting / something happens / effect exists`
2. Choose significance level (α), commonly 0.05
3. Select appropriate test (t-test, ANOVA, chi-square, etc.)
4. Collect data and calculate test statistic
5. Compute p-value
6. Decision:
   - If p ≤ α (0.05) → reject H₀  
   - If p > α → fail to reject H₀
7. Draw conclusion

**Example:**
- H₀: New drink has **no effect** on blood pressure  
- H₁: New drink **lowers blood pressure**

---

## **Confidence Interval (CI)**
Suppose you know the sample mean - CI is going to tell the range where the population mean might occur

**Definition:**  
A range of values that likely contains the **true population parameter**.

**Example:**  
- Sample mean = 120  
- 95% CI = [115, 125]  
- Interpretation: “We are 95% confident that the true mean lies between 115 and 125.”

**Relation with α:**  
- α = 0.05 → Confidence level = 1 - α = 95%  
- Smaller α → wider CI → more confidence

---

## **p-value**

**Definition:**  
Probability of observing your data (or more extreme) **if H₀ is true**.

**Decision rule:**  
- p ≤ α → reject H₀  
- p > α → fail to reject H₀

**Connection with CI:**  
- H₀ value **outside CI** → reject H₀ → p < α  
- H₀ value **inside CI** → fail to reject H₀ → p > α

---


# **Formula for Confidence Interval (for mean)**

If you have a **sample mean** x_bar, **sample standard deviation** (s), and **sample size** (n):

`
CI = x_bar +- {Critical Value} * Standard_error
`

Where:

* **Critical Value** depends on the confidence level:

  * 95% → z_score ≈ 1.96 (if large sample / known σ)
  * 99% → z ≈ 2.576
  * For small sample → use t-distribution value
  * s / sqrt{n} = **Standard Error (SE)**

---

# **Steps to Compute the CI**

1. Calculate the **sample mean** (\bar{x})
2. Calculate **standard deviation** (s)
3. Compute **standard error**:

`
SE = s / sqrt{n}
`

4. Find the **critical value** from z or t table based on confidence level
5. Compute **margin of error**:


`
ME = Critical Value * SE
`

6. Compute **CI range**:

`
Lower Bound = x_bar - ME
`

`
Upper Bound = x_bar + ME
`

---

# **Example**

Suppose:

* Sample mean = 100
* Sample SD = 15
* Sample size = 25
* Confidence level = 95% → z ≈ 1.96

**Step 1: Compute SE**

`
SE = 3
`

**Step 2: Margin of Error**

`
ME = 1.96 *  3 = 5.88
`

**Step 3: CI range**

`
min = 100 - 5.88 = 94.12
`

`
max = 100 + 5.88 = 105.88
`

**95% CI = [94.12, 105.88]**

---

# **Quick Notes**

* Higher confidence → wider CI
* Larger sample → smaller SE → narrower CI
* If **population SD unknown & small sample**, use **t-distribution** instead of z