#Statistics Part 2


# Q1. What is hypothesis testing in statistics ?
Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It helps you determine whether a claim or assumption about a population parameter is likely to be true.

Here’s how it works in a nutshell:

1. **Formulate two hypotheses**:
   - **Null hypothesis (H₀)**: This is the default assumption—usually that there is no effect or no difference.  
   - **Alternative hypothesis (H₁ or Ha)**: This is what you want to test for—suggesting there is an effect or a difference.

2. **Choose a significance level (α)**: Commonly set at 0.05, this represents the probability of rejecting the null hypothesis when it’s actually true (Type I error).

3. **Collect and analyze sample data**: You calculate a test statistic (like a z-score or t-score) based on your data.

4. **Compare the p-value to α**:
   - If **p-value ≤ α**, reject the null hypothesis (evidence supports the alternative).
   - If **p-value > α**, fail to reject the null hypothesis (not enough evidence to support the alternative).

5. **Draw a conclusion**: Based on the comparison, you decide whether your data supports the claim you're testing.

There are different types of tests—**one-tailed** (testing for a specific direction of effect) and **two-tailed** (testing for any difference, regardless of direction).



# Q2. What is the null hypothesis, and how does it differ from the alternative hypothesis

 Understanding the difference between the **null hypothesis** and the **alternative hypothesis** is key to mastering hypothesis testing.

### Null Hypothesis (H₀)
This is the default assumption or status quo. It states that **there is no effect, no difference, or no relationship** between variables. Think of it as the claim you're trying to test _against_.

Example:  
> “There is no difference in average test scores between students who study with music and those who don’t.”

### Alternative Hypothesis (H₁ or Ha)
This is the claim you’re trying to find evidence for. It suggests that **there is an effect, a difference, or a relationship**.

Example:  
> “Students who study with music score differently (higher or lower) than those who don’t.”

### Key Differences
| Feature | Null Hypothesis (H₀) | Alternative Hypothesis (H₁) |
|--------|----------------------|-----------------------------|
| **Meaning** | No effect or difference | Some effect or difference |
| **Goal** | Try to disprove or reject | Try to support |
| **Assumed True Until...** | Evidence suggests otherwise | Evidence supports it |
| **Symbol** | H₀ | H₁ or Ha |

In practice, we never “prove” the alternative—we just gather enough evidence to **reject the null**. It’s a bit like a courtroom: the null is “innocent until proven guilty.”



# Q3.  What is the significance level in hypothesis testing, and why is it important ?

The **significance level**, often denoted by **α (alpha)**, is the threshold you set in hypothesis testing to decide whether to reject the null hypothesis. It represents the **maximum probability of making a Type I error**—that is, rejecting the null hypothesis when it’s actually true.

### Why it matters:
- **Controls false positives**: A lower α (like 0.01) means you're being more cautious about claiming a result is statistically significant.
- **Sets the bar for evidence**: It defines how strong your sample evidence must be to confidently say, “This result probably didn’t happen by chance.”
- **Common choices**: Researchers often use α = 0.05, meaning they’re willing to accept a 5% chance of being wrong when rejecting the null.

### Real-world analogy:
Think of it like a courtroom. The significance level is your standard of proof—how sure you need to be before declaring someone guilty. A stricter α is like requiring more convincing evidence.




# Q4.What does a P-value represent in hypothesis testing ?
The **p-value** in hypothesis testing tells you how likely it is to observe your sample results—or something more extreme—**if the null hypothesis were true**.

### In simple terms:
It answers the question: _“Assuming there’s no real effect, how surprising is this data?”_

- A **small p-value** (typically ≤ 0.05) means your data is **unlikely under the null hypothesis**, so you might reject the null.
- A **large p-value** suggests your data is **consistent with the null**, so you don’t have strong evidence to reject it.

### Example:
Let’s say you’re testing whether a new teaching method improves student scores.  
- **H₀**: The method has no effect.  
- You run a test and get a **p-value of 0.02**.  
- That means there’s a **2% chance** of seeing results this extreme if the method truly had no effect.  
- Since 0.02 < 0.05, you’d likely reject H₀ and say the method might be effective.

It’s important to remember: **a p-value doesn’t tell you the probability that the null hypothesis is true**—just how compatible your data is with it.





Q5. How do you interpret the P-value in hypothesis testing ?
Interpreting the **p-value** is all about understanding how surprising your data is—*assuming the null hypothesis is true*.

### Here's how to think about it:

- A **small p-value** (typically ≤ 0.05) means your data is **unlikely under the null hypothesis**. This gives you reason to **reject the null** and consider the alternative.
- A **large p-value** suggests your data is **plausible under the null**, so you **fail to reject** it.

### Example:
Let’s say you’re testing whether a new drug lowers blood pressure:
- **H₀**: The drug has no effect.
- You get a **p-value of 0.03**.
- That means there’s a **3% chance** of observing your results (or more extreme ones) if the drug truly had no effect.

Since 0.03 < 0.05, you’d reject H₀ and say the drug likely has an effect.

### Important notes:
- A **p-value is not** the probability that the null hypothesis is true.
- It also doesn’t measure the size or importance of an effect—just the strength of evidence *against* H₀.



# Q5. How do you interpret the P-value in hypothesis testing ?
The **P-value** is like a reality check for your hypothesis. It tells you how surprising your data is *if* the null hypothesis were actually true.

Here’s the breakdown:

- A **small P-value** (typically ≤ 0.05) means your data is *unlikely* under the null hypothesis. That gives you reason to **reject** the null — your results are statistically significant.
- A **large P-value** (> 0.05) means your data is *plausible* under the null. So, you **fail to reject** the null — no strong evidence for the alternative.

Think of it like this: if you flip a coin 100 times and get 90 heads, the P-value tells you how likely that is if the coin were fair. A tiny P-value would suggest the coin might be rigged.

It’s important to remember: the P-value doesn’t tell you the probability that the null is true — it tells you how compatible your data is with the null.




# Q6. What are Type 1 and Type 2 errors in hypothesis testing ?
In hypothesis testing, **Type I and Type II errors** are the two classic mistakes you can make when drawing conclusions from data. Think of them as the statistical version of “false alarms” and “missed signals.”

---

### 🔹 Type I Error (False Positive)
This happens when you **reject the null hypothesis (H₀) even though it’s actually true**.

- **Analogy**: You think a fire alarm is going off because there’s a fire—but there isn’t.
- **Example**: A medical test says a patient has a disease when they actually don’t.
- **Probability of this error**: Denoted by **α (alpha)**, often set at 0.05.

---

### 🔹 Type II Error (False Negative)
This occurs when you **fail to reject the null hypothesis when it’s actually false**.

- **Analogy**: There’s a fire, but the alarm doesn’t go off.
- **Example**: A test fails to detect a disease that the patient actually has.
- **Probability of this error**: Denoted by **β (beta)**.

---

### 🧠 Quick Comparison

| Error Type | What Happens | Real-World Analogy | Controlled By |
|------------|--------------|--------------------|----------------|
| **Type I** | Rejecting a true H₀ | False alarm | Significance level (α) |
| **Type II** | Not rejecting a false H₀ | Missed detection | Power of the test (1 - β) |

---

Both errors are important to consider when designing experiments. Reducing one often increases the other, so it’s all about finding the right balance.





# Q7. What is the difference between a one-tailed and a two-tailed test in hypothesis testing
 The difference between **one-tailed** and **two-tailed** tests lies in the direction of the effect you're testing for in your hypothesis.

---

### 🔹 One-Tailed Test
This test checks for an effect in **only one direction**—either greater than or less than a certain value.

- **Alternative Hypothesis (H₁)**: The parameter is either **greater than** or **less than** the null value.
- **Use case**: When you have a strong reason to expect a specific direction.
- **Example**:  
  H₀: μ = 50  
  H₁: μ > 50 (right-tailed) → you're only interested if the mean is *greater* than 50.

---

### 🔸 Two-Tailed Test
This test checks for an effect in **both directions**—whether the parameter is **different** (either higher or lower) from the null value.

- **Alternative Hypothesis (H₁)**: The parameter is **not equal to** the null value.
- **Use case**: When you're open to detecting a difference in **either direction**.
- **Example**:  
  H₀: μ = 50  
  H₁: μ ≠ 50 → you're testing if the mean is *either* higher or lower than 50.

---

### 🧠 Quick Comparison

| Feature | One-Tailed Test | Two-Tailed Test |
|--------|------------------|------------------|
| **Direction** | One (left or right) | Both |
| **H₁ Symbol** | > or < | ≠ |
| **Critical Region** | One tail of the distribution | Both tails |
| **When to Use** | Specific directional claim | Any difference |

---

Choosing the right test depends on your research question. If you're testing whether a new teaching method improves scores (and only care about improvement), a one-tailed test might fit. But if you're checking whether it changes scores in *any* way, go two-tailed.





# Q8. What is the Z-test, and when is it used in hypothesis testing ?
The **Z-test** is a statistical method used in hypothesis testing to determine whether there's a significant difference between sample and population means—or between two sample means—**when the population standard deviation is known** and the sample size is large (typically **n > 30**).

---

### 🧪 When to Use a Z-Test:
- The data is **approximately normally distributed**.
- The **population standard deviation (σ)** is known.
- The **sample size is large** (n > 30).
- You're testing **means** or **proportions**.

---

### 🔹 Common Types of Z-Tests:
1. **One-sample Z-test**:  
   Tests whether the mean of a single sample differs from a known population mean.  
   Example: Is the average battery life of a new phone model different from the advertised 12 hours?

2. **Two-sample Z-test**:  
   Compares the means of two independent samples to see if they differ significantly.  
   Example: Do students from two different schools have different average test scores?

3. **Z-test for proportions**:  
   Used when comparing sample proportions to a population proportion or between two groups.

---

### 🧮 Z-Test Formula:
For a one-sample Z-test:
\[
Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}
\]
Where:  
- \(\bar{x}\) = sample mean  
- \(\mu\) = population mean  
- \(\sigma\) = population standard deviation  
- \(n\) = sample size

---

### 🧠 Why it’s useful:
The Z-test leverages the **standard normal distribution** to calculate how far your sample statistic is from the population parameter in terms of standard deviations. If the result falls in the critical region (based on your α level), you reject the null hypothesis.




#Q9. How do you calculate the Z-score, and what does it represent in hypothesis testing
The **Z-score** (or **standard score**) tells you how many standard deviations a data point or sample mean is from the population mean. In hypothesis testing, it helps you determine whether your observed result is statistically significant under the assumption that the null hypothesis is true.

---

### 🧮 Z-Score Formula (for a sample mean):
\[
Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}
\]

Where:  
- \(\bar{x}\) = sample mean  
- \(\mu\) = population mean  
- \(\sigma\) = population standard deviation  
- \(n\) = sample size

---

### 🔍 What It Represents:
- A **Z-score of 0** means your sample mean is exactly equal to the population mean.
- A **positive Z-score** means the sample mean is above the population mean.
- A **negative Z-score** means it’s below.
- The **larger the absolute value**, the more unusual the result is under the null hypothesis.

---

### 🧠 In Hypothesis Testing:
Once you calculate the Z-score, you compare it to a **critical value** (based on your significance level α) or use it to find a **p-value**. If the Z-score falls in the critical region (e.g., beyond ±1.96 for α = 0.05 in a two-tailed test), you reject the null hypothesis.



#Q10. What is the T-distribution, and when should it be used instead of the normal distribution ?
The **t-distribution** (or **Student’s t-distribution**) is a probability distribution that’s similar to the normal distribution but has **heavier tails**—meaning it’s more prone to producing values that fall far from its mean. This makes it especially useful when dealing with **small sample sizes** or **unknown population standard deviations**.

---

### 🧪 When to Use the T-Distribution:
Use the t-distribution **instead of the normal distribution** when:
- The **sample size is small** (typically **n ≤ 30**).
- The **population standard deviation (σ)** is **unknown**.
- The data is **approximately normally distributed**.

---

### 🔍 Why It Matters:
When sample sizes are small, we have more uncertainty about the population parameters. The t-distribution accounts for this by spreading out more (wider tails), which leads to **more conservative estimates**—like wider confidence intervals or larger critical values.

As the sample size increases, the t-distribution **approaches the normal distribution**. So for large samples, the difference becomes negligible.

---

### 🧠 Quick Comparison:

| Feature | Normal Distribution | T-Distribution |
|--------|----------------------|----------------|
| Shape | Bell-shaped, thinner tails | Bell-shaped, heavier tails |
| Use When | σ is known, large n | σ is unknown, small n |
| Critical Values | Smaller | Larger (more conservative) |
| Degrees of Freedom | Not needed | Required (n - 1) |

---



#Q11. What is the difference between a Z-test and a T-test .
The **Z-test** and **T-test** are both used in hypothesis testing to compare means, but they differ in when and how they’re applied. Here's a clear breakdown:

---

### 🔍 **Key Differences**

| Feature | **Z-Test** | **T-Test** |
|--------|------------|------------|
| **When to Use** | Large sample size (**n ≥ 30**) | Small sample size (**n < 30**) |
| **Population Standard Deviation (σ)** | **Known** | **Unknown** |
| **Distribution Used** | Standard **normal distribution** | **t-distribution** (heavier tails) |
| **Test Statistic** | Z-score | t-score |
| **Degrees of Freedom** | Not required | Required (usually **n - 1**) |

---

### 🧠 Why It Matters:
- The **Z-test** is more precise when σ is known and the sample is large—it assumes less uncertainty.
- The **T-test** is more flexible and conservative, especially with small samples or when σ is unknown—it accounts for extra variability.

---

### 🧪 Real-World Analogy:
Imagine you're estimating the average height of students:
- If you know the **exact population standard deviation** and have data from **100 students**, use a **Z-test**.
- If you're working with **just 15 students** and don’t know σ, go with a **T-test**.

---



#Q12. What is the T-test, and how is it used in hypothesis testing ?
The **t-test** is a statistical method used in hypothesis testing to determine whether there's a **significant difference between the means of two groups**—especially when the **sample size is small** and the **population standard deviation is unknown**.

---

### 🔍 What It Does:
It tests whether the observed difference between sample means (or between a sample mean and a known value) is likely due to chance, or if it's statistically meaningful.

---

### 🧪 How It’s Used in Hypothesis Testing:

1. **Set up hypotheses**  
   - **Null hypothesis (H₀)**: There is no difference between the means.  
   - **Alternative hypothesis (H₁)**: There is a difference.

2. **Choose the type of t-test**  
   - **One-sample t-test**: Compare a sample mean to a known value.  
   - **Independent two-sample t-test**: Compare means of two independent groups.  
   - **Paired sample t-test**: Compare means from the same group at different times (e.g., before and after treatment).

3. **Calculate the t-statistic**  
   It measures how far your sample result is from the null hypothesis, in terms of standard error.

4. **Determine the p-value**  
   Based on the t-distribution and degrees of freedom, you find the probability of observing such a result if H₀ were true.

5. **Make a decision**  
   - If **p-value ≤ α** (e.g., 0.05), reject H₀ → significant difference.  
   - If **p-value > α**, fail to reject H₀ → no significant difference.

---

### 📌 Example:
Suppose you want to test if a new teaching method improves test scores. You collect scores from 20 students before and after using the method. A **paired t-test** can help you determine if the improvement is statistically significant.

---



#Q13. What is the relationship between Z-test and T-test in hypothesis testing ?
The **Z-test** and **T-test** are closely related—they’re both used to test hypotheses about population means—but they differ mainly in the assumptions they make and the situations in which they’re used.

---

### 🔗 **Their Relationship at a Glance**

| Aspect | **Z-Test** | **T-Test** |
|--------|------------|------------|
| **Used When** | Sample size is **large** (n ≥ 30) | Sample size is **small** (n < 30) |
| **Population Standard Deviation (σ)** | **Known** | **Unknown** |
| **Distribution** | Standard **normal distribution** | **t-distribution** (with heavier tails) |
| **Precision** | More precise with large samples | More conservative with small samples |
| **Converges to Z** | — | As sample size increases, **t-distribution approaches normal** |

---

### 🧠 How They’re Connected:
- Both tests compare sample data to a population parameter to assess statistical significance.
- The **t-test is essentially a generalization of the Z-test**—it’s used when you don’t know the population standard deviation and must estimate it from the sample.
- As your sample size grows, the **t-distribution becomes nearly identical to the normal distribution**, so the **t-test and Z-test give nearly the same results**.

---

### 🎯 In Practice:
If you know σ and have a large sample, use a **Z-test**.  
If σ is unknown or your sample is small, go with a **T-test**.



#Q14. What is a confidence interval, and how is it used to interpret statistical results ?
A **confidence interval (CI)** is a range of values that’s used to estimate an unknown population parameter—like a mean or proportion—based on sample data. Instead of giving a single number (a point estimate), it gives a **range** that likely contains the true value, along with a **confidence level** that quantifies how sure we are.

---

### 🔍 What It Means:
If you calculate a **95% confidence interval** for the average height of students and get **160–170 cm**, it means:

> “We are 95% confident that the true average height of all students lies between 160 and 170 cm.”

This doesn’t mean there’s a 95% chance the true value is in that range—it means that if you repeated the sampling process many times, **95 out of 100 intervals** would contain the true value.

---

### 🧮 Formula (for a mean):
\[
\text{Confidence Interval} = \bar{x} \pm (\text{Critical Value} \times \text{Standard Error})
\]
Where:
- \(\bar{x}\) = sample mean  
- Critical Value = from Z or t-distribution (based on confidence level)  
- Standard Error = \(\frac{\text{Standard Deviation}}{\sqrt{n}}\)

---

### 📌 Why It’s Useful:
- **Quantifies uncertainty** in estimates
- Helps assess **reliability** of results
- Widely used in **A/B testing**, **survey analysis**, and **machine learning**

---

#Q15. What is the margin of error, and how does it affect the confidence interval ?
The **margin of error** is the amount you're allowing for uncertainty in your estimate—it's like a buffer zone around your sample statistic that accounts for sampling variability.

---

### 🔍 What It Represents:
In a confidence interval, the margin of error defines **how far above or below** the sample estimate the true population parameter might be.

For example, if a survey finds that 60% of people support a policy with a **margin of error of ±3%**, the confidence interval is **57% to 63%**. That means you're reasonably confident the true support lies within that range.

---

### 🧮 How It’s Calculated:
\[
\text{Margin of Error} = \text{Critical Value} \times \text{Standard Error}
\]

- **Critical Value**: Based on your confidence level (e.g., 1.96 for 95% confidence using a Z-distribution).
- **Standard Error**: Reflects the variability in your sample.

---

### 🎯 How It Affects the Confidence Interval:
The **confidence interval** is built around your sample statistic like this:
\[
\text{Confidence Interval} = \text{Sample Estimate} \pm \text{Margin of Error}
\]

So, a **larger margin of error** means a **wider interval**—more uncertainty. A **smaller margin of error** means a **narrower interval**—more precision.

---

### 🧠 What Influences the Margin of Error:
- **Sample size**: Larger samples → smaller margin of error.
- **Confidence level**: Higher confidence (like 99%) → larger margin of error.
- **Population variability**: More variability → larger margin of error.

---


# Q16. How is Bayes' Theorem used in statistics, and what is its significance ?
Bayes’ Theorem is like a statistical compass—it helps you **update your beliefs** when new evidence comes in. It’s a cornerstone of **Bayesian inference**, which is all about refining probabilities as you learn more.

---

### 🔁 The Formula:
\[
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
\]

Where:
- **P(A)** = Prior probability (your belief before seeing the data)
- **P(B|A)** = Likelihood (how likely the evidence is if A is true)
- **P(B)** = Marginal probability of the evidence
- **P(A|B)** = Posterior probability (your updated belief after seeing the data)

---

### 🧠 Why It’s Significant:
- **Incorporates prior knowledge**: Unlike classical (frequentist) methods, Bayes’ Theorem lets you start with an assumption and refine it.
- **Adapts with new data**: It’s dynamic—perfect for real-world situations where information evolves.
- **Used in many fields**: From **medical diagnostics** (e.g., updating disease probability after a test result) to **machine learning**, **spam filtering**, and **decision-making under uncertainty**.

---

### 📌 Real-World Example:
Suppose a disease affects 1% of a population. A test is 99% accurate. If someone tests positive, what’s the chance they actually have the disease?  
Bayes’ Theorem helps you combine the **base rate** (1%) with the **test accuracy** to get a more realistic answer—often much lower than you'd expect.

---



#Q17. What is the Chi-square distribution, and when is it used ?
The **Chi-square (χ²) distribution** is a continuous probability distribution that arises when you sum the squares of independent standard normal variables. In simpler terms, if you take several values from a standard normal distribution, square them, and add them up—you get a Chi-square distributed value.

---

### 🔍 What It Represents:
If \( Z_1, Z_2, ..., Z_k \) are independent standard normal variables, then:
\[
\chi^2 = Z_1^2 + Z_2^2 + \dots + Z_k^2
\]
This sum follows a Chi-square distribution with **k degrees of freedom** (df), where *k* is the number of variables.

---

### 🧪 When It’s Used:
The Chi-square distribution is a workhorse in **hypothesis testing**, especially for **categorical data**. Here are the most common applications:

1. **Chi-square test of independence**  
   - Checks if two categorical variables are related.  
   - Example: Is there a relationship between gender and voting preference?

2. **Chi-square goodness-of-fit test**  
   - Tests whether observed frequencies match expected frequencies.  
   - Example: Do dice rolls follow a uniform distribution?

3. **Test for population variance**  
   - When the underlying population is normal, you can use the Chi-square distribution to test hypotheses about variance.

---

### 📌 Key Properties:
- **Non-negative**: Values are always ≥ 0.
- **Right-skewed**: Especially with fewer degrees of freedom.
- **As df increases**: The distribution becomes more symmetric and approaches a normal distribution.

---




#Q18. What is the Chi-square goodness of fit test, and how is it applied ?
The **Chi-square goodness of fit test** is a statistical method used to determine whether the distribution of a **categorical variable** in your sample matches an expected distribution. In other words, it helps answer the question: _“Does what I observed differ significantly from what I expected?”_

---

### 🧪 When to Use It:
- You have **one categorical variable** with two or more levels (e.g., colors, brands, preferences).
- You want to test whether the observed frequencies match a **theoretical or expected distribution**.

---

### 🧮 The Formula:
\[
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
\]
Where:
- \(O_i\) = observed frequency for category *i*  
- \(E_i\) = expected frequency for category *i*

---

### 🧠 Hypotheses:
- **Null hypothesis (H₀)**: The observed distribution matches the expected distribution.
- **Alternative hypothesis (H₁)**: The observed distribution does not match the expected distribution.

---

### 📌 Example:
Suppose a company claims that customers are equally likely to choose one of three new dog food flavors. You test this with 75 dogs and observe:

- Flavor A: 30  
- Flavor B: 20  
- Flavor C: 25  

Expected frequency for each (if equally likely) = 75 ÷ 3 = 25

You’d plug these into the formula to calculate the Chi-square statistic, then compare it to a critical value (or use a p-value) to decide whether to reject H₀.

---


#Q19.What is the F-distribution, and when is it used in hypothesis testing ?
The **F-distribution** is a continuous probability distribution that arises frequently in statistics, especially when comparing **variances** or evaluating **multiple group means**. It’s shaped like a skewed bell curve and is defined by two parameters: the **degrees of freedom** for the numerator and denominator.

---

### 🔍 What It Represents:
If you take two independent chi-square variables (each divided by their respective degrees of freedom) and form a ratio, that ratio follows an F-distribution. Mathematically:
\[
F = \frac{(S_1^2 / \text{df}_1)}{(S_2^2 / \text{df}_2)}
\]
Where \( S_1^2 \) and \( S_2^2 \) are sample variances.

---

### 🧪 When It’s Used in Hypothesis Testing:

1. **Analysis of Variance (ANOVA)**  
   - To test whether **three or more group means** are significantly different.  
   - Example: Comparing test scores across four different teaching methods.

2. **Comparing Two Variances**  
   - To test if the **variances of two populations** are equal.  
   - Example: Checking if two machines produce items with the same consistency.

3. **Regression Analysis**  
   - To test the **overall significance** of a regression model.  
   - It helps determine if your independent variables explain a significant portion of the variance in the dependent variable.

---

### 📌 Key Characteristics:
- **Always non-negative** (F ≥ 0)
- **Right-skewed**, especially with small degrees of freedom
- As degrees of freedom increase, it becomes more symmetric

---

The F-distribution is like the referee in a statistical match—it helps you decide whether the differences you see are just noise or something meaningful.



# Q20. What is an ANOVA test, and what are its assumptions ?
The **ANOVA test** (Analysis of Variance) is a statistical method used to determine whether there are **significant differences between the means of three or more independent groups**. Instead of comparing means pairwise (like multiple t-tests), ANOVA evaluates all groups simultaneously, helping control the risk of Type I errors.

---

### 🧪 When to Use ANOVA:
- You have **one continuous dependent variable** (e.g., test scores).
- You have **one or more categorical independent variables** (e.g., teaching methods, diet types).
- You want to test if **group means differ significantly**.

---

### 🔍 Key Assumptions of ANOVA:

1. **Independence of observations**  
   - Each data point should be collected independently.  
   - For example, one person’s test score shouldn’t influence another’s.

2. **Normality**  
   - The data in each group should be **approximately normally distributed**.  
   - This can be checked using Q-Q plots or tests like Shapiro-Wilk.

3. **Homogeneity of variances (homoscedasticity)**  
   - The **variances across groups should be roughly equal**.  
   - You can test this with **Levene’s test** or **Bartlett’s test**.

---

### 🧠 Why These Assumptions Matter:
Violating them can lead to **misleading results**. For example, if variances are unequal, ANOVA might detect a difference that’s not really there—or miss one that is.

If assumptions are violated, you might consider:
- **Transforming the data**
- Using **Welch’s ANOVA** (for unequal variances)
- Switching to **non-parametric tests** like the Kruskal-Wallis test

---




#Q21. What are the different types of ANOVA tests?
There are several types of **ANOVA (Analysis of Variance)** tests, each designed for different experimental setups and research questions. Here's a breakdown of the most common ones:

---

### 🔹 **One-Way ANOVA**
- **Purpose**: Compares the means of **three or more groups** based on **one independent variable**.
- **Example**: Testing whether different fertilizers affect plant growth differently.

---

### 🔸 **Two-Way ANOVA**
- **Purpose**: Examines the effect of **two independent variables** on a dependent variable, and whether there’s an **interaction** between them.
- **Example**: Studying how both teaching method and class time affect student performance.

---

### 🔁 **Repeated Measures ANOVA**
- **Purpose**: Used when the **same subjects** are measured **multiple times** under different conditions or time points.
- **Example**: Measuring blood pressure of patients before, during, and after treatment.

---

### 🧩 **Factorial ANOVA**
- **Purpose**: A generalization of two-way ANOVA that handles **more than two factors**, each with multiple levels.
- **Example**: Analyzing the effects of diet, exercise, and sleep on weight loss.

---

### 🎯 **MANOVA (Multivariate ANOVA)**
- **Purpose**: Extends ANOVA when there are **multiple dependent variables**.
- **Example**: Testing how different therapies affect both anxiety and depression scores.

---

Each type of ANOVA helps you answer slightly different questions about group differences and interactions.


#Q22. What is the F-test, and how does it relate to hypothesis testing?
The **F-test** is a statistical test used to compare **variances** or assess the **overall significance** of models, especially in **hypothesis testing** involving multiple groups or variables.

---

### 🔍 What It Does:
At its core, the F-test evaluates whether the **variability between groups** is significantly greater than the **variability within groups**. It uses the **F-distribution**, which is right-skewed and depends on two degrees of freedom: one for the numerator and one for the denominator.

---

### 🧪 Common Uses in Hypothesis Testing:

1. **Comparing Two Variances**  
   - Tests if two populations have equal variances.  
   - **H₀**: σ₁² = σ₂²  
   - **H₁**: σ₁² ≠ σ₂²

2. **ANOVA (Analysis of Variance)**  
   - Tests if **three or more group means** are significantly different.  
   - The F-statistic compares **between-group variance** to **within-group variance**.

3. **Regression Analysis**  
   - Tests whether a regression model explains a significant portion of the variance in the dependent variable.  
   - **H₀**: All regression coefficients = 0 (no effect)  
   - **H₁**: At least one coefficient ≠ 0 (model is significant)

---

### 🧮 F-Statistic Formula:
\[
F = \frac{\text{Variance between groups}}{\text{Variance within groups}}
\]
A **larger F-value** suggests that the group means are more spread out than you'd expect by chance—possibly indicating a significant effect.

---

### 📌 In Summary:
The F-test is like a referee in hypothesis testing—it helps you decide whether the differences you observe are **statistically meaningful** or just random noise.



#<<< Practical >>>

# Q1. Write a Python program to perform a Z-test for comparing a sample mean to a known population mean and interpret the results .
Here's a Python program that performs a **one-sample Z-test** to compare a sample mean to a known population mean. It includes calculation, decision-making, and interpretation:

---

### 🧪 Problem Setup:
Suppose we’re testing whether the average height of a new plant variety differs from the known population mean of **50 cm**, assuming we know the population standard deviation (σ = 5). Here's a simple dataset from our sample:



In [None]:
import numpy as np
from scipy.stats import norm

# Sample data
sample_data = [52, 49, 51, 53, 50, 48, 54, 55, 49, 51]
population_mean = 50  # μ
population_std = 5    # σ
alpha = 0.05          # significance level

# Step 1: Calculate sample statistics
sample_mean = np.mean(sample_data)
sample_size = len(sample_data)
standard_error = population_std / np.sqrt(sample_size)

# Step 2: Calculate the Z-score
z_score = (sample_mean - population_mean) / standard_error

# Step 3: Calculate the p-value (two-tailed)
p_value = 2 * (1 - norm.cdf(abs(z_score)))

# Step 4: Print results and interpret
print(f"Sample Mean: {sample_mean:.2f}")
print(f"Z-Score: {z_score:.4f}")
print(f"P-Value: {p_value:.4f}")

if p_value < alpha:
    print("Result: Reject the null hypothesis. The sample mean is significantly different from the population mean.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference found.")

Sample Mean: 51.20
Z-Score: 0.7589
P-Value: 0.4479
Result: Fail to reject the null hypothesis. No significant difference found.




---

### 🧠 Interpretation:
If the **p-value** is less than 0.05, we reject the null hypothesis and conclude that the sample mean is **statistically significantly different** from 50. If it’s higher, we say there’s **no significant evidence** to suggest a difference.

---




# Q2. Simulate random data to perform hypothesis testing and calculate the corresponding P-value using Python .
Absolutely, Sakshi! Let’s simulate a scenario where we generate random data and perform a **one-sample t-test** to see if the sample mean significantly differs from a known population mean. Here's a complete Python example:

---

### 🧪 Scenario:
Suppose the average test score in a population is **75**, and we want to test if a randomly selected sample of students has a different average.

```python
import numpy as np
from scipy import stats

# Step 1: Simulate random sample data
np.random.seed(42)  # for reproducibility
sample = np.random.normal(loc=78, scale=10, size=30)  # mean=78, std=10, n=30

# Step 2: Define population mean
population_mean = 75

# Step 3: Perform one-sample t-test
t_statistic, p_value = stats.ttest_1samp(sample, population_mean)

# Step 4: Print results
print(f"Sample Mean: {np.mean(sample):.2f}")
print(f"T-Statistic: {t_statistic:.4f}")
print(f"P-Value: {p_value:.4f}")

# Step 5: Interpret the result
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. The sample mean is significantly different from the population mean.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference found.")
```

---

### 🧠 What’s Happening:
- We simulate a sample of 30 students with a mean around 78.
- We test whether this sample mean is significantly different from the population mean of 75.
- The **p-value** tells us whether the observed difference is statistically significant.



# Q3.Implement a one-sample Z-test using Python to compare the sample mean with the population mean ?
Here's a clean Python implementation of a **one-sample Z-test** using `statsmodels`, which is perfect when the **population standard deviation is known** and the **sample size is reasonably large**:

---

### ✅ One-Sample Z-Test in Python

```python
import numpy as np
from statsmodels.stats.weightstats import ztest

# Simulated sample data (e.g., test scores)
sample = np.array([82, 85, 88, 90, 87, 84, 86, 89, 91, 83])

# Known population mean
population_mean = 85

# Perform one-sample Z-test
z_stat, p_value = ztest(sample, value=population_mean)

# Output results
print(f"Sample Mean: {np.mean(sample):.2f}")
print(f"Z-Statistic: {z_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. The sample mean is significantly different from the population mean.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference found.")
```

---

### 🧠 What’s Happening:
- We're testing whether the sample mean differs from the population mean of 85.
- The `ztest()` function handles the math behind the scenes.
- The **p-value** tells us whether the difference is statistically significant.

# Q4. Perform a two-tailed Z-test using Python and visualize the decision region on a plot .

---

### 🧪 Scenario:
Suppose the population mean is 100 with a known standard deviation of 15. You collect a sample of 50 observations with a sample mean of 106. You want to test if the sample mean is significantly different from the population mean at a 5% significance level.

---

### ✅ Python Code with Visualization

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Parameters
population_mean = 100
population_std = 15
sample_mean = 106
sample_size = 50
alpha = 0.05

# Z-test calculation
standard_error = population_std / np.sqrt(sample_size)
z_score = (sample_mean - population_mean) / standard_error
p_value = 2 * (1 - norm.cdf(abs(z_score)))

# Critical z-values for two-tailed test
z_critical = norm.ppf(1 - alpha/2)

# Print results
print(f"Z-Score: {z_score:.4f}")
print(f"P-Value: {p_value:.4f}")
print(f"Z-Critical (±): ±{z_critical:.4f}")

# Visualization
x = np.linspace(-4, 4, 1000)
y = norm.pdf(x)

plt.figure(figsize=(10, 5))
plt.plot(x, y, label='Standard Normal Distribution', color='blue')

# Shade rejection regions
plt.fill_between(x, y, where=(x <= -z_critical), color='red', alpha=0.5, label='Rejection Region (Left)')
plt.fill_between(x, y, where=(x >= z_critical), color='red', alpha=0.5, label='Rejection Region (Right)')

# Plot z-score
plt.axvline(z_score, color='green', linestyle='--', label=f'Z-Score = {z_score:.2f}')

# Labels and legend
plt.title('Two-Tailed Z-Test Decision Regions')
plt.xlabel('Z')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()
```

---

### 🧠 Interpretation:
- If the **Z-score** falls outside ±Z-critical, you **reject the null hypothesis**.
- The shaded red areas are the **rejection regions** for a 5% significance level.
- The green dashed line shows where your **Z-score** lands.



# Q5. Create a Python function that calculates and visualizes Type 1 and Type 2 errors during hypothesis testing .
 Here's a Python function that simulates a hypothesis test and **visualizes Type I and Type II errors** on a normal distribution curve. This is a great way to see how significance level (α), power, and effect size interact.

---

### ✅ Python Function: Visualize Type I & II Errors

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

def visualize_type1_type2(mu_null=0, mu_alt=1, sigma=1, alpha=0.05, n=30):
    # Standard error
    se = sigma / np.sqrt(n)

    # Critical value for right-tailed test
    z_critical = norm.ppf(1 - alpha)
    x_crit = mu_null + z_critical * se

    # X range for plotting
    x = np.linspace(mu_null - 4*se, mu_alt + 4*se, 1000)

    # Null and alternative distributions
    y_null = norm.pdf(x, mu_null, se)
    y_alt = norm.pdf(x, mu_alt, se)

    # Plot distributions
    plt.figure(figsize=(10, 5))
    plt.plot(x, y_null, label='Null Hypothesis (H₀)', color='blue')
    plt.plot(x, y_alt, label='Alternative Hypothesis (H₁)', color='green')

    # Shade Type I error region (α)
    plt.fill_between(x, y_null, where=(x >= x_crit), color='red', alpha=0.4, label='Type I Error (α)')

    # Shade Type II error region (β)
    plt.fill_between(x, y_alt, where=(x < x_crit), color='orange', alpha=0.4, label='Type II Error (β)')

    # Decision boundary
    plt.axvline(x_crit, color='black', linestyle='--', label=f'Critical Value = {x_crit:.2f}')

    # Labels and legend
    plt.title('Type I and Type II Errors in Hypothesis Testing')
    plt.xlabel('Sample Mean')
    plt.ylabel('Probability Density')
    plt.legend()
    plt.grid(True)
    plt.show()

# Example usage
visualize_type1_type2(mu_null=0, mu_alt=1, sigma=1, alpha=0.05, n=30)
```

---

### 🧠 What This Shows:
- **Red area**: Type I error (rejecting H₀ when it’s true)
- **Orange area**: Type II error (failing to reject H₀ when H₁ is true)
- You can adjust `mu_alt`, `alpha`, or `n` to see how the errors shift


# Q6. Write a Python program to perform an independent T-test and interpret the results.

 Here's a Python program that performs an **independent two-sample t-test**—used to compare the means of two unrelated groups—and interprets the results step by step.

---

### 🧪 Scenario:
Suppose we want to test whether two different teaching methods lead to different average test scores.

```python
import numpy as np
from scipy.stats import ttest_ind

# Sample data: test scores from two independent groups
group_A = np.array([85, 88, 90, 93, 87, 91, 89, 86, 90, 88])
group_B = np.array([78, 82, 80, 79, 81, 77, 83, 80, 79, 78])

# Perform independent two-sample t-test (assume equal variances)
t_stat, p_value = ttest_ind(group_A, group_B, equal_var=True)

# Output results
print(f"Group A Mean: {np.mean(group_A):.2f}")
print(f"Group B Mean: {np.mean(group_B):.2f}")
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. There is a significant difference between the group means.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference between the group means.")
```

---

### 🧠 Interpretation:
- The **t-statistic** tells you how far apart the group means are in terms of standard error.
- The **p-value** tells you whether that difference is statistically significant.
- If `p < 0.05`, we conclude the teaching methods likely lead to different outcomes.


# Q7. Perform a paired sample T-test using Python and visualize the comparison results
A **paired sample t-test** is perfect when you're comparing two related sets of observations—like before-and-after measurements on the same subjects. Here's a Python program that performs the test and visualizes the results side by side.

---

### 🧪 Scenario:
Let’s say we’re testing whether a training program improved test scores for 10 students.

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_rel

# Sample data: scores before and after training
before = np.array([72, 75, 78, 70, 74, 69, 73, 76, 71, 77])
after  = np.array([78, 79, 82, 74, 77, 72, 76, 80, 75, 81])

# Perform paired t-test
t_stat, p_value = ttest_rel(before, after)

# Print results
print(f"Mean Before: {np.mean(before):.2f}")
print(f"Mean After:  {np.mean(after):.2f}")
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value:     {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. The training had a significant effect.")
else:
    print("Result: Fail to reject the null hypothesis. No significant effect detected.")

# Visualization
x = np.arange(len(before))
width = 0.35

plt.figure(figsize=(10, 5))
plt.bar(x - width/2, before, width, label='Before', color='skyblue')
plt.bar(x + width/2, after, width, label='After', color='lightgreen')
plt.xticks(x, [f'Student {i+1}' for i in x])
plt.ylabel('Score')
plt.title('Before vs After Training Scores')
plt.legend()
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
```

---

### 🧠 What You’ll See:
- A **bar chart** comparing each student’s score before and after training.
- The **t-test result** tells you if the improvement is statistically significant.





# Q8. Simulate data and perform both Z-test and T-test, then compare the results using Python .
 Let’s simulate a dataset and perform both a **Z-test** and a **T-test** on it to see how they compare—especially when the population standard deviation is known vs. unknown.

---

### 🧪 Scenario:
We’ll simulate a sample of 40 observations from a normal distribution with a true mean of 100. We'll test whether the sample mean is significantly different from a hypothesized population mean of 105.

---

### ✅ Python Code

```python
import numpy as np
from scipy import stats
from statsmodels.stats.weightstats import ztest

# Step 1: Simulate sample data
np.random.seed(42)
sample = np.random.normal(loc=100, scale=10, size=40)  # mean=100, std=10

# Hypothesized population mean
mu = 105
alpha = 0.05

# Step 2: Perform one-sample T-test (unknown population std)
t_stat, t_p = stats.ttest_1samp(sample, mu)

# Step 3: Perform one-sample Z-test (assume known population std = 10)
z_stat, z_p = ztest(sample, value=mu, alternative='two-sided')

# Step 4: Print results
print("Sample Mean:", np.mean(sample))
print("\n--- T-Test ---")
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value:     {t_p:.4f}")

print("\n--- Z-Test ---")
print(f"Z-Statistic: {z_stat:.4f}")
print(f"P-Value:     {z_p:.4f}")

# Step 5: Interpretation
print("\n--- Interpretation ---")
if t_p < alpha:
    print("T-Test: Reject the null hypothesis.")
else:
    print("T-Test: Fail to reject the null hypothesis.")

if z_p < alpha:
    print("Z-Test: Reject the null hypothesis.")
else:
    print("Z-Test: Fail to reject the null hypothesis.")
```

---

### 🧠 Key Takeaways:
- The **T-test** uses the **sample standard deviation**, making it more conservative for small samples.
- The **Z-test** assumes the **population standard deviation is known**, which can lead to slightly different results.
- As sample size increases, both tests tend to converge.


# Q9. Write a Python function to calculate the confidence interval for a sample mean and explain its significance.
 Here's a Python function that calculates the **confidence interval** for a sample mean using the **t-distribution**, which is ideal when the population standard deviation is unknown and the sample size is small.

---

### ✅ Python Function: Confidence Interval for Sample Mean

```python
import numpy as np
from scipy import stats

def confidence_interval(data, confidence=0.95):
    """
    Calculate the confidence interval for a sample mean.

    Parameters:
        data (list or array): Sample data
        confidence (float): Confidence level (default is 0.95)

    Returns:
        tuple: (mean, lower bound, upper bound)
    """
    data = np.array(data)
    n = len(data)
    mean = np.mean(data)
    sem = stats.sem(data)  # Standard error of the mean
    margin = sem * stats.t.ppf((1 + confidence) / 2, df=n-1)
    return mean, mean - margin, mean + margin

# Example usage
sample = [12, 15, 14, 10, 13, 17, 14, 15, 16, 14]
mean, lower, upper = confidence_interval(sample)
print(f"Sample Mean: {mean:.2f}")
print(f"95% Confidence Interval: ({lower:.2f}, {upper:.2f})")
```

---

### 🧠 Why It Matters:
A **confidence interval** gives you a range of plausible values for the population mean based on your sample. For example, a 95% confidence interval means that if you repeated the sampling process many times, about 95% of those intervals would contain the true population mean.

It’s a powerful way to express **uncertainty** and **reliability** in your estimates—especially useful in research, surveys, and A/B testing.



# Q10. Write a Python program to calculate the margin of error for a given confidence level using sample data .

 Here's a Python program that calculates the **margin of error** for a sample mean using the **t-distribution**—perfect when the population standard deviation is unknown:

---

### ✅ Python Program: Margin of Error for a Given Confidence Level

```python
import numpy as np
from scipy import stats

def margin_of_error(data, confidence=0.95):
    """
    Calculate the margin of error for a sample mean using the t-distribution.

    Parameters:
        data (list or array): Sample data
        confidence (float): Confidence level (default is 0.95)

    Returns:
        float: Margin of error
    """
    data = np.array(data)
    n = len(data)
    sem = stats.sem(data)  # Standard error of the mean
    t_critical = stats.t.ppf((1 + confidence) / 2, df=n-1)
    moe = t_critical * sem
    return moe

# Example usage
sample = [12, 15, 14, 10, 13, 17, 14, 15, 16, 14]
moe = margin_of_error(sample, confidence=0.95)
print(f"Margin of Error (95% confidence): ±{moe:.2f}")
```

---

### 🧠 Why It Matters:
The **margin of error** tells you how much your sample estimate might vary from the true population value. It’s a key part of building **confidence intervals** and understanding the **precision** of your results.



# Q11. Implement a Bayesian inference method using Bayes' Theorem in Python and explain the process.

---

### 🧠 Step-by-Step: Bayes’ Theorem Refresher

\[
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
\]

Where:
- **P(A)** = Prior probability (e.g. having the disease)
- **P(B|A)** = Likelihood (e.g. testing positive if diseased)
- **P(B)** = Marginal probability of testing positive
- **P(A|B)** = Posterior probability (updated belief)

---

### 🧪 Python Implementation

```python
def bayesian_inference(p_disease, p_pos_given_disease, p_pos_given_no_disease):
    p_no_disease = 1 - p_disease
    p_pos = (p_pos_given_disease * p_disease) + (p_pos_given_no_disease * p_no_disease)
    p_disease_given_pos = (p_pos_given_disease * p_disease) / p_pos
    return p_disease_given_pos

# Example values
p_disease = 0.01                  # 1% of population has the disease
p_pos_given_disease = 0.99        # 99% sensitivity
p_pos_given_no_disease = 0.05     # 5% false positive rate

posterior = bayesian_inference(p_disease, p_pos_given_disease, p_pos_given_no_disease)
print(f"Probability of having the disease given a positive test: {posterior:.4f}")
```

---

### 🔍 Interpretation:
Even with a highly accurate test, the **posterior probability** might be surprisingly low if the disease is rare. This is the power of Bayesian thinking—it forces us to consider **base rates** and not just test accuracy.

---


# Q12. Perform a Chi-square test for independence between two categorical variables in Python .

---

### 🧪 Scenario:
Suppose we surveyed 100 people about their **preferred beverage** (Tea or Coffee) and their **work shift** (Day or Night). Here's the observed data:

|            | Tea | Coffee |
|------------|-----|--------|
| Day Shift  | 20  | 30     |
| Night Shift| 25  | 25     |

---

### ✅ Python Code

```python
import numpy as np
from scipy.stats import chi2_contingency

# Step 1: Create the contingency table
data = np.array([[20, 30],
                 [25, 25]])

# Step 2: Perform the Chi-square test
chi2, p, dof, expected = chi2_contingency(data)

# Step 3: Display results
print(f"Chi-square Statistic: {chi2:.4f}")
print(f"P-value: {p:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:\n", expected)

# Step 4: Interpret the result
alpha = 0.05
if p < alpha:
    print("Result: Reject the null hypothesis. The variables are dependent.")
else:
    print("Result: Fail to reject the null hypothesis. The variables are independent.")
```

---

### 🧠 Interpretation:
- **Null hypothesis (H₀)**: Beverage preference is independent of work shift.
- **Alternative hypothesis (H₁)**: Beverage preference depends on work shift.
- If the **p-value < 0.05**, we conclude there’s a significant association.



# Q13. Write a Python program to calculate the expected frequencies for a Chi-square test based on observed data .
**bold text** Here's a Python program that calculates the **expected frequencies** for a Chi-square test using a contingency table of observed values. This is a key step in performing a **Chi-square test for independence**.

---

### ✅ Python Program: Calculate Expected Frequencies

```python
import numpy as np
from scipy.stats import chi2_contingency

# Step 1: Define the observed frequency table
# Example: Beverage preference by work shift
observed = np.array([[20, 30],
                     [25, 25]])

# Step 2: Use chi2_contingency to get expected frequencies
chi2, p, dof, expected = chi2_contingency(observed)

# Step 3: Display results
print("Observed Frequencies:\n", observed)
print("\nExpected Frequencies:\n", expected.round(2))
```

---

### 🧠 How It Works:
- `chi2_contingency()` automatically computes the expected frequencies under the assumption that the two variables are independent.
- The expected frequency for each cell is calculated as:
  \[
  E_{ij} = \frac{(\text{Row Total}_i) \times (\text{Column Total}_j)}{\text{Grand Total}}
  \]

---


# Q14. Perform a goodness-of-fit test using Python to compare the observed data to an expected distribution .
 Let’s perform a **Chi-square goodness-of-fit test** in Python to compare observed data to an expected distribution. This test helps determine whether your observed frequencies differ significantly from what you'd expect under a theoretical model.

---

### 🧪 Example Scenario:
Suppose you roll a 6-sided die 60 times and get the following observed counts:

```python
observed = [8, 9, 10, 11, 12, 10]  # Frequencies for faces 1–6
```

If the die is fair, you'd expect each face to appear 10 times:

```python
expected = [10, 10, 10, 10, 10, 10]
```

---

### ✅ Python Code

```python
import numpy as np
from scipy.stats import chisquare

# Observed and expected frequencies
observed = np.array([8, 9, 10, 11, 12, 10])
expected = np.array([10, 10, 10, 10, 10, 10])

# Perform Chi-square goodness-of-fit test
chi2_stat, p_value = chisquare(f_obs=observed, f_exp=expected)

# Output results
print(f"Chi-square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. The observed distribution differs significantly from the expected.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference found.")
```

---

### 🧠 Why It Matters:
This test is perfect for checking if your data fits a **uniform**, **binomial**, or any other expected distribution. It’s widely used in quality control, genetics, and survey analysis.


# Q15. Create a Python script to simulate and visualize the Chi-square distribution and discuss its characteristics .
 Here's a Python script that simulates and visualizes the **Chi-square distribution** using `scipy.stats` and `matplotlib`. We'll also explore how its shape changes with different degrees of freedom (df).

---

### ✅ Python Script: Simulate & Visualize Chi-square Distribution

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2

# Degrees of freedom to visualize
dfs = [1, 2, 5, 10, 20]
x = np.linspace(0, 40, 1000)

plt.figure(figsize=(10, 6))
for df in dfs:
    plt.plot(x, chi2.pdf(x, df), label=f'df = {df}')

# Plot formatting
plt.title('Chi-square Distribution for Various Degrees of Freedom')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True)
plt.show()
```

---

### 📊 Characteristics of the Chi-square Distribution:

- **Non-negative**: Values are always ≥ 0 since it's based on squared values.
- **Right-skewed**: Especially for small degrees of freedom.
- **Asymptotic**: The tail extends infinitely to the right.
- **Shape depends on degrees of freedom**:
  - For **df = 1 or 2**, the distribution is highly skewed.
  - As **df increases**, it becomes more symmetric and approaches a normal distribution.
- **Applications**:
  - **Goodness-of-fit tests**
  - **Tests of independence** in contingency tables
  - **Tests for population variance**

---


# Q16. Implement an F-test using Python to compare the variances of two random samples .
 Here's a Python program that performs an **F-test** to compare the variances of two independent samples. This is useful when you want to test if two populations have **equal variability**.

---

### 🧪 Scenario:
Let’s say we have two groups of measurements from different machines, and we want to test if their output variances are significantly different.

---

### ✅ Python Code: F-Test for Equality of Variances

```python
import numpy as np
from scipy.stats import f

# Step 1: Simulate two random samples
np.random.seed(42)
sample1 = np.random.normal(loc=50, scale=5, size=30)   # mean=50, std=5
sample2 = np.random.normal(loc=52, scale=8, size=30)   # mean=52, std=8

# Step 2: Calculate sample variances
var1 = np.var(sample1, ddof=1)
var2 = np.var(sample2, ddof=1)

# Step 3: Compute F-statistic (larger variance / smaller variance)
if var1 > var2:
    F = var1 / var2
    dfn, dfd = len(sample1) - 1, len(sample2) - 1
else:
    F = var2 / var1
    dfn, dfd = len(sample2) - 1, len(sample1) - 1

# Step 4: Calculate p-value (two-tailed)
p_value = 2 * min(f.cdf(F, dfn, dfd), 1 - f.cdf(F, dfn, dfd))

# Step 5: Output results
print(f"Variance 1: {var1:.2f}")
print(f"Variance 2: {var2:.2f}")
print(f"F-Statistic: {F:.4f}")
print(f"P-Value: {p_value:.4f}")

# Step 6: Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. Variances are significantly different.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference in variances.")
```

---

### 🧠 What’s Happening:
- The **F-statistic** is the ratio of the two sample variances.
- The **p-value** tells us whether the difference is statistically significant.
- We use a **two-tailed test** to detect inequality in either direction.



# Q17. Write a Python program to perform an ANOVA test to compare means between multiple groups and interpret the results .
Here's a Python program that performs a **one-way ANOVA (Analysis of Variance)** to test whether the means of three or more independent groups are significantly different.

---

### 🧪 Scenario:
Suppose we’re comparing test scores from students taught using three different teaching methods.

---

### ✅ Python Code: One-Way ANOVA

```python
import numpy as np
from scipy.stats import f_oneway

# Sample data: test scores from three teaching methods
method_A = [85, 88, 90, 87, 86]
method_B = [78, 82, 80, 79, 81]
method_C = [92, 94, 91, 93, 95]

# Perform one-way ANOVA
f_stat, p_value = f_oneway(method_A, method_B, method_C)

# Output results
print(f"F-Statistic: {f_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. At least one group mean is significantly different.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference between group means.")
```

---

### 🧠 Interpretation:
- **Null hypothesis (H₀)**: All group means are equal.
- **Alternative hypothesis (H₁)**: At least one group mean is different.
- If the **p-value < 0.05**, we conclude that **teaching method affects performance**.



# Q18. Perform a one-way ANOVA test using Python to compare the means of different groups and plot the results .
 Here's a Python program that performs a **one-way ANOVA test** and visualizes the group means using a boxplot—perfect for comparing multiple groups at once.

---

### 🧪 Scenario:
Suppose we’re comparing exam scores from students taught using three different teaching methods.

---

### ✅ Python Code: One-Way ANOVA with Visualization

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import f_oneway

# Sample data: scores from three teaching methods
method_A = [85, 88, 90, 87, 86]
method_B = [78, 82, 80, 79, 81]
method_C = [92, 94, 91, 93, 95]

# Perform one-way ANOVA
f_stat, p_value = f_oneway(method_A, method_B, method_C)

# Print results
print(f"F-Statistic: {f_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. At least one group mean is significantly different.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference between group means.")

# Visualization
data = [method_A, method_B, method_C]
labels = ['Method A', 'Method B', 'Method C']

plt.figure(figsize=(8, 5))
plt.boxplot(data, labels=labels, patch_artist=True)
plt.title('Comparison of Teaching Methods')
plt.ylabel('Test Scores')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
```

---

### 🧠 What This Tells You:
- The **F-statistic** measures the ratio of between-group to within-group variance.
- A **low p-value** (typically < 0.05) suggests that **at least one group mean differs** significantly.
- The **boxplot** helps you visually compare the spread and central tendency of each group.



# Q19. Write a Python function to check the assumptions (normality, independence, and equal variance) for ANOVAD .
 Before running an ANOVA, it's essential to check its assumptions: **normality**, **independence**, and **homogeneity of variances**. Here's a Python function that helps assess these using statistical tests and visualizations:

---

### ✅ Python Function to Check ANOVA Assumptions

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import shapiro, levene
from statsmodels.graphics.gofplots import qqplot

def check_anova_assumptions(data, group_col, value_col):
    """
    Check ANOVA assumptions: normality, independence (visually), and equal variance.

    Parameters:
        data (DataFrame): Input dataset
        group_col (str): Column name for group labels
        value_col (str): Column name for numeric values
    """
    groups = data[group_col].unique()
    print("🔍 Checking Normality (Shapiro-Wilk Test):")
    for group in groups:
        vals = data[data[group_col] == group][value_col]
        stat, p = shapiro(vals)
        print(f"  {group}: W={stat:.4f}, p={p:.4f} → {'Normal' if p > 0.05 else 'Not normal'}")

    print("\n📊 Checking Homogeneity of Variance (Levene’s Test):")
    samples = [data[data[group_col] == g][value_col] for g in groups]
    stat, p = levene(*samples)
    print(f"  Levene’s W={stat:.4f}, p={p:.4f} → {'Equal variances' if p > 0.05 else 'Unequal variances'}")

    print("\n👁️ Visual Check for Independence and Normality:")
    plt.figure(figsize=(12, 5))

    # Boxplot for spread
    plt.subplot(1, 2, 1)
    sns.boxplot(x=group_col, y=value_col, data=data)
    plt.title("Boxplot by Group")

    # Q-Q plot of residuals
    plt.subplot(1, 2, 2)
    model_resid = data[value_col] - data.groupby(group_col)[value_col].transform('mean')
    qqplot(model_resid, line='s', ax=plt.gca())
    plt.title("Q-Q Plot of Residuals")

    plt.tight_layout()
    plt.show()
```

---

### 🧠 What It Checks:
- **Normality**: Shapiro-Wilk test for each group
- **Equal variance**: Levene’s test across groups
- **Independence**: Visual check via boxplots and Q-Q plot of residuals

---



# Q20. Perform a two-way ANOVA test using Python to study the interaction between two factors and visualize the results.
 Let’s perform a **two-way ANOVA** in Python to analyze the effects of two categorical factors on a continuous outcome—and visualize the interaction.

---

### 🧪 Scenario:
Suppose we’re studying how **teaching method** (`Method A`, `Method B`) and **study time** (`Short`, `Long`) affect **test scores**.

---

### ✅ Python Code: Two-Way ANOVA with Visualization

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Step 1: Create sample data
data = pd.DataFrame({
    'Method': np.repeat(['A', 'B'], 10),
    'Time': np.tile(np.repeat(['Short', 'Long'], 5), 2),
    'Score': [70, 72, 68, 71, 69, 75, 78, 76, 77, 74,
              65, 67, 66, 68, 64, 80, 82, 81, 83, 79]
})

# Step 2: Fit the two-way ANOVA model
model = ols('Score ~ C(Method) + C(Time) + C(Method):C(Time)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# Step 3: Visualize interaction
plt.figure(figsize=(8, 5))
sns.pointplot(data=data, x='Time', y='Score', hue='Method', capsize=0.1, dodge=True, markers='o')
plt.title('Interaction Plot: Method × Time')
plt.ylabel('Mean Test Score')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
```

---

### 🧠 Interpretation:
- The **ANOVA table** shows p-values for:
  - Main effects: `Method`, `Time`
  - **Interaction**: `Method × Time`
- If the interaction p-value is **< 0.05**, it suggests the effect of one factor depends on the level of the other.
- The **interaction plot** helps visualize whether the lines cross or diverge—indicating interaction.

---

# Q21. Write a Python program to visualize the F-distribution and discuss its use in hypothesis testing .

 Here's a Python program that visualizes the **F-distribution** for different degrees of freedom and explains how it's used in hypothesis testing—especially in **ANOVA** and **variance comparison**.

---

### ✅ Python Code: Visualize F-Distribution

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import f

# Define degrees of freedom for numerator and denominator
df_pairs = [(1, 10), (5, 10), (10, 20), (20, 30)]
x = np.linspace(0, 5, 1000)

plt.figure(figsize=(10, 6))
for dfn, dfd in df_pairs:
    y = f.pdf(x, dfn, dfd)
    plt.plot(x, y, label=f'df1={dfn}, df2={dfd}')

plt.title('F-Distribution for Various Degrees of Freedom')
plt.xlabel('F-value')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
```

---

### 🧠 Why the F-Distribution Matters in Hypothesis Testing:

- **Used in ANOVA**: To test if **three or more group means** are significantly different.
- **Used in variance comparison**: To test if **two populations have equal variances**.
- **F-statistic**: Ratio of two scaled variances. A large F-value suggests a significant difference between groups or variances.
- **Right-skewed**: The distribution is always non-negative and skewed right, especially with small degrees of freedom.

---



# Q22. Perform a one-way ANOVA test in Python and visualize the results with boxplots to compare group means .
 Here's a complete Python example that performs a **one-way ANOVA** and visualizes the group means using **boxplots**—a great way to compare distributions across categories.

---

### 🧪 Scenario:
Suppose we’re comparing exam scores from students taught using three different teaching methods.

---

### ✅ Python Code: One-Way ANOVA + Boxplot

```python
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.stats import f_oneway

# Step 1: Create sample data
data = {
    'Score': [85, 88, 90, 87, 86, 78, 82, 80, 79, 81, 92, 94, 91, 93, 95],
    'Method': ['A']*5 + ['B']*5 + ['C']*5
}
df = pd.DataFrame(data)

# Step 2: Perform one-way ANOVA
grouped = [df[df['Method'] == m]['Score'] for m in df['Method'].unique()]
f_stat, p_value = f_oneway(*grouped)

# Step 3: Print results
print(f"F-Statistic: {f_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. At least one group mean is significantly different.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference between group means.")

# Step 4: Visualize with boxplot
plt.figure(figsize=(8, 5))
sns.boxplot(x='Method', y='Score', data=df, palette='Set2')
plt.title('Comparison of Test Scores by Teaching Method')
plt.ylabel('Score')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()
```

---

### 🧠 Interpretation:
- The **F-statistic** tells you how much the group means differ relative to within-group variability.
- The **p-value** helps you decide whether the observed differences are statistically significant.
- The **boxplot** visually compares the spread and central tendency of each group.



# Q23. Simulate random data from a normal distribution, then perform hypothesis testing to evaluate the means .
Let’s simulate data from a **normal distribution**, then perform a **one-sample t-test** to evaluate whether the sample mean significantly differs from a hypothesized population mean.

---

### ✅ Python Code: Simulate & Test the Mean

```python
import numpy as np
from scipy import stats

# Step 1: Simulate random data
np.random.seed(42)
sample = np.random.normal(loc=102, scale=10, size=50)  # mean=102, std=10, n=50

# Step 2: Define population mean to test against
mu = 100  # hypothesized population mean

# Step 3: Perform one-sample t-test
t_stat, p_value = stats.ttest_1samp(sample, mu)

# Step 4: Output results
print(f"Sample Mean: {np.mean(sample):.2f}")
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Step 5: Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. The sample mean is significantly different from the population mean.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference found.")
```

---

### 🧠 What’s Happening:
- We simulate 50 values from a normal distribution with a true mean of 102.
- We test whether this sample could plausibly come from a population with mean 100.
- The **p-value** tells us whether the observed difference is statistically significant.





# Q24. Perform a hypothesis test for population variance using a Chi-square distribution and interpret the results .
Let’s walk through how to perform a **hypothesis test for population variance** using the **Chi-square distribution** in Python. This test helps determine whether the variance of a population equals a specified value.

---

### 🧪 Scenario:
Suppose a machine is designed to produce bolts with a variance in length of **4 mm²**. We collect a sample of 25 bolts and want to test if the actual variance differs from this target.

---

### ✅ Python Code: Chi-square Test for Variance

```python
import numpy as np
from scipy.stats import chi2

# Step 1: Sample data
np.random.seed(42)
sample = np.random.normal(loc=50, scale=2.5, size=25)  # std = √6.25 = 2.5 → variance = 6.25

# Step 2: Hypothesized population variance
sigma_squared_0 = 4  # claimed variance

# Step 3: Calculate test statistic
n = len(sample)
sample_variance = np.var(sample, ddof=1)
chi2_stat = (n - 1) * sample_variance / sigma_squared_0

# Step 4: Compute p-value (two-tailed)
alpha = 0.05
df = n - 1
p_lower = chi2.cdf(chi2_stat, df)
p_upper = 1 - chi2.cdf(chi2_stat, df)
p_value = 2 * min(p_lower, p_upper)

# Step 5: Output results
print(f"Sample Variance: {sample_variance:.4f}")
print(f"Chi-square Statistic: {chi2_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Step 6: Interpretation
if p_value < alpha:
    print("Result: Reject the null hypothesis. The population variance is significantly different from the hypothesized value.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference in variance.")
```

---

### 🧠 Interpretation:
- **Null Hypothesis (H₀)**: Population variance = 4 mm²
- **Alternative Hypothesis (H₁)**: Population variance ≠ 4 mm²
- The **Chi-square statistic** measures how far the sample variance deviates from the hypothesized variance.
- A **small p-value** indicates that the observed variance is unlikely under H₀.

# Q25. Write a Python script to perform a Z-test for comparing proportions between two datasets or groups .
Here's a Python script that performs a **two-proportion Z-test**—perfect for comparing success rates between two independent groups, like conversion rates in A/B testing or survey responses.

---

### ✅ Python Script: Two-Proportion Z-Test

```python
import numpy as np
from statsmodels.stats.proportion import proportions_ztest

# Step 1: Define the number of successes and observations in each group
# Example: Group A had 60 successes out of 150 trials, Group B had 80 out of 200
successes = np.array([60, 80])
samples = np.array([150, 200])

# Step 2: Perform the two-proportion Z-test
z_stat, p_value = proportions_ztest(count=successes, nobs=samples, alternative='two-sided')

# Step 3: Output results
print(f"Z-Statistic: {z_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Step 4: Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. The proportions are significantly different.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference in proportions.")
```

---

### 🧠 When to Use This:
- Comparing **click-through rates** between two ads
- Evaluating **pass rates** between two classes
- Testing **conversion rates** in A/B experiments



# Q26. Implement an F-test for comparing the variances of two datasets, then interpret and visualize the results .
Let’s implement an **F-test** to compare the variances of two datasets, interpret the result, and visualize the distributions along with the F-statistic.

---

### ✅ Python Code: F-Test with Visualization

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import f

# Step 1: Simulate two datasets
np.random.seed(42)
data1 = np.random.normal(loc=50, scale=5, size=30)   # std = 5
data2 = np.random.normal(loc=52, scale=8, size=30)   # std = 8

# Step 2: Calculate sample variances
var1 = np.var(data1, ddof=1)
var2 = np.var(data2, ddof=1)

# Step 3: Compute F-statistic (larger variance / smaller variance)
if var1 > var2:
    F = var1 / var2
    dfn, dfd = len(data1) - 1, len(data2) - 1
else:
    F = var2 / var1
    dfn, dfd = len(data2) - 1, len(data1) - 1

# Step 4: Compute p-value (two-tailed)
p_value = 2 * min(f.cdf(F, dfn, dfd), 1 - f.cdf(F, dfn, dfd))

# Step 5: Print results
print(f"Variance 1: {var1:.2f}")
print(f"Variance 2: {var2:.2f}")
print(f"F-Statistic: {F:.4f}")
print(f"P-Value: {p_value:.4f}")

# Step 6: Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. Variances are significantly different.")
else:
    print("Result: Fail to reject the null hypothesis. No significant difference in variances.")

# Step 7: Visualization
x = np.linspace(0, 5, 1000)
y = f.pdf(x, dfn, dfd)

plt.figure(figsize=(10, 5))
plt.plot(x, y, label=f'F-distribution (df1={dfn}, df2={dfd})', color='blue')
plt.axvline(F, color='red', linestyle='--', label=f'F-statistic = {F:.2f}')
plt.fill_between(x, y, where=(x >= F), color='red', alpha=0.3, label='Rejection Region')
plt.title('F-Test: Comparing Variances')
plt.xlabel('F-value')
plt.ylabel('Probability Density')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
```

---

### 🧠 Interpretation:
- The **F-statistic** is the ratio of the two sample variances.
- The **p-value** tells us whether the observed difference in variances is statistically significant.
- The **plot** shows the F-distribution and where the test statistic falls—highlighting the rejection region.


# Q27. Perform a Chi-square test for goodness of fit with simulated data and analyze the results.
 Let’s simulate some categorical data and perform a **Chi-square goodness-of-fit test** to see if the observed frequencies match an expected distribution.

---

### 🧪 Scenario:
Suppose we roll a 6-sided die 120 times and get the following observed counts:

```python
observed = [15, 22, 18, 20, 25, 20]  # Frequencies for faces 1–6
```

If the die is fair, we expect each face to appear 20 times:

```python
expected = [20, 20, 20, 20, 20, 20]
```

---

### ✅ Python Code: Chi-square Goodness-of-Fit Test

```python
import numpy as np
from scipy.stats import chisquare

# Observed and expected frequencies
observed = np.array([15, 22, 18, 20, 25, 20])
expected = np.array([20] * 6)

# Perform Chi-square test
chi2_stat, p_value = chisquare(f_obs=observed, f_exp=expected)

# Output results
print(f"Chi-square Statistic: {chi2_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Result: Reject the null hypothesis. The die may not be fair.")
else:
    print("Result: Fail to reject the null hypothesis. No significant evidence the die is unfair.")
```

---

### 🧠 Interpretation:
- **Null Hypothesis (H₀)**: The die is fair (uniform distribution).
- **Alternative Hypothesis (H₁)**: The die is not fair.
- If the **p-value < 0.05**, we conclude the observed distribution differs significantly from the expected.

