# Statistics Advance 3 Assignment

## Q1: What is Estimation Statistics? Explain point estimate and interval estimate.

Estimation statistics involves using sample data to estimate population parameters. A **point estimate** is a single value estimate of a parameter (e.g., sample mean for population mean). An **interval estimate** gives a range (e.g., confidence interval) within which the parameter is expected to lie.

## Q1. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5 using Python. Interpret the results.

In [None]:
import numpy as np
from scipy import stats

# Given values
mean = 50
std = 5
n = 30  # assuming sample size is 30

# 95% confidence interval
confidence = 0.95
z = stats.norm.ppf(1 - (1-confidence)/2)
margin_of_error = z * (std / np.sqrt(n))
ci_lower = mean - margin_of_error
ci_upper = mean + margin_of_error

print(f"95% Confidence Interval: ({ci_lower:.2f}, {ci_upper:.2f})")

**Interpretation:**

The 95% confidence interval gives a range in which we are 95% confident that the true population mean lies, based on our sample. If the interval is, for example, (48.21, 51.79), it means that if we were to take many samples and compute a confidence interval for each, about 95% of those intervals would contain the true mean.

## Q2. Explain the difference between a one-tailed and a two-tailed hypothesis test. Provide examples of when each would be appropriate.

**Answer:**

- **One-tailed test:** Tests for the possibility of the relationship in one direction. For example, testing if a new drug is *better* than the current drug (H₀: new ≤ current, H₁: new > current).
- **Two-tailed test:** Tests for the possibility of the relationship in both directions. For example, testing if a new drug is *different* (either better or worse) than the current drug (H₀: new = current, H₁: new ≠ current).

**When to use:**
- Use a one-tailed test when you are only interested in deviations in one direction.
- Use a two-tailed test when deviations in both directions are important.

## Q4. What is a p-value? How is it used in hypothesis testing?

**Answer:**

A **p-value** is the probability of obtaining results at least as extreme as the observed results, assuming that the null hypothesis is true. In hypothesis testing, the p-value helps you decide whether to reject the null hypothesis:
- If the p-value is less than the chosen significance level (e.g., 0.05), you reject the null hypothesis.
- If the p-value is greater, you fail to reject the null hypothesis.

The smaller the p-value, the stronger the evidence against the null hypothesis.

## Q5. Calculate the p-value for a z-score of 2.0 in a right-tailed test using Python.

In [None]:
from scipy.stats import norm

z_score = 2.0
# For a right-tailed test, p-value is the area to the right of z
p_value = 1 - norm.cdf(z_score)
print(f"P-value for z=2.0 (right-tailed): {p_value:.4f}")

**Interpretation:**

A p-value of approximately 0.0228 means there is a 2.28% chance of observing a z-score of 2.0 or higher if the null hypothesis is true. If your significance level is 0.05, you would reject the null hypothesis.

## Q6. What is a Type I error and a Type II error? Provide examples of each.

**Answer:**

- **Type I Error (False Positive):** Rejecting the null hypothesis when it is actually true. Example: Concluding a new drug works when it actually does not.
- **Type II Error (False Negative):** Failing to reject the null hypothesis when it is actually false. Example: Concluding a new drug does not work when it actually does.

## Q7. What is statistical power? How can it be increased?

**Answer:**

- **Statistical power** is the probability that a test correctly rejects a false null hypothesis (i.e., detects an effect when there is one).
- **How to increase power:**
  - Increase sample size
  - Increase effect size
  - Increase significance level (alpha)
  - Reduce variability in the data
  - Use a more sensitive test

## Q8. What is the Central Limit Theorem (CLT)? Why is it important in statistics?

**Answer:**

The **Central Limit Theorem (CLT)** states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population's distribution, provided the samples are independent and identically distributed.

**Importance:**
- Allows us to use normal probability theory to make inferences about population means, even when the population is not normally distributed.
- Forms the basis for many statistical tests and confidence intervals.

## Q11. What is the difference between parametric and non-parametric tests? Give examples of each.

**Answer:**

- **Parametric tests** assume underlying statistical distributions in the data (e.g., normal distribution). They are generally more powerful if assumptions are met.
  - Examples: t-test, ANOVA, Pearson correlation
- **Non-parametric tests** do not assume a specific distribution. They are used when data do not meet parametric assumptions.
  - Examples: Mann-Whitney U test, Wilcoxon signed-rank test, Kruskal-Wallis test, Spearman correlation

## Q1. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5 using Python. Interpret the results.

## Q2. Write a Python function to estimate the population mean using a sample mean and standard deviation.

In [None]:
def estimate_population_mean(sample_mean, std_dev, n):
    return sample_mean

# Example usage:
print(estimate_population_mean(100, 15, 30))

## Q3: What is Hypothesis testing? Why is it used? State the importance of Hypothesis testing.

Hypothesis testing is a statistical method used to make decisions about population parameters based on sample data. It is used to test assumptions or claims and helps in making data-driven decisions.

## Q4. Create a hypothesis that states whether the average weight of male college students is greater than the average weight of female college students.

- Null hypothesis (H0): The average weight of male college students is less than or equal to the average weight of female college students.
- Alternative hypothesis (H1): The average weight of male college students is greater than the average weight of female college students.

## Q5. Write a Python script to conduct a hypothesis test on the difference between two population means, given a sample from each population.

In [None]:
from scipy.stats import ttest_ind

def test_difference_of_means(sample1, sample2):
    t_stat, p_value = ttest_ind(sample1, sample2, equal_var=False)
    return t_stat, p_value

# Example usage:
sample1 = [80, 85, 78, 90, 88]
sample2 = [75, 70, 72, 68, 74]
print(test_difference_of_means(sample1, sample2))

## Q6: What is a null and alternative hypothesis? Give some examples.

- **Null hypothesis (H0):** The default assumption (e.g., no difference, no effect).
- **Alternative hypothesis (H1):** The claim to be tested (e.g., there is a difference, there is an effect).

**Example:**
- H0: The mean test score is 70.
- H1: The mean test score is not 70.

## Q7: Write down the steps involved in hypothesis testing.

1. State the null and alternative hypotheses.
2. Choose a significance level (alpha).
3. Collect and summarize the data.
4. Calculate the test statistic and p-value.
5. Make a decision: reject or fail to reject the null hypothesis.
6. State the conclusion.

## Q8. Define p-value and explain its significance in hypothesis testing.

The **p-value** is the probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true. A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis.

## Q9. Generate a Student's t-distribution plot using Python's matplotlib library, with the degrees of freedom parameter set to 10.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t

x = np.linspace(-4, 4, 100)
y = t.pdf(x, df=10)
plt.plot(x, y)
plt.title("Student's t-distribution (df=10)")
plt.xlabel('t')
plt.ylabel('Probability Density')
plt.show()

## Q10. Write a Python program to calculate the two-sample t-test for independent samples, given two random samples of equal size and a null hypothesis that the population means are equal.

In [None]:
from scipy.stats import ttest_ind

sample1 = [80, 85, 78, 90, 88]
sample2 = [75, 70, 72, 68, 74]
t_stat, p_value = ttest_ind(sample1, sample2)
print(f"t-statistic: {t_stat}, p-value: {p_value}")

## Q11: What is Student’s t distribution? When to use the t-Distribution.

Student’s t-distribution is a probability distribution used when estimating population parameters when the sample size is small and/or the population standard deviation is unknown. It is used for hypothesis testing and confidence intervals for small samples.

## Q12: What is t-statistic? State the formula for t-statistic.

The **t-statistic** measures the difference between a sample statistic and a population parameter in units of standard error.

Formula: t = (sample_mean - population_mean) / (sample_std / sqrt(n))

## Q13. A coffee shop owner wants to estimate the average daily revenue for their shop. They take a random sample of 50 days and find the sample mean revenue to be $500 with a standard deviation of $50. Estimate the population mean revenue with a 95% confidence interval.

In [None]:
import scipy.stats as stats
import numpy as np

sample_mean = 500
std_dev = 50
n = 50
confidence = 0.95
z = stats.norm.ppf(1 - (1-confidence)/2)
margin_error = z * (std_dev / np.sqrt(n))
ci_lower = sample_mean - margin_error
ci_upper = sample_mean + margin_error
print(f"95% Confidence Interval: (${ci_lower:.2f}, ${ci_upper:.2f})")

## Q14. A researcher hypothesizes that a new drug will decrease blood pressure by 10 mmHg. They conduct a clinical trial with 100 patients and find that the sample mean decrease in blood pressure is 8 mmHg with a standard deviation of 3 mmHg. Test the hypothesis with a significance level of 0.05.

In [None]:
import scipy.stats as stats

sample_mean = 8
population_mean = 10
std_dev = 3
n = 100
alpha = 0.05
z = (sample_mean - population_mean) / (std_dev / np.sqrt(n))
p_value = stats.norm.cdf(z)  # left-tailed test
print(f"z-score: {z}, p-value: {p_value}")
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

## Q15. An electronics company produces a certain type of product with a mean weight of 5 pounds and a standard deviation of 0.5 pounds. A random sample of 25 products is taken, and the sample mean weight is found to be 4.8 pounds. Test the hypothesis that the true mean weight of the products is less than 5 pounds with a significance level of 0.01.

In [None]:
import scipy.stats as stats

sample_mean = 4.8
population_mean = 5
std_dev = 0.5
n = 25
alpha = 0.01
z = (sample_mean - population_mean) / (std_dev / np.sqrt(n))
p_value = stats.norm.cdf(z)  # left-tailed test
print(f"z-score: {z}, p-value: {p_value}")
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

## Q16. Two groups of students are given different study materials to prepare for a test. The first group (n1 = 30) has a mean score of 80 with a standard deviation of 10, and the second group (n2 = 40) has a mean score of 75 with a standard deviation of 8. Test the hypothesis that the population means for the two groups are equal with a significance level of 0.01.

In [None]:
import numpy as np
import scipy.stats as stats

mean1, std1, n1 = 80, 10, 30
mean2, std2, n2 = 75, 8, 40
alpha = 0.01
se = np.sqrt((std1**2/n1) + (std2**2/n2))
z = (mean1 - mean2) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))  # two-tailed test
print(f"z-score: {z}, p-value: {p_value}")
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

## Q17. A marketing company wants to estimate the average number of ads watched by viewers during a TV program. They take a random sample of 50 viewers and find that the sample mean is 4 with a standard deviation of 1.5. Estimate the population mean with a 99% confidence interval.

In [None]:
import scipy.stats as stats
import numpy as np

sample_mean = 4
std_dev = 1.5
n = 50
confidence = 0.99
z = stats.norm.ppf(1 - (1-confidence)/2)
margin_error = z * (std_dev / np.sqrt(n))
ci_lower = sample_mean - margin_error
ci_upper = sample_mean + margin_error
print(f"99% Confidence Interval: ({ci_lower:.2f}, {ci_upper:.2f})")