## Q1: What is the difference between a t-test and a z-test? Provide an example scenario where you would use each type of test.

The t-test and the z-test are both statistical hypothesis tests used to make inferences about population parameters based on sample data. However, they are applicable in different situations, primarily due to differences in assumptions and the type of data they are suitable for.

**1. Z-test:**
The z-test is used when we know the population standard deviation or have a large sample size (typically n > 30). It assumes that the data follows a normal distribution and is more appropriate for cases where the population standard deviation is known or when the sample size is large enough for the Central Limit Theorem to hold.

**Example scenario for a z-test:**
Let's say you have data on the heights of adult males in a certain country and you want to test whether the average height of adult males in this country is significantly different from a known population mean height (e.g., the global average height of adult males). In this case, if you have a large enough sample size and the standard deviation of the population is known, you can use a z-test to perform the hypothesis test.

**2. T-test:**
The t-test is used when the population standard deviation is unknown and needs to be estimated from the sample. It is appropriate for smaller sample sizes (typically n < 30) and assumes that the data follows a normal distribution.

**Example scenario for a t-test:**
Consider a scenario where you are testing a new drug's effectiveness in reducing blood pressure. You divide the participants into two groups: a control group that receives a placebo and a treatment group that receives the new drug. After the study, you want to compare the mean reduction in blood pressure between the two groups. Since the population standard deviation is unknown, and you have a relatively small sample size, you would use a t-test to compare the means of the two groups and assess whether the drug has a statistically significant effect on blood pressure.



## Q2: Differentiate between one-tailed and two-tailed tests.

**1. One-tailed test:**
In a one-tailed test, the alternative hypothesis (H1) is formulated to test for a specific direction of the effect. It checks whether the population parameter is either significantly greater than or significantly less than a certain value, but not both. Therefore, the critical region is concentrated in one tail of the sampling distribution.

Mathematically, for a one-tailed test, the hypotheses are typically formulated as follows:
- Null hypothesis (H0): The population parameter is equal to a specific value.
- Alternative hypothesis (H1): The population parameter is either greater than or less than the specific value.

A one-tailed test is more powerful (more likely to reject the null hypothesis) when the effect direction is predicted or important to the research question.

**2. Two-tailed test:**
In a two-tailed test, the alternative hypothesis (H1) is formulated to test for a difference in either direction from the hypothesized value. It checks whether the population parameter is significantly different from a specific value in either the positive or negative direction. Therefore, the critical region is split between both tails of the sampling distribution.

Mathematically, for a two-tailed test, the hypotheses are typically formulated as follows:
- Null hypothesis (H0): The population parameter is equal to a specific value.
- Alternative hypothesis (H1): The population parameter is different (either greater than or less than) from the specific value.

A two-tailed test is more conservative and used when you want to test if there is any significant difference without predicting the direction of the effect.

**Example:**
Suppose you are testing whether a new educational program improves students' test scores. Your null hypothesis is that the mean test scores of students who go through the program are the same as those who don't (no effect). Depending on your research question, the alternative hypotheses would be:

- One-tailed test (directional): The mean test scores of students who go through the program are significantly greater than those who don't.
- Two-tailed test (non-directional): The mean test scores of students who go through the program are significantly different (either greater or less) from those who don't.


## Q3: Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for each type of error.

In hypothesis testing, there are two types of errors that can occur: Type 1 error (false positive) and Type 2 error (false negative). These errors arise when we make decisions about the population based on sample data and the results of a statistical test.

**1. Type 1 Error (False Positive):**
A Type 1 error occurs when we reject the null hypothesis when it is actually true. In other words, we mistakenly conclude that there is a significant effect or difference when there is no such effect or difference in the population. This is akin to a "false alarm" or a "finding" that doesn't really exist.

The probability of committing a Type 1 error is denoted by the symbol alpha (α), and it is typically set as the significance level of the test. Common significance levels include 0.05 (5%) or 0.01 (1%).

**Example Scenario for Type 1 Error:**
Imagine a clinical drug trial where the null hypothesis (H0) states that the new drug has no effect on patients' recovery time. The alternative hypothesis (H1) is that the drug does have a significant effect, shortening the recovery time.

Type 1 Error: The researchers mistakenly reject the null hypothesis and conclude that the new drug has a significant effect on recovery time when, in reality, it does not. As a result, the drug might be approved for use, leading to unnecessary costs and potential side effects for patients.

**2. Type 2 Error (False Negative):**
A Type 2 error occurs when we fail to reject the null hypothesis when it is actually false. In other words, we miss detecting a significant effect or difference that exists in the population. This is akin to a "missed opportunity" to identify a real effect.

The probability of committing a Type 2 error is denoted by the symbol beta (β). The power of a statistical test is equal to (1 - β) and represents the probability of correctly rejecting the null hypothesis when it is false.

**Example Scenario for Type 2 Error:**
Continuing with the clinical drug trial example, suppose the null hypothesis (H0) is the same as before, and the alternative hypothesis (H1) still suggests that the drug has a significant effect, shortening recovery time.

Type 2 Error: The researchers fail to reject the null hypothesis, concluding that the new drug has no significant effect on recovery time, when, in reality, it does. As a result, an effective drug that could save lives and improve patient outcomes goes unnoticed and is not put into use.


## Q4: Explain Bayes's theorem with an example.

Bayes's theorem is a fundamental concept in probability theory and statistics that allows us to update the probability of an event based on new evidence or information. It provides a way to incorporate prior knowledge or beliefs (prior probability) and new data (likelihood) to obtain a revised or updated probability (posterior probability). The formula for Bayes's theorem is as follows:

$P(A|B)=\frac{P(B|A).P(A)}{P(B)}$

Where:

- P(A∣B) is the posterior probability of event A given event B has occurred.
- P(B∣A) is the likelihood of observing event B given that event A has occurred.
- P(A) is the prior probability of event A before observing any new evidence.
- P(B) is the probability of event B before observing any new evidence.


Let's go through an example to illustrate how Bayes's theorem works:

**Example: Diagnostic Test for a Disease**

Suppose we have a medical test to diagnose a particular disease, and the prevalence of the disease in the population is 1%. The test is not perfect and can sometimes produce false positive and false negative results.

- Prior Probability: The probability of a randomly selected person having the disease is 1% (or 0.01). So, $( P(\text{Disease}) = 0.01 $).
- Likelihood: The test has an accuracy of 95% in detecting the disease correctly (sensitivity). Therefore, the probability of a positive test result given that a person has the disease is 0.95, or $( P(\text{Positive Test}|\text{Disease}) = 0.95 $).
- Complement of Likelihood: The test has a 5% false positive rate, meaning it indicates the disease incorrectly in 5% of healthy individuals (specificity). So, $( P(\text{Positive Test}|\text{No Disease}) = 0.05 $).

Now, let's say we want to find the probability that a person has the disease given that they tested positive, i.e., $( P(\text{Disease}|\text{Positive Test}) $).

Using Bayes's theorem:
$[ P(\text{Disease}|\text{Positive Test}) = \frac{P(\text{Positive Test}|\text{Disease}) \cdot P(\text{Disease})}{P(\text{Positive Test})} $]

We need to calculate the denominator \( P(\text{Positive Test}) \), which can be obtained using the law of total probability:
$[ P(\text{Positive Test}) = P(\text{Positive Test}|\text{Disease}) \cdot P(\text{Disease}) + P(\text{Positive Test}|\text{No Disease}) \cdot P(\text{No Disease}) $]

Given that the prevalence of the disease is 1%, the prevalence of no disease (complement) is $( P(\text{No Disease}) = 1 - P(\text{Disease}) = 0.99 $).

Now, we can calculate $( P(\text{Positive Test}) $):
$[ P(\text{Positive Test}) = (0.95 \cdot 0.01) + (0.05 \cdot 0.99) $]
$[ P(\text{Positive Test}) = 0.0495 + 0.0495 = 0.099 $]

Now, we can calculate the posterior probability:
$[ P(\text{Disease}|\text{Positive Test}) = \frac{0.95 \cdot 0.01}{0.099} $]
$[ P(\text{Disease}|\text{Positive Test}) \approx 0.09595 $]

So, the probability that a person has the disease given that they tested positive is approximately 0.09595 or about 9.60%.

Bayes's theorem is a powerful tool that allows us to update our beliefs or probabilities as we acquire new evidence or data. It is widely used in various fields, including medical diagnosis, machine learning, and artificial intelligence.

## Q5: What is a confidence interval? How to calculate the confidence interval, explain with an example.

A confidence interval is a range of values that is constructed based on sample data and is used to estimate the true population parameter with a certain level of confidence. It provides a measure of uncertainty about the population parameter and is often used in statistics to infer about the population mean, proportion, or other parameters.

Confidence intervals are typically expressed with two values: a lower bound and an upper bound, which form a range. The confidence level associated with the interval represents the probability that the interval contains the true population parameter. For example, if we construct a 95% confidence interval for the population mean, it means that there is a 95% probability that the true population mean falls within the range.

The formula to calculate a confidence interval depends on the parameter being estimated (mean, proportion, etc.) and the sample data's characteristics. The most common type of confidence interval is for estimating the population mean.

**How to Calculate a Confidence Interval for the Population Mean:**

The formula to calculate the confidence interval for the population mean (\( \mu \)) is:

\[ \text{Confidence Interval} = \text{Sample Mean} \pm \text{Margin of Error} \]

The margin of error is calculated as:

$[ \text{Margin of Error} = \text{Critical Value} \times \left( \frac{\text{Sample Standard Deviation}}{\sqrt{\text{Sample Size}}} \right) $]

The critical value is obtained from the standard normal distribution or the t-distribution based on the desired confidence level and the sample size. For larger sample sizes (usually n > 30), the critical value comes from the standard normal distribution (Z-score). For smaller sample sizes, the t-distribution is used.

**Example of Calculating a Confidence Interval:**

Let's say we have a random sample of 50 students' test scores from a larger population. The sample mean is 78, and the sample standard deviation is 5. We want to calculate a 95% confidence interval for the population mean test score.

1. First, find the critical value associated with a 95% confidence level for a sample size of 50. Since the sample size is relatively large, we can use the standard normal distribution. The critical value for a 95% confidence level is approximately 1.96.

2. Next, calculate the margin of error:
   $ \text{Margin of Error} = 1.96 \times \left( \frac{5}{\sqrt{50}} \right) \approx 1.38 $

3. Finally, construct the confidence interval:
   $ \text{Confidence Interval} = 78 \pm 1.38 = (76.62, 79.38) $

The 95% confidence interval for the population mean test score is (76.62, 79.38). This means that we are 95% confident that the true population mean test score lies within this range.

## Q6. Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence. Provide a sample problem and solution.

Sure! Let's use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence.

**Sample Problem:**
Suppose we have a box of 100 marbles, containing 30 red marbles and 70 blue marbles. You randomly pick one marble from the box without looking and then roll a fair six-sided die. You want to find the probability that you picked a red marble from the box given that the die shows a 4.

**Solution:**
Let's define the events:
- Event A: Picking a red marble from the box.
- Event B: Rolling a die and getting a 4.

We are asked to find \( P(A|B) \), which represents the probability of picking a red marble from the box given that the die shows a 4.

Using Bayes' Theorem:
$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $

1. Prior Probability: We know the prior probability of picking a red marble from the box (\( P(A) \)) is 30/100 since there are 30 red marbles out of 100.

2. Likelihood: The probability of rolling a 4 given that we picked a red marble (\( P(B|A) \)) is 1/6 because each side of the fair die has an equal chance of 1/6.

3. Complement of Likelihood: The probability of rolling a 4 given that we picked a blue marble (\( P(B|\sim A) \)) is also 1/6, as the die's outcome is independent of the color of the marble.

4. Total Probability of B: We need to calculate the total probability of rolling a 4 (\( P(B) \)) by considering the Law of Total Probability:
$P(B) = P(B|A) \cdot P(A) + P(B|\sim A) \cdot P(\sim A) $

   where $( P(\sim A) \) is the probability of not picking a red marble, which is \( 1 - P(A) = 70/100 $).

Now, let's plug the values into Bayes' Theorem:

$ P(A|B) = \frac{\frac{1}{6} \cdot \frac{30}{100}}{\frac{1}{6} \cdot \frac{30}{100} + \frac{1}{6} \cdot \frac{70}{100}} $

Simplifying:
$ P(A|B) = \frac{\frac{1}{20}}{\frac{1}{20} + \frac{7}{60}} = \frac{\frac{1}{20}}{\frac{3}{20}} = \frac{1}{3} $

So, the probability of picking a red marble from the box given that the die shows a 4 is \( \frac{1}{3} \) or approximately 0.3333.

## Q7. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5. Interpret the results.

To find confidence interval

$C.I=sample mean\pm alpha*\frac{sample std}{\sqrt{n}}$

$=50\pm 1.96*\frac{5}{\sqrt{30}}$

$C.I=50 \pm 1.79$

$C.I=(48.21,51.79)$

## Q8. What is the margin of error in a confidence interval? How does sample size affect the margin of error? Provide an example of a scenario where a larger sample size would result in a smaller margin of error.

The margin of error (MOE) is a statistical measure used in estimating population parameters based on a sample. When conducting surveys or collecting data, it is often not feasible to collect data from an entire population, so we gather data from a sample and use it to make inferences about the whole population.

The margin of error represents the range within which we expect the true population parameter to lie with a certain level of confidence. It is typically expressed as a percentage and is associated with a specific confidence level (e.g., 95% confidence level). The margin of error takes into account the variability in the sample data and provides an interval within which the population parameter is likely to be found.

The formula for calculating the margin of error in a confidence interval is:

MOE = Z * (σ / √n)

where:
- MOE = Margin of Error
- Z = Z-score, which corresponds to the desired confidence level (e.g., 1.96 for a 95% confidence level)
- σ = Standard deviation of the population (if known), or the standard deviation of the sample if the population standard deviation is unknown (replaced with the sample standard deviation, s)
- n = Sample size

As the sample size increases, the margin of error decreases. In other words, larger sample sizes lead to more precise estimates. This is because larger samples provide more information about the population, reducing the variability in the estimate.

Example scenario:

Let's say a political pollster wants to estimate the proportion of voters who support a particular candidate within a city. They conduct two surveys with different sample sizes and calculate the margin of error for each:

Survey 1: Sample size (n1) = 400
Survey 2: Sample size (n2) = 1000

Assuming both surveys used the same methodology and had a 95% confidence level (Z = 1.96 for a 95% confidence level), and the estimated proportion of voters who support the candidate is approximately 50% (this is the worst-case scenario, which results in the highest MOE):

Let's assume the standard deviation in the population is unknown, and the sample standard deviation is calculated to be the same in both cases (σ = s).

MOE1 = 1.96 * (σ / √400)
MOE2 = 1.96 * (σ / √1000)

Since both surveys are estimating the same proportion (50%), and the standard deviation of the population is assumed to be the same, the only difference is the sample size. In this scenario, MOE2 (the margin of error for Survey 2) would be smaller than MOE1 (the margin of error for Survey 1) because the larger sample size in Survey 2 results in a more precise estimate.

So, a larger sample size (n2 = 1000) would lead to a smaller margin of error compared to a smaller sample size (n1 = 400) when estimating the same population parameter with the same confidence level.

## Q9. Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population standard deviation of 5. Interpret the results.

$\mu=70$

x=75

$\sigma=5$

$z-score= ( x - \mu) / \sigma$

z-score= (75-70)/5

z-score= 1

We can say that the point x=75 is 1 standard deviation away from the mean(70)

## Q10. In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is significantly effective at a 95% confidence level using a t-test.


H0: The new weight loss drug is not significantly effective. The true population mean weight loss (μ) is equal to or greater than 0 pounds.
H0: μ = 0

Alternative Hypothesis (Ha): The new weight loss drug is significantly effective. The true population mean weight loss (μ) is less than 0 pounds.
Ha: μ > 0

Step 1: Calculate the test statistic t:
The formula for the t-test is given by:
t = (x̄ - μ) / (s / √n)

Where:
x̄ = Sample mean weight loss (6 pounds)
μ = Population mean weight loss under the null hypothesis (0 pounds)
s = Sample standard deviation (2.5 pounds)
n = Sample size (50 participants)

Since we are conducting a one-tailed test (Ha: μ > 0), we need to find the critical value or p-value corresponding to the 5th percentile of the t-distribution with degrees of freedom (df) = n - 1.


If the test statistic falls in the rejection region (beyond the critical value or p-value), we reject the null hypothesis in favor of the alternative hypothesis. Otherwise, we fail to reject the null hypothesis.

Now, let's calculate the t-test and determine if the drug is significantly effective at a 95% confidence level:



Given data  

sample_mean = 6

sample_std = 2.5

sample_size = 50

population_mean_null = 0

confidence_level = 0.95


t_stats= (6-0)/(2.5/$\sqrt{50})$
t_stats= 16.97


critical value (at df=49, alpha=0.05) =1.677

Since the t_stats<critical value,
Fail to reject the null hypothesis. The new weight loss drug is not significantly effective.


## Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95% confidence interval for the true proportion of people who are satisfied with their job.

To calculate the 95% confidence interval for the true proportion of people who are satisfied with their job, we can use the formula for confidence intervals for proportions. The confidence interval is based on the binomial distribution, and we use the normal approximation because the sample size is sufficiently large (n ≥ 30).

The formula for the confidence interval for proportions is:

CI = p̂ ± Z * √((p̂ * (1 - p̂)) / n)

Where:
- CI: Confidence Interval
- p̂: Sample proportion (65% = 0.65 in decimal form)
- Z: Z-score corresponding to the desired confidence level (for a 95% confidence level, Z ≈ 1.96)
- n: Sample size (500)

Let's calculate the confidence interval:

CI= 0.65 $\pm$ 1.96* $\sqrt{((0.65 * (1 - 0.65)) / 500)}$



The output will be in the form of a confidence interval with percentages:

"95% Confidence Interval: [0.613, 0.687]"

This means that we are 95% confident that the true proportion of people satisfied with their job lies between approximately 61.3% and 68.7%.

## Q12. A researcher is testing the effectiveness of two different teaching methods on student performance. Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82 with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a significant difference in student performance using a t-test with a significance level of 0.01.

To conduct a hypothesis test to determine if there is a significant difference in student performance between two teaching methods (Sample A and Sample B), we will use a two-sample independent t-test. This test is appropriate when comparing the means of two independent groups (in this case, the two teaching methods).

Null Hypothesis (H0): The two teaching methods do not have a significant difference in student performance. The population means of the two groups are equal.
H0: μA - μB = 0

Alternative Hypothesis (Ha): The two teaching methods have a significant difference in student performance. The population means of the two groups are not equal.
Ha: μA - μB ≠ 0

where:
μA: Population mean score for teaching method A
μB: Population mean score for teaching method B

The significance level is given as 0.01, which means we want to test the hypothesis at a 99% confidence level.

To perform the t-test, we need to calculate the test statistic (t) and compare it to the critical value of the t-distribution to determine if we can reject the null hypothesis.

The formula for the two-sample independent t-test is given by:

t = (x̄A - x̄B) / √((sA^2 / nA) + (sB^2 / nB))

Where:
x̄A: Sample mean score for Sample A (85)
x̄B: Sample mean score for Sample B (82)
sA: Sample standard deviation for Sample A (6)
sB: Sample standard deviation for Sample B (5)
nA: Sample size for Sample A
nB: Sample size for Sample B

Step 1: Calculate the test statistic t.
  

Step 2: Determine the critical value or p-value for a two-tailed test at the 0.01 significance level and degrees of freedom df = nA + nB - 2.

df= 10+10-2=18 

Step 3: Compare the test statistic with the critical value or p-value.
- If the absolute value of the test statistic is greater than the critical value or if the p-value is less than the significance level (0.01), we reject the null hypothesis in favor of the alternative hypothesis, indicating a significant difference in student performance between the two teaching methods.
- Otherwise, we fail to reject the null hypothesis.


In [5]:

import scipy.stats as stats

# Given data for Sample A
sample_mean_A = 85
sample_std_A = 6
sample_size_A =  10

# Given data for Sample B
sample_mean_B = 82
sample_std_B = 5
sample_size_B = 10

alpha = 0.01

# Calculate the test statistic t
numerator = sample_mean_A - sample_mean_B
denominator = ((sample_std_A ** 2) / sample_size_A) + ((sample_std_B ** 2) / sample_size_B)
t_statistic = numerator / (denominator ** 0.5)

# Calculate the degrees of freedom
degrees_of_freedom = sample_size_A + sample_size_B - 2

# Calculate the critical value for a two-tailed test
critical_value = stats.t.ppf(1 - alpha/2, df=degrees_of_freedom)

# Calculate the p-value for a two-tailed test
p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), df=degrees_of_freedom))

# Compare the test statistic with the critical value and p-value
if abs(t_statistic) > critical_value or p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in student performance between the two teaching methods.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in student performance between the two teaching methods.")


Fail to reject the null hypothesis. There is no significant difference in student performance between the two teaching methods.


## Q13. A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean of 65. Calculate the 90% confidence interval for the true population mean.

The formula for the confidence interval for a population mean is:

CI = x̄ ± Z * (σ / √n)

Where:
- CI: Confidence Interval
- x̄: Sample mean (65 in this case)
- Z: Z-score corresponding to the desired confidence level (for a 90% confidence level, Z ≈ 1.645)
- σ: Population standard deviation (8 in this case)
- n: Sample size (50)



In [6]:

import math

# Given data
sample_mean = 65
population_std = 8
sample_size = 50
confidence_level = 0.90

# Calculate the standard error
standard_error = population_std / math.sqrt(sample_size)

# Calculate the margin of error (using Z ≈ 1.645 for a 90% confidence level)
margin_of_error = 1.645 * standard_error

# Calculate the lower and upper bounds of the confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# Print the confidence interval
print("90% Confidence Interval: [{:.2f}, {:.2f}]".format(lower_bound, upper_bound))


90% Confidence Interval: [63.14, 66.86]


## Q14. In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.

H0: $\mu$=0

H1: $\mu \ne$ 0

n=30

$\mu$=0

x=0.25

$\sigma$=0.05

$\alpha$=0.1

t = (x̄ - μ0) / (s / √n)

In [9]:
sample_size=30
sample_mean=0.25
population_mean=0
sample_std=0.05
CI=0.9

t_stats= (sample_mean-population_mean)/(sample_std/math.sqrt(sample_size))

critical_value=stats.t.ppf(CI, df=sample_size-1)

p_value=stats.t.cdf(t_stats, df=sample_size-1)

if t_statistic > critical_value or p_value < (1 - confidence_level):
    print("Reject the null hypothesis.  Caffeine has a significant effect on reaction time.")
else:
    print("Fail to reject the null hypothesis. Caffeine does not have a significant effect on reaction time")
    


Fail to reject the null hypothesis. Caffeine does not have a significant effect on reaction time
