# Q1: What is the difference between a t-test and a z-test? Provide an example scenario where you would use each type of test.

The t-test and z-test are both statistical hypothesis tests used to compare means of two groups or to determine if a sample mean differs significantly from a population mean. The main difference between the two tests lies in the circumstances under which they are applicable.

1. t-test:
The t-test is used when the sample size is relatively small (typically, when the sample size is less than 30) and the population standard deviation is unknown. It is also appropriate when the data is approximately normally distributed. There are two main types of t-tests:

- One-sample t-test: Used to determine if the mean of a sample significantly differs from a known population mean.
- Independent samples t-test: Used to compare the means of two independent groups (samples) to see if they are significantly different from each other.

Example scenario for t-test: Let's say you want to compare the average scores of two groups of students, where each group contains fewer than 30 students. You don't know the population standard deviation, but you can assume that the data is approximately normally distributed. In this case, you would use an independent samples t-test to assess whether there is a significant difference between the mean scores of the two groups.

2. z-test:
The z-test is used when the sample size is large (typically, when the sample size is greater than 30) and the population standard deviation is known. It is also suitable when the data is normally distributed. The z-test is less commonly used in practice compared to the t-test because the population standard deviation is often unknown in real-world scenarios.

Example scenario for z-test: Imagine you have a large dataset of exam scores from a university, and you know the population standard deviation of scores for that particular exam. You want to determine if the average score of a specific class significantly differs from the overall average score of the university. In this situation, you would use a z-test to compare the class's mean score with the known population mean.

In summary, use the t-test when the sample size is small and/or the population standard deviation is unknown. Use the z-test when the sample size is large and the population standard deviation is known. If the sample size is large, and the population standard deviation is unknown, the t-test is still a suitable choice, but it approaches the z-test as the sample size increases.

# Q2: Differentiate between one-tailed and two-tailed tests.

One-tailed and two-tailed tests are two types of hypothesis tests used in statistics to determine whether there is a significant difference or relationship between groups or variables. The key difference between these tests lies in the directionality of the hypothesis being tested.

1. One-tailed test:
In a one-tailed test, the null hypothesis specifies a particular direction for the effect or relationship, while the alternative hypothesis (research hypothesis) specifies the opposite direction. The one-tailed test is used when researchers have a specific expectation or hypothesis about the direction of the effect.

For example, let's consider testing the effect of a new drug on participants' reaction times. The hypotheses would be formulated as follows:

Null hypothesis (H0): The new drug has no effect on reaction times.
Alternative hypothesis (H1): The new drug decreases reaction times.

In this case, you would conduct a one-tailed test to see if there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis, specifically looking for a decrease in reaction times due to the new drug.

The critical region for a one-tailed test is located entirely in one tail of the distribution, making it easier to achieve statistical significance if the data supports the expected direction.

2. Two-tailed test:
In a two-tailed test, the null hypothesis simply states that there is no significant difference or relationship, without specifying a particular direction. The alternative hypothesis, on the other hand, asserts that there is a significant difference or relationship, but it does not specify the direction.

Continuing with the same example of testing the effect of a new drug on reaction times, the hypotheses for a two-tailed test would be formulated as follows:

Null hypothesis (H0): The new drug has no effect on reaction times.
Alternative hypothesis (H1): The new drug has a significant effect on reaction times (either increasing or decreasing).

In a two-tailed test, you are interested in whether there is a significant change in either direction—either an increase or decrease in reaction times.

The critical region for a two-tailed test is split between both tails of the distribution, making it less likely to achieve statistical significance compared to a one-tailed test, as the evidence must support a significant effect regardless of the direction.

In summary, a one-tailed test is used when researchers have a specific directional hypothesis, and they want to determine if the data supports that specific direction. A two-tailed test, on the other hand, is used when researchers are interested in any significant effect, regardless of the direction.

# Q3: Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for each type of error.

In hypothesis testing, Type 1 and Type 2 errors are two types of mistakes that can occur when making decisions about the null hypothesis (H0) and alternative hypothesis (H1).

1. Type 1 error (False Positive or α-error):
A Type 1 error occurs when we reject the null hypothesis (H0) when it is actually true. In other words, we conclude that there is a significant effect or relationship when, in reality, there is no effect or relationship in the population. The probability of committing a Type 1 error is denoted by the symbol α (alpha) and is also known as the significance level.

Example scenario for Type 1 error:
Imagine a medical test designed to detect a specific disease. The null hypothesis (H0) in this case would be that the person being tested does not have the disease. The alternative hypothesis (H1) would be that the person does have the disease. If the test is overly sensitive or not well-calibrated, it may produce false positives, indicating that a person has the disease when, in fact, they do not.

2. Type 2 error (False Negative or β-error):
A Type 2 error occurs when we fail to reject the null hypothesis (H0) when it is actually false. In other words, we conclude that there is no significant effect or relationship when, in reality, there is an effect or relationship in the population. The probability of committing a Type 2 error is denoted by the symbol β (beta).

Example scenario for Type 2 error:
Consider a clinical trial for a new drug intended to lower blood pressure. The null hypothesis (H0) would be that the drug has no effect on blood pressure, while the alternative hypothesis (H1) would be that the drug does lower blood pressure. If the trial lacks statistical power or the sample size is too small, it may fail to detect the drug's actual effect, leading to a Type 2 error. In this case, the trial would conclude that the drug is ineffective when it actually could be effective.

It's important to note that the probabilities of Type 1 and Type 2 errors are typically related, meaning that decreasing the probability of one type of error often increases the probability of the other. Researchers must choose an appropriate significance level (α) when designing their experiments to control the risk of Type 1 error. Additionally, they should consider factors like sample size and effect size to mitigate the risk of Type 2 error and ensure the statistical power of their study is sufficient to detect meaningful effects if they exist.

# Q4: Explain Bayes's theorem with an example.

Bayes's theorem, named after the Reverend Thomas Bayes, is a fundamental concept in probability theory and statistics. It provides a way to update the probability of a hypothesis based on new evidence. The theorem mathematically describes the relationship between conditional probabilities, allowing us to make more informed decisions in the presence of uncertain or incomplete information.

The formula for Bayes's theorem is as follows:

\[ P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)} \]

Where:
- \( P(H|E) \) is the posterior probability of hypothesis H given evidence E.
- \( P(E|H) \) is the likelihood of evidence E given hypothesis H.
- \( P(H) \) is the prior probability of hypothesis H (the probability of H before considering any evidence).
- \( P(E) \) is the probability of evidence E (the normalizing constant).

Now, let's illustrate Bayes's theorem with a classic example known as the "diagnostic test" scenario:

Example scenario:
Suppose you are a doctor, and you have a patient who is experiencing flu-like symptoms. You want to determine the probability that the patient has the flu (H) based on the results of a diagnostic test (E). You know the following probabilities:

1. Prior probability: \( P(\text{Flu}) = 0.05 \) (The probability that a randomly selected person has the flu, before any test results).
2. Sensitivity: \( P(\text{Positive Test Result} | \text{Flu}) = 0.9 \) (The probability that the test correctly identifies a person with the flu as positive).
3. Specificity: \( P(\text{Negative Test Result} | \text{No Flu}) = 0.95 \) (The probability that the test correctly identifies a person without the flu as negative).

Now, you want to find the probability that the patient has the flu given a positive test result.

Using Bayes's theorem:
Let H be the event "the patient has the flu," and E be the event "the test result is positive."

\[ P(\text{Flu}|\text{Positive Test}) = \frac{P(\text{Positive Test}|\text{Flu}) \cdot P(\text{Flu})}{P(\text{Positive Test})} \]

To compute the denominator, we need to consider all possibilities that could lead to a positive test result:

\[ P(\text{Positive Test}) = P(\text{Positive Test}|\text{Flu}) \cdot P(\text{Flu}) + P(\text{Positive Test}|\text{No Flu}) \cdot P(\text{No Flu}) \]

Since \( P(\text{No Flu}) = 1 - P(\text{Flu}) \) (the probability of not having the flu), we can calculate the denominator and then find the posterior probability of having the flu given a positive test result.

By substituting the known values, you can now calculate \( P(\text{Flu}|\text{Positive Test}) \). The result will give you the probability that the patient actually has the flu, given the positive test result. This updated probability takes into account both the prior probability of having the flu and the test's accuracy in identifying true positive cases.

# Q5: What is a confidence interval? How to calculate the confidence interval, explain with an example.

A confidence interval is a range of values within which we are reasonably confident that the true population parameter (e.g., mean, proportion) lies. It provides a measure of the uncertainty associated with estimating a population parameter from a sample.

When conducting statistical analysis, we often have a sample from a population, and we want to estimate an unknown population parameter. A point estimate (e.g., sample mean, sample proportion) gives us a single value as an estimate, but it doesn't tell us anything about the uncertainty associated with that estimate. A confidence interval, on the other hand, gives us a range of values that is likely to contain the true population parameter with a specified level of confidence.

The confidence interval is often expressed as:

\[ \text{Point Estimate} \pm \text{Margin of Error} \]

The margin of error is calculated based on the variability in the data and the desired level of confidence.

Example of calculating a confidence interval for the population mean:

Suppose you want to estimate the average height of students in a university. You take a random sample of 100 students and measure their heights. Let's assume the sample mean height is 170 cm, and the sample standard deviation is 5 cm.

1. Choose the level of confidence: Let's say you want a 95% confidence interval. This means you want to be 95% confident that the true average height of all students in the university lies within the interval.

2. Find the critical value: Since you have a large enough sample (n > 30) and you know the sample standard deviation, you can use the Z-table for a 95% confidence level. The critical value for a 95% confidence interval is approximately 1.96.

3. Calculate the margin of error: The margin of error (ME) is given by:

\[ ME = \text{Critical Value} \times \frac{\text{Sample Standard Deviation}}{\sqrt{\text{Sample Size}}} \]

\[ ME = 1.96 \times \frac{5}{\sqrt{100}} = 1.96 \times 0.5 = 0.98 \]

4. Calculate the confidence interval: Now, you can construct the confidence interval using the formula:

\[ \text{Confidence Interval} = \text{Sample Mean} \pm \text{Margin of Error} \]

\[ \text{Confidence Interval} = 170 \pm 0.98 \]

The 95% confidence interval for the average height of all students in the university is (169.02, 170.98) cm. This means that you are 95% confident that the true average height lies between these two values.

Remember that increasing the level of confidence will widen the confidence interval, making it more likely to capture the true population parameter, but at the cost of increased uncertainty. Confidence intervals are an essential tool in statistics to help us draw meaningful conclusions and make inferences about population parameters based on sample data.

# Q6. Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence. Provide a sample problem and solution.

You have two bags, Bag A and Bag B, containing some marbles.
Bag A has 2 red marbles and 3 blue marbles.
Bag B has 1 red marble and 4 blue marbles.

You randomly pick one bag and then randomly draw one marble from the chosen bag. The marble turns out to be red. What is the probability that you picked Bag A?

A: Picking Bag A.
B: Drawing a red marble.

P(A) == 0.5
P(B|A) = 2/5

#P(B) = P(B|A) * P(A) + P(B|¬A) * P(¬A)
P(B) = (0.4 * 0.5) + (0.2 * 0.5)
P(B) = 0.2 + 0.1
P(B) = 0.3

P(A|B) = (P(A) * P(B/A) / P(B))
P(A|B) = (0.5 * 0.4) / 0.3
P(A|B) = 0.2 / 0.3
P(A|B) = 0.6667

# Q7. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5. Interpret the results.

In [2]:
import scipy.stats as stats

sample_mean = 50
sample_std = 5
ci = stats.norm.interval(confidence= 0.95 , loc=50 , scale = 5)
print("95% Confidence interval: ",ci)

95% Confidence interval:  (40.200180077299734, 59.799819922700266)


# Q8. What is the margin of error in a confidence interval? How does sample size affect the margin of error? Provide an example of a scenario where a larger sample size would result in a smaller margin of error.

The margin of error in a confidence interval is a measure of the uncertainty or precision of the estimate of a population parameter (e.g., mean, proportion) based on a sample. It quantifies the range within which the true population parameter is likely to lie. A smaller margin of error indicates a more precise estimate, while a larger margin of error indicates a less precise estimate.

In general, the margin of error is affected by three main factors:
1. Confidence level: The higher the confidence level (e.g., 95%, 99%), the larger the margin of error, as it requires a wider interval to capture a higher proportion of potential sample estimates.
2. Standard deviation or variability of the population: A larger population standard deviation results in a larger margin of error since the data points are more spread out, leading to greater uncertainty in the estimate.
3. Sample size: A larger sample size leads to a smaller margin of error. As the sample size increases, the variability between individual data points decreases, leading to a more precise estimate.

Example Scenario:

Suppose you want to estimate the average height of students in a university. You can take two different sample sizes: one with 30 students and another with 300 students.

For the sample with 30 students:
- Assume the sample mean height is 170 cm.
- Assume the standard deviation of the height in the population is 8 cm.
- Assume a 95% confidence level.

Using the formula for the margin of error:

Margin of error = (Z * (σ / √n))

where:
- Z is the critical value from the standard normal distribution corresponding to the desired confidence level (for 95% confidence, Z ≈ 1.96).
- σ is the population standard deviation.
- n is the sample size.

For the sample with 30 students:
Margin of error = (1.96 * (8 / √30)) ≈ 2.78 cm

For the sample with 300 students:
Margin of error = (1.96 * (8 / √300)) ≈ 0.73 cm

As you can see, the larger sample size (300 students) resulted in a smaller margin of error (0.73 cm) compared to the smaller sample size (30 students) with a margin of error of 2.78 cm. This demonstrates that a larger sample size provides a more precise estimate with a smaller range of uncertainty around the estimated population parameter.

# Q9. Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population standard deviation of 5. Interpret the results.

In [1]:
Value = 75
Population_mean = 70
Population_std = 5

Z_Score = (Value - Population_mean) / Population_std
print(f"Z-Score: {Z_Score}")

Z-Score: 1.0


# Q10. In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is significantly effective at a 95% confidence level using a t-test.

In [6]:
import scipy.stats as stats
import numpy as np

n = 50
sample_mean = 6
sample_std = 2.5
alpha = 1 - 0.95
df = n - 1

Decision_rule = stats.t.ppf(1 - alpha / 2 , df)
t_statistics = (sample_mean) / (sample_std / np.sqrt(n))

print(f"t-statistics: {t_statistics}")
print(f"Decision_rule: {Decision_rule}")
if t_statistics > Decision_rule:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

t-statistics: 16.970562748477143
Decision_rule: 2.009575234489209
Reject the null hypothesis


# Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95% confidence interval for the true proportion of people who are satisfied with their job.

In [7]:
import numpy as np
import scipy.stats as stats

sample_size = 500
Confidence_level = 0.95
Sample_proportion = 0.65
alpha = 1 - Confidence_level

margin_error = stats.norm.ppf(1 - (alpha) / 2)  *  np.sqrt(Sample_proportion * (1 - Sample_proportion) / sample_size)
confidence_interval = (Sample_proportion - margin_error , Sample_proportion + margin_error)
print(f"Confidence_interval: {confidence_interval}")

Confidence_interval: (0.6081925393809212, 0.6918074606190788)


# Q12. A researcher is testing the effectiveness of two different teaching methods on student performance. Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82 with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a significant difference in student performance using a t-test with a significance level of 0.01

In [10]:
import numpy as np
import scipy.stats as stats

sample_a_mean = 85
sample_b_mean = 82
sample_a_std = 6
sample_b_std = 5
n1 = 50
n2 = 50
alpha = 0.01
df = n1+n2 - 1

decision_rule = stats.t.ppf(1 - alpha / 2 , df)
t_statistics = (sample_a_mean - sample_b_mean) / (sample_a_std / np.sqrt(n1) + (sample_b_std / np.sqrt(n2)))

print("decision_rule: ",decision_rule)
print("t-statistics: ",t_statistics)
if t_statistics > decision_rule:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

decision_rule:  2.6264054563851857
t-statistics:  1.928473039599675
Fail to reject the null hypothesis


# Q13. A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean of 65. Calculate the 90% confidence interval for the true population mean.

In [13]:
import scipy.stats as stats
import numpy as np

n = 50
mean = 60
std = 8
sample_mean = 65
alpha = 1 - 0.90

z_score = stats.norm.ppf(1 - alpha / 2)
margin_of_error = z_score * (std / np.sqrt(n))
c_interval = (sample_mean - margin_error , sample_mean + margin_error)
print("90 % Confidence Interval: ",c_interval)

90 % Confidence Interval:  (64.95819253938092, 65.04180746061908)


# Q14. In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.

In [14]:
import numpy as np
import scipy.stats as stats

n = 30
mean = 0.25
std = 0.05
alpha = 1 - 0.90
df = n - 1

decision_rule_rule = stats.t.ppf(1 - alpha / 2 ,df)
t_statistics = (mean) / (std / np.sqrt(n))

print("decision_rule: ",decision_rule)
print("T-statistics: ",t_statistics)

if t_statistics > decision_rule:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

decision_rule:  2.6264054563851857
T-statistics:  27.386127875258307
Reject the null hypothesis
