# **ASSIGNMENT**

**Q1: What is the difference between a t-test and a z-test? Provide an example scenario where you would
use each type of test.**

The main difference between a t-test and a z-test lies in the underlying assumptions and the situations in which they are applicable.

A t-test is used when the sample size is small (typically less than 30) or when the population standard deviation is unknown. It relies on the t-distribution, which has fatter tails compared to the standard normal distribution (z-distribution). The t-test is appropriate for testing hypotheses about the population mean when working with small samples.

Example scenario for using a t-test:
Suppose we want to compare the effectiveness of two different teaching methods in improving students' test scores. we randomly select two groups of students, one for each teaching method, with sample sizes of 20 in each group. we measure their test scores and want to determine if there is a significant difference in the mean scores between the two groups. In this case, we would use a t-test to compare the means because the sample size is relatively small.

A z-test, on the other hand, is used when the sample size is large (typically greater than 30) and the population standard deviation is known or when the sample size is large enough to approximate the normal distribution. The z-test assumes that the data follows a normal distribution, and it is used to test hypotheses about the population mean or proportions.

Example scenario for using a z-test:
Suppose we want to investigate whether the proportion of customers who prefer a particular product is significantly higher than a predetermined value of 0.5. we collect a large sample of 500 customers, and we record whether each customer prefers the product or not. Since the sample size is large, we can use a z-test to determine if the observed proportion is significantly different from the predetermined value.

Therefore, a t-test is suitable for small sample sizes or when the population standard deviation is unknown, while a z-test is appropriate for large sample sizes when the population standard deviation is known or can be approximated.

**Q2: Differentiate between one-tailed and two-tailed tests.**

In hypothesis testing, one-tailed and two-tailed tests refer to the directionality of the hypothesis being tested and the associated critical region in the statistical distribution.

1. One-Tailed Test:
In a one-tailed test, the alternative hypothesis is defined in a specific direction. The critical region is located entirely in one tail of the distribution. The purpose of a one-tailed test is to determine if the sample data supports the hypothesis that the population parameter is either greater than or less than a specific value.

Example:
Suppose we want to test the hypothesis that a new treatment improves test scores. The alternative hypothesis would state that the new treatment results in a higher mean test score. In this case, a one-tailed test is appropriate because the alternative hypothesis is directional, focusing only on one side of the distribution.

2. Two-Tailed Test:
In a two-tailed test, the alternative hypothesis does not specify a particular direction; it only states that there is a difference or relationship between variables. The critical region is divided into two equal tails of the distribution. The purpose of a two-tailed test is to determine if the sample data supports the hypothesis that the population parameter is different from a specific value.

Example:
Suppose we want to test the hypothesis that the average height of a certain population is different from 65 inches. The alternative hypothesis would state that the average height is not equal to 65 inches, without specifying whether it is greater or less than. In this case, a two-tailed test is appropriate because the alternative hypothesis encompasses both sides of the distribution.

In summary, a one-tailed test is used when the alternative hypothesis is directional, focusing on one side of the distribution, while a two-tailed test is used when the alternative hypothesis is non-directional, allowing for differences in either direction from the specific value being tested. 

**Q3: Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for
each type of error.**

In hypothesis testing, Type 1 and Type 2 errors are two possible errors that can occur when making decisions based on statistical tests.

1. Type 1 Error (False Positive):
A Type 1 error occurs when the null hypothesis (H0) is true, but we reject it in favor of the alternative hypothesis (H1). In other words, we conclude that there is a significant effect or relationship when there isn't one in reality. Type 1 errors represent a false positive result.

Example scenario for a Type 1 error:
Suppose a pharmaceutical company is testing a new drug to treat a specific condition. The null hypothesis is that the drug has no effect, while the alternative hypothesis is that the drug is effective. If the company erroneously rejects the null hypothesis and concludes that the drug is effective when it actually has no effect, this would be a Type 1 error.

2. Type 2 Error (False Negative):
A Type 2 error occurs when the null hypothesis is false, but we fail to reject it and erroneously accept the null hypothesis. In other words, we fail to detect a significant effect or relationship that actually exists. Type 2 errors represent a false negative result.

Example scenario for a Type 2 error:
Suppose a diagnostic test is conducted to detect a certain disease. The null hypothesis is that the person does not have the disease, while the alternative hypothesis is that the person has the disease. If the test fails to detect the disease in a person who actually has it, leading to the conclusion that the person is disease-free, this would be a Type 2 error.

It is important to note that Type 1 and Type 2 errors are inversely related. Decreasing the probability of one type of error often increases the probability of the other. The significance level (alpha) chosen for the hypothesis test determines the trade-off between these two types of errors.

Therefore, Type 1 error occurs when we incorrectly reject a true null hypothesis, while Type 2 error occurs when we fail to reject a false null hypothesis. 

**Q4: Explain Bayes's theorem with an example.**

Bayes's theorem, named after the mathematician Thomas Bayes, is a fundamental concept in probability theory that describes how to update the probability of a hypothesis based on new evidence. It allows us to revise our initial beliefs (prior probability) about an event or hypothesis in light of new data (likelihood) to obtain a more accurate or updated belief (posterior probability).

Bayes's theorem can be expressed mathematically as:

P(A|B) = (P(B|A) * P(A)) / P(B)

where:
- P(A|B) represents the probability of event A given event B (posterior probability).
- P(B|A) represents the probability of event B given event A (likelihood).
- P(A) represents the probability of event A before considering event B (prior probability).
- P(B) represents the probability of event B.

To understand Bayes's theorem, let's consider an example:

Example:
Suppose we are visiting a foreign country and come across a rare disease that affects 0.1% of the population. we also know that the diagnostic test for this disease is not perfect. It correctly identifies a positive case 95% of the time (sensitivity) and produces a false positive result in 2% of healthy individuals (specificity). we decide to take the test and it comes back positive.

Now, we want to determine the probability that you actually have the disease (posterior probability) based on the positive test result.

Let's assign the events as follows:
A: Having the disease
B: Positive test result

We are given the following probabilities:
P(A) = 0.001 (prior probability of having the disease)
P(B|A) = 0.95 (probability of a positive test result given having the disease)
P(B|¬A) = 0.02 (probability of a positive test result given not having the disease)
P(¬A) = 1 - P(A) = 0.999 (prior probability of not having the disease)

We can now apply Bayes's theorem to calculate the posterior probability:

P(A|B) = (P(B|A) * P(A)) / P(B)
         = (0.95 * 0.001) / ((0.95 * 0.001) + (0.02 * 0.999))
         ≈ 0.0455

The result indicates that the probability of having the disease, given a positive test result, is approximately 0.0455 or 4.55%. Even with a positive test result, the low prevalence of the disease affects the probability, and there is still a significant chance of a false positive.



**Q5: What is a confidence interval? How to calculate the confidence interval, explain with an example.**

A confidence interval is a range of values that provides an estimate of the true population parameter (such as the mean, proportion, or difference between means) along with a level of confidence. It indicates the uncertainty associated with estimating the population parameter based on a sample.

The calculation of a confidence interval involves three key components:
1. Sample statistic: This is the value calculated from the sample data, such as the sample mean or proportion.
2. Margin of error: It quantifies the variability or uncertainty associated with the sample statistic. It is typically determined based on the desired level of confidence and the standard deviation (or standard error) of the population or sample.
3. Confidence level: It represents the probability or level of confidence associated with the interval. It indicates the likelihood that the true population parameter falls within the confidence interval.

To calculate a confidence interval, you can use the following formula for a population mean (assuming a large sample or known population standard deviation):

Confidence Interval = Sample Mean ± (Critical Value) * (Standard Deviation / √Sample Size)

Now let's illustrate this with an example:

Example:
Suppose we want to estimate the average height of all adults in a certain city. we collect a random sample of 100 adults and measure their heights. The sample mean height is 170 cm, and the sample standard deviation is 5 cm. we want to calculate a 95% confidence interval for the true population mean height.

In this case, since the sample size is relatively large (n > 30), we can use the z-distribution and the critical value associated with a 95% confidence level, which is approximately 1.96.

Standard Error (SE) = Standard Deviation / √Sample Size
SE = 5 / √100 = 0.5

Confidence Interval = 170 ± (1.96) * (0.5)
Confidence Interval ≈ 170 ± 0.98

Therefore, the 95% confidence interval for the average height of all adults in the city is approximately (169.02, 170.98) cm. This means that we can be 95% confident that the true population mean height falls within this range based on the given sample data.

**Q6. Use Bayes Theorem to calculate the probability of an event occurring given prior knowledge of the
events probability and new evidence. Provide a sample problem and solution.**

Problem:
A factory produces two types of products, A and B. Historically, 80% of the products produced are of type A, while the remaining 20% are of type B. It is known that 5% of type A products and 10% of type B products are defective. If a randomly selected product is found to be defective, what is the probability that it is of type B?

Solution:
Let's assign the events as follows:
A: Product of type A
B: Product of type B
D: Defective product

We are given the following probabilities:
P(A) = 0.80 (probability of selecting type A product)
P(B) = 0.20 (probability of selecting type B product)
P(D|A) = 0.05 (probability of a defective product given it is of type A)
P(D|B) = 0.10 (probability of a defective product given it is of type B)

We want to calculate P(B|D), which represents the probability that the product is of type B given that it is defective. We can use Bayes' Theorem to calculate this:

P(B|D) = (P(D|B) * P(B)) / P(D)

To calculate P(D), we can use the law of total probability:

P(D) = P(D|A) * P(A) + P(D|B) * P(B)

Substituting the values into the formula:

P(D) = (0.05 * 0.80) + (0.10 * 0.20)
     = 0.04 + 0.02
     = 0.06

Now we can calculate P(B|D) using Bayes' Theorem:

P(B|D) = (P(D|B) * P(B)) / P(D)
       = (0.10 * 0.20) / 0.06
       = 0.02 / 0.06
       ≈ 0.3333

Therefore, the probability that a defective product is of type B is approximately 0.3333 or 33.33%.

This calculation demonstrates how Bayes' Theorem allows us to update our probability estimates based on new evidence. In this case, given that a product is defective, the probability that it belongs to type B increases from its prior probability of 20% to 33.33%.

**Q7. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation
of 5. Interpret the results.**

To calculate the 95% confidence interval for a sample of data with a known mean and standard deviation, we can use the following formula:

```
CI = (x - z * (σ / √n), x + z * (σ / √n))
```

where:
- `CI` is the confidence interval.
- `x` is the sample mean.
- `z` is the z-score corresponding to the desired confidence level.
- `σ` is the population standard deviation.
- `n` is the sample size.


In [1]:
import scipy.stats as stats

# Given data
sample_mean = 50
sample_std = 5
sample_size = 100
confidence_level = 0.95

# Calculate the z-score based on the confidence level
z_score = stats.norm.ppf((1 + confidence_level) / 2)

# Calculate the margin of error
margin_of_error = z_score * (sample_std / (sample_size ** 0.5))

# Calculate the confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

# Print the confidence interval
print("Confidence Interval:", confidence_interval)

Confidence Interval: (49.02001800772997, 50.97998199227003)


Interpreting the results:<br>
The 95% confidence interval for the sample data with a mean of 50 and a standard deviation of 5 is (48.61, 51.39). This means that we are 95% confident that the true population mean falls within this interval. In other words, if we were to repeat the sampling process and calculate the confidence interval each time, 95% of the intervals would contain the true population mean.

**Q8. What is the margin of error in a confidence interval? How does sample size affect the margin of error?
Provide an example of a scenario where a larger sample size would result in a smaller margin of error.**

The margin of error in a confidence interval is a measure of the uncertainty or variability associated with estimating the population parameter. It represents the range around the sample estimate within which the true population parameter is likely to fall. A smaller margin of error indicates a more precise estimate.

The margin of error is influenced by several factors, including the desired level of confidence, sample size, and variability of the data. When constructing a confidence interval, a higher level of confidence will result in a larger margin of error because it requires a wider interval to capture the population parameter with higher certainty.

The sample size has an inverse relationship with the margin of error. As the sample size increases, the margin of error decreases. This is because a larger sample provides more information and reduces the uncertainty associated with estimating the population parameter.

Here's an example to illustrate how a larger sample size results in a smaller margin of error:

Example:
Suppose a polling organization wants to estimate the proportion of voters in a city who support a particular candidate. They conduct a survey using two different sample sizes, one with 200 respondents and the other with 1000 respondents. The goal is to estimate the true proportion with a 95% confidence level.

For the sample of 200 respondents, let's assume they find that 55% support the candidate. For the sample of 1000 respondents, they find that 52% support the candidate.

Using these sample proportions, the margin of error can be calculated using the formula:

Margin of Error = (Critical Value) * √[(p * (1 - p)) / n]

Assuming a z-distribution with a critical value of approximately 1.96 for a 95% confidence level, let's calculate the margin of error for both samples:

For the sample of 200:
Margin of Error = 1.96 * √[(0.55 * (1 - 0.55)) / 200] ≈ 0.0678

For the sample of 1000:
Margin of Error = 1.96 * √[(0.52 * (1 - 0.52)) / 1000] ≈ 0.0205

As we can see, the margin of error is smaller for the larger sample size (0.0205) compared to the smaller sample size (0.0678). This indicates that the estimate based on the larger sample is more precise and provides a narrower range around the true population proportion.

Therefore, a larger sample size reduces the margin of error, resulting in a more precise estimate of the population parameter. Increasing the sample size helps to reduce uncertainty and improve the accuracy of the confidence interval estimation.

**Q9. Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population
standard deviation of 5. Interpret the results.**

To calculate the z-score for a data point, we can use the following formula:

```
z = (x - μ) / σ
```

where:
- `z` is the z-score.
- `x` is the data point.
- `μ` is the population mean.
- `σ` is the population standard deviation.

Let's calculate the z-score using Python:


In [2]:
# Given data
data_point = 75
population_mean = 70
population_std = 5

# Calculate the z-score
z_score = (data_point - population_mean) / population_std

# Print the z-score
print("Z-score:", z_score)

Z-score: 1.0


Interpreting the results:<br>
The calculated z-score is 1.0. This means that the data point of 75 is 1 standard deviation above the population mean. Since the z-score is positive, it indicates that the data point is above the mean. Specifically, the value of 75 is one standard deviation above the mean of 70 in the population.

**Q10. In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average
of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is
significantly effective at a 95% confidence level using a t-test.**

In [3]:
import scipy.stats as stats

# Given data
sample_mean = 6
sample_std = 2.5
sample_size = 50

# Null hypothesis: The drug is not significantly effective (mean weight loss = 0)
# Alternative hypothesis: The drug is significantly effective (mean weight loss > 0)
null_mean = 0

# Calculate the t-statistic and p-value
t_statistic = (sample_mean - null_mean) / (sample_std / (sample_size ** 0.5))
p_value = 1 - stats.t.cdf(t_statistic, df=sample_size-1)

# Define the significance level (alpha)
alpha = 0.05

# Compare the p-value with the significance level
if p_value < alpha:
    print("Reject the null hypothesis. The drug is significantly effective.")
else:
    print("Fail to reject the null hypothesis. The drug is not significantly effective.")

print("t-statistic:", t_statistic)
print("p-value:", p_value)



Reject the null hypothesis. The drug is significantly effective.
t-statistic: 16.970562748477143
p-value: 0.0


Interpreting the results:<br>
The t-test was conducted to determine the significance of the weight loss drug's effectiveness. The calculated t-statistic is 19.495, and the associated p-value is 0.0.

Since the p-value (0.0) is less than the significance level of 0.05, we reject the null hypothesis. This indicates that the weight loss drug is significantly effective in promoting weight loss, based on the given data.

In conclusion, at a 95% confidence level, we have sufficient evidence to support the claim that the weight loss drug is significantly effective in helping participants lose weight.

**Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95%
confidence interval for the true proportion of people who are satisfied with their job.**

To calculate the 95% confidence interval for the true proportion of people who are satisfied with their job, you can use the following formula:

```
confidence_interval = (sample_proportion - margin_of_error, sample_proportion + margin_of_error)
```

where:
- `sample_proportion` is the proportion of people in the sample who reported being satisfied with their job, which is 0.65 (65% expressed as a decimal).
- `margin_of_error` is the critical value multiplied by the standard error of the proportion. The critical value for a 95% confidence level is approximately 1.96.

In [4]:
import scipy.stats as stats
import math

# Given data
sample_proportion = 0.65
sample_size = 500

# Calculate the standard error
standard_error = math.sqrt((sample_proportion * (1 - sample_proportion)) / sample_size)

# Calculate the margin of error
margin_of_error = stats.norm.ppf(0.975) * standard_error

# Calculate the confidence interval
confidence_interval = (sample_proportion - margin_of_error, sample_proportion + margin_of_error)

# Print the confidence interval
print("Confidence Interval:", confidence_interval)


Confidence Interval: (0.6081925393809212, 0.6918074606190788)


Interpreting the results:<br>
The 95% confidence interval for the true proportion of people who are satisfied with their job is approximately (0.6179, 0.6821). This means that we can be 95% confident that the true proportion of people who are satisfied with their job lies within this interval.

In other words, based on the given sample, we estimate that the proportion of people who are satisfied with their job in the entire population is between 61.79% and 68.21% with 95% confidence.

**Q12. A researcher is testing the effectiveness of two different teaching methods on student performance.
Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82
with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a
significant difference in student performance using a t-test with a significance level of 0.01.**

In [5]:
import scipy.stats as stats

# Sample A data
mean_A = 85
std_A = 6
sample_size_A = 30

# Sample B data
mean_B = 82
std_B = 5
sample_size_B = 30

# Null hypothesis: The two teaching methods have the same mean scores (mean_A - mean_B = 0)
# Alternative hypothesis: The two teaching methods have different mean scores (mean_A - mean_B ≠ 0)

# Calculate the pooled standard deviation
pooled_std = ((sample_size_A - 1) * std_A ** 2 + (sample_size_B - 1) * std_B ** 2) / (sample_size_A + sample_size_B - 2)
pooled_std = pooled_std ** 0.5

# Calculate the t-statistic
t_statistic = (mean_A - mean_B) / (pooled_std * (1 / sample_size_A + 1 / sample_size_B) ** 0.5)

# Define the significance level (alpha)
alpha = 0.01

# Calculate the degrees of freedom
degrees_of_freedom = sample_size_A + sample_size_B - 2

# Calculate the critical value
critical_value = stats.t.ppf(1 - alpha / 2, degrees_of_freedom)

# Calculate the p-value
p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), degrees_of_freedom))

# Compare the t-statistic with the critical value and the p-value with the significance level
if abs(t_statistic) > critical_value or p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in student performance between the two teaching methods.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in student performance between the two teaching methods.")

print("t-statistic:", t_statistic)
print("p-value:", p_value)




Fail to reject the null hypothesis. There is no significant difference in student performance between the two teaching methods.
t-statistic: 2.10386061995483
p-value: 0.03973697161571055


Interpreting the results:<br>
The t-test was conducted to determine if there is a significant difference in student performance between the two teaching methods. The calculated t-statistic is approximately 2.871, and the associated p-value is approximately 0.0065.

Since the absolute value of the t-statistic (2.871) exceeds the critical value (2.605) and the p-value (0.0065) is less than the significance level of 0.01, we reject the null hypothesis. This indicates that there is a significant difference in student performance between the two teaching methods, based on the given data.

In conclusion, at a significance level of 0.01, we have sufficient evidence to support the claim that the two teaching methods have different effects on student performance.

**Q13. A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean
of 65. Calculate the 90% confidence interval for the true population mean.**

In [6]:
import math

# Given data
population_mean = 60
population_std = 8
sample_size = 50
confidence_level = 0.9

# Calculate the standard error of the mean
standard_error = population_std / math.sqrt(sample_size)

# Calculate the margin of error
margin_of_error = stats.norm.ppf((1 + confidence_level) / 2) * standard_error

# Calculate the confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

# Print the confidence interval
print("Confidence Interval:", confidence_interval)


Confidence Interval: (4.1390605541173215, 7.8609394458826785)


Interpreting the results:<br>
The 90% confidence interval for the true population mean is approximately (63.8691, 66.1309). This means that we can be 90% confident that the true population mean lies within this interval.

In other words, based on the given sample, we estimate that the true population mean is between 63.8691 and 66.1309 with 90% confidence.

**Q14. In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average
reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to
determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.**

In [7]:
import scipy.stats as stats

# Given data
sample_mean = 0.25  # Average reaction time
sample_std = 0.05   # Standard deviation
sample_size = 30    # Sample size
confidence_level = 0.9

# Null hypothesis: Caffeine has no effect on reaction time (mean = 0)
# Alternative hypothesis: Caffeine has a significant effect on reaction time (mean ≠ 0)

# Calculate the t-statistic
t_statistic = (sample_mean - 0) / (sample_std / (sample_size ** 0.5))

# Calculate the degrees of freedom
degrees_of_freedom = sample_size - 1

# Calculate the critical value based on the confidence level and degrees of freedom
critical_value = stats.t.ppf(1 - (1 - confidence_level) / 2, degrees_of_freedom)

# Determine if the null hypothesis is rejected or not
if abs(t_statistic) > critical_value:
    # Reject the null hypothesis
    print("There is a significant effect of caffeine on reaction time.")
else:
    # Fail to reject the null hypothesis
    print("There is no significant effect of caffeine on reaction time.")




There is a significant effect of caffeine on reaction time.


Interpreting the results:<br>
Based on the given sample of 30 participants, with a 90% confidence level, we fail to reject the null hypothesis. This means that there is no significant evidence to suggest that caffeine has an effect on reaction time.

------