# Q1: What is the difference between a t-test and a z-test? Provide an example scenario where you would use each type of test.
## Both t-tests and z-tests are statistical tests used to determine whether two sample means are significantly different from each other. However, they differ in their assumptions and the situations in which they are appropriate.

## The main difference between t-tests and z-tests is that t-tests are used when the sample size is small (less than 30) and the population standard deviation is unknown, while z-tests are used when the sample size is large (greater than or equal to 30) and the population standard deviation is known or can be estimated from the sample.

### A t-test uses the t-distribution to calculate the probability that the difference between two sample means is due to chance. For example, a researcher might use a t-test to determine whether there is a significant difference in the average weight loss between two groups of participants who followed different diets for a month.

### On the other hand, a z-test uses the standard normal distribution to test the difference between two means. For example, a quality control manager might use a z-test to determine whether there is a significant difference in the mean length of two different brands of screws produced by a factory, where the population standard deviation is known.

# Q2: Differentiate between one-tailed and two-tailed tests.
## In hypothesis testing, a one-tailed test is a statistical test where the alternative hypothesis is directional, meaning it predicts the direction of the difference between the sample statistic and the population parameter. In contrast, a two-tailed test is a statistical test where the alternative hypothesis is non-directional, meaning it predicts that there is a difference between the sample statistic and the population parameter, but it does not specify the direction of that difference.

## The choice between a one-tailed and a two-tailed test depends on the research question and the directionality of the hypothesis. If the research question and hypothesis specify a direction, a one-tailed test is appropriate. For example, if a researcher hypothesizes that a new medication will decrease symptoms of a disease, they would use a one-tailed test because they are only interested in detecting a decrease in symptoms, not an increase.

## On the other hand, if the hypothesis is non-directional or there is no clear prediction about the direction of the difference, a two-tailed test is appropriate. For example, if a researcher wants to test whether there is a significant difference in IQ scores between two groups of students, but does not have a prior expectation about which group will score higher, they would use a two-tailed test.

## In a one-tailed test, the critical region is located entirely on one side of the sampling distribution, whereas in a two-tailed test, the critical region is split between the two tails of the sampling distribution. As a result, the p-value for a one-tailed test is half that of a two-tailed test for the same observed effect size.

# Q3: Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for each type of error.
## In hypothesis testing, a Type 1 error is the rejection of a null hypothesis when it is actually true. This error occurs when a researcher concludes that there is a significant effect or difference between two groups when in reality there is no such effect or difference. Type 1 errors are also called false positives and have a significance level denoted by alpha.
### Example: A researcher wants to determine whether a new drug reduces the symptoms of a disease. They conduct a study with a sample of patients and find a statistically significant improvement in symptoms. However, in reality, the drug has no effect on the symptoms of the disease, and the observed difference is due to chance. The researcher commits a Type 1 error by rejecting the null hypothesis that the drug has no effect when it is actually true.

## A Type 2 error is the acceptance of a null hypothesis when it is actually false. This error occurs when a researcher fails to detect a significant effect or difference between two groups when in reality there is such an effect or difference. Type 2 errors are also called false negatives and have a significance level denoted by beta.
### Example: A researcher wants to determine whether a new educational program improves the academic performance of students. They conduct a study with a sample of students and find no statistically significant difference in performance between the group that received the educational program and the control group. However, in reality, the program does improve academic performance, but the observed difference is not statistically significant due to a small sample size or other factors. The researcher commits a Type 2 error by accepting the null hypothesis that the program has no effect when it is actually false.

## Both Type 1 and Type 2 errors are possible in any hypothesis testing scenario, and researchers need to balance the risks of each type of error when designing their studies and interpreting their results. Reducing the risk of one type of error often increases the risk of the other, so it is important to carefully consider the consequences of each type of error in the specific context of the research question.

# Q4: Explain Bayes's theorem with an example.
## Bayes's theorem is a mathematical formula that describes the relationship between conditional probabilities of two events. The theorem states that the probability of event A given that event B has occurred is equal to the probability of event B given that event A has occurred multiplied by the probability of event A, divided by the probability of event B.

### Here is the formula for Bayes's theorem:

## P(A|B) = P(B|A) * P(A) / P(B)

#### where:
#### P(A|B) is the probability of event A given that event B has occurred
#### P(B|A) is the probability of event B given that event A has occurred
#### P(A) is the prior probability of event A
#### P(B) is the prior probability of event B

## Example of Baye's Theorem:
### Suppose we have two bags of marbles: Bag A and Bag B. Bag A contains 5 red marbles and 3 blue marbles, while Bag B contains 2 red marbles and 7 blue marbles. You choose one bag at random and draw one marble from it. The marble you draw is blue. What is the probability that we chose Bag A?

#### To calculate this probability, we can set up the following events:

### A: We chose Bag A
### B: We drew a blue marble
#### We want to calculate the probability of A given B, or P(A|B).

### First, we need to calculate the conditional probabilities of B given A and B given not A:
#### P(B|A) = 3/8, since Bag A contains 3 blue marbles out of 8 total marbles
#### P(B|not A) = 7/9, since Bag B contains 7 blue marbles out of 9 total marbles
### Now, we can calculate the prior probabilities of A and B:

#### P(A) = 0.5, since you chose one bag at random
#### P(B) = P(B|A) * P(A) + P(B|not A) * P(not A) = 0.5 * 0.375 + 0.5 * 0.7 = 0.5375, since the probability of drawing a blue marble depends on which bag we chose.
### Now we can use Bayes's theorem to calculate the probability of A given B:
#### P(A|B) = P(B|A) * P(A) / P(B) = (3/8 * 0.5) / 0.5375 = 0.279
### So the probability that we chose Bag A given that we drew a blue marble is approximately 0.279 or 27.9%. This means that even though Bag A has fewer blue marbles, there is still a relatively high probability that we chose it if we drew a blue marble, because Bag B has a much higher proportion of blue marbles overall.

# Q5: What is a confidence interval? How to calculate the confidence interval, explain with an example.
## A confidence interval is a range of values that is likely to contain the true value of a population parameter with a certain level of confidence. It is used to estimate an unknown population parameter based on a sample of data.

### To calculate a confidence interval, we first need to choose a confidence level, which is the probability that the interval will contain the true population parameter. The most common confidence levels are 90%, 95%, and 99%. The corresponding alpha level, or level of significance, is the probability of rejecting the null hypothesis when it is true, and it is usually set at 0.05.

## The formula for a confidence interval for the mean of a population is:

### CI = X̄ ± z*(s/√n)
#### where:
#### CI is the confidence interval
### X̄ is the sample mean
#### z is the critical value from the standard normal distribution corresponding to the desired confidence level
#### s is the sample standard deviation
#### n is the sample size

## Example:
### Suppose we want to estimate the average height of all students at a university. We take a random sample of 50 students and measure their heights. The sample mean is 68 inches and the sample standard deviation is 3 inches.

### We want to construct a 95% confidence interval for the true population mean height. The critical value for a 95% confidence level is 1.96 from the standard normal distribution.

### CI = 68 ± 1.96*(3/√50)
### CI = 68 ± 0.84
### CI = [67.16, 68.84]

## This means that we are 95% confident that the true population mean height is between 67.16 inches and 68.84 inches. If we were to repeat this process many times, we would expect the true population mean to fall within this range 95% of the time.

# Q6. Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence. Provide a sample problem and solution.
## Suppose a certain disease affects 1 in 1000 people in a population. There is a test for this disease that is 99% accurate, meaning that it correctly identifies 99% of people who have the disease and 99% of people who do not have the disease. If a person tests positive for the disease, what is the probability that they actually have the disease?

### To use Bayes' Theorem to calculate this probability, we can set up the following events:

#### A: The person has the disease
#### B: The person tests positive for the disease
### We want to calculate the probability of A given B, or P(A|B).

### First, we can calculate the prior probability of A:

#### P(A) = 1/1000, since the disease affects 1 in 1000 people in the population
### Next, we need to calculate the conditional probabilities of B given A and B given not A:

#### P(B|A) = 0.99, since the test correctly identifies 99% of people who have the disease
#### P(B|not A) = 0.01, since the test incorrectly identifies 1% of people who do not have the disease
### Now we can use Bayes' Theorem to calculate the probability of A given B:

##### P(A|B) = P(B|A) * P(A) / P(B) = (0.99 * 0.001) / P(B)
### To calculate P(B), we can use the Law of Total Probability:
####  P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
#### P(B) = 0.99 * 0.001 + 0.01 * 0.999
#### P(B) = 0.01098

# Q7. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5. Interpret the results.

In [7]:
import math
from scipy.stats import t
def estimate_population_mean(sample_mean, sample_std, sample_size, alpha=0.05):    
    moe = 1.96 * sample_std / math.sqrt(sample_size)  # margin of error
    lower = sample_mean - moe  # lower bound of confidence interval
    upper = sample_mean + moe  # upper bound of confidence interval    
    return lower, upper

### We don't know the sample size, so we cannot calculate the exact confidence interval. However, we can still interpret the result. This confidence interval tells us that if we were to take many random samples of the same size from the same population, we would expect 95% of the resulting confidence intervals to contain the true population mean. In other words, we are 95% confident that the true population mean falls within this range.

### For example, if we had a sample size of 100, the confidence interval would be:

In [29]:
lower,upper = estimate_population_mean(50,5,100)

In [30]:
print(f'The population mean is between {lower:.2f} and {upper:.2f}')

The population mean is between 49.02 and 50.98


### This means that we are 95% confident that the true population mean is between 49.02 and 50.98. If we were to repeat the sampling process many times, we would expect the true population mean to fall within this range in 95% of the samples.

# Q8. What is the margin of error in a confidence interval? How does sample size affect the margin of error ? Provide an example of a scenario where a larger sample size would result in a smaller margin of error.
## The margin of error in a confidence interval is the range of values above and below the point estimate (such as the sample mean) that is likely to contain the true population parameter with a certain level of confidence. It is the product of the critical value (obtained from the standard normal or t-distribution based on the desired confidence level) and the standard error of the sample.

## The margin of error decreases with increasing sample size, as larger samples tend to produce more precise estimates of the population parameter. This is because larger samples provide more information about the population, and therefore have less sampling variability. As a result, larger samples can estimate the population parameter with greater precision, resulting in a smaller margin of error.

### For example, suppose we want to estimate the average income of people in a particular city with a 95% confidence interval. If we take a sample of 50 people, the margin of error might be around 10,000 INR. However, if we take a sample of 500 people from the same population, the margin of error might be reduced to 2,000 INR. This means that we can be more confident that the true population mean falls within a narrower range, and our estimate is more precise.

## In general, a larger sample size leads to a smaller margin of error, all other things being equal. However, it is important to note that increasing the sample size indefinitely will not necessarily result in a zero margin of error, as there will always be some level of sampling variability due to the nature of random sampling.

# Q9. Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population standard deviation of 5. Interpret the results.
## The z-score (also known as standard score) is a measure of how many standard deviations a data point is from the mean of a distribution. It is calculated by subtracting the population mean from the data point and then dividing by the population standard deviation. The formula for calculating the z-score is:

### z = (x - μ) / σ
### where:
#### x is the data point
#### μ is the population mean
#### σ is the population standard deviation

#### z = (75 - 70) / 5
#### z = 1

### A positive z-score indicates a data point above the mean, while a negative z-score indicates a data point below the mean. A z-score of 0 indicates a data point at the mean.

# Q10. In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is significantly effective at a 95% confidence level using a t-test.

In [103]:
import math
from scipy.stats import t

# Given 
sample_mean = 6
mu = 0 # Hypothesized mean
sample_std = 2.5
sample_size = 50
significance_level = 0.05

#### Null hypothesis (H0):The mean weight loss for the population of individuals taking the new drug is not significantly different from zero (μ = 0).
#### Alternative hypothesis (Ha): The mean weight loss for the population of individuals taking the new drug is significantly different from zero (μ != 0).

In [122]:
# t-statistic
t_stat = (sample_mean - mu) / (sample_std / math.sqrt(sample_size))

# p-value
p_value = t.sf(abs(t_stat), dof:=sample_size-1) *2 # 2-tailed test

# Conclusion
if p_value < significance_level:
    print(f"Reject the null hypothesis with a p-value of {p_value}")
else:
    print(f"Fail to reject the null hypothesis with a p-value of {p_value}")

Reject the null hypothesis with a p-value of 6.896726383560639e-100


### The p-value is much smaller than our significance level of 0.05, which means that the probability of observing a t-value as extreme as ours (or more extreme) if the null hypothesis were true is very small. Therefore, we reject the null hypothesis and conclude that the mean weight loss for the population of individuals taking the new drug is significantly different from zero at a 95% confidence level. In other words, we can be confident that the weight loss drug is effective.

# Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95% confidence interval for the true proportion of people who are satisfied with their job.

In [57]:
import math
from scipy.stats import norm

# data
p = 0.65  # Probability of succees 
q = 1 - p # Probability of failure 
n = 500
confidence_level = 0.95

# z-score
z_score = norm.ppf((1 + confidence_level) / 2)

# SE (standard error)
SE = math.sqrt(p*q/n)

# margin of error
moe = z_score * SE

# confidence interval
lower_bound,upper_bound = p - moe , p + moe

# result
print(f'''The true proportion of people who are satisfied with their job is between
{lower_bound*100:.2f}% and {upper_bound*100:.2f}% with a {confidence_level*100:.0f}% confidence level.''')

The true proportion of people who are satisfied with their job is between
60.82% and 69.18% with a 95% confidence level.


# Q12. A researcher is testing the effectiveness of two different teaching methods on student performance. Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82 with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a significant difference in student performance using a t-test with a significance level of 0.01.

In [18]:
# Giben
x̄A,x̄B = 85,82
sA,sB = 6,5 # (standard deviation of sample A and sample B)
nA,nB = 30,30 # (consider)

#### Null hypothesis (H0):There is no significant difference in student performance of two different teaching methods (x̄A = x̄B).
#### Alternative hypothesis (Ha): There is a significant difference in student performance of two different teaching methods (x̄A ≠ x̄B).

In [4]:
significance_level = 0.01

11.148350294358098


In [None]:
dof = n1+n2-2

In [None]:
from math import sqrt
# SP is standard deviation of pooled standard deviation
SP = sqrt(((nA-1)*sA^2 + (nB-1)*sB^2) / (nA+nB-2))

In [None]:
t_stat = (x̄1 - x̄2)/(SP * sqrt(1/nA + 1/nB))
print(t_stat)

11.148350294358098


In [21]:
from scipy.stats import t

# critical value using the t.ppf() function
critical_value = t.ppf(significance_level, dof)

# p-value using the t.sf() function
p_value = t.sf(abs(t_stat),dof) * 2  # 2-tailed test

print("Critical value:", critical_value)
print("p value:", p_value)

Critical value: -2.392377470282891
p value: 4.740110532916404e-16


In [22]:
if t_stat > critical_value and p_value < significance_level:
    print(f"Reject the null hypothesis with a p-value of {p_value} with confidence level {(1-significance_level)*100} %")
else:
    print(f"Fail to reject the null hypothesis with a p-value of {p_value} with confidence level {(1-significance_level)*100} %")

Reject the null hypothesis with a p-value of 4.740110532916404e-16 with confidence level 99.0 %


### The p-value is much smaller than our significance level of 0.01, which means that the probability of observing a t-value as extreme as ours (or more extreme) if the null hypothesis were true is very small. Therefore, we reject the null hypothesis and conclude that there is a significant difference in student performance of two different teaching methods at a 99% confidence level.

# Q13. A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean of 65. Calculate the 90% confidence interval for the true population mean.

In [121]:
import math
from scipy.stats import norm

# data
mu = 60
sigma = 8
sample_mean = 65
sample_size = 50
confidence_level = 0.90
# z-score
z_score = norm.ppf((1 + confidence_level) / 2)
# margin of error
moe = z_score * (sigma/math.sqrt(sample_size))
# confidence interval
lower_bound,upper_bound = sample_mean - moe , sample_mean + moe
print(f"The population mean lies between {lower_bound:.2f} and {upper_bound:.2f} with a {confidence_level*100:.0f} % confidence level.")

The population mean lies between 63.14 and 66.86 with a 90 % confidence level.


#### Therefore, we can say with 90% confidence that the true population mean lies between 63.14 and 66.86.

# Q14. In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.

In [28]:
import math
from scipy.stats import t

# Given 
sample_mean = 0.25
mu = 0 # Hypothesized mean
sample_std = 0.05
sample_size = 30
significance_level = 0.10

#### Null hypothesis (H0):The mean reaction time for the population of individuals taking the caffeine is not significantly different from zero (μ = 0).
#### Alternative hypothesis (Ha):The mean reaction time for the population of individuals taking the caffeine is significantly different from zero (μ ≠ 0).

In [29]:
# t-statistic
t_stat = (sample_mean - mu) / (sample_std / math.sqrt(sample_size))

# p-value
p_value = t.sf(abs(t_stat), dof:=sample_size-1) *2 # 2-tailed test

# Conclusion
if p_value < significance_level:
    print(f"Reject the null hypothesis with a p-value of {p_value}")
else:
    print(f"Fail to reject the null hypothesis with a p-value of {p_value}")

Reject the null hypothesis with a p-value of 2.8325244885113353e-22


### The p-value is much smaller than our significance level of 0.05, which means that the probability of observing a t-value as extreme as ours (or more extreme) if the null hypothesis were true is very small. Therefore, we reject the null hypothesis and conclude that the mean reaction time for the population of individuals taking the caffeine is significantly different from zero at a 90% confidence level. In other words, we can be confident that the caffeine is effective on reaction time.