# Q1: What is the difference between a t-test and a z-test? Provide an example scenario where you would use each type of test.

# Ans: 1 

Both t-tests and z-tests are used to compare means of two groups and determine if the observed difference between the groups is statistically significant.


**The difference between a t-test and a z-test:**

- **Sample Size:** The key difference between a t-test and a z-test is the sample size. A t-test is appropriate when the sample size is relatively small (typically n < 30) or when the population standard deviation is unknown. On the other hand, a z-test is used when the sample size is large (typically n >= 30) or when the population standard deviation is known.

- **Assumption:** The t-test assumes that the data is approximately normally distributed, whereas the z-test assumes that the data is normally distributed.

.


Now, let's look at an example for each test using Python:

# t-test:


Suppose we want to compare the heights of two groups of individuals: Group A and Group B. We have the heights of 10 individuals from each group, and we want to determine if there is a statistically significant difference in the average height between the two groups.

In [8]:
import numpy as np 
from scipy.stats import ttest_ind

 # Heights of individuals in group A ans group B
group_a_heights = np.array([165, 170, 168, 175, 172, 167, 169, 171, 173, 168])
group_b_heights = np.array([160, 162, 165, 158, 163, 166, 159, 161, 164, 162])

# perform independent t test 
t_stat, p_value = ttest_ind(group_a_heights, group_b_heights)

# print the result 
print("t-statisti:c" , t_stat)
print("p-value:" , p_value)

# check the difference is statistically significant at 5% significance level
if p_value < 0.05:
    print("There is a statistically significant difference in the average height between Group A and Group B.")
else: 
    print("There is no statistically significant difference in the average height between Group A and Group B")

t-statisti:c 6.218479840396996
p-value: 7.226808844213945e-06
There is a statistically significant difference in the average height between Group A and Group B.


# z-test:
Now, suppose we have a larger sample size for each group and we know the population standard deviation. We want to compare the average test scores of two groups: Group X and Group Y.

In [9]:
from scipy.stats import norm

# test scores of individuals in group x and group y 
group_x_scores = np.array([75, 80, 85, 90, 78, 82, 88, 92, 79, 81])
group_y_scores = np.array([70, 72, 74, 68, 73, 71, 75, 76, 72, 74])

# population standard deviation of test scores(for this example)
population_std = 5.0

# calculate and sample mean and standard errors 
mean_x = np.mean(group_x_scores)
mean_y = np.mean(group_y_scores)

se_x = population_std/ np.sqrt(len(group_x_scores))
se_y = population_std / np.sqrt(len(group_y_scores))

# calculate the z-score for the difference in means
z_score = (mean_x - mean_y ) / np.sqrt(se_x**2 + se_y**2)

# calculate the p-value
p_value = 2*(1 - norm.cdf(abs(z_score)))

# print the result 
print("z-score:", z_score)
print("p-value:", p_value)

# # Check if the difference is statistically significant at 5% significance level
if p_value < 0.05:
    print("There is a statistically significant difference in the average test scores between Group X and Group Y.")
else:
    print("There is no statistically significant difference in the average test scores between Group X and Group Y.")

z-score: 4.695742752749559
p-value: 2.656396824285423e-06
There is a statistically significant difference in the average test scores between Group X and Group Y.


# Q2: Differentiate between one-tailed and two-tailed tests.

# Ans: 2 



One-tailed and two-tailed tests are two types of hypothesis tests used in statistical analysis to assess the significance of an observed effect or difference. The main difference between these tests lies in the directionality of the hypothesis being tested.

**One-tailed test:**

In a one-tailed test, the hypothesis being tested is directional, meaning it specifies a particular direction of the effect or difference. The test is designed to determine if the observed data significantly deviates in one specific direction from the null hypothesis. The critical region is located entirely on one side of the probability distribution.

- **Null hypothesis (H0):** There is no effect or difference between the groups.
- **Alternative hypothesis (Ha):** There is a specific effect or difference in a particular direction.

One-tailed tests are typically used when there is a strong prior expectation or theoretical reason to believe that the effect or difference, if it exists, will occur in a specific direction.

.


**Two-tailed test:**

In a two-tailed test, the hypothesis being tested is non-directional, meaning it does not specify a particular direction of the effect or difference. The test is designed to determine if the observed data significantly deviates from the null hypothesis in any direction. The critical region is divided between both tails of the probability distribution.

- **Null hypothesis (H0):** There is no effect or difference between the groups.
- **Alternative hypothesis (Ha):** There is a significant effect or difference, but the direction is not specified.

Two-tailed tests are more conservative because they consider deviations in both directions and are typically used when there is no prior expectation or theoretical reason to predict the direction of the effect or difference.



# Q3: Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for each type of error.

# Ans: 3 

**Type 1 error (False Positive):**

A Type 1 error occurs when we incorrectly reject a true null hypothesis. In other words, we conclude that there is a significant effect or difference when, in reality, there is no such effect or difference in the population. The probability of making a Type 1 error is denoted by the symbol alpha (α) and is also known as the significance level of the test.

- **Example:** Let's say a company claims their new energy drink improves people's memory. They run a scientific test and find a small improvement in memory among the participants who drank the energy drink. However, in reality, the drink has no effect on memory. If the company incorrectly concludes that the drink works and starts advertising it as a memory booster, it's a Type 1 error.


**Type 2 error (False Negative):**

A Type 2 error occurs when we incorrectly fail to reject a false null hypothesis. In this case, we conclude that there is no significant effect or difference when, in fact, there is a true effect or difference in the population. The probability of making a Type 2 error is denoted by the symbol beta (β).

- **Example:** Let's consider the same company testing the memory-boosting energy drink. Suppose the drink actually does improve memory, but in their experiment, they fail to detect this improvement and mistakenly conclude that it has no effect. It's a Type 2 error because they missed the true effect of the drink.


# Q4: Explain Bayes's theorem with an example.

# Ans: 4 



Bayes's theorem is a fundamental concept in probability theory that allows us to update the probability of an event based on new evidence. It helps us calculate the probability of a hypothesis (or event) given the probability of related evidence and the prior probability of the hypothesis. Bayes's theorem is represented mathematically as:

**P(H/E)= P(E/H).P(H) / P(E)**

Where : 

- P(H/E) is the posterior probability of the hypothesis H given the evidence E.
- P(E/H) is the likelihood of the evidence E given the hypothesis H.
- P(H) is the prior probability of the hypothesis H.
- P(E) is the probability of the evidence E.


Let's demonstrate Bayes's theorem with a simple example in Python:

**Example: Coin Toss**
Suppose we have an unfair coin, and we want to calculate the probability of the coin being biased towards heads (H) or tails (T) based on the evidence of three consecutive tosses: H, H, T.


Now, let's use Bayes's theorem to calculate the posterior probability of the coin being biased towards heads (H) after observing three consecutive tosses: H, H, T.

In [10]:
# Prior probabilities
p_h = 0.4  # Probability of the coin being biased towards heads (prior)
p_t = 0.6  # Probability of the coin being biased towards tails (prior)

# Likelihoods
p_h_given_h = 0.9  # Probability of getting heads given the coin is biased towards heads
p_t_given_h = 0.1  # Probability of getting tails given the coin is biased towards heads
p_h_given_t = 0.3  # Probability of getting heads given the coin is biased towards tails
p_t_given_t = 0.7  # Probability of getting tails given the coin is biased towards tails

# Evidence (observed tosses)
evidence = ['H', 'H', 'T']

# Calculate the probability of the evidence P(E)
p_e = (p_h * p_h_given_h * p_h_given_h * p_t_given_h) + (p_t * p_h_given_t * p_h_given_t * p_t_given_t)

# Calculate the posterior probabilities P(H|E) and P(T|E)
p_h_given_e = (p_h * p_h_given_h * p_h_given_h * p_t_given_h) / p_e
p_t_given_e = (p_t * p_h_given_t * p_h_given_t * p_t_given_t) / p_e

print("Posterior probability of the coin being biased towards heads (H):", p_h_given_e)
print("Posterior probability of the coin being biased towards tails (T):", p_t_given_e)

Posterior probability of the coin being biased towards heads (H): 0.4615384615384615
Posterior probability of the coin being biased towards tails (T): 0.5384615384615383


# Q5: What is a confidence interval? How to calculate the confidence interval, explain with an example.

# Ans: 5 

**Confidence interval:**

A confidence interval is a range of values that provides an estimate of the unknown population parameter (such as the population mean) along with a level of confidence. It is a measure of the uncertainty associated with an estimate based on a sample from the population. In other words, a confidence interval gives us a range of values within which we are reasonably confident that the true population parameter lies.

The level of confidence associated with a confidence interval is expressed as a percentage, typically 90%, 95%, or 99%. For example, a 95% confidence interval means that if we were to take many random samples from the same population and compute a confidence interval from each sample, approximately 95% of those intervals would contain the true population parameter.

**How to calculate the confidence interval:**
The formula for calculating a confidence interval depends on the type of data and the distribution of the data. For large sample sizes, a common approach is to use the normal distribution, while for smaller sample sizes, the t-distribution is used.

- **Example: Calculating a Confidence Interval for the Mean**

Suppose we want to estimate the average height of a certain population. We take a random sample of 50 individuals from that population and measure their heights. The sample mean height is 170 cm, and the sample standard deviation is 5 cm.


In [11]:
from scipy.stats import t

# Sample data (heights in centimeters)
sample_data = np.array([165, 170, 175, 168, 172, 173, 169, 171, 168, 172,
                        170, 167, 170, 171, 174, 172, 170, 168, 172, 169,
                        170, 172, 170, 168, 170, 172, 170, 169, 173, 171,
                        167, 170, 172, 169, 171, 170, 168, 173, 170, 169,
                        171, 173, 170, 168, 174, 171, 170, 172, 169, 172])

# Sample statistics
sample_mean = np.mean(sample_data)
sample_std = np.std(sample_data, ddof=1)  # ddof=1 for sample standard deviation
sample_size = len(sample_data)

# Confidence level (as a decimal)
confidence_level = 0.95

# Calculate the critical value for t-distribution
# Degrees of freedom = sample_size - 1
critical_value = t.ppf((1 + confidence_level) / 2, df=sample_size - 1)

# Calculate the standard error of the mean
standard_error = sample_std / np.sqrt(sample_size)

# Calculate the margin of error
margin_of_error = critical_value * standard_error

# Calculate the confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

print("Sample mean:", sample_mean)
print("Margin of error:", margin_of_error)
print("95% Confidence Interval: ({:.2f}, {:.2f})".format(lower_bound, upper_bound))

Sample mean: 170.4
Margin of error: 0.5741643527112024
95% Confidence Interval: (169.83, 170.97)


# Q6. Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence. Provide a sample problem and solution.

# Ans : 6

Sample Problem:
Suppose there is a rare disease that affects 1 in 10,000 people in a certain population. We have a diagnostic test for this disease, but it is not perfect. The test has the following characteristics:

The probability of a positive test result (indicating the presence of the disease) given that a person has the disease is 0.98 (P(Positive|Disease) = 0.98).
The probability of a negative test result (indicating the absence of the disease) given that a person does not have the disease is 0.99 (P(Negative|No Disease) = 0.99).
Now, a person from the population gets tested, and the test result is positive. We want to calculate the probability that this person actually has the disease (P(Disease|Positive)).

Solution:
Let's use Bayes' Theorem to calculate the probability of having the disease given a positive test result (P(Disease|Positive)).

Bayes' Theorem is given by:

  **P(Disease/Positive) = P(Positive/Disease)⋅P(Disease) / P(positive)**
  
  
Where: 
- P(Disease/Positive) is the probability of having the disease given a positive test result.
- P(Positive/Disease)  is the probability of a positive test result given that the person has the disease (0.98).
- P(Disease) is the prior probability of having the disease (1 in 10,000 or 0.0001).
- P(positive) is the probability of a positive test result.


Let's calculate P(positive) and then use Bayes' Theorem to find 
P(Disease/Positive) in Python:

In [12]:
# Given data
p_positive_given_disease = 0.98
p_disease = 0.0001
p_positive_given_no_disease = 1 - 0.99
p_no_disease = 1 - p_disease

# Calculate P(Positive)
p_positive = (p_positive_given_disease * p_disease) + (p_positive_given_no_disease * p_no_disease)

# Calculate P(Disease|Positive) using Bayes' Theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

# Display the result
print("Probability of having the disease given a positive test result:")
print("{:.6f}".format(p_disease_given_positive))

Probability of having the disease given a positive test result:
0.009706


# Q7. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5. Interpret the results.

# Ans: 7 


To calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5, we need to use the formula for a confidence interval for the population mean

Let's calculate the 95% confidence interval using the given information (mean = 50, standard deviation = 5) and assume a sample size of 30:

In [13]:
import scipy.stats as st

# Given data
sample_mean = 50
sample_std = 5
confidence_level = 0.95
sample_size = 30

# Calculate the critical value (Z-score) for 95% confidence level
critical_value = st.norm.ppf((1 + confidence_level) / 2)

# Calculate the margin of error
margin_of_error = critical_value * (sample_std / (sample_size ** 0.5))

# Calculate the confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# Display the result
print("95% Confidence Interval: ({:.2f}, {:.2f})".format(lower_bound, upper_bound))

95% Confidence Interval: (48.21, 51.79)


**Interpretation of the Results:**

The 95% confidence interval for the population mean(μ) based on the sample data is (48.51, 51.49). This means that we are 95% confident that the true population mean falls within this range.

# Q8. What is the margin of error in a confidence interval? How does sample size affect the margin of error? Provide an example of a scenario where a larger sample size would result in a smaller margin of error.

# Ans: 8 



The margin of error (MOE) in a confidence interval (CI) is a measure of the uncertainty or precision associated with the estimate of a population parameter based on a sample. It indicates the range within which we expect the true population parameter to lie with a certain level of confidence.

In statistical terms, when we calculate a confidence interval for a population parameter (e.g., mean, proportion), we use a sample statistic (e.g., sample mean, sample proportion) to estimate the true population parameter. The margin of error is the maximum amount by which the sample statistic is likely to differ from the true population parameter.

The margin of error is directly related to the level of confidence chosen for the interval and the variability of the data in the sample. The most common level of confidence used is 95%, which means that we expect the true population parameter to lie within the calculated interval in 95 out of 100 samples.

The formula for calculating the margin of error in a confidence interval is:

**MOE = Z * (σ / √n)**

Where:

MOE = Margin of Error
- Z = Z-score associated with the desired level of confidence (e.g., 1.96 for a 95% confidence level)
- σ = Standard deviation of the population (unknown in most cases, so we often use the sample standard deviation as an estimate)
- n = Sample size

Now, as for how the sample size affects the margin of error, we can observe that the margin of error is inversely proportional to the square root of the sample size. In other words, as the sample size increases, the margin of error decreases.

Here's an example:

Let's say you want to estimate the average height of students at a particular university with a 95% confidence level and a margin of error of 2 inches. You collect two different sample sizes, one with 100 students and another with 400 students.

For the sample with 100 students:
MOE = Z * (σ / √n)
Assuming σ (population standard deviation) is 4 inches (just for illustration purposes):
MOE = 1.96 * (4 / √100) = 1.96 * 0.4 ≈ 0.78 inches

For the sample with 400 students:
MOE = Z * (σ / √n)
MOE = 1.96 * (4 / √400) = 1.96 * 0.2 ≈ 0.39 inches

# Q9. Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population standard deviation of 5. Interpret the results.

In [14]:
# Ans: 9


# Given data
data_point = 75
population_mean = 70
population_std_dev = 5

# Calculate the z-score
z_score = (data_point - population_mean) / population_std_dev

# Display the z-score
print("The z-score for the data point is:", z_score)

The z-score for the data point is: 1.0


**Now, let's interpret the results:**

The calculated z-score represents the number of standard deviations the data point (75) is away from the population mean (70). In this case:

Z = (75-70)/5

A z-score of 1 means that the data point is 1 standard deviation above the mean. Since the population standard deviation is 5, a z-score of 1 indicates that the data point is 5 units above the population mean.

# Q10. In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is significantly effective at a 95% confidence level using a t-test.

# Ans: 10 

**Let's define the hypotheses:**

- Null Hypothesis (H0): The weight loss drug is not significantly effective, and the population mean weight loss is equal to or less than 0 pounds. (μ ≤ 0)

- Alternative Hypothesis (H1): The weight loss drug is significantly effective, and the population mean weight loss is greater than 0 pounds. (μ > 0)


We will use a one-tailed t-test because the alternative hypothesis is directional (μ > 0).


In [15]:
import scipy.stats as stats

sample_mean = 6
sample_std_dev = 2.5
sample_size = 50
population_mean_null = 0

# Calculate the t-statistic
standard_error = sample_std_dev / (sample_size ** 0.5)
t_statistic = (sample_mean - population_mean_null) / standard_error

# Degrees of freedom for a one-sample t-test
degrees_of_freedom = sample_size - 1

# Calculate the critical t-value for a one-tailed t-test at 95% confidence level
alpha = 0.05
critical_t_value = stats.t.ppf(1 - alpha, df=degrees_of_freedom)

# Print the results
print("Calculated t-statistic:", t_statistic)
print("Critical t-value:", critical_t_value)

Calculated t-statistic: 16.970562748477143
Critical t-value: 1.6765508919142629


# Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95% confidence interval for the true proportion of people who are satisfied with their job.

# Ans: 11

To calculate the 95% confidence interval for the true proportion of people who are satisfied with their job, we can use the formula for the confidence interval for a proportion.

In [16]:
import math

# Given data
sample_proportion = 0.65
sample_size = 500

# Calculate the critical z-score for a 95% confidence level
confidence_level = 0.95
critical_z_score = stats.norm.ppf(1 - (1 - confidence_level) / 2)

# Calculate the standard error
standard_error = math.sqrt((sample_proportion * (1 - sample_proportion)) / sample_size)

# Calculate the lower and upper bounds of the confidence interval
lower_bound = sample_proportion - critical_z_score * standard_error
upper_bound = sample_proportion + critical_z_score * standard_error

# Convert bounds to percentages
lower_bound_percent = lower_bound * 100
upper_bound_percent = upper_bound * 100

# Print the results
print("95% Confidence Interval for the proportion of people satisfied with their job:")
print(f"{lower_bound_percent:.2f}% to {upper_bound_percent:.2f}%")

95% Confidence Interval for the proportion of people satisfied with their job:
60.82% to 69.18%


# Q12. A researcher is testing the effectiveness of two different teaching methods on student performance. Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82 with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a significant difference in student performance using a t-test with a significance level of 0.01.

# Ans: 12


To conduct a hypothesis test to determine if there is a significant difference in student performance between the two teaching methods, we can use a two-sample t-test. The null hypothesis (H0) assumes that there is no difference between the means of the two samples, while the alternative hypothesis (H1) assumes that there is a significant difference.

Let's define the hypotheses:

- Null Hypothesis (H0): The two teaching methods have no significant difference in student performance. 
  (μA - μB = 0)

- Alternative Hypothesis (H1): The two teaching methods have a significant difference in student performance. 
   (μA - μB ≠ 0)

We will use a two-tailed t-test because the alternative hypothesis is non-directional (it doesn't specify which mean is larger).


In [17]:
import scipy.stats as stats

# Given data for sample A
sample_mean_A = 85
sample_std_dev_A = 6
sample_size_A = 30    

# Given data for sample B
sample_mean_B = 82
sample_std_dev_B = 5
sample_size_B = 25

# Calculate the pooled standard deviation
pooled_std_dev = math.sqrt(((sample_std_dev_A ** 2) / sample_size_A) + ((sample_std_dev_B ** 2) / sample_size_B))

# Calculate the t-statistic
t_statistic = (sample_mean_A - sample_mean_B) / pooled_std_dev

# Degrees of freedom for a two-sample t-test
degrees_of_freedom = sample_size_A + sample_size_B - 2

# Calculate the critical t-value for a two-tailed t-test at a significance level of 0.01
alpha = 0.01
critical_t_value = stats.t.ppf(1 - alpha / 2, df=degrees_of_freedom)

# Print the results
print("Calculated t-statistic:", t_statistic)
print("Critical t-value:", critical_t_value)

Calculated t-statistic: 2.0225995873897262
Critical t-value: 2.6718226362410027


**let's interpret the results:**

If the absolute value of the calculated t-statistic is greater than the critical t-value, we can reject the null hypothesis (H0) and conclude that there is a significant difference in student performance between the two teaching methods. Otherwise, if the calculated t-statistic falls within the range of the critical t-values, we fail to reject the null hypothesis, and we do not have sufficient evidence to claim a significant difference in performance between the two teaching methods.

# Q13. A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean of 65. Calculate the 90% confidence interval for the true population mean.

In [19]:
# Ans: 13 

# Given data
sample_mean = 65
population_mean = 60
population_std_dev = 8
sample_size = 50

# Calculate the critical z-score for a 90% confidence level
confidence_level = 0.90
critical_z_score = stats.norm.ppf(1 - (1 - confidence_level) / 2)

# Calculate the standard error
standard_error = population_std_dev / (sample_size ** 0.5)

# Calculate the lower and upper bounds of the confidence interval
lower_bound = sample_mean - critical_z_score * standard_error
upper_bound = sample_mean + critical_z_score * standard_error

# Print the results
print("90% Confidence Interval for the true population mean:")
print(f"{lower_bound:.2f} to {upper_bound:.2f}")

90% Confidence Interval for the true population mean:
63.14 to 66.86


# Q14. In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.

# Ans: 14


To conduct a hypothesis test to determine if caffeine has a significant effect on reaction time, we can use a one-sample t-test. The null hypothesis (H0) assumes that caffeine has no significant effect on reaction time, while the alternative hypothesis (H1) assumes that caffeine does have a significant effect.

Let's define the hypotheses:

- Null Hypothesis (H0): Caffeine has no significant effect on reaction time. (μ = 0)

- Alternative Hypothesis (H1): Caffeine does have a significant effect on reaction time. (μ ≠ 0)

We will use a two-tailed t-test because the alternative hypothesis is non-directional 


In [20]:
# Given data
sample_mean = 0.25
population_mean_null = 0
sample_std_dev = 0.05
sample_size = 30

# Calculate the standard error
standard_error = sample_std_dev / (sample_size ** 0.5)

# Calculate the t-statistic
t_statistic = (sample_mean - population_mean_null) / standard_error

# Degrees of freedom for a one-sample t-test
degrees_of_freedom = sample_size - 1

# Calculate the critical t-value for a two-tailed t-test at a 90% confidence level
confidence_level = 0.90
critical_t_value = stats.t.ppf(1 - (1 - confidence_level) / 2, df=degrees_of_freedom)

# Print the results
print("Calculated t-statistic:", t_statistic)
print("Critical t-value:", critical_t_value)

Calculated t-statistic: 27.386127875258307
Critical t-value: 1.6991270265334972


**let's interpret the results:**

If the absolute value of the calculated t-statistic is greater than the critical t-value, we can reject the null hypothesis (H0) and conclude that caffeine has a significant effect on reaction time at the 90% confidence level. Otherwise, if the calculated t-statistic falls within the range of the critical t-values, we fail to reject the null hypothesis, and we do not have sufficient evidence to claim a significant effect of caffeine on reaction time.