Q1: What is the difference between a t-test and a z-test? Provide an example scenario where you would
use each type of test.

Both t-tests and z-tests are statistical hypothesis tests used to make inferences about population parameters based on sample data. However, they are used under different circumstances and assumptions.

**1. t-test:**
A t-test is used when you have a small sample size (typically less than 30) and when the population standard deviation is unknown. The t-test takes into account the variability within the sample and provides a more accurate estimate of the population parameter when the sample size is small.

**Example scenario for using a t-test:**
Suppose you are testing whether there is a significant difference in the average test scores of two groups of students (Group A and Group B) after a new teaching method was introduced. You collect test scores from 15 students in each group. Since the population standard deviations are unknown, you would use a t-test to compare the means of the two groups.

**2. z-test:**
A z-test is used when you have a relatively large sample size (typically greater than 30) and when the population standard deviation is known or can be assumed to be known. The z-test is more appropriate when the sample size is large because the sample mean is more likely to be representative of the population mean, and the normal distribution approximation is valid.

**Example scenario for using a z-test:**
Imagine you are analyzing the heights of a sample of 1000 adults in a city. You want to determine if the average height of this sample is significantly different from the known average height of adults in the entire country. Since you have a large sample size and the population standard deviation is known, you can use a z-test for this comparison.

Q2: Differentiate between one-tailed and two-tailed tests.

**One-Tailed Test:**
A one-tailed test, also known as a one-sided test, is a type of statistical hypothesis test where the alternative hypothesis focuses on a specific direction of difference or effect. In other words, it tests whether the sample data is significantly greater than or less than a certain value. The critical region for the test is only on one side of the distribution (either the upper tail or the lower tail).

**Example of a one-tailed test:**
Imagine a pharmaceutical company has developed a new drug they claim increases the average life expectancy of patients. To test this claim, they compare the life expectancy of patients who took the new drug to a historical average life expectancy. The null hypothesis (H0) would state that there is no difference in life expectancy between the two groups. The alternative hypothesis (Ha) for a one-tailed test might be that the new drug increases life expectancy. If the statistical analysis supports the alternative hypothesis, it would suggest that the new drug is indeed effective in increasing life expectancy.

**Two-Tailed Test:**
A two-tailed test, also known as a two-sided test, is a type of statistical hypothesis test where the alternative hypothesis is concerned with a difference or effect in either direction. It tests whether the sample data is significantly different from a certain value, without specifying whether it's greater or smaller. The critical region for the test is divided between both tails of the distribution.

**Example of a two-tailed test:**
Suppose a car manufacturer claims that a certain model of their cars has an average fuel efficiency of 40 miles per gallon (mpg). You decide to test this claim by collecting data from a sample of these cars and calculating their average fuel efficiency. The null hypothesis (H0) would be that the average fuel efficiency is 40 mpg. The alternative hypothesis (Ha) for a two-tailed test would be that the average fuel efficiency is not equal to 40 mpg. If the statistical analysis shows that the sample mean is significantly different from 40 mpg in either direction, it would suggest that the manufacturer's claim is not supported.



Q3: Explain the concept of Type 1 and Type 2 errors in hypothesis testing. Provide an example scenario for
each type of error.

**Type I Error (False Positive):**
A Type I error occurs when you reject the null hypothesis when it is actually true. In other words, you conclude that there is a significant effect or difference when, in reality, there is no such effect or difference. This error is also known as a false positive or alpha error and is denoted by the symbol "α."

**Example of Type I Error:**
Imagine a clinical trial for a new medical treatment. The null hypothesis (H0) states that the treatment has no effect on patients' recovery time. The alternative hypothesis (Ha) suggests that the treatment does have an effect. If the researchers incorrectly reject the null hypothesis based on the data from the trial, and conclude that the treatment is effective when it's not, they have committed a Type I error. This could lead to unnecessary adoption of an ineffective treatment.

**Type II Error (False Negative):**
A Type II error occurs when you fail to reject the null hypothesis when it is actually false. In this case, you conclude that there is no significant effect or difference, even though such an effect or difference exists. This error is also known as a false negative or beta error and is denoted by the symbol "β."

**Example of Type II Error:**
Continuing with the medical treatment example, suppose the null hypothesis (H0) now states that the treatment does have an effect on patients' recovery time. The alternative hypothesis (Ha) suggests that the treatment has no effect. If the researchers fail to reject the null hypothesis based on the data from the trial, and conclude that the treatment is not effective when it actually is, they have committed a Type II error. This could prevent a potentially beneficial treatment from being adopted.

In summary:
- **Type I Error:** Rejecting the null hypothesis when it's true (False Positive)
- **Type II Error:** Failing to reject the null hypothesis when it's false (False Negative)

Hypothesis testing aims to strike a balance between these two types of errors. The significance level (α) and the power of the test are factors that influence the likelihood of Type I and Type II errors. Increasing the significance level (α) can reduce the likelihood of Type II errors but increases the risk of Type I errors, and vice versa. Similarly, increasing the power of the test can reduce the likelihood of Type II errors but requires larger sample sizes.

Q4: Explain Bayes's theorem with an example.

Bayes's Theorem is a fundamental concept in probability theory and statistics that allows you to update the probability of a hypothesis based on new evidence or information. It provides a way to incorporate prior knowledge (prior probability) and new data (likelihood) to calculate the revised probability (posterior probability) of the hypothesis being true.

The formula for Bayes's Theorem is:

P(A|B) = ( P(B|A) * P(A))/P(B)

Where:
- \( P(A|B) \) is the posterior probability of event A given event B.
- \( P(B|A) \) is the likelihood of event B given event A.
- \( P(A) \) is the prior probability of event A.
- \( P(B) \) is the probability of event B.


In [1]:
#define function for Bayes' theorem
def bayesTheorem(pA, pB, pBA):
    return pA * pBA / pB

#define probabilities
pRain = 0.2
pCloudy = 0.4
pCloudyRain = 0.85

#use function to calculate conditional probability
bayesTheorem(pRain, pCloudy, pCloudyRain)


0.425

Q5: What is a confidence interval? How to calculate the confidence interval, explain with an example.

A confidence interval is a range of values that is used to estimate the true value of a population parameter, such as the mean or the proportion, based on a sample from that population. It provides a measure of the uncertainty associated with the sample estimate and indicates the range within which the true population parameter is likely to fall.

Confidence intervals are often used in statistical inference to quantify the level of confidence we have in the estimate. For example, if we calculate a 95% confidence interval for the mean, it means that we are 95% confident that the true population mean lies within the calculated interval.

The formula to calculate a confidence interval depends on the parameter being estimated and the distribution of the sample statistic. Here's a general formula for calculating a confidence interval for a population mean (assuming a normal distribution):

Confidence Interval = Sample Mean +- Margin of Error

The margin of error depends on the level of confidence (typically denoted as \(1 - \alpha\), where \(\alpha\) is the significance level) and the standard error of the sample mean. The standard error reflects the variability of the sample mean.


In [2]:
import numpy as np
from scipy import stats

# Sample data
data = np.array([62, 68, 65, 70, 63, 67, 64, 69, 66, 68])

# Confidence level (e.g., 95%)
confidence_level = 0.95

# Calculate sample statistics
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)  # ddof=1 for sample standard deviation
sample_size = len(data)

# Calculate standard error
standard_error = sample_std / np.sqrt(sample_size)

# Calculate t-score for the desired confidence level and degrees of freedom
t_score = stats.t.ppf((1 + confidence_level) / 2, df=sample_size - 1)

# Calculate margin of error
margin_of_error = t_score * standard_error

# Calculate confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print("Sample Mean:", sample_mean)
print("Confidence Interval:", confidence_interval)


Sample Mean: 66.2
Confidence Interval: (64.29835223544127, 68.10164776455873)


Q6. Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the
event's probability and new evidence. Provide a sample problem and solution.

Suppose you are a quality control manager at a factory that produces electronic components. You know that 5% of the components produced are defective. You also know that the testing machine has a false positive rate of 2% (it reports a component as defective when it's actually not), and a false negative rate of 1% (it fails to detect a defective component). If a randomly selected component is tested and the machine reports it as defective, what is the probability that the component is actually defective?

In [3]:
# Given probabilities
p_d = 0.05  # Prior probability of being defective
p_t_given_d = 0.99  # Probability of testing defective given it's defective
p_t_given_not_d = 0.02  # Probability of testing defective given it's not defective

# Calculate the total probability of testing defective
p_t = p_t_given_d * p_d + p_t_given_not_d * (1 - p_d)

# Calculate the posterior probability of being defective given testing defective
p_d_given_t = (p_t_given_d * p_d) / p_t

print("Probability of being defective given testing defective:", p_d_given_t)


Probability of being defective given testing defective: 0.7226277372262774


Q7. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation
of 5. Interpret the results.

In [5]:
sample_mean = 50
sample_std = 5
sample_size = 30  # Assuming a reasonable sample size

# Calculate standard error
standard_error = sample_std / np.sqrt(sample_size)

# Calculate the critical value for a 95% confidence interval
critical_value = 1.96

# Calculate the margin of error
margin_of_error = critical_value * standard_error

# Calculate confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print("Sample Mean:", sample_mean)
print("Confidence Interval:", confidence_interval)


Sample Mean: 50
Confidence Interval: (48.21077297881646, 51.78922702118354)


Q8. What is the margin of error in a confidence interval? How does sample size affect the margin of error?
Provide an example of a scenario where a larger sample size would result in a smaller margin of error.

The margin of error (MOE) in a confidence interval is a measure of how much the sample estimate (e.g., sample mean or proportion) is expected to vary from the true population parameter. It represents the range within which we believe the true population parameter is likely to lie. The larger the margin of error, the less confident we are in the accuracy of our estimate.

The formula for the margin of error depends on factors such as the level of confidence, the variability of the sample data, and the sample size. In general, the margin of error is calculated as the product of a critical value (usually obtained from a statistical distribution like the normal or t-distribution) and the standard error of the sample statistic.

Sample size has a significant impact on the margin of error. As the sample size increases, the margin of error tends to decrease. This is because larger samples provide more information about the population and thus reduce the uncertainty in the estimate.

**Example Scenario:**
Imagine you are conducting a political poll to estimate the proportion of voters who support a particular candidate. You want to calculate a confidence interval with a 95% confidence level.

Scenario 1: Small Sample Size
If you survey only 50 voters, the margin of error will likely be larger. This means that the confidence interval will be wider, and your estimate of the proportion of supporters will be less precise.

Scenario 2: Large Sample Size
Now, if you survey 1000 voters, the margin of error will be smaller. The confidence interval will be narrower, and your estimate of the proportion of supporters will be more precise.

In both scenarios, you might use the same level of confidence (e.g., 95%), but the larger sample size in the second scenario leads to a smaller margin of error. This demonstrates the trade-off between precision and sample size in estimation. A larger sample size provides more information and reduces the uncertainty associated with the estimate, resulting in a smaller margin of error.

Q9. Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population
standard deviation of 5. Interpret the results.

The z-score (also known as the standard score) measures how many standard deviations a data point is away from the mean of the population. It's calculated using the formula:

z = (x - \mu)/{\sigma} \

Where:
- \( x \) is the data point's value (75 in this case).
- \( \mu \) is the population mean (70 in this case).
- \( \sigma \) is the population standard deviation (5 in this case).

Let's calculate the z-score and interpret the result:

 z = {75 - 70}/{5} = 1

Interpretation:
The calculated z-score of 1 means that the data point with a value of 75 is 1 standard deviation above the population mean of 70. This indicates that the data point is relatively close to the mean of the population. Since the z-score is positive, it suggests that the data point is on the right side of the distribution (above the mean). The z-score provides a standardized measure of how much the data point deviates from the mean in terms of standard deviations.

Q10. In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average
of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is
significantly effective at a 95% confidence level using a t-test.

In [None]:
import numpy as np
from scipy import stats

# Given sample data
sample_mean = 6
sample_std = 2.5
sample_size = 50

# Population parameters under the null hypothesis (no effect)
population_mean_null = 0  # Assuming no weight loss under the null hypothesis

# Calculate t-statistic
t_statistic = (sample_mean - population_mean_null) / (sample_std / np.sqrt(sample_size))

# Calculate degrees of freedom
degrees_of_freedom = sample_size - 1

# Calculate p-value for a two-tailed t-test
p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), df=degrees_of_freedom))

# Set significance level (alpha)
alpha = 0.05

# Check if p-value is less than alpha to reject the null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis. The drug is significantly effective.")
else:
    print("Fail to reject the null hypothesis. There is no significant evidence of effectiveness.")


Reject the null hypothesis. The drug is significantly effective.


Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95%
confidence interval for the true proportion of people who are satisfied with their job.

In [7]:
import numpy as np
from scipy.stats import norm

# Given values
sample_proportion = 0.65  # 65%
sample_size = 500

# Calculate standard error of the sample proportion
standard_error = np.sqrt((sample_proportion * (1 - sample_proportion)) / sample_size)

# Calculate the critical value for a 95% confidence interval
critical_value = norm.ppf(0.975)  # 0.975 for two-tailed 95% confidence interval

# Calculate the margin of error
margin_of_error = critical_value * standard_error

# Calculate confidence interval
confidence_interval = (sample_proportion - margin_of_error, sample_proportion + margin_of_error)

print("Sample Proportion:", sample_proportion)
print("Confidence Interval:", confidence_interval)


Sample Proportion: 0.65
Confidence Interval: (0.6081925393809212, 0.6918074606190788)


Q12. A researcher is testing the effectiveness of two different teaching methods on student performance.
Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82
with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a
significant difference in student performance using a t-test with a significance level of 0.01.

In [8]:
import numpy as np
from scipy import stats

# Given sample data for two teaching methods
sample_A_mean = 85
sample_A_std = 6
sample_A_size = 30  # Assuming reasonable sample size

sample_B_mean = 82
sample_B_std = 5
sample_B_size = 35  # Assuming reasonable sample size

# Set significance level (alpha)
alpha = 0.01

# Calculate the pooled standard deviation
pooled_std = np.sqrt(((sample_A_size - 1) * sample_A_std ** 2 + (sample_B_size - 1) * sample_B_std ** 2) / (sample_A_size + sample_B_size - 2))

# Calculate the t-statistic
t_statistic = (sample_A_mean - sample_B_mean) / (pooled_std * np.sqrt((1 / sample_A_size) + (1 / sample_B_size)))

# Calculate degrees of freedom
degrees_of_freedom = sample_A_size + sample_B_size - 2

# Calculate the p-value for a two-tailed t-test
p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), df=degrees_of_freedom))

# Check if p-value is less than alpha to reject the null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis. The two teaching methods have a significant difference in student performance.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in student performance between the two teaching methods.")


Fail to reject the null hypothesis. There is no significant difference in student performance between the two teaching methods.


Q13. A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean
of 65. Calculate the 90% confidence interval for the true population mean.

In [9]:
import numpy as np
from scipy.stats import t

# Given values
sample_mean = 65
population_mean = 60
population_std = 8
sample_size = 50

# Set confidence level (e.g., 90%)
confidence_level = 0.90

# Calculate standard error of the sample mean
standard_error = population_std / np.sqrt(sample_size)

# Calculate degrees of freedom
degrees_of_freedom = sample_size - 1

# Calculate the critical value for a 90% confidence interval using t-distribution
critical_value = t.ppf(1 - (1 - confidence_level) / 2, df=degrees_of_freedom)

# Calculate margin of error
margin_of_error = critical_value * standard_error

# Calculate confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print("Sample Mean:", sample_mean)
print("Confidence Interval:", confidence_interval)


Sample Mean: 65
Confidence Interval: (63.10319919251691, 66.89680080748309)


Q14. In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average
reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to
determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.

In [10]:
import numpy as np
from scipy import stats

# Given sample data
sample_mean = 0.25  # Average reaction time
sample_std = 0.05  # Standard deviation
sample_size = 30  # Sample size

# Population parameters under the null hypothesis (no effect of caffeine)
population_mean_null = 0.28  # Hypothesized population mean

# Set confidence level (e.g., 90%)
confidence_level = 0.90

# Calculate t-statistic
t_statistic = (sample_mean - population_mean_null) / (sample_std / np.sqrt(sample_size))

# Calculate degrees of freedom
degrees_of_freedom = sample_size - 1

# Calculate p-value for a one-tailed t-test (lower tail)
p_value = stats.t.cdf(t_statistic, df=degrees_of_freedom)

# Check if p-value is less than alpha to reject the null hypothesis
if p_value < (1 - confidence_level):
    print("Reject the null hypothesis. Caffeine has a significant effect on reaction time.")
else:
    print("Fail to reject the null hypothesis. There is no significant evidence of caffeine's effect on reaction time.")


Reject the null hypothesis. Caffeine has a significant effect on reaction time.
