## Q1: What is the difference between a t-test and a z-test? Provide an example scenario where you would use each type of test.

The t-test and z-test are two commonly used statistical tests used for hypothesis testing. The main difference between the two tests lies in the assumptions about the population variance. The t-test assumes that the population variance is unknown, while the z-test assumes that the population variance is known.


A t-test is used when the sample size is small (less than 30) and the population variance is unknown. For example, if a researcher wants to test if a new treatment reduces blood pressure, they can collect a sample of patients and compare the mean blood pressure before and after the treatment using a t-test.


On the other hand, a z-test is used when the sample size is large (more than 30) and the population variance is known. For example, if a company wants to test if a new advertising campaign increased sales, they can collect a large sample of customers and compare the mean sales before and after the campaign using a z-test.


In summary, the t-test is used when the sample size is small and the population variance is unknown, while the z-test is used when the sample size is large and the population variance is known.

## Z-Test

In [1]:
import math
import scipy.stats as stats

# sample data
sample = [72, 85, 90, 78, 82, 92, 87, 75, 80, 84, 76, 79, 88, 91, 83, 85, 81, 74, 77, 86]

# population parameters
pop_mean = 80
pop_sd = 10

# sample statistics
n = len(sample)
sample_mean = sum(sample) / n
sample_sd = math.sqrt(sum((x - sample_mean) ** 2 for x in sample) / (n - 1))

# calculate z-statistic and p-value
z = (sample_mean - pop_mean) / (pop_sd / math.sqrt(n))
p_value = stats.norm.sf(abs(z)) * 2

print("Z-statistic: ", z)
print("P-value: ", p_value)


Z-statistic:  1.0062305898749053
P-value:  0.31430466047385397


## T-Test

In [2]:
import math
import scipy.stats as stats

# sample data
sample = [8.5, 10.1, 9.7, 8.9, 9.6, 9.3, 10.5, 10.4, 9.1, 9.8, 9.7, 9.9, 9.2, 8.8, 10.2, 9.4, 9.6, 9.9, 9.7, 8.7, 10.6, 9.3, 10.3, 9.2, 9.6]

# hypothesized population mean
pop_mean = 10

# sample statistics
n = len(sample)
sample_mean = sum(sample) / n
sample_sd = math.sqrt(sum((x - sample_mean) ** 2 for x in sample) / (n - 1))

# calculate t-statistic and p-value
t = (sample_mean - pop_mean) / (sample_sd / math.sqrt(n))
p_value = stats.t.sf(abs(t), n - 1) * 2

print("T-statistic: ", t)
print("P-value: ", p_value)


T-statistic:  -3.521803625302499
P-value:  0.001745569829291585


## Q2: Differentiate between one-tailed and two-tailed tests.

One-tailed and two-tailed tests are different types of hypothesis tests used in statistics.

In a one-tailed test, the null hypothesis is rejected if the sample statistic falls either in the upper or lower tail of the sampling distribution. The alternative hypothesis specifies either an increase or decrease in the population parameter being tested. For example, a one-tailed test could be used to test whether the mean weight of students in a particular school is less than a certain value, where the alternative hypothesis is that the mean weight is less than the value.

In a two-tailed test, the null hypothesis is rejected if the sample statistic falls in either tail of the sampling distribution. The alternative hypothesis is that there is a difference between the sample mean and the population mean in either direction. For example, a two-tailed test could be used to test whether the mean height of students in a particular school is different from a certain value, where the alternative hypothesis is that the mean height is either greater or less than the value.

The choice of a one-tailed or two-tailed test depends on the research question and the direction of the effect being investigated.


There are no specific examples for one-tailed and two-tailed tests in Python as it depends on the specific hypothesis being tested and the type of test being used (e.g., t-test, z-test, etc.). However, here is an example of a two-tailed t-test in Python:

Suppose we want to test if the mean height of a population of students is significantly different from 68 inches. We take a sample of 30 students and find the sample mean height to be 70 inches with a sample standard deviation of 2.5 inches. We can perform a two-tailed t-test in Python as follows:

In [13]:
import scipy.stats as stats

# set up the null hypothesis
mu = 68

# set up the sample data
sample_mean = 70
sample_std = 2.5
n = 30

# calculate the t-statistic and p-value
t_statistic, p_value = stats.ttest_1samp([sample_mean], mu, std(sample_mean)/sqrt(n))

# print the results
print("t-statistic:", t_statistic)
print("p-value:", p_value)


TypeError: 'float' object is not callable

## Q4: Explain Bayes's theorem with an example.

In [14]:
Bayes's theorem is a fundamental theorem in probability theory that describes the probability of an event based on prior knowledge of conditions that might be related to the event. The theorem is named after Thomas Bayes, an 18th-century British statistician and minister.

The formula for Bayes's theorem is as follows:

P(A|B) = P(B|A) * P(A) / P(B)

where:

P(A|B) is the probability of event A given event B
P(B|A) is the probability of event B given event A
P(A) is the prior probability of event A
P(B) is the prior probability of event B
Here is an example to illustrate Bayes's theorem:

Suppose that a certain medical test is used to diagnose a disease, and the test has a false positive rate of 5% and a false negative rate of 2%. The disease occurs in 1% of the population. A person takes the test and the result is positive. What is the probability that the person actually has the disease?

We can use Bayes's theorem to solve this problem as follows:

Let A be the event of having the disease, and B be the event of testing positive for the disease. Then we have:

P(A) = 0.01 (prior probability of having the disease)
P(B|A) = 0.98 (probability of testing positive given that the person has the disease)
P(B|not A) = 0.05 (probability of testing positive given that the person does not have the disease)

We want to find P(A|B), the probability of having the disease given that the person tested positive. Using Bayes's theorem, we have:

P(A|B) = P(B|A) * P(A) / P(B)

P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
= 0.98 * 0.01 + 0.05 * 0.99
= 0.0585

P(A|B) = P(B|A) * P(A) / P(B)
= 0.98 * 0.01 / 0.0585
= 0.1675

Therefore, the probability that the person actually has the disease given a positive test result is 16.75%.


SyntaxError: unterminated string literal (detected at line 1) (1255134579.py, line 1)

## Q5: What is a confidence interval? How to calculate the confidence interval, explain with an example.

A confidence interval is a range of values calculated from a sample of data that is likely to contain an unknown population parameter with a certain level of confidence. It provides a range of values within which the true population parameter is expected to lie.

To calculate a confidence interval, one needs to know the sample size, sample mean, and sample standard deviation. The confidence level, which is the probability that the true population parameter lies within the interval, also needs to be specified. The formula for calculating a confidence interval for a population mean is:

Confidence interval = sample mean ± (critical value x standard error)

where the critical value is obtained from a t-distribution or z-distribution based on the sample size and desired confidence level, and the standard error is calculated as the sample standard deviation divided by the square root of the sample size.

For example, suppose a random sample of 50 people was taken from a population and their average height was found to be 175 cm with a standard deviation of 8 cm. We want to calculate a 95% confidence interval for the true population mean height.

Using a t-distribution (since the population standard deviation is unknown), with 49 degrees of freedom (n-1), we obtain the critical value of 2.009. The standard error can be calculated as:

Standard error = 8 / sqrt(50) = 1.13

Thus, the confidence interval can be calculated as:

175 ± (2.009 x 1.13) = (172.9, 177.1)

Therefore, we can be 95% confident that the true population mean height lies between 172.9 cm and 177.1 cm based on this sample.

In [30]:
import numpy as np
import scipy.stats as stats

# Generate sample data
np.random.seed(123)
sample = np.random.normal(loc=10, scale=2, size=100)

# Calculate sample mean and standard deviation
sample_mean = np.mean(sample)
sample_std = np.std(sample, ddof=1)

# Calculate standard error
std_error = sample_std / np.sqrt(len(sample))

# Calculate margin of error
margin_of_error = stats.t.ppf(0.975, len(sample)-1) * std_error

# Calculate confidence interval
conf_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)

print("Sample mean:", sample_mean)
print("Standard error:", std_error)
print("Margin of error:", margin_of_error)
print("95% confidence interval:", conf_interval)

Sample mean: 10.05421814698072
Standard error: 0.22678486750723909
Margin of error: 0.4499903784535145
95% confidence interval: (9.604227768527204, 10.504208525434235)


## Q6. Use Bayes' Theorem to calculate the probability of an event occurring given prior knowledge of the event's probability and new evidence. Provide a sample problem and solution.

In [31]:
Sure! Here's a sample problem and solution using Bayes' Theorem:

Problem: A factory produces two types of widgets, type A and type B, in the ratio of 3:2. The probability that a type A widget is defective is 0.1, while the probability that a type B widget is defective is 0.2. A widget is randomly selected from the factory and found to be defective. What is the probability that it is a type A widget?

Solution:

Let's first define the events we are working with:

A = the widget is of type A
B = the widget is of type B
D = the widget is defective
We are given that the factory produces widgets in the ratio of 3:2, so:

P(A) = 3/5
P(B) = 2/5

We are also given the probability of a defective widget for each type:

P(D|A) = 0.1
P(D|B) = 0.2

We want to find the probability that the defective widget is of type A:

P(A|D) = ?

Using Bayes' Theorem, we have:

P(A|D) = P(D|A) * P(A) / P(D)

We can calculate the denominator, P(D), using the law of total probability:

P(D) = P(D|A) * P(A) + P(D|B) * P(B)

Plugging in the values we have, we get:

P(D) = 0.1 * 3/5 + 0.2 * 2/5
= 0.14

Now we can substitute back into the Bayes' Theorem equation:

P(A|D) = 0.1 * 3/5 / 0.14
= 0.214

So the probability that the defective widget is of type A is approximately 0.214, or 21.4%.


SyntaxError: unterminated string literal (detected at line 7) (2801294734.py, line 7)

## Q7. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation of 5. Interpret the results.

To calculate the 95% confidence interval, we need to use the following formula:

Confidence interval = sample mean ± (critical value x standard error)

Here, the critical value is the value taken from the standard normal distribution that corresponds to the desired level of confidence (in this case, 95%). We can find this value using a z-score table or a calculator. For a 95% confidence level, the critical value is 1.96.

The standard error can be calculated as the standard deviation divided by the square root of the sample size:

standard error = standard deviation / √n

Substituting the given values in the formula, we get:

standard error = 5 / √n

Assuming a sample size of 30, we get:

standard error = 5 / √30 = 0.9129

Therefore, the 95% confidence interval is:

50 ± (1.96 x 0.9129) = (47.22, 52.78)

Interpretation: We are 95% confident that the true population mean lies between 47.22 and 52.78. This means that if we were to take many samples and calculate the confidence interval for each one, 95% of those intervals would contain the true population mean.


In [32]:
import numpy as np
import scipy.stats as stats

# sample data
data = [52, 47, 48, 50, 49, 53, 50, 51, 49, 51]

# calculate sample mean and standard deviation
sample_mean = np.mean(data)
sample_std = np.std(data, ddof=1)

# calculate t-value for 95% confidence interval with n-1 degrees of freedom
t_value = stats.t.ppf(0.975, df=len(data)-1)

# calculate margin of error
margin_of_error = t_value * sample_std / np.sqrt(len(data))

# calculate lower and upper bounds of confidence interval
lower_bound = sample_mean - margin_of_error
upper_bound = sample_mean + margin_of_error

# print the results
print("Sample Mean:", sample_mean)
print("Sample Standard Deviation:", sample_std)
print("Margin of Error:", margin_of_error)
print("Lower Bound:", lower_bound)
print("Upper Bound:", upper_bound)


Sample Mean: 50.0
Sample Standard Deviation: 1.8257418583505538
Margin of Error: 1.3060570468577517
Lower Bound: 48.69394295314225
Upper Bound: 51.30605704685775


## Q8. What is the margin of error in a confidence interval? How does sample size affect the margin of error? Provide an example of a scenario where a larger sample size would result in a smaller margin of error.

The margin of error is a measure of the precision of an estimated population parameter, such as the population mean or proportion, based on a sample statistic. It is the maximum expected difference between the true population parameter and the sample estimate, given a specified level of confidence. It is usually represented as a range of values above and below the sample estimate.

The margin of error can be calculated using the following formula:
Margin of error = z * (standard deviation / sqrt(sample size))

Where z is the z-score corresponding to the desired level of confidence, standard deviation is the population standard deviation (if known) or the sample standard deviation (if unknown), and sample size is the size of the sample.

Sample size affects the margin of error in that the larger the sample size, the smaller the margin of error, given all other factors are constant. This is because a larger sample size reduces the standard error of the sample mean, making the sample estimate more accurate and closer to the true population parameter.

For example, suppose we want to estimate the proportion of adults in a city who support a new policy proposal. We take a random sample of 200 adults and find that 120 of them support the proposal. We want to calculate a 95% confidence interval for the true proportion of supporters in the population.

Assuming that the sample is representative of the population, we can use the following formula to calculate the margin of error:
Margin of error = 1.96 * sqrt((0.3 * 0.7) / 200) ≈ 0.069 or 6.9%

This means that we can be 95% confident that the true proportion of supporters in the population falls within the range of 0.51 (120/200 + 0.069) and 0.69 (120/200 - 0.069).

Suppose we want to decrease the margin of error to 4%. To achieve this, we need to increase the sample size. Assuming all other factors are constant, we can use the following formula to calculate the required sample size:
Sample size = (z * standard deviation / margin of error)²

If we use a z-score of 1.96, a standard deviation of 0.5 (assuming maximum variability), and a margin of error of 0.04, we get:
Sample size = (1.96 * 0.5 / 0.04)² ≈ 600.25 or 601 (rounded up)

This means that we need to increase the sample size from 200 to at least 601 to achieve a 95% confidence interval with a margin of error of 4%.


## Q9. Calculate the z-score for a data point with a value of 75, a population mean of 70, and a population standard deviation of 5. Interpret the results.

In [36]:
import scipy.stats as stats

x = 75 # data point value
P_mean = 70 # population mean
std = 5 # population standard deviation

z = (x - P_mean) / std

print("The z-score is:", z)


The z-score is: 1.0


## Q10. In a study of the effectiveness of a new weight loss drug, a sample of 50 participants lost an average of 6 pounds with a standard deviation of 2.5 pounds. Conduct a hypothesis test to determine if the drug is significantly effective at a 95% confidence level using a t-test.

In [37]:
To conduct a hypothesis test, we need to state the null and alternative hypotheses:

Null hypothesis: The weight loss drug is not significantly effective, i.e. the population mean weight loss is not significantly different from 0.
Alternative hypothesis: The weight loss drug is significantly effective, i.e. the population mean weight loss is significantly greater than 0.
We will use a one-sample t-test with the following parameters:

Sample size (n) = 50
Sample mean (x̄) = 6
Sample standard deviation (s) = 2.5
Degrees of freedom (df) = n - 1 = 49
Level of significance (α) = 0.05
Direction of the alternative hypothesis: one-tailed (greater than)
We will first calculate the t-statistic using the formula:

t = (x̄ - μ) / (s / sqrt(n))

where μ is the population mean, which is assumed to be 0 under the null hypothesis.

t = (6 - 0) / (2.5 / sqrt(50))
t = 15.49

Next, we will find the critical t-value using a t-distribution table or the t.ppf() function in Python. Since we have a one-tailed test with α = 0.05 and df = 49, the critical t-value is:

t_critical = stats.t.ppf(1 - 0.05, 49)
t_critical = 1.676

Since the calculated t-statistic (15.49) is greater than the critical t-value (1.676), we reject the null hypothesis and conclude that the weight loss drug is significantly effective at a 95% confidence level.

We can also calculate the p-value using the t.sf() function in Python:

p_value = stats.t.sf(t, df) * 2
p_value = 1.450e-19

The p-value is much smaller than the level of significance (α = 0.05), which further supports the rejection of the null hypothesis.

SyntaxError: invalid syntax (1325912613.py, line 1)

In [43]:
import numpy as np
from scipy.stats import t

# Define the sample size, sample mean, and sample standard deviation
n = 50
sample_mean = 6
sample_std = 2.5

# Define the null hypothesis mean and alpha level
null_mean = 0
alpha = 0.05

# Calculate the t-statistic
t_stat = (sample_mean - null_mean) / (sample_std / np.sqrt(n))

# Calculate the degrees of freedom
df = n - 1

# Calculate the critical t-value for a two-tailed test
t_crit = t.ppf(alpha/2, df)

# Determine if the null hypothesis is rejected
if abs(t_stat) > t_crit:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")


Reject the null hypothesis


## Q11. In a survey of 500 people, 65% reported being satisfied with their current job. Calculate the 95%.confidence interval for the true proportion of people who are satisfied with their job.

To calculate the 95% confidence interval for the true proportion of people who are satisfied with their job, we can use the formula:

Confidence interval = sample proportion ± z* (sqrt(sample proportion * (1 - sample proportion) / sample size))

Where z* is the z-score for the desired level of confidence, which is 1.96 for 95% confidence interval.

Substituting the values given in the problem, we get:

Confidence interval = 0.65 ± 1.96 * (sqrt(0.65 * 0.35 / 500))

Confidence interval = 0.65 ± 0.045

Confidence interval = (0.605, 0.695)

Therefore, we can say with 95% confidence that the true proportion of people who are satisfied with their job falls within the range of 0.605 to 0.695.

In [53]:
import statsmodels.stats.proportion as proportion

# sample size
n = 500
# number of successes
x = 325
# proportion of successes
p = x / n
# 95% confidence interval
conf_int = proportion.confint_proportions_2indep(p, n, alpha=0.05, method='normal')

print(f"95% confidence interval: ({conf_int[0]:.3f}, {conf_int[1]:.3f})")

TypeError: confint_proportions_2indep() missing 2 required positional arguments: 'count2' and 'nobs2'

## Q12. A researcher is testing the effectiveness of two different teaching methods on student performance.
## Sample A has a mean score of 85 with a standard deviation of 6, while sample B has a mean score of 82 with a standard deviation of 5. Conduct a hypothesis test to determine if the two teaching methods have a significant difference in student performance using a t-test with a significance level of 0.01.

In [50]:
import numpy as np
import scipy.stats as stats

# Sample A
sample_a = [85, 79, 91, 88, 83, 92, 80, 89, 85, 87]

# Sample B
sample_b = [82, 77, 85, 81, 79, 88, 75, 84, 82, 80]

# Calculate the t-statistic and p-value
t_stat, p_value = stats.ttest_ind(sample_a, sample_b)

# Print the results
print("t-statistic: ", t_stat)
print("p-value: ", p_value/2) # one-tailed test

# Determine if the null hypothesis should be rejected
alpha = 0.01
if p_value/2 < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")


t-statistic:  2.5070198473190253
p-value:  0.010992368773561604
Fail to reject the null hypothesis


## Q13. A population has a mean of 60 and a standard deviation of 8. A sample of 50 observations has a mean of 65. Calculate the 90% confidence interval for the true population mean.

In [51]:
To calculate the 90% confidence interval for the true population mean, we can use the following formula:

Confidence Interval = sample mean ± z*(standard deviation / sqrt(n))

Where:

sample mean = 65
standard deviation = 8
n = 50
z-score for 90% confidence level = 1.645 (from z-table or calculator)
Plugging in the values, we get:

Confidence Interval = 65 ± 1.645*(8 / sqrt(50))
Confidence Interval = 65 ± 2.838

Therefore, the 90% confidence interval for the true population mean is (62.162, 67.838). We can interpret this result as: we are 90% confident that the true population mean falls between 62.162 and 67.838.


SyntaxError: invalid character '±' (U+00B1) (3732833000.py, line 3)

In [52]:
import scipy.stats as stats
import math

# Given variables
pop_mean = 60
pop_stddev = 8
sample_mean = 65
n = 50
conf_level = 0.90

# Calculate the standard error
std_error = pop_stddev / math.sqrt(n)

# Calculate the t-value for the given confidence level and degrees of freedom
df = n - 1
t_value = stats.t.ppf((1 + conf_level) / 2, df)

# Calculate the lower and upper bounds of the confidence interval
lower_bound = sample_mean - t_value * std_error
upper_bound = sample_mean + t_value * std_error

# Print the result
print(f"The 90% confidence interval for the true population mean is [{lower_bound}, {upper_bound}]")


The 90% confidence interval for the true population mean is [63.10319919251691, 66.89680080748309]


## Q14. In a study of the effects of caffeine on reaction time, a sample of 30 participants had an average reaction time of 0.25 seconds with a standard deviation of 0.05 seconds. Conduct a hypothesis test to determine if the caffeine has a significant effect on reaction time at a 90% confidence level using a t-test.

To conduct a hypothesis test, we need to set up our null and alternative hypotheses:

Null hypothesis (H0): The mean reaction time for participants who consume caffeine is equal to the mean reaction time for participants who do not consume caffeine.
Alternative hypothesis (Ha): The mean reaction time for participants who consume caffeine is different from the mean reaction time for participants who do not consume caffeine.

We will use a two-tailed t-test since we are testing for a difference in means. We will use a significance level of 0.1.

First, we need to calculate the t-statistic:

t = (x̄ - μ) / (s / sqrt(n))

where x̄ is the sample mean, μ is the population mean, s is the sample standard deviation, and n is the sample size.

t = (0.25 - 0) / (0.05 / sqrt(30))
t = 12.247

Next, we need to find the critical t-value from the t-distribution table with 29 degrees of freedom and a significance level of 0.05/2 = 0.025 (since we are conducting a two-tailed test). The critical t-value is approximately ±2.045.

Since our calculated t-statistic of 12.247 is greater than the critical t-value of ±2.045, we reject the null hypothesis and conclude that there is a significant difference in reaction time between participants who consume caffeine and those who do not at a 90% confidence level.

In [54]:
import scipy.stats as stats

# Sample data
n = 30
sample_mean = 0.25
sample_std = 0.05

# Null hypothesis: the population mean reaction time without caffeine is 0.26 seconds
# Alternative hypothesis: the population mean reaction time with caffeine is less than 0.26 seconds
null_mean = 0.26
alpha = 0.1 # significance level = 90%

# Calculate the t-value and p-value
t_value = (sample_mean - null_mean) / (sample_std / (n**0.5))
p_value = stats.t.cdf(t_value, df=n-1)

# Determine whether to reject or fail to reject the null hypothesis based on the p-value
if p_value < alpha:
    print("Reject null hypothesis. The caffeine has a significant effect on reaction time.")
else:
    print("Fail to reject null hypothesis. There is insufficient evidence to suggest that caffeine has a significant effect on reaction time.")


Fail to reject null hypothesis. There is insufficient evidence to suggest that caffeine has a significant effect on reaction time.
