Q1: What is Estimation Statistics? Explain point estimate and interval estimate?

Estimation statistics is a branch of statistics that deals with estimating population parameters, such as the population mean or population proportion, from sample data. It involves making inferences about population characteristics based on sample data, using either point estimates or interval estimates.

Point estimate refers to a single value that is used to estimate a population parameter. For example, the sample mean is a point estimate of the population mean, while the sample proportion is a point estimate of the population proportion. Point estimates are often calculated as the sample statistic that best represents the population parameter. However, they do not provide any information about the uncertainty or variability associated with the estimate.

Interval estimate, on the other hand, provides a range of plausible values for a population parameter. It is also called a confidence interval. A confidence interval is constructed based on the sample statistic and its associated standard error. The interval estimate consists of two values, an upper bound and a lower bound, which define the range of values that is likely to contain the true population parameter with a certain level of confidence. For example, a 95% confidence interval for the population mean would provide a range of values within which the population mean is estimated to lie with a 95% level of confidence.

Interval estimates are more informative than point estimates because they provide a measure of the uncertainty associated with the estimate. They allow us to quantify the level of confidence we have in our estimate and determine the precision of our estimate. However, interval estimates are wider than point estimates, which means that they are less precise.

Q2. Write a Python function to estimate the population mean using a sample mean and standard
deviation?

In [3]:
import math

def estimate_population_mean(sample_mean, sample_std_dev, sample_size):
    """
    Calculates the estimated population mean based on a sample mean, standard deviation and sample size
    """
    standard_error = sample_std_dev / math.sqrt(sample_size)
    lower_bound = sample_mean - 1.96 * standard_error  # 95% confidence interval
    upper_bound = sample_mean + 1.96 * standard_error  # 95% confidence interval
    return (lower_bound, upper_bound)


Q3: What is Hypothesis testing? Why is it used? State the importance of Hypothesis testing?

Hypothesis testing is a statistical method used to make decisions based on data, by testing whether a hypothesis about a population parameter is supported by the sample data. The hypothesis testing process involves making assumptions about the population parameter based on the sample data, and then determining the probability of observing the sample data, assuming that the null hypothesis is true.

Hypothesis testing is used to make inferences about a population based on a sample, and to draw conclusions about the statistical significance of the results. It is used in a variety of fields, including medicine, engineering, business, and social sciences, to test theories and hypotheses and make informed decisions based on data.

The importance of hypothesis testing lies in its ability to provide a structured approach to making decisions based on data. By defining clear hypotheses and test criteria, hypothesis testing allows researchers and decision-makers to make informed decisions based on the evidence. It also helps to reduce the risk of making decisions based on random fluctuations or biased interpretations of the data.

Hypothesis testing can also help to identify relationships between variables and to determine the strength of those relationships. It can be used to determine whether a treatment or intervention has a significant effect, to compare groups or populations, and to test the validity of models or theories.

Overall, hypothesis testing is an important tool for making data-driven decisions, and for advancing scientific knowledge by testing theories and hypotheses.

Q4. Create a hypothesis that states whether the average weight of male college students is greater than
the average weight of female college students?

Null Hypothesis=The average weight of male college students is greater than
the average weight of female college students.
Alternate Hypothesis==The average weight of male college students is  not greater than
the average weight of female college students.

Q5. Write a Python script to conduct a hypothesis test on the difference between two population means,
given a sample from each population?

In [None]:
import scipy.stats as stats

def two_sample_t_test(sample1, sample2, alpha):
    """
    Conducts a two-sample t-test on the difference between two population means.
    Returns the test statistic, p-value, and whether the null hypothesis is rejected.
    """
    n1 = len(sample1)
    n2 = len(sample2)
    mean1 = sum(sample1) / n1
    mean2 = sum(sample2) / n2
    var1 = sum((x - mean1) ** 2 for x in sample1) / (n1 - 1)
    var2 = sum((x - mean2) ** 2 for x in sample2) / (n2 - 1)
    dof = n1 + n2 - 2
    se = ((var1 / n1) + (var2 / n2)) ** 0.5
    t_stat = (mean1 - mean2) / se
    p_value = stats.t.sf(abs(t_stat), dof) * 2
    if p_value < alpha:
        return t_stat, p_value, True
    else:
        return t_stat, p_value, False


Q6: What is a null and alternative hypothesis? Give some examples?

The assumption you are begining with is called null hypothesis.
The alternate hypothesis is opposite to the null hypothesis.
Example-person crime in court.

Q7: Write down the steps involved in hypothesis testing?

step-1 : first we find null hypothesis and the alternate hypothesis 
step-2 : find significance value 
step-3 : we find confidential interval
step-4 : Then we find z_score which is the p-value.
step-5 : If p value < significance value then we reject the null hypothesis 
Else:
    we fail to reject the null hypothesis

Q8. Define p-value and explain its significance in hypothesis testing?

In hypothesis testing, the p-value is the probability of observing a test statistic as extreme as or more extreme than the one calculated from the sample data, assuming that the null hypothesis is true. Put simply, it measures the strength of evidence against the null hypothesis.

The significance of the p-value lies in its ability to help researchers and decision-makers determine whether to reject or fail to reject the null hypothesis. Typically, a predetermined significance level (alpha) is chosen, such as 0.05 or 0.01, which represents the maximum probability of a Type I error (rejecting the null hypothesis when it is actually true). If the p-value is less than or equal to the significance level, then the null hypothesis is rejected, and it is concluded that there is evidence to support the alternative hypothesis. If the p-value is greater than the significance level, then the null hypothesis is not rejected, and it is concluded that there is insufficient evidence to support the alternative hypothesis.

The p-value provides a standardized measure of the strength of evidence against the null hypothesis, and it allows researchers and decision-makers to compare the strength of evidence across different studies or analyses. It also helps to reduce the risk of making decisions based on random fluctuations or biased interpretations of the data.

However, it is important to note that the p-value does not provide information about the size or practical significance of the effect. A small p-value does not necessarily indicate a large or important effect, and a large p-value does not necessarily indicate a small or unimportant effect. It is also not a measure of the probability that the alternative hypothesis is true. Therefore, the p-value should be considered in conjunction with other factors, such as effect size, study design, and practical significance, when making decisions based on the results of hypothesis tests.

Q9. Generate a Student's t-distribution plot using Python's matplotlib library, with the degrees of freedom
parameter set to 10?

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

# Set the degrees of freedom
df = 10

# Generate the x-axis values (t-values)
t_values = np.linspace(-4, 4, 1000)

# Calculate the y-axis values (pdf values)
pdf_values = stats.t.pdf(t_values, df)

# Create the plot
fig, ax = plt.subplots()
ax.plot(t_values, pdf_values, label=f"df = {df}")

# Set the plot title and axis labels
ax.set_title("Student's t-distribution")
ax.set_xlabel("t-values")
ax.set_ylabel("Probability density function")

# Add a legend and display the plot
ax.legend()
plt.show()


Q10. Write a Python program to calculate the two-sample t-test for independent samples, given two
random samples of equal size and a null hypothesis that the population means are equal?

In [None]:
import numpy as np
from scipy.stats import ttest_ind

# Generate two random samples of equal size
sample1 = np.random.normal(10, 2, size=50)
sample2 = np.random.normal(12, 2, size=50)

# Calculate the t-statistic and p-value for the two-sample t-test
t_statistic, p_value = ttest_ind(sample1, sample2)

# Print the results
print(f"Sample 1 mean: {np.mean(sample1)}")
print(f"Sample 2 mean: {np.mean(sample2)}")
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")


Q11: What is Student’s t distribution? When to use the t-Distribution?

Student's t-distribution is a probability distribution that arises in hypothesis testing when the population standard deviation is unknown and must be estimated from a small sample. It is a bell-shaped distribution that is similar in shape to the standard normal distribution but has heavier tails. The shape of the distribution depends on the degrees of freedom, which are determined by the sample size. As the sample size increases, the t-distribution approaches the standard normal distribution.

The t-distribution is used in hypothesis testing when the population standard deviation is unknown and must be estimated from a small sample. In this case, the test statistic is calculated using the sample mean and standard deviation, and its distribution is approximated by the t-distribution. The t-distribution is also used when the sample size is small and the population is not normally distributed, in which case the Central Limit Theorem may not apply.

The t-distribution is commonly used in statistics for confidence interval estimation, hypothesis testing, and constructing tolerance intervals. It is also used in various applications, including quality control, reliability engineering, and experimental design.

Q12: What is t-statistic? State the formula for t-statistic?

In statistics, the t-statistic is a measure of the difference between a sample mean and a hypothesized population mean, relative to the variability in the sample. It is used in hypothesis testing to determine whether the difference between the sample mean and the hypothesized population mean is statistically significant.

The formula for the t-statistic is:

t = (x̄ - μ) / (s / √n)

where:

x̄ is the sample mean
μ is the hypothesized population mean
s is the sample standard deviation
n is the sample size
The t-statistic is calculated by subtracting the hypothesized population mean from the sample mean, and then dividing the result by the standard error of the sample mean. The standard error of the sample mean is the standard deviation of the sample divided by the square root of the sample size. The t-statistic measures how many standard errors the sample mean is from the hypothesized population mean. A large absolute value of the t-statistic indicates a larger difference between the sample mean and the hypothesized population mean, relative to the variability in the sample. The t-statistic follows a t-distribution with n-1 degrees of freedom, where n is the sample size.

Q13. A coffee shop owner wants to estimate the average daily revenue for their shop. They take a random
sample of 50 days and find the sample mean revenue to be $500 with a standard deviation of $50.
Estimate the population mean revenue with a 95% confidence interval?

To estimate the population mean revenue with a 95% confidence interval, we can use the following formula:

CI = x̄ ± tα/2 * (s/√n)

where:

CI is the confidence interval
x̄ is the sample mean revenue ($500)
tα/2 is the t-score associated with a 95% confidence level and 49 degrees of freedom (50 - 1 = 49). Using a t-table, we find tα/2 to be 2.009.
s is the sample standard deviation ($50)
n is the sample size (50)
Plugging in the values, we get:

CI = 500 ± 2.009 * (50/√50)
CI = 500 ± 14.14

Therefore, the 95% confidence interval for the population mean revenue is (485.86, 514.14). We can be 95% confident that the true population mean revenue falls within this interval.

Q14. A researcher hypothesizes that a new drug will decrease blood pressure by 10 mmHg. They conduct a
clinical trial with 100 patients and find that the sample mean decrease in blood pressure is 8 mmHg with a
standard deviation of 3 mmHg. Test the hypothesis with a significance level of 0.05?

o test the hypothesis that the new drug decreases blood pressure by 10 mmHg, we can use a one-sample t-test. The null and alternative hypotheses are:

H0: μ = 10 (the new drug does not decrease blood pressure by 10 mmHg)
Ha: μ < 10 (the new drug decreases blood pressure by less than 10 mmHg)

We will use a significance level of 0.05, which means that we will reject the null hypothesis if the p-value is less than 0.05.

The test statistic for the one-sample t-test is:

t = (x̄ - μ) / (s/√n)

where:

x̄ is the sample mean decrease in blood pressure (8 mmHg)
μ is the hypothesized population mean decrease in blood pressure (10 mmHg)
s is the sample standard deviation (3 mmHg)
n is the sample size (100)
Plugging in the values, we get:

t = (8 - 10) / (3/√100)
t = -2.82

Using a t-table with 99 degrees of freedom (100 - 1 = 99), we find the p-value to be less than 0.005. Since the p-value is less than 0.05, we reject the null hypothesis and conclude that there is sufficient evidence to support the alternative hypothesis that the new drug decreases blood pressure by less than 10 mmHg.

Q15. An electronics company produces a certain type of product with a mean weight of 5 pounds and a
standard deviation of 0.5 pounds. A random sample of 25 products is taken, and the sample mean weight
is found to be 4.8 pounds. Test the hypothesis that the true mean weight of the products is less than 5
pounds with a significance level of 0.01?

To test the hypothesis that the true mean weight of the products is less than 5 pounds, we can use a one-tailed t-test with the following null and alternative hypotheses:

Null hypothesis: The true mean weight of the products is equal to 5 pounds.
Alternative hypothesis: The true mean weight of the products is less than 5 pounds.

The significance level is given as 0.01, which means we need to find the critical t-value for a one-tailed test with 24 degrees of freedom and a significance level of 0.01. This can be done using a t-table or a calculator and gives a critical t-value of -2.492.

Next, we need to calculate the test statistic, which is given by:

t = (sample mean - hypothesized mean) / (standard deviation / sqrt(sample size))

Plugging in the values, we get:

t = (4.8 - 5) / (0.5 / sqrt(25)) = -2.0

The test statistic is -2.0, which is less than the critical t-value of -2.492. Therefore, we reject the null hypothesis and conclude that the true mean weight of the products is less than 5 pounds at a significance level of 0.01.

In other words, we have evidence to suggest that the company is producing products with a mean weight less than 5 pounds.

Q16. Two groups of students are given different study materials to prepare for a test. The first group (n1 =
30) has a mean score of 80 with a standard deviation of 10, and the second group (n2 = 40) has a mean
score of 75 with a standard deviation of 8. Test the hypothesis that the population means for the two
groups are equal with a significance level of 0.01?

To test the hypothesis that the population means for the two groups are equal, we can use a two-sample t-test with the following null and alternative hypotheses:

Null hypothesis: The population means for the two groups are equal.
Alternative hypothesis: The population means for the two groups are not equal.

The significance level is given as 0.01, which means we need to find the critical t-value for a two-tailed test with (30 + 40 - 2) = 68 degrees of freedom and a significance level of 0.01/2 = 0.005. This can be done using a t-table or a calculator and gives a critical t-value of ±2.636.

Next, we need to calculate the test statistic, which is given by:

t = (sample mean difference - hypothesized difference) / standard error

where the sample mean difference is the difference between the two sample means, hypothesized difference is the assumed difference under the null hypothesis, and the standard error is given by:

SE = sqrt((s1^2/n1) + (s2^2/n2))

Plugging in the values, we get:

sample mean difference = 80 - 75 = 5
hypothesized difference = 0
s1 = 10, n1 = 30
s2 = 8, n2 = 40

SE = sqrt((10^2/30) + (8^2/40)) = 2.464

t = (5 - 0) / 2.464 = 2.026

The test statistic is 2.026, which is greater than the critical t-value of ±2.636. Therefore, we fail to reject the null hypothesis and conclude that there is insufficient evidence to suggest that the population means for the two groups are different at a significance level of 0.01.

In other words, we cannot conclude that the study materials provided to the two groups resulted in different test scores.

Q17. A marketing company wants to estimate the average number of ads watched by viewers during a TV
program. They take a random sample of 50 viewers and find that the sample mean is 4 with a standard
deviation of 1.5. Estimate the population mean with a 99% confidence interval?

To estimate the population mean with a 99% confidence interval, we can use the following formula:

Confidence interval = sample mean ± (critical value) x (standard error)

where the critical value is obtained from the t-distribution table for a given level of confidence and degrees of freedom, and the standard error is calculated as:

Standard error = standard deviation / sqrt(sample size)

In this case, we have:

Sample size (n) = 50
Sample mean (x̄) = 4
Standard deviation (s) = 1.5

Degrees of freedom (df) = n - 1 = 49 (since we are using the t-distribution)

From the t-distribution table for 49 degrees of freedom and a 99% confidence level, the critical value is 2.680.

The standard error is:

Standard error = 1.5 / sqrt(50) = 0.2121

Plugging in the values, we get:

Confidence interval = 4 ± (2.680) x (0.2121)
= 4 ± 0.568

Therefore, the 99% confidence interval for the population mean is (3.432, 4.568).

We can say with 99% confidence that the average number of ads watched by viewers during a TV program is between 3.432 and 4.568.