## ASSIGNMENT ON STATISTICS ADVANCE

Q1: What is Estimation Statistics? Explain point estimate and interval estimate.

Estimation statistics is a branch of inferential statistics that involves making inferences about population parameters based on sample data. It is used when it is not feasible or practical to collect data from an entire population, so we rely on a sample to make inferences about the population.

Point Estimate:
A point estimate is a single value that is used to estimate an unknown population parameter. It is calculated based on the sample data and is used to make an educated guess about the population parameter. For example, the sample mean (x̄) is often used as a point estimate for the population mean (μ). However, it's important to note that a point estimate does not provide information about the precision or accuracy of the estimate.

Interval Estimate:
An interval estimate provides a range of values within which the true population parameter is likely to lie. It takes into account both the point estimate and the variability or uncertainty associated with the estimate. The most common type of interval estimate is the confidence interval, which provides a range of values around the point estimate.

Q2. Write a Python function to estimate the population mean using a sample mean and standard deviation.

In [3]:
import math

def estimate_population_mean(sample_mean, sample_std_dev, sample_size):
    # Calculate the standard error (standard deviation of the sampling distribution)
    standard_error = sample_std_dev / math.sqrt(sample_size)
    
    # Calculate the margin of error (usually based on a desired confidence level)
    # For example, for a 95% confidence level, the z-value would be 1.96
    z_value = 1.96
    margin_of_error = z_value * standard_error
    
    # Calculate the lower and upper bounds of the confidence interval
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error
    
    # Return the estimated population mean and confidence interval
    return sample_mean, lower_bound, upper_bound


Q3: What is Hypothesis testing? Why is it used? State the importance of Hypothesis testing.

Hypothesis testing is a statistical procedure used to make inferences or draw conclusions about a population based on sample data. It involves formulating a null hypothesis and an alternative hypothesis, collecting data, and using statistical methods to determine the likelihood of the data supporting or contradicting the null hypothesis.

The main purpose of hypothesis testing is to make objective decisions or conclusions about the characteristics of a population using sample data. It allows us to evaluate the validity of a claim or hypothesis about a population based on the available evidence.

### Importance of Hypothesis Testing:

Objectivity: Hypothesis testing provides an objective framework for evaluating claims or hypotheses. It involves using statistical methods and data to make informed decisions rather than relying on personal opinions or biases.

Scientific Research: Hypothesis testing is crucial in scientific research as it helps researchers test their hypotheses, determine the significance of their findings, and contribute to the existing body of knowledge. It allows researchers to draw meaningful and valid conclusions based on statistical evidence.

Decision-Making: Hypothesis testing plays a vital role in decision-making processes, particularly in fields such as business, medicine, and engineering. It allows decision-makers to assess the effectiveness of interventions, compare different strategies or treatments, and make informed choices based on statistical evidence.

Control of Errors: Hypothesis testing provides a framework for controlling errors, specifically Type I and Type II errors. Type I error occurs when we reject a true null hypothesis, while Type II error occurs when we fail to reject a false null hypothesis. By setting a predetermined significance level (alpha) and calculating p-values, hypothesis testing helps us minimize these errors and make accurate conclusions.

Scientific Validity: Hypothesis testing adds rigor and validity to scientific studies by subjecting claims and hypotheses to statistical scrutiny. It allows researchers to quantify the strength of evidence against the null hypothesis, increasing the credibility and reliability of research findings.

Q4. Create a hypothesis that states whether the average weight of male college students is greater than the average weight of female college students

Hypothesis: The average weight of male college students is greater than the average weight of female college students.

Null hypothesis (H0): The average weight of male college students is equal to or less than the average weight of female college students.
Alternative hypothesis (HA): The average weight of male college students is greater than the average weight of female college students.

Symbolically:

H0: μ_male ≤ μ_female
HA: μ_male > μ_female

In this hypothesis, μ_male represents the population mean weight of male college students, and μ_female represents the population mean weight of female college students.

To test this hypothesis, we would collect data on the weights of male and female college students, calculate the sample means for each group, and perform a statistical test (such as a t-test or z-test) to determine if there is sufficient evidence to support the alternative hypothesis that the average weight of male college students is greater than the average weight of female college students.

Q5. Write a Python script to conduct a hypothesis test on the difference between two population means, given a sample from each population.

In [4]:
import scipy.stats as stats

def conduct_hypothesis_test(sample1, sample2, alpha):
    # Perform an independent t-test assuming unequal variances
    t_stat, p_value = stats.ttest_ind(sample1, sample2, equal_var=False)

    # Compare the p-value to the significance level
    if p_value < alpha:
        print("Reject the null hypothesis.")
    else:
        print("Fail to reject the null hypothesis.")

    # Return the test statistic and p-value
    return t_stat, p_value


Q6: What is a null and alternative hypothesis? Give some examples.

1)Null Hypothesis (H0):
The null hypothesis represents the default assumption or the statement of no effect or no difference. It assumes that there is no relationship, no effect, or no significant difference between variables. It is typically denoted as H0. In hypothesis testing, we aim to provide evidence to either reject or fail to reject the null hypothesis. 

Examples of null hypotheses include:

The mean height of men and women is equal.
There is no difference in test scores between two teaching methods.
A new drug has no effect on reducing cholesterol levels.

2)Alternative Hypothesis (HA or H1):
The alternative hypothesis is the statement that contradicts or opposes the null hypothesis. It represents the possibility of a relationship, an effect, or a significant difference between variables. It is denoted as HA or sometimes as H1. The alternative hypothesis is what the researcher or analyst seeks to support or demonstrate through statistical evidence. 

Examples of alternative hypotheses include:
The mean height of men is greater than the mean height of women.
Teaching method A leads to higher test scores compared to teaching method B.
The new drug significantly reduces cholesterol levels.

Q7: Write down the steps involved in hypothesis testing.

Hypothesis testing involves a series of steps to evaluate the validity of a claim or hypothesis about a population based on sample data. Here are the general steps involved in hypothesis testing:

State the Null and Alternative Hypotheses:
Clearly define the null hypothesis (H0) and the alternative hypothesis (HA or H1) based on the research question or claim.

Set the Significance Level (Alpha):
Determine the desired level of significance (alpha), which represents the probability of rejecting the null hypothesis when it is true. Commonly used significance levels are 0.05 (5%) and 0.01 (1%).

Collect Sample Data:
Collect data from a representative sample that is relevant to the research question or claim. The sample should be obtained using appropriate sampling methods to ensure it is representative of the population.

Select an Appropriate Test Statistic:
Choose a suitable test statistic based on the nature of the data and the hypothesis being tested. This could be a t-test, z-test, chi-square test, or others, depending on the specific situation.

Determine the Test Statistic's Distribution:
Determine the distribution of the test statistic under the assumption that the null hypothesis is true. This step is important for calculating p-values or critical values needed for the hypothesis test.

Calculate the Test Statistic:
Use the sample data to calculate the value of the chosen test statistic based on the selected statistical test.

Determine the Rejection Region or Calculate the p-value:
Depending on the chosen test statistic and the form of the alternative hypothesis, determine the rejection region (critical region) or calculate the p-value associated with the test statistic. The rejection region represents the values of the test statistic that would lead to the rejection of the null hypothesis. The p-value represents the probability of observing a test statistic as extreme as or more extreme than the calculated value, assuming the null hypothesis is true.

Make a Decision:
Compare the test statistic to the critical value(s) or compare the p-value to the significance level (alpha). If the test statistic falls in the rejection region or the p-value is less than alpha, reject the null hypothesis. Otherwise, fail to reject the null hypothesis.

Draw Conclusions:
Based on the decision in step 8, interpret the results of the hypothesis test in the context of the research question or claim. Provide conclusions, including any implications or insights gained from the analysis.

Q8. Define p-value and explain its significance in hypothesis testing.

In hypothesis testing, the p-value is a measure of the strength of evidence against the null hypothesis. It represents the probability of observing a test statistic as extreme as or more extreme than the one calculated from the sample data, assuming the null hypothesis is true. The p-value helps us determine the level of statistical significance and make decisions about rejecting or failing to reject the null hypothesis.

The significance of the p-value in hypothesis testing can be understood as follows:

Testing the Null Hypothesis:
The p-value allows us to assess the compatibility of the observed data with the null hypothesis. If the p-value is small (below the predetermined significance level, alpha), it suggests that the observed data is unlikely to have occurred under the assumption of the null hypothesis.

Decision-Making:
By comparing the p-value with the significance level (alpha), we can make decisions about rejecting or failing to reject the null hypothesis. If the p-value is less than alpha, we reject the null hypothesis in favor of the alternative hypothesis. If the p-value is greater than or equal to alpha, we fail to reject the null hypothesis.

Quantifying the Strength of Evidence:
The p-value provides a quantitative measure of the strength of evidence against the null hypothesis. A small p-value suggests strong evidence against the null hypothesis, indicating that the observed data is unlikely to have occurred due to chance alone. Conversely, a large p-value suggests weak evidence against the null hypothesis, indicating that the observed data is reasonably consistent with the null hypothesis.

Interpretation:
The p-value helps in interpreting the results of a hypothesis test. If the p-value is very small (e.g., less than 0.05), it is commonly interpreted as statistically significant, suggesting that the observed effect or relationship is unlikely to be due to random variation. If the p-value is not small (e.g., greater than 0.05), it is interpreted as not statistically significant, indicating that the observed effect or relationship could plausibly be due to random variation.

Q9. Generate a Student's t-distribution plot using Python's matplotlib library, with the degrees of freedom parameter set to 10.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Generate x-axis values for the plot
x = np.linspace(-4, 4, 1000)

# Calculate the probability density function (PDF) for the t-distribution with 10 degrees of freedom
pdf = stats.t.pdf(x, df=10)

# Plot the t-distribution
plt.plot(x, pdf, label='t-distribution (df=10)')
plt.xlabel('x')
plt.ylabel('Probability Density')
plt.title("Student's t-distribution (df=10)")
plt.legend()
plt.grid(True)
plt.show()


Q10. Write a Python program to calculate the two-sample t-test for independent samples, given two random samples of equal size and a null hypothesis that the population means are equal.

In [7]:
import numpy as np
import scipy.stats as stats

def two_sample_t_test(sample1, sample2):
    # Calculate the means and variances of the two samples
    mean1 = np.mean(sample1)
    mean2 = np.mean(sample2)
    var1 = np.var(sample1, ddof=1)
    var2 = np.var(sample2, ddof=1)
    n1 = len(sample1)
    n2 = len(sample2)

    # Calculate the pooled standard deviation
    pooled_std = np.sqrt((var1 + var2) / 2)

    # Calculate the test statistic
    t = (mean1 - mean2) / (pooled_std * np.sqrt(1/n1 + 1/n2))

    # Calculate the degrees of freedom
    df = n1 + n2 - 2

    # Calculate the p-value (two-tailed)
    p_value = 2 * (1 - stats.t.cdf(abs(t), df))

    # Return the test statistic and p-value
    return t, p_value

# Example usage
sample1 = [5, 7, 9, 8, 6]
sample2 = [4, 6, 8, 7, 5]

t_stat, p_value = two_sample_t_test(sample1, sample2)

print("Test statistic:", t_stat)
print("p-value:", p_value)


Test statistic: 1.0
p-value: 0.3465935070873343


Q11: What is Student’s t distribution? When to use the t-Distribution.

Student's t-distribution, also known as the t-distribution, is a probability distribution that is used in hypothesis testing and confidence interval estimation when the population standard deviation is unknown and the sample size is small.

The t-distribution is similar to the standard normal distribution (Z-distribution), but it has heavier tails. The shape of the t-distribution depends on the degrees of freedom (df), which is determined by the sample size. As the sample size increases, the t-distribution approaches the standard normal distribution.

The t-distribution is used in situations where the population standard deviation is unknown and needs to be estimated from the sample data. This commonly occurs when working with small sample sizes or when the population standard deviation is not available. Some scenarios where the t-distribution is appropriate include:

Testing the population mean: When you want to test whether the mean of a sample is significantly different from a hypothesized population mean, and the population standard deviation is unknown, you can use the t-distribution.

Confidence interval estimation: When you want to estimate the population mean with a certain level of confidence and the population standard deviation is unknown, you can use the t-distribution to construct a confidence interval.

Comparing two population means: When comparing the means of two independent samples and the population standard deviations are unknown, the t-distribution is used to perform a two-sample t-test.

The t-distribution is particularly valuable when working with small sample sizes because it takes into account the additional uncertainty introduced by estimating the population standard deviation from limited data. As the sample size increases, the t-distribution becomes closer to the standard normal distribution, and the use of the t-distribution becomes less critical.

Q12: What is t-statistic? State the formula for t-statistic.

The t-statistic (also known as the t-value or t-score) is a measure of how the sample mean differs from the hypothesized population mean in hypothesis testing. It is used to assess whether the observed difference between sample data and the population parameter is statistically significant.

The formula for the t-statistic depends on the type of hypothesis test being conducted. Here are the formulas for two common scenarios:

One-sample t-test:
The one-sample t-test compares the mean of a single sample to a hypothesized population mean.

t = (x̄ - μ) / (s / √n)

In this formula:

t: The t-statistic
x̄: The sample mean
μ: The hypothesized population mean
s: The sample standard deviation
n: The sample size
Independent two-sample t-test:
The independent two-sample t-test compares the means of two independent samples to determine if they come from populations with equal means.

t = (x̄1 - x̄2) / √[(s1^2 / n1) + (s2^2 / n2)]

In this formula:

t: The t-statistic
x̄1, x̄2: The sample means of the two independent samples
s1, s2: The sample standard deviations of the two independent samples
n1, n2: The sample sizes of the two independent samples
In both formulas, the t-statistic represents the standardized difference between the sample mean(s) and the hypothesized population mean(s). A larger absolute value of the t-statistic indicates a larger difference between the sample mean(s) and the hypothesized population mean(s).

The t-statistic is then compared to critical values from the t-distribution or used to calculate a p-value to make a decision about rejecting or failing to reject the null hypothesis.

It's important to note that the formula for the t-statistic may vary depending on the specific hypothesis test being conducted, such as paired t-tests or other variants.

Q13. A coffee shop owner wants to estimate the average daily revenue for their shop. They take a random sample of 50 days and find the sample mean revenue to be $500 with a standard deviation of $50. Estimate the population mean revenue with a 95% confidence interval.

To estimate the population mean revenue with a 95% confidence interval, we can use the sample mean, sample standard deviation, and the t-distribution.

Given information:

Sample size (n) = 50
Sample mean (x̄) = $500
Sample standard deviation (s) = $50
Confidence level (1 - α) = 95% (which corresponds to α = 0.05)

To calculate the confidence interval, we need to determine the critical value for a t-distribution with n-1 degrees of freedom (49 in this case) at a significance level of α/2 = 0.025 (since it's a two-tailed test).

Using the t-distribution table or a statistical software, the critical value for a t-distribution with 49 degrees of freedom and a significance level of 0.025 is approximately 2.009.

Now we can calculate the margin of error (E) and the confidence interval:

Margin of error (E) = Critical value * (sample standard deviation / √n)
E = 2.009 * (50 / √50)
E ≈ 14.24

Confidence interval = Sample mean ± Margin of error
Confidence interval = $500 ± $14.24

Therefore, the 95% confidence interval estimate for the population mean revenue is approximately $485.76 to $514.24. We can be 95% confident that the true average daily revenue for the coffee shop lies within this interval.

Q14. A researcher hypothesizes that a new drug will decrease blood pressure by 10 mmHg. They conduct a clinical trial with 100 patients and find that the sample mean decrease in blood pressure is 8 mmHg with a standard deviation of 3 mmHg. Test the hypothesis with a significance level of 0.05.

To test the hypothesis, we will perform a one-sample t-test.

Given information:

Sample size (n) = 100
Sample mean decrease in blood pressure (x̄) = 8 mmHg
Sample standard deviation (s) = 3 mmHg
Hypothesized population mean decrease in blood pressure (μ) = 10 mmHg
Significance level (α) = 0.05
Null hypothesis (H0): The new drug does not decrease blood pressure by 10 mmHg (μ = 10)
Alternative hypothesis (HA): The new drug decreases blood pressure by 10 mmHg (μ < 10)

Next, we calculate the t-statistic using the formula:
t = (x̄ - μ) / (s / √n)

t = (8 - 10) / (3 / √100)
t = -2 / 0.3
t = -6.67

Using a t-distribution table or statistical software, we find the critical t-value at a significance level of 0.05 for a one-tailed test with 99 degrees of freedom is approximately -1.660. Since our calculated t-value (-6.67) is less than the critical t-value (-1.660), we reject the null hypothesis.

Therefore, we can conclude that there is evidence to suggest that the new drug decreases blood pressure by less than 10 mmHg at a significance level of 0.05.

Q15. An electronics company produces a certain type of product with a mean weight of 5 pounds and a standard deviation of 0.5 pounds. A random sample of 25 products is taken, and the sample mean weight is found to be 4.8 pounds. Test the hypothesis that the true mean weight of the products is less than 5 pounds with a significance level of 0.01.

To test the hypothesis, we will perform a one-sample t-test.

Given information:

Population mean weight (μ) = 5 pounds
Population standard deviation (σ) = 0.5 pounds
Sample size (n) = 25
Sample mean weight (x̄) = 4.8 pounds
Significance level (α) = 0.01
Null hypothesis (H0): The true mean weight of the products is 5 pounds or greater (μ ≥ 5)
Alternative hypothesis (HA): The true mean weight of the products is less than 5 pounds (μ < 5)

Next, we calculate the t-statistic using the formula:
t = (x̄ - μ) / (s / √n)

t = (4.8 - 5) / (0.5 / √25)
t = -0.2 / 0.1
t = -2

Using a t-distribution table or statistical software, we find the critical t-value at a significance level of 0.01 for a one-tailed test with 24 degrees of freedom is approximately -2.492. Since our calculated t-value (-2) is greater than the critical t-value (-2.492), we fail to reject the null hypothesis.

Therefore, based on the given data, we do not have sufficient evidence to conclude that the true mean weight of the products is less than 5 pounds at a significance level of 0.01.

Q16. Two groups of students are given different study materials to prepare for a test. The first group (n1 = 30) has a mean score of 80 with a standard deviation of 10, and the second group (n2 = 40) has a mean score of 75 with a standard deviation of 8. Test the hypothesis that the population means for the two groups are equal with a significance level of 0.01.

To test the hypothesis that the population means for the two groups are equal, we can perform an independent two-sample t-test.

Given information:

Group 1: n1 = 30, mean score (x̄1) = 80, standard deviation (s1) = 10
Group 2: n2 = 40, mean score (x̄2) = 75, standard deviation (s2) = 8
Significance level (α) = 0.01
Null hypothesis (H0): The population means for the two groups are equal (μ1 = μ2)
Alternative hypothesis (HA): The population means for the two groups are not equal (μ1 ≠ μ2)

The formula for the test statistic in an independent two-sample t-test is:

t = (x̄1 - x̄2) / √[(s1^2 / n1) + (s2^2 / n2)]

Plugging in the given values, we have:

t = (80 - 75) / √[(10^2 / 30) + (8^2 / 40)]
t = 5 / √[1.11 + 0.32]
t ≈ 5 / √1.43
t ≈ 5 / 1.195
t ≈ 4.18

Using a t-distribution table or statistical software, we find the critical t-value at a significance level of 0.01 for a two-tailed test with (n1 + n2 - 2) degrees of freedom (68 degrees of freedom in this case) is approximately ±2.618.

Since our calculated t-value (4.18) is greater than the critical t-value (2.618), we reject the null hypothesis.

Therefore, based on the given data, we have sufficient evidence to conclude that the population means for the two groups are not equal at a significance level of 0.01.

Q17. A marketing company wants to estimate the average number of ads watched by viewers during a TV program. They take a random sample of 50 viewers and find that the sample mean is 4 with a standard deviation of 1.5. Estimate the population mean with a 99% confidence interval.

To estimate the population mean with a 99% confidence interval, we can use the sample mean, sample standard deviation, and the t-distribution.

Given information:

Sample size (n) = 50
Sample mean (x̄) = 4
Sample standard deviation (s) = 1.5
Confidence level (1 - α) = 99% (which corresponds to α = 0.01)

To calculate the confidence interval, we need to determine the critical value for a t-distribution with n-1 degrees of freedom (49 in this case) at a significance level of α/2 = 0.005 (since it's a two-tailed test).

Using the t-distribution table or a statistical software, the critical value for a t-distribution with 49 degrees of freedom and a significance level of 0.005 is approximately ±2.680.

Now we can calculate the margin of error (E) and the confidence interval:

Margin of error (E) = Critical value * (sample standard deviation / √n)
E = 2.680 * (1.5 / √50)
E ≈ 0.599

Confidence interval = Sample mean ± Margin of error
Confidence interval = 4 ± 0.599

Therefore, the 99% confidence interval estimate for the population mean of the number of ads watched by viewers during a TV program is approximately 3.401 to 4.599. We can be 99% confident that the true average number of ads watched falls within this interval.