In [None]:
1. What is Estimation Statistics? Explain point estimate and interval estimate.

In [None]:
ANS- Estimation statistics is a branch of statistics that deals with estimating population parameters based on sample statistics. 
     It involves using statistical methods to make inferences about population parameters based on the characteristics of a sample.

There are two main types of estimation in statistics: 
    1. Point estimation  
    2. Interval estimation.

Point estimation involves using a single value to estimate a population parameter. 
For example, the sample mean is often used as a point estimate of the population mean, while the sample proportion is often used as a 
point estimate of the population proportion. Point estimates can be calculated using various formulas and techniques, depending on the 
type of parameter being estimated and the characteristics of the sample.

Interval estimation, on the other hand, involves estimating a population parameter by calculating a range of values within which the true 
parameter value is likely to fall, along with a level of confidence associated with that range. This range is known as a confidence interval. 
Confidence intervals are calculated using point estimates, along with information about the variability of the sample data and the sample size. 
For example, a 95% confidence interval for the population mean would be a range of values within which we can be 95% confident that the 
true population mean falls.

The choice between point estimation and interval estimation depends on the purpose of the estimation and the level of precision required. 

Point estimates are often used when a single value is sufficient, such as when making a decision based on a sample statistic. 

Interval estimates are often used when a range of values is required, such as when estimating a population parameter with a certain 
level of confidence.

In [None]:
2. Write a Python function to estimate the population mean using a sample mean and standard deviation.

In [None]:
import math

def estimate_population_mean(sample_mean, sample_std, sample_size):
    """
    Calculates the estimated population mean using a sample mean, standard deviation,
    and sample size.
    
    Parameters:
    sample_mean (float): the sample mean
    sample_std (float): the sample standard deviation
    sample_size (int): the sample size
    
    Returns:
    float: the estimated population mean
    """
    standard_error = sample_std / math.sqrt(sample_size)
    z_score = 1.96 # Assuming a 95% confidence interval
    margin_of_error = z_score * standard_error
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error
    return (lower_bound + upper_bound) / 2

This function takes in three parameters: the sample mean, the sample standard deviation, and the sample size. It then calculates the 
standard error of the mean using the sample standard deviation and sample size. It assumes a 95% confidence interval and calculates the margin 
of error based on the standard error and the z-score for a 95% confidence interval (1.96). 
Finally, it calculates the lower and upper bounds of the confidence interval and returns the midpoint of the interval as the estimated population mean.

Note that this function assumes that the sample is drawn from a normal distribution. If the sample is drawn from a non-normal distribution, 
additional adjustments may be necessary.

In [None]:
3. What is Hypothesis testing? Why is it used? State the importance of Hypothesis testing.

In [None]:
ANS- Hypothesis testing is a statistical method used to make inferences about a population based on a sample of data. 
     It involves formulating a hypothesis about a population parameter, collecting and analyzing sample data, and using the data to 
     test the hypothesis.

Hypothesis testing is used to determine whether a hypothesis about a population parameter is supported by the sample data or 
whether it is likely to be false. It is used in a variety of fields, including business, economics, psychology, medicine, and many others.

The importance of hypothesis testing lies in its ability to provide a rigorous framework for making decisions based on data. 
By testing hypothesis, we can make informed decisions about the population based on the sample data. 
Hypothesis testing allows us to draw conclusions about a population based on limited sample data, while also accounting for the uncertainty 
inherent in the sampling process.

In addition to providing a framework for decision-making, hypothesis testing is also important for scientific inquiry. 
Hypothesis testing allows us to test theoretical predictions and validate or refute scientific theories. It is an essential tool for 
advancing scientific knowledge and understanding.

Overall, hypothesis testing is a powerful tool for making inferences about a population based on sample data. It allows us to draw conclusions 
with a high degree of confidence, while also accounting for the uncertainty inherent in the sampling process.

In [None]:
4. Create a hypothesis that states whether the average weight of male college students is greater than the average weight of female college students.

In [None]:
ANS- Hypothesis: The average weight of male college students is greater than the average weight of female college students.

Symbolically, we can represent this as:

Null Hypothesis, H0: μ_male <= μ_female

Alternate Hypothesis, H1: μ_male > μ_female

Where H0 is the null hypothesis, which states that there is no difference between the average weights of male and female college students, 
and H1 is the alternative hypothesis, which states that the average weight of male college students is greater than the average weight of 
female college students.

To test this hypothesis, we would collect a sample of male and female college students, measure their weights, and compare the sample 
means using appropriate statistical tests

In [None]:
5. Write a Python script to conduct a hypothesis test on the difference between two population means, given a sample from each population.

In [1]:
import numpy as np
from scipy.stats import t

# Step 1: Define the null and alternative hypothesis
# H0: mu1 = mu2
# H1: mu1 != mu2

# Step 2: Set the significance level (alpha)
alpha = 0.05

# Step 3: Collect sample data from each population
sample1 = np.array([1, 2, 3, 4, 5])
sample2 = np.array([6, 7, 8, 9, 10])

# Step 4: Calculate the sample means and standard deviations
mean1 = np.mean(sample1)
mean2 = np.mean(sample2)
std1 = np.std(sample1, ddof=1)
std2 = np.std(sample2, ddof=1)

# Step 5: Calculate the test statistic (t-value)
n1 = len(sample1)
n2 = len(sample2)
sp = np.sqrt(((n1-1)*std1**2 + (n2-1)*std2**2)/(n1+n2-2))
t_value = (mean1 - mean2)/(sp*np.sqrt(1/n1 + 1/n2))

# Step 6: Calculate the degrees of freedom
df = n1 + n2 - 2

# Step 7: Calculate the critical value
t_critical = t.ppf(1-alpha/2, df)

# Step 8: Compare the test statistic to the critical value and make a decision
if np.abs(t_value) > t_critical:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

# Step 9: Calculate the p-value
p_value = 2*(1 - t.cdf(np.abs(t_value), df))
print("p-value:", p_value)

Reject the null hypothesis.
p-value: 0.0010528257933664076


In [None]:
In this example, we have two samples (sample1 and sample2) from two populations, and we want to test whether the population means are equal or not. 
We define the null and alternative hypothesis, set the significance level (alpha), collect the sample data, and calculate the sample means and 
standard deviations.

Next, we calculate the test statistic (t-value) using the formula (mean1 - mean2)/(sp*np.sqrt(1/n1 + 1/n2)), where sp is pooled standard deviation. 
We then calculate the degrees of freedom (df) and the critical value (t_critical) using the t distribution from the scipy.stats module.

We compare the test statistic to the critical value and make a decision to either reject the null hypothesis or fail to reject it. 
Finally, we calculate the p-value using the t.cdf function and print it out along with the decision.

In [None]:
6. What is a null and alternative hypothesis? Give some examples.

In [None]:
ANS- In statistical hypothesis testing, the null hypothesis and alternative hypothesis are two complementary statements about a population 
     or process under study. The null hypothesis is typically the default position or assumption that there is no significant difference or 
     relationship between variables, while the alternative hypothesis is the opposite or complementary statement that there is some significant 
    difference or relationship.

Here are some examples:

1. Null hypothesis: The mean height of male and female students in a university is the same.
   Alternative hypothesis: The mean height of male students is greater than that of female students.

2. Null hypothesis: The proportion of customers who purchase a product after seeing an advertisement is 50% or less.
   Alternative hypothesis: The proportion of customers who purchase a product after seeing an advertisement is greater than 50%.

3. Null hypothesis: The quality of a product manufactured by a company is acceptable.
   Alternative hypothesis: The quality of a product manufactured by a company is not acceptable.

4. Null hypothesis: The time taken to complete a task is the same for two different methods.
   Alternative hypothesis: The time taken to complete a task is different for two different methods.

In each example, the null hypothesis assumes that there is no significant difference or effect, while the alternative hypothesis proposes that 
there is a significant difference or effect. The hypothesis testing procedure is used to determine whether there is enough evidence to reject 
the null hypothesis in favor of the alternative hypothesis.

In [None]:
7. Write down the steps involved in hypothesis testing.

In [None]:
ANS- The general steps involved in hypothesis testing are:

1. State the research question and formulate the null and alternative hypotheses: The first step is to clearly define the research question and 
   specify the null hypothesis (which assumes no effect or difference) and the alternative hypothesis (which proposes an effect or difference).

2. Choose the level of significance: The level of significance (alpha) represents the maximum probability of rejecting the null hypothesis when 
   it is actually true. It is typically set to 0.05 (5%) or 0.01 (1%) depending on the context.

3. Collect data and calculate test statistics: Collect a sample of data and calculate a test statistic that quantifies the difference between the 
   sample data and the null hypothesis.

4. Determine the p-value: The p-value is the probability of observing a test statistic as extreme or more extreme than the one calculated from 
   the sample data, assuming the null hypothesis is true.

5. Compare the p-value to the level of significance: If the p-value is less than the level of significance, reject the null hypothesis and 
   accept the alternative hypothesis. If the p-value is greater than the level of significance, do not reject the null hypothesis.

6. Interpret the results: If the null hypothesis is rejected, interpret the results in terms of the alternative hypothesis and draw conclusions 
   about the research question. If the null hypothesis is not rejected, do not draw conclusions about the research question.

7. Make a decision: Based on the conclusions drawn from the hypothesis test, make a decision about the research question.

It is important to note that these steps may vary depending on the type of hypothesis test being conducted and the specific context of the 
research question.

In [None]:
8. Define p-value and explain its significance in hypothesis testing.

In [None]:
ANS- In hypothesis testing, the p-value is the probability of observing a test statistic as extreme or more extreme than the one calculated 
     from the sample data, assuming the null hypothesis is true. It is a measure of the evidence against the null hypothesis provided by the 
     sample data.

The p-value is significant in hypothesis testing because it helps determine whether the null hypothesis should be rejected or not. 
If the p-value is less than the level of significance (often set to 0.05 or 0.01), then the null hypothesis is rejected and the alternative hypothesis 
is accepted. This means that there is strong evidence against the null hypothesis and the result is considered statistically significant.

On the other hand, if the p-value is greater than the level of significance, then the null hypothesis is not rejected. This means that there is 
insufficient evidence against the null hypothesis and the result is considered statistically non-significant.

It is important to note that the p-value is not the probability of the null hypothesis being true or false. It only represents the probability of 
observing the sample data or more extreme data, assuming the null hypothesis is true. Therefore, the interpretation of the p-value should always be 
considered in conjunction with the context of the research question and other relevant factors.

In [None]:
9. Generate a Student's t-distribution plot using Python's matplotlib library, with the degrees of freedom parameter set to 10.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate x-values for the t-distribution
x = np.linspace(-4, 4, 1000)

# Compute the t-distribution using the probability density function
df = 10  # Degrees of freedom
t = np.random.standard_t(df, size=1000)  # Generate random sample from t-distribution
pdf = stats.t.pdf(x, df)  # Probability density function

# Plot the t-distribution
fig, ax = plt.subplots()
ax.plot(x, pdf, 'k-', lw=2, label='t-distribution')
ax.hist(t, bins=30, density=True, histtype='stepfilled', alpha=0.5)
ax.legend(loc='best', frameon=False)
ax.set_xlabel('x')
ax.set_ylabel('pdf(x)')
plt.show()

In [None]:
10. Write a Python program to calculate the two-sample t-test for independent samples, given two random samples of equal size and 
a null hypothesis that the population means are equal.

In [3]:
import numpy as np
from scipy.stats import t

# Generate two random samples of equal size
sample1 = np.random.normal(50, 10, size=100)
sample2 = np.random.normal(45, 12, size=100)

# Calculate the sample means and standard deviations
mean1 = np.mean(sample1)
mean2 = np.mean(sample2)
std1 = np.std(sample1, ddof=1)
std2 = np.std(sample2, ddof=1)

# Calculate the standard error of the difference between means
se = np.sqrt((std1**2)/len(sample1) + (std2**2)/len(sample2))

# Calculate the t-statistic and degrees of freedom
t_stat = (mean1 - mean2) / se
df = len(sample1) + len(sample2) - 2

# Calculate the p-value for the two-sided test
p_val = 2 * t.cdf(-np.abs(t_stat), df)

# Print the results
print("Sample 1 Mean:", mean1)
print("Sample 2 Mean:", mean2)
print("T-Statistic:", t_stat)
print("P-Value:", p_val)

Sample 1 Mean: 48.725845929122926
Sample 2 Mean: 46.40795130381128
T-Statistic: 1.4088837034464647
P-Value: 0.16043803281428895


In [None]:
In this program, we first generate two random samples of equal size using the np.random.normal function from the NumPy library. 
We then calculate the sample means and standard deviations using the np.mean and np.std functions, respectively.

Next, we calculate the standard error of the difference between means using the formula se = sqrt((std1**2)/n1 + (std2**2)/n2), 
where std1 and std2 are the standard deviations of the two samples, and n1 and n2 are the sizes of the two samples.

We then calculate the t-statistic using the formula t_stat = (mean1 - mean2) / se, where mean1 and mean2 are the sample means, 
and se is the standard error of the difference between means.

We also calculate the degrees of freedom using the formula df = n1 + n2 - 2, where n1 and n2 are the sizes of the two samples.

Finally, we calculate the p-value for a two-sided test using the t.cdf function from the SciPy library, and print the results.

In [None]:
11. What is Student’s t distribution? When to use the t-Distribution.

In [None]:
ANS- Student’s t-distribution is a probability distribution that is used to estimate the population mean when the sample size is small and/or 
     the population standard deviation is unknown. It is also used in hypothesis testing to determine whether two sample means are significantly 
     different from each other.

The t-distribution is similar to the standard normal distribution but has heavier tails, which makes it more suitable for smaller sample sizes. 
It is a family of distributions that depends on the degrees of freedom (df), which is determined by the sample size minus one.

The t-distribution is used when the population standard deviation is unknown and must be estimated from the sample data. 
This occurs in situations where the sample size is small or where the population standard deviation is not available. 
The t-distribution is also used in hypothesis testing when the population mean is unknown and must be estimated from the sample data.

In [None]:
12. What is t-statistic? State the formula for t-statistic.

In [None]:
t-statistic is a measure used in hypothesis testing to determine whether the difference between two sample means is statistically significant or not. 
It is calculated as the difference between the sample means divided by the standard error of the difference.

The formula for t-statistic is:

t = (x̄1 - x̄2) / (s√(1/n1 + 1/n2))

where:

x̄1 and x̄2 are the sample means of the two independent samples
s is the pooled standard deviation of the two samples, which is calculated as:
s = sqrt(((n1-1)s1^2 + (n2-1)s2^2) / (n1 + n2 - 2))
n1 and n2 are the sample sizes of the two independent samples

The t-statistic is then compared to the critical values from the t-distribution with degrees of freedom equal to (n1 + n2 - 2) to determine 
the p-value and the significance of the difference between the two sample means.

In [None]:
13. A coffee shop owner wants to estimate the average daily revenue for their shop. They take a random sample of 50 days and find the sample mean 
    revenue to be $500 with a standard deviation of $50. Estimate the population mean revenue with a 95% confidence interval.

In [None]:
ANS- To estimate the population mean revenue with a 95% confidence interval, we can use the following formula:

Confidence interval = sample mean ± (critical value) x (standard error)

where:

sample mean = $500
standard deviation = $50
sample size (n) = 50
degrees of freedom (df) = n - 1 = 49
critical value for a 95% confidence interval and df = 49 is 2.01 (from t-distribution table)

First, we need to calculate the standard error:

standard error = standard deviation / sqrt(sample size)
= $50 / sqrt(50)
= $7.07

Then, we can plug in the values into the formula:

Confidence interval = $500 ± (2.01) x ($7.07)
= $500 ± $14.20

So the 95% confidence interval for the population mean revenue is ($485.80, $514.20). We can be 95% confident that the true population mean 
revenue is between these two values based on the sample data.

In [None]:
14. A researcher hypothesizes that a new drug will decrease blood pressure by 10 mmHg. They conduct a clinical trial with 100 patients and find that 
    the sample mean decrease in blood pressure is 8 mmHg with a standard deviation of 3 mmHg. Test the hypothesis with a significance level of 0.05.

In [None]:
To test the hypothesis, we can use a one-sample t-test.

The null hypothesis is that the true population mean decrease in blood pressure is equal to 10 mmHg. 
The alternative hypothesis is that the true population mean decrease in blood pressure is less than 10 mmHg.

Lets set the significance level (alpha) to 0.05.

First, we need to calculate the t-statistic:

t = (sample mean - hypothesized mean) / (standard deviation / sqrt(sample size))
= (8 - 10) / (3 / sqrt(100))
= -2.82

Using a t-distribution table with degrees of freedom (df) = n - 1 = 99 and alpha = 0.05 for a one-tailed test (since we are testing if the mean 
decrease in blood pressure is less than 10 mmHg), we find the critical t-value to be -1.66.

Since the calculated t-value (-2.82) is less than the critical t-value (-1.66), we reject the null hypothesis. 
This means that there is enough evidence to conclude that the new drug decreases blood pressure by less than 10 mmHg.

Therefore, the researchers hypothesis is not supported by the sample data.

In [None]:
15. An electronics company produces a certain type of product with a mean weight of 5 pounds and a standard deviation of 0.5 pounds. A random sample 
    of 25 products is taken, and the sample mean weight is found to be 4.8 pounds. Test the hypothesis that the true mean weight of the products is 
    less than 5 pounds with a significance level of 0.01.

In [None]:
We need to test the hypothesis that the true mean weight of the products is less than 5 pounds.

Null hypothesis: The true mean weight of the products is equal to 5 pounds.
Alternative hypothesis: The true mean weight of the products is less than 5 pounds.

Lets assume a significance level of 0.01.

We can use the one-sample t-test to test the hypothesis.

The test statistic can be calculated as:

t = (x̄ - μ) / (s / √n)

where x̄ is the sample mean weight, μ is the hypothesized true mean weight (5 pounds), s is the sample standard deviation, and n is the sample size.

Substituting the values, we get:

t = (4.8 - 5) / (0.5 / √25) = -2

The degrees of freedom for the t-distribution is (n - 1) = (25 - 1) = 24.

Using a t-table or a t-distribution calculator, the critical t-value for a one-tailed test with a significance level of 0.01 and 24 
degrees of freedom is -2.492.

Since the calculated t-value (-2) is less than the critical t-value (-2.492), we can reject the null hypothesis.

Conclusion: There is sufficient evidence to suggest that the true mean weight of the products is less than 5 pounds at a significance level of 0.01.

In [None]:
16. Two groups of students are given different study materials to prepare for a test. The first group (n1 =30) has a mean score of 80 with a 
standard deviation of 10, and the second group (n2 = 40) has a mean score of 75 with a standard deviation of 8. Test the hypothesis that the 
population means for the two groups are equal with a significance level of 0.01.

In [None]:
We can use a two-sample t-test to test the hypothesis that the population means for the two groups are equal. 

The null hypothesis is that the population means are equal, and the alternative hypothesis is that they are not equal.

Assuming equal variances, the formula for the test statistic is:

t = (x1 - x2) / (s_pool * sqrt(1/n1 + 1/n2))

where x1 and x2 are the sample means, s_pool is the pooled standard deviation, and n1 and n2 are the sample sizes.

To calculate the pooled standard deviation, we use the formula:

s_pool = sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))

where s1 and s2 are the sample standard deviations.

Using a significance level of 0.01 and a two-tailed test, the critical value for t with degrees of freedom (df) = n1 + n2 - 2 = 68 and 
α/2 = 0.005 is ±2.660.

We can calculate the values as follows:

In [4]:
import math

# sample statistics
x1 = 80
s1 = 10
n1 = 30

x2 = 75
s2 = 8
n2 = 40

# calculate pooled standard deviation
s_pool = math.sqrt(((n1 - 1) * s1 ** 2 + (n2 - 1) * s2 ** 2) / (n1 + n2 - 2))

# calculate test statistic
t = (x1 - x2) / (s_pool * math.sqrt(1/n1 + 1/n2))

# calculate critical value
cv = 2.660

# compare test statistic to critical value
if abs(t) > cv:
    print("Reject null hypothesis: population means are not equal")
else:
    print("Fail to reject null hypothesis: population means may be equal")

Fail to reject null hypothesis: population means may be equal


In [None]:
The calculated t-value is 2.23, which is less than the critical value of ±2.660. 
Therefore, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest that the population means for the 
two groups are different at a significance level of 0.01.

In [None]:
17. A marketing company wants to estimate the average number of ads watched by viewers during a TV program. They take a random sample of 50 viewers 
    and find that the sample mean is 4 with a standard deviation of 1.5. Estimate the population mean with a 99% confidence interval.

In [None]:
ANS- To estimate the population mean with a 99% confidence interval, we can use the following formula:

CI = x̄ ± z*(σ/√n)

where:

x̄ = sample mean = 4
σ = population standard deviation (unknown)
n = sample size = 50
z = z-score for 99% confidence level = 2.576

First, we need to estimate the standard deviation of the population using the sample standard deviation:

s = 1.5 (given)
σ = s/√n = 1.5/√50 = 0.212

Substituting the values in the formula, we get:

CI = 4 ± 2.576*(0.212)
CI = 4 ± 0.546
CI = [3.454, 4.546]

Therefore, we can say with 99% confidence that the population mean number of ads watched by viewers during a TV program is between 3.454 and 4.546.