Question 1: What is Estimation Statistics? Explain point estimate and interval estimate

Answer :
Estimation statistics is a branch of statistics that deals with estimating or inferring population parameters based on sample data. In other words, it involves using statistical methods to estimate the values of parameters that describe a population from a sample of that population.
There are two main types of estimation in statistics: point estimation and interval estimation.
Point Estimation: Point estimation involves estimating a population parameter by a single value, which is called a point estimate. A point estimate is usually the sample statistic, such as the sample mean, sample proportion, or sample standard deviation. For example, if you want to estimate the average height of all people in a city, you can take a sample of people and calculate the mean height of that sample. This mean height is a point estimate of the population mean height.

Interval Estimation: Interval estimation involves estimating a population parameter by a range of values, which is called an interval estimate. The range of values is called a confidence interval, and it provides a measure of the precision or uncertainty of the estimate. The confidence interval is usually computed by using a sample statistic and a margin of error. For example, if you want to estimate the average height of all people in a city, you can take a sample of people and calculate the mean height of that sample, along with a margin of error. The confidence interval for the population mean height is the range of values that includes the point estimate (sample mean) and the margin of error.

In summary, point estimation involves estimating a population parameter by a single value, while interval estimation involves estimating a population parameter by a range of values that provides a measure of the precision or uncertainty of the estimate.

Question 2 : Write a Python function to estimate the population mean using a sample mean and standard deviation.

Answer :
1. Creating a estimate_pop_mean function

In [1]:
import math
import scipy.stats as stats

def estimate_pop_mean(samples, confidence_level=0.95):
    # calculate the sample mean and standard deviation
    sample_mean = sum(samples) / len(samples)
    sample_std = math.sqrt(sum([(x - sample_mean)**2 for x in samples]) / (len(samples) - 1))

    # calculate the t-value for the desired level of confidence and degrees of freedom
    alpha = 1 - confidence_level
    dof = len(samples) - 1
    t_value = stats.t.ppf(1 - alpha/2, dof)

    # calculate the standard error and margin of error
    std_error = sample_std / math.sqrt(len(samples))
    margin_of_error = t_value * std_error

    # calculate the confidence interval bounds
    lower_bound = sample_mean - margin_of_error
    upper_bound = sample_mean + margin_of_error

    # return the confidence interval as a tuple
    return (lower_bound, upper_bound)

2. Example with sample and population data


In [2]:
import numpy as np
import random

# Set random seed for reproducibility
np.random.seed(42)

# Create a population of 1000 random values between 0 and 100
population_size = 1000
population = np.random.uniform(low=0, high=100, size=population_size)

# Take a random sample of 100 values from the population
sample_size = 100
sample = random.sample(list(population), sample_size)

# Estimate the population mean and interval using the sample data
lower_bound, upper_bound = estimate_pop_mean(sample)

# Print the estimated mean
print(f"ESTIMATED population mean point estimate is : {np.mean(sample)}")
print(f"ESTIMATED population mean with 95% confidence interval : ({lower_bound},{upper_bound})")
print('\n==========================================================================================================\n')
# Printing Actual population mean 
print(f"ACTUAL Population mean is : {np.mean(population)}")

ESTIMATED population mean point estimate is : 47.368925783282364
ESTIMATED population mean with 95% confidence interval : (41.64314360778864,53.09470795877612)


ACTUAL Population mean is : 49.02565533201336


Question 3 : What is Hypothesis testing? Why is it used? State the importance of Hypothesis testing

nswer :
Hypothesis testing is a statistical method used to determine whether a hypothesis about a population parameter is supported by the evidence provided by a sample of data. The process involves making a statistical inference about a population parameter, such as a mean or a proportion, based on sample data.
In hypothesis testing, a null hypothesis is initially assumed to be true, and then evidence is gathered and analyzed to determine whether the null hypothesis can be rejected in favor of an alternative hypothesis. The null hypothesis represents the status quo or the default assumption, while the alternative hypothesis represents the proposed change or difference.
Hypothesis testing is used in many fields, including science, engineering, business, and social sciences, to draw conclusions about the population based on sample data. For example, a pharmaceutical company may use hypothesis testing to determine whether a new drug is more effective than an existing drug. A marketing team may use hypothesis testing to determine whether a new advertising campaign is more effective than the current one.
The importance of hypothesis testing lies in its ability to make statistical inferences about a population based on sample data, while accounting for the uncertainty and variability of the data. It provides a systematic and objective way to test a hypothesis and draw conclusions based on the evidence. Hypothesis testing allows us to make informed decisions based on data, rather than relying on intuition or guesswork. It is a critical tool in scientific research, where the validity and reliability of the results depend on the appropriate use of hypothesis testing.


Question 4 : Create a hypothesis that states whether the average weight of male college students is greater than the average weight of female college students.

Answer :
Null hypothesis: The average weight of male college students is equal to or less than the average weight of female college students.
Alternative hypothesis: The average weight of male college students is greater than the average weight of female college students.
In statistical notation, this can be represented as:
H0: μm ≤ μf
Ha: μm > μf
where:
H0 = null hypothesis

Ha = alternative hypothesis

μm = population mean weight of male college students

μf = population mean weight of female college students
The null hypothesis assumes that there is no significant difference in weight between male and female college students.
The alternative hypothesis assumes that there is a significant difference in weight, and specifically that male college students have a higher average weight than female college students.



Question 5 : Write a Python script to conduct a hypothesis test on the difference between two population means, given a sample from each population.

Answer :
Done this with help of t distribution from scipy.stats

In [3]:
import numpy as np
from scipy.stats import t

# Sample data
sample1 = np.array([2.5, 3.2, 2.9, 3.8, 3.5])
sample2 = np.array([2.1, 3.0, 3.5, 2.8, 3.2])

# Sample statistics
n1 = len(sample1)
n2 = len(sample2)
mean1 = np.mean(sample1)
mean2 = np.mean(sample2)
std1 = np.std(sample1, ddof=1)
std2 = np.std(sample2, ddof=1)

# Null and alternative hypotheses
null_hypothesis = "The population means of the two samples are EQUAL"
alt_hypothesis = "The population mean of sample 1 and sample 2 are NOT Equal"

# Calculate the t-statistic and degrees of freedom
sp = np.sqrt(((n1-1)*(std1**2) + (n2-1)*(std2**2)) / (n1+n2-2))
t_stat = (mean1 - mean2) / (sp * np.sqrt(1/n1 + 1/n2))
df = n1 + n2 - 2

# Calculate the p-value and critical value
p_value = 1 - t.cdf(t_stat, df=df)
alpha = 0.05
t_crit = t.ppf(1-alpha/2, df=df)

# Print p_value
print(f'p-value : {p_value}')

# Print t_stat
print(f't-statistic : {t_stat}')

#print t_crit
print(f't-crit : {t_crit}')

# Compare the p-value and critical value to the significance level
if p_value < alpha:
    print("Reject the null hypothesis: " + null_hypothesis)
    print("Accept the alternative hypothesis: " + alt_hypothesis)
else:
    print("Fail to reject the null hypothesis: " + null_hypothesis)
    
# Compare the t-statistic to the critical value
if abs(t_stat) > t_crit:
    print("Reject the null hypothesis: " + null_hypothesis)
    print("Accept the alternative hypothesis: " + alt_hypothesis)
else:
    print("Fail to reject the null hypothesis: " + null_hypothesis)

p-value : 0.22461584380277022
t-statistic : 0.7955870797707367
t-crit : 2.3060041350333704
Fail to reject the null hypothesis: The population means of the two samples are EQUAL
Fail to reject the null hypothesis: The population means of the two samples are EQUAL
