Q1.

The differences between the standard error of the mean and the standard deviation are that SD applies to the distribution of data points, while SEM applies to the distribution of the sample means.
Besides, when utilizing these two methods for analysing dataset, the standard deviation does not depend on the size of the sample, but the standard error of the mean decreases as the sample size increases.

The standard error of the mean measures the accuracy of the sample mean as an estimate of the population mean. It is primarily used to draw conclusions about the population mean based on sample data. It captures how much the sample mean is likely to vary from sample to sample. A smaller SEM purports that the sample mean has the likelihood to be a more accurate estimate of the population mean, while a larger SEM suggests greater uncertainty about how well the sample mean represents the population mean.


The standard deviation is a way to measure the dispersion of each single data point in a dataset relative to the mean of the dataset. It can be use to calculate how much the individual data points in the sample differ from the mean. After computing the SD, the larger one due to far from mean to each data point states clearly that the dataset is more variable, while the smaller one indicates a more steady dataset in where the data point are more closer to the mean.

Q2.

To create a 95% confidence interval which covers 95% of the bootstrapped sample means by using SEM, we can follow these steps:

Firstly, we need to resample the original data many times(the video says we usually use computer to calculate 1000s times). Then, for each resample, we calculate the sample mean to generate a distribution of bootstrapped sample means. Each bootstrap sapmle has the same size as the original data.

Secondly, we compute the SEM by dividing SD of the boostrapped means by under the root of n.

Lastly, the 95% confidence interval is given by addition or substraction with mean of bootstrapped means and SEM multiplied by 1.96 which is the z-score corresponding to the 95% confidence level for a normal distribution.

Q3.

In [None]:
# Assuming 'boot_means' is a vector of bootstrapped sample means
lower_bound <- quantile(boot_means, 0.025)  # 2.5th percentile
upper_bound <- quantile(boot_means, 0.975)  # 97.5th percentile
ci <- c(lower_bound, upper_bound)           # 95% Confidence Interval


To create a 95% bootstrapped confidence interval using the bootstrapped means (without using their standard deviation to estimate the standard error of the mean), we can have access to these following steps:

Firstly, resample the original dataset with replacement many times. Each bootstrap will have the same size of sample as the original dataset. For each resample, compute the sample mean, leading to a distribution of boostrapped means.

Secondly, after deriving a distribution of bootstrapped means, sort them in ascending order.

Lastly, identify the 2.5th percentile and the 97.5th percentile of the sorted bootstrapped means. These percentiles correspond respectively to the lower and upper bounds of the confidence interval. The 95% confidence interval is the range from the value at the 2.5th percentile to the value at the 97.5th percentile, which covers the middle 95% of the bootstrapped means.

Q4.

In [None]:
import numpy as np

# Sample data (replace with your actual sample)
sample = np.array([1.2, 2.4, 2.7, 3.1, 4.3, 5.6, 6.2])

# Function to calculate the bootstrap confidence interval for the population mean
def bootstrap_ci_mean(sample, n_bootstrap=1000, ci=95):
    np.random.seed(42)  # Set seed for reproducibility
    bootstrap_means = []  # To store the mean of each bootstrap sample
    
    # Generate bootstrap samples
    for _ in range(n_bootstrap):
        # Draw a random sample (with replacement) of the same size as the original sample
        bootstrap_sample = np.random.choice(sample, size=len(sample), replace=True)
        
        # Calculate the mean of the bootstrap sample
        bootstrap_means.append(np.mean(bootstrap_sample))  
    
    # Calculate the percentiles for the desired confidence interval
    lower_bound = np.percentile(bootstrap_means, (100 - ci) / 2)
    upper_bound = np.percentile(bootstrap_means, 100 - (100 - ci) / 2)
    
    return lower_bound, upper_bound

# Calculate the 95% confidence interval for the population mean
mean_ci = bootstrap_ci_mean(sample)
print(f"95% Bootstrap CI for the population mean: {mean_ci}")


# --- Modifications for Other Population Parameters ---

# To modify this code for calculating a 95% bootstrap confidence interval 
# for a different population parameter (e.g., median), you need to change 
# the statistic being computed within the bootstrap sampling loop.

# Example for population median:

def bootstrap_ci_median(sample, n_bootstrap=1000, ci=95):
    np.random.seed(42)  # Set seed for reproducibility
    bootstrap_medians = []  # To store the median of each bootstrap sample
    
    # Generate bootstrap samples
    for _ in range(n_bootstrap):
        # Draw a random sample (with replacement) of the same size as the original sample
        bootstrap_sample = np.random.choice(sample, size=len(sample), replace=True)
        
        # Change here: Calculate the median instead of the mean
        bootstrap_medians.append(np.median(bootstrap_sample))  
    
    # Calculate the percentiles for the desired confidence interval
    lower_bound = np.percentile(bootstrap_medians, (100 - ci) / 2)
    upper_bound = np.percentile(bootstrap_medians, 100 - (100 - ci) / 2)
    
    return lower_bound, upper_bound

# Calculate the 95% confidence interval for the population median
median_ci = bootstrap_ci_median(sample)
print(f"95% Bootstrap CI for the population median: {median_ci}")


# Example for population variance:

def bootstrap_ci_variance(sample, n_bootstrap=1000, ci=95):
    np.random.seed(42)  # Set seed for reproducibility
    bootstrap_variances = []  # To store the variance of each bootstrap sample
    
    # Generate bootstrap samples
    for _ in range(n_bootstrap):
        # Draw a random sample (with replacement) of the same size as the original sample
        bootstrap_sample = np.random.choice(sample, size=len(sample), replace=True)
        
        # Change here: Calculate the variance instead of the mean
        bootstrap_variances.append(np.var(bootstrap_sample))  
    
    # Calculate the percentiles for the desired confidence interval
    lower_bound = np.percentile(bootstrap_variances, (100 - ci) / 2)
    upper_bound = np.percentile(bootstrap_variances, 100 - (100 - ci) / 2)
    
    return lower_bound, upper_bound

# Calculate the 95% confidence interval for the population variance
variance_ci = bootstrap_ci_variance(sample)
print(f"95% Bootstrap CI for the population variance: {variance_ci}")


# Example for population standard deviation:

def bootstrap_ci_std(sample, n_bootstrap=1000, ci=95):
    np.random.seed(42)  # Set seed for reproducibility
    bootstrap_stds = []  # To store the standard deviation of each bootstrap sample
    
    # Generate bootstrap samples
    for _ in range(n_bootstrap):
        # Draw a random sample (with replacement) of the same size as the original sample
        bootstrap_sample = np.random.choice(sample, size=len(sample), replace=True)
        
        # Change here: Calculate the standard deviation instead of the mean
        bootstrap_stds.append(np.std(bootstrap_sample))  
    
    # Calculate the percentiles for the desired confidence interval
    lower_bound = np.percentile(bootstrap_stds, (100 - ci) / 2)
    upper_bound = np.percentile(bootstrap_stds, 100 - (100 - ci) / 2)
    
    return lower_bound, upper_bound

# Calculate the 95% confidence interval for the population standard deviation
std_ci = bootstrap_ci_std(sample)
print(f"95% Bootstrap CI for the population standard deviation: {std_ci}")


Comment:
Change for the Median:
Replace np.mean(bootstrap_sample) with np.median(bootstrap_sample) to calculate the median instead of the mean in each bootstrap sample.

Change for the Variance:
Replace np.mean(bootstrap_sample) with np.var(bootstrap_sample) to calculate the variance for each bootstrap sample.

Change for the Standard Deviation:
Replace np.mean(bootstrap_sample) with np.std(bootstrap_sample) to calculate the standard deviation for each bootstrap sample.

Bootstrap Resampling:
The key mechanism stays the same: repeatedly sample from the original dataset (with replacement) and compute the desired statistic for each bootstrap sample.

Confidence Interval Calculation:
The np.percentile function is used to compute the lower and upper bounds for the 95% confidence interval, based on the distribution of bootstrap statistics.

https://chatgpt.com/share/66feea75-22ec-8007-8402-879dde809375

Q5.

Because the population parameter and the sample statistic serve different roles in statistical inference.

The population parameter is what we are eventually interested in. However, as it is impractical and impossible to survey the entire population, we typically cannot measure it directly. It indicates a fixed and unknown value which describes a characteristic of the whole population.

For sample statistic, this refers to a value counted from a sample that estimates the population parameter. It is not fixed like population parameter, on the contrary, it varies from sample to sample because different sample can generate slightlt different estimates.

The interval is established by using the sample statistic and the variability in the sample, but it is used to make inference about the unknown population parameter. The sample statistic is at the center of the confidence interval computation and the best estimate of the population parameter. The populaiton parameter is the target of deduction. 

To sum up, according to the data, the confidence interval shows how uncertain we are about that estimate. And it provides a range off values within which we epxpect the population parameter to lie based on the sample statistic. Without distinguishing between these two, we may misunderstand what confidence interval denote.

Q6.

1).

Bootstrapping is a resampling technique used in statistics to estimate the distribution of a sample statistic by repeatedly sampling with replacement from the original data. 

The process of bootstrapping includes four steps.

Firstly, randomly choose data points from the sample with replacement which implies the same data point can be picked more than once. 

Secondly, redid this experiment several times to create a new resampled dataset that is the same size as the original sample. 

Thirdly, compute whatever statistic we want to analyze from this new resampled dataset that we just constructed before.

Lastly, repeat this resampling process a bunch of times and record the statistic each time.

2).

The main purpose of bootstrapping is to derive an idea of the variability or uncertainty of estimates when there is only one sample. Basically, it helps figure out how much sample statistic might change if different samples perform without actually needing to collect more data.
In other words, it gives confidence intervals and helps make inference about the population, even though there is only one set of data.

3).

Firstly, take the sample and randomly resample many tiems from it with replacement to create a distribution of sample averages. 
Then, for each bootstrap sample, count the mean and keep track of all these means.
After redid above steps a bunch of times, a distribution of means has been constructed.
If the hypothesized population average is within the range of typical values in bootstrap distribution,the guess might be plausible. However, if the hypothesized populaiton is way outside this range, it is a sign that the guess might not be accurate.

Q7.

1).If a 95% confidence interval for the mean includes zero, it suggests that zero is a plausible value for the population mean based on the data. Thus, we cannot confidently say that the population mean is different from zero, resulting in failing to reject the null hypothesis. Even if the observed sample mean is not zero, if the confidence interval ranges from a interval including zero, it purports that the observed effect might not be statistically significant. The range of plausible values includes zero, signifying that the observed sample mean might be due to random sampling variability.

2). To reject the null hypothesis, we have three choices below:
Firstly, if the confidence interval does not include zero, it suggests that zero is not a plausible value for the population mean, providing evidence that the population mean is likely different from zero. 

Secondly, if we perform a hypothesis test and find a p-value less than the significance level, it indicates strong evidence against the null hypothesis. 

Lastly, a larger sample size typically provides more reliable estimates and narrower confidence intervals. If the sample mean is significantly away from zero, even with small effects, a large enough sample can lead to a confidence interval that excludes zero, allowing rejection of the null hypothesis.

Q8.

Problem Introduction:
The aim of this analysis is to determine whether the new vaccine developed by AliTech significantly improves the health of patients. We are tasked with testing the null hypothesis of "no effect," meaning there is no significant improvement in health after taking the vaccine.

Null Hypothesis of "No Effect":
The null hypothesis assumes that the vaccine doesn't change health scores.

Data Visualization:
We'll plot the health scores before and after the vaccine to get a first look at any possible improvements.

Quantitative Analysis:
We’ll use a method called bootstrapping to analyze the health score changes. This involves:
1.Health score difference: Subtract the initial score from the final score for each patient.
2.Bootstrapping: Randomly sample and calculate the average health score difference multiple times to build a distribution.
3.Confidence Interval: Use the distribution to check if the difference is significant (if the confidence interval doesn't include zero, the vaccine has an effect).

Supporting Visualizations:
Histogram: Shows the distribution of health score improvements based on bootstrapping.
Initial vs. Final Health: Compare the health scores before and after the vaccine for each patient.

Findings and Discussion:
Conclusion regarding Null Hypothesis of "No Effect":
Based on our analysis, we will either reject the null hypothesis,which means if the confidence interval excludes zero and say the vaccine works, or we’ll fail to reject it if there is no significant effect.

Further Considerations:
Small Sample: With only 10 patients, more testing with larger groups is needed to confirm the results.
No Control Group: A control group (people who didn't take the vaccine) would strengthen the analysis.
Long-term Impact: We only looked at short-term health improvements; long-term effects could be different.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Read data
data = pd.read_csv('alitech_vaccine_data.csv')

# Set random seed for reproducibility
np.random.seed(42)

# Calculate health score difference
data['HealthScoreDifference'] = data['FinalHealthScore'] - data['InitialHealthScore']

# Bootstrapping
def bootstrap(data, n_iterations=1000):
    means = []
    for _ in range(n_iterations):
        sample = np.random.choice(data, size=len(data), replace=True)
        means.append(np.mean(sample))
    return np.array(means)

# Apply bootstrapping
bootstrap_means = bootstrap(data['HealthScoreDifference'].values)

# Confidence interval (95%)
confidence_interval = np.percentile(bootstrap_means, [2.5, 97.5])

# Visualize the bootstrapped means
plt.hist(bootstrap_means, bins=30, edgecolor='k')
plt.title('Bootstrap Distribution of Health Score Differences')
plt.axvline(confidence_interval[0], color='red', linestyle='--')
plt.axvline(confidence_interval[1], color='red', linestyle='--')
plt.xlabel('Mean Health Score Difference')
plt.ylabel('Frequency')
plt.show()

# Results
mean_diff = np.mean(bootstrap_means)
print(f"Mean health score difference: {mean_diff:.2f}")
print(f"95% confidence interval: {confidence_interval}")


Q9.

Yes