In [None]:
The standard deviation represents variations in the original data and quantifies the dispersion of individual data points.
The sample mean's variation is measured by the standard error of the mean, which also indicates how accurate the mean estimate is.


In [None]:
First, the mean of the sample is computed, and then the standard deviation of the sample is divided by the square root of the sample size. To get the margin of error, multiply the SEM by the confidence coefficient after determining the confidence coefficient Z-value, also known as the t-value. Lastly, the interval that results from multiplying or dividing the margin of error by the sample mean is the confidence interval.

In [None]:
The 2.5% and 97.5% percentile points of the bootstrap mean distribution were used to calculate the 95% confidence interval. The confidence interval's bottom and upper bounds are made up of these two numbers.


In [1]:
import numpy as np
import pandas as pd

# Sample data (you can replace this with your own data)
sample = np.array([12, 15, 14, 10, 18, 20, 25, 30, 16, 13])

# Define a function to generate bootstrap samples and compute a statistic
def bootstrap(data, num_bootstrap_samples=1000, statistic=np.mean, ci=95):
    """
    data: Original sample data
    num_bootstrap_samples: Number of bootstrap samples
    statistic: The statistic to compute, e.g., np.mean (mean) or np.median (median)
    ci: Confidence interval percentage
    """
    # Store the statistic for each bootstrap sample
    bootstrap_stats = []
    
    # Generate num_bootstrap_samples bootstrap samples
    for _ in range(num_bootstrap_samples):
        # Randomly sample with replacement from the original data
        bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
        # Compute the statistic for the bootstrap sample (e.g., mean)
        stat = statistic(bootstrap_sample)
        bootstrap_stats.append(stat)
    
    # Sort the bootstrap statistics
    sorted_stats = np.sort(bootstrap_stats)
    
    # Calculate the confidence interval bounds (2.5% and 97.5% percentiles for a 95% CI)
    lower_bound = np.percentile(sorted_stats, (100 - ci) / 2)
    upper_bound = np.percentile(sorted_stats, 100 - (100 - ci) / 2)
    
    return lower_bound, upper_bound

# Calculate the 95% bootstrap confidence interval for the population mean
mean_ci_lower, mean_ci_upper = bootstrap(sample)
print(f"95% Bootstrap Confidence Interval (Population Mean): [{mean_ci_lower}, {mean_ci_upper}]")

# To calculate the 95% bootstrap confidence interval for the population median,
# simply change the statistic parameter to np.median
median_ci_lower, median_ci_upper = bootstrap(sample, statistic=np.median)
print(f"95% Bootstrap Confidence Interval (Population Median): [{median_ci_lower}, {median_ci_upper}]")


95% Bootstrap Confidence Interval (Population Mean): [13.7, 21.2]
95% Bootstrap Confidence Interval (Population Median): [13.0, 21.5]


In [None]:
Confidence intervals are created in order to extrapolate population parameters from sample data. The estimate of population parameters from the sample statistic is imprecise because sample data are prone to random error. Confidence intervals give a range that represents our degree of confidence in the probable values of the population parameter. Thus, knowing the distinction between the sample statistic and the population parameter enables us to evaluate the data more precisely and draw trustworthy conclusions.


In [None]:
Hey, imagine you have a group of, say, your friends' weights, but you've only measured a few of them. Bootstrapping is like playing a game. Its main purpose is to help us get more information from this small sample, especially about some characteristics of the whole group, such as the average weight. In this way, we can obtain more reliable statistical information, such as the confidence level and confidence intervals of the estimates, or even the level of confidence we have in the results.Suppose you think the average weight of your friends is 70kg, but you only have a sample of size n, say 6 people. First, draw a random sample from your 6 friends, say by substitution. That is, you are allowed to sample the same person's weight multiple times. This is like making a larger sample out of the sample you already have. For each sample, calculate the average height of that small sample. Repeat this process, say doing it 1,000 times, and you'll end up with 1,000 values for the average weight. See what these 1000 average values look like. You'll notice that they fluctuate within a certain range, and that range gives you some sense of the average weight. You can calculate the percentile of these averages to find something like a 95% confidence interval. Finally, you can then see if your guess falls within that confidence interval. If it does, then your guess is probably plausible; if it doesn't, then it's probably time to reconsider.

In [None]:
When the confidence interval overlaps with zero, although the mean of the sample may not be zero, we still cannot rule out the possibility that the drug's effect is zero and therefore cannot reject the null hypothesis. Conversely, if the confidence interval does not contain zero at all, or if the p-value is less than the significance level, we can reject the null hypothesis and assume that the drug does have an effect.

In [7]:
import pandas as pd
from scipy import stats

# Step 1: Load the dataset
data = pd.DataFrame({
    'PatientID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Age': [45, 34, 29, 52, 37, 41, 33, 48, 26, 39],
    'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
    'InitialHealthScore': [84, 78, 83, 81, 81, 80, 79, 85, 76, 83],
    'FinalHealthScore': [86, 86, 80, 86, 84, 86, 86, 82, 83, 84]
})

# Step 2: Data overview
print(data.info())  # Check the structure of the data
print(data.describe())  # Check basic statistics

# Step 3: Calculate health improvement
data['HealthImprovement'] = data['FinalHealthScore'] - data['InitialHealthScore']
print(data[['PatientID', 'HealthImprovement']])

# Step 4: Hypothesis testing (one-sample t-test)
# Null hypothesis (H0): The vaccine has no effect (mean improvement = 0)
# Alternative hypothesis (H1): The vaccine improves health (mean improvement > 0)

t_statistic, p_value = stats.ttest_1samp(data['HealthImprovement'], 0)
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

# Step 5: Decision based on significance level
alpha = 0.05  # Significance level of 5%
if p_value < alpha:
    print("Reject the null hypothesis: The vaccine is effective.")
else:
    print("Fail to reject the null hypothesis: There is not enough evidence to claim the vaccine is effective.")

# Step 6: Construct a 95% confidence interval for mean health improvement
mean_improvement = data['HealthImprovement'].mean()
std_error = stats.sem(data['HealthImprovement'])
confidence_level = 0.95
degrees_freedom = len(data['HealthImprovement']) - 1
confidence_interval = stats.t.interval(confidence_level, degrees_freedom, loc=mean_improvement, scale=std_error)

print(f"Mean Health Improvement: {mean_improvement}")
print(f"95% Confidence Interval for Mean Improvement: {confidence_interval}")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   PatientID           10 non-null     int64 
 1   Age                 10 non-null     int64 
 2   Gender              10 non-null     object
 3   InitialHealthScore  10 non-null     int64 
 4   FinalHealthScore    10 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 532.0+ bytes
None
       PatientID       Age  InitialHealthScore  FinalHealthScore
count   10.00000  10.00000           10.000000         10.000000
mean     5.50000  38.40000           81.000000         84.300000
std      3.02765   8.30261            2.828427          2.110819
min      1.00000  26.00000           76.000000         80.000000
25%      3.25000  33.25000           79.250000         83.250000
50%      5.50000  38.00000           81.000000         85.000000
75%      7.75000  44.00000           83.000000  