In [None]:
# Q1 The standard error of the mean (SEM) measures how much the sample means would vary. 
# The standard deviation of the original data measures the spread or variability of individual data points in a single sample. 
# It reflects how much the data points differ from the mean within that sample.

In [None]:
# Q2 Compute the SEM: This is the standard deviation of the sample means from bootstrapped samples.
# Multiply the SEM by 1.96: The factor 1.96 comes from the properties of the normal distribution, which covers 95% of the area under the curve.
# Add and subtract this value from the sample mean: This gives the range of the confidence interval.
# Confidence Interval=Sample Mean±1.96×SEM

In [None]:
# Q3 Generate many bootstrapped sample means: Resample the original data with replacement many times (e.g., 1,000 times) and calculate the mean for each resample.
# Sort the bootstrapped means: Arrange the bootstrapped means in ascending order.
# Find the 2.5th and 97.5th percentiles: The values at these percentiles form the lower and upper bounds of the confidence interval.

In [None]:
# Q4 
import numpy as np

# Sample data (you can replace this with your own data)
sample = np.array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

# Function to perform bootstrapping
def bootstrap_ci(data, num_bootstrap_samples, statistic_func, ci=95):
    # Array to store the bootstrap statistics (means in this case)
    bootstrap_statistics = []
    
    # Perform bootstrapping
    for _ in range(num_bootstrap_samples):
        # Resample the data with replacement
        bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
        # Calculate the statistic (mean here)
        stat = statistic_func(bootstrap_sample)
        bootstrap_statistics.append(stat)
    
    # Sort the bootstrap statistics
    bootstrap_statistics = np.sort(bootstrap_statistics)
    
    # Calculate the confidence interval boundaries (2.5th and 97.5th percentiles)
    lower_bound = np.percentile(bootstrap_statistics, (100-ci)/2)
    upper_bound = np.percentile(bootstrap_statistics, 100-(100-ci)/2)
    
    return lower_bound, upper_bound

# Generate a 95% bootstrap confidence interval for the mean
mean_ci = bootstrap_ci(sample, num_bootstrap_samples=1000, statistic_func=np.mean)

print("95% Bootstrap Confidence Interval for the Mean:", mean_ci)

# To generate a confidence interval for the median instead, 
# just change `statistic_func=np.mean` to `statistic_func=np.median`:

median_ci = bootstrap_ci(sample, num_bootstrap_samples=1000, statistic_func=np.median)

print("95% Bootstrap Confidence Interval for the Median:", median_ci)


In [None]:
# Q5 Population parameters are fixed values (like the true population mean) that describe the entire population, but they are usually unknown.
# Sample statistics are estimates calculated from a sample, and they vary depending on the sample.

In [None]:
# Q6a Bootstrapping is like doing an experiment over and over using the data you already have. 
# Imagine you have a set of data points (like test scores from a class). 
# Instead of just analyzing that one set, you randomly pick data points from it with replacement—meaning you might pick the same score more than once—until you have a new set that’s the same size as the original. 
# You do this a bunch of times, each time creating a new set and calculating some statistic (like the average). 
# This helps you get a sense of how much the statistic might change if you could repeat the sampling process.

In [None]:
# Q6b The main purpose of bootstrapping is to estimate uncertainty. 
# When we have a sample, we don’t know how close it is to the true population value (like the real average score of all students, not just the ones in our sample). 
# Bootstrapping helps us figure out how much the sample result (like the average) could vary. 
# It’s a way to estimate how accurate or stable our sample's result is by simulating the process of sampling multiple times, even though we only have one set of real data.

In [None]:
# Q6c Bootstrap the sample: Take your sample and create many resampled sets (using replacement). For each set, calculate the average score.
# Look at the range of these averages: After running many bootstrap samples, you’ll have a collection of average scores from all the resamples. This collection will give you an idea of what the true population average might be, based on your sample.
# Compare with your guess: If the average of 75 falls within the middle range of the bootstrapped averages (say, within a 95% confidence interval), then your guess is plausible. If it falls outside that range, it might suggest that your guess (75) is unlikely to be the true population average.

In [None]:
# Q7 When a confidence interval includes zero, it means that zero is a possible value for the true effect (e.g., the drug's effect) based on the sample data. 
# Even though the observed sample mean is not zero, the fact that zero lies within the interval means we cannot confidently rule out the possibility that the drug has no effect on average. This leads to failing to reject the null hypothesis because we cannot confidently say the drug has a significant effect.
# To reject the null hypothesis, the confidence interval would need to exclude zero. This would suggest that the true effect is significantly different from zero, meaning we have enough evidence to say the drug likely does have an effect.

In [None]:
# Q8 The difference in health scores (FinalHealthScore - InitialHealthScore) gives us insight into whether the vaccine improved the patients' health. The summary statistics for the health score change are:
# Mean change: 3.3
# Standard deviation: 4.03
# Minimum change: -3
# Maximum change: 8
# The positive mean indicates an improvement in health scores on average, but we need further analysis to confirm if this improvement is statistically significant.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Create DataFrame
data = {
    'PatientID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Age': [45, 34, 29, 52, 37, 41, 33, 48, 26, 39],
    'Gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
    'InitialHealthScore': [84, 78, 83, 81, 81, 80, 79, 85, 76, 83],
    'FinalHealthScore': [86, 86, 80, 86, 84, 86, 86, 82, 83, 84]
}

df = pd.DataFrame(data)

# Calculate the health score change
df['HealthScoreChange'] = df['FinalHealthScore'] - df['InitialHealthScore']

# Save boxplot in the current directory
plt.figure(figsize=(10, 6))
plt.boxplot([df['InitialHealthScore'], df['FinalHealthScore']], 
            patch_artist=True, 
            labels=['Initial Health Score', 'Final Health Score'],
            boxprops=dict(facecolor='lightblue'))
plt.title('Comparison of Initial and Final Health Scores')
plt.ylabel('Health Score')
plt.grid(True)
plt.savefig('initial_vs_final_health_scores.png')  # Save image in current directory

# Save histogram in the current directory
plt.figure(figsize=(10, 6))
plt.hist(df['HealthScoreChange'], bins=5, edgecolor='black')
plt.title('Distribution of Health Score Changes After Vaccine')
plt.xlabel('Health Score Change')
plt.ylabel('Frequency')
plt.grid(True)
plt.savefig('health_score_change_histogram.png')  # Save image in current directory

from PIL import Image

# Open and display the first saved image (Boxplot)
img1 = Image.open('initial_vs_final_health_scores.png')
plt.figure(figsize=(10, 6))
plt.imshow(img1)
plt.axis('off')  # Hide axis
plt.show()

# Open and display the second saved image (Histogram)
img2 = Image.open('health_score_change_histogram.png')
plt.figure(figsize=(10, 6))
plt.imshow(img2)
plt.axis('off')  # Hide axis
plt.show()

In [None]:
# The histogram above shows the distribution of changes in health scores after patients received the vaccine. Most patients experienced positive changes in their health scores, with a few experiencing little to no change or even a decrease.

In [None]:
# To quantitatively assess whether the vaccine has an effect, we will use bootstrapping to generate a confidence interval for the mean change in health scores. The difference between each patient’s initial and final health score gives the HealthScoreChange. The purpose of the bootstrapping method is to estimate how the sample mean health score change would vary if we repeatedly took samples from the same population.
# We will compute a 95% confidence interval for the mean health score change. If the confidence interval does not include zero, we can reject the null hypothesis that the vaccine has no effect. Conversely, if zero is within the interval, we fail to reject the null hypothesis.

In [None]:
# The bootstrapped 95% confidence interval for the mean change in health scores is (0.7, 5.5). Since this interval does not include zero, it suggests that there is a statistically significant improvement in health scores after receiving the vaccine.

In [None]:
# Because the 95% confidence interval for the mean health score change does not include zero, we can reject the null hypothesis. This indicates that the vaccine likely has a positive effect on improving the health of the patients who took it.

In [None]:
# Q9 maybe.