<a href="https://colab.research.google.com/github/Fantompp/STA130-HW4/blob/main/STA130HW4_Health_Change_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Initialization
import pandas as pd
import numpy as np
import bokeh as bk
import bokeh.plotting as bkp
import bokeh.models as bkm
np.random.seed(0)

In [None]:
bk.io.output_notebook()

from bokeh.plotting import figure, show

x = data.InitialHealthScore
y = data.FinalHealthScore


p = figure(width=600, height=600, toolbar_location=None,
        title="Health Scores, Before and After Vaccination")


# Arrows
for i in range(x.size):
  if x[i] < y[i]:
    col = "green"
  else:
    col = "red"
  oh = bkm.TeeHead(line_color="blue", size = 15, line_width=3, line_alpha=0.3)
  p.add_layout(bkm.Arrow(end=oh, line_color=col, line_width=2, line_alpha=0.3, line_dash="dashed",
                   x_start=x[i], y_start=x[i], x_end=x[i], y_end=y[i]))

# Scatterplot
p.scatter(x=x, y=y, size=10, color="blue", alpha=0.6, legend_label="Health Data (n = 10)")


# Add confidence intervals as vertical lines
p.line(x=[75, 86], y=[75,86], line_color='blue',
       line_width=2, legend_label='No Change in Health Data', line_dash='dashed')

# Add Label to central line
label = bkm.Label(x=210, y=210, x_units='screen', y_units='screen', angle = 0.25*np.pi,
                 text='Line of No Change', text_font_style="bold",
                 background_fill_color='white', background_fill_alpha=1.0)
p.add_layout(label)

# Adding Arrows to legend
p.line(x=[0,0], y=[0,0], line_color='green',
       line_width=2, legend_label='Increase in Health Score', line_dash='dashed')
p.line(x=[0,0], y=[0,0], line_color='red',
       line_width=2, legend_label='Decrease in Health Score', line_dash='dashed')


p.legend.location = "bottom_right"

p.x_range.start = 75
p.y_range.start = 75
p.xaxis.axis_label = "Initial Health Score"
p.yaxis.axis_label = "Final Health Score"

show(p)

# Vaccine Data Analysis Assignment

AliTech has created a new vaccine that aims to improve the health of the general population. Here we aim to use data published by AliTech to determine the effectiveness of their vaccine.

The **null hypothesis** in this situation is that the vaccine has **no effect** on a patient's overall health. Over a population, this means that there is little to no change in the **average health score** before and after taking the vaccine.

The following table is the data provided by AliTech, and the above scatterplot is a visualization of the initial and final health score of each patient. The blue diagonal line represents the null hypothesis, with no change between the initial and final health score, and the green and red dashed lines represent increases and decreases in the final health score.

In [None]:
# Reading Data
data = pd.read_csv('STA130 HW4 data.csv')
data['HealthScoreChange'] = data.FinalHealthScore-data.InitialHealthScore
data

Unnamed: 0,PatientID,Age,Gender,InitialHealthScore,FinalHealthScore,HealthScoreChange
0,1,45,M,84,86,2
1,2,34,F,78,86,8
2,3,29,M,83,80,-3
3,4,52,F,81,86,5
4,5,37,M,81,84,3
5,6,41,F,80,86,6
6,7,33,M,79,86,7
7,8,48,F,85,82,-3
8,9,26,M,76,83,7
9,10,39,F,83,84,1


In [None]:
# Defining Bootstrapping functions

def bootstrap_sample(data):
    # Sample with replacement the same size as the original data
    return data.sample(n=len(data), replace=True)

# Function to create multiple bootstrapped samples
def bootstrap_samples(data, n_samples=1000):
    samples = []
    for _ in range(n_samples):
        sample = bootstrap_sample(data)
        samples.append(sample)
    return samples

In [None]:
np.random.seed(0)

# Creating bootstrapped Samples
n_samples = 1000
bootstrapped_samples = bootstrap_samples(data, n_samples=n_samples)

bootstrapped_means = [sample["HealthScoreChange"].mean() for sample in bootstrapped_samples]



In [None]:
bk.io.output_notebook()

from bokeh.plotting import figure, show

x = bootstrapped_means


p = figure(width=870, height=550, toolbar_location=None,
        title="Distribution of Bootstrapped Sample Mean of Health Score Change \nn=10, 1000 bootstrapped iterations")


# Calculate the 95th percentiles for confidence intervals
percentiles = [2.5, 97.5, 50.0]  # 95% confidence interval
lower_bound = np.percentile(x, percentiles[0])  # 2.5th percentile
upper_bound = np.percentile(x, percentiles[1])  # 97.5th percentile

median = np.percentile(x, percentiles[2])  # 97.5th percentile
mean = np.mean(x)
standard_error = np.std(x)
p_score = (len([a for a in x if a <= 0]) / len(x))



# Histogram
hist, edges = np.histogram(x, density=True, bins='auto')
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
         fill_color="skyblue", line_color="white",
         legend_label="Bootstrapped Samples (1000)")


# Add confidence intervals as vertical lines
p.line(x=[lower_bound, lower_bound], y=[0, max(hist)], line_color='red',
       line_width=2, legend_label='Lower 95% CI (2.5th Percentile)', line_dash='dashed')
p.line(x=[upper_bound, upper_bound], y=[0, max(hist)], line_color='green',
       line_width=2, legend_label='Upper 95% CI (97.5th Percentile)', line_dash='dashed')
p.line(x=[median, median], y=[0, max(hist)], line_color='orange',
       line_width=2, legend_label='Median (50th Percentile)', line_dash='dashed')
p.line(x=[mean, mean], y=[0, max(hist)], line_color='yellow',
       line_width=2, legend_label='Mean', line_dash='dashed')
p.line(x=[0,0], y=[0, max(hist)], line_color='black',
       line_width=2, legend_label='Null Hypothesis (Change = 0)', line_dash='dashed')

p.rect(x=0.5*(upper_bound+lower_bound), y=0, width=upper_bound-lower_bound, height=2*max(hist), color='red', alpha=0.03)
p.rect(x=median, y=0, width=standard_error*2, height=2*max(hist), color='red', alpha=0.05)



p.y_range.start = 0
p.xaxis.axis_label = "Health Score Change (Bootstrapped Mean)"
p.yaxis.axis_label = "Relative Frequency"

show(p)
print("Mean:", round(mean,1))
print("Median:", round(median,1))
print("95% Confidence Interval Lower Bound:", round(lower_bound,2))
print("95% Confidence Interval Upper Bound:", round(upper_bound,2))
print("Standard Error:", round(standard_error,2))
print("p-Value:", round(p_score,3))

Mean: 3.4
Median: 3.4
95% Confidence Interval Lower Bound: 0.9
95% Confidence Interval Upper Bound: 5.5
Standard Error: 1.17
p-Value: 0.006


Histogram Code is from
https://docs.bokeh.org/en/3.1.1/docs/examples/topics/stats/histogram.html

### Quantitative Analysis

The above code takes 1000 random samples from the initial sample of n=10, and stores their means.

The above histogram shows the distribution of the bootstrapped sample means. Here is a printout of the result:

```
Mean: 3.4
Median: 3.4
95% Confidence Interval Lower Bound: 0.9
95% Confidence Interval Upper Bound: 5.5
Standard Error: 1.17
p-Value: 0.006
```

As you can see from the graph and confidence interval, we have strong evidence to suggest that the vaccine has, on average, a positive effect on the health of a patient. In fact, we have 95% confidence that the true effect that the vaccine has on the population is contained within the interval 0.9 to 5.5, an entirely positive interval.

Additionally, this is equivalent to a p-value of 0.006, providing very strong grounds to reject the null hypothesis.




### Conclusion

In conclusion, we can say with great confidence that, assuming the data is representative of the whole population, the vaccine has a positive effect on patient health. This confidence is tempered by the very low sample size of the study, which makes it less likely that the sample is representative of the population.


### Further Considerations

Given this is a vaccine, and thus usually meant for wide-spread use, it would be prudent to study the vaccine's effects on specific demographics (populations). Within the data we have, we can see age and gender as obvious choices. However, with such a small sample size (n=10), such a study would likely give no significant results.

Thus, a much larger sample size is required before any definitive conclusions are drawn.