# Health Study - Part 1

The goals of this notebook are as follows:  
- **basic descriptive statistics**  
- **simple visualisations**  
- **a simulation related to how often a disease occurs**  
- **confidence intervals for systolic blood pressure**  
- **a hypothesis test: "Do smokers have higher mean blood pressure versus non-smokers?"**

In [None]:
import numpy as np
import pandas as pd

from src.health import HealthAnalysis
from src.viz import (hist_bp,
                     box_weight_by_sex,
                     bar_smoker_share,
                     scatter_bp_vs_age,
                     bar_mean_bp_smoker)
np.random.seed(42) # For reproducibility

In [None]:
analysis = HealthAnalysis()
df = analysis.df

df.head()

First we load the dataset through the "HealthAnalysis" class, which reads the csv file & converts columns to numeric/categorical types.  
We also compute BMI from height and weight for possible later analysis.

In [None]:
desc = analysis.descriptive()
desc

## Descriptive Stats

Basic Descriptive statistics (mean,median,min,max) for:  
- **age**  
- **height**  
- **weight**  
- **systolic blood pressure**  
- **cholesterol**

In [None]:
hist_bp(df)

## Histogram of Systolic Blood Pressure

The histogram graph shows how systolic blood pressure is distributed within the dataset.  
It looks roughly bellshaped & is centered around **149 mmHG**.  
Most participants fall in the interval of about **135-165 mmHG**, with only a few very low/very high values inbetween.

In [None]:
box_weight_by_sex(df)

## Weight by sex

The boxplot compares the weight distribution for men and women.  
- Men have a higher median weight and a sligherly larger spread.
- Women have a lower median weight and several outliers in the data.

In [None]:
bar_smoker_share(df)

## Share of Smokers

This bar chart show the percentage of smokers and non-smokers from the dataset.  

- **25-30%** of participants are smokers.
- **70-75%** are non-smokers.

This means the majority are non-smokers, which we should keep in mind when I compare blood pressure between the two groups later.

In [None]:
scatter_bp_vs_age(df)

## Systolic blood pressure vs age

A scatter plot that shows the relationship between age and blood pressure.

Theres a clear line between the older you get, the higher you blood pressure becomes.  
This patterns matches usual medical expectations and is useful context for the rest of the analysis.

In [None]:
bar_mean_bp_smoker(df)

## Mean systolic blood pressure based on smoker status

This bar chart compares the **average** systolic blood pressure for smokers and non-smokers.

The two bars are almost identical. In the dataset given, the mean systolic blood pressure is about **149 mmHG** for both smokers and non-smokers.  
This suggests that there is no apparenty difference in average blood pressure between the two groups, which I will confirm with the hypothesis tests later.

In [None]:
p_real = analysis.disease_rate()
p_sim = analysis.simulate_disease(n=1000)

print(f"Disease share in data: {p_real * 100:.3f}%")
print(f"Disease share in simulation: {p_sim * 100:.3f}%")

## Disease rate and disease simulation

First, I calculate the amount of people with a disease in the given dataset.  
In this dataset the disease rate comes out to about 5.9%.  

Then I simulation 1000 "virtual" participants with the same probablitiy of disease using a binomial model.  
The simulated disease rate comes out very close to the observed dataset one, which makes sense because it is based on the same probablitiy (but it is not exactly the same due to random variation)

In [None]:
ci_norm = analysis.ci_bp_normal(alpha=0.05)
ci_t = analysis.ci_bp_t(alpha=0.05)
ci_boot = analysis.ci_bp_bootstrap(alpha=0.05, n_boot=5000)

ci_norm = tuple(float(x) for x in ci_norm)
ci_t = tuple(float(x) for x in ci_t)
ci_boot = tuple(float(x) for x in ci_boot)

print(f"95% CI using normal approximation:", tuple(round(x, 2) for x in ci_norm))
print(f"95% CI using t-distribution:", tuple(round(x, 2) for x in ci_t))
print(f"95% CI using bootstrap:", tuple(round(x, 2) for x in ci_boot))

## Confidence intervals for mean systolic blood pressure

Here I calculate 95% confidence intervals for the **mean** of "systolic_bp" using three different methods:  

- Normal approximation - Asssumes the sampling distribution of the mean is aprox normal.
- T-distribution - Similar to normal aproxximation, but takes into account the uncertainty in the estimated standard deviation.
- Bootstrap - A non-parametric method that does not assume a normal distribution. Resamling the dataset many times and calculates the mean for each sample.

All three methods produces almost identical intervals, which suggets the dataset is stable, large and that the distribution is not extremely skewed.

In [None]:
t_stat, p_two, p_boot = analysis.smoker_bp_ttests()

print(f"Welch t-test (two-sided): t = {t_stat:.2f}, p ≈ {p_two:.4f}")
print(f"Bootstrap test (one-sided, smoker > non-smoker): p ≈ {p_boot:.4f}")

## Hypothesis test: Do smokers have higher mean blood pressure?

To find out wether smokers have higher average systolic blood pressure compared to non-smokers, we do statistical tests:  
- **Welch t-test** (two sided)
- **Bootstrap test** (one-sided)

Results:  
- t-test p-value ≈ 0.65
- bootstrap p-value ≈ 0.33

Both p-values are far above the common significance level of 0.05.
This means there i no statistical evidence in our data set that smokers have higher blood pressure compared to non-smokers.  
This is very consitent with the earlier bar plot that showed average blood pressure for smokers & non-smokers.

## Method Discussion & Results

In part 1 of this project, I used several different statistical methods to get the expected results.

#### 1. Descriptive stats & visualisation graphs

Here I gathered all basic summary statistics (mean,median,min,max) to get a good idea of the range on all variables.  
I also created several graphs (histograms, boxplots, scatter plots) to make patterns in the data easier to observe, to identify potential outliers and wether assumptions such as "normality" seem reasonable.

#### 2. Confidence Intervals

I calculated 95% confidence intervals for the mean blood pressure using:
- **Normal Approximation**
- **t-distribution**
- **Bootstrap confidence interval**

All three methods produced nearly identical intervals.  
This indicates that the dataset is large, not extremely skewed and therefore making more normal-based methos appropriate and stable for this dataset.

#### 3. Hypothesis testing

In the final part, I used two different methods to compare smokers & non-smokers:
- **Welch t-test**
- **Bootstrap hypothesis test**

Both tests led to the same conclusion (no significant difference), which in return stengthens the reliability of the result.  
This suggets that in this dataset, smoking status **does not** have a measurable effect on blood pressure.