# Health Study – Part 2

This notebook continues the analysis from Part 1 using the same health study
dataset containing information about participants’ age, sex, height, weight,
blood pressure, cholesterol level, smoking habits, and disease occurrence.

In this second part, the focus is on extending the analysis by:
- organising the code into modules, functions and a class,
- performing matrix-based analysis using linear regression with scikit-learn,
- adding new visualisations to gain deeper insight into the data,
- documenting methods and explaining analytical choices.

All analysis builds on the cleaned dataset and is written to be fully reproducible.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

import seaborn as sns

from analysis.cleaning import load_and_clean_data
from analysis.analyzer import HealthAnalyzer

np.random.seed(42)

plt.rcParams["figure.figsize"] = (8, 5)

DATA_PATH = "data/health_study_dataset.csv"


In [None]:
df = load_and_clean_data(DATA_PATH)

print("Data loaded and cleaned.")
df.head()


## Code structure in Part 2

To keep the notebook cleaner and make the analysis easier to reuse, some parts
of the code were moved into separate Python modules inside the `analysis/` folder:

- `cleaning.py` provides the function `load_and_clean_data`.
- `analyzer.py` contains the class `HealthAnalyzer`, which handles descriptive
  statistics, plots and regression.

The notebook now focuses on demonstrating and explaining the analysis, while the
modules contain the underlying code logic.



## Descriptive analysis using the HealthAnalyzer class

In Part 2, the descriptive statistics from Part 1 are now generated using the 
`HealthAnalyzer` class. This demonstrates how the refactored code can be called 
directly from the notebook in a cleaner and more modular way.


In [None]:
analyzer = HealthAnalyzer(df)

analyzer.descriptive_stats()


## Visualisations using the HealthAnalyzer class

In this section, the visualisation methods from the `HealthAnalyzer` class are 
used to recreate and extend the plots from Part 1. These functions provide a 
clean and modular way to generate figures directly from the notebook.



In [None]:
analyzer.plot_systolic_hist()
analyzer.plot_weight_by_sex()
analyzer.plot_smoker_counts()


## Extended visualisation

This plot extends the visual analysis from Part 1 by examining the relationship 
between systolic blood pressure and age, separated by smoking status. This 
provides a deeper understanding of how blood pressure varies across different 
subgroups in the dataset.



In [None]:
analyzer.plot_bp_vs_age_by_smoking()

## Regression analysis

In this section, a linear regression model is fitted to predict systolic blood 
pressure using age and weight as predictors. This introduces the matrix-based 
analysis required in Part 2, using scikit-learn to estimate the relationship 
between these variables.


In [None]:
reg = analyzer.fit_bp_regression(features=["age", "weight"])

print("Regression coefficients (standardised predictors):")
for name, coef in zip(reg["features"], reg["coefficients"]):
    print(f"{name}: {coef:.3f}")
print(f"Intercept: {reg['intercept']:.3f} mmHg")


## Regression analysis: Model 1 (age + weight)

The first regression model uses `age` and `weight` as predictors of systolic 
blood pressure. The predictors are standardised before fitting, which allows the 
coefficients to be compared on the same scale.


In [None]:
reg1 = analyzer.fit_bp_regression(features=["age", "weight"])

print("Model 1: systolic_bp ~ age + weight\n")

for i in range(len(reg1["features"])):
    name = reg1["features"][i]
    coef = reg1["coefficients"].iloc[i]
    print(f"{name}: {coef:.3f}")

print(f"Intercept: {reg1['intercept']:.3f}")
print(f"R-squared: {reg1['r_squared']:.3f}")



## Regression analysis: Model 2 (age + weight + cholesterol)

To extend the model, cholesterol is added as a third predictor. This allows us 
to examine whether including an additional health variable improves the model’s 
ability to explain variation in systolic blood pressure.



In [None]:
reg2 = analyzer.fit_bp_regression(features=["age", "weight", "cholesterol"])

print("Model 2: systolic_bp ~ age + weight + cholesterol\n")

for i in range(len(reg2["features"])):
    name = reg2["features"][i]
    coef = reg2["coefficients"].iloc[i]
    print(f"{name}: {coef:.3f}")

print(f"Intercept: {reg2['intercept']:.3f}")
print(f"R-squared: {reg2['r_squared']:.3f}")


### Comparing two regression models

Two multiple linear regression models were fitted for systolic blood pressure:

- **Model 1:** age and weight as predictors  
- **Model 2:** age, weight and cholesterol as predictors  

Both models use standardised predictors, which means that the coefficients
show the expected change in systolic blood pressure for a one-standard-deviation
increase in each predictor.

The R-squared values were:

- Model 1: 0.405  
- Model 2: 0.406  

Model 2 has a slightly higher R-squared than Model 1, but the improvement is 
very small. This means that adding cholesterol explains only a little more of 
the variation in systolic blood pressure, and age and weight already capture 
most of the signal in the data.

This comparison illustrates how multiple regression and linear algebra can be 
used to evaluate how much additional predictors contribute to explaining a 
health-related outcome.



## Method choices and motivation – Part 2

In Part 2 the aim was to build on the work from Part 1 by organising the code
better and adding a few more advanced analyses.

**Code structure**

The function `load_and_clean_data` in `analysis/cleaning.py` handles reading and
preparing the dataset, while the class `HealthAnalyzer` in `analysis/analyzer.py`
collects the main analysis tools such as descriptive statistics, plots and
regression. This keeps the notebook cleaner and makes the code easier to reuse.

**Visualisations**

The basic plots from Part 1 were moved into class methods and slightly improved
in their layout. A new scatter plot of blood pressure against age was also
added. This makes it easier to study how blood pressure changes with age for
smokers and non-smokers and provides a more detailed visual view than in Part 1.

**Regression**

Multiple linear regression was used to study how systolic blood pressure relates
to age, weight and cholesterol. The predictors were standardised so that the
coefficients could be compared on the same scale. Two models were tested, and
their R-squared values were compared to see whether cholesterol adds any extra
explanatory power.

Overall, Part 2 extends the analysis from Part 1 with a clearer code structure,
an extra visualisation and a more explicit use of regression and linear algebra.

## Sources

- **Course material**, which covers the fundamentals of linear regression and model evaluation.
- **Scikit-learn documentation** for `LinearRegression` and `StandardScaler`, which describes how the regression model is implemented in Python and why standardising predictors is useful.
- **General descriptions of multiple linear regression and R-squared**, such as the summaries available on Wikipedia or in standard introductory statistics texts.
