<span style="color:#04c921; font-size:24px; font-weight:700"> Linear Regression</span>

In this practice, we will use linear regression to do some analysis with the framingham data set to see if we can find any meaningful relationships between blood pressure, age, and gender.

The numeric difference between your systolic and diastolic blood pressure is called your pulse pressure. For example, if your resting blood pressure is 120/80, your pulse pressure is 40. For adults older than age 60, a pulse pressure greater than 60 can be a useful predictor of heart attacks or other cardiovascular disease; this is especially true for men. The most important cause of elevated pulse pressure is stiffness of the aorta, the largest artery in the body. The greater your pulse pressure, the stiffer and more damaged the vessels are thought to be.
([Reference](http://www.mayoclinic.org/diseases-conditions/high-blood-pressure/expert-answers/pulse-pressure/faq-20058189))

#### Read the data

Load the framingham dataset in directory '/datasets/framingham/'.

In [None]:
framingham_data <- read.csv("/dsa/data/all_datasets/framingham/framingham.csv")

Let's add a new column for the pulse pressure named "pulseP" and compute it from sysBP and diaBP.

In [None]:
framingham_data["pulseP"] <- framingham_data$sysBP - framingham_data$diaBP
head(framingham_data)

<span style="color:#1d80ba; font-size:14px; font-weight:700">Activity 1:</span> For this analysis, we'll need adults (age>18) who are not taking blood pressure medication (BPMeds==0), and we will create two subsets; one for males and one for females, and pick only the columns we'll work on (age, sysBP, diaBP, BMI, heartRate, pulseP).

In [None]:
framingham_data_male   <- subset(framingham_data, <what goes in here>, select=c(2,11:14,17))
framingham_data_female <- subset(framingham_data, <what goes in here>, select=c(2,11:14,17))
head(framingham_data_male)

Now, let's see if we can model the relation between age and pulse pressure for males vs. females.

<span style="color:#1d80ba; font-size:14px; font-weight:700">Activity 2:</span>   Fit a linear regression model where age is the independent variable, and pulse pressure is the dependent variable. 

In [None]:
# find the model for males
age_pp_male <- lm(<what goes in here> ~ <what goes in here>)
summary(age_pp_male)

# find the model for females
age_pp_female <- lm(<what goes in here> ~ <what goes in here>)
summary(age_pp_female)

The $R^2$ values are 0.10 and 0.22 which suggests age alone is not a good predictor of pulse pressure in neither males nor females. Look out for the caveats here: first, we are trying to fit a **linear** model; the actual relationship could be a nonlinear one; and second, there might be different types of relationships in separated age brackets (such as before 50, after 50, etc.) that is hard to model with one linear model. 

<span style="color:#1d80ba; font-size:14px; font-weight:700">Activity 3:</span>  Let's do the same for BMI and pulse pressure variables.  Fit a linear regression model where BMI is independent and pulse pressure is the dependent variable. 

In [None]:
# find the model for males
bmi_pp_male <- lm(<what goes in here>)
summary(bmi_pp_male)

# find the model for females
bmi_pp_female <- lm(<what goes in here>)
summary(bmi_pp_female)

Again, $R^2$ values are too low to suggest a good model based on BMI.

<span style="color:#1d80ba; font-size:14px; font-weight:700">Activity 4:</span>  Let's create two subsets from male data; first one is for those younger than 50, second one is for those older than 50. We want to see if the relation between heart rate and pulse pressure is different for older males vs. younger males.


In [None]:
framingham_data_male_younger <- subset(framingham_data_male, <what goes in here>)
framingham_data_male_older   <- subset(framingham_data_male, <what goes in here>)

# now fit a linear model to both data sets; use pulse pressure as independent variable and find the model for heart rate.
hr_pp_male_young <- lm(<what goes in here>)
summary(hr_pp_male_young)

hr_pp_male_older <- lm(<what goes in here>)
summary(hr_pp_male_older)

Again, we do not see a meaningful linear model for heart rate given pulse pressure for young and old people. Let's look at the relationship between systolic and diastolic blood pressures. 

The dynamic relationship between diastolic and systolic blood pressure expressed by the **ambulatory arterial stiffness index (AASI)** has been introduced as a measure of arterial function.
[AASI = 1 - (regression slope of diastolic-versus-systolic)] 
The available evidence suggests that AASI can predict future cardiovascular events, particularly stroke, and is associated with indices of arterial function. ([Reference](https://www.ncbi.nlm.nih.gov/pubmed/22632918))

<span style="color:#1d80ba; font-size:14px; font-weight:700">Activity 5:</span>  Now, we want to compute the AASI for males older than 50 vs. females older than 50. For this, we first create the female subset, then compute the slopes of diastolic vs. systolic linear regression model for both sets, and finally compute the AASI for both sets. 

In [None]:
framingham_data_female_older <- subset(framingham_data_female, age >= 50)
# Now compute the linear models for framingham_data_female_older and framingham_data_male_older; 
# Use systolic pressure as independent variable

In [None]:
# find the model for older males
slope_male_older <- lm(<what goes in here> ~ <what goes in here>)
summary(slope_male_older)

# find the model for older females
slope_female_older <- lm(<what goes in here> ~ <what goes in here>)
summary(slope_female_older)


The $R^2$ values suggest that there is a somewhat linear relationship between systolic and diastolic blood pressures for both genders. Let's look at the coefficients of the models.

In [None]:
# First column gives the intercept and the slope of the model.

coef(summary(slope_male_older))
slope_mo <- coef(summary(slope_male_older))[2,1]

coef(summary(slope_female_older))
slope_fo <- coef(summary(slope_female_older))[2,1]


In [None]:
# Now, given slopes for both models, compute the corresponding AASI values

AASI_male_older = 1 - slope_mo
AASI_female_older = 1 - slope_fo

In [None]:
print(AASI_male_older)
print(AASI_female_older)

The AASI for older females is slightly higher than the AASI for older males in this data set. Keep in mind that this is an approximation; the real AASI is measured for individuals by observing their blood pressure in a 24h interval. However, the almost linear relationship between systolic and diastolic blood pressure is indeed medically relevant; it is not just a fluke of this dataset. 

In [None]:
# Let's plot the model for males
library(ggplot2)
p = ggplot(framingham_data_male_older, aes(x=<what goes in here>, y=<what goes in here>)) +
    geom_point() +  
    geom_smooth(method=lm,level = 0.95)   # Add linear regression line, by default includes 95% confidence region
p+xlab('systolic BP')+ylab('diastolic BP')

In [None]:
# and for females
p = ggplot(framingham_data_female_older, aes(x=<what goes in here>, y=<what goes in here>)) +
    geom_point() +  
    geom_smooth(method=lm,level = 0.95)   # Add linear regression line, by default includes 95% confidence region
p+xlab('systolic BP')+ylab('diastolic BP')

# Save your notebook!