# Lab week 6 - Outcomes


In this lab you should read through and run the code in the lab sheet and complete the lab assessment. By the end of this lab you should be able to use R to:


* Run a linear regression in R and interpret the findings.
* Create and interpret a residual plot.
* Use linear regression for prediction.
* Use log-transformation.


Before running the code in the following questions, we will load the necessary packages below: 

In [None]:
library(ggplot2)
library(repr) 
options(repr.plot.width=4, repr.plot.height=4, repr.plot.res = 120)


# Exercise 1.1: Linear Regression


At the end of last week's lab you were asked to create a scatterplot to visualise the relationship between the two variables 'Velocity' and 'Distance' from the galaxy datafile. However, the datafile contained three outliers, which was visible from both the boxplots as well as the scatterplot. The datafile below contains the now error-corrected data.

In [None]:
galaxy <- read.csv("week6-galaxy.csv")
head(galaxy)

As a reminder, the two variables describe the recessional velocity (measured in km per second) of a galaxy moving away from earth and the distance of that galaxy from earth (measured in Million light years). 

Calculate the covariance and the correlation coefficient of 'Velocity' and 'Distance'. Use the empty code cell below to do so.

In [None]:
# use this cell for your calculations.

*How would you interpret these findings?*

Let's visualise the association between the two variables using a scatterplot now.

In [None]:
ggplot(galaxy, aes(x = Velocity, y = Distance)) + geom_point() 

Derive the summary statistics for both variables via the `summary()` command. Use the code cell below.

In [None]:
# summary(galaxy$Distance)

In last week's lectures you have learned about linear regression. You can now use the `geom_smooth()` command to visualise the regression line in your scatterplot like shown below. This regression line is the graph of the linear function $y = \beta_0 + \beta_1  x $. Remember from last week, that $y$ is hence a linear transformation of $x$. 

In [None]:
ggplot(galaxy, aes(x = Velocity, y = Distance)) +
 geom_point() +
 geom_smooth(method = lm, se = FALSE)


In the equation from above the response variable $y$ depends on the explanatory variable $x$, while $\beta_0$ and $\beta_1$ are constants (called "regression coefficients"). The corresponding regression model is: $y_i = \beta_0 + \beta_1  x_i + \epsilon_i$, where $\epsilon_i$ denotes the individual vertical deviation from the regression line for the *i-th* observation. Those deviations are called "residuals". We will now fit a linear regression model to our data, that is, we will find the optimal regression coefficients using R. To do so, run the lines below. Pay attention to the order of the variables as the response variable has to be entered first in `lm()`.

In [None]:
# runs the linear regression and stores the residuals
lmRes1 <- lm(galaxy$Distance ~ galaxy$Velocity)
# summarizes the results
summary(lmRes1)

# Interpretation of Regression Results

The first two lines state the command you ran. This is followed by a five number summary of the residuals. Next, you will see the coefficients table, in the first column of which you will find the estimates of $\beta_0$ and $\beta_1$. The residual standard error represents the average (vertical) distance between the observed values and the regression line measured in units of the dependent variable (so the response variable). You will also find the coefficient of determination $R^2$, which serves as an indication of how well the model approximates the observations. More precisely, $R^2$ is the proportion of the variation in the response that can be explained by the explanatory variable. Hence the name 'explanatory variable'. You can ignore all other output for now.


From the above results, answer the following questions: 

* What is the estimated regression line? **Write it out.** 
* From the output, what are the estimates for $\beta_0$, $\beta_1$ and $R^2$?
* For your data, does it appear that 'Distance' depends on 'Velocity'? How would interpret the regression coefficients?
* Report the standard deviation of the residuals. Note: This is labelled *Residual Standard Error* in the regression output.


# Exercise 1.2: 

Repeat the method from above but exchange resposne and explanatory variable, i.e. chose 'Distance' as explanatory and 'Velocity' as response variable this time. Then answer the corresponding assignment questions.

In [None]:
# this block can be used for fitting the linear model with exchanged variables.

lmRes2 <- lm(...)     # has to be completed by you!

summary(lmRes2)

# Exercise 2: Residual Analysis


#### Plotting the Residuals 

As explained above, residuals are vertical deviations from the regression line. Very big residuals (positive as well as negative) hence mark outliers in the dataframe. Outliers are troublesome for the purpose of a regression analysis as they alter the regression line in their favor. Let us take a look at the residuals of the previous regression (i.e. the regression performed in Exercise 1.2). To plot the residuals, you can use the command `plot(lmRes2,1)`, where 'lmRes2' is the variable we created above. 


In [None]:
lmRes2[2]             # shows all residuals
plot(lmRes2,1)         # plots all residuals against their fitted values (values on regression line)


In the residual plot above you will find that the most extreme residuals have their id number printed next to them. Bear this in mind, as we will remove them from the dataframe later. Hint: You may also find the following lines of code helpful to clearly identify the most extreme outliers. 

In [None]:
x <- residuals(lmRes2)
sort(x)

**What do you notice about these residuals?**

* Is there any residual pattern?
* Are there any outliers? 
* What do you think the mean of the residuals might be (from the residual plot, no calculation needed)?


## Removing the outliers 

Find the three largest residuals (based on their absolute value) and remove the corresponding datasets from the dataframe using the code line below. 


In [None]:
galaxy_Modified <- galaxy[-c(...),]     # has to be completed

To see what effect these large residuals have we can re-run the regression without them. Make note of any differences in the results (i.e. the values of $\beta_0$ and $\beta_1$).

In [None]:
lmRes_corrected <- lm(...)

summary(lmRes_corrected)

Also re-run the residual analysis using the residual plot as seen above.

In [None]:
plot(lmRes_corrected,1)

* What do you notice about the outlier corrected residuals? 
* How does the plot compare to the plot of the original residuals?


# Exercise 3: Prediction

Fitting a regression model is often used to make predictions on the dependent variable (response) based on different levels of the independent variable (explanatory variable). You will now try to predict the number of dead larvae based on the concentration of the insecticide used. But first, we will read in the 'larvae' datafile. This datafile has also been slightly altered compared to previous labs, so make sure you read in this new datafile please.

Remember: The datafile contains data of an experiment testing the effect of different insecticide intensities. The variable 'NumberLarvae' records the number of dead larvae following different intensities of insecticide treatments (the variable 'Insecticide') measured in millilitres per litre. 

In [None]:
larvae <- read.csv("week6-larvae.csv")
head(larvae)

This is the visualised relationship that we have already seen in last week's lab. Try to add a linear regression line to this visualisation. 

In [None]:
ggplot(larvae, aes(x = Insecticide, y = NumberLarvae)) + geom_point() 
ggplot(larvae, aes(x = Insecticide, y = NumberLarvae)) + geom_point() +  ...   # to be completed

Use the empty code cell to fit a linear regression model with 'Insecticide' as the independent variable and 'NumberLarvae' as the dependent variable.


In [None]:
lm2 <- lm(...)  # needs to be completed
plot(lm2,1)        
summary(lm2)

Interpret your findings regarding the regression coefficients $\beta_0$ and $\beta_1$ as well as $R^2$ and the residual standard deviation.

Now, repeat the **residual analysis** from above for this regression. 

You will now use the estimated regression line to predict how many larvae will die if an insecticide with a certain concentration is used. Let's calculate the predicted number of dead larvae using a dosage of 1 mililitre per litre.

Enter the regression coefficient $\beta_0$ and $\beta_1$ from your regression output above in the code below and then run the chunk of code. 

In [None]:
# Enter intercept estimate: 
b0 <- 
# Enter b1 estimate: 
b1 <- 

concentration <- 1                                  
PredictedDeadLarvae <- b0 + b1*concentration
PredictedDeadLarvae

Now turn to your assignment and answer the questions for this part of today's lab. 

# Exercise 4: Log Transformation in Linear Regression

**What did you notice about the residuals?** Take another look at the residual plot.

* Is there any residual pattern?
* Are there any outliers? 
* Does the variance of the residuals depend on the fitted values or is it constant?

One application of log-transformation is to remove non-constant variance in data.
Your last task for today is to perform a log-transformation on the 'NumberLarvae' variable and investigate the consequences running a linear regression with the transformed values. Run the code below to execute the transformation.

In [None]:
larvae$LogNumberLarvae <- log(larvae$NumberLarvae)
head(larvae)

If the response variable is log transformed in a linear regression, the interpretation of the regression coefficients changes. We shall focus on $\beta_1$ only. More precisely, $e^{\beta_1}$ is the multiplicative change in the response we expect to see for every one unit increase of the independent variable.
Now, perform a linear regression analysis and a residual analysis using the code cells below. 

In [None]:
lm3 <- lm(...)      # linear model, needs to be completed

summary(lm3)

In [None]:
plot(lm3,1)          # residual plot

**What do you notice about the new residuals?**

* Is there any residual pattern now?
* Are there any outliers? 
* Is the variance across the fitted values similar or rather different?
* What has changed from the previous residual plot? 

Compare your findings for both models regarding the residuals. 

* In which model do the residuals appear to be more constant (from the residual plot)?

Please finalise this week's lab sheet by answering the remaining questions in your assignment now. Remember to round all results to 3 decimal places. Following zero's can of course as always be omitted. For example, state 2.5 as 2.5 and not as 2.500. 

Good luck!