# Lecture 1: Code demo

In [None]:
suppressPackageStartupMessages(library(tidyverse))
library(palmerpenguins)
library(infer)
library(modelr)

## Part I

### 0. Remove the NAs

In [None]:
penguins_clean <-
    penguins %>%
    ... #what function to use to clean NA?

cat("We lost", nrow(penguins) - nrow(penguins_clean), "rows by removing the NAs")

### 1. Start with scatterplot

In [None]:
# Adjust these numbers so the plot looks good on your desktop.
options(repr.plot.width = 7, repr.plot.height = 5) 

penguins_clean %>% 
  ggplot(aes(flipper_length_mm, body_mass_g)) + 
  geom_point(size = 2) + 
  ggtitle("Palmer Penguins (NAs removed)") +  #make sure to have meaningful titles and axis labels
  xlab("Flipper Length (mm)") + 
  ylab("Body mass (g)") + 
  theme(text = element_text(size = 15))

### 2. Fit the `lm`

In [None]:
(penguins_lm <- ...)

#### 2.1 Extract coefficients

In [None]:
# one way to extract ceofs from lm

...

In [None]:

# another way to extract coefs from lm

...

#### 2.2 Make predictions

In [None]:
#using a useful function from the modelr library
(penguins_clean <-
    penguins_clean %>%
    ...)

#### 2.3 `lm` with no intercept

If you add a 0 to your formula, you remove the intercept of the model.

We **do not** want to fit a model without an intercept, but this trick can be useful to us later (more on this next lecture). 

In [None]:
lm(body_mass_g ~ 0 + flipper_length_mm, data = penguins_clean)

### 3. Write the equation

Equation for prediction?


Equation for the estimated conditional mean of response variable?




#### 3.1 Interpret the parameter

- **Slope:** ...
- **Intercept:** ...

### 4. Show plot

#### 4.1 `geom_smooth`

In [None]:
# Adjust these numbers so the plot looks good on your desktop.
options(repr.plot.width = 7, repr.plot.height = 5) 

penguins_clean %>% 
  ggplot(aes(flipper_length_mm, body_mass_g)) + 
  geom_point(size = 2) + 
  ... + #adding the prediction line / line for conditional expectation
  ggtitle("Palmer Penguins (NAs removed)") + 
  xlab("Flipper Length (mm)") + 
  ylab("Body mass (g)") + 
  theme(text = element_text(size = 25))

#### 4.2 Using the predictions

In [None]:
# Adjust these numbers so the plot looks good on your desktop.
options(repr.plot.width = 7, repr.plot.height = 5) 

penguins_clean %>% 
  ggplot(aes(flipper_length_mm, body_mass_g)) + 
  geom_point(size = 2) + 
  ... + #another way to add the predicted values across a range of input values
  ggtitle("Palmer Penguins (NAs removed)") + 
  xlab("Flipper Length (mm)") + 
  ylab("Body mass (g)") + 
  theme(text = element_text(size = 25))

## Part II - Confidence Interval

### 5. Mathematical Approach

#### 5.1 Manual Calculation (not examinable)

Let's first calculate it by hand. Of course you won't need to to this, but it might help you understand what R is doing.

In [None]:
# Number of data points
n <- nrow(penguins_clean)

# Estimate σ  
s <- sqrt(sum(penguins_lm$residuals**2)/ (n - 2)) 
s

##### The case of $\beta_0$

In [None]:
# Std. Error of beta0 estimator.
s_beta0_hat <- sqrt(s**2 * (1/n + mean(penguins_clean$flipper_length_mm)**2 / sum( (penguins_clean$flipper_length_mm - mean(penguins_clean$flipper_length_mm))** 2)))

# Confidence Interval for Beta0
tibble(
    ci_lower = coef(penguins_lm)[1] - qt(0.975, nrow(penguins_clean)-2) * s_beta0_hat,
    ci_upper = coef(penguins_lm)[1] + qt(0.975, nrow(penguins_clean)-2) * s_beta0_hat
)

##### The case of $\beta_1$

In [None]:
# Std. Error of beta1 estimator.
s_beta1_hat <- sqrt( s**2 / sum( (penguins_clean$flipper_length_mm - mean(penguins_clean$flipper_length_mm))** 2))

# Confidence Interval for Beta1
tibble(
    ci_lower = coef(penguins_lm)[2] - qt(0.975, nrow(penguins_clean)-2) * s_beta1_hat,
    ci_upper = coef(penguins_lm)[2] + qt(0.975, nrow(penguins_clean)-2) * s_beta1_hat
)

#### 5.2 R does it for us! (examinable)

You can use the `confint` function.

In [None]:
...

Or the `broom::tidy` function, which extracts everything for us! 

In [None]:
...

### 6 Bootstrap approach

In [None]:
# Infer package framework https://infer.netlify.app

(ci_slope <-
  penguins_clean %>% 
  ... %>% #first specify the model fitted
  generate(reps = 15000, type = "bootstrap") %>% #then generate the bootstrap samples
  ... %>% #then get the bootstrap estimates of slope parameter
  ... #then get percentile based bootstrap ci

## Part III - Hypothesis Test

### 7.1 Math-based approach

We can get the hypothesis test directly from R using the `summary` function

In [None]:
summary(penguins_lm) #messy output

or the `broom::tidy` function:

In [None]:
... #clean-looking output

### 7.2 Bootstrap 

In [None]:
# Infer package framework
beta1_hat <- coef(penguins_lm)[2]

... |> #obtain bootstrap estimates of slope first
    ... #then get p-value

## Part IV - Range Problem

In [None]:
# Let's read the data
milk_fat <- read_csv("data/milk_fat_train.csv")

In [None]:
# Let's take a look
milk_fat %>% 
    ggplot() + 
    geom_point(aes(week, fat), size=4) + 
    xlab("Week") + 
    ylab("Fat (%)") +
    theme(text = element_text(size=25))


The relationship seems fairly linear here. Let's try fitting a linear model. 


In [None]:
(milk_fat_lm <- lm(fat ~ week, data = milk_fat))

In [None]:
# Let's take a look
milk_fat %>% 
    ggplot(aes(week, fat)) + 
    geom_point(size = 4) + 
    xlab("Week") + 
    ylab("Fat (%)") +
    theme(text = element_text(size=25)) + 
    geom_smooth(method = lm, se = FALSE)

What would happen if we predict for week 40? What would our model predict? A negative fat content!
<font color='red'>This is clearly impossible!! </font> Don't do it! 

However, what about if you wanted to predict for weeks 7, 8, 9 and 10? 
The prediction of the model wouldn't be absurd, so maybe we can do it. Let's try. 

In [None]:
(milk_fat_extra_weeks <- 
    milk_fat %>% 
    rows_append(tibble(week = c(4, 5, 6, 7, 8, 9, 10), fat = NA)) %>% 
    add_predictions(model = milk_fat_lm, var = 'pred'))


In [None]:
# Let's take a look
(plot_pred <- 
    milk_fat_extra_weeks %>% 
    ggplot() + 
    geom_point(aes(week, fat), size = 4) + 
    xlab("Week") + 
    ylab("Fat (%)") +
    theme(text = element_text(size=25)) + 
    geom_line(aes(week, pred)))

Ok, seems reasonable, doesn't it? 

Let's get the data

In [None]:
milk_fat_true <- read_csv("data/milk_fat _true.csv")

plot_pred + 
    geom_point(aes(week, fat), size = 4, color = 'red', data = milk_fat_true)

Even if you have a very good model in a given range of your data, there is no guarantee that the shape of the association will be the same outside the range of the data. 

<font color='red'><u>**Careful when predicting outside the range of your data**.</u></font> (meaning don't do it if possible!)

This is not a rare example just to illustrate this problem, this is a common issue. 


Note that this data has a temporal component, so the observations here are NOT independent. 
It is intuitive, think about it. The (%) of fat in the milk this week, most likely is associated with the % fat in the week(s) prior. 

<font color='red'><u>**Careful with temporal data**.</u></font> They are usually violate the independence assumption


## Part V (Optional) - The misleading word `Linear` 

In [None]:
crazy_lm <- lm(body_mass_g ~ flipper_length_mm + I(flipper_length_mm^2), data = penguins_clean)

In [None]:
crazy_lm

In [None]:
penguins_clean <-
    penguins_clean %>%
    add_predictions(crazy_lm, var = 'crazy_pred')

In [None]:
# Adjust these numbers so the plot looks good on your desktop.
options(repr.plot.width = 20, repr.plot.height = 10) 

penguins_clean %>% 
  ggplot(aes(flipper_length_mm, body_mass_g)) + 
  geom_point(size = 3) + 
  geom_line(aes(y = slr_pred), color = 'blue', lwd = 1) + 
  geom_line(aes(y = crazy_pred), color = 'red', lwd = 1) + 
  ggtitle("Palmer Penguins (NAs removed)") + 
  xlab("Flipper Length (mm)") + 
  ylab("Body mass (g)") + 
  theme(text = element_text(size = 25))

In [None]:
summary(penguins_lm)

In [None]:
ybar <- mean(penguins_clean$body_mass_g)
(SSE = sum(penguins_lm$residuals^2))
(SSR = sum((penguins_lm$fitted.values - ybar)**2))
(SST = var(penguins_clean$body_mass_g)*(length(penguins_clean$body_mass_g) - 1))

In [None]:
SSR/SST