## Linear regression
Linear regression models are created using the lm() function. The required parameters of this function are formula and data.
A useful way to view the output of the linear model is to save the model to an object and then use summary(model).

Let’s say we want to run a regression on a dataset df that has an outcome variable y and predictors x1, x2, and x3. We could make the linear mode

In [3]:
# Creating the faux dataframe
df <- data.frame(y = c(1, 2, 3, 4, 5),
  x1 = c(0.1, 0.2, 0.3, 0.4, 0.5),
  x2 = c(10, 20, 30, 40, 50),
  x3 = c(100, 200, 300, 400, 500))

In [4]:
# Create the model
lm1 <- lm(y ~ x1 + x2 + x3, data = df)
 
# Print summary
summary(lm1)

"essentially perfect fit: summary may be unreliable"



Call:
lm(formula = y ~ x1 + x2 + x3, data = df)

Residuals:
         1          2          3          4          5 
-5.292e-16  5.398e-16  2.689e-16 -4.055e-17 -2.390e-16 

Coefficients: (2 not defined because of singularities)
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -7.944e-16  5.076e-16 -1.565e+00    0.216    
x1           1.000e+01  1.530e-15  6.535e+15   <2e-16 ***
x2                  NA         NA         NA       NA    
x3                  NA         NA         NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.839e-16 on 3 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 4.27e+31 on 1 and 3 DF,  p-value: < 2.2e-16


In [5]:
# Estimate column returns the regression coefficients for each variable. 
# We can call summary(lm1)$coefficients to extract just the estimates as a vector.
summary(lm1)$coefficients

"essentially perfect fit: summary may be unreliable"


Unnamed: 0,Estimate,Std. Error,t value,Pr(>|t|)
(Intercept),-7.944109e-16,5.075553e-16,-1.565171,0.21551
x1,10.0,1.530337e-15,6534509000000000.0,7.903724999999999e-48


Pr(>|t|) column returns the associated p-value for each variable. Any amount of * next to a p-value means there is a p-value of 0.05 or less.


## Logistic regression
The function for a logistic regression model is similar to linear regression except we use glm(), and we have to specify family. family indicates the type of regression we want to perform. Generally logistic regression will use family = binomial().

In [15]:
# Creating the faux dataset
df <- data.frame(
  y = c(1, 0, 1, 0, 1),   # Binary response variable (0 or 1)
  x1 = c(0.1, 0.2, 0.3, 0.4, 0.5),
  x2 = c(10, 20, 30, 40, 50),
  x3 = c(100, 200, 300, 400, 500)
)

In [17]:
# Create the model
glm1 <- glm(y ~ x1 + x2 + x3, data = df, family = binomial)
 
# Print the summary
summary(glm1)


Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial, data = df)

Coefficients: (2 not defined because of singularities)
             Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.055e-01  2.141e+00   0.189     0.85
x1          1.433e-15  6.455e+00   0.000     1.00
x2                 NA         NA      NA       NA
x3                 NA         NA      NA       NA

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.7301  on 4  degrees of freedom
Residual deviance: 6.7301  on 3  degrees of freedom
AIC: 10.73

Number of Fisher Scoring iterations: 4


## Predictions
To make predictions of the outcome variable using a regression model, we need a dataset whose column names match the names of the coefficients in the model. Once the column names match, we can use the predict() function to generate predictions. The model will produce 1 predicted outcome for each observation in this new dataset.

In [18]:
pred_data <- data.frame(
  x1 = c(0, 1, -1), 
  x2 = c(1, 6, 5),
  x3 = c(10, -4, 9)
)
 
predict(lm1, pred_data)

"prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases"
