# **MS R code cheat sheet**

*(Based strictly on course R-lectures - everything you need for regression, PCA, LDA, MANOVA, factor analysis, diagnostics, selection ans more)*


Load data:
```R
source("path/data_name.R")
```

## **Linear Regression (GLM)**



Model:
$$
y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \dots + \varepsilon
$$

##### **R code**

```r
model <- lm(y ~ x1 + x2 + x3, data = df)
summary(model)
```

### **Extract $R^2$, adjusted RÂ²**

```r
summary(model)$r.squared
summary(model)$adj.r.squared
```

### **SSE find sum of squared residuals**

```r
SSE <- sum(residuals(model)^2)
```

### **Residual diagnostics**

```r
plot(model)                # Residual plots
rstudent(model)            # Studentized residuals
hatvalues(model)           # Leverage (h_ii)
cooks.distance(model)
```

### **Count $|\text{Rstudent}| > 2$**

```r
sum(abs(rstudent(model)) > 2)
```


### **Influential obs (Rstudent $> 2$ & leverage $>$ threshold)**

```r
lev <- hatvalues(model)
rst <- abs(rstudent(model))
sum(rst > 2 & lev > 2*mean(lev))
```

### **VIF + Tolerance**

```r
library(car)
vif(model)           # >10 indicates multicollinearity (then multicollinearity)
1/vif(model)         # Tolerance has to be < 0.1 (then multicollinearity)
                     # If VIF < 10 AND Tolerance > 0.1, then NO multicollinearity
```         

### **Backwards elimination**

```r
drop1(model, test="F")   # find variable to remove
```

We always remove the least significant variable (lowest $F$-value) and this equals to the highest $p$-value.Because in the linear model:
$$
F = \frac{\text{SSR removed}}{\sigma^2}
$$

Small F $\to$ removing the variable barely increases RSS which means that the variable is not important so we can remove it first. OR WE CAN look at the $t$-values where we just remove the one with lowest $t$-value, and to find $F$-value from $t$-value we can use that $F = t^2$.

### **Forward selection (start with no predictors)**

```r
add1(lm(y ~ 1, data=df), scope = ~ x1+x2+x3, test="F")
```

We pick the one with highest $F$-value (lowest $p$-value) to add first.

## **Logistic Regression (binary)**

Model

$$
\text{logit}(p) = \alpha + \beta x
$$

##### **R code**

```r
model <- glm(y ~ x, family=binomial(link="logit"), data=df)
summary(model)
```

### **Prediction**

```r
predict(model, newdata=df, type="response")
```

## **Linear Discriminant Analysis (LDA)**

### **Fit LDA**

```r
library(MASS)
lda.model <- lda(group ~ x1 + x2 + x3, data=df)
lda.model
```

### **Resubstitution errors**

```r
pred <- predict(lda.model)$class
sum(pred != df$group)
```

### **Posterior probabilities**

```r
predict(lda.model)$posterior
```

## **MANOVA**

**Model**

$$
Y = \mu + \text{Group}
$$

```r
fit <- manova(cbind(y1, y2, y3) ~ group, data=df)
summary(fit, test="Wilks")
```


### **Test of individual variables (Type III)**

```r
library(car)
Anova(fit)
```


## **PCA (Principal Component Analysis)**

### **Scale $=$ TRUE $\to$ correlation matrix**

(*used when variables have different units, like in lectures*)

```r
p <- prcomp(df, scale=TRUE)
summary(p)      # variance explained
p$rotation      # loadings
p$x             # scores
```

## **Factor Analysis (Principal Factor, VARIMAX)**

### **Extraction**

```r
fa <- factanal(df, factors=3, rotation="varimax")
fa
```

## **Multivariate Normal Density / KDE (from lectures)**


### **2D kernel density plot**

```r
library(MASS)
d <- kde2d(x, y)
persp(d)
```

## **Useful helper functions from Lectures**


### **Logit + Logistic**

```r
logit <- function(p) log(p/(1-p))
logistic <- function(y) exp(y)/(1+exp(y))
```

## **Common exam patterns**


### "Compute R of the model"

> **Use Multiple $R^2$**

### "How many obs have $|\text{Rstudent}| > 2$"

> `sum(abs(rstudent(model)) > 2)`

### "Influential $=$ Rstudent $>2$ & leverage $>$ threshold"

> `sum(rst>2 & lev>threshold)`

### "Forward backward selection"

>  `add1()` and `drop1()`

### "Multicollinearity"

> Look at `vif()`

### "Test nested model" ($F$-test)

> Look at: 
```r
anova(model_small, model_big)
```
