# Contents
- How linear regression works (simple)
- Determining how good the fitted line is using R squared
- P values
- R
- Python



## Overview
Given some data that you think are related, linear regression quantifies the relationship in the data: this is R^2. You want this to be large. It also determines how reliable that relation ship is: this is the p value that we calculate with F. You want this to be small. You need both to have good results


## Conditions for linear regression

1. Linear relationship: There exists a linear relationship between the independent variable, x, and the dependent variable, y.
2. Independence: The residuals are independent. In particular, there is no correlation between consecutive residuals in time series data.
3. Homoscedasticity: The residuals have constant variance at every level of x.
4. Normality: The residuals of the model are normally distributed.

## The steps in a linear regression is:
1. Use least squares to fit a line to the data
2. Calculate the R^2
3. Calculate the p value for R^2

## How it works

This is our Data
![image.png](attachment:image.png)
First you draw a line through the data and measure the distance from the line to the data and sqaure each distance and then add them up

**NOTE**: the distance from the line to the point is called a residual.

![image.png](attachment:image.png)
Now you rotate the line and measure the residuals, square them, then sum up the squares. to get the sum of squared residuals



![image.png](attachment:image.png)

You can plot the sum of squared residuals to the coresponding rotation. We want the rotation with the "least squares" 
![image.png](attachment:image.png)

Now we have a fit line to the data.
![image.png](attachment:image.png)



Since the slope is not 0 it means that knowing a mouse weight will help us guess about the mouse's size


## Determining how good the guess is with R Squared

First we calculate the average mouse size
![image.png](attachment:image.png)
Sum the squared residuals

![image.png](attachment:image.png)
Like we did in least squares we measure the distance from the mean to the data point and square it this is called: SS(mean)

SS(mean) = (data - mean)^2

Variation around the mean = SS(mean)/n
    
    n being the sample size

Now we go back to our plot and Sum up the residuals around our least squares fit
![image.png](attachment:image.png)
This is calles SS(fit): for the sum of squares around the least squares fit

SS(fit)= (data - line)^2

Var(fit)= SS(fit)/n

**NOTE**: in general the variance(something) is the "sum of squares"/ "number of those things"  == Average sum of squares



![image.png](attachment:image.png)
We can say that some of the variation in mouse size is "explained" by looking at mouse weight

Or in simpler terms Heavier mice are Bigger and Lighter mice are Smaller

R^2 = (Var(mean) - Var(fit)) / Var(mean)

We can thus say that mouse weight explains R^2 % of the variation in mouse size 

**We need a way to determine if the R^2 value is statistically significant**
any 2 points can have a line drawn between them and thier r^2 would be 100% 

Thus we need p values

## Pvalues

p values for R^2 come from something called F

F = Var in size explained by weight / Var in size not explained by weight


## How to do it in R


![image.png](attachment:image.png)



In [5]:
print("plot(mouse.data$ weight,mouse.data $size)")


plot(mouse.data$ weight,mouse.data $size)


![image.png](attachment:image.png)

In [4]:
print("mouse.regression <- lm(size ~ weight, data = mouse.data)")
print("summary(mouse.regression)")

mouse.regression <- lm(size ~ weight, data = mouse.data)
summary(mouse.regression)


![image.png](attachment:image.png)

## Explaining the summary
The summary function is where all the meat of the linear model can be found.

### Residuals

is the summary of the residuals(distance from data to the fitted line) you want this to be symmestrically distributed around the line: which means you want the min and max value to be approx. the same distance from 0

you also want to 1Q and 3Q the approx the same distance


### Coefficents 

This section tells us about the least squares estimates for the fitted line

The std. Error and t value are provided to show how the p values were calculated

Pr(>|t|) these are the p values for the estimated parameters. we want the p value for weight to be < .05 that is to be statistically significant

A significant p value for weight(seen above) will mean that it will give us a reliable guess of mouse size.

### residual standard errror:
this is the square root of the denominator in the equiation for F
### multiple R squared:
this is the R^2 which means weight can be explained by 61% of the size
### F statistic:
this tells us if your R^2 is significant or not: again you see the p value which says the weight is a reliable estimate for size

## How to do linear regressions in Python
There are 2 libraries that allow for linear regression in python. 1 being **statsmodels** and the other is **sickit-learn** in these notes ill be using scikit learn

When using python usually you will also use numpy and pandas for array manipulation and the data structure to hold such array here is an example of what the work flow would look like when using python

In [19]:
## librarys you will need to import
import numpy as np
import matplotlib.pyplot as plt  # To visualize
import pandas as pd  # To read data
from sklearn.linear_model import LinearRegression


data = pd.read_csv(" insert data file here ")  # load data set
X = data.iloc[:, 0].values.reshape(-1, 1)  # values converts it into a numpy array
Y = data.iloc[:, 1].values.reshape(-1, 1)  # -1 means that calculate the dimension of rows, but have 1 column
linear_regressor = LinearRegression()  # create object for the class
linear_regressor.fit(X, Y)  # perform linear regression
Y_pred = linear_regressor.predict(X)  # make predictions



plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

FileNotFoundError: [Errno 2] File b' insert data file here ' does not exist: b' insert data file here '

## Multiple Linear regression
Lets say we wanted to know if **weight** and **tail** length did a good job at predicting the **length of the mouse's body**
![image.png](attachment:image.png)
Go to the Multiple Linear regression notes for more info