# Linear Regression From Scratch
---
Linear Regression is a highly powerful tool, although it may seem basic compared to other models.
<br/>
<br/>
Here is a general overview of the math involved:
- Finding a mean in data
- Finding Standard Deviation
- Finding Correlation Coeficent
- Using previous values to find slope of regression equaiton
<br/>
<br/>
What is Linear Regression? It is a regression or perdiction model that given points will fit a line to perdict 
<br/>
![LinReg](LinReg.png)
<br/>
(The dots repersent the points and the redline repersent what the perdictions would be based on the x values)
<br/>
<br/>
So in this problem statement we are given some points and need to find the line of best fit based on that, for this we are going to use something called a Least Squares Regression Line (LSRL), which will find the line that best minimizes the distance of points.
<br/>
<br/>
Sources: https://youtu.be/nk2CQITm_eo

## Finding a mean
---
We first need to find the mean of both the X and Y values, most of us should already understand how to find a mean but I will quickly explain, to find a mean (or average) you first add all the values and divide by how many values there are.
<br/>
<br/>
Pseudo-code:
<br/>
><p>
loop through values:
<br/>
&emsp; add values to sum
<br/>
return sum/total values
<p/>

In [14]:
import math
class ScatterPlot:
    def __init__(self, points):
        self.points = points
        self.pointsX = points[0]
        self.pointsY = points[1]
        self.meanX = self.mean(self.pointsX)
        self.meanY = self.mean(self.pointsY)
    def mean(self, points):
        sum = 0
        for p in points:
            sum+=p    
        return (sum/len(points))    
    
#lets do some test cases
scatter = ScatterPlot([[0, 1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6]]) #2d array, [x values], [y values]
print("Mean of X values: " + str(scatter.meanX))# should be 2.5
print("Mean of Y values: " + str(scatter.meanY))# should be 3.5

Mean of X values: 2.5
Mean of Y values: 3.5


## Finding Standard Deviation
---
Standard deviation is a messure of how far away each point is from the mean, this will give a *Standard* or common way to tell the spread of points
<br/>
<br/>
Pseudo-code:
<br/>
><p>
loop through values:
<br/>
&emsp; add the (value - mean)^2 to sum
<br/>
return sqrt(sum/2)
<p/>

In [16]:
import math
class ScatterPlot:
    def __init__(self, points):
        self.points = points
        self.pointsX = points[0]
        self.pointsY = points[1]
        self.meanX = self.mean(self.pointsX)
        self.meanY = self.mean(self.pointsY)
        self.SDx = self.standardDev(self.pointsX, self.meanX)
        self.SDy = self.standardDev(self.pointsY, self.meanY)
    def mean(self, points):
        sum = 0
        for p in points:
            sum+=p
        return (sum/len(points))
    def standardDev(self, points, mean):
        sumOfSquares = 0
        for p in points:
            sumOfSquares+= ((p-mean) ** 2)
        return math.sqrt(sumOfSquares/len(points))
    
#lets do some test cases
scatter = ScatterPlot([[0, 1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6]]) #2d array, [x values], [y values]
print("Standard Deviation of X values: " + str(scatter.SDx))# should be ~1.7078
print("Standard Deviation of Y values: " + str(scatter.SDy))# should be same ~1.7078 because the spread is the same

Standard Deviation of X values: 1.707825127659933
Standard Deviation of Y values: 1.707825127659933


## Finding Correlation Coefficent
---
The correlation coefficent (r) is a messure of association between X and Y
<br/>
<br/>
Pseudo-code:
<br/>
><p>
loop through values:
<br/>
&emsp; add the (value - mean)^2 to sum
<br/>
return sqrt(sum/2)
<p/>

In [None]:
import math
class ScatterPlot:
    def __init__(self, points):
        self.points = points
        self.pointsX = points[0]
        self.pointsY = points[1]
        self.meanX = self.mean(self.pointsX)
        self.meanY = self.mean(self.pointsY)
        self.SDx = self.standardDev(self.pointsX, self.meanX)
        self.SDy = self.standardDev(self.pointsY, self.meanY)
        self.r = self.correlation()
        #Slope
        self.slope = self.r*(self.SDy/self.SDx)
        
    def mean(self, points):
        sum = 0
        for p in points:
            sum+=p
        return (sum/len(points))
    def standardDev(self, points, mean):
        sumOfSquares = 0
        for p in points:
            sumOfSquares+= ((p-mean) ** 2)
        return math.sqrt(sumOfSquares/2)
    def correlation(self):
        zSum = 0
        for i in range (len(self.points[0])):
            zSum += (((self.points[0][i]-self.meanX)/self.SDx) * ((self.points[1][i]-self.meanY)/self.SDy))
        return zSum/(len(self.points))

## Putting it all together
---
Standard deviation is a messure of how far away each point is from the mean, this will give a *Standard* or common way to tell the spread of points
<br/>
<br/>
Pseudo-code:
<br/>
><p>
loop through values:
<br/>
&emsp; add the (value - mean)^2 to sum
<br/>
return sqrt(sum/2)
<p/>

In [8]:
import math
class ScatterPlot:
    def __init__(self, points):
        self.points = points
        self.pointsX = points[0]
        self.pointsY = points[1]
        self.meanX = self.mean(self.pointsX)
        self.meanY = self.mean(self.pointsY)
        self.SDx = self.standardDev(self.pointsX, self.meanX)
        self.SDy = self.standardDev(self.pointsY, self.meanY)
        self.r = self.correlation()
        self.slope = self.r*(self.SDy/self.SDx)
        self.regression = self.LSRL()
        
    def mean(self, points):
        sum = 0
        for p in points:
            sum+=p
        return (sum/len(points))
    def standardDev(self, points, mean):
        sumOfSquares = 0
        for p in points:
            sumOfSquares+= ((p-mean) ** 2)
        return math.sqrt(sumOfSquares/2)
    def correlation(self):
        zSum = 0
        for i in range (len(self.points[0])):
            zSum += (((self.points[0][i]-self.meanX)/self.SDx) * ((self.points[1][i]-self.meanY)/self.SDy))
        return zSum/(len(self.points))
    def LSRL(self):
        yInt = self.meanY-(self.slope*self.meanX)
        return "y = " + str(round(self.slope, 3)) + "x + " + str(round(yInt, 3))
    def perdictLSRL(self):
        predictions = []
        for x in self.points:
            predictions.append(self.slope*x + self.yInt)
        return predictions