 # Simple Linear regression
 This is amongst the simplest and popular regression algorithm. This like almost all machine learning algorithm has a strong start in statics.
 This algorithm, is used to map the relationship between two variables namely **X** and **Y**.
 Given known values of X and Y; assuming that the relation between the variables is linear in nature, can we fit a line that predicts values of Y based on given values of X?
 Well... this is what is achieved by linear regression. With this definition, we know the limitation. This can only fit observations with linear relationships.

Talking about lines, one of the most familiar equations in math springs to mind.
$$ y = mx +c$$

The above equation represents shows that we can predict the value of the unknown/dependent values of **Y** if we know the right combination of __m__ and __c__.
Here __m__ is called the scale factor or bias and __c__ is called the bias coefficient.
So how do we compute the right values of __m__ and __c__? There are two approaches to get these values.
+ Gradient decent
+ Least Mean Squared Method.

For this post, we will be using the Least Mean Squared Method. Our aim here is to reduce the difference between the actual value of *y* and the predicted value. Lets called the predicted value *h(x)*
For an instance of *i* of *y*
$$y_i = mx_i + c + \epsilon_i$$
Here $$\epsilon_i$$ is the error in computation of **y_i**. Our learning algorithm's main task is to learn the values of *m* and *c* so that $\epsilon$ is minimum. This minimization is inferred using the cost function which is given by
$$J(m, c) = \frac{1}{2n}\sum_{i=1}^{n}\epsilon_i^2$$
Our task here is to minimize the cost function defined above. We can do that through gradient decent as mentioned above, we have can also use a less computational method (is not that accurate) which is the least mean squared method.
Not going into a lot of math, we derive the values of *m* and *c* as
$$c = \frac{SS_{xy}}{SS_{xx}}$$
$$m = \overline{y} - c\overline{x}$$
Here $$ \overline{y} and \overline{x} $$ are mean/arithemetic averages of *y* and *x* respectively.
$${SS_{xy}} = \sum_{i=1}^{n}(x_i - \overline{x})(y_i - \overline{y}) = \sum_{i=1}^{n}x_iy_i - n\overline{x}\overline{y}$$
$${SS_{xx}} = \sum_{i=1}^{n}(x_i - \overline{x})^2 = \sum_{i=1}^{n}x_i^2 - n(\overline{x})^2$$

Done with all the math talk. Lets implement this in python.

In [1]:
import numpy as np

class linearRegression():
    """Linear Regression computes linear regression line using Least Squared Method. 
        This implementation is for a univalue training and test data.
    
    Attributes:
        x_train (:obj:1darray numpy array): The training feature array.
        y_train (:obj:1darray numpy array): The training label array.
        x_test (:obj:1darray numpy array): The test feature array.
        y_test (:obj:1darray numpy array): The test label array.
        m (float): Initial value of first coefficient. Optional.
        c (float): Initial value of first coefficient. Optional.
    """
    def __init__(self, m = 0.0, c = 0.0):
        if not isinstance(m, float):
            raise ValueError("The type of m should be an 'float' but found {}".format(type(m)))
        if not isinstance(c, float):
            raise ValueError("The type of c should be an 'float' but found {}".format(type(c)))
        self.x_train = np.array([])
        self.y_train = np.array([])
        self.x_test = np.array([])
        self.y_test = []
        self.m = m
        self.c = c
    def fit(self, x_train, y_train):

        """Fits the simple linear regression model. Does some validation on the data passed to it.
        Args:
        x_train (numpy ndarray): The independent variables.
        y_train (numpy ndarray): The dependent variable.
        
        Raises:
            ValueError: If x_train is not of type numpy.
            ValueError: If y_train is not of type numpy.
            ValueError: If x_train is not a one dimensional array.
            ValueError: If y_train is not a one dimensional array.
            ValueError: If the length of x_train and y_train are not same.
        """
            
        if not isinstance(x_train, np.ndarray):
            raise ValueError("The type of x_train should be an 'numpy.ndarray' but found {}".format(type(x_train)))
        if not isinstance(y_train, np.ndarray):
            raise ValueError("The type of y_train should be an 'numpy.ndarray' but found {}".format(type(y_train)))
        
        if(x_train.ndim > 1):
            raise ValueError("This implementation only calculates univalue linear regression line. We found dimension as {}".format(x_train.ndim))
        
        if(y_train.ndim > 1):
            raise ValueError("The dependent/target training value must be a one dimensional array. We found dimension as {}".format(y_train.ndim))
        
        if not x_train.shape[0] == y_train.size:
            raise ValueError("The number of training examples and the lables are not of the name size.")
        
        self.x_train = x_train
        self.y_train = y_train
        self.m, self.c = self.compute_coef(self.x_train, self.y_train)

    def compute_coef(self, x, y):

        """Computes the regression coefficients for the given values of x and y.

        Args:
            x (numpy ndarray): The independent variables.
            y (numpy ndarray): The dependent variable.


        Returns:
            (m, c) (:tuple:float): The computed regression coefficients.

        """
        
        # number of observations/points 
        n = np.size(x) 

        # mean of x and y
        m_x, m_y = np.mean(x), np.mean(y)

        #compute the cross-deviation and deviation of x
        SS_xy = np.sum(y*x) - n*m_y*m_x 
        SS_xx = np.sum(x*x) - n*m_x*m_x 

        #Compute the regression coefficients
        c = SS_xy / SS_xx 
        m = m_y - c*m_x 
        
        return (m, c)
    
    def predict(self, x_test):
        """Predicts the y_test values using the calculated coefficients.
        
        Arguments:
            x_test {numpy ndarray} -- The x values on which the predictions needs to be made.
        
        Raises:
            ValueError: If x_train is not of type numpy ndarray
        
        Return:
            y_pred {list: float} -- The predicted values.
        """
        if not isinstance(x_test, np.ndarray):
            raise ValueError("The type of x_test should be an 'numpy.ndarray' but found {}".format(type(x_test)))
        
        
        y_pred = []
        
        for _x in x_test:
            y_pred.append(self.m*_x + self.c)
        
        return y_pred



In [2]:
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) 
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12]) 
model = linearRegression()
model.fit(x,y)
print(model.predict(x))


[1.1696969696969697, 2.4060606060606062, 3.6424242424242426, 4.878787878787879, 6.115151515151515, 7.351515151515152, 8.587878787878788, 9.824242424242424, 11.06060606060606, 12.296969696969697]
