<a href="https://colab.research.google.com/github/LucyMariel/Lucy/blob/master/LinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Finding $a,b$ that best fits the data is called solving linear regressionThere are various known methods, but this text describes a method using the steepest descent method. The steepest descent method is a basic method that can be used for a wide range of problems that are difficult to solve analytically, and it is also very relevant to neural networks, so it is important to learn it here.

**Overview of scratch code**

Based on the following template, we will now implement the linear regression algorithm. The scratch code below combines the various functions into one class. The processing of the function is described here as PASS, but it will approach completion as you progress through the text.



In [1]:
class ScratchLinearRegression():
    def __init__(self, num_iter, lr, no_bias, verbose):
        self.num_iter = num_iter
        self.lr = lr
        self.bias = bias
        self.verbose = verbose
        self.theta = np.array([])
        self.loss = np.array([])
        self.val_loss = np.array([])

    # problem6（Learning and estimation ）
    def fit(self):
        """
        Learning linear regression
        """
        pass


    # problem1
    def _linear_hypothesis(self):
        """
        Hypothetical function
        """
        pass

    # problem2
    def _gradient_descent(self):
        """
        Calculation of updated parameter values using the steepest descent method.
        """
        pass

    # problem3
    def predict(self):
        """
        Estimation by linear regression.
        """
        pass

    # problem4
    def _mse(self):
        """
        Calculation of mean square error
        """
        pass

    # problem5
    def _loss_func(self):
        """
         loss function
        """
        pass


Line 1: Class definition

Line 2: Constructor definition. The arguments are num_iter (number of training), lr (learning rate), bias (whether to include a constant term or not), verbose (whether to output the learning process or not).

Line 3 ~: Definition of necessary member variables. Variables to be used across various functions (theta), variables to be passed as arguments when instantiating (num_iter, lr, bias, verbose), and variables to be referenced via instances (loss, val_loss) are defined as member variables.

We will now find $a,b$ that best fits the HousePrice data. A linear regression with one explanatory variable, as in this equation, is specifically called a linear simple regression. A linear multiple regression with two or more explanatory variables is called a linear multiple regression, and the two are collectively called a linear regression

Here, a human hypothetically defines a function (equation) that the data will approximately follow. This is called a hypothetical function.

Also, it is unlikely that the relationships between variables will be clear at the time the data is given, and in general it is necessary to consider what variables are being analyzed and how they are affected. I have.
This work is called modeling.​
The relationship between variables obtained by modeling is shown in a diagram, and the function itself is called a model, and the hypothetical function can be said to be one means of expressing the model.

In [2]:
#Implementation using inner product
import numpy as np

# y=ax1+b
a = 1
b = 2
x1 = 3
y = a * x1 + b
print(y)

5


Next, let's confirm that the inner product matches the calculation result of $ y = a x_1 + b $.
In Python you can use $ @ $ to calculate the inner product.
Equivalent processing can be performed by setting the part corresponding to $ b $ as the first component of $ \ theta $ and setting the first component of the input $ X $ as $ 1 $.

In [3]:
theta = np.array([[b], [a]])
X = np.array([[1, x1]])
y = X@theta
print(y)

[[5]]


To be able to inner-product with the explanatory variables $X$ and $theta$, we need to match the columns of the previous matrix of the two matrices to be inner-producted with the rows of the second matrix. Checking shape withshapewithprint(X.shape), we get(1, 2)Checking the shaoewithprint(theta.shape) yields(2, 1). Therefore,X@thetais(1, 2)@(2, 1), which is a normal computation since the columns of the previous matrix and the rows of the second matrix match.

In other words, the $y = X@theta$ part defines the assumed function of linear regression.

It was confirmed that (1) above can be replaced with (2) using the expression of the inner product. Keep in mind that this notation is commonly used.

**Function Implementation**

Let's try to run the implemented hypothetical function with the generated $\theta$ and $X$ as arguments.
In the following example, the hypothetical function is defined as _linear_hypothesis(theta, X).
The single underscore in front of the function indicates that the function is used only within the class, and it is common practice to add an underscore to functions used only within the classto improve program readability. In Problem 1, it is not necessary to incorporate the function into the class, but if you want to incorporate it into a class to complete the scratch class for linear regression, give it self as its first argument.

In [4]:
def linear_hypothesis(self, X):
    """
    Hypothetical functionの出力を計算する
    Parameters
    ----------
    X : of the following form. ndarray, shape (n_samples, n_features)
      Training data
    Returns
    -------
    of the following form. ndarray, shape (n_samples, 1)
    線形のHypothetical functionによる推定結果
    """
    pred = X @ self.theta
    return pred

Line 1: Function definition. It takes an explanatory variable X as an argument. The first argument, self, specifies that this function is a member function.

Lines 2-12: docstring (written function description)

Line 13: Calculation of the predicted value and the output place of the assumed function. theta is then initialized in the fit function and treated as a member variable.

Line 14: Return as return value

The hypothetical function is a means of expressing what kind of model is assumed.
The hypothetical function of linear regression can be calculated using matrix multiplication

** Machine Learning Scratch Understand the knowledge required to solve "Problem 2" in "Machine Learning Scratch: Linear Regression - Problem 2 steepest descent method**

In the previous text, we introduced that solving a regression equation means finding the best value of $\theta$ (coefficient) for the data. This section explains how to find this $\theta$.

The best fit to the data is generally defined as the smallest average of the sum of the squares of the differences between the predicted and true values (sum of squares error).

The difference between the predicted value and the true value is called the error, and the average of the sum of the squares is called the mean squared error (MSE).

MSE is often used as a goodness-of-fit measure because it is easy to implement.

Mean squared error (MSE)
The mean squared error is expressed by the following equation

l
(
θ
)
=
1
m
m
∑
i
=
1

(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
.
$x^{(i)}$:$x$ value of $i$th data
$y^{(i)}$:$y$ value of $i$th data
$h_\theta(x^{(i)})$: Predicted value of $i$th data
The goal of the steepest descent method is to find $\theta$ such that the value of this equation is minimized.

**The steepest descent method (how to find $\theta$)**

In linear regression, when the horizontal axis is $\theta$ and the vertical axis is the mean squared error, the function draws a parabola.
To explain why it draws a parabola, a simple regression equation $ax+b$ is substituted into $H_\theta(x^{(i)})$
As you can see from this equation, it is a quadratic function that is convex downwards for the parameters $A, B$.
Thus, if we plot the equation for the mean squared error with $L$ on the vertical axis and $\theta$ on the horizontal axis, we see that it would look like the following figure (originally, we do not know the shape of the overall mean squared error).
You can see that the point where the value of the mean squared error is the smallest is at

The location of this $\theta$ is found by the following procedure.

First, plot based on a random $\theta$
Calculate MSE.
Find the slope of a randomly plotted $\theta$ position.

Calculate the expression $\frac{\partial L(\theta)}{\partial \theta}$ for the slope (gradient).
It is computed in this way. Note that in the linear regression the coefficient of $\theta_j$ is $x_{j}^{(i)}$.

Thus, the equation representing the update is as follows

**Creating Functions**

The update expression is implemented using python as follows.
The following is a function extracted from the completed code, so the function does not work by itself.
If you want the following function to work by itself, you need to define it as a Class definition and adjust the argument self, the fourth line that calls the function in the same Class, self.theta defined in the constructor, and so on.

In [5]:
def _gradient_descent(self, X, y):
        m = X.shape[0]
        n = X.shape[1]
        pred = self.linear_hypothesis(X)
        for j in range(n):
            gradient = 0
            for i in range(m):
                gradient += (pred[i] - y[i]) * X[i, j]
            self.theta[j] = self.theta[j] - self.lr * (gradient / m)

Line 1: Function definition. It takes as arguments the explanatory variable X and the objective variable y.

Line 2: The number of data is stored in m

Line 3: The number of explanatory variables is stored in n

Line 4: linear_hypothesis is executed to compute the predictions

Line 5: Looping with the number n of explanatory variables.

Line 6: Assigning 0 to gradient (the result of the calculation of Σ in the update formula)

Line 7: Looping with m number of data.

Line 8: (pred[i] - y[i]) * X[i, j]is the calculation inside Σ in the update formula, which is calculated to be the final Σ value by making it gradient +=​

Line 9: The Σ(gradient) of the update formula has been calculated, and the corresponding theta (self.theta[j]) is updated using it.

3. Summary
 The steepest descent method is a method for finding the values of parameters that minimize the output value of the objective function.
 The update equation is derived by differentiating the mean squared error (loss function).
 The image of the steepest descent method is like descending through a mean-square error trough

**Estimation**

Finding output values using parameters is calledestimation/prediction. Let's create apredictmethod and have it return the predicted value as the return value

In [6]:
def predict(self, X):
    if self.bias == True:
        bias = np.ones(X.shape[0]).reshape(X.shape[0], 1)
        X = np.hstack([bias, X])
    pred_y = self._linear_hypothesis(X)
    return pred_y

Line 1: Function definition. As arguments, we receive the definition of self and the explanatory variable X so that we can use the instance itself

Row 2: Whether to use a bias term or not

Row 3: Definition of the bias term (constant term). It is initialized at 1 for the number of data in the explanatory variable and reshapeso that the number of matrices is (number of data,1).

Row 4: The bias term is combined with the explanatory variable X received as an argument.

Row 5: Pass the explanatory variable X to the linear_hypothesis function, run it, and calculate the predictions.

Line 6: The calculated predicted value is returned as the return value.

4. Summary
Estimation is the use of updated parameters to obtain predictions

**What is mean square error (MSE)**

The squared value of the "difference between the correct and predicted values" for each of the training data and the average of the squared values is called the mean squared error. The formula is as follows.
$x^{(i)}$:$x$ value of $i$th data

$y^{(i)}$:$y$ value of $i$th data

$h_\theta(x^{(i)})$: Predicted value of $i$th data

Creating Functions
The MSE defined above can be written using python as follows.
Variables appearing in the function should be considered as follows.
y_pred : the predicted value
y : the correct value
mse : the result of the mean squared error calculation

In [7]:
def MSE(self,y_pred, y):
    mse = ((y_pred - y) ** 2).sum() / X.shape[0]
    return mse

Line 1: Function definition. It receives the predicted value y_pred and the correct answer y as arguments.

Line 2: Calculation of MSE. The square of the error is calculated by (y_pred - y) ** 2, and the sum is calculated by ((y_pred - y) ** 2).sum(). The sum is divided by / X.shape[0] to compute the average.

Line 3: MSE is returned as the return value.

The mean squared error is the residual between the correct label and the prediction result squared and calculated for all training data
NumPy makes it easy to calculate.

** Objective function**

The objective function can be expressed by the following equation

It is the formula for the mean squared error further divided by two. It is called the objective function in the sense that it is the function of the objective to be minimized.

Almost in the same sense is the term loss function. The function that determines how to evaluate the residuals is called the loss function, and the resultant function is the function of the objective (objective function) that should be minimized (maximized).

There is also a term cost function, which can be safely assumed to mean almost the same thing.

Here, the mean squared error could be expressed by the following equation.

If you compare the formulas, the only difference is whether or not they are divided by 2. Therefore, the calculation can be performed using the MSE function we have created so far. Please refer to the text on the steepest descent method for the difference between this loss function and the mean squared error.

Creating Functions

The loss function defined above can be written using Python as follows.

In [8]:
def _loss_func(self,y_pred, y):
    loss = self.MSE(pred, y)/2
    return loss

Line 1: Function definition. It receives the predicted value y_pred and the correct answer y as arguments.

Row 2: Perform the function MSEto calculate the mean squared error and divide by 2

Line 3: Returns the calculated loss loss as the return value

The objective function is equivalent to the correct label and the mean squared error divided by 2
NumPy makes it easy to calculate

**Learning and Estimation**

In the problem so far, we have created the following function.

_linear_hypothesis : Output computation of a hypothetical function

_gradient_descent : Update $\theta$.

The purpose of this problem is to use these two functions (which can be modified) to implement theScratchLinearRegressionshown in the introduction. So let's implement the fit() method, which we have not yet done.

Thefit()method is a function used for learning in scikit-learn. In this text, we follow this convention and use the fit()method as the learning method.

Following the flow of the steepest descent algorithm, it is implemented as follows

1.Initializelr(learning rate) and num_iter (number of training/iterations) & initialize $\theta$ with __init__()(no_bias (presence of bias) andverbose (output of learning process) need not be considered for advanced tasks)

2.Initializelr(learning rate) and num_iter(number of times learned) & initialize $\theta$ with __init__()(no_bias (presence of bias) and verbose (output of learning process) need not be considered for advanced tasks)

3.Estimates are calculated with _linear_hypothesis​

4.Update $\theta$ with_gradient_descent​
Don't forget to calculate the loss value and save it in the list, because you can draw a learning curve by saving the loss value (solution $J(\theta)$ of problem 5) each time here.
5. Repeat 2 and 3 for num_iter (number of times learned)

6.Estimate predict​

In [10]:
#Creating Functions / Now let's write fit() in the class.
def fit(self, X, y, X_val, y_val):
    if self.bias == True:
	      bias = np.ones((X.shape[0], 1))
	      X = np.hstack((bias, X))
	      bias = np.ones((X_val.shape[0], 1))
	      X_val = np.hstack((bias, X_val))
    self.theta = np.zeros(X.shape[1])
    self.theta = self.theta.reshape(X.shape[1], 1)
    for i in range(self.num_iter):
        pred = self._linear_hypothesis(X)
        pred_val = self._linear_hypothesis(X_val)
        self._gradient_descent(X, y)
        loss = self._loss_func(pred, y)
        self.loss = np.append(self.loss, loss)
        loss_val = self._loss_func(pred_val, y_val)
        self.val_loss = np.append(self.val_loss, loss_val)
        if verbose == True:
            print('{}回目の学習の損失は{}'.format(i,loss))

Line 1: Function definition. As arguments, explanatory variables and objective variables of training and evaluation data are received, respectively.

Line 2: Determine whether to use the bias term.

Line 3-4: The bias term is initialized and combined with the explanatory variables of the "training data".

Line 5-6: The bias term is initialized and combined with the explanatory variables for the "evaluation data".

Line 7: theta is initialized to 1. the number of data in theta matches the number of data in the explanatory variable.

Line 8: Shape is reshaped so that matrix operations can be performed.  （X.shape[1],）→（X.shape[1],1）

Line 9: Loop for the number of studies

Line 10: The hypothetical function is executed and the predicted value of the "training data" is calculated.

Line 11: The hypothetical function is executed and the predicted value of the "evaluation data" is calculated.

Line 12: The function of the steepest descent method is executed and theta is updated.

Line 13: "Calculating" the loss of "training data".

Line 14: "Calculate" the loss of "valuation data".

Line 15: "Storing" the loss of "training data".

Line 16: "Stores" the loss of "valuation data".

Line 17: Determine whether to display the output of the learning process.

Line 18: Output the learning process

**Data preparation**

Next, let's define the data to be used for training

In [12]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
dataset = pd.read_csv("train.csv")
X = dataset.loc[:, ['GrLivArea', 'YearBuilt']]
y = dataset.loc[:, ['SalePrice']]
X = X.values
y = y.values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

Lines 1-3: Importing the required libraries

Line 4: train.csv is read using pandas

Line 5: Only the variables you want to use as "explanatory variables" are retrieved.

Line 6: Only the variable you want to use as the "objective variable" is retrieved.

Lines 7-8: The pandas data type cannot be calculated, so it is converted to a numpy array.

Line 9: Split training data and test data at a ratio of 8:2

**Run**

Next, let's move the model and check if it is implemented.

Here, num_iter (number of iterations) is set to 10,lr(learning rate) to 0.01,no_bias (bias) is present, and verbose (learning process) is output.

In [13]:
slr = ScratchLinearRegression(num_iter=10, lr=0.01, no_bias=True, verbose=True)
slr.fit(X_train, y_train, X_test, y_test)

NameError: name 'bias' is not defined

Line 1: Instantiation by putting arguments into the created class
Line 2: Learning execution

**predict**

Next, let's make a prediction with the trained model.

In [None]:
slr.predict(X_test)

Line 1: Prediction using test data

**Plotting the "Problem 7" learning curve**

learning curve
In machine learning, we determine if the learning is going well or not, and make corrections and parameter adjustments.

One way to check the progress of learning is to visualize the value of the loss function (objective function).

A plot of the transition of loss values output by the loss function is called a learning curve.

In solving the problem so far, the loss is stored in self.loss of the scratch class created. This is used to verify that the loss value is decreasing.

Since the training and test data losses are stored inlossandval_loss, they are referenced and drawn respectively via instances

In [14]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(reg.loss)
plt.plot(reg.val_loss)

NameError: name 'reg' is not defined

Line 1: Import required libraries
Line 2: Setup to be able to draw on jupyter if using jupyter notebook
Line 3: Draw training data loss
Line 4: Draw test data loss

Draw a learning curve with losses to see how learning is progressing
Visualization of losses (plotting the learning curve) is also important to verify that the learning and implementation is working!