### Implement Linear Regression by writing code for fit, predict and score functions

In [271]:
import numpy as np

In [272]:
data = np.loadtxt("data.csv", delimiter=",")

In [273]:
x = data[:, 0]
y = data[:, 1]

In [274]:
x.shape

(100,)

In [275]:
from sklearn import model_selection
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x, y, test_size=0.30)
X_train.shape

(70,)

Let's have a quick look at how we calculate m and c values for a line y = mx + c, which is a best fit line with minimum cost.

![](images/cost17.png)

![](images/cost18.png)

In [276]:
def fit(X_train, Y_train):
    num = (X_train * Y_train).mean() - X_train.mean() * Y_train.mean()
    den = (X_train**2).mean() - X_train.mean()**2
    m = num / den
    c = Y_train.mean() - m * X_train.mean()
    return m, c

Let's have a quick look at the flow of things before writing other functions.

- m, c = fit(X_train, Y_train)
- for test data
    - Y_test_pred = predict(X_test, m, c)
    - score(Y_test, Y_test_pred)
- for training data
    - Y_train_pred = predict(X_train, m, c)
    - score(Y_train, Y_train_pred)
- Call fit function on the training data ad get values of m and c
- Call the predict function which will take X_test, m and c values and will predict Y for us.
- Call the score function which will take Y_test and Y_pred as arguments and compare them to provide us with a score of how well our algorithm is doing.

In [277]:
def predict(x, m, c):
    return m * x + c

In [278]:
def score(y_truth, y_pred): # score is (1 - u / v)
    u = ((y_truth - y_pred)**2).sum()
    v = ((y_truth - y_truth.mean())**2).sum()
    return 1 - u / v

In [279]:
def cost(x, y, m, c):
    return ((y - (m * x + c))**2).mean()  # instead of return sum we divide by n to take the mean

In [280]:
m, c = fit(X_train, Y_train)

# test data
Y_test_pred = predict(Y_test, m, c)
print("Test Score: ", score(Y_test, Y_test_pred))

# train data
Y_train_pred = predict(Y_train, m, c)
print("Train Score: ", score(Y_train, Y_train_pred))
print("m, c: ", m, c)
print("Cost on training data: ", cost(X_train, Y_train, m, c))
print("Cost on training data: ", cost(X_train, Y_train, m + 1, c)) # cost will increase quadratically

Test Score:  -1.5949366257242095
Train Score:  -2.6595747317924667
m, c:  1.2168504508202016 13.781144625806732
Cost on training data:  116.67259128666761
Cost on training data:  2655.7729166245445


Let's use the in-built linear regression. Using the same data and using the in-built linear regression check the coefficients and score values against the values for implementation of linear regression.

In [281]:
from sklearn.linear_model import LinearRegression
algo = LinearRegression()
algo.fit(X_train.reshape(-1, 1), Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [282]:
# Find slope m
m = algo.coef_[0]

In [283]:
# Find intercept c
c = algo.intercept_

In [284]:
print("m, c: ", m, c)

m, c:  1.2168504508201818 13.781144625807713


In [285]:
print("Train Score: ", algo.score(X_train.reshape(-1, 1), Y_train))

Train Score:  0.5264787415636045
