# Simple Linear Regression

## There are majorly 5 steps involved when implementing this from scratch.

1. Calculate Mean and Variance
2. Estimate Covariance
3. Estimate coefficients
4. Make predictions 
5. Calculate RMSE

There is an interesting observation here. In the tutorial that am following, the author uses wrong formula for variance and covariance (does not divide by N) but the answer is same as my method.

The reason?
It just cancels out each other when calculating the coeeffecients.
Since the value of len(x) is on opposite sides while calculating the coeffecients, it cancels out each other and we ultimately get the same answer

## 1. Functions for Mean and Variance

In [13]:
from math import sqrt

def mean_dataset(values):
    return sum(values) / float(len(values))

def variance_dataset(values, mean):
    return sum([(x-mean)**2 for x in values])/len(values)

## 2. Estimating covariance

In [20]:
def covariance_dataset(x, y):
    mean_x = mean_dataset(x)
    mean_y = mean_dataset(y)
    
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar/len(x)

## 3. Estimating coefficients

In [21]:
def coefficients(dataset):
    x = [row[0] for row in dataset]
    y = [row[1] for row in dataset]
    
    x_mean, y_mean = mean_dataset(x), mean_dataset(y)
    
    b1 = covariance_dataset(x, y)/ variance_dataset(x, x_mean)
    b0 = y_mean - b1 * x_mean
    
    return[b0, b1]

## 4. Make predictions

In [22]:
def simple_linear_regression(train, test):
    predictions = list()
    b0, b1 = coefficients(train)
    
    for row in test:
        yhat = b0 + b1 * row[0]
        predictions.append(yhat)
    
    return predictions 

## Calculating RMSE


In [23]:
def rmse_metric(actual, predicted):
    sum_error = 0.0
    
    for i in range(len(actual)):
        prediction_error = (actual[i] - predicted[i])
        sum_error += (prediction_error ** 2)
    mean_error = sum_error / float(len(actual))
    
    return sqrt(mean_error)
        

## Evaluating algorithm

Not entirely sure but I think this function remains constant across regression problems?

In [24]:
def evaluation_algorithm(dataset, algorithm):
    test_set = list()
    for row in dataset:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
        
    predicted = algorithm(dataset, test_set)
    print(predicted)
    actual = [row[-1] for row in dataset]
    rmse = rmse_metric(actual, predicted)
    return rmse
    

# Testing with dummy data

In [27]:
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
rmse = evaluation_algorithm(dataset, simple_linear_regression)
print("RMSE for the dummy dataset is: ", round(rmse, 3))

[1.1999999999999995, 1.9999999999999996, 3.5999999999999996, 2.8, 4.3999999999999995]
RMSE for the dummy dataset is:  0.693


# Testing with actual dataset

## dataset used is: [Swedish Auto Insurance Case study](https://www.kaggle.com/datasets/redwankarimsony/auto-insurance-in-sweden)

In [35]:
from random import seed
from random import randrange
from csv import reader
from math import sqrt

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, "r") as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
        return dataset

def str_to_float(dataset):
    for i in range(1, len(dataset)):
        for j in range(len(dataset[0])):
            dataset[i][j] = float(dataset[i][j])
        
# Split a dataset into a train and test set
def train_test_split(dataset, split):
    train = list()
    train_size = split * len(dataset)
    dataset_copy = list(dataset)
    while len(train) < train_size:
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy

In [51]:
seed(1)
filename = 'swedish_insurance.csv'
dataset_insurance = load_csv(filename)
str_to_float(dataset_insurance)

#popping the first row since it's irrelevant
dataset_insurance.pop(0)

#evaluate algorithm using train/test split
def evaluate_algorithm(dataset, algorithm, split, *args):
    train, test = train_test_split(dataset, split)
    test_set = list()
    for row in test:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
    predicted = algorithm(train, test_set, *args)
    actual = [row[-1] for row in test]
    rmse = rmse_metric(actual, predicted)
    return rmse

split = 0.6
rmse = evaluate_algorithm(dataset_insurance, simple_linear_regression, split)
print("RMSE is: ", rmse)


RMSE is:  33.62982326492123
