#IT 166 Lab 9

In this lab, we will implement a simple linear regression modeul from scratch for a dataset from a CSV file. 

Objectives
* Be able to develop a simple linear regression model
* Be able to evaluate the performance of the predictive model

Preparation
* Launch the Jupyter notebook.
* Rename the notebook page as “lab6”.

##Simple Linear Regression

Linear regression assumes a linear or straight line relationship between the input variables ($X$) and the single output variable ($y$).

More specifically, that output ($y$) can be calculated from a linear combination of the input variables ($X$). When there is a *single input variable*, the method is referred to as a **Simple Linear Regression**.

In simple linear regression we can use statistics on the training data to estimate the coefficients required by the model to make predictions on new data.

The line for a simple linear regression model can be written as:

$y = b_0 + b_1 × x$

where $b_0$ and $b_1$ are the coefficients we must estimate from the training data.

Once the coefficients are known, we can use this equation to estimate output values for $y$ given new input examples of $x$.

It requires that you calculate statistical properties from the data such as mean, variance and covariance.

All the algebra has been taken care of and we are left with some arithmetic to implement to estimate the simple linear regression coefficients.

Briefly, we can estimate the coefficients as follows:

\begin{align*}
b_1 =& \sum((x_i - \bar{x})\times(y_i - \bar{y})) / \sum((x_i - \bar{x})^2) \\
 =& ~~~Covariance(x, y) / Variance(x) \\
 b_0 =& ~~~\bar{y} - b_1 \times \bar{x}
\end{align*}


where $i$ refers to the value of the $i^{th}$ value of the input $x$ or output $y$. $\bar{x}$ and $\bar{y}$ are the mean value of the input $x$ or output $y$.

##Background: 

###Covariance & Variance

The covariance of two groups of numbers describes how those numbers change together.

Covariance is a generalization of correlation. Correlation describes the relationship between two groups of numbers, whereas covariance can describe the relationship between two or more groups of numbers.

We can calculate the covariance between two variables as follows:

$covariance = \sum((x_i - \bar{x})\times(y_i - \bar{y})) / (n-1)$

$variance = \sum((x_i - \bar{x})^2) / (n-1)$

The calculation of covariance between $x$ and $y$, and the variance of $x$ is the estimation based on the samples of $x$ and $y$.  

###Performance Evaluation

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit (i.e., the regression model). Root mean square error is commonly used in forecasting, and regression analysis to verify experimental results.
\begin{equation}
RMSE = \sqrt{\frac{\sum(y^{predicted}_i - y^{actual}_i)^2}{n}}
\end{equation}
Another commonly used performance metric is the Mean Absolute Error(MAE), which is the average of all absolute errors. The formula is:
\begin{equation}
MAE = \frac{\sum|y^{predicted}_i - y^{actual}_i|}{n}
\end{equation}

#The Task

In this exercise, we are provided with a CSV (Comma Separated Values) file called salary.csv. The file is essentially a plaintext file, which contains 30 lines of information. Each line consists of two numbers (i.e., years of experience and salary) separated by a comma, such as "1.1,39343". Assuming an employee's salary ($y$) is approximately linearly increased with the number of years of working experience ($x$). Our task is to build a Simple Linear Regression (i.e., $y = b_0 + b_1 × x$).

##**Here are the steps we intend to do:**
1.   Open/Read the CSV file to a List, in which each item is also a List containing one pair of "years" and "salary". So after successfully loading the dataset, it will be saved to a List as [["1.1", "39343"], ["1.3", "46205"], ...]. This list is similar to a $[2 \times 30]$ array, representing 30 samples.
2.   Split the dataset into a training set and a test set based on a ratio (e.g., $75\%$ for training and $25\%$ for testing).
3.   Calculate Covariance and Variance of the training dataset to find out coefficients (i.e., $b_0, b_1$) 
4.   Evaluate the model performance (e.g., using Root Mean Square Error or RMSE) 




In [None]:
#Mount drive
from google.colab import drive
drive.mount('/content/drive')
import os
os.chdir("/content/drive/My Drive/Colab Notebooks/IT166")

Mounted at /content/drive


In [2]:
#train_test_split() splits the list based on ratio (e.g., 0.8)
#If ratio = 0.8, then 80% data are randomly selected to the training set
import random
def train_test_split(aList, ratio):
    num_train = round(len(aList) * ratio)
    trainSet = []
    while len(trainSet) < num_train:
        index = random.randint(0, len(aList) - 1)
        trainSet.append(aList[index])
        aList.pop(index) 
    testSet = aList  
    return trainSet, testSet

In [3]:
#We can use the functions below developed in the previous labs: 
#sumList, meanList, minMaxList, stdevList
import math

def sumList(aList):
    sum = 0
    for i in aList:
        sum += i
    return sum

def meanList(aList):
    return sumList(aList) / len(aList)

def minMaxList(aList):
    min = aList[0]
    max = aList[0]
    for i in aList:
        if i > max:
            max = i
        if i < min:
            min = i
    return min, max

#To calculate standard deviation
def stdevList(aList):
    s = 0
    sum = 0
    mean = meanList(aList)
    for i in aList:
        sum += (i - mean) ** 2
    stdev = math.sqrt(sum/len(aList))
    return stdev

# Calculate the variance of a list of numbers
def varianceList(aList):
    mean = meanList(aList)
    return sumList([(x - mean)**2 for x in aList])/(len(aList)-1)

# Calculate the covariance of two lists of numbers
#The two lists should have the same dimension.
def covarianceList(aList, bList):
    covar = 0.0
    meanA = meanList(aList)
    meanB = meanList(bList)
    for idx in range(len(aList)):
        covar += (aList[idx] - meanA) * (bList[idx] - meanB)
    return covar/(len(aList)-1)

# Calculate the coefficients
#The input is a list, in which each item is also a list containg a sample [x_i, y_i]
def coefficients(trainingDatasetList):
    xList = [item[0] for item in trainingDatasetList]
    yList = [item[1] for item in trainingDatasetList]
    b1 = covarianceList(xList, yList) / varianceList(xList)
    b0 = meanList(yList) - b1 * meanList(xList)
    return b0, b1

#Build the regression model and test it
def simple_linear_regression(trainList, testList):
    b0, b1 = coefficients(trainList)
    testResultList = []
    for item in testList:
        y_predict = b0 + b1 * item[0]
        testResultList.append([item[1], y_predict])
    return testResultList #a list of actual and predicted value pairs

#Evaluate the model performance using RMSE
def evalRMSE(testResultList):
    err = 0
    for item in testResultList:
        err += (item[0] - item[1]) ** 2 
    return (err / len(testResultList)) ** 0.5

In [7]:
#The app using the Simple Linear Regression 

dataset = []

with open("salary.csv") as file:
    data = file.readlines()
    for item in data:
        s1 = item.strip()
        s2 = s1.split(",")
        dataset.append(s2)

#print(dataset)
#Split the dataset: 80% as trainning set; 20% as test set
trainSet, testSet = train_test_split(dataset, 0.8)
print(dataset)
print(trainSet)


[['1.3', '46205'], ['2.9', '56642'], ['4.0', '56957'], ['5.1', '66029'], ['6.0', '93940'], ['7.1', '98273']]
[['9.6', '112635'], ['5.9', '81363'], ['10.5', '121872'], ['3.9', '63218'], ['4.5', '61111'], ['6.8', '91738'], ['1.1', '39343'], ['8.7', '109431'], ['2.0', '43525'], ['7.9', '101302'], ['4.9', '67938'], ['9.5', '116969'], ['8.2', '113812'], ['3.2', '54445'], ['10.3', '122391'], ['4.1', '57081'], ['3.0', '60150'], ['4.0', '55794'], ['9.0', '105582'], ['5.3', '83088'], ['3.7', '57189'], ['3.2', '64445'], ['1.5', '37731'], ['2.2', '39891']]


In [None]:
print(trainSet)
print(testSet)

for item in trainSet:
    item[0] = float(item[0])
    item[1] = float(item[1])

for item in testSet:
    item[0] = float(item[0])
    item[1] = float(item[1])

resultList = simple_linear_regression(trainSet, testSet)
rmse = evalRMSE(resultList)

print(f"RMSE is {rmse}")