# Linear Regression Lab
## Pre-lab
In reality, before applying a machine learning algorithm, you would need to:
1. Find a dataset
2. Modify the dataset as per your needs
3. Preprocess the dataset for improved performance (normalize, reduce dimensions, etc.)
For now, the first two steps have already been done for you, and the third step will be discussed in the next article.

To import the data and any necessary libraries, simply run the cell below.

In [None]:
# Deal with large arrays quickly and easily
import numpy as np
# Display data table
import pandas as pd

# Get data from CSV file on GitHub
dataTable = pd.read_csv("https://raw.githubusercontent.com/Endothermic-Dragon/Polygence/master/Jupyter%20Notebooks/Linear%20Regression/Fish%20Dataset.csv")

# Get data as array
data = dataTable.to_numpy()

# Display data table
dataTable

After running the cell above, you should be able to see a data table about fish. Your goal is to take the fish species, vertical length, diagonal length, cross length, height, and diagonal width, and use that to predict the fish's weight. First, let's perform a basic visualization of the data to see what kind of model would work well.

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

# Specify five subplots
fig, axes = plt.subplots(5, 1, figsize=(5, 25))

# Generate and assign subplots
sns.scatterplot(ax=axes[0], data=dataTable, x="Vertical Length (cm)", y="Weight (g)")
sns.scatterplot(ax=axes[1], data=dataTable, x="Diagonal Length (cm)", y="Weight (g)")
sns.scatterplot(ax=axes[2], data=dataTable, x="Cross Length (cm)", y="Weight (g)")
sns.scatterplot(ax=axes[3], data=dataTable, x="Height (cm)", y="Weight (g)")
sns.scatterplot(ax=axes[4], data=dataTable, x="Diagonal Width (cm)", y="Weight (g)")

# Delete temporary variables
del fig, axes

# Show figure
plt.show()

Above, we have analyzed the direct relationships between our inputs and outputs. Note that the relations could be multivariable, but the dataset has been chosen so it is not, as the purpose of this exercise is to learn the inner computational mechanics of linear regression.

It seems like each of the input columns would benefit from a fit with a squared term. So, let's add that into our linear regression equation to reduce bias.

<h3>

$$f(\theta, x_i)=\theta_1 x_{i,\, \text{vertical\_length}} + \theta_2 x_{i,\, \text{vertical\_length}}^2 \\ + \theta_3 x_{i,\, \text{diagonal\_length}} + \theta_4 x_{i,\, \text{diagonal\_length}}^2 \\ + \theta_5 x_{i,\, \text{cross\_length}} + \theta_6 x_{i,\, \text{cross\_length}}^2 \\ + \theta_7 x_{i,\, \text{height}} + \theta_8 x_{i,\, \text{height}}^2 \\ + \theta_9 x_{i,\, \text{diagonal\_width}} + \theta_{10} x_{i,\, \text{diagonal\_width}}^2 \\ + \theta_{11}$$

</h3>

Note that we will preprocess the squared terms and store them in an array to reduce the computation necessary in the next cell. In addition, we will be also splitting the data into training and validation groups.

In [None]:
# Shuffle data to make it unordered and random
np.random.shuffle(data)

trainingData = data[:71, :]
validationData = data[71:, :]

# Get columns
x_vertical_length, x_diagonal_length, x_cross_length, x_height, x_diagonal_width, trainingOutputs = np.hsplit(trainingData, 6)

# Horizontally stack columns and additional columns as appropriate
trainingData = np.hstack((
    x_vertical_length, x_vertical_length**2,
    x_diagonal_length, x_diagonal_length**2,
    x_cross_length, x_cross_length**2,
    x_height, x_height**2,
    x_diagonal_width, x_diagonal_width**2
))

# Get columns
x_vertical_length, x_diagonal_length, x_cross_length, x_height, x_diagonal_width, validationOutputs = np.hsplit(validationData, 6)

# Horizontally stack columns and additional columns as appropriate
validationData = np.hstack((
    x_vertical_length, x_vertical_length**2,
    x_diagonal_length, x_diagonal_length**2,
    x_cross_length, x_cross_length**2,
    x_height, x_height**2,
    x_diagonal_width, x_diagonal_width**2
))

# Remove temporary variables
del x_vertical_length, x_diagonal_length, x_cross_length, x_height, x_diagonal_width

print("Data split successfully.")

This lab also contains various checks to make sure that you've implemented the functions correctly. Run the next cell to initialize them.

In [None]:
def checkF():
    valid = []
    for _ in range(5):
        # Randomly generate "data"
        thetas = np.random.uniform(-10, 10, 11)
        xs = np.random.uniform(-10, 10, 10)

        # Get the correct answer
        expectedOutputs = sum(thetas[i]*xs[i] for i in range(10))+ thetas[10]

        # Compare function's output to the correct answer
        # Add up the absolute value of the differences
        # Make sure that the sum is under the float error tolerance of 0.01
        validity = np.sum(np.abs(f(thetas, xs) - expectedOutputs)) < 0.01

        # Add validity check to list
        valid.append(validity)

    # If all checks are true, return success
    if all(valid):
        print("Successful implementation of function!")
    else:
        print("Oops! There's an error in your code.")

def checkJ():
    valid = []
    for _ in range(5):
        # Randomly generate "data"
        thetas = np.random.uniform(-10, 10, 11)
        data = np.random.uniform(-10, 10, (100, 10))
        outputs = np.random.uniform(-10, 10, (100, 1))

        # Get the correct answer
        expectedOutputs = np.mean(
            (
                (np.hstack((data, np.ones((100, 1)))) @ np.array([thetas]).T) - outputs
            )**2
        )

        # Compare function's output to the correct answer
        # Add up the absolute value of the differences
        # Make sure that the sum is under the float error tolerance of 0.01
        validity = np.sum(np.abs(J(thetas, data, outputs) - expectedOutputs)) < 0.1

        # Add validity check to list
        valid.append(validity)

    # If all checks are true, return success
    if all(valid):
        print("Successful implementation of function!")
    else:
        print("Oops! There's an error in your code.")

def checkGradients():
    valid = []
    for _ in range(5):
        # Randomly generate "data"
        thetas = np.random.uniform(-10, 10, 11)
        data = np.random.uniform(-10, 10, (100, 10))
        outputs = np.random.uniform(-10, 10, (100, 1))

        # Get the correct answer
        expectedOutputs = np.mean(
            (
                (np.hstack((data, np.ones((100, 1)))) @ np.array([thetas]).T) - outputs
            ) * np.hstack((data, np.ones((100, 1))))
        , 0) * 2

        # Compare function's output to the correct answer
        # Add up the absolute value of the differences
        # Make sure that the sum is under the float error tolerance of 0.01
        validity = np.sum(np.abs(getGradients(thetas, data, outputs) - expectedOutputs)) < 0.1

        # Add validity check to list
        valid.append(validity)

    # If all checks are true, return success
    if all(valid):
        print("Successful implementation of function!")
    else:
        print("Oops! There's an error in your code.")


Great! Now you have everything you need to make your linear regression model. Specifically, you have to use these to train your model:
* `trainingData` - data which has all the training input data, with the columns in this order:
  * $\text{Vertical Length}$
  * $\text{Diagonal Length}^2$
  * $\text{Diagonal Length}$
  * $\text{Diagonal Length}^2$
  * $\text{Cross Length}$
  * $\text{Cross Length}^2$
  * $\text{Height}$
  * $\text{Height}^2$
  * $\text{Diagonal Width}$
  * $\text{Diagonal Width}^2$
* `validationData` - data which has all the validation input data in the same column layout as above
* `trainingOutputs` - a column of the expected outputs for the corresponding training data
* `validaitonOutputs` - a column of the expected outputs for the corresponding validation data

But wait, one more thing before you start the hands-on portion:
## Quick Tutorial

How the heck do you process all of this data without descending into `for` loop hell? Well, `numpy` is a python library that's used to process large amounts of data in an easy manner. One of its advantages is that you don't need to write a loop to process each row of data. You'll need to know the following functions for this lab:
* `np.array` - converts a native python array to a numpy array
* `np_array[a:b, c:d]` - gets elements from row a (inclusive) to b (exclusive), and then column c (inclusive) to d (exclusive). This is very similar to accessing elements in a python list, but now you can access more than one dimension at once.
* `np_array + np_array` - adds the two arrays element-wise. For example, `[1, 2] + [3, 4] = [4, 6]`.
* `np_array - np_array` - subtracts the two arrays element-wise. For example, `[1, 2] - [3, 4] = [-2, -2]`.
* `np_array * np_array` - multiplies the two arrays element-wise. For example, `[1, 2] * [3, 4] = [3, 8]`.
* `np.mean(np_array)` - gives the average value for all the cells.

Instead of importing `math` and using `math.pow`, you can use `**` as its equivalent. **MAKE SURE** that you don't use `^`, as that is a bitwise OR operator, not a math exponent. If you don't know what that is, don't worry about it, just remember to not use it.  
Example: `3**4` returns `81`.


## Programming the Model
First, implement the $f(\theta, x_i)$ function. Remember that $x_i$ contains 10 values in the format of a row, and $\theta$ contains 11 values in the format of a row.

Running the cell will automatically tell you whether your implementation is correct or not.

In [None]:
def f(thetas, xs):
    return

checkF()

<details>
<summary>Stuck or completed? Click here to reveal a working example function that you could've written.</summary>

```python
def f(thetas, xs):
    # Multiply corresponding thetas to xs and add them together
    result = 0
    for i in range(10):
        result += thetas[i]*xs[i]
    
    # Add the last theta and return result
    return result + thetas[10]
```
</details>

You've implemented your function. Great job! Now, you have to implement the cost function. We will graph this over iterations to ensure that we're making progress. Here's the cost function formula:

<h3>

$$J(θ) = \frac{1}{n} \sum_{i=1}^n (f(θ, x_i) - y_i)^2$$

</h3>

In the function below, you can see that the parameters `data` and `outputs` are initialized default values of `trainingData` and `trainingOutputs`, respectively, if they are not specified when calling the function. This will come in handy later, when we'll be checking for overfitting on our validation data.

In [None]:
def J(thetas, data=trainingData, outputs=trainingOutputs):
    return

checkJ()

<details>
<summary>Stuck or completed? Click here to reveal a working example function that you could've written.</summary>

```python
def J(thetas, data=trainingData, outputs=trainingOutputs):
    result = 0
    # Loop through each row, adding results together
    for i in range(len(data)):
        # Two arrays go into f, number returned
        # Subtract prediction and expected output, square, and add to result
        result += (f(thetas, data[i]) - outputs[i, 0])**2
    # Take the average and return
    return result/len(data)
```
</details>

Next, it's time to implement the gradients. If you don't know how to take the gradient of the cost function, then click the dropdowns below to reveal them.

<details>
<summary>View Gradients</summary>
<h3>
<details>
<summary>

$\frac{\partial}{\partial \theta_1} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{vertical\_length}})$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_2} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{vertical\_length}}^2)$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_3} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{diagonal\_length}})$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_4} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{diagonal\_length}}^2)$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_5} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{cross\_length}})$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_6} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{cross\_length}}^2)$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_7} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{height}})$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_8} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{height}}^2)$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_9} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{diagonal\_width}})$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_{10}} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n ((f(θ, x_i) - y_i) * x_{i,\, \text{diagonal\_width}}^2)$$
</details>

<details>
<summary>

$\frac{\partial}{\partial \theta_{11}} J(\theta)$</summary>
$$\frac{2}{n} \sum_{i=1}^n (f(θ, x_i) - y_i)$$
</details>
</h3>
</details>

<details>
<summary>Implementation Hint</summary>

Try to use the fact that all the partial derivatives have $f(θ, x_i) - y_i$ and $\frac{2}{n}$ in common to "parallel process" these.
</details>

In [None]:
def getGradients(thetas, data=trainingData, outputs=trainingOutputs):
    return

checkGradients()

<details>
<summary>Stuck or completed? Click here to reveal a working example function that you could've written.</summary>

```python
def getGradients(thetas, data=trainingData, outputs=trainingOutputs):
    # Get length of training data
    n = len(data)

    # Store gradient values
    results = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

    # Loop through each training datapoint
    for i in range(n):
        # Calculate the part which all the gradients have in common
        resultPart = f(thetas, data[i]) - outputs[i, 0]
        for j in range(10):
            # Multiply the "uncommon" part for each and add to the sum
            results[j] += resultPart * data[i, j]
        results[10] += resultPart

    # Multiply each sum by 2/n to get the gradient for each theta
    for i in range(11):
        results[i] = 2 * results[i] / n

    return results
```
</details>

Finally, it's time to implement a complete iteration that takes in $\theta$ values, and returns new $\theta$ values after performing a gradient descent step with step size of 0.0000003.

If you're wondering why it's such a small and weird number, it's because that's the value that works well in this scenario, as found by trial and error. To do this yourself, simply print the gradients and see if they're converging to a minimum efficiently.

At this point, you're probably thinking why we don't use 0.1, 0.001, or one of those other values we went over in the Gradient Descent article. These learning rates are determined as "reasonably good" after a process called regularization or normalization, which speeds up the training process significantly. We will go over how to normalization data in the next article.

When editing this next cell, keep in mind whether you implemented the `getGradients` function to return a python list or a numpy array.

This should be relatively easy, so there isn't a check for this part. You can just compare your code with the example code after you're done.

In [None]:
def step(thetas):
    return

<details>
<summary>Stuck or completed? Click here to reveal a working example function that you could've written.</summary>

```python
def step(thetas):
    # Get the gradients
    gradients = getGradients(thetas)

    # Loop through and update each theta
    for i in range(11):
        thetas[i] = thetas[i] - 0.0000003*gradients[i]
    
    return thetas
```
</details>

Now that you're done implementing all of the functions, run the next cell to perform 500 iterations, and see how the cost function goes down.

Feel free to adjust the number of iterations, stored in the variable `iterations`, and see how the cost changes over time. Even if you change it to 5, you should be able to see how the cost dramatically goes down, showing how effective this algorithm is!

In [None]:
# Number of iterations to run
iterations = 0

# Set any values for theta
thetas = np.zeros(11)

# Store cost history for plotting later
cost_history = [J(thetas)]

# Run iterations - perform a step + store cost in history for plotting later
for _ in range(iterations):
    thetas = step(thetas)
    cost_history.append(J(thetas))

print("Starting cost:", cost_history[0])
print("Ending cost:", cost_history[-1])

# Plot cost
plt.figure()
plt.plot(np.arange(iterations+1), cost_history)
plt.show()

But wait, we're not done! We still need to use out validation data. We'll plot the cost on the validation dataset over each iteration, in order to ensure that the cost is going down and we aren't overfitting.

## Model Validation

Remember how we had parameters `data` and `outputs` in our `J` function? It's time to reuse those, by calling the function with `validationData` and `validationOutputs`.

Using the code above, implement a similar version that measures the cost using the validation data over iterations. Remember that you only want to change the dataset for which the cost is measured, not train it for that new dataset, because you want to make sure that your model generalizes well.

The code above has been copied below for your convenience.

In [None]:
# Number of iterations to run
iterations = 0

# Set any values for theta
thetas = np.zeros(11)

# Store cost history for plotting later
cost_history = [J(thetas)]

# Run iterations - perform a step + store cost in history for plotting later
for _ in range(iterations):
    thetas = step(thetas)
    cost_history.append(J(thetas))

print("Starting cost:", cost_history[0])
print("Ending cost:", cost_history[-1])

# Plot cost
plt.figure()
plt.plot(np.arange(iterations+1), cost_history)
plt.show()

<details>
<summary>Stuck or completed? Click here to reveal a working example function that you could've written.</summary>

```python
# Number of iterations to run
iterations = 1000

# Set any values for theta
thetas = np.zeros(11)

# Store cost history for plotting later
cost_history = [J(thetas)]

# Run iterations - perform a step + store cost in history for plotting later
for _ in range(iterations):
    thetas = step(thetas)
    # Modified here - pass validation data and outputs to the cost function J
    cost_history.append(J(thetas, validationData, validationOutputs))

print("Starting cost:", cost_history[0])
print("Ending cost:", cost_history[-1])

# Plot cost
plt.figure()
plt.plot(np.arange(iterations+1), cost_history)
```
</details>

As you can see, this has a cost value similar to that of our training data, showing that our data does not contain significant variance.

## Model Accuracy
Great! Now you know for sure that your model works. But how well? We can measure that using something called the coefficient of determination, or $R^2$ for short. This provides a measure for how good of a fit your model is, and is calculated by the following formula:

<h3>

$$R^2 = 1-\frac{\sum_{i=1}^n (f(\theta, x_i) - y_i)^2}{\sum_{i=1}^n (y_{average} - y_i)^2}$$

</h3>

The value of $R^2$ always falls between 0 and 1, where 1 represents a perfect fit. We can use this to judge the accuracy of our model as a percentage.

Implement the formula above using the validation data to quantify the accuracy level of our model in the next cell. You should have around a 90% accuracy, but there will be fluctuations due to randomization in the splitting of the original data into training and validation sets.

<details>
<summary>Implementation Hint</summary>

Using the `np.mean` function might be helpful.
</details>

In [None]:
# Value of thetas carry over from previous cell

r_squared = 0

# Type your code here
# The value of r_squared should be a decimal between 0 and 1 after calculations


# Display R^2 as percentage
print(f"{'%.2f' % (r_squared*100)}% accuracy")

<details>
<summary>Stuck or completed? Click here to reveal a working example function that you could've written.</summary>

```python
# Value of thetas carry over from previous cell

# Get number of rows
n = len(validationData)

# Get the mean of the output values
averageValidationOutput = np.mean(validationOutputs)

# Calculate numerator and denominator sum with loop
numerator = 0
denominator = 0

for i in range(n):
    numerator += (f(thetas, validationData[i]) - validationOutputs[i, 0])**2
    denominator += (averageValidationOutput - validationOutputs[i, 0])**2

# Plug into formula
r_squared = 1 - numerator/denominator

# Display as percentage
print(f"{'%.2f' % (r_squared*100)}% accuracy")
```
</details>

As a side note, you can increase the accuracy by running more iterations, implementing a better learning rate constant, implementing a dynamic learning rate as was discussed in the article, or using a more advanced gradient descent algorithm.

Running 10,000 iterations should give you around 96% accuracy, and running 100,000 iterations should give you a little higher than 97.5% accuracy.

Congratuations, you've now successfully completed the lab! But, as you can tell, that was a lot of work. We can make these calculations a lot easier and faster by utilizing matrix computations and numpy arrays to our advantage, which we will learn about in the next article.

## Credits
* This lab used a modified version of Aung Pyae's fish market dataset from Kaggle. You can find the original dataset [here](https://www.kaggle.com/datasets/aungpyaeap/fish-market).
* Formula to calculate $R^2$ accuracy from [Newcastle University, UK](https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-correlation/coefficient-of-determination-r-squared.html).