# Training Faster and Better Lab
## Preprocessing Data
### Import The Data
Much like the last lab, we need to import the dataset. The last lab did this for you automatically, it's time to take off the training wheels. Your job is to import the data from the CSV file from [this github link](https://raw.githubusercontent.com/Endothermic-Dragon/Polygence/master/Jupyter%20Notebooks/Training%20Faster%20and%20Better/House%20Pricing%20Dataset.csv). Feel free to refer to the code from the last lab, but here's a step-by-step explanation of what you have to do:
1. Import numpy as np, so you can use it to work with large arrays and matrices.
2. Import pandas as pd, so you can use it to fetch the data and display the data table to confirm that all the data has been fetched correctly.
3. Use `pd.read_csv()` to fetch the data from the appropriate URL.
  - [Click for documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
  - You only need to input the `filepath_or_buffer` parameter - essentially, you should only have one input into the `pd.read_csv()` function.
  - Assign the output of the function to the variable `dataTable`.
4. To display the data table, simply type in `dataTable` on an empty line at the end
A template has set up so it's easier to fill in the proper code. Once again, feel free to refer to the code from the last lab.

In [None]:
# Import numpy as np, and pandas as pd

# Get data from CSV file on GitHub

# Display data table

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
# Deal with large arrays quickly and easily
import numpy as np
# Display data table
import pandas as pd

# Get data from CSV file on GitHub
dataTable = pd.read_csv("https://raw.githubusercontent.com/Endothermic-Dragon/Polygence/master/Jupyter%20Notebooks/Training%20Faster%20and%20Better/House%20Pricing%20Dataset.csv")

# Display data table
dataTable
```
</details>

Our goal with all of these variables is to predict the data for "House price of unit area".

### Visualize The Data

Now that you've imported the data, you want to visualize all of it. A good library to do this is `seaborn`, which works hand-in-hand with matplotlib's `pyplot`. Run the cell below to visualize all variable relationships and hisograms. Keep in mind that this does take from about 25 to 45 seconds to render, because of all the graphs and variables involved.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create and show plot
sns.pairplot(dataTable)
plt.show()

### Analyzing correlations

Looking through the data carefully, we can make a few observations;
1. The data for `Distance to the nearest MRT station` is very skewed
    - Apply a nonlinear transformation, such as a natural log
2. `Latitude` and `Longitude` don't seem to have a specific relationship with `House price of unit area`
    - This could hint that the relationship is not exactly visible in two dimensions, but might have a correlation in higher dimensions
    - We would generally expect latitude and longitude to affect the house price in some way as it makes sense, but how they're related is unknown to us
    - We want to bucket this data so our algorithm can model nonlinear relationships
3. The scatterplot for `Longitude` and `Distance to the nearest MRT station` is very linear, but has a very sharp break
    - We want to make sure to insert a boundary for a bin there
4. There seems to be an outlier for `House price of unit area`

We've visualized our data and drew some basic conclusions. Now, let's visualize the correlation coefficients. Pandas has a built-in function to calculate the matrix, and we can feed it to seaborn to visualize it as a heatmap.

In [None]:
fig, ax = plt.subplots(figsize=(8,7))

# Get the correlation matrix, and round to two decimals to visualize sparsely
corr = dataTable.corr().round(2)

# Display the heatmap of the matrix
sns.heatmap(corr, annot=True, ax=ax)
plt.show()

### Further Analysis and Transforming Data

We can see from the heatmap that `Distance to the nearest MRT station` has a pretty decent relationship with `House price of unit area`. However, previously we saw that the data was skewed. Let's see if distributing the data more evenly helps increase the level of correlation.

The space below is for you to experiment for what transformations work best.

<details>
<summary>What I did</summary>

`np.log(distance)**1.1`
</details>

In [None]:
# Get all the data from the "Distance to the nearest MRT station" column
distance = dataTable["Distance to the nearest MRT station"]

# Modify the values so it looks more like a normal distribution
# EDIT HERE
distanceModified = distance

# Plot the old relationship
sns.histplot(distance)
plt.show()

# Plot the new relationship
sns.histplot(distanceModified)
plt.show()

Once you're happy with your transformed data, run the next cell to replace the old data.

In [None]:
# Reassign the column data with the modified data
dataTable["Distance to the nearest MRT station"] = distanceModified

Now, check the fruit of your labors. Run the next cell to regraph the correlation matrix. The value's <u>*magnitude*</u> should have gone up.

In [None]:
fig, ax = plt.subplots(figsize=(8,7))

corr = dataTable.corr().round(2)
sns.heatmap(corr, annot=True, ax=ax)
plt.show()

Let's address the relationship between `Longitude` and `Distance to the nearest MRT station`. Since we'll be making buckets for `Longitude`, we can simply identify the boundary. Keep in mind that the graph won't be linear anymore as we just applied various transformations to `Distance to the nearest MRT station`, but the boundary should still be clear. Once identifying the critical section you want to see in more detail, zoom into that area.

After zooming in, make sure to take note of the bottom right of the graph, which shows how shifted the graph is. In the context of computer science, `e` represents to what power of 10 the number before it is scaled. For example, `1.7e+3` actually means $1.7 \cdot 10^3$. A negative value after the `e` represents a negative exponent. For example, `0.8e-13` actually means $0.8 \cdot 10^{-13}$.

<details>
<summary>What I did</summary>

These bounds worked well with the transformation I used: `[121.53, 121.55, 5, 8.5]`

Reguardless of what transformation you used, you should get a boundary value close to 121.54125.
</details>

In [None]:
# Plot the data
dataTable.plot(kind='scatter', x='Longitude', y='Distance to the nearest MRT station')
plt.show()

# Zoom in to the area of change to pinpoint where 
dataTable.plot(kind='scatter', x='Longitude', y='Distance to the nearest MRT station')
# Set zoomed area bounds
# EDIT HERE
# The parameters in the array should be numbers (floats to be specific), not strings
plt.axis(["x lower limit", "x higher limit", "y lower limit", "y higher limit"])
plt.show()

For now, remember that value (or scroll back up later). We'll deal with this again in the near future.

From our correlation matrix, we can see that the following values are somewhat low, when compared to `House price of unit area`:
- `Number of convenience stores`
- `House age`
- `Transaction date`

So, let's graph them to see if there's anything we can do.

We'll start with `Number of convenience stores`. Since `Number of convenience stores` is going to be an input in our model, and the output is supposed to predict `House price of unit area`, we'll have `Number of convenience stores` be on the x-axis and `House price of unit area` be on the y-axis.

In [None]:
dataTable.plot(kind='scatter', x='Number of convenience stores', y='House price of unit area')
plt.show()

We can see that it roughly does have a linear shape, so there's not much we can do about that. However, we see the same outlier we saw earlier! Let's remove it.

We can seperate this by its value of `House price of unit area`. The other datapoints are around 45ish, but this point is at around 120. Visually, we can see that 100 is a clear cutoff mark that seperates this from all the other points. Use this observation to select ONLY the outlier.

In [None]:
# Conditionally select the outlier point
# EDIT HERE
# Change the condition so it only satisfied the outlier point
droppedData = dataTable['House price of unit area'] != 0

# Get all the indexs of the data you're dropping (should be ONLY the outlier)
index = dataTable[droppedData].index

if len(index) > 1:
    print("Oops, there's an error. Here's what you selected:")
    print(index)
else:
    print(f"You've selected the datapoint at index {index[0]}.")

Now, delete this datapoint by running the next line.

In [None]:
dataTable = dataTable.drop(index)

Next, let's tackle `House age` in a similar manner - by graphing it on a scatterplot!

In [None]:
dataTable.plot(kind='scatter', x='House age', y='House price of unit area')
plt.show()

There's a pretty clear nonlinear relationship where it starts high, dips down, and goes back up again. A quadratic fit would suffice for this. We'll add a `House price of unit area^2` feature later on to account for this.

For now, let's move on to `Transaction date`. The date is in a decimal format. Let's turn that into years, months, and days, as you would generally expect different trends that vary by season (and thus, by time). The input for the function will be a 1D row of dates expressed as a year (but in decimal form). Why is it a row, you ask? That's because when a specific column is retrieved from `dataTable`, it returns a 1D array. The job of your function will be to take that, and output a list of 3 arrays - the first one being the integer year, the second one being the integer month, and the third one being the integer date. To be honest, you can most likely treat the input array as a decimal input, as `numpy` automatically applies that operation to all its elements.

Some helpful operators you might need:
- `/` - plain old division, returns decimal (float) result
- `//` - division ignore remainder (returns quotient)
- `%` - division ignore quotient, also called a modulus (returns remainder)

In [None]:
# EDIT HERE
def breakDate(years):
    yearsReturn = 
    monthsReturn = 
    daysReturn = 
    return [yearsReturn, monthsReturn, daysReturn]

# Use a test input and an if statement to confirm correct implementation
testInput = breakDate(np.array([[3 + 5/12 + 17/12/30]]))

# Account for float error and implementation variation
if testInput[0][0,0] == 3 and testInput[1][0,0] == 5 and testInput[2][0,0] in [16, 17]:
    # "Spread" list output and assign to three variables
    years, months, days = breakDate(dataTable["Transaction date"])

    # Add new columns to dataTable
    dataTable["Year"] = years
    dataTable["Month"] = months
    dataTable["Day"] = days

    # Remove the old measure
    dataTable = dataTable.drop(labels="Transaction date", axis=1)
else:
    print("Oops, looks like your implementation is incorrect!")

<details>
<summary>Stuck or completed? Click here to reveal a working example function that you could've written.</summary>

```python
def breakDate(years):
    yearsReturn = years // 1
    monthsReturn = ((years % 1) * 12) // 1
    daysReturn = ((((years % 1) * 12) % 1) * 30) // 1
    return [yearsReturn, monthsReturn, daysReturn]
```
</details>

Next, let's visualize the data we just transformed. One of the lines is set up for you as an example, fill in the other two lines. Make sure to take not of the function name and parameter inputs, as you'll have to do this by yourself sooner or later!

In [None]:
# Display scatterplot for year
dataTable.plot(kind='scatter', x='Year', y='House price of unit area')
plt.show()

# Display scatterplot for month

# Display scatterplot for day

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
# Display scatterplot for month
dataTable.plot(kind='scatter', x='Month', y='House price of unit area')
plt.show()

# Display scatterplot for day
dataTable.plot(kind='scatter', x='Day', y='House price of unit area')
plt.show()
```
</details>

From the above information, we can conclude a few things:
- There is an overall deviation in the price from the year 2021 to 2013. This is important data to keep, as it will provide our model with an overall sense of trends.
- There is a slight fluctuation over the months. I personally feel that a cubic should be enough to get the gist of the trend, but if you feel otherwise, feel free to use your own adaptation!
- The days recorded are either at the beginning of the month or at the end. It doesn't really provide any extra information; nor are there any large-scale general patterns to help train our model.

As per this analysis, let's strike out `Day`. We'll modify `Month` later.

In [None]:
# The axis parameter tells the dataframe that we're trying to strike out a column
dataTable = dataTable.drop("Day", axis=1)

Great, now we finished identifying what we need to change (and even changed a few)!

### Binning and Adding Preprocessed Features

So far we have:
- Create bins for latitude
- Create bins for longitude, making sure to add a bin for 121.54125
- House age - fit with a quadratic
  - Add feature which is squared
- Month - fit with a cubic
  - Add features which are squared and cubed

Before making bins, let's first visualize the histograms of `Latitude` and `Longitude`. Use `sns.histplot()` to do this. Make sure to specify what you're plotting as the `x` parameter.

In [None]:
# Histogram for latitude
sns.histplot(dataTable, x="Latitude")
plt.show()

# Histogram for longitude
sns.histplot(dataTable, x="Longitude")
plt.show()

Now, let's make the buckets by quantile (also known as percentile). One of them is shown, your job is to code the other one. Make sure to not just copy and paste, but to truly *understand* what you're coding.

In [None]:
# Get the boundary values, add -infinity and infinity to represent open intervals on either end
latitudeBins = np.append(np.percentile(dataTable.Latitude, np.arange(1,10)*10), [-np.Inf, np.Inf])
# Sort to get the infinities at the right spots
latitudeBins.sort()
# Use panda's cut function to bin and automatically label each datapoint
latitudeData = pd.cut(dataTable.Latitude, bins=latitudeBins, labels=["Latitude Bin " + str(i) for i in range(1, 11)])
# Create bin column, assign 1 if in bin, otherwise assign 0
for i in range(1,11):
    binName = "Latitude Bin " + str(i)
    dataTable[binName] = (latitudeData == binName).astype(int)

# EDIT HERE
# Get the boundary values, add -infinity and infinity to represent open intervals on either end
# Also add 121.54125 as a bin boundary

# Sort to get the infinities and 121.54125 at the right spots

# Use panda's cut function to bin and automatically label each datapoint
# Take not that there's now one more bin than there was before

# Create bin column, assign 1 if in bin, otherwise assign 0
# Take not that there's now one more bin than there was before


# Delete the latitude and longitude data, as we don't need it anymore
dataTable = dataTable.drop(labels=["Latitude", "Longitude"], axis=1)

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
# Get the boundary values, add -infinity and infinity to represent open intervals on either end
# Also add 121.54125 as a bin boundary
longitudeBins = np.append(np.percentile(dataTable.Longitude, np.arange(1,10)*10), [-np.Inf, 121.54125, np.Inf])
# Sort to get the infinities and 121.54125 at the right spots
longitudeBins.sort()
# Use panda's cut function to bin and automatically label each datapoint
# Take not that there's now one more bin than there was before
longitudeData = pd.cut(dataTable.Longitude, bins=longitudeBins, labels=["Longitude Bin " + str(n) for n in range(1, 12)])
# Create bin column, assign 1 if in bin, otherwise assign 0
# Take not that there's now one more bin than there was before
for i in range(1,12):
    binName = "Longitude Bin " + str(i)
    dataTable[binName] = (longitudeData == binName).astype(int)
```
</details>

Now, let's create the preprocessed features for `House age` and `Month`. I used just squared for `House age`, and squared and cubed for `Month`. If you want to use something different, simply modify the code below! Otherwise, just run the cell.

Once again, we precompute this beforehand so we don't have to recalculate it on every iteration, and can simply fit values of θ to it - plus, it makes dealing with derivatives and matrices easier.

In [None]:
# Add the other features
dataTable["House age^2"] = dataTable["House age"]**2
dataTable["Month^2"] = dataTable["Month"]**2
dataTable["Month^3"] = dataTable["Month"]**3

Next, we to Z-Score all the data for efficient training. Fill in the `zScore` funciton, once again keeping in mind the input is a 1D row of data (while technically we're singling out a column, getting the data column from `dataTable` returns a 1D array). The formula to calculate a Z-Score is $x^{\prime} = (x - μ) / σ$, where μ is the mean and σ is the standard deviation.

You can use `np.mean(a)` to calculate the mean of all elements in array `a`. You can also calculate the standard deviation of all elements in array `a` with `np.std(a)`.

In [None]:
# EDIT HERE
def zScore(columnData):
    return

# Use a test input and an if statement to confirm correct implementation
testInput = np.array([1,2,3,4,5])
testOutput = zScore(testInput)

# Account for float error while checking
if -0.01 < testOutput[0]+1.414 < 0.01 and -0.01 < testOutput[2] < 0.01:
    # Apply Z-Score function
    for i in dataTable.columns:
        # Exclude our bins
        if "Latitude" not in i and "Longitude" not in i:
            dataTable[i] = zScore(dataTable[i])

    # Reorder the data columns and remove any other unnecessary columns if we didn't already
    dataTable = dataTable[[
                        "Number of convenience stores",
                        "Distance to the nearest MRT station",
                        "House age",
                        "House age^2",
                        "Year",
                        "Month",
                        "Month^2",
                        "Month^3",
                        *["Latitude Bin " + str(i) for i in range(1, 11)],
                        *["Longitude Bin " + str(n) for n in range(1, 12)],
                        "House price of unit area",
                        ]]

    # Display the data table
    display(dataTable)
else:
    print("Oops, you have an incorrect implementation of calculating the Z-Score!\nRecheck your code.")

<details>
<summary>Stuck or completed? Click here to reveal a working example function that you could've written.</summary>

```python
def zScore(columnData):
    return (columnData - np.mean(columnData)) / np.std(columnData)
```
</details>

### Finishing Up Preprocessing

Great! Now we have all of our data preprocessed! Now, we just need to randomize our data and split into training and validation sets. Note that a column of 1s is added to the training and validation data. This acts as the bias term - any coefficient that is multiplied with it will be returned as-is, and will be added to the other terms.

Doing this step is trivial since we're focusing on the mathematics, so it is done for you already. However, it is still recommended that you look through the code to understand what it is doing.

In [None]:
# Randomly scamble data, but in a consistent manner (random_state is kind of like a random seed)
dataTable = dataTable.sample(frac=1, random_state=314).reset_index(drop=True)

In [None]:
# If odd number of data points, include in training data
splitPoint = len(dataTable) // 2 + len(dataTable) % 2

trainingData = dataTable.drop(labels="House price of unit area", axis=1).to_numpy()[:splitPoint]
trainingData = np.hstack([
                          np.ones((splitPoint, 1)),
                          trainingData
                         ])
trainingOutputs = dataTable[["House price of unit area"]].to_numpy()[:splitPoint]

validationData = dataTable.drop(labels="House price of unit area", axis=1).to_numpy()[splitPoint:]
validationData = np.hstack([
                          np.ones((len(dataTable) - splitPoint, 1)),
                          validationData
                           ])
validationOutputs = dataTable[["House price of unit area"]].to_numpy()[splitPoint:]

The code below has been set up for you so you simply have to fill in the functions. This should be pretty doable, given that you did a similar activity last lab.

This time, the `thetas` is a vector-like vertical matrix/array.

Also... new numpy function! The `@` function performs the dot product of two numpy arrays. This *might* be helpful (hint hint) when batch calculating `f`, which should take in all the data and the values of θ, and output a column of predictions.

Calculating `J` should be simple enough, given the formula:

<h3>

$$J(θ) = \frac{1}{n} \sum_{i=1}^n (f(θ, x_i) - y_i)^2$$

</h3>

and the `np.mean(...)` function.

The `getGradients` function is a bit more difficult though. From the past article, we considered each row one-by-one. Specifically, for each row, we took the data for that row, plugged it into `f` (along with the values of $\theta$), and multiplied it to each feature before putting it in a list. At the end, we took the average of each feature list and multiplied it by 2, before returning it as an array.

You can do the same operations, but now a lot more efficient and organized manner. First, you have to plug in the entire matrix of data and the θ values into `f`. As per previous specifications, you know that `f` will return a column of outputs. You can directly multiply this with `*`, as numpy is "smart" enough to figure out to multiply each output row (consisting of one element) with each feature row (consisting of multiple elements), and in the process "stretch" the output row to match the size of the feature row (by repeating the singular element). This is called broadcasting, and you can read more about it [here](https://numpy.org/doc/stable/user/basics.broadcasting.html). After that, take the mean of the array, but while doing so, make sure to specify the axis, which specifies which direction to "collapse" the data in when taking the mean. In this example, your axis should be 0, which takes the mean "vertically" - in other words, it returns the mean of each column. For the sake of completeness, you should also be aware that an axis of 1 corresponds to taking the mean "horizontally", and returns the mean of each row (as you would expect). Note that since this process "collapses" by a dimension, it returns a 1D array, despite taking a 2D array as input. You need to turn into a vertical 2D array to be compatible with the array shape of `thetas`. Finally, multiply all the values of the array by 2, transform into a 2D vertical array, and return it.

<details>
<summary>Hint to transform into 2D vertical array</summary>

To transform a 1D array into a vertical 2D array, you might find the `reshape` numpy function useful. You can find its documentation [here](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html). However, there are other methods. For example, you can use `np.newaxis`. Or, my personal favorite (due to its readability and simplicity), wrap it in a list, convert to an array, and use `a.T`, with `a` as your array, to "flip" the arrays's elements - formally known as a matrix transposition.
</details>

Phew! That was a lot. While initially it may require a bit of effort to understand and implement all of that, in the end your code should be about 3 to 4 lines - a huge improvement in readability. In addition, numpy is a library programmed partially in C, which is much faster than regular python. Consequently, your code should run much faster, too. If you want to know why this is the case, check out [this article](https://www.huffpost.com/entry/computer-programming-languages-why-c-runs-so-much_b_59af8178e4b0c50640cd632e).

The code in all of your functions should be very clear and concise, and should not utilize any loops.

In [None]:
thetas = np.zeros([
                    trainingData.shape[1],
                    1
                  ])

def f(thetas, data=trainingData):
    return

def J(thetas, data=trainingData, outputs=trainingOutputs):
    return

def getGradients(thetas, data=trainingData, outputs=trainingOutputs):
    return

iterations = 1000
cost_history = []
cost_history.append(J(thetas))
for i in range(iterations):
    # Learning rate is chosen for you, but feel free to modify
    # If you get any errors involving infinity or your cost is increasing, chances are the learning rate is too high
    thetas = thetas - 0.25 * getGradients(thetas)
    cost_history.append(J(thetas))

print("Initial cost:", cost_history[0])
print("Final cost:", cost_history[-1])
plt.plot(np.arange(iterations+1), cost_history)
plt.show()

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
thetas = np.zeros([
                    trainingData.shape[1],
                    1
                  ])

def f(thetas, data=trainingData):
    return data @ thetas

def J(thetas, data=trainingData, outputs=trainingOutputs):
    return np.sum((f(thetas, data) - outputs)**2) / len(data)

def getGradients(thetas, data=trainingData, outputs=trainingOutputs):
    resultPart = f(thetas, data) - outputs
    results = 2 * np.mean(resultPart * data, 0)

    return np.array([results]).T

iterations = 1000
cost_history = []
cost_history.append(J(thetas))
for i in range(iterations):
    thetas = thetas - 0.25 * getGradients(thetas)
    cost_history.append(J(thetas))

print("Initial cost:", cost_history[0])
print("Final cost:", cost_history[-1])
plt.plot(np.arange(iterations+1), cost_history)
plt.show()
```
</details>

In the next cell, fill in the missing code to visualize the cost of the validation data and ensure that you're not overfitting. The code from the last cell or last lab might be useful. Note that the functions you previously wrote carry over in memory.

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
thetas = np.zeros([
                   trainingData.shape[1],
                   1
                  ])

iterations = 1000
cost_history = []
cost_history.append(J(thetas, validationData, validationOutputs))
for i in range(iterations):
    thetas = thetas - 0.25 * getGradients(thetas)
    cost_history.append(J(thetas, validationData, validationOutputs))

print("Initial cost:", cost_history[0])
print("Final cost:", cost_history[-1])
plt.plot(np.arange(iterations+1), cost_history)
plt.show()
```
</details>

Finally, calculate the accuracy of your model. Remember, the formula is as follows:

<h3>

$$R^2 = 1-\frac{\sum_{i=1}^n (f(\theta, x_i) - y_i)^2}{\sum_{i=1}^n (y_{average} - y_i)^2}$$

</h3>

You should use `np.mean(...)` to calculate the average, and `np.sum(...)` to calculate the sums of the errors. In total, your code should have around 5-7 lines of code (excluding comments and empty lines), and you should not have any loops.

In [None]:
# Value of thetas carry over from previous cell

# Get the mean of the output values

# Calculate numerator and denominator sum with loop

# Plug into formula

# Display R^2 as percentage

<details>
<summary>Stuck or completed? Click here to reveal a working example program that you could've written.</summary>

```python
# Value of thetas carry over from previous cell

# Get the mean of the output values
averageValidationOutput = np.mean(validationOutputs)

# Calculate numerator and denominator sum with loop
numerator = np.sum((f(thetas, validationData) - validationOutputs)**2)
denominator = np.sum((validationOutputs - averageValidationOutput)**2)

# Plug into formula
r_squared = 1 - numerator/denominator

# Display R^2 as percentage
print(f"{'%.2f' % (r_squared*100)}% accuracy")
```
</details>

## Credits
* This lab used a modified version of Algor_Bruce's real estate price prediction dataset from Kaggle. You can find the original dataset [here](https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction).
* Formula to calculate $R^2$ accuracy from [Newcastle University, UK](https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-correlation/coefficient-of-determination-r-squared.html).