# Basic Regression

For this, we'll do a simple linear regression as described [on this Wikipedia page](https://en.wikipedia.org/wiki/Simple_linear_regression), creating a line of best fit through a series of points.

## Basics - Getting warmed up


Let's run some basic python. Using the terminal below open up a python instance. Try typing `py` into it. If it fails, try `python.exe`.

Finish the following code by first experimenting in the python console, then inserting the answer here.

In [None]:
int_val = 3 # create an integer value
list_of_class_nums = [1770, 1000, 2300] # create a list of the class numbers you are taking
list_about_me = ['Clint', 4, 1.7] # create a list with your name, birthday month, and your estimated height (meters)

# loop through each entry of my_list and print it
for val in list_about_me:
    print(val) # replace with print statement

A list wasn't the most descriptive container for the "about me" info. Let's use a dictionary instead

In [None]:
dict_about_me = {"name":"Clint", "birth_month":6, "estimate_height":1.7} # use keys "name", "birth_month", "estimated_height"

# a dictionary has a useful way to loop through keys and values together
for key, value in dict_about_me.items():
    print("key=", key, "val=", value) # replace with print statement

## Creating your virtual environment

We'll install packages locally to the folder you're doing these exercises. This avoids conflicts with PC priveleges and To do this, let's create a virtual environment. The following commands will be run assuming you are using a Windows machine.

Run the following from a terminal:

```
py -m venv lab
```

A virtual environment should be created in your directory. Look for a folder named `lab`. See what's inside. Look in the windows explorer, or type the following to see the contents:

```
tree lab
```

Inside a directory called `Scripts`, there is an `activate` file. In terminal, type the file: `lab\Scripts\activate`. Now, your virtual environment is activated, as indicated with the text `(lab)` at the front of the terminal entry line.

## Manually Calculating the Linear Regression

To get familiar with some python basics, we will manually perform a linear regression in R. Following sections will take advantage of existing libraries to do the leg-work for you.

For this, we'll use `numpy`. Numpy is one of the most common python scientific packages. It includes a multi-dimensional array and some high-
performance functions and operations on those arrays. I'd highly recommend viewing [their documentation](https://numpy.org/doc/stable/) to get a more-complete understanding. At the very least, go through [their absolute beginners tutorial](https://numpy.org/doc/stable/user/absolute_beginners.html) if you're starting from scratch.

You know how I said everything feels very easy to do at first with python? Parts of `numpy` might be an exception. This package is powerful and computationally fast, but from experience teaching it to others, it will feel rigid and a bit confusing. This is probably because it is coded `C`.

### Reading the data from file

We start by importing it. Because it is such a common library, it is convention to shorten the name of `numpy` to `np` with the `as` keyword, as seen below. We'll load the data from file using `np.genfromtxt`.

In [None]:
import numpy as np # for performing arithmetic on homogeneous arrays

# create your first numpy array from list_of_class_nums
np_class_nums = np.array(list_of_class_nums) # insert fields here

# try something with it: let's sort the array from lowest number class to highest
np_class_nums.sort()
print(np_class_nums)

Before using the function, briefly review its documentation (either with the `help()` function or with `pydoc3`).

* We'll use the `delimiter=','` option to say that entries are comma-separated.
* We'll use the `skip_header=True` option to state that we don't care about the names of the columns.
* We'll use the `dtype=np.int64` to state that the incoming variables are integers

In [None]:
xy = np.genfromtxt('linear_regression_data.csv', delimiter=',', skip_header=True, dtype=np.int64)
print(f"  x y\n{xy}")

### Finding the mean and deviations from mean

We can now take this array and use it going forward to find the mean on each column, using the same `numpy` library. Afterwards, we'll find the deviation of each x and y value from the mean.

We have to take the mean of each column of our 2D xy array. To do this, we use the `axis` argument. This is often quite confusing to interpret, even for intermediate users, but we specify as the `axis` which dimension we wish to aggregate or collapse down. `axis=0` means to collapse the 0th-indexed dimension, or row. `axis=1` means to collapse the columns. In other words, for a `mean`, specifying `axis=0` will take all row entries per column and find the mean of them. This is what we want.

If you ever feel confused, don't be afraid to experiment. Take a small data sample, and run the mean for `axis=0`, then `axis=1` and see which looks right.

In [None]:
xy_mean = xy.mean(axis=0) # calculate the mean using either np.mean or xy.mean, with axis=0

print(f"Mean of x {xy_mean[0]} and mean of y {xy_mean[1]}")

Basic arithmetic of numpy arrays will be performed element-by-element, given that they are the same dimension. When not the same dimension, `numpy` will try to [broadcast](https://numpy.org/doc/stable/user/basics.broadcasting.html) the dimensions to perform element-wise arithmetic. Our `xy_mean` object has two values. Numpy will say, "well `xy` has 2 columns... let's broadcast the rows." Tell me to write the example on the board of how this will work.

In [None]:
xy_dev = xy - xy_mean# subtract xy from means

# check your work!
print(f"Deviations (or least squares residuals) of x and y from its mean:\n  xdev ydev\n{xy_dev}")

### Slicing

We need to take another `numpy` aside before proceeding. We need to talk about [indexing and slicing](https://numpy.org/doc/stable/user/basics.indexing.html).

For n-dimensional arrays (in our case 2-dimensional `xy`), we can specify the index of the row and column we want to select with a comma `,` within the square brackets. For example, for a 2D array `arr_2d[50,16]` means "get row index 50 and column index 16". Let's try it on our `xy` array.

In [None]:
xy[2,1] # select the third element of the y column

# Hint: remember indexing starts at 0, so x is the 0th column

We can select a range of elements with the colon `:` operator in the index. Putting numbers on either side of the colon represent the bounds of the range you want. For example, `3:5` means, "select index 3 up to but not including 5." If you know set notation in math, this would be the range $[3,5)$. A colon by itself means select all. I'll give you an example, then we'll practice with `xy_dev` 

In [None]:
arr_1d = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
print(arr_1d[4:7]) # I want elements in range [4,7)
print(arr_1d[:]) # I want everything. In this case, it is the same as leaving off the index

xy_dev[:,0] # select every row in the x column

### Slope

We can now calculate the slope using the formula:

$$m = \frac{\sum ((x_i - \bar{x}) * (y_i - \bar{y}))}{\sum ((x_i - \bar{x})^2)}$$

or:

$$m = \frac{\sum ( \textrm{x\_dev} * \textrm{y\_dev})}{\sum ( \textrm{x\_dev}^2)}$$

We can use `numpy.sum` to calculate the sum of each point, and the division operator to divide the two sums `/`. Remember that an exponent operator in python is `**`

In [None]:
regression_slope = np.sum() / np.sum()
print(f"slope = {regression_slope}")

Now that we have the slope and mean, the intercept can be found using the formula

$$c = \bar{y} - m * \bar{x}$$

In [None]:
regression_intercept = xy_mean[1] - regression_slope * xy_mean[0] # take y_mean - slope * x_mean
print(f"intercept = {regression_intercept}")

print(f"In the form y = a + bx, we have:\n  y = {regression_intercept:.2f} + {regression_slope:.2f}x")

### Plotting

To do this, we'll use `matplotlib`: see [documentation here](https://matplotlib.org/stable/api/index). Let's install it into our environment with `pip`. Afterwards we'll import its `pyplot` module, giving it the shortcut `plt`.

In [None]:
import matplotlib.pyplot as plt

plt.plot(xy[:,0], xy[:,1], 'go', label="Example data", markersize=10)

y_hat = regression_slope * xy[:,0] + regression_intercept
plt.plot(xy[:,0], y_hat, label="Fitted line")
plt.legend() # add a legend to the graph.
plt.show()

# Change the marker size, shape and color. Let's use the website or help menu to find out how to change color

### Finding Sum of Squared values and R-squared

We will calculate the following 3 sum of square values:

1. Sum of Squares Total -- SST
2. Sum of Squares Regression -- SSR
3. Sum of Squares Estimate of Errors -- SSE

The equations are as follows

$$
SST = \sum (y_i - \bar{y})^2 \\
SSR = \sum (\hat{y}_i – \bar{y})^2 \\
SSE = \sum (y_i - \hat{y}_i)^2
$$

Where $y_i$ is each observed y entry, $\hat{y}_i$ is the predicted y entry and $\bar{y}$ is the mean.

Note the following relationship:
$$SST = SSR + SSE$$

We can calculate all three with our previously calculated values in `xy_dev` and `xy_mean`, and `np.sum`. Just as we did in the plot above, we can calculate all the predicted values with our initial `x` and store it in a variable called `y_hat`.

In [None]:
sst = np.sum(xy_dev[:,1]**2) # we have the x_dev, let's square it
print(f"Sum of Squares Total: {sst:.4f}")

ssr = np.sum((y_hat - xy_mean[1])**2) # use y_hat and mean within the sum
print(f"Sum of Squares Regression: {ssr:.4f}")

sse = sst - ssr# subtract two sum of square values above
print(f"Sum of Squares Estimate of Errors: {sse:.4f}")

Finally, we can calculate R-squared with the following formula:
$$R^2 = \frac{SSR}{SST}$$

In [None]:
r_squared = ssr /sst # simple division
print(f"r-squared = {r_squared}")

## Using numpy Least Squares function

If we wanted to solve a least-squares regression more optimally using only `numpy`, we can use linear algebra and a tool available in `numpy` already. The `numpy` library has a linear algebra submodule and in there is a least-square solution: `numpy.linalg.lstsq`, see [the help page](https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html) for their own example.

The result of `np.linalg.lstsq` provides a few useful parameters for the data given, such as the solution itself (slope and intercept) along with average of the squared residuals, matrix rank, etc. Really, we're interested in the first things it returns, the solution (i.e. slope, intercept).

In [None]:
A = np.array([xy[:,0], np.ones(xy.shape[0])]).T # T is shortcut for "transpose()"
lstsq_out = np.linalg.lstsq(A, xy[:,1], rcond=None) # Finding the least-squares solution to Ax = y

slope, intercept = lstsq_out[0]
print(f"slope = {slope}, intercept = {intercept}")
print(f"sum of residuals, squared (or SSE): {lstsq_out[1]}")
print(f"matrix rank = {lstsq_out[2]}")
print(f"singular values of input (A): {lstsq_out[3]}")

Again, we can plot our results, as we saw before.

In [None]:
# use the matplotlib pyplot. Plot slope and intercept
plt.plot(xy[:,0],xy[:,1], 'o', label="example data", markersize=5) # fill in formatting info
y_hat = slope * xy[:,0] + intercept # calculate slope * x + intercept with new info above
plt.plot(xy[:,0], y_hat, label='Fit line') # add formatting

# add a legend to the graph.
plt.legend() # add a legend to the graph.
plt.show()

## Using statsmodel to tell us everything in the universe

The [statsmodel](https://www.statsmodels.org/stable/) is a swiss army knife for creating and observing statistical models. In a few lines of codes, you can be handed a table of statistical summaries. It can be seen as a merger between the worlds of R and python, and functions primarily with `pandas.DataFrame` objects, similar to R's data frames.

Below we'll run an Oridinary Least Squares regression model on the data, see a table printing a summary on multiple statistics.

In [None]:
from statsmodels.formula.api import ols # Oridinary Least Squares regression
from statsmodels.stats.api import anova_lm # For ANOVA least squares
from pandas import DataFrame

xy_df = DataFrame(data=xy, columns=["height", "weight"]) # statsmodels lives in the world of data frames
ols_fit_results = ols(formula="height ~ weight", data=xy_df).fit()

print(ols_fit_results.summary())