### Exercises

#### Question 1

The accompanying file `data.csv` contains information for the value `x` of something observed at time `t`.

Given this data, we want to calculate the rate of change of this value over time - we'll do this by taking two consecutive observations, say $x(t_i)$ and $x(t_{i+1})$ and approximate the rate of change using this formula:

$$
v(t_{i+1}) = \frac{x(t_{i+1}) - x(t_i)}{t_{i+1} - t_i}
$$

For example, if the data looks like this:

```
t     x
0.1   10
0.2   12
0.4   14
0.5   15
```

Then the first row of data would be considered $t_0$, the second row $t_1$, etc

And we can start approximating the rate of change starting at $v_1$ which would be calculated as:

$$
v_1 = \frac{12 - 10}{0.2 - 0.1} = 20.0
$$

Similarly, $v_2$ would be calculated as:

$$
v_2 = \frac{14 - 12}{0.4 - 0.2} = 10.0
$$

Use NumPy arrays to create an array that holds the calculated rates of change and determine the minimum, maximum, average and standard deviation of the rate of change.

In [22]:
import numpy as np
import csv

# CSV file
csv_file = './data.csv'

def rate_change(table):
    """ Rate change calc of CSV data

    Args:
        table (csv): data for calculations
    """
    # Read CSV data
    with open(table, 'r') as f:
        reader = csv.reader(f)
        next(f)
        row_data = list(reader)

    # Unpack into numpy
    data = [[float(t), float(x)] for t, x in row_data]
    np_data = np.array(data)

    # Delta T and X calcs
    delta_t = np_data[1:, 0] - np_data[:-1, 0]
    delta_x = np_data[1:, 1] - np_data[:-1, 1]

    rates = delta_t / delta_x
    return rates

values = rate_change(csv_file)

print(f'Rate change: {values}\n')
print(f'Lowers value: {np.amin(values)}')
print(f'Highest value: {np.amax(values)}')
print(f'Average value: {np.mean(values)}')
print(f'Standard deviation: {np.std(values)}')


Rate change: [        inf -0.01747503 -0.00896888 -0.0075219  -0.00533187 -0.00373633
 -0.00355075 -0.00237648 -0.00278519 -0.0022859  -0.00172917 -0.00199385
 -0.00148887 -0.00175985 -0.00121255 -0.00135691 -0.00128309 -0.00107748
 -0.00126601 -0.00096024 -0.00095636 -0.00103077 -0.000901   -0.00091999
 -0.00075261 -0.00073467 -0.00082781 -0.00070025 -0.00070347 -0.00081875
 -0.00057297 -0.00065515 -0.00066129 -0.00063492 -0.00057342 -0.00050931
 -0.00062111 -0.00054949 -0.00051878 -0.00043017 -0.00051467 -0.00055323
 -0.00040769 -0.00046925 -0.0004734  -0.00047052 -0.00037824 -0.00046014
 -0.0004109  -0.00042832 -0.00039648 -0.00038236 -0.00041622 -0.00034243
 -0.00036978 -0.00038504 -0.00034921 -0.00030879 -0.00037364 -0.00034499
 -0.00034987 -0.00029836 -0.00035741 -0.00030421 -0.00032049 -0.00028934
 -0.00028446 -0.00031364 -0.00029008 -0.00030938 -0.00025942 -0.00026214
 -0.00028064 -0.00031534 -0.00027842 -0.00026143 -0.00026319 -0.00022609
 -0.00028509 -0.00021965 -0.00028007 -

  rates = delta_t / delta_x
  x = asanyarray(arr - arrmean)


#### Question 2

In linear regression we try to find the coefficients `m` (slope) and `c` (y-intercept) of a straight line

$$
y = mx + c
$$

that provides the "best" fit given some `x` and `y` data. This formula then allows to predict `y` values for given `x` values.

Given an array of `n` `(x, y)` data pairs, these coefficients can be calculated very simply.

A bit of terminology first:

- Let `X` mean the column of `X` values.
- Let `Y` mean the column of `Y` values.
- Let `XX` mean a column calculated by multiplying each `x` in the `X` column by itself
- Let `XY` mean a column calculated by multiplying the `x` and `y` values from the `X` and `Y` columns

Then, given some column (say `X`), this symbol: $\sum{X}$ means the sum of all the elements in the column.

Similarly, the symbol $\sum{XY}$ means the sum of the values obtained by multiplying (pairwise) the values in `X` and `Y`.

Given those definitions, the formulas for calculating the "best" values of `m` and `c` are given by:

$$
m = \frac{n\sum{XY} - \sum{X}\sum{Y}}{n\sum{XX} - (\sum{X})^2}
$$

$$
c = \frac{\sum{Y}\sum{XX} - \sum{X}\sum{XY}}{n\sum{XX} - (\sum{X})^2}
$$

(where `n` is the number of `(x,y)` pairs in our data set.)

Using the same data we saw in Question 1, calculate the values for `m` and `c` for that data set given the formulas above.

You can think of the `t` column in the data as the `X` column, and the `x` values in the data as the `Y` column - we are trying to predict the value of `x` given a value of `t`.

This will result in a straight line that "best" fits through the data.

Compare the slope of this regression line to the average rate of change you calculated in Question 1.

In [16]:
import numpy as np
import csv

# CSV file
csv_file = './data.csv'

def linear_regression(table):
    # Read CSV data
    with open(table, 'r') as f:
        reader = csv.reader(f)
        next(f)
        row_data = list(reader)

    # Unpack into numpy
    data = [[float(t), float(x)] for t, x in row_data]
    np_data = np.array(data)

    # Columns X and Y values
    X = np_data[:, 0]
    Y = np_data[:, 1]

    # N value
    n = len(X)

    # Two M line calc values
    m1 = (n * np.sum(X * Y)) - (np.sum(X) * np.sum(Y))
    m2 = (n * np.sum(X * X)) - (np.sum(X) * np.sum(X))
    m3 = m1 / m2 # Divide both line values

    # Two C line calc values
    c1 = (np.sum(Y) * np.sum(X * X)) - (np.sum(X) * np.sum(X * Y))
    c2 = (n * np.sum(X * X)) - (np.sum(X) * np.sum(X))
    c3 = c1 / c2 # Divide both line values

    return m3, c3

values = linear_regression(csv_file)
print(values)

(49.978008206387344, 10.081268844890284)
