_Linear Regression_ is when you have a group of points on a graph, and you find a line that approximately resembles that group of points. A good Linear Regression algorithm minimizes the _error_, or the distance from each point to the line. A line with the least error is the line that fits the data the best. We call this a line of _best fit_.

Here in this project, we will use loops, lists, and arithmetic to create a function that will find a line of best fit when given a set of data.

## Part 1: Calculating Error

The line we will end up with will have a formula that looks like:
```
y = m*x + b
```
`m` is the slope of the line and `b` is the intercept, where the line crosses the y-axis.

Here we will write a function called `get_y()` that takes in `m`, `b`, and `x`. It should return what the `y` value would be for that `x` on that line.

In [9]:
def get_y(m, b, x):
  y = m*x + b
  return y

print(get_y(1, 0, 7) == 7)
print(get_y(5, 10, 3) == 25)

True
True



We want try a number of different `m` values and `b` values and see which line produces the least error. To calculate error between a point and a line, we will write a a function called `calculate_error()`, which will take in `m`, `b`, and an [x, y] point called `point` and return the distance between the line and the point.

To find the distance, the following steps will be followed:
1. Get the x-value from the point and store it in a variable called `x_point`
2. Get the y-value from the point and store it in a variable called `y_point`
3. Use `get_y()` to get the y-value that `x_point` would be on the line
4. Find the difference between the y from `get_y` and `y_point`
5. Return the absolute value of the distance (you can use the built-in function `abs()` to do this)

The distance represents the error between the line `y = m*x + b` and the `point` given.

In [10]:
def calculate_error(m,b,point):
    x_point = point[0]
    y_point = point[1]
    y1 = get_y(m, b, x_point)
    return abs(y1 - y_point)

To test the function, we will take sample values of m, b and a point:

In [11]:
#the following line is y = x, so (3, 3) should lie on it. thus, error should be 0:
print(calculate_error(1, 0, (3, 3)))
#the point (3, 4) should be 1 unit away from the line y = x:
print(calculate_error(1, 0, (3, 4)))
#the point (3, 3) should be 1 unit away from the line y = x - 1:
print(calculate_error(1, -1, (3, 3)))
#the point (3, 3) should be 5 units away from the line y = -x + 1:
print(calculate_error(-1, 1, (3, 3)))

0
1
1
5


We will be using a set of datapoints, for example:

In [12]:
datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]

Next, we will write a function calculate_all_error, which will iterate through each point in points and calculate the error from that point to the line (using calculate_error). It should keep a running total of the error, and then return that total after the loop.

In [13]:
def calculate_all_error(m, b, points):
    error_total = 0
    for point in points:
        error = calculate_error(m, b, point)
        error_total += error
    return error_total

To test the function, we will take sample values of m, b and a set of datapoints:

In [15]:
#every point in this dataset lies upon y=x, so the total error should be zero:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 0, datapoints))

#every point in this dataset is 1 unit away from y = x + 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, 1, datapoints))

#every point in this dataset is 1 unit away from y = x - 1, so the total error should be 4:
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(1, -1, datapoints))


#the points in this dataset are 1, 5, 9, and 3 units away from y = -x + 1, respectively, so total error should be
# 1 + 5 + 9 + 3 = 18
datapoints = [(1, 1), (3, 3), (5, 5), (-1, -1)]
print(calculate_all_error(-1, 1, datapoints))

0
4
4
18


## Part 2: Try a number of slope and intercept values

We want to try a bunch of different slopes (`m` values) and a bunch of different intercepts (`b` values) and see which one produces the smallest error value for the dataset.

Using a list comprehension, let's create a list of possible `m` values to try. 

The list `possible_ms`  goes from -10 to 10 inclusive, in increments of 0.1.

In [16]:
possible_ms = [m * 0.1 for m in range(-100, 101)]

The list `possible_bs` goes from -20 to 20 inclusive, in steps of 0.1:

In [17]:
possible_bs = [b * 0.1 for b in range(-200, 201)]

We are going to find the smallest error. First, we will make every possible `y = m*x + b` line by pairing all the possible values of `m`s with all the possible values of `b`s. Then, we will see which `y = m*x + b` line produces the smallest total error with the set of data stored in `datapoint`.

We will be optimizing the following variables:
* `smallest_error` &mdash; this should start at infinity (`float("inf")`) so that any error we get at first will be smaller than our value of `smallest_error`
* `best_m` &mdash; we can start this at `0`
* `best_b` &mdash; we can start this at `0`

We will be carrying out the following steps:
* Iterate through each element `m` in `possible_ms`
* For every `m` value, take every `b` value in `possible_bs`
* If the value returned from `calculate_all_error` on this `m` value, this `b` value, and `datapoints` is less than our current `smallest_error`, set `best_m` and `best_b` to be these values, and set `smallest_error` to this error.

By the end of these nested loops, the `smallest_error` should hold the smallest error we have found, and `best_m` and `best_b` should be the values that produced that smallest error value.

In [18]:
datapoints = [(1, 2), (2, 0), (3, 4), (4, 4), (5, 3)]
smallest_error = float("inf")
best_m = 0
best_b = 0

for m in possible_ms:
    for b in possible_bs:
   	 error = calculate_all_error(m, b, datapoints)
   	 if error < smallest_error:
   		 best_m = m
   		 best_b = b
   		 smallest_error = error
       	 
print(best_m, best_b, smallest_error)

0.30000000000000004 1.7000000000000002 4.999999999999999


## Part 3: What does the model predict?


The line that fits the data best has an `m` of 0.3 and a `b` of 1.7:

```
y = 0.3x + 1.7
```

This line produced a total error of 5.

Using this `m` and this `b`, what value of y does the line predict for the datapoint x = 6?
In other words, what is the output of `get_y()` when we call it with:
* m = 0.3
* b = 1.7
* x = 6

In [20]:
get_y(0.3,1.7,6)

3.5

The model predicts that the value of y will be 3.5