# Evaluating Regression Lines Lab

### Introduction

In the previous lesson, we learned to evaluate how well a regression line estimated our actual data.  In this lab, we will turn these formulas into code.  In doing so, we'll build lots of useful functions for both calculating and displaying our errors for a given regression line and dataset.

> In moving through this lab, we'll access to the functions that we previously built out to plot our data, available in the [graph](https://github.com/learn-co-curriculum/evaluating-regression-lines-lab/blob/master/graph.py) here.

### Determining Quality

In the file, `movie_data.py` you will find movie data written as a python list of dictionaries, with each dictionary representing a movie.  The movies are derived from the first 30 entries from the dataset containing 538 movies [provided here](https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv).

In [2]:
from movie_data import movies 
len(movies)

30

> Press shift + enter

In [3]:
movies[0]

{'budget': 13000000, 'domgross': 25682380.0, 'title': '21 &amp; Over'}

In [4]:
movies[0]['budget']/1000000

13.0

The numbers are in millions, so we will simplify things by dividing everything by a million

In [5]:
scaled_movies = list(map(lambda movie: {'title': movie['title'], 'budget': round(movie['budget']/1000000, 0), 'domgross': round(movie['domgross']/1000000, 0)}, movies))
scaled_movies[0]

{'title': '21 &amp; Over', 'budget': 13.0, 'domgross': 26.0}

Note that, like in previous lessons, the budget is our explanatory value and the revenue is our dependent variable.  Here revenue is represented as the key `domgross`.  

#### Plotting our data

Let's write the code to plot this data set.

As a first task, convert the budget values of our `scaled_movies` to `x_values`, and convert the domgross values of the `scaled_movies` to `y_values`.

In [29]:
x_values= list(map(lambda x: x['budget'], scaled_movies))
y_values= list(map(lambda y: y['domgross'], scaled_movies))

In [30]:
x_values and x_values[0] # 13.0

13.0

In [31]:
y_values and y_values[0] # 26.0

26.0

Assign a variable called `titles` equal to the titles of the movies.

In [32]:
titles = list(map(lambda title: title['title'], scaled_movies))

In [33]:
titles and titles[0]

'21 &amp; Over'

Great! Now we have the data necessary to make a trace of our data.

In [34]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
from graph import trace_values, plot

movies_trace = trace_values(x_values, y_values, text=titles, name='movie data')

plot([movies_trace])

#### Plotting a regression line

Now let's add a regression line to make a prediction of output (revenue) based on an input (the budget).  We'll use the following regression formula:

* $\hat{y} = m x + b$, with $m = 1.7$, and $b = 10$. 


* $\hat{y} = 1.7x + 10$

Write a function called `regression_formula` that calculates our $\hat{y}$ for any provided value of $x$. 

In [36]:
def regression_formula(x):
    return 1.7*x+10

Check to see that the regression formula generates the correct outputs.

In [39]:
regression_formula(100) # 180.0
regression_formula(250) # 435.0

435.0

Let's plot the data as well as the regression line to get a sense of what we are looking at.

In [40]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
from graph import trace_values, m_b_trace, plot

if x_values and y_values:
    movies_trace = trace_values(x_values, y_values, text=titles, name='movie data')
    regression_trace = m_b_trace(1.7, 10, x_values, name='estimated revenue')
    plot([movies_trace, regression_trace])

### Calculating errors of a regression Line

Now that we have our regression formula, we can move towards calculating the error. We provide a function called `y_actual` that given a data set of `x_values` and `y_values`, finds the actual y value, provided a value of `x`.



In [41]:
def y_actual(x, x_values, y_values):
    combined_values = list(zip(x_values, y_values))
    point_at_x = list(filter(lambda point: point[0] == x,combined_values))[0]
    return point_at_x[1]

In [42]:
x_values and y_values and y_actual(13, x_values, y_values) # 26.0

26.0

Write a function called `error`, that given a list of `x_values`, and a list of `y_values`, the values `m` and `b` of a regression line, and a value of `x`, returns the error at that x value.  Remember ${\varepsilon_i} =  y_i - \hat{y}_i$.  

In [49]:
def error(x_values, y_values, m, b, x):
    return y_actual(x,x_values,y_values)-(m*x+b)
    

In [50]:
error(x_values, y_values, 1.7, 10, 13) # -6.099999999999994

-6.099999999999994

Now that we have a formula to calculate our errors, write a function called `error_line_trace` that returns a trace of an error at a given point.  So for a given movie budget, it will display the difference between the regression line and the actual movie revenue.

![](./error-line.png)

Ok, so the function `error_line_trace` takes our dataset of `x_values` as the first argument and `y_values` as the second argument.  It also takes in values of $m$ and $b$ as the next two arguments to represent the regression line we will calculate errors from. Finally, the last argument is the value $x$ it is drawing an error for.

The return value is a dictionary that represents a trace, and looks like the following:

```python
{'marker': {'color': 'red'},
 'mode': 'line',
 'name': 'error at 120',
 'x': [120, 120],
 'y': [93.0, 214.0]}

```

The trace represents the error line above. The data in `x` and `y` represent the starting point and ending point of the error line. Note that the x value is the same for the starting and ending point, just as it is for each vertical line. It's just the y values that differ - representing the actual value and the expected value. The mode of the trace equals `'lines'`.

In [59]:
def error_line_trace(x_values, y_values, m, b, x):
    return {'marker': {'color': 'red'}, 'mode': 'line', 'name': 'error at 120','x': [x,x], 'y': [y_actual(x, x_values, y_values), m*x+b]}

In [69]:
error_at_120m = error_line_trace(x_values, y_values, 1.7, 10, 120)

# {'marker': {'color': 'red'},
#  'mode': 'line',
#  'name': 'error at 120',
#  'x': [120, 120],
#  'y': [93.0, 214.0]}
error_at_120m

{'marker': {'color': 'red'},
 'mode': 'line',
 'name': 'error at 120',
 'x': [120, 120],
 'y': [93.0, 214.0]}

We just ran the our function to draw a trace of the error for the movie Elysium.  Let's see how it looks.

In [70]:
scaled_movies[17]

{'title': 'Elysium', 'budget': 120.0, 'domgross': 93.0}

In [78]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
from graph import trace_values, m_b_trace, plot
if x_values and y_values:
    movies_trace = trace_values(x_values, y_values, mode='lines', name='movie data', text= [titles])
    regression_trace = m_b_trace(1.7, 10, x_values, mode='lines', name='estimated revenue')
    plot([regression_trace,movies_trace)

ValueError: Invalid properties specified for object of type plotly.graph_objs.Layout: ('x', 'y', 'mode', 'name', 'text')

    Valid properties:
        angularaxis
            plotly.graph_objs.layout.AngularAxis instance or dict
            with compatible properties
        annotations
            plotly.graph_objs.layout.Annotation instance or dict
            with compatible properties
        autosize
            Determines whether or not a layout width or height that
            has been left undefined by the user is initialized on
            each relayout. Note that, regardless of this attribute,
            an undefined layout width or height is always
            initialized on the first call to plot.
        bargap
            Sets the gap (in plot fraction) between bars of
            adjacent location coordinates.
        bargroupgap
            Sets the gap (in plot fraction) between bars of the
            same location coordinate.
        barmode
            Determines how bars at the same location coordinate are
            displayed on the graph. With *stack*, the bars are
            stacked on top of one another With *relative*, the bars
            are stacked on top of one another, with negative values
            below the axis, positive values above With *group*, the
            bars are plotted next to one another centered around
            the shared location. With *overlay*, the bars are
            plotted over one another, you might need to an
            *opacity* to see multiple bars.
        barnorm
            Sets the normalization for bar traces on the graph.
            With *fraction*, the value of each bar is divide by the
            sum of the values at the location coordinate. With
            *percent*, the results form *fraction* are presented in
            percents.
        boxgap
            Sets the gap (in plot fraction) between boxes of
            adjacent location coordinates.
        boxgroupgap
            Sets the gap (in plot fraction) between boxes of the
            same location coordinate.
        boxmode
            Determines how boxes at the same location coordinate
            are displayed on the graph. If *group*, the boxes are
            plotted next to one another centered around the shared
            location. If *overlay*, the boxes are plotted over one
            another, you might need to set *opacity* to see them
            multiple boxes.
        calendar
            Sets the default calendar system to use for
            interpreting and displaying dates throughout the plot.
        colorway
            Sets the default trace colors.
        datarevision
            If provided, a changed value tells `Plotly.react` that
            one or more data arrays has changed. This way you can
            modify arrays in-place rather than making a complete
            new copy for an incremental change. If NOT provided,
            `Plotly.react` assumes that data arrays are being
            treated as immutable, thus any data array with a
            different identity from its predecessor contains new
            data.
        direction
            For polar plots only. Sets the direction corresponding
            to positive angles.
        dragmode
            Determines the mode of drag interactions. *select* and
            *lasso* apply only to scatter traces with markers or
            text. *orbit* and *turntable* apply only to 3D scenes.
        font
            Sets the global font. Note that fonts used in traces
            and other layout components inherit from the global
            font.
        geo
            plotly.graph_objs.layout.Geo instance or dict with
            compatible properties
        grid
            plotly.graph_objs.layout.Grid instance or dict with
            compatible properties
        height
            Sets the plot's height (in px).
        hiddenlabels

        hiddenlabelssrc
            Sets the source reference on plot.ly for  hiddenlabels
            .
        hidesources
            Determines whether or not a text link citing the data
            source is placed at the bottom-right cored of the
            figure. Has only an effect only on graphs that have
            been generated via forked graphs from the plotly
            service (at https://plot.ly or on-premise).
        hoverdistance
            Sets the default distance (in pixels) to look for data
            to add hover labels (-1 means no cutoff, 0 means no
            looking for data). This is only a real distance for
            hovering on point-like objects, like scatter points.
            For area-like objects (bars, scatter fills, etc)
            hovering is on inside the area and off outside, but
            these objects will not supersede hover on point-like
            objects in case of conflict.
        hoverlabel
            plotly.graph_objs.layout.Hoverlabel instance or dict
            with compatible properties
        hovermode
            Determines the mode of hover interactions.
        images
            plotly.graph_objs.layout.Image instance or dict with
            compatible properties
        legend
            plotly.graph_objs.layout.Legend instance or dict with
            compatible properties
        mapbox
            plotly.graph_objs.layout.Mapbox instance or dict with
            compatible properties
        margin
            plotly.graph_objs.layout.Margin instance or dict with
            compatible properties
        orientation
            For polar plots only. Rotates the entire polar by the
            given angle.
        paper_bgcolor
            Sets the color of paper where the graph is drawn.
        plot_bgcolor
            Sets the color of plotting area in-between x and y
            axes.
        polar
            plotly.graph_objs.layout.Polar instance or dict with
            compatible properties
        radialaxis
            plotly.graph_objs.layout.RadialAxis instance or dict
            with compatible properties
        scene
            plotly.graph_objs.layout.Scene instance or dict with
            compatible properties
        selectdirection
            When "dragmode" is set to "select", this limits the
            selection of the drag to horizontal, vertical or
            diagonal. "h" only allows horizontal selection, "v"
            only vertical, "d" only diagonal and "any" sets no
            limit.
        separators
            Sets the decimal and thousand separators. For example,
            *. * puts a '.' before decimals and a space between
            thousands. In English locales, dflt is *.,* but other
            locales may alter this default.
        shapes
            plotly.graph_objs.layout.Shape instance or dict with
            compatible properties
        showlegend
            Determines whether or not a legend is drawn.
        sliders
            plotly.graph_objs.layout.Slider instance or dict with
            compatible properties
        spikedistance
            Sets the default distance (in pixels) to look for data
            to draw spikelines to (-1 means no cutoff, 0 means no
            looking for data). As with hoverdistance, distance does
            not apply to area-like objects. In addition, some
            objects can be hovered on but will not generate
            spikelines, such as scatter fills.
        template
            Default attributes to be applied to the plot. Templates
            can be created from existing plots using
            `Plotly.makeTemplate`, or created manually. They should
            be objects with format: `{layout: layoutTemplate, data:
            {[type]: [traceTemplate, ...]}, ...}` `layoutTemplate`
            and `traceTemplate` are objects matching the attribute
            structure of `layout` and a data trace.  Trace
            templates are applied cyclically to traces of each
            type. Container arrays (eg `annotations`) have special
            handling: An object ending in `defaults` (eg
            `annotationdefaults`) is applied to each array item.
            But if an item has a `templateitemname` key we look in
            the template array for an item with matching `name` and
            apply that instead. If no matching `name` is found we
            mark the item invisible. Any named template item not
            referenced is appended to the end of the array, so you
            can use this for a watermark annotation or a logo
            image, for example. To omit one of these items on the
            plot, make an item with matching `templateitemname` and
            `visible: false`.
        ternary
            plotly.graph_objs.layout.Ternary instance or dict with
            compatible properties
        title
            Sets the plot's title.
        titlefont
            Sets the title font.
        updatemenus
            plotly.graph_objs.layout.Updatemenu instance or dict
            with compatible properties
        violingap
            Sets the gap (in plot fraction) between violins of
            adjacent location coordinates.
        violingroupgap
            Sets the gap (in plot fraction) between violins of the
            same location coordinate.
        violinmode
            Determines how violins at the same location coordinate
            are displayed on the graph. If *group*, the violins are
            plotted next to one another centered around the shared
            location. If *overlay*, the violins are plotted over
            one another, you might need to set *opacity* to see
            them multiple violins.
        width
            Sets the plot's width (in px).
        xaxis
            plotly.graph_objs.layout.XAxis instance or dict with
            compatible properties
        yaxis
            plotly.graph_objs.layout.YAxis instance or dict with
            compatible properties
        

From there, we can write a function called `error_line_traces`, that takes in a list of `x_values` as an argument, `y_values` as an argument, and returns a list of traces for every x value provided.

In [None]:
def error_line_traces(x_values, y_values, m, b):
    pass

In [None]:
errors_for_regression = error_line_traces(x_values, y_values, 1.7, 10)

In [None]:
errors_for_regression and len(errors_for_regression) # 30

In [None]:
errors_for_regression and errors_for_regression[-1]

# {'x': [200.0, 200.0],
#  'y': [409.0, 350.0],
#  'mode': 'lines',
#  'marker': {'color': 'red'},
#  'name': 'error at 200.0'}

In [None]:
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

from graph import trace_values, m_b_trace, plot

if x_values and y_values:
    movies_trace = trace_values(x_values, y_values, text=titles, name='movie data')
    regression_trace = m_b_trace(1.7, 10, x_values, name='estimated revenue')
    plot([movies_trace, regression_trace, *errors_for_regression])

> Don't worry about some of the points that don't have associated error lines.  It is a complication with Plotly and not our functions.

### Calculating RSS

Now write a function called `squared_error`, that given a value of x, returns the squared error at that x value.

${\varepsilon_i}^2 =  (y_i - \hat{y}_i)^2$

In [None]:
def squared_error(x_values, y_values, m, b, x):
    pass

In [None]:
x_values and y_values and squared_error(x_values, y_values, 1.7, 10, x_values[0]) # 37.20999999999993

Now write a function that will iterate through the x and y values to create a list of squared errors at each point, $(x_i, y_i)$ of the dataset.

In [None]:
def squared_errors(x_values, y_values, m, b):
    pass

In [None]:
x_values and y_values and squared_errors(x_values, y_values, 1.7, 10)

Next, write a function called `residual_sum_squares` that, provided a list of x_values, y_values, and the m and b values of a regression line, returns the sum of the squared error for the movies in our dataset.

In [None]:
def residual_sum_squares(x_values, y_values, m, b):
    pass

In [None]:
residual_sum_squares(x_values, y_values, 1.7, 10) # 327612.2800000001

Finally, write a function called `root_mean_squared_error` that calculates the RMSE for the movies in the dataset, provided the same parameters as RSS.  Remember that `root_mean_squared_error` is a way for us to measure the approximate error per data point.

In [None]:
import math
def root_mean_squared_error(x_values, y_values, m, b):
    return (math.sqrt(residual_sum_squares(x_values, y_values, m, b)))/len(x_values)

In [None]:
root_mean_squared_error(x_values, y_values, 1.7, 10) # 19.07914160659343

#### Some functions for your understanding

Now we'll provide a couple functions for you.  Note that we can represent multiple regression lines by a list of m and b values:

In [None]:
regression_lines = [(1.7, 10), (1.9, 20)]

Then we can return a list of the regression lines along with the associated RMSE.

In [None]:
def root_mean_squared_errors(x_values, y_values, regression_lines):
    errors = []
    for regression_line in regression_lines:
        error = root_mean_squared_error(x_values, y_values, regression_line[0], regression_line[1])
        errors.append([regression_line[0], regression_line[1], round(error, 0)])
    return errors

Now let's generate the RMSE values for each of these lines.

In [None]:
x_values and y_values and root_mean_squared_errors(x_values, y_values, regression_lines)

Now we'll provide a couple functions for you:
* a function called `trace_rmse`, that builds a bar chart displaying the value of the RMSE.  The return value is a dictionary with keys of `x` and `y`, both which point to lists.  The $x$ key points to a list with one element, a string containing each regression line's m and b value.  The $y$ key points to a list of the RMSE values for each corresponding regression line.

In [None]:
import plotly.graph_objs as go

def trace_rmse(x_values, y_values, regression_lines):
    errors = root_mean_squared_errors(x_values, y_values, regression_lines)
    x_values_bar = list(map(lambda error: 'm: ' + str(error[0]) + ' b: ' + str(error[1]), errors))
    y_values_bar = list(map(lambda error: error[-1], errors))
    return dict(
        x=x_values_bar,
        y=y_values_bar,
        type='bar'
    )


x_values and y_values and trace_rmse(x_values, y_values, regression_lines)

Once this is built, we can create a subplot showing the two regression lines, as well as the related RMSE for each line.

In [None]:
import plotly
from plotly.offline import iplot
from plotly import tools
import plotly.graph_objs as go

def regression_and_rss(scatter_trace, regression_traces, rss_calc_trace):
    fig = tools.make_subplots(rows=1, cols=2)
    for reg_trace in regression_traces:
        fig.append_trace(reg_trace, 1, 1)
    fig.append_trace(scatter_trace, 1, 1)
    fig.append_trace(rss_calc_trace, 1, 2)
    iplot(fig)

In [None]:
### add more regression lines here, by adding new elements to the list
regression_lines = [(1.7, 10), (1, 50)]

if x_values and y_values:
    regression_traces = list(map(lambda line: m_b_trace(line[0], line[1], x_values, name='m:' + str(line[0]) + 'b: ' + str(line[1])), regression_lines))

    scatter_trace = trace_values(x_values, y_values, text=titles, name='movie data')
    rmse_calc_trace = trace_rmse(x_values, y_values, regression_lines)

    regression_and_rss(scatter_trace, regression_traces, rmse_calc_trace)

As we can see above, the second line (m: 1.0, b: 50) has the lower RMSE. We thus can conclude that the second line "fits" our set of movie data better than the first line. Ultimately, our goal will be to choose the regression line with the lowest RSME or RSS. We will learn how to accomplish this goal in the following lessons and labs.