In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from random import randrange
import random

# Task 3: Research and theory
### Task 3A: Research - State of the art solutions

### Task 3B: Theory - MSE versus MAE

$$ MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 $$

$y_i$ is the actual expected output and $\hat{y}_i$ is the model's prediction.

MSE measures averages squared error of our predictions. For each point, it calculates square difference between the predictions and the target and then average those values.

The higher this value, the worse the model is. It's never negative since we're squaring the individual prediction-wise error before summing them, but would be zero for a perfect model. 

*Advantage*: Useful if we have unexpected values that we should care about. Vey high or low value that we should pay attention.

*Disadvantage*: If we make a single very bad prediction, the squaring will make the error even worse and it may skew the metric towards overestimating the model’s badness. That is a particularly problematic behaviour if we have noisy data (that is, data that for whatever reason is not entirely reliable) — even a “perfect” model may have a high MSE in that situation, so it becomes hard to judge how well the model is performing. On the other hand, if all the errors are small, or rather, smaller than 1, than the opposite effect is felt: we may underestimate the model’s badness.

*Note that* if we want to have a constant prediction the best one will be the mean value of the target values. It can be found by setting the derivative of our total error with respect to that constant to zero, and find it from this equation.


$$ MAE = \frac{1}{N}\sum_{i=1}^{N}|y_i - \hat{y}_i| $$

MAE calculates the error as an average of absolute differences between the target values and the predictions. The MAE is a linear score which means that all the individual differences are weighted equally in the average. For example, the difference between 10 and 0 will be twice the difference between 5 and 0.

What is important about this metric is that it penalizes huge errors that not as that badly as MSE does. Thus, it’s not that sensitive to outliers as mean square error.


#### MAE – Mean Absolute Error
MAE is the most intuitive of them all. The name in itself is pretty good at telling us what’s going on.

- Mean: average
- Absolute: without direction, get rid of any negative signs
Simply put, the average difference observed in the predicted and actual values across the whole test set.

In the background, the algorithm takes the differences in all of the predicted and actual prices, adds them up and then divides them by the number of observations. It doesn’t matter if the prediction is higher or lower than the actual price, the algorithm just looks at the absolute value. A lower value indicates better accuracy.

In our case, the MAE was telling us that on average our predictions are off by roughly \\$24,213. Is this good or bad? To compare, we can go back to our stats table printed earlier by Python and find the mean house price, it’s roughly \\$493,091. Now a simple calculation will tell us that the error is about 5% of mean house price, I think that’s pretty good. However, keep in mind that our training and test sets are pretty tiny and things might change significantly when a larger dataset is used.

As a general guide, I think we can use MAE when we aren’t too worried about the outliers.

#### Mean Squared Error
I personally don’t focus too much on MSE as I see it as a stepping stone for calculating RMSE. However, let’s see what’s it about.

- Mean: average
- Squared: square the errors so a difference of 2, becomes 4, a difference of 3 becomes 9
As you can see, as a result of the squaring, it assigns more weight to the bigger errors. The algorithm then continues to add them up and average them. If you are worried about the outliers, this is the number to look at. Keep in mind, it’s not in the same unit as our dependent value. In our case, the value was roughly 82,3755,495, this is NOT the dollar value of the error like MAE. As before, lower the number the better.

### Task 3C: Theory - analyze a less obvious dataset

In [None]:
def read_file(filename):
    """ Reads a file. """
    
    f = open(filename, "r")
    lines = f.readlines()
    f.close()
    
    return lines


def create_dataframe(lines):
    """ Create a dataframe from a csv file. """
    
    # get column names from first line
    col_names = lines[0].split(';')
    cols = [col_names[i].strip() for i in range(len(col_names))]
    
    # prepare data frame
    amount_lines = len(lines)
    df = pd.DataFrame(columns=cols, index=range(amount_lines - 1))
    
    # fill dataframe
    i = 0
    for line in lines[1:]:

        parts = line.split(';', 1)

        df.loc[i].label = parts[0]
        df.loc[i].text = parts[1].strip()

        i = i + 1
        
    return df

In [None]:
lines = read_file("data/SmsCollection.csv")

import html

df = create_dataframe(lines)


# unescape html 
df = df.apply(lambda s: html.unescape(s))

with pd.option_context('display.min_rows', 50, 'display.max_colwidth', 10000):
    display(df)

