In [5]:
#The Model
'''
    You've collected additional data: you know how many hours each of your users works each day, and whether they have a PhD. You'd like to use this additional data to improve your model.
    Accordingly, you hypothesize a linear model with more iundependent variables:

        minutes = α + β1friends + β2work hours + β3phd + ε
    
    Obviously, whether a user has a PhD is not a number - but, as we mentioned in Chp 11, we can introduce a dummy variable that equals 1 for users with PhDs and 0 for users without, after it's just as numberic as the other variables.

    Recall that in Chp 14 we git a model of the form:
        
        yi = α + βxi + εi
    
    Now imagine that each inputn xi is not a single number but rather a vector of k numbers, xi1, ..., xik. The multiple regression model assumes that:

        yi = α + β1xi1 + ... + βkxik + εi
    
    In mutliple regression the vector of parameters is usually called β. We'll want this to include the constant term as well, which we can achieve by adding a column of 1s to our data:

        beta = [alpha, beta_1, ..., beta_k]

    and:

        x_i = 1, x_i1, ..., x_ik

    Then our model is just:
'''

from scratch.linear_algebra import dot, Vector

def predict(x: Vector, beta: Vector) -> float:
    """assumes that the first element of x is 1"""
    return dot(x, beta)

#In this particular case, our independent variable x will be a list of vectors, each of which looks like this:

[1,     #constant term
 49,    #number of friends
 4,     #work hours per day
 0]     #doesn't have PhD

[1, 49, 4, 0]

In [6]:
#Further Assumptions of the Least Squares Model
'''
    There are a couple of further assumptions that are required for this model (and our solution) to make sense.
    The first is that the columns of x are linearly independent - that there's no way to write any one as a weighted sum of some of the others. If this assumption fails, it's impossible to estimate beta. To see this in an extreme case, imagine we had an extra field num_acquanintances in our data that for every user was exactly equal to num_friends.
    Then, starting with any beta, if we add any amount to the num_friends coefficient and subtract that same amount form the num_acquaintances coefficient, the model's predictions will remain unchanged. This means that there's no way to find the coefficient for num_friends.(Usually violations of this assumption won't be obvious)
    The second important assumption is that the columns of x are all uncorrelated with the errors ε. If this fails to be the case, our estimates of beta will be systematically wrong.
    For instance, in Chp. 14, we build a model thatpredicted that each additional friend was associated with an extra 0.90 daily minutes on the site.
    Imagine it's also the case that:
        
        -People who work more hours spend less time on the site
        -People with mre friends tend to work more hours

    That is, imagine that the "actual" model is:

        minutes = α + β1friends + β2work hours + ε

    Where β2 is negative, and that work hours and friends are positively correlated. In that case, when we minimize the errors of the single-variable model:
    
        minutes = α + β1friends + ε

    we will underestimate β1.

    Think about what would happen if we made predictions using the single-variable model with the "actual" value of β1. (That is, the value that arises from minimizing the errors of what we called the "actual" model.) The predictions would tend to be way too large for users who work many hours and a little too large for users who work few hours, because β2 < 0 and we "forgot" to include it. Because work hours is positively correlated with number of friends, this means the predictions tend to be way too large for users with many friends, and the only slightly too large for users with few friends.
    The result of this is that we canr educe the errors (in the single-variable model) by decreasing our estimate of β1, which means that the error-minimizing β1 is smaller than the "actual" value. That is, in this case the single-variable least squares solution is biased to underestimating β1. And, in general, whenever the independent variables are correlated with the errors like this, our least squares solution will give us a biased estimate of β1.
'''

'\n    There are a couple of further assumptions that are required for this model (and our solution) to make sense.\n    The first is that the columns of x are linearly independent - that there\'s no way to write any one as a weighted sum of some of the others. If this assumption fails, it\'s impossible to estimate beta. To see this in an extreme case, imagine we had an extra field num_acquanintances in our data that for every user was exactly equal to num_friends.\n    Then, starting with any beta, if we add any amount to the num_friends coefficient and subtract that same amount form the num_acquaintances coefficient, the model\'s predictions will remain unchanged. This means that there\'s no way to find the coefficient for num_friends.(Usually violations of this assumption won\'t be obvious)\n    The second important assumption is that the columns of x are all uncorrelated with the errors ε. If this fails to be the case, our estimates of beta will be systematically wrong.\n    For in

In [7]:
#Fitting the Model
'''
    As we did in the simple linear model, we'll choose beta to minimize the sum of squared errors. Finding an exact solution is not simple to do by hand, which means we'll need to use gradient descent. Again we'll want to minimize the sum of the squared errors. The error function is almost identical to the one used in Chp14, except that instead of expecting parameters [alpha, beta] it will take a vector of arbitrary length:
'''

from typing import List

def error(x: Vector, y: float, beta: Vector) -> float:
    return predict(x, beta) - y

def squared_error(x: Vector, y: float, beta: Vector) -> float:
    return error(x,y,beta) ** 2

x = [1,2,3]
y= 30
beta = [4,4,4]  #so prediction = 4 + 8 + 12 = 24
                #             (1*4) + (2*4) + (3*4) = 24

assert error(x,y,beta) == -6            #24-30 
assert squared_error(x,y,beta) == 36    #-6**2

#If you know calculus, it's easy to compute the gradient
def sqerror_gradient(x: Vector, y: float, beta: Vector) -> Vector:
    err = error(x,y,beta)
    return [2 * err * x_i for x_i in x]

assert sqerror_gradient(x,y,beta) == [-12, -24, -36]
#err = -6 -> [[2 * -6 * 1], [2 * -6 * 2], [2 * -6 * 3]] -> [-12, -24, -36]

'''
    Otherwise, you'll need to take my word for it. 
    At this point, we're ready to find the optimal beta using gradient descent. Let's first write out a least_squares_fit function that can work with any dataset:
'''
import random
import tqdm
from scratch.linear_algebra import vector_mean
from scratch.gradient_descent import gradient_step

def least_squares_fit(xs: List[Vector],
                      ys: List[float],
                      learning_rate: float = 0.001,
                      num_steps: int = 1000,
                      batch_size: int = 1) -> Vector:
    '''
    Find the beta that minimizes the sum of squared errors 
    assuming the model y = dot(x, beta)
    '''

    #start with a random guess
    guess = [random.random() for _ in xs[0]]

    for _ in tqdm.trange(num_steps, desc="least squares fit"):
        for start in range(0, len(xs), batch_size):
            batch_xs = xs[start:start+batch_size]
            batch_ys = ys[start:start+batch_size]

            gradient = vector_mean([sqerror_gradient(x, y, guess)
                                    for x, y in zip(batch_xs, batch_ys)])
            guess = gradient_step(guess, gradient, -learning_rate)
        
    return guess

In [21]:
#We can then apply that to our data

from scratch.statistics import daily_minutes_good
from scratch.gradient_descent import gradient_step

random.seed(0)
learning_rate = 0.001

#No idea what 'inputs' is exactly, I understand it's the data about the users but was never given exactly what it was, to get these assert's to pass besides the random.seed
beta = least_squares_fit(inputs, daily_minutes_good, learning_rate, 5000, 25)
'''assert 30.50 < beta[0] < 30.70  #constant
assert  0.96 < beta[1] <  1.00  #num friends
assert -1.89 < beta[2] < -1.85  #work hours per day
assert  0.91 < beta[3] <  0.94  #has PhD'''


'''
    In practice, you wouldn't estimate a linear regression using gradient descent; you'd get the exact coefficient using linear algebra techniques that are beyond the scope of this book. If you did so, you'd find the equation:

        mintutes = 30.58 + 0.972 friends - 1.87 work hours + 0.923 phd

    which is pretty close to what we found
'''

least squares fit: 100%|██████████| 5000/5000 [00:01<00:00, 4222.02it/s]


"\n    In practice, you wouldn't estimate a linear regression using gradient descent; you'd get the exact coefficient using linear algebra techniques that are beyond the scope of this book. If you did so, you'd find the equation:\n\n        mintutes = 30.58 + 0.972 friends - 1.87 work hours + 0.923 phd\n\n    which is pretty close to what we found\n"

In [22]:
#Interpreting the Model
'''
    You should think of the coefficients of the model as representing all-else-being-equal estimates of the impacts of each factor. All else being equal, each additional friend corresponds to an extra minute spent on the site each day. All else being equal, each additional hour in a user's workday corresponds to about two fewer minutes spent on the site each day. All else being equal, having a PhD is associated with spending an extra minute on the site each day.
    What this doesn't (directly) tell us is anything about the interactions among the variables. It's possible that the effect of work hours is different for people with many friends than it is for people with few friends. This model doesn't capture that. One way to handle this case is to introduce a new variable that is the product of "friends" and "work hours." This effectively allows the "work hours" coefficient to increase (or decrease) as the number of friends increases.
    Or it's possible that the more friends you have, the more time you spend on the site yp to a point, after which further friends cause you to spend less time on the site. (Perhaps with too many friends the experience is just too overwhelming?) We could try to capture this in our model by adding another variable that's the square of the number of friends.
    Once we start adding variables, we need to worry about whether their coefficients "matter." There are no limits to the numbers of products, logs, squares, and higher powers we could add.
'''

'\n    You should think of the coefficients of the model as representing all-else-being-equal estimates of the impacts of each factor. All else being equal, each additional friend corresponds to an extra minute spent on the site each day. All else being equal, each additional hour in a user\'s workday corresponds to about two fewer minutes spent on the site each day. All else being equal, having a PhD is associated with spending an extra minute on the site each day.\n    What this doesn\'t (directly) tell us is anything about the interactions among the variables. It\'s possible that the effect of work hours is different for people with many friends than it is for people with few friends. This model doesn\'t capture that. One way to handle this case is to introduce a new variable that is the product of "friends" and "work hours." This effectively allows the "work hours" coefficient to increase (or decrease) as the number of friends increases.\n    Or it\'s possible that the more friends

In [24]:
#Goodness of Fit

#Again we can look at R-squared:
from scratch.simple_linear_regression import total_sum_of_squares

def multiple_r_squared(xs: List[Vector], ys: Vector, beta: Vector) -> float:
    sum_of_squared_errors = sum(error(x,y,beta) ** 2
                                for x, y in zip(xs, ys))
    return 1.0 - sum_of_squared_errors / total_sum_of_squares

#Which has now increased to 0.68:
#assert 0.67 < multiple_r_squared(inputs, daily_minutes_good, beta) < 0.68

'''
    Keep in mind, however, that adding new variables to a regression will necessarily increase the R-squared.
'''

TypeError: unsupported operand type(s) for /: 'float' and 'function'