In [305]:
import sympy as sp

I'm trying out sympy to compute derivatives of the mse cost function. Normally, I would do this manually, but I wanted to see how the 2nd derivative of mse looks like. Also, I wanted to learn a little bit about sympy. Looking at the newton's method made me more curious about the 2nd derivative and how it can be used.

In [306]:
i, m, n = sp.symbols('i m n', integer=True)
X = sp.MatrixSymbol('X', m, n)
Y = sp.MatrixSymbol('Y', m, 1)
w = sp.MatrixSymbol('w', n, 1)

print(X.shape,Y.shape,w.shape)

(m, n) (m, 1) (n, 1)


We define the symbols like this. MatrixSymbol is convenient as well for defining matrix symbols. 

In [307]:
cost = sp.summation(((X[i,:]*w)[0,0]-Y[i,0])**2, (i, 0, m-1)) / (2*m)
display(cost)
print(sp.latex(cost))

 m - 1                                              
_______                                             
╲                                                   
 ╲                                                  
  ╲                                                2
   ╲    ⎛             n - 1                       ⎞ 
    ╲   ⎜              ___                        ⎟ 
     ╲  ⎜              ╲                          ⎟ 
     ╱  ⎜               ╲                         ⎟ 
    ╱   ⎜-(Y)[i, 0] +   ╱    (X)[i, i₁]⋅(w)[i₁, 0]⎟ 
   ╱    ⎜              ╱                          ⎟ 
  ╱     ⎜              ‾‾‾                        ⎟ 
 ╱      ⎝             i₁ = 0                      ⎠ 
╱                                                   
‾‾‾‾‾‾‾                                             
 i = 0                                              
────────────────────────────────────────────────────
                        2⋅m                         

\frac{\sum_{i=0}^{m - 1} \left(- Y_{i, 0} + \sum_{i_{1}=0}^{n - 1} X_{i, i_{1}} w_{i_{1}, 0}\right)^{2}}{2 m}


Here, we write the cost function as an expression and symbols. The output looks really good and it surprises me that sympy can understand summations and simplifies a lot. I also show the latex in the second line. Normally, sympy can handle generalized matrix sizes like we have, but in order to see the derivatives, we have to leak the sizes of the matrices. This can be any size as long as it's sufficiently explaining. Therefore, we choose the 1st 3 weights and take the 1st derivatives to get a good understanding. 

In [308]:
d1 = sp.derive_by_array(cost, [w[i,0] for i in range(3)]) # only use the first 3 weights for display convenience
d1 = sp.simplify(d1)
display(d1)
print(sp.latex(d1))

⎡m - 1                                                         m - 1          
⎢______                                                        ______         
⎢╲                                                             ╲              
⎢ ╲     ⎛             n - 1                       ⎞             ╲     ⎛       
⎢  ╲    ⎜              ___                        ⎟              ╲    ⎜       
⎢   ╲   ⎜              ╲                          ⎟               ╲   ⎜       
⎢    ╲  ⎜               ╲                         ⎟                ╲  ⎜       
⎢    ╱  ⎜-(Y)[i, 0] +   ╱    (X)[i, i₁]⋅(w)[i₁, 0]⎟⋅(X)[i, 0]      ╱  ⎜-(Y)[i,
⎢   ╱   ⎜              ╱                          ⎟               ╱   ⎜       
⎢  ╱    ⎜              ‾‾‾                        ⎟              ╱    ⎜       
⎢ ╱     ⎝             i₁ = 0                      ⎠             ╱     ⎝       
⎢╱                                                             ╱              
⎢‾‾‾‾‾‾                                             

\left[\begin{matrix}\frac{\sum_{i=0}^{m - 1} \left(- Y_{i, 0} + \sum_{i_{1}=0}^{n - 1} X_{i, i_{1}} w_{i_{1}, 0}\right) X_{i, 0}}{m} & \frac{\sum_{i=0}^{m - 1} \left(- Y_{i, 0} + \sum_{i_{1}=0}^{n - 1} X_{i, i_{1}} w_{i_{1}, 0}\right) X_{i, 1}}{m} & \frac{\sum_{i=0}^{m - 1} \left(- Y_{i, 0} + \sum_{i_{1}=0}^{n - 1} X_{i, i_{1}} w_{i_{1}, 0}\right) X_{i, 2}}{m}\end{matrix}\right]


Output is really nice. Lot of cancellations happen here and this is the same equation we have for linear regression partial derivatives. Next, we do the same thing and compute the 2nd derivatives also on the 1st 3 weights.

In [309]:
d2 = sp.derive_by_array(d1, [w[i,0] for i in range(3)]) # only use the first 3 weights for display convenience
d2 = sp.simplify(d2)
display(d2)
print(sp.latex(d2))

⎡    m - 1                  m - 1                      m - 1                  
⎢     ___                    ___                        ___                   
⎢     ╲                      ╲                          ╲                     
⎢      ╲      2               ╲                          ╲                    
⎢      ╱   (X) [i, 0]         ╱   (X)[i, 0]⋅(X)[i, 1]    ╱   (X)[i, 0]⋅(X)[i, 
⎢     ╱                      ╱                          ╱                     
⎢     ‾‾‾                    ‾‾‾                        ‾‾‾                   
⎢    i = 0                  i = 0                      i = 0                  
⎢    ────────────────       ─────────────────────────  ───────────────────────
⎢           m                           m                          m          
⎢                                                                             
⎢m - 1                          m - 1                  m - 1                  
⎢ ___                            ___                

\left[\begin{matrix}\frac{\sum_{i=0}^{m - 1} X_{i, 0}^{2}}{m} & \frac{\sum_{i=0}^{m - 1} X_{i, 0} X_{i, 1}}{m} & \frac{\sum_{i=0}^{m - 1} X_{i, 0} X_{i, 2}}{m}\\\frac{\sum_{i=0}^{m - 1} X_{i, 0} X_{i, 1}}{m} & \frac{\sum_{i=0}^{m - 1} X_{i, 1}^{2}}{m} & \frac{\sum_{i=0}^{m - 1} X_{i, 1} X_{i, 2}}{m}\\\frac{\sum_{i=0}^{m - 1} X_{i, 0} X_{i, 2}}{m} & \frac{\sum_{i=0}^{m - 1} X_{i, 1} X_{i, 2}}{m} & \frac{\sum_{i=0}^{m - 1} X_{i, 2}^{2}}{m}\end{matrix}\right]


Wow, everything is really simplified and compressed. The expressions for the 2nd derivatives are really simple. Additionally, this matrix is symmetric so half of the matrix can be precomputed and stored in memory and the rest of the matrix is a reflection of that half. 

$$
[\frac{dJ}{dw_{0,0}} \space \frac{dJ}{dw_{1,0}} \space \frac{dJ}{dw_{2,0}}] \\\\
[\frac{dJ}{dw_{0,1}} \space \frac{dJ}{dw_{1,1}} \space \frac{dJ}{dw_{2,1}}] \\\\
[\frac{dJ}{dw_{0,2}} \space \frac{dJ}{dw_{1,2}} \space \frac{dJ}{dw_{2,2}}] \\\\
$$

These are the 2nd derivatives the output is representing. Typically, taking 2nd derivatives of a function and writing it in a matrix is called the hessian. What we just got is the transpose of the hessian. So if we take the transpose of our matrix, then we get the hessian matrix for the MSE. 

$\mathbf H_f= \begin{bmatrix}
  \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex]
  \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex]
  \vdots & \vdots & \ddots & \vdots \\[2.2ex]
  \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2}
\end{bmatrix}.$

Here's what the hessian looks like for reference. This is from the wikipedia page https://en.wikipedia.org/wiki/Hessian_matrix.
We can now use this to do other computations like newton's method. Overall, sympy is incredible. Not having to manually do the derivatives is really powerful and time saving. It can even evaluate the values after symbolic manipulation. Additionally, the library is really easy to use and all of this was achieved with only a few lines of code. :)