# Multinomial Logistic Regression

**Sections**
- [1.0 Synthetic Data & Model](#1.0-Synthetic-Data-&-Model)
- [2.0 Newton Raphson Algorithm](#2.0-Newton-Raphson-Algorithm)
- [3.0 NR Implementation](#3.0-Newton-Raphson-Implementation)
- [4.0 Prediction at X values](#4.0-Prediction-at-X-values)

### 0. Importing Modules

In [1]:
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import bokeh
from bokeh.plotting import figure, show
from bokeh.models import tickers, ranges
from bokeh.io import output_notebook
output_notebook()

## 1.0 Synthetic Data & Model

$$ K = 4 $$

\begin{align}
\large
P(Y_i = j \hspace{1 mm} |  \hspace{1 mm} \beta , x_i) = \dfrac{ {\rm e}^{x_i \beta_j}}{1 + \Sigma_{j=1}^{K-1} {\rm e}^{x_i \beta_j}}, \hspace{3 mm} for \hspace{3 mm} j = 1,2,3
\end{align}

$$\large = \dfrac{ {\rm e}^{x_i \beta_j}}{1 + \Sigma_{j=1}^{3} {\rm e}^{x_i \beta_j}}$$

\begin{align}
\large
P(Y_i = 4 \hspace{1 mm} |  \hspace{1 mm} \beta , x_i) = \dfrac{ {\rm e}^{x_i \beta_j}}{1 + \Sigma_{j=1}^{3} {\rm e}^{x_i \beta_j}}
\end{align}

(1) Write down the log-likelihood function

 $$\large L (\beta) = \prod_{i=1}^{n} p(y_i \hspace{1 mm} |  \hspace{1 mm} x) $$

$$\large log (L (\beta)) = \Sigma_{j=1}^{K-1}\hspace{1 mm}\beta_j \cdot \hspace{3 mm} \Sigma_{y_i=j}\hspace{1 mm} x_i - \hspace{3 mm}
\Sigma_{i=1}^{n}\hspace{1 mm} log \hspace{1 mm} (1 + \Sigma_{j=1}^{K-1} {\rm e}^{x_i \beta_j})$$

$$\large log (L (\beta)) = \Sigma_{j=1}^{K-1}\hspace{1 mm}\beta_j \cdot \hspace{3 mm} \Sigma_{y_i=j}\hspace{1 mm} x_i - \hspace{3 mm}
\Sigma_{i=1}^{n}\hspace{1 mm} log \hspace{1 mm} (1 + \Sigma_{j=1}^{K-1} {\rm e}^{x_i \beta_j})$$

$$\large log (L (\beta)) = \Sigma_{j=1}^{3}\hspace{1 mm}\beta_j \cdot \hspace{3 mm} \Sigma_{y_i=j}\hspace{1 mm} x_i - \hspace{3 mm}
\Sigma_{i=1}^{50}\hspace{1 mm} log \hspace{1 mm} (1 + \Sigma_{j=1}^{3} {\rm e}^{x_i \beta_j})$$

(2) Expressions of the partial derivatives

$$\large \dfrac{\partial L}{\partial \beta_j} \text{      and      }  
\dfrac{\partial ^2L}{\partial \beta_j \partial \beta_k} $$

$$\large \dfrac{\partial L}{\partial \beta_j} = \Sigma_{y_i=j}\hspace{1 mm} x_i 
- \Sigma_{i=1}^n \dfrac{ x_i{\rm e}^{x_i \beta_j}}{1 + \Sigma_{j=1}^{K-1} {\rm e}^{x_i \beta_j}}$$

$$\large \dfrac{\partial ^2L}{\partial \beta_j \partial \beta_k} = 
-1(j=k) \hspace{3 mm}
\Sigma_{i=1}^n \dfrac{ x_i^2{\rm e}^{x_i \beta_j}}{1 + \Sigma_{j=1}^{K-1} {\rm e}^{x_i \beta_j}}
+ \Sigma_{i=1}^n \dfrac{ x_i^2{\rm e}^{x_i (\beta_j + \beta_k)}}{(1 + \Sigma_{j=1}^{K-1} {\rm e}^{x_i \beta_j})^2}$$

## 2.0 Newton Raphson Algorithm

In [2]:
def get_L1_vector(x, yi, Bj):
    """
    Computes partial derivatives dL/dBj.

        Assumes Bj is a row vector with K−1 entries and X is a column array.

    Args:
        xi (np.ndarray): Column vector or N observation x M features
            matrix
        yi (np.ndarray): Column vector with categorical data.
        Bj (np.ndarray): Row vector

    """
    dL_dBj = []

    for category, bj in enumerate(Bj, start = 1):

        term_1 = np.sum(x[yi == category])
        denominator = np.ones(shape = x.shape)        
        for i, xi in enumerate(x):
            denominator[i] = (1 + np.sum(np.exp(xi * Bj))) # Vector Scaling
    
        numerator = x * np.exp(x * bj)
        dL_dBj.append(term_1 - np.sum(numerator/denominator))

    return dL_dBj

def get_Lprime2_matrix(x, K, Bj):
    """
    Computes partial second derivatives d2L/dBj dBk.

    Assumes Bj is a row vector with K−1 entries and X is a column array.

    Args:
        xi (np.ndarray): Column vector or N observation x M features
            matrix
        K (int): Number of categories or discrete values y can take
            from 1 to K.
        Bj (np.ndarray): Row vector with regression parameters.

    """
 
    l_prime2 = np.zeros(shape = (K-1, K-1)) # Matrix L2 is (j,k)

    # Approach: 
    #   Since x is a column vector and Bj is a row, vectorized operations make
    #   more sense. 
    #   The only explicit iteration 1:n is for the denominator.
    denominator = np.ones(shape = x.shape)

    for i, xi in enumerate(x):
        denominator[i] = (1 + np.sum(np.exp(xi * Bj))) # Vector Scaling
    
    # Note: symmetric matrix, we are esimating K-1 parameters
    for j in range(K-1):
        for k in range(0, K-1): #

            if j == k:
                f = -1
            else:
                f = 0

            l_prime2[j, k] = f*np.sum(x**2*np.exp(x * Bj[j])/denominator) + \
                    np.sum(x**2*np.exp(x * (Bj[j] + Bj[k]))/(denominator**2))

    return l_prime2

def newton_raphson(xArr, yArr, b_0, tolerance = 0.00001):
    """
    Performs Newton-Raphson root finding.
    
    Args:
        xArr (np.ndarray): Column array with x values.
        yArr (np.ndarray): Column array with y values (discrete).
        b_0 (float): Initial guess for regression parameters.
        tolerance (float): Stops iteration when difference between iterations
            is within tolerance.
    """

    k = len(b_0) + 1
    difference = tolerance * 5
    
    beta_iter = [b_0]
    while abs(difference) > tolerance:
        
        L_1 = get_L1_vector(xArr, yArr, b_0)
        L_2 = get_Lprime2_matrix(xArr, k, b_0)
        beta_1 = b_0 - np.linalg.solve(L_2, L_1)

        # Calculate difference and update iteration state
        difference = max(abs(np.array(beta_1) - np.array(b_0)))
        b_0 = beta_1
        beta_iter.append(b_0)
    
    return beta_1, beta_iter

## 3.0 Newton Raphson Implementation

(3) Using the partial derivatives just found, write and run a Newton{Raphson
algorithm to obtain the maximum likelihood estimator $\hat \beta$. <br>
State the algorithm and the final result.

In [3]:
df = pd.read_csv('data_1.csv')
df.head()

Unnamed: 0,y,x
0,2,0.208561
1,4,0.002906
2,2,0.392529
3,2,0.836454
4,3,0.465919


In [4]:
xArr = df.x.values
yArr = df.y.values

# Run NR with different starting points
betas = [np.array([0.5, 0.5, 0.5]),
         np.array([1, 1, 1]),
         np.array([4, 4, 4]),
         np.array([-2, -2, -2]),
        ]

root = [] # Container for different starting points
for b_0 in betas:
    beta_1, _ = newton_raphson(xArr, yArr, b_0)
    root.append(beta_1)

In [5]:
df_nr = pd.DataFrame(root)
beta_hat = pd.Series(df_nr.mean().values, index = ['B0', 'B1', 'B2'], name = 'B-hats')
beta_hat

B0    0.940669
B1    1.769768
B2    2.585184
Name: B-hats, dtype: float64

## 4.0 Prediction at X values

(4) Find the predictive probabilities for y with a new predictor at x = $\bar x$.

In [6]:
x_bar= xArr.mean()
x_bar

0.54944584972

In [7]:
def get_p_j_given_x(x, Bj):
    """Calculates P( y = j | x).

    Args:
        x: N x M features
        B: M features x K-1

    Returns:
        np.ndarray: N x K-1 matrix with the probabilities of each observation
        to be classified as a given category. 
    """
    numerator = np.exp(x @ Bj) # Returns N x K-1 Matrix
    # Note: It is critical to sum over the axis because it is
    # only within an observation that the probabilities must add up to 1.
    denominator = (1 + 
                    np.sum(np.exp(x @ Bj), axis = 1)).reshape(-1,1) # N Vector
    
    return numerator / denominator

def get_pK(x, Bj):
    """Calculates P( y = K | x).
    """    
    denominator = (1 + np.sum(np.exp(x @ Bj), axis = 1)).reshape(-1,1)
    return 1 / denominator

In [8]:
# Probabilities j = 1 through K-1
xi = np.array([x_bar]).reshape(1,1)
beta_hat = beta_hat.values.reshape(1,-1)

p_array_K_minus_one = get_p_j_given_x(xi, beta_hat)

# Probabilities j = K
p_K = 1 - np.sum(p_array_K_minus_one, axis = 1).reshape(-1,1)

# Same as doing
p_K_v2 = get_pK(xi, beta_hat)
assert max(abs(p_K - p_K_v2)) < 10**-15

# Full array
p_array = np.concatenate([p_array_K_minus_one, p_K], axis = 1)
p_array

array([[0.17724729, 0.27952478, 0.43751799, 0.10570995]])

(5) Hence, what would be the predicted outcome for $y$ at this $x$.

In [9]:
print('Predicted outcome for y =' ,p_array.argmax() + 1 )

Predicted outcome for y = 3
