## Maximum Entropy

The Maximum Entropy Principle is a general principle based on the information entropy theory. The classification model based on the principle of maximum entropy is also called the maximum entropy model. Information entropy is a quantity that describes the uncertainty of information. The maximum entropy method considers that the maximum probability distribution of entropy under the constraints obtained from known information is a probability distribution that makes full use of known information and makes the least assumptions about unknown parts. 

For a discrete random variable $x$, its information entropy can be defined as:
$$
H = -\sum^{n}_{i=1}f(x_{i})\ln f(x_{i})
$$

For a continuous random variable $x$, its information entropy can be defined as:
$$
H=-\int_{R} f(x) \ln f(x) d x
$$

$f(x)$ is the probability density function of the distribution function, and $f(x_{i})$ is the probability distribution of discrete points. The maximum entropy method is to obtain $f(x)$ or $f(x_{i})$ under the given constraints so that the entropy $H$ reaches the maximum value, which is essentially an optimization problem.

Assuming that the target classification model is a condition probability distribution $P(Y|X)$, $X$ represents input and $Y$ represents output. With the given training data set, the learning goal is to select the maximum entropy model as the target model. And with the data set, the empirical distribution $\hat{P}(X,Y)$ of joint probability distribution $P(X,Y)$ and the empirical distribution $\hat{P}(X)$ of marginal probability distribution $P(X)$ can be determined. Then characteristic function $f(x,y)$ can be used to describe the fact between input and output. When $x$ and $y$ satisfy a fact, the function takes the value of 1, otherwise, it takes the value of 0.

The expected value of the characteristic function $f(x,y)$ with respect to the empirical distribution $\hat{P}(X,Y)$ is $E_{\hat{P}}(f)$:
$$
E_{\hat{P}}(f) = \sum _{x,y} \hat{P}(X,Y)f(x,y)
$$

The expected value of the characteristic function $f(x,y)$ with respect to the empirical distribution $\hat{P}(X)$ is $E_{P}(f)$:
$$
E_{P}(f) = \sum _{x,y} \hat{P}(X)P(y|x)f(x,y)
$$

If the model can obtain enough information from the known data, it can be assumed that the above two expected values are equal:
$$
\begin{aligned}
\sum _{x,y} \hat{P}(X,Y)f(x,y) &= \sum _{x,y} \hat{P}(X)P(y|x)f(x,y) \\
E_{\hat{P}}(f) &= E_{P}(f)
\end{aligned}
$$

The above formula can be used as a constraint condition for the maximum entropy model learning. If there are $n$ characteristic functions, there are $n$ constraints.

Assuming that the model set that satisfies the above constraints is $C$, the model with the largest conditional entropy defined in the model set is the maximum entropy model:
$$
\max_{P \in C}H(P) = -\sum_{x,y} \hat{P}(x)P(y|x) \log P(y|x) \\
s.t. \enspace E_{\hat{P}}(f) = E_{P}(f), \enspace \sum_{y} P(y|x)=1
$$

Rewrite the above maximization problem as a minimization problem:
$$
\min_{P \in C} -H(P) = \sum_{x,y} \hat{P}(x)P(y|x) \log P(y|x) \\
s.t. \enspace E_{\hat{P}}(f) - E_{P}(f)=0, \enspace \sum_{y} P(y|x)=1
$$

The above constrained optimization can be transformed into an unconstrained optimization problem by the Lagrangian multiplier method, and its original problem can be transformed into a dual problem to solve, and the Lagrangian function $L(P,W)$ is defined as follow:
$$
\begin{aligned}
L(P, W)&=-H(P)+w_{0}\left(1-\sum_{y} P(y \mid x)\right)+\sum_{i=1}^{n} w_{i}\left(E_{\hat{P}}\left(f_{i}\right)-E_{P}\left(f_{i}\right)\right) \\
&=\sum_{x, y} \hat{P}(x) P(y \mid x) \log P(y \mid x)+w_{0}\left(1-\sum_{y} P(y \mid x)\right) \\
&+\sum_{i=1}^{n} w_{i}\left(\sum_{x, y} \hat{P}(X, Y) f_{i}(x, y)-\sum_{x, y} \hat{P}(X) P(y \mid x) f_{i}(x, y)\right)
\end{aligned}
$$

The original optimization problem is:
$$
\min_{P \in C} \max_{w} L(P, w)
$$

The dual problem of the problem above is:
$$
\max_{w} \min_{P \in C} L(P, w)
$$

To solve the dual problem, first to solve the internal minimization problem $\min _{P \in C} L(P, w)$:
$$
\Psi(w)=\min _{P \in C} L(P, w)=L\left(P_{w}, w\right) \\
P_{w}=\arg \min _{P \in C} L(P, w)=P_{w}(y \mid x)
$$

Take the partial derivative of $L(P,w)$ with respect to $P(y|x)$ and set it to 0, the following can be obtained:
$$
P_{w}(y \mid x)=\frac{1}{Z_{w}(x)} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right) \\
Z_{w}(x) = \sum _{y} \exp \left(\sum_{i=1}^{n} w_{i} f_{i}(x, y)\right)
$$

The model represented by the  formula of $P_{w}(y \mid x)$ is maximum entropy model.

Then solve the external maximization problem $\max_{w} \Psi (w)$ and mark it as $w^{*}$:
$$
w^{*} = \arg \max_{w} \Psi (w)
$$

The maximum entropy model can be attributed to the maximization of the dual function $\Psi (w)$, and $P^{*}=P_{w^{*}}=P_{w^{*}}(y|x)$  obtained by the optimization solution is the final maximum entropy model.

In [12]:
import pandas as pd
import numpy as np
from collections import defaultdict

In [20]:
class MaxEnt:
    def __init__(self, max_iter=100):
        # input
        self.X_ = None
        # output: label
        self.y_ = None
        # number of label
        self.m = None   
        # number of feature
        self.n = None   
        # number of training sample
        self.N = None   
        # constant characteristic value
        self.M = None
        # weights
        self.w = None
        # label name
        self.labels = defaultdict(int)
        # feature name
        self.features = defaultdict(int)
        # max iteration number
        self.max_iter = max_iter

    # expectation of feature function with respect of empirical joint distribution P(X,Y)
    def _EP_hat_f(self, x, y):
        self.Pxy = np.zeros((self.m, self.n))
        self.Px = np.zeros(self.n)
        for x_, y_ in zip(x, y):
            # traverse every sample
            for x__ in set(x_):
                self.Pxy[self.labels[y_], self.features[x__]] += 1
                self.Px[self.features[x__]] += 1           
        self.EP_hat_f = self.Pxy/self.N
    
    # expectation of feature function with respect of model P(Y|X) and empirical distribution P(X) 
    def _EP_f(self):
        self.EP_f = np.zeros((self.m, self.n))
        for X in self.X_:
            pw = self._pw(X)
            pw = pw.reshape(self.m, 1)
            px = self.Px.reshape(1, self.n)
            self.EP_f += pw*px / self.N
    
    ### maximum entropy model P(y|x)
    def _pw(self, x):
        mask = np.zeros(self.n+1)
        for ix in x:
            mask[self.features[ix]] = 1
        tmp = self.w * mask[1:]
        pw = np.exp(np.sum(tmp, axis=1))
        Z = np.sum(pw)
        pw = pw/Z
        return pw

    # improved iterative scaling method based on IIS
    def fit(self, x, y):
        self.X_ = x
        self.y_ = list(set(y))
        # input data flattened collection
        tmp = set(self.X_.flatten())
        self.features = defaultdict(int, zip(tmp, range(1, len(tmp)+1)))   
        self.labels = dict(zip(self.y_, range(len(self.y_))))
        self.n = len(self.features)+1  
        self.m = len(self.labels)
        self.N = len(x)  
        # calculate EP_hat_f
        self._EP_hat_f(x, y)
        # initialize coefficient matrix
        self.w = np.zeros((self.m, self.n))
        # loop iteration
        i = 0
        while i <= self.max_iter:
            # calculate EPf
            self._EP_f()
            self.M = 100
            # IIS step(3)
            tmp = np.true_divide(self.EP_hat_f, self.EP_f)
            tmp[tmp == np.inf] = 0
            tmp = np.nan_to_num(tmp)
            sigma = np.where(tmp != 0, 1/self.M*np.log(tmp), 0)  
            # IIS step(4)
            self.w = self.w + sigma
            i += 1
        print('training done.')
        return self

    def predict(self, x):
        res = np.zeros(len(x), dtype=np.int64)
        for ix, x_ in enumerate(x):
            tmp = self._pw(x_)
            print(tmp, np.argmax(tmp), self.labels)
            res[ix] = self.labels[self.y_[np.argmax(tmp)]]
        return np.array([self.y_[ix] for ix in res])

In [14]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
raw_data = load_iris()
X, labels = raw_data.data, raw_data.target
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=43)
print(X_train.shape, y_train.shape)

(105, 4) (105,)


In [21]:
from sklearn.metrics import accuracy_score
maxent = MaxEnt()
maxent.fit(X_train, y_train)
y_pred = maxent.predict(X_test)
print(accuracy_score(y_test, y_pred))

training done.
[0.87116843 0.04683368 0.08199789] 0 {0: 0, 1: 1, 2: 2}
[0.00261138 0.49573305 0.50165557] 2 {0: 0, 1: 1, 2: 2}
[0.12626693 0.017157   0.85657607] 2 {0: 0, 1: 1, 2: 2}
[1.55221378e-04 4.45985560e-05 9.99800180e-01] 2 {0: 0, 1: 1, 2: 2}
[7.29970746e-03 9.92687370e-01 1.29226740e-05] 1 {0: 0, 1: 1, 2: 2}
[0.01343943 0.01247887 0.9740817 ] 2 {0: 0, 1: 1, 2: 2}
[0.85166079 0.05241898 0.09592023] 0 {0: 0, 1: 1, 2: 2}
[0.00371481 0.00896982 0.98731537] 2 {0: 0, 1: 1, 2: 2}
[2.69340079e-04 9.78392776e-01 2.13378835e-02] 1 {0: 0, 1: 1, 2: 2}
[0.01224702 0.02294254 0.96481044] 2 {0: 0, 1: 1, 2: 2}
[0.00323508 0.98724246 0.00952246] 1 {0: 0, 1: 1, 2: 2}
[0.00196548 0.01681989 0.98121463] 2 {0: 0, 1: 1, 2: 2}
[0.00480966 0.00345107 0.99173927] 2 {0: 0, 1: 1, 2: 2}
[0.00221101 0.01888735 0.97890163] 2 {0: 0, 1: 1, 2: 2}
[9.87528545e-01 3.25313387e-04 1.21461416e-02] 0 {0: 0, 1: 1, 2: 2}
[3.84153917e-05 5.25603786e-01 4.74357798e-01] 1 {0: 0, 1: 1, 2: 2}
[0.91969448 0.00730851 0.0729

  tmp = np.true_divide(self.EP_hat_f, self.EP_f)
  sigma = np.where(tmp != 0, 1/self.M*np.log(tmp), 0)


In [28]:
# maximum entropy model in maxentropy
import maxentropy
'''
samplespace = np.arange(6) + 1
model = maxentropy.Model(samplespace)
model.verbose = True
# set the expectation value
K=[4.5]
model.fit(f, K)
'''

'\nsamplespace = np.arange(6) + 1\nmodel = maxentropy.Model(samplespace)\nmodel.verbose = True\n# set the expectation value\nK=[4.5]\nmodel.fit(f, K)\n'