熵值法是计算指标权重的经典算法之一，它是指用来判断某个指标的离散程度的数学方法。**离散程度越大，即信息量越大，不确定性就越小，熵也就越小；信息量越小，不确定性越大，熵也越大。**根据熵的特性，我们可以通过计算熵值来判断一个事件的随机性及无序程度，也可以用熵值来判断某个指标的离散程度，指标的离散程度越大，该指标对综合评价的影响越大。

### 实现步骤

1.假设数据有n行记录，m个变量，数据可以用一个n*m的矩阵A表示(n行m列，即n行记录数，m个特征列)

2.数据的归一化处理$x_{ij}$表示矩阵A的第i行j列元素。

$$
x_{ij}=\frac{x_{ij}-min(x_j)}{max(x_j)-min(x_j)}
$$

3.计算第j项指标下第i个记录所占比重
$$
P_{ij}=\frac{x_{ij}}{\sum_1^n x_{ij}}(j=1,2,...,m)
$$

4.计算第j项指标的熵值
$$
e_{j}=-k*\sum_1^n P_{ij}*log(P_{ij})(k=\frac{1}{\ln n})
$$

5.计算j项指标的差异系数
$$
g_j = 1-e_j
$$

6.计算第j项指标的权重
$$
W_j = \frac{g_j}{\sum_1^m g_j}
$$

In [2]:
import pandas as pd
import numpy as np
import math
from numpy import array

In [7]:
df = pd.read_csv('test.csv',encoding='gb2312')
df.dropna()
df.head()

Unnamed: 0,var1,var2,var3,var4,var5,var6
0,171.33,151.33,0.28,0.0,106.36,0.05
1,646.66,370.0,1.07,61.0,1686.79,1.64
2,533.33,189.66,0.59,0.0,242.31,0.57
3,28.33,0.0,0.17,0.0,137.85,2.29
4,620.0,234.0,0.88,41.33,428.33,0.13


In [8]:
#定义熵值法函数
def cal_weight(x):
    '''熵值法计算变量的权重'''
    # 标准化
    x = x.apply(lambda x: ((x - np.min(x)) / (np.max(x) - np.min(x))))
 
    # 求k
    rows = x.index.size  # 行
    cols = x.columns.size  # 列
    k = 1.0 / math.log(rows)
 
    lnf = [[None] * cols for i in range(rows)]
 
    # 矩阵计算--
    # 信息熵
    # p=array(p)
    x = array(x)
    lnf = [[None] * cols for i in range(rows)]
    lnf = array(lnf)
    for i in range(0, rows):
        for j in range(0, cols):
            if x[i][j] == 0:
                lnfij = 0.0
            else:
                p = x[i][j] / x.sum(axis=0)[j]
                lnfij = math.log(p) * p * (-k)
            lnf[i][j] = lnfij
    lnf = pd.DataFrame(lnf)
    E = lnf
 
    # 计算冗余度
    d = 1 - E.sum(axis=0)
    # 计算各指标的权重
    w = [[None] * 1 for i in range(cols)]
    for j in range(0, cols):
        wj = d[j] / sum(d)
        w[j] = wj
        # 计算各样本的综合得分,用最原始的数据
    
    w = pd.DataFrame(w)
    return w


In [12]:
# 计算df各字段的权重
w = cal_weight(df)  # 调用cal_weight
w.index = df.columns
w.columns = ['weight']
w    # w.sum() == 1

Unnamed: 0,weight
var1,0.088485
var2,0.07484
var3,0.140206
var4,0.410843
var5,0.144374
var6,0.141251


In [19]:
w.sort_index(axis = 0,ascending = False,by = 'weight')  # 降序排列

  """Entry point for launching an IPython kernel.


Unnamed: 0,weight
var4,0.410843
var5,0.144374
var6,0.141251
var3,0.140206
var1,0.088485
var2,0.07484
