Статья на хабре: [Методы отбора фич](https://habr.com/post/264915/)

# 1. Imports

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
import pandas as pd
import numpy as np

In [3]:
from sklearn.datasets import load_breast_cancer, load_wine

# 2. Data
## 2.1. Data Import

In [4]:
X, y = load_breast_cancer(return_X_y=True)

In [5]:
X.shape
y.shape

(569, 30)

(569,)

In [6]:
X

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

# Filters
## Informaition gain

Формула энтропии:

$$H(X) = - \sum_{x_i \in X}{p(x_i)} * \log_{2}{p(x_i)}$$

где, $p(x_i)$ — вероятность того, что переменная $X$ примет значение $x_i$.

В задаче классификации эта вероятность равна относительной частоте классов. Другими словами количество наблюдений, в которых $X = x_i$, поделенное на общее число наблюдений.

<font color='red'>
Чтобы лучше понять смысл этой меры, можно представить два простых примера. Во-первых, подбрасывание монетки, у которой выпадение орла и решки равновероятны. В этом случае энтропия, рассчитанная по формуле, будет равна 1. Если же монета всегда падает исключительно орлом вверх, то энтропия будет равна 0. Иными словами высокая энтропия говорит о равномерном распределении, низкая — о каком-то более интересном.
    
Для расчета корреляции между переменными нам понадобится определить еще несколько мер. Первая из них — specific conditional entropy:
</font>
$$H(Y|X = x_i)$$

<font color='red'>
— энтропия H(Y) посчитанная только для тех записей, для которых $X = x_i$.
</font>

Формула относительной энтропии (conditional entropy):

$$H(Y|X) = \sum_{x_i \in X}{p(x_i)} * H(Y|X = x_i)$$

<font color='red'>
Интересная такая величина не сама по себе, а ее разница с обычной энтропией фичи Y. Т.е. мера того, насколько более упорядоченной становится для нас переменная Y, если мы знаем значения X. Или, говоря проще, существует ли корреляция между значениями X и Y, и насколько она велика. Об этом говорит величина information gain:
</font>

$$IG(Y|X) = H(Y) - H(Y|X)$$

<font color='red'>
Чем больше параметр IG — тем сильнее корреляция. Таким образом, мы легко можем вычислить information gain для всех фич и выкинуть те, которые слабо влияют на целевую переменную. Тем самым, во-первых, сократив время расчета модели, а, во-вторых, уменьшив опасность переобучения.
</font>

In [7]:
def entropy(arr, base=None):
    length = len(arr)
    uniqs, freqs = np.unique(arr, return_counts=True)
    # 
    probs = freqs / length
    # 
    log_probs = np.log(probs)
    # 
    if base:
        log_probs /= np.log(base)
    H = -sum(probs * log_probs)
    return H

def spec_cond_entropy(arr, feature, value):
    if value in feature:
        mask = (feature == value)
        return entropy(arr[mask])
    else:
        print('There is no this value in feature')
        
def cond_entropy(arr, feature):
    length = len(feature)
    uniqs, freqs = np.unique(feature, return_counts=True)
    # 
    probs = freqs / length
    #
    H_YX_xi = np.array([spec_cond_entropy(arr, feature, uniq) for uniq in uniqs])
    H_YX = sum(probs * H_YX_xi)
    return H_YX

def information_gain(arr, feature):
    H_Y = entropy(arr)
    H_YX = cond_entropy(arr, feature)
    IG_YX = H_Y - H_YX
    return IG_YX

def mutual_info_cl(X, y):
    '''
    Only discrete features
    '''
    mut_inf = np.array([information_gain(y, feature) for feature in X.T])
    
    return mut_inf

In [8]:
ig_lst = []
for feature in X.T:
    ig_lst.append(information_gain(y, feature))

In [9]:
np.argsort(ig_lst)[::-1]

array([ 7, 14, 23,  6, 10, 12, 26,  3, 16,  2, 13, 15, 25, 19, 27,  5, 20,
       22, 29, 11, 21,  0, 17, 28,  9,  1, 18,  4,  8, 24])

In [10]:
ig_lst[24]
ig_lst[7]

0.4988711330717472

0.6530072400856323

In [11]:
ig_arr = mutual_info_cl(X, y)

ig_arr

array([0.59664833, 0.57927133, 0.64234217, 0.64326176, 0.53800617,
       0.63016032, 0.6481345 , 0.65300724, 0.50948857, 0.57959374,
       0.64721491, 0.5990847 , 0.64569813, 0.64142258, 0.6481345 ,
       0.6399058 , 0.64326176, 0.59513155, 0.56708948, 0.63595265,
       0.62404558, 0.59848751, 0.62285121, 0.6481345 , 0.49887113,
       0.63654984, 0.64477854, 0.63411347, 0.58598326, 0.61797847])

In [12]:
ig_arr.argsort()[::-1]

array([ 7, 14, 23,  6, 10, 12, 26,  3, 16,  2, 13, 15, 25, 19, 27,  5, 20,
       22, 29, 11, 21,  0, 17, 28,  9,  1, 18,  4,  8, 24])

In [13]:
from sklearn.feature_selection import mutual_info_classif

In [14]:
mut_inf = mutual_info_classif(X, y, discrete_features=True)

In [15]:
mut_inf

array([0.59664833, 0.57927133, 0.64234217, 0.64326176, 0.53800617,
       0.63016032, 0.6481345 , 0.65300724, 0.50948857, 0.57959374,
       0.64721491, 0.5990847 , 0.64569813, 0.64142258, 0.6481345 ,
       0.6399058 , 0.64326176, 0.59513155, 0.56708948, 0.63595265,
       0.62404558, 0.59848751, 0.62285121, 0.6481345 , 0.49887113,
       0.63654984, 0.64477854, 0.63411347, 0.58598326, 0.61797847])

In [16]:
np.argsort(mut_inf)[::-1]

array([ 7, 14, 23,  6, 10, 12, 26,  3, 16,  2, 13, 15, 25, 19, 27,  5, 20,
       22, 29, 11, 21,  0, 17, 28,  9,  1, 18,  4,  8, 24])

## Spearman corellation coefficient

$$\rho (x, y) = \frac{\sum_{i}{(x_{i} - \bar{x_{j}}}) (y_{i} - \bar{y})}{\sqrt{\sum_{i}{(x_{i} - \bar{x_{j}}})^2 \sum_{i}(y_{i} - \bar{y})^2}}$$

In [None]:
from scipy.stats import pearsonr

In [17]:
def spearmen_rank_correlation(x_arr, y_arr):
    x_mean, y_mean = np.mean(x_arr), np.mean(y_arr)
    x_arr_centr = (x_arr - x_mean)
    y_arr_centr = (y_arr - y_mean)
    
    return sum(x_arr_centr * y_arr_centr) / np.sqrt(sum(x_arr_centr ** 2) * sum(y_arr_centr ** 2))

In [19]:
[spearmen_rank_correlation(x, y) for x in X.T]

[-0.730028511375456,
 -0.4151852998452045,
 -0.7426355297258331,
 -0.7089838365853907,
 -0.35855996508593224,
 -0.5965336775082528,
 -0.6963597071719053,
 -0.7766138400204362,
 -0.3304985542625469,
 0.012837602698432376,
 -0.5671338208247172,
 0.008303332973877451,
 -0.5561407034314833,
 -0.5482359402780248,
 0.06701601057948735,
 -0.29299924424885854,
 -0.2537297659808307,
 -0.40804233271650475,
 0.0065217558706479415,
 -0.07797241739025611,
 -0.7764537785950394,
 -0.45690282139679833,
 -0.7829141371737588,
 -0.7338250349210514,
 -0.4214648610664028,
 -0.5909982378417925,
 -0.6596102103692342,
 -0.7935660171412697,
 -0.41629431104861936,
 -0.3238721887208241]

In [24]:
def corr(X_arr):
    X_arr = np.asaray(X_arr)
    n_cols = X_arr.shape[1]
    corr_matrix = np.zeros((n_cols, n_cols))
    for row in range(n_cols):
        for col in range(n_cols):
            corr_matrix[row, col] = spearmen_rank_correlation(X_arr[:, row], X_arr[:, col])
    
    return corr_matrix

In [28]:
corr(np.c_[X, y])

array([[ 1.00000000e+00,  3.23781891e-01,  9.97855281e-01,
         9.87357170e-01,  1.70581187e-01,  5.06123578e-01,
         6.76763550e-01,  8.22528522e-01,  1.47741242e-01,
        -3.11630826e-01,  6.79090388e-01, -9.73174431e-02,
         6.74171616e-01,  7.35863663e-01, -2.22600125e-01,
         2.05999980e-01,  1.94203623e-01,  3.76168956e-01,
        -1.04320881e-01, -4.26412691e-02,  9.69538973e-01,
         2.97007644e-01,  9.65136514e-01,  9.41082460e-01,
         1.19616140e-01,  4.13462823e-01,  5.26911462e-01,
         7.44214198e-01,  1.63953335e-01,  7.06588569e-03,
        -7.30028511e-01],
       [ 3.23781891e-01,  1.00000000e+00,  3.29533059e-01,
         3.21085696e-01, -2.33885160e-02,  2.36702222e-01,
         3.02417828e-01,  2.93464051e-01,  7.14009805e-02,
        -7.64371834e-02,  2.75868676e-01,  3.86357623e-01,
         2.81673115e-01,  2.59844987e-01,  6.61377735e-03,
         1.91974611e-01,  1.43293077e-01,  1.63851025e-01,
         9.12716776e-03,  5.44

In [44]:
corr_lst = [abs(spearmen_rank_correlation(x, y)) for x in X.T]
np.argsort(corr_lst)[::-1]

array([27, 22,  7, 20,  2, 23,  0,  3,  6, 26,  5, 25, 10, 12, 13, 21, 24,
       28,  1, 17,  4,  8, 29, 15, 16, 19, 14,  9, 11, 18])

In [46]:
corr_lst[27]
corr_lst[18]

0.7935660171412697

0.0065217558706479415

In [26]:
df = pd.DataFrame(np.c_[X, y])
df.corr('spearman')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,1.0,0.340956,0.997802,0.999602,0.14851,0.497578,0.645728,0.759702,0.120242,-0.349931,...,0.314911,0.971555,0.978863,0.125789,0.491357,0.596043,0.727265,0.174698,0.044564,-0.732785
1,0.340956,1.0,0.348142,0.344145,0.024649,0.266499,0.342646,0.306891,0.11013,-0.059303,...,0.909218,0.375273,0.368335,0.101401,0.290917,0.339725,0.319235,0.120693,0.116144,-0.461971
2,0.997802,0.348142,1.0,0.997068,0.182923,0.543925,0.681958,0.788629,0.150049,-0.304891,...,0.323109,0.97898,0.980864,0.156611,0.534565,0.632106,0.757526,0.199007,0.088961,-0.748496
3,0.999602,0.344145,0.997068,1.0,0.138053,0.488988,0.642557,0.755165,0.113928,-0.358425,...,0.318178,0.971822,0.980264,0.119712,0.485813,0.593736,0.72339,0.17086,0.038758,-0.734122
4,0.14851,0.024649,0.182923,0.138053,1.0,0.678806,0.518511,0.565172,0.542228,0.588465,...,0.060645,0.226345,0.191735,0.796085,0.481384,0.429107,0.498868,0.393579,0.511457,-0.371892
5,0.497578,0.266499,0.543925,0.488988,0.678806,1.0,0.896518,0.848295,0.552203,0.499195,...,0.255305,0.592254,0.53159,0.578902,0.901029,0.837921,0.825473,0.450333,0.688986,-0.609288
6,0.645728,0.342646,0.681958,0.642557,0.518511,0.896518,1.0,0.927352,0.446793,0.258174,...,0.335866,0.722424,0.676628,0.488775,0.849985,0.938543,0.904938,0.383667,0.541838,-0.733308
7,0.759702,0.306891,0.788629,0.755165,0.565172,0.848295,0.927352,1.0,0.423767,0.142659,...,0.300562,0.81396,0.780395,0.490035,0.758309,0.827281,0.937075,0.355477,0.42111,-0.777877
8,0.120242,0.11013,0.150049,0.113928,0.542228,0.552203,0.446793,0.423767,1.0,0.428467,...,0.11889,0.190526,0.154462,0.42423,0.440828,0.394481,0.397477,0.710359,0.410069,-0.332567
9,-0.349931,-0.059303,-0.304891,-0.358425,0.588465,0.499195,0.258174,0.142659,0.428467,1.0,...,-0.047791,-0.247456,-0.304927,0.493474,0.403653,0.242611,0.139152,0.295046,0.760771,0.025903
