里用随机森林对进行特征筛选，参照文献中进行四步法筛选

1. 通过皮尔森相关系数移除关联性最强的特征
2. 通过随机森林移除重要性最小的特征
3. 正向选择
4. 最优子集选择

In [1]:
from matminer.featurizers.composition import alloy
from matminer.featurizers.conversions import StrToComposition

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, r2_score

from figrecipes import PlotlyFig
import pandas as pd
import numpy as np

In [12]:
data = pd.read_csv('data.csv')

# Convert formula to composition
data = StrToComposition().featurize_dataframe(data, 'formula')
# 然后基于composition计算特征
data = alloy.WenAlloys().featurize_dataframe(data, 'composition')

StrToComposition:   0%|          | 0/2000 [00:00<?, ?it/s]

WenAlloys:   0%|          | 0/2000 [00:00<?, ?it/s]

In [19]:
# 去除数据中的'formula', 'C11', 'C12', 'C44', 'a', 'b', 'c', 'G', 'B', 'E', 'v', 'Zener', 'composition', 'Weight Fraction', 'Atomic Fraction'
# data.drop(['formula', 'C11', 'C12', 'C44', 'a', 'b', 'c', 'G', 'B', 'E', 'v', 'Zener', 'composition', 'Weight Fraction', 'Atomic Fraction'], axis=1, inplace=True)

data.dropna(axis=1, how='any', inplace=True)

# 选择前1500条数据作为训练集和验证集；后500条数据作为验证集。
data_fit = data.iloc[:1500]
data_test = data.iloc[1500:]

In [20]:
data.columns

Index(['Nb', 'Mo', 'Ta', 'W', 'Pugh', 'Yang delta', 'Yang omega', 'APE mean',
       'Radii local mismatch', 'Radii gamma', 'Configuration entropy',
       'Atomic weight mean', 'Total weight', 'Lambda entropy',
       'Electronegativity delta', 'Electronegativity local mismatch',
       'VEC mean', 'Mixing enthalpy', 'Mean cohesive energy',
       'Interant electrons', 'Interant s electrons', 'Interant p electrons',
       'Interant d electrons', 'Interant f electrons', 'Shear modulus mean',
       'Shear modulus delta', 'Shear modulus local mismatch',
       'Shear modulus strength model'],
      dtype='object')

In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 28 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Nb                                2000 non-null   int64  
 1   Mo                                2000 non-null   int64  
 2   Ta                                2000 non-null   int64  
 3   W                                 2000 non-null   int64  
 4   Pugh                              2000 non-null   float64
 5   Yang delta                        2000 non-null   float64
 6   Yang omega                        2000 non-null   float64
 7   APE mean                          2000 non-null   float64
 8   Radii local mismatch              2000 non-null   float64
 9   Radii gamma                       2000 non-null   float64
 10  Configuration entropy             2000 non-null   float64
 11  Atomic weight mean                2000 non-null   float64
 12  Total 

In [27]:
data['VEC mean']

0       5.42
1       5.62
2       5.63
3       5.50
4       5.34
        ... 
1995    5.49
1996    5.52
1997    5.55
1998    5.50
1999    5.34
Name: VEC mean, Length: 2000, dtype: float64

## 1. 移除关联性强的特征

计算两两特征之间的Person correlation coefficient，0.95为阈值。如果两特征大于0.95，则只保留建模误差较小的一个。

In [18]:
data_fit.corr(method='pearson')

Unnamed: 0,Nb,Mo,Ta,W,Pugh,Yang delta,Yang omega,APE mean,Radii local mismatch,Radii gamma,...,Mean cohesive energy,Interant electrons,Interant s electrons,Interant p electrons,Interant d electrons,Interant f electrons,Shear modulus mean,Shear modulus delta,Shear modulus local mismatch,Shear modulus strength model
Nb,1.0,-0.318892,-0.333438,-0.374204,0.867078,-0.465959,-0.564729,-0.25095,-0.516958,-0.415125,...,-0.24308,,,,,,-0.761098,0.91695,0.433206,-0.828498
Mo,-0.318892,1.0,-0.340382,-0.343044,-0.414675,-0.586266,0.50588,0.575089,-0.495778,0.218043,...,-0.777568,,,,,,0.273248,-0.647357,-0.554282,0.719668
Ta,-0.333438,-0.340382,1.0,-0.289237,0.158796,0.228781,0.380938,-0.187563,0.18122,-0.643328,...,0.215326,,,,,,-0.290368,-0.108566,-0.5147,0.206509
W,-0.374204,-0.343044,-0.289237,1.0,-0.619616,0.834666,-0.29892,-0.136712,0.842257,0.82519,...,0.81211,,,,,,0.780007,-0.180392,0.609591,-0.075816
Pugh,0.867078,-0.414675,0.158796,-0.619616,1.0,-0.474657,-0.300321,-0.257702,-0.553879,-0.793358,...,-0.244735,,,,,,-0.965499,0.850657,0.077569,-0.678362
Yang delta,-0.465959,-0.586266,0.228781,0.834666,-0.474657,1.0,-0.226466,-0.411039,0.990344,0.446563,...,0.940203,,,,,,0.561353,-0.116392,0.460448,-0.091846
Yang omega,-0.564729,0.50588,0.380938,-0.29892,-0.300321,-0.226466,1.0,0.506716,-0.20004,-0.126819,...,-0.327614,,,,,,0.159504,-0.670994,-0.802131,0.770792
APE mean,-0.25095,0.575089,-0.187563,-0.136712,-0.257702,-0.411039,0.506716,1.0,-0.391685,0.180557,...,-0.399307,,,,,,0.227707,-0.505425,-0.495735,0.55948
Radii local mismatch,-0.516958,-0.495778,0.18122,0.842257,-0.553879,0.990344,-0.20004,-0.391685,1.0,0.505317,...,0.894015,,,,,,0.624634,-0.185363,0.457005,-0.032381
Radii gamma,-0.415125,0.218043,-0.643328,0.82519,-0.793358,0.446563,-0.126819,0.180557,0.505317,1.0,...,0.34194,,,,,,0.905323,-0.43771,0.419703,0.213957
