## Feature Ranking Methods Comparison

Using several dataset downloaded from the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/) we compare three feature ranking methods for decision tree-based classifiers:
* The default (or "global") feature importance;
* SHAP values;
* Our PCFI.

To do so, we first preprocess each dataset and fit a Random Forest Classifier via GridSearchCV.
Then we select the top features ranked after each method (which are printed on screen), and refit the Random Forest model (using the same parameters from the GridSearchCV). Finally, the methods are compared based on the accuracy of the refitted model.

Similarly, we also compare SHAP and PCFI methods when selecting for top features that are most important to the rare class in the dataset (this is not possible with global feature importance). The only difference is that methods are scored with the f1_score on the rare class.

#### Results

SAHP and PCFI are very similar, and none emerges as clearly better than the other.

When ranking features globally, out of 9 dataset, PCFI outperforms SHAP 3 times and the conversely is true in 2 cases the global ranking.

When comparing the ranking methods biased in favour of the rare class, instead, each method outperforms the other once. 


In [8]:
# import custom functions and libraries
%run def_functions.py

proj_path = os.path.dirname(os.getcwd())
datasets = []

#### Dermatology dataset

In [11]:
col_names = np.array([
    'erythema', 'scaling', 'definite borders',
    'itching', 'koebner phenomenon', 'polygonal papules',
    'follicular papules', 'oral mucosal involvement', 'knee and elbow involvement',
    'scalp involvement', 'family history', 'melanin incontinence',
    'eosinophils in the infiltrate', 'PNL infiltrate', 'fibrosis of the papillary dermis',
    'exocytosis', 'acanthosis', 'hyperkeratosis',
    'parakeratosis', 'clubbing of the rete ridges', 'elongation of the rete ridges',
    'thinning of the suprapapillary epidermis', 'spongiform pustule', 'munro microabcess',
    'focal hypergranulosis', 'disappearance of the granular layer',
    'vacuolisation and damage of basal layer',
    'spongiosis', 'saw-tooth appearance of retes', 'follicular horn plug',
    'perifollicular parakeratosis', 'inflammatory monoluclear inflitrate',
    'band-like infiltrate',
    'Age', 'Class'
])
col_names = np.array([lab.capitalize() for lab in col_names])
feature_names = np.array(col_names[:-1])
class_col = col_names[-1]
class_names = np.array(['psoriasis', 'seboreic dermatitis', 'lichen planus',
                        'pityriasis rosea', 'cronic dermatitis', 'pityriasis rubra pilaris'])
class_names = np.array([lab.capitalize() for lab in class_names])
class_tags = np.arange(len(class_names)) + 1
data = pd.read_csv(proj_path+'/0_data/dermatology.data.csv', header=None, names=col_names)
skip_rows = data.Age == '?'
data = data[~skip_rows]
data.Age = np.array(data.Age, dtype=int)

rnd_seed = 45
data_info = ['Dermatology', data, feature_names, class_col, class_names, class_tags, rnd_seed]
datasets.append(data_info)


###


np.random.seed(rnd_seed)
feature_names = np.random.permutation(feature_names)###PERMUTE FEATURES
rf_gscv, top_features_lst, rare_class = fitRF_and_rank(data, feature_names, class_col, rnd_seed)
print('Method top selected features:')
for lab,feat in zip(['Global: ', 'Shap: ', 'Pcfi: ', 'Rare class Shap: ', 'Rare class Pcfi: '], top_features_lst):
    print(lab,feat)
label_lst = ['Global', 'Shap', 'Pcfi', 'Rare Shap', 'Rare Pfci']
is_global_score = [True, True, True, False, False]
print('\n')
scores = [refitRF(data, top_features, class_col, rf_gscv, rare_class, rnd_seed, global_score=global_score)
             for global_score,top_features in zip(is_global_score,top_features_lst)]
print('Feature ranking method accuracy:')
for lab,score in zip(label_lst[:3], scores[:3]):
    print(lab, score)
print('Feature ranking method f1_score for rare class')
for lab,score in zip(label_lst[3:], scores[3:]):
    print(lab, score)

Is the dataset balanced? False
Classes and counts: [1 2 3 4 5 6] [111  60  71  48  48  20]
Classes count in train: (array([1, 2, 3, 4, 5, 6]), array([80, 46, 58, 34, 36, 14]))
Classes count in test: (array([1, 2, 3, 4, 5, 6]), array([31, 14, 13, 14, 12,  6]))
Best score: 0.9776001512352437
Model performances:
Accuracy: 98.89


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


Do y and y_train have same rare class? True
Method top selected features:
Global:  ['Clubbing of the rete ridges' 'Thinning of the suprapapillary epidermis'
 'Fibrosis of the papillary dermis']
Shap:  ['Clubbing of the rete ridges' 'Elongation of the rete ridges'
 'Thinning of the suprapapillary epidermis']
Pcfi:  ['Fibrosis of the papillary dermis' 'Clubbing of the rete ridges'
 'Koebner phenomenon']
Rare class Shap:  ['Perifollicular parakeratosis' 'Follicular horn plug'
 'Follicular papules']
Rare class Pcfi:  ['Perifollicular parakeratosis' 'Follicular horn plug'
 'Follicular papules']


Feature ranking method accuracy:
Global 63.33
Shap 60.0
Pcfi 76.67
Feature ranking method f1_score for rare class
Rare Shap 100.0
Rare Pfci 100.0


#### Students dataset

In [12]:
col_names = np.array([
    'Gender', 'Caste', 'Class X Percentage', 'Class XII Percentage', 'Internal Assessment Percentage',
    'End Semester Percentage', 'Whether the student has back or arrear papers', 'Marital Status',
    'Lived in Town or Village', 'Admission Category', 'Family Monthly Income', 'Family Size',
    'Father Qualification', 'Mother Qualification', 'Father Occupation', 'Mother Occupation',
    'Number of Friends', 'Study Hours', 'Student School attended at Class X level', 'Medium',
    'Home to College Travel Time', 'Class Attendance Percentage'
])
col_names = np.array([lab.capitalize() for lab in col_names])
students_data = pd.read_csv(proj_path+'/0_data/Student_performances.csv', header=None, names=col_names)
students_data = students_data.loc[:,col_names!='Marital status']

string_to_int_list = [
    {'M':0,'F':1}, {'G':0,'ST':1,'SC':2,'OBC':3,'MOBC':4}, {'Best':4,'Vg':3,'Good':2,'Pass':1,'Fail':0},
    {'Best':4,'Vg':3,'Good':2,'Pass':1,'Fail':0}, {'Best':4,'Vg':3,'Good':2,'Pass':1,'Fail':0},
    {'Best':4,'Vg':3,'Good':2,'Pass':1,'Fail':0}, {'Y':1,'N':0},
    {'T':1,'V':0}, {'Free':0,'Paid':1},
    {'Vh':4,'High':3,'Am':2,'Medium':1,'Low':0}, {'Large':2,'Average':1,'Small':0},
    {'Il':0,'Um':1,'10':2,'12':3,'Degree':4,'Pg':5}, {'Il':0,'Um':1,'10':2,'12':3,'Degree':4,'Pg':5},
    {'Service':0,'Business':1,'Retired':2,'Farmer':3,'Others':4},
    {'Service':0,'Business':1,'Retired':2,'Housewife':3,'Others':4},
    {'Large':2,'Average':1,'Small':0}, {'Good':2,'Average':1,'Poor':0},
    {'Govt':1,'Private':0}, {'Eng':0,'Asm':1,'Hin':2,'Ben':3},
    {'Large':2,'Average':1,'Small':0}, {'Good':2,'Average':1,'Poor':0}
]
for col_name,string_to_int in zip(students_data.columns, string_to_int_list):
    students_data.loc[:,col_name] = students_data.apply(lambda r: string_to_int[r[col_name]], axis=1)
    
feature_names = np.array(students_data.columns[students_data.columns!='End semester percentage'])
class_col = 'End semester percentage'
class_names = np.array(['Fail', 'Pass', 'Good', 'Vg', 'Best'])
class_tags = np.arange(len(class_names))
rnd_seed = 44#not 45 as no rare class is assigned to the test set

data_info = ['Student_finals', students_data, feature_names, class_col, class_names, class_tags, rnd_seed]
datasets.append(data_info)


###


data = students_data
np.random.seed(rnd_seed)
feature_names = np.random.permutation(feature_names)###PERMUTE FEATURES
rf_gscv, top_features_lst, rare_class = fitRF_and_rank(data, feature_names, class_col, rnd_seed)
print('Method top selected features:')
for lab,feat in zip(['Global: ', 'Shap: ', 'Pcfi: ', 'Rare class Shap: ', 'Rare class Pcfi: '], top_features_lst):
    print(lab,feat)
label_lst = ['Global', 'Shap', 'Pcfi', 'Rare Shap', 'Rare Pfci']
is_global_score = [True, True, True, False, False]
print('\n')
scores = [refitRF(data, top_features, class_col, rf_gscv, rare_class, rnd_seed, global_score=global_score)
             for global_score,top_features in zip(is_global_score,top_features_lst)]
print('Feature ranking method accuracy:')
for lab,score in zip(label_lst[:3], scores[:3]):
    print(lab, score)
print('Feature ranking method f1_score for rare class')
for lab,score in zip(label_lst[3:], scores[3:]):
    print(lab, score)

Is the dataset balanced? False
Classes and counts: [1 2 3 4] [27 54 42  8]
Classes count in train: (array([1, 2, 3, 4]), array([22, 41, 29,  6]))
Classes count in test: (array([1, 2, 3, 4]), array([ 5, 13, 13,  2]))
Best score: 0.6441071052076744
Model performances:
Accuracy: 66.67


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


Do y and y_train have same rare class? True
Method top selected features:
Global:  ['Class x percentage' 'Class xii percentage'
 'Internal assessment percentage' 'Father occupation']
Shap:  ['Class x percentage' 'Internal assessment percentage'
 'Class xii percentage' 'Home to college travel time']
Pcfi:  ['Class x percentage' 'Internal assessment percentage'
 'Class xii percentage' 'Mother qualification']
Rare class Shap:  ['Internal assessment percentage' 'Class xii percentage'
 'Class x percentage']
Rare class Pcfi:  ['Internal assessment percentage' 'Class xii percentage'
 'Class x percentage']


Feature ranking method accuracy:
Global 57.58
Shap 54.55
Pcfi 69.7
Feature ranking method f1_score for rare class
Rare Shap 0.0
Rare Pfci 0.0


#### Tumor dataset

In [13]:
col_names = np.array([
    'class', 'age', 'sex', 'histologic-type', 'degree-of-diffe', 'bone', 'bone-marrow', 'lung', 'pleura',
    'peritoneum', 'liver', 'brain', 'skin', 'neck', 'supraclavicular', 'axillar', 'mediastinum', 'abdominal'
])
col_names = np.array([lab.capitalize() for lab in col_names])
feature_names = col_names[col_names!='Class']
class_names = np.array([
    'lung', 'head & neck', 'esophasus', 'thyroid', 'stomach', 'duoden & sm.int',
    'colon', 'rectum', 'anus', 'salivary glands', 'pancreas', 'gallblader',
    'liver', 'kidney', 'bladder', 'testis', 'prostate', 'ovary', 'corpus uteri', 
    'cervix uteri', 'vagina', 'breast'
])
tumor_data = pd.read_csv(proj_path+'/0_data/primary-tumor.data.csv', header=None, names=col_names)
tumor_data = tumor_data.loc[:,[col for col in col_names if not(col in ['Histologic-type', 'Degree-of-diffe'])]]
tumor_data.apply(lambda r: any(r=='?'),axis=1)
tumor_data = tumor_data.loc[~(tumor_data.apply(lambda r: any(r=='?'),axis=1)),:]
class_tags, class_counts = np.unique(tumor_data.Class, return_counts=True)
keep = class_tags[class_counts>=15]
tumor_data = tumor_data.loc[tumor_data.Class.isin(keep)]

col_names = tumor_data.columns
feature_names = np.array(col_names[col_names!='Class'])
class_names = np.array([col for col in class_names if col in class_names[keep-1]])
class_tags = np.unique(tumor_data.Class)
class_col = 'Class'
rnd_seed = 45
data_info = ['Tumor', tumor_data, feature_names, class_col, class_names, class_tags, rnd_seed]
datasets.append(data_info)


###


data = tumor_data
np.random.seed(rnd_seed)
feature_names = np.random.permutation(feature_names)###PERMUTE FEATURES
rf_gscv, top_features_lst, rare_class = fitRF_and_rank(data, feature_names, class_col, rnd_seed)
print('Method top selected features:')
for lab,feat in zip(['Global: ', 'Shap: ', 'Pcfi: ', 'Rare class Shap: ', 'Rare class Pcfi: '], top_features_lst):
    print(lab,feat)
label_lst = ['Global', 'Shap', 'Pcfi', 'Rare Shap', 'Rare Pfci']
is_global_score = [True, True, True, False, False]
print('\n')
scores = [refitRF(data, top_features, class_col, rf_gscv, rare_class, rnd_seed, global_score=global_score)
             for global_score,top_features in zip(is_global_score,top_features_lst)]
print('Feature ranking method accuracy:')
for lab,score in zip(label_lst[:3], scores[:3]):
    print(lab, score)
print('Feature ranking method f1_score for rare class')
for lab,score in zip(label_lst[3:], scores[3:]):
    print(lab, score)

Is the dataset balanced? False
Classes and counts: [ 1  2  5 11 12 14 18 22] [82 20 39 28 16 24 29 24]
Classes count in train: (array([ 1,  2,  5, 11, 12, 14, 18, 22]), array([59, 15, 32, 23,  9, 21, 25, 12]))
Classes count in test: (array([ 1,  2,  5, 11, 12, 14, 18, 22]), array([23,  5,  7,  5,  7,  3,  4, 12]))
Best score: 0.5213973422928647
Model performances:
Accuracy: 54.55


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


Do y and y_train have same rare class? True
Method top selected features:
Global:  ['Age' 'Sex' 'Mediastinum']
Shap:  ['Sex' 'Neck' 'Peritoneum']
Pcfi:  ['Age' 'Neck' 'Sex']
Rare class Shap:  ['Age' 'Sex' 'Peritoneum']
Rare class Pcfi:  ['Age' 'Sex' 'Liver']


Feature ranking method accuracy:
Global 40.91
Shap 40.91
Pcfi 36.36
Feature ranking method f1_score for rare class
Rare Shap 53.33
Rare Pfci 71.43


#### Flag dataset

In [14]:
col_names = np.array([
    'name', 'landmass', 'zone', 'area', 'population', 'language', 'religion', 'bars', 'stripes', 'colours',
    'red', 'green', 'blue', 'gold', 'white', 'black', 'orange', 'mainhue', 'circles', 'crosses', 'saltires',
    'quarters', 'sunstars', 'crescent', 'triangle', 'icon', 'animate', 'text', 'topleft', 'botright'
])
col_names = np.array([lab.capitalize() for lab in col_names])

feature_names = np.array(col_names[np.logical_and(col_names!='Religion', col_names!='Name')])
class_col = 'Religion'
class_names = np.array(['Catholic', 'Other Christian', 'Muslim',# 'Buddhist', 'Hindu',
                        'Ethnic', 'Marxist', 'Others'])

class_tags = np.arange(len(class_names))
flag_data = pd.read_csv(proj_path+'/0_data/flag.data.csv', header=None, names=col_names)

string_to_int = {'black':0, 'blue':1, 'brown':2, 'gold':3, 'green':4, 'orange':5, 'red':6, 'white':7}
for col_name in ['Mainhue', 'Topleft', 'Botright']:
    flag_data.loc[:,col_name] = flag_data.apply(lambda r: string_to_int[r[col_name]], axis=1)
string_to_int = {0:0, 1:1, 2:2, 3:5, 4:5, 5:3, 6:4, 7:5}
flag_data.loc[:,['Religion']] = flag_data.apply(lambda r: string_to_int[r['Religion']], axis=1)

rnd_seed = 45

data_info = ['Flag', flag_data, feature_names, class_col, class_names, class_tags, rnd_seed]
datasets.append(data_info)


###


data = flag_data  
np.random.seed(rnd_seed)
feature_names = np.random.permutation(feature_names)###PERMUTE FEATURES
rf_gscv, top_features_lst, rare_class = fitRF_and_rank(data, feature_names, class_col, rnd_seed)
print('Method top selected features:')
for lab,feat in zip(['Global: ', 'Shap: ', 'Pcfi: ', 'Rare class Shap: ', 'Rare class Pcfi: '], top_features_lst):
    print(lab,feat)
label_lst = ['Global', 'Shap', 'Pcfi', 'Rare Shap', 'Rare Pfci']
is_global_score = [True, True, True, False, False]
print('\n')
scores = [refitRF(data, top_features, class_col, rf_gscv, rare_class, rnd_seed, global_score=global_score)
             for global_score,top_features in zip(is_global_score,top_features_lst)]
print('Feature ranking method accuracy:')
for lab,score in zip(label_lst[:3], scores[:3]):
    print(lab, score)
print('Feature ranking method f1_score for rare class')
for lab,score in zip(label_lst[3:], scores[3:]):
    print(lab, score)

Is the dataset balanced? False
Classes and counts: [0 1 2 3 4 5] [40 60 36 27 15 16]
Classes count in train: (array([0, 1, 2, 3, 4, 5]), array([33, 47, 27, 18,  9, 11]))
Classes count in test: (array([0, 1, 2, 3, 4, 5]), array([ 7, 13,  9,  9,  6,  5]))
Best score: 0.6064553480966856
Model performances:
Accuracy: 63.27


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


Do y and y_train have same rare class? True
Method top selected features:
Global:  ['Landmass' 'Language' 'Area']
Shap:  ['Landmass' 'Zone' 'Language']
Pcfi:  ['Landmass' 'Language' 'Area']
Rare class Shap:  ['Zone' 'Green' 'Landmass']
Rare class Pcfi:  ['Population' 'Green' 'Sunstars']


Feature ranking method accuracy:
Global 61.22
Shap 61.22
Pcfi 61.22
Feature ranking method f1_score for rare class
Rare Shap 0.0
Rare Pfci 0.0


F-score is ill-defined and being set to 0.0 due to no predicted samples.
F-score is ill-defined and being set to 0.0 due to no predicted samples.


#### Breast tissue dataset

In [15]:
breast_data = pd.read_excel(proj_path+'/0_data/BreastTissue.xls', sheet_name=1)
breast_data = breast_data.iloc[:,1:]

feature_names = np.array(breast_data.columns[1:])
class_col = 'Class'
class_names = np.unique(breast_data.Class)
class_tags = np.arange(len(class_names))

string_to_int = dict(zip(class_names, class_tags))
breast_data.loc[:,'Class'] = breast_data.apply(lambda r: string_to_int[r.Class], axis=1)

rnd_seed = 45

data_info = ['Breast_tissue', breast_data, feature_names, class_col, class_names, class_tags, rnd_seed]
datasets.append(data_info)


###


data = breast_data
np.random.seed(rnd_seed)
feature_names = np.random.permutation(feature_names)###PERMUTE FEATURES
rf_gscv, top_features_lst, rare_class = fitRF_and_rank(data, feature_names, class_col, rnd_seed)
print('Method top selected features:')
for lab,feat in zip(['Global: ', 'Shap: ', 'Pcfi: ', 'Rare class Shap: ', 'Rare class Pcfi: '], top_features_lst):
    print(lab,feat)
label_lst = ['Global', 'Shap', 'Pcfi', 'Rare Shap', 'Rare Pfci']
is_global_score = [True, True, True, False, False]
print('\n')
scores = [refitRF(data, top_features, class_col, rf_gscv, rare_class, rnd_seed, global_score=global_score)
             for global_score,top_features in zip(is_global_score,top_features_lst)]
print('Feature ranking method accuracy:')
for lab,score in zip(label_lst[:3], scores[:3]):
    print(lab, score)
print('Feature ranking method f1_score for rare class')
for lab,score in zip(label_lst[3:], scores[3:]):
    print(lab, score)

Is the dataset balanced? False
Classes and counts: [0 1 2 3 4 5] [22 21 14 15 16 18]
Classes count in train: (array([0, 1, 2, 3, 4, 5]), array([17, 14, 11, 11, 12, 14]))
Classes count in test: (array([0, 1, 2, 3, 4, 5]), array([5, 7, 3, 4, 4, 4]))
Best score: 0.7054865424430643
Model performances:
Accuracy: 74.07


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


Do y and y_train have same rare class? True
Method top selected features:
Global:  ['P' 'I0' 'Max IP']
Shap:  ['I0' 'P' 'DA']
Pcfi:  ['P' 'I0' 'Max IP']
Rare class Shap:  ['I0' 'P' 'DA']
Rare class Pcfi:  ['I0' 'P' 'DA']


Feature ranking method accuracy:
Global 70.37
Shap 77.78
Pcfi 70.37
Feature ranking method f1_score for rare class
Rare Shap 100.0
Rare Pfci 100.0


#### Car dataset

In [16]:
col_names = np.array([
    'buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'
])
col_names = np.array([lab.capitalize() for lab in col_names])
car_data = pd.read_csv(proj_path+'/0_data/car.data.csv', header=None, names=col_names)

string_to_int_list = [
    {'vhigh':3, 'high':2, 'med':1, 'low':0},
    {'vhigh':3, 'high':2, 'med':1, 'low':0},
    {'2':2, '3':3, '4':4, '5more':5},
    {'2':0, '4':1, 'more':2},
    {'small':0, 'med':1, 'big':2},
    {'low':0, 'med':1, 'high':2},
    {'unacc':0, 'acc':1, 'good':2, 'vgood':3}
]

for col_name,string_to_int in zip(car_data.columns, string_to_int_list):
    car_data.loc[:,col_name] = car_data.apply(lambda r: string_to_int[r[col_name]], axis=1)

    
feature_names = np.array(car_data.columns[car_data.columns!='Class'])
class_col = 'Class'
class_names = np.array(['unacc', 'acc', 'good', 'vgood'])
class_tags = np.arange(len(class_names))
print(feature_names, class_names, class_tags)

rnd_seed = 45
data_info = ['Cars', car_data, feature_names, class_col, class_names, class_tags, rnd_seed]
datasets.append(data_info)


###


data = car_data
np.random.seed(rnd_seed)
feature_names = np.random.permutation(feature_names)###PERMUTE FEATURES
rf_gscv, top_features_lst, rare_class = fitRF_and_rank(data, feature_names, class_col, rnd_seed)
print('Method top selected features:')
for lab,feat in zip(['Global: ', 'Shap: ', 'Pcfi: ', 'Rare class Shap: ', 'Rare class Pcfi: '], top_features_lst):
    print(lab,feat)
label_lst = ['Global', 'Shap', 'Pcfi', 'Rare Shap', 'Rare Pfci']
is_global_score = [True, True, True, False, False]
print('\n')
scores = [refitRF(data, top_features, class_col, rf_gscv, rare_class, rnd_seed, global_score=global_score)
             for global_score,top_features in zip(is_global_score,top_features_lst)]
print('Feature ranking method accuracy:')
for lab,score in zip(label_lst[:3], scores[:3]):
    print(lab, score)
print('Feature ranking method f1_score for rare class')
for lab,score in zip(label_lst[3:], scores[3:]):
    print(lab, score)

['Buying' 'Maint' 'Doors' 'Persons' 'Lug_boot' 'Safety'] ['unacc' 'acc' 'good' 'vgood'] [0 1 2 3]
Is the dataset balanced? False
Classes and counts: [0 1 2 3] [1210  384   69   65]
Classes count in train: (array([0, 1, 2, 3]), array([908, 288,  51,  49]))
Classes count in test: (array([0, 1, 2, 3]), array([302,  96,  18,  16]))
Best score: 0.9691249616395052
Model performances:
Accuracy: 98.38


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


Do y and y_train have same rare class? True
Method top selected features:
Global:  ['Safety' 'Persons' 'Buying']
Shap:  ['Safety' 'Persons' 'Buying']
Pcfi:  ['Safety' 'Persons' 'Buying']
Rare class Shap:  ['Safety' 'Buying' 'Lug_boot']
Rare class Pcfi:  ['Safety' 'Buying' 'Lug_boot']


Feature ranking method accuracy:
Global 80.32
Shap 80.32
Pcfi 80.32
Feature ranking method f1_score for rare class
Rare Shap 48.28
Rare Pfci 48.28


#### WiFi signals dataset

In [17]:
wifi_data = pd.read_excel(proj_path+'/0_data/WiFi_signals.xls')

class_names = np.unique(wifi_data.Room)
feature_names = np.array(wifi_data.columns[wifi_data.columns!='Room'])
class_col = 'Room'
class_tags = class_names

rnd_seed = 45

data_info = ['Wifi', wifi_data, feature_names, class_col, class_names, class_tags, rnd_seed]
datasets.append(data_info)


###


data = wifi_data
np.random.seed(rnd_seed)
feature_names = np.random.permutation(feature_names)###PERMUTE FEATURES
rf_gscv, top_features_lst, rare_class = fitRF_and_rank(data, feature_names, class_col, rnd_seed)
print('Method top selected features:')
for lab,feat in zip(['Global: ', 'Shap: ', 'Pcfi: ', 'Rare class Shap: ', 'Rare class Pcfi: '], top_features_lst):
    print(lab,feat)
label_lst = ['Global', 'Shap', 'Pcfi', 'Rare Shap', 'Rare Pfci']
is_global_score = [True, True, True, False, False]
print('\n')
scores = [refitRF(data, top_features, class_col, rf_gscv, rare_class, rnd_seed, global_score=global_score)
             for global_score,top_features in zip(is_global_score,top_features_lst)]
print('Feature ranking method accuracy:')
for lab,score in zip(label_lst[:3], scores[:3]):
    print(lab, score)
print('Feature ranking method f1_score for rare class')
for lab,score in zip(label_lst[3:], scores[3:]):
    print(lab, score)

Is the dataset balanced? True
Classes and counts: [1 2 3 4] [500 500 500 500]
Classes count in train: (array([1, 2, 3, 4]), array([372, 379, 378, 371]))
Classes count in test: (array([1, 2, 3, 4]), array([128, 121, 122, 129]))
Best score: 0.9859959653171946
Model performances:
Accuracy: 98.2


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


Do y and y_train have same rare class? False
Method top selected features:
Global:  ['Signal 5' 'Signal 4' 'Signal 1']
Shap:  ['Signal 5' 'Signal 4' 'Signal 1']
Pcfi:  ['Signal 5' 'Signal 4' 'Signal 1']
Rare class Shap:  ['Signal 4' 'Signal 1' 'Signal 5']
Rare class Pcfi:  ['Signal 4' 'Signal 1' 'Signal 5']


Feature ranking method accuracy:
Global 97.6
Shap 97.6
Pcfi 97.6
Feature ranking method f1_score for rare class
Rare Shap 98.08
Rare Pfci 98.08


#### Iris dataset

In [18]:
from sklearn.datasets import load_iris

iris = load_iris()
feature_names = np.array(iris.feature_names)
class_names = iris.target_names
class_tags = np.array([0, 1, 2])
class_col = 'Class'

iris_data = pd.DataFrame(iris.data,columns=feature_names)
iris_data[class_col] = pd.Series(iris.target)

rnd_seed = 45
data_info = ['Iris', iris_data, feature_names, class_col, class_names, class_tags, rnd_seed]
datasets.append(data_info)


###


data = iris_data
np.random.seed(rnd_seed)
feature_names = np.random.permutation(feature_names)###PERMUTE FEATURES
rf_gscv, top_features_lst, rare_class = fitRF_and_rank(data, feature_names, class_col, rnd_seed)
print('Method top selected features:')
for lab,feat in zip(['Global: ', 'Shap: ', 'Pcfi: ', 'Rare class Shap: ', 'Rare class Pcfi: '], top_features_lst):
    print(lab,feat)
label_lst = ['Global', 'Shap', 'Pcfi', 'Rare Shap', 'Rare Pfci']
is_global_score = [True, True, True, False, False]
print('\n')
scores = [refitRF(data, top_features, class_col, rf_gscv, rare_class, rnd_seed, global_score=global_score)
             for global_score,top_features in zip(is_global_score,top_features_lst)]
print('Feature ranking method accuracy:')
for lab,score in zip(label_lst[:3], scores[:3]):
    print(lab, score)
print('Feature ranking method f1_score for rare class')
for lab,score in zip(label_lst[3:], scores[3:]):
    print(lab, score)

Is the dataset balanced? True
Classes and counts: [0 1 2] [50 50 50]
Classes count in train: (array([0, 1, 2]), array([36, 41, 35]))
Classes count in test: (array([0, 1, 2]), array([14,  9, 15]))
Best score: 0.9639376218323585
Model performances:
Accuracy: 94.74
Do y and y_train have same rare class? False
Method top selected features:
Global:  ['petal width (cm)' 'petal length (cm)' 'sepal length (cm)']
Shap:  ['petal width (cm)' 'petal length (cm)' 'sepal length (cm)']
Pcfi:  ['petal width (cm)' 'petal length (cm)' 'sepal length (cm)']
Rare class Shap:  ['petal width (cm)' 'petal length (cm)' 'sepal length (cm)']
Rare class Pcfi:  ['petal width (cm)' 'petal length (cm)' 'sepal length (cm)']




Setting feature_perturbation = "tree_path_dependent" because no background data was given.


Feature ranking method accuracy:
Global 94.74
Shap 94.74
Pcfi 94.74
Feature ranking method f1_score for rare class
Rare Shap 100.0
Rare Pfci 100.0


#### Wine dataset

In [19]:
from sklearn.datasets import load_wine

wines = load_wine()
feature_names = np.array(wines.feature_names)
class_names = wines.target_names
class_tags = np.array([0, 1, 2])
class_col = 'Class'

wine_data = pd.DataFrame(wines.data,columns=feature_names)
wine_data[class_col] = pd.Series(wines.target)

rnd_seed = 45
data_info = ['Wine', wine_data, feature_names, class_col, class_names, class_tags, rnd_seed]
datasets.append(data_info)


###


data = wine_data
np.random.seed(rnd_seed)
feature_names = np.random.permutation(feature_names)###PERMUTE FEATURES
rf_gscv, top_features_lst, rare_class = fitRF_and_rank(data, feature_names, class_col, rnd_seed)
print('Method top selected features:')
for lab,feat in zip(['Global: ', 'Shap: ', 'Pcfi: ', 'Rare class Shap: ', 'Rare class Pcfi: '], top_features_lst):
    print(lab,feat)
label_lst = ['Global', 'Shap', 'Pcfi', 'Rare Shap', 'Rare Pfci']
is_global_score = [True, True, True, False, False]
print('\n')
scores = [refitRF(data, top_features, class_col, rf_gscv, rare_class, rnd_seed, global_score=global_score)
             for global_score,top_features in zip(is_global_score,top_features_lst)]
print('Feature ranking method accuracy:')
for lab,score in zip(label_lst[:3], scores[:3]):
    print(lab, score)
print('Feature ranking method f1_score for rare class')
for lab,score in zip(label_lst[3:], scores[3:]):
    print(lab, score)

Is the dataset balanced? False
Classes and counts: [0 1 2] [59 71 48]
Classes count in train: (array([0, 1, 2]), array([47, 54, 32]))
Classes count in test: (array([0, 1, 2]), array([12, 17, 16]))
Best score: 0.9851851851851853
Model performances:
Accuracy: 100.0


Setting feature_perturbation = "tree_path_dependent" because no background data was given.


Do y and y_train have same rare class? True
Method top selected features:
Global:  ['color_intensity' 'proline' 'flavanoids' 'alcohol'
 'od280/od315_of_diluted_wines' 'hue' 'total_phenols' 'malic_acid']
Shap:  ['color_intensity' 'proline' 'flavanoids' 'alcohol'
 'od280/od315_of_diluted_wines' 'hue' 'total_phenols' 'proanthocyanins']
Pcfi:  ['color_intensity' 'proline' 'flavanoids' 'alcohol'
 'od280/od315_of_diluted_wines' 'hue' 'total_phenols' 'malic_acid']
Rare class Shap:  ['flavanoids' 'od280/od315_of_diluted_wines' 'color_intensity']
Rare class Pcfi:  ['flavanoids' 'od280/od315_of_diluted_wines' 'hue']


Feature ranking method accuracy:
Global 100.0
Shap 97.78
Pcfi 100.0
Feature ranking method f1_score for rare class
Rare Shap 96.97
Rare Pfci 96.77
