------------------------------------------------
Choosing important features (feature importance)
--------------------------------------------------

Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let’s understand it in detail.

Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy.

Otto Train data

You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory.

This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy).

In [19]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from pandas import read_csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
np.random.seed(1)

In [2]:
#Function to create Train and Test set from the original dataset 
def getTrainTestData(dataset,split):
    np.random.seed(0) 
    training = [] 
    testing = []
    np.random.shuffle(dataset) 
    shape = np.shape(dataset)
    trainlength = np.uint16(np.floor(split*shape[0]))
    for i in range(trainlength): 
        training.append(dataset[i])
    for i in range(trainlength,shape[0]): 
        testing.append(dataset[i])
    training = np.array(training) 
    testing = np.array(testing)
    return training,testing

In [3]:
#Function to evaluate model performance
def getAccuracy(pre,ytest): 
    count = 0
    for i in range(len(ytest)):
        if ytest[i]==pre[i]: 
            count+=1
    acc = float(count)/len(ytest)
    return acc

In [5]:
#Load dataset as pandas data frame
data = read_csv('data/ottoTrain.csv')
#Extract attribute names from the data frame
feat = data.keys()
feat_labels = feat.get_values()
#Extract data values from the data frame
dataset = data.values
#Shuffle the dataset
np.random.shuffle(dataset)
#We will select 50000 instances to train the classifier
inst = 50000
#Extract 50000 instances from the dataset
dataset = dataset[0:inst,:]
#Create Training and Testing data for performance evaluation
train,test = getTrainTestData(dataset, 0.7)
#Split data into input and output variable with selected features
Xtrain = train[:,0:94] 
ytrain = train[:,94] 
shape = np.shape(Xtrain)
print("Shape of the dataset ",shape)
#Print the size of Data in MBs
print("Size of Data set before feature selection:",(Xtrain.nbytes/1e6),"MB")

Shape of the dataset  (35000, 94)
Size of Data set before feature selection: 26.32 MB


In [6]:
#Lets select the test data for model evaluation purpose
Xtest = test[:,0:94] 
ytest = test[:,94]
#Create a random forest classifier with the following Parameters
trees= 250
max_feat= 7
max_depth = 30
min_sample = 2
clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, 
min_samples_split= min_sample, random_state=0,n_jobs=-1)
#Train the classifier and calculate the training time
import time
start = time.time() 
clf.fit(Xtrain, ytrain) 
end = time.time()
#Lets Note down the model training time
print("Execution time for building the Tree is: %f"%(float(end)- float(start)))
pre = clf.predict(Xtest)
#Let's see how much time is required to train the model on the training dataset:
#Evaluate the model performance for the test data
acc = getAccuracy(pre, ytest)
print("Accuracy of model before feature selection is",(100*acc))

Execution time for building the Tree is: 11.645748
Accuracy of model before feature selection is 98.82


In [8]:
#Once we have trained the model we will rank all the features 
for feature in zip(feat_labels, clf.feature_importances_):
    print(feature)

('id', 0.3334665042017519)
('feat_1', 0.003618695862880122)
('feat_2', 0.003724305088853096)
('feat_3', 0.01157921747206275)
('feat_4', 0.010297382675187447)
('feat_5', 0.0010359139416194119)
('feat_6', 0.0003817133603805617)
('feat_7', 0.0024867672489765026)
('feat_8', 0.00966897216105461)
('feat_9', 0.007906150362995095)
('feat_10', 0.002234248080213037)
('feat_11', 0.03032120226642743)
('feat_12', 0.0011208629500706663)
('feat_13', 0.003991984466073026)
('feat_14', 0.0194087068806635)
('feat_15', 0.01539863449663281)
('feat_16', 0.0055203970543115455)
('feat_17', 0.007198233904267588)
('feat_18', 0.0036309310056707516)
('feat_19', 0.003800885800560713)
('feat_20', 0.004600100163709177)
('feat_21', 0.0012839572570891805)
('feat_22', 0.0034580481856073624)
('feat_23', 0.001941425686466054)
('feat_24', 0.009502403878816025)
('feat_25', 0.01838207049845683)
('feat_26', 0.022011162365845237)
('feat_27', 0.008292147847657359)
('feat_28', 0.003155738407834562)
('feat_29', 0.002479225759860

In [14]:
#Select features which have higher contribution in the final prediction
sfm = SelectFromModel(clf, threshold=0.01) 
sfm.fit(Xtrain,ytrain)

SelectFromModel(estimator=RandomForestClassifier(bootstrap=True,
                                                 class_weight=None,
                                                 criterion='gini', max_depth=30,
                                                 max_features=7,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=250, n_jobs=-1,
                                                 oob_score=False,
                                                 random_state=0, verbose=0,
                                                 warm_start=Fal

In [15]:
#Transform input dataset
Xtrain_1 = sfm.transform(Xtrain) 
Xtest_1= sfm.transform(Xtest)
#Let's see the size and shape of new dataset 
print("Size of Data set before feature selection: ",(Xtrain_1.nbytes/1e6)," MB")
shape = np.shape(Xtrain_1)
print("Shape of the dataset ",shape)

Size of Data set before feature selection:  5.6  MB
Shape of the dataset  (35000, 20)


In [20]:
Xtrain_1.shape
Xtest_1.shape

(35000, 20)

(15000, 20)

In [16]:
#Model training time
start = time.time() 
clf.fit(Xtrain_1, ytrain) 
end = time.time()
print("Execution time for building the Random Forest is: ",(float(end)- float(start)))
#Let's evaluate the model on test data
pre = clf.predict(Xtest_1) 
count = 0
acc2 = getAccuracy(pre, ytest)
print("Accuracy after feature selection ",(100*acc2))

Execution time for building the Random Forest is:  5.580687522888184
Accuracy after feature selection  99.97333333333333
