------------------------------------------------
Choosing important features (feature importance)
--------------------------------------------------

Feature importance is the technique used to select features using a trained supervised classifier. When we train a classifier such as a decision tree, we evaluate each attribute to create splits; we can use this measure as a feature selector. Let’s understand it in detail.

Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness, and ease of use. They also provide two straightforward methods for feature selection—mean decrease impurity and mean decrease accuracy.

Otto Train data

You can download training dataset, train.csv.zip, from the https://www.kaggle.com/c/otto-group-product-classification-challenge/data and place the unzipped train.csv file in your working directory.

This dataset describes 93 obfuscated details of more than 61,000 products grouped into 10 product categories (for example, fashion, electronics, and so on). Input attributes are the counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories, and models are evaluated using multiclass logarithmic loss (also called cross entropy).

In [1]:
from pandas import read_csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
np.random.seed(1)

In [2]:
#Function to create Train and Test set from the original dataset 
def getTrainTestData(dataset,split):
    np.random.seed(0) 
    training = [] 
    testing = []
    np.random.shuffle(dataset) 
    shape = np.shape(dataset)
    trainlength = np.uint16(np.floor(split*shape[0]))
    for i in range(trainlength): 
        training.append(dataset[i])
    for i in range(trainlength,shape[0]): 
        testing.append(dataset[i])
    training = np.array(training) 
    testing = np.array(testing)
    return training,testing

In [3]:
#Function to evaluate model performance
def getAccuracy(pre,ytest): 
    count = 0
    for i in range(len(ytest)):
        if ytest[i]==pre[i]: 
            count+=1
    acc = float(count)/len(ytest)
    return acc