<h2>Creating a training pipeline<h2>

Load necessary modules

In [2]:
from sklearn.datasets import make_blobs
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier
import pandas as pd

Let's generate some labeled sample data for training and testing. The scikit-learn
package has a built-in function that handles it. In the line to follow, we create 150 data
points, where each data point is a 25-dimensional feature vector. The numbers in each
feature vector will be generated using a random sample generator. Each data point has six
informative features and no redundant features

In [4]:
arg = {'n_samples':150,'n_features':25,'random_state':7}
X,Y = make_blobs(**arg)

The first block in the pipeline is the feature selector. This block selects the K best features.

In [8]:
k_best = SelectKBest(f_regression,k = 9)

The next block in the pipeline is an Extremely Random Forests classifier with 60 estimators
and a maximum depth of four

In [9]:
arg = {'n_estimators':60,'max_depth':4}
classifier = ExtraTreesClassifier(**arg)

construct the pipeline by joining the individual blocks that we've constructed. We can
name each block so that it's easier to track:

In [10]:
processor_pipeline = Pipeline([('selector',k_best),('erf',classifier)])

We can change the parameters of the individual blocks. Let's change the value of K in the
first block to 7 and the number of estimators in the second block to 30. We will use the
names we assigned in the previous line to define the scope

In [12]:
processor_pipeline.set_params(selector__k = 7 ,erf__n_estimators = 30)

Train the pipeline using the sample data that we generated earlier

In [13]:
processor_pipeline.fit(X,Y)

Predict the output for all the input values and print it:

In [14]:
y_pred = processor_pipeline.predict(X)
print("\n Predicted :",y_pred)


 Predicted : [1 1 0 0 0 2 1 1 1 1 1 2 0 2 1 0 0 2 2 0 1 1 0 2 1 2 2 2 1 1 0 1 1 2 2 2 2
 1 1 0 2 2 2 2 1 2 1 1 0 1 1 1 1 1 0 1 0 2 0 1 0 2 0 0 0 1 0 2 0 0 2 1 2 0
 0 0 2 0 2 1 0 0 2 0 2 0 1 2 1 0 0 2 2 1 1 0 2 1 0 0 0 2 2 0 0 2 1 2 1 2 2
 2 0 1 1 1 1 2 1 2 0 1 2 1 1 2 2 0 1 2 2 0 0 0 0 1 2 2 1 0 0 2 0 0 2 1 0 0
 2 1]


Compute the score using the labeled training data:

In [17]:
print("\nScore:{:.2f}".format(processor_pipeline.score(X,y_pred)))


Score:1.00


Extract the features chosen by the selector block. We specified that we wanted to choose 7
features out of 25. Use the following code:

In [22]:
status = processor_pipeline.named_steps['selector'].get_support()

#Extract and print indices of selcted features
selected = [i for i,n in enumerate(status) if n]
print("\nIndices of selected features:", ', '.join([str(x) for x in
selected]))


Indices of selected features: 5, 7, 10, 15, 16, 17, 21
