# Creating a training pipeline

__Machine-learning systems are usually built using different modules.__ These modules are combined in a particular way to achieve an end goal. __The scikit-learn library has functions that enable us to build these pipelines by concatenating various modules together.__ We just need to specify the modules along with the corresponding parameters. It will then build a pipeline using these modules that processes the data and trains the system.

The pipeline can include modules that perform various functions like feature selection, preprocessing, random forests, clustering, and so on. In this section, we will see how to build a pipeline to select the top K features from an input data point and then classify them using an Extremely Random Forest classifier.

In [7]:
from sklearn.datasets import samples_generator 
from sklearn.feature_selection import SelectKBest, f_regression 
from sklearn.pipeline import Pipeline 
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np

In the line to follow, we create 150 data points, where each data point is a 25-dimensional feature vector. The numbers in each feature vector will be generated using a random sample generator. Each data point has 6 informative features and no redundant features.

In [12]:
# Generate data  
X, y = samples_generator.make_classification(n_samples=150,  
        n_features=25, n_classes=3, n_informative=6,  
        n_redundant=0, random_state=7) 
print("X Shape: " , np.shape(X))
print("X example: \n" , X[0])

print("\nY Shape: " , np.shape(y))
print(y[0])

X Shape:  (150, 25)
X example: 
 [ 1.01856035 -0.1850947   0.33953529  0.88377939 -2.22145741 -0.71205954
  0.46313981 -2.42424476 -0.07998485  0.03653191 -1.27561144 -1.5670243
 -0.82216114 -0.47040384  0.98701872 -0.34439804  0.02056176 -1.65437764
  0.94696772 -0.22854693  0.40599781  0.16376894 -0.89722827  2.43356744
 -0.69119524]

Y Shape:  (150,)
0


The first block in the pipeline is the feature selector. This block selects the K best features. Let's set the value of K to 9, as follows:

In [15]:
# Select top K features  
k_best_selector = SelectKBest(f_regression, k=9) 

The next block in the pipeline is an Extremely Random Forests classifier with 60 estimators and a maximum depth of 4

In [16]:
# Initialize Extremely Random Forests classifier  
classifier = ExtraTreesClassifier(n_estimators=60, max_depth=4) 

Let's construct the pipeline by joining the individual blocks that we've constructed. We can name each block so that it's easier to track:

In [17]:
# Construct the pipeline 
processor_pipeline = Pipeline([('selector', k_best_selector), ('erf', classifier)]) 

We can change the parameters of the individual blocks. Let's change the value of K in the first block to 7 and the number of estimators in the second block to 30. We will use the names we assigned in the previous line to define the scope:

In [18]:
# Set the parameters 
processor_pipeline.set_params(selector__k=7, erf__n_estimators=30) 

Pipeline(memory=None,
     steps=[('selector', SelectKBest(k=7, score_func=<function f_regression at 0x0000021531103950>)), ('erf', ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=4, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_spl...ators=30, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False))])

Train the pipeline using the sample data that we generated earlier:

In [19]:
# Training the pipeline  
processor_pipeline.fit(X, y) 

Pipeline(memory=None,
     steps=[('selector', SelectKBest(k=7, score_func=<function f_regression at 0x0000021531103950>)), ('erf', ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=4, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_spl...ators=30, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False))])

Predict the output for all the input values and print it:

In [20]:
# Predict outputs for the input data 
output = processor_pipeline.predict(X) 
print("\nPredicted output:\n", output) 


Predicted output:
 [1 2 2 0 2 0 2 1 0 1 1 2 2 0 2 2 1 0 0 1 0 2 1 1 2 2 0 0 1 2 1 2 1 0 2 2 1
 1 2 2 2 0 1 2 2 1 2 2 1 0 1 2 2 2 2 0 2 2 0 2 2 0 1 0 2 2 1 1 1 2 0 1 0 2
 0 0 1 2 2 0 0 2 2 2 2 0 0 0 2 2 2 1 2 0 2 2 2 2 0 0 1 1 1 1 2 2 1 2 1 1 1
 0 2 1 1 0 1 1 1 1 0 0 0 1 2 0 0 0 2 1 2 0 0 1 1 1 1 0 1 1 1 2 0 2 0 1 2 0
 2 2]


In [21]:
print("\nActual output:\n" , y)


Actual output:
 [0 2 2 0 2 0 2 1 0 1 1 2 1 0 2 2 1 0 0 1 0 1 0 1 2 2 0 0 1 0 1 2 1 0 2 2 1
 1 2 2 2 0 0 0 2 1 1 2 1 0 1 2 2 1 2 0 2 2 0 2 2 0 1 0 2 1 1 1 1 2 0 1 0 2
 0 0 1 2 2 0 0 1 0 2 2 0 0 0 2 2 2 1 2 0 2 0 2 0 0 0 1 1 1 1 2 2 2 2 0 1 1
 0 2 1 1 0 1 1 1 1 0 0 0 1 2 0 0 0 2 1 2 0 0 1 0 1 1 0 1 1 1 1 2 2 0 1 1 0
 2 2]


Compute the score using the labeled training data:

In [22]:
# Print scores  
print("\nScore:", processor_pipeline.score(X, y)) 


Score: 0.8666666666666667


Extract the features chosen by the selector block. We specified that we wanted to choose 7 features out of 25. Use the following code:

In [23]:
# Print the features chosen by the pipeline selector 
status = processor_pipeline.named_steps['selector'].get_support() 
 
# Extract and print indices of selected features 
selected = [i for i, x in enumerate(status) if x] 
print("\nIndices of selected features:", ', '.join([str(x) for x in selected])) 



Indices of selected features: 4, 7, 8, 12, 14, 17, 22


The predicted output list in the preceding screenshot shows the output labels predicted using the processor. The score represents the effectiveness of the processor. The last line indicates the indices of the chosen features.