# Transformers and Pipelines



In [1]:
import pandas as pd

from sklearn.base import TransformerMixin,BaseEstimator
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

### Transformers

To start of this discussion of transformers consider the following toy dataset, which we will refer to as our corpus, consisting of 4 documents


In [2]:
corpus = ['This is the first document.',
          'This is the second second document.',
          'And the third one.',
          'Is this the first document?',
         ]


The data is currently not in a state where we can use it to train a predictive model. To prepare the data for modelling we can use a transformer. An sklearn transformer is any object with both a 'fit' method, which learns the transform parameters, and a 'transform' method, which applies the transform. This is similar to the sklearn estimator object, which has both a 'fit' and a 'predict' method. In this case the fit method learns the model parameters, and the predict method predicts the target variable. Slightly confusingly, sklearn will also refer to transformers as a type of estimator.

In [3]:
# the procedure for using a transformer is as follows

# initilise an instance of our desired transformer
cv = CountVectorizer()

# fit the transformer to learn the necessary parameters
cv.fit(corpus)

# apply the transform
X_train_transformed = cv.transform(corpus)
X_train_transformed

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

The data which was formerly a list of strings is now a sparse matrix. It's easier to see what has happened if we convert the data into a dataframe

In [4]:

X = pd.DataFrame(X_train_transformed.toarray(),
                 columns=cv.get_feature_names()
                )

X.head()

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,1,0,1,0,2,1,0,1
2,1,0,0,0,1,0,1,1,0
3,0,1,1,1,0,0,1,0,1


The data has been transformed into a 'bag of words' respresentation. Our original data only had one feature (the text contained within the document) whereas now it has 9 (each unique word present in our corpus). The entry for each sample is how many times that feature/word appears in that document. When we apply the fit method of CountVectoriser, we are learning all of the unique words in our corpus. There are numerous other examples of transformers in the sklearn library. For instance, many preprossessing, dimension reduction, and feature engineering methods are implemented as transformers.

You can also make your own custom transformer classes. Below is an example that transforms a sparse matrix into a dense one. If you are not familiar with classes this article is a good intro:
https://realpython.com/python3-object-oriented-programming/#classes-in-python

In [5]:
class ToDenseTransformer(BaseEstimator,TransformerMixin):

    # define the transform operation
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    # no paramter to learn this case
    # fit just returns an unchanged object
    def fit(self, X, y=None, **fit_params):
        return self
  

### Pipelines

Pipelines are a convenient way to chain together multiple transformers sequentially. The only restriction is that the final step in the pipeline must be an estimator object with a predict method. To demonstrate how to make and use a pipeline we will use the 20 Newsgroups dataset. The modelling task is to classify a document as being about religon or athiesm.


In [6]:

# downloading the data
categories = [
    'alt.atheism',
    'talk.religion.misc',
]

training_data = fetch_20newsgroups(subset='train', categories=categories)
X_train = training_data.data
Y_train = training_data.target

testing_data = fetch_20newsgroups(subset='test', categories=categories)
X_test = testing_data.data
Y_test = testing_data.target

In [7]:
# create pipeline object
pipeline = Pipeline(
    [
        ("countvectorize", CountVectorizer()), 
        ("logreg", LogisticRegression(solver='liblinear'))
    ]
)
# the main input argument when you initiate an instance of a pipeline class is a list of tuples

# each tuple is one of the steps in the pipeline. The first element of the tuple is the desired name of the step, the second element is the transformer or estimator object.

# once the pipeline has been created it behaves like an estimator

# call the fit method on the whole pipeline
pipeline.fit(X_train, Y_train)

# this sequentially fits the data, transforms it, and passes it to the next object in the pipeline. In the final step the predictive model is fitted. This allows you to wrap up your data processing and your estimator into one object. This means you to make predictions directly on unprocessed data.

y_pred = pipeline.predict(X_test)

# when you call the predict method, the data is sequentially transformed by all of the previously fit transformers in the pipeline before it is passed to the estimator for a prediction of the target variable.

In [8]:
# you can also use the score method, which will access the estimator's score method within the pipeline. In this case the classification accuracy is the score metric.
pipeline.score(X_test, Y_test)

0.8