# Using Sklearn's Pipeline

Sklearn's pipeline allows you to go from raw data to trained mode in a repeatable way in a series of sequential steps. Every step can be encapsulated within the pipeline

- the output of one step is the input for the next.
- each step is a tuple of two elements, a string `name`, and an object that implements a preprocessing, transform or fit operetion.

In [1]:
# ignore deprecation warnings in sklearn
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# import our custom train_test split function
from multilabel import multilabel_train_test_split

# set seed for reproducibility
np.random.seed(0)

df = pd.read_csv('../data/TrainingData.csv',index_col=0)

NUMERIC_COLUMNS = ['FTE', 'Total']
LABELS = ['Function',
 'Use',
 'Sharing',
 'Reporting',
 'Student_Type',
 'Position_Type',
 'Object_Type',
 'Pre_K',
 'Operating_Status']

### Build a pipeline using numeric data only

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer

# Split and select numeric data only
X_train, X_test, y_train, y_test = train_test_split(
    df[NUMERIC_COLUMNS], 
    pd.get_dummies(df[LABELS]), 
    test_size=0.3, 
    random_state=42
)

# fill in any missing values with the imputer
# by default it fills in NaNs with the mean of that column
steps = [
    ('imp', Imputer()),
    ('clf', OneVsRestClassifier(LogisticRegression()))
]

# instantiate the pipeline
pipeline = Pipeline(steps)

# Fit your pipeline to the training data
pipeline.fit(X_train, y_train)

# Compute its accuracy
pipeline.score(X_test, y_test)

### Build a pipeline using the text data

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

df.Position_Extra.fillna('', inplace=True)
                         
# using one text column
X_train, X_test, y_train, y_test = train_test_split(
    df['Position_Extra'],
    pd.get_dummies(df[LABELS]),
    test_size=0.3,
    random_state=42
)

steps = [
    ('vec', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)),
    ('clf', OneVsRestClassifier(LogisticRegression()))
]

pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

# Build a pipeline processing text and numerical data

If we wanted to build a model that used ALL available features, both numerical and text, we can't build a simple pipeline with these steps:

- the numeric and text preprocessing can not follow each other, each step will not knnow what to do with the other data type.
- output of the `CountVectorizer` can't be the input of the `Imputer`.

In order to build a pipeline, we need to separetly operate on numeric and text columns, and then combine the results. Sklearn provides two functions, `FunctionTransformer()` and `FeatureUnion()`.

**FunctionTransformer**

Allows you to define python functions that can be used by the pipeline.

- we'll use one to return just the numeric columns
- and a second to return just the text columns

Using these functions, we can build separate pipelines for text and numeric data.

**Feature Union**

Combines the return of the numeric and text pipelines, this becomes the input to the classifier.

Define the `get_text_data` by using a lambda function and `FunctionTransformer()` to obtain A SINGLE 'text' column, `Position_Extra` in this example.

Define the `get_numeric_data` by using a lambda function and `FunctionTransformer()` to obtain all the numeric columns (including missing data), specified by `NUMERIC_COLUMNS`.

#### FeatureTransform

In [2]:
from sklearn.preprocessing import FunctionTransformer

# Obtain the text data: get_text_data
get_text_data = FunctionTransformer(lambda x: x['Position_Extra'], validate=False)

# Obtain the numeric data: get_numeric_data
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)

# Fit and transform the text data: just_text_data
just_text_data = get_text_data.fit_transform(df)

# Fit and transform the numeric data: just_numeric_data
just_numeric_data = get_numeric_data.fit_transform(df)

# Print head to check results
print('Text Data')
print(just_text_data.head())
print('\nNumeric Data')
print(just_numeric_data.head())

Text Data
134338                 KINDERGARTEN 
206341                  UNDESIGNATED
326408                       TEACHER
364634    PROFESSIONAL-INSTRUCTIONAL
47683     PROFESSIONAL-INSTRUCTIONAL
Name: Position_Extra, dtype: object

Numeric Data
        FTE      Total
134338  1.0  50471.810
206341  NaN   3477.860
326408  1.0  62237.130
364634  NaN     22.300
47683   NaN     54.166


#### FeatureUnion

Sklearn's tools allow the streamlining of all preprocessing steps of our model, even when multiple datatypes are involved. For example, we don't want to impute our text data, and we don't want to create a bag-of-words with our numeric data. Instead, we deal with these separately and then join the results together using `FeatureUnion()`.

We'll still have two high-level steps in our pipeline: preprocessing and model instantiation. The difference is that the first preprocessing step actually consists of a pipeline for numeric data and a pipeline for text data.

In [9]:
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Fill NaN values in df.Position_Extra with empty string
df.Position_Extra.fillna('', inplace=True)

# Split using ALL data in sample_df
X_train, X_test, y_train, y_test = train_test_split(df,
                                                    pd.get_dummies(df[LABELS]),
                                                    test_size=0.3,
                                                    random_state=42)

# Create a FeatureUnion with nested pipeline: process_and_join_features
process_and_join_features = FeatureUnion(
            transformer_list = [
                ('numeric_features', Pipeline([
                    ('selector', get_numeric_data),
                    ('imputer', Imputer())
                ])),
                ('text_features', Pipeline([
                    ('selector', get_text_data),
                    ('vectorizer', CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC))
                ]))
             ]
        )

# Instantiate nested pipeline: pl
pl = Pipeline([
        ('union', process_and_join_features),
        ('clf', OneVsRestClassifier(LogisticRegression()))
    ])


# Fit pl to the training data
pl.fit(X_train, y_train)

# Compute and print accuracy
accuracy = pl.score(X_test, y_test)
print("\nAccuracy on sample data - all data: ", accuracy)




Accuracy on sample data - all data:  0.0
