## Code Demo

Using some of the data from my capstone, we'll go through a relatively quick example of how a pipeline can be deployed for text analysis.

In [51]:
# Import the libraries, because that's what you do.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report

In [29]:
# Read in the data, in this case 4000 rows of all four authors at the word level
# 'Code' is the column for which author it is: 1 is Twain, 2 is Wilde, 3 is Lincoln, 4 is 'Modern'
df_1kW = pd.read_csv('df_1k_w.csv', index_col=0, encoding='utf-8')

In [30]:
df_1kW.head()

Unnamed: 0,0,code
0,"This, book, is, a, record, of, a, pleasure, tr...",1
1,"., But, it, was, not, ., There, was, a, tolera...",1
2,"a, year, ., A, good, many, expedients, were, r...",1
3,"no, particle, of, trimming, about, this, monst...",1
4,"fog, !, We, got, plenty, of, fresh, oranges, ,...",1


In [31]:
df_1kW.shape

(4004, 2)

Everything appears to be in order, so off we go! We won't be looking at the target quotes, just at the initial training data - the corpora of Twain, Wilde, Lincoln, and Modern - to see how this works.

In [33]:
X = df_1kW['0'].values
y = df_1kW['code'].values

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3, random_state=42)

In [36]:
# Initialize a CountVectorizer
cvec_W = CountVectorizer(decode_error='ignore', stop_words='english') # single-word count vectorizer

In [41]:
# Iinitialize the TFIDF transformer, which will transform the count matrix to a TF-IDF matrix
tfidf = TfidfTransformer()

In [45]:
# Initialize the logistic regression
logit = LogisticRegression()

In [42]:
# Turn text into counts.
X_train_count = cvec_W.fit_transform(X_train)

In [43]:
# Turn counts into frequencies.
X_train_df = tfidf.fit_transform(X_train_count)

In [46]:
# Fit the Logistic Regression to the training data
q_clf = logit.fit(X_train_df, y_train)

We then repeat with the test data, the only change being that we transform with the transformers, rather than fit_transform.

In [48]:
X_test_count = cvec_W.transform(X_test)

In [49]:
X_test_df = tfidf.transform(X_test_count)

In [50]:
predicted = q_clf.predict(X_test_df)

In [61]:
print classification_report(y_test, predicted, target_names=['Twain', 'Wilde', 'Lincoln', 'Modern'])

             precision    recall  f1-score   support

      Twain       0.83      0.81      0.82       315
      Wilde       0.91      0.88      0.90       308
    Lincoln       0.92      0.93      0.92       292
     Modern       0.87      0.91      0.89       287

avg / total       0.88      0.88      0.88      1202



## Pipelines

So, we could do that. Over and over and over again. Or! We can build a pipeline!

As constructed, a pipeline is a list of tuples. The first element of the tuple is an arbitrary - meaning we choose it - name. The second element is the thing being initialized and called on whatever the pipeline is called on. (There is a method in sklearn that allows you to even skip the arbitrary name. Look to the documentation: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline for more information.

In [58]:
quote_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('logit', LogisticRegression())])

Let's try that again, shall we? We call the pipeline just as we would its component parts.

In [54]:
quote_fit = quote_clf.fit(X_train, y_train)

In [59]:
quote_pred = quote_fit.predict(X_test)

In [62]:
print classification_report(y_test, quote_pred, target_names=['Twain', 'Wilde', 'Lincoln', 'Modern'])

             precision    recall  f1-score   support

      Twain       0.80      0.76      0.78       315
      Wilde       0.87      0.87      0.87       308
    Lincoln       0.87      0.93      0.90       292
     Modern       0.89      0.88      0.88       287

avg / total       0.86      0.86      0.86      1202

