We'll take the ideas introduced in Rachael Tatman's [Beginner Tutorial: Python](https://www.kaggle.com/rtatman/beginner-s-tutorial-python), scale them up into a complete machine learning pipeline, and add some basic feature engineering.

First, we need to make quite a few  imports. It's a long list but everything important is a component of either [pandas](http://pandas.pydata.org/pandas-docs/stable/) for data manipulation, [spaCy](https://spacy.io/docs/usage/lightning-tour) for text processing, or [scikit-learn](http://scikit-learn.org/stable/) for machine learning. I'll assume you have general familiarity with both pandas and scikit-learn.

In [1]:
import pandas as pd
import spacy

from multiprocessing import cpu_count
from sklearn.base import TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from spacy import attrs
from spacy.symbols import VERB, NOUN, ADV, ADJ

Next we'll declare important constants, per the python style guide, [PEP8](https://www.python.org/dev/peps/pep-0008). This isn't strictly necessary, but makes for cleaner code.

In [2]:
TEXT_COLUMN = 'text'
Y_COLUMN = 'author'
TRAIN_DATA_FILE = "train_data_spooky_author.csv"

We're going to run a couple of different models with different sets of features, so it's worth taking a moment to set up our model evaluation process as its own function.

For evaulation, we need to do several things:
1. Split the input dataframe into the a feature dataframe and a label dataframe (X and Y).
- Conduct  feature engineering.
- Train the model.
- Perform cross validation.
- Report the relevant score. In this case, we'll use log loss to match the competition's evaluation.

Integrating [Scikit-learn pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) into our evaluation makes this straightforward to repeat with different models and features.

In [3]:
def test_pipeline(df, nlp_pipeline, pipeline_name=''):
    y = df[Y_COLUMN].copy()
    X = pd.Series(df[TEXT_COLUMN])
    # If you've done EDA, you may have noticed that the author classes aren't quite balanced.
    # We'll use stratified splits just to be on the safe side.
    rskf = StratifiedKFold(n_splits=5, random_state=1)
    losses = []
    for train_index, test_index in rskf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        nlp_pipeline.fit(X_train, y_train)
        losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
    print(f'{pipeline_name} kfolds log losses: {str([str(round(x, 3)) for x in sorted(losses)])}')
    print(f'{pipeline_name} mean log loss: {round(pd.np.mean(losses), 3)}')

We're ready to load the data and run our first model. We'll start with the exact same model,
a naive bayes classifer on unigram probabilities, as in Rachael's tutorial. Using sklearn instead of implementing everything ourselves will make this both easier to code up and faster to run.

The `Id` column doesn't actually help us (or if it does, isn't really in the spirit of an NLP competition), so we'll skip over it.

In [4]:
train_df = pd.read_csv(TRAIN_DATA_FILE, usecols=[TEXT_COLUMN, Y_COLUMN])

In [82]:
unigram_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('mnb', MultinomialNB())
                        ])
test_pipeline(train_df, unigram_pipe, "Unigrams only")

Unigrams only kfolds log losses: ['0.455', '0.46', '0.47', '0.473', '0.474']
Unigrams only mean log loss: 0.466


In [88]:
unigram_pipe.classes_
train_df[Y_COLUMN].unique()
text_series = train_df[TEXT_COLUMN].copy()

In [97]:
 unigram_predictions = pd.DataFrame(
            unigram_pipe.predict_proba(text_series),
            columns=['naive_bayes_pred_' + x for x in unigram_pipe.classes_])

unigram_predictions.drop(unigram_predictions.columns[0], axis=1, inplace=True)
unigram_predictions

Unnamed: 0,naive_bayes_pred_HPL,naive_bayes_pred_MWS
0,9.777886e-08,1.808110e-08
1,1.926767e-01,4.028278e-02
2,5.844533e-04,7.521696e-08
3,3.250437e-10,1.000000e+00
4,9.197734e-01,9.724659e-04
5,3.665646e-16,1.000000e+00
6,1.948935e-03,4.957886e-05
7,1.559503e-02,1.612274e-02
8,3.715539e-10,1.294066e-06
9,1.061606e-04,9.933449e-01


In [100]:
df = pd.DataFrame(text_series.reset_index(drop=True))
df= df.merge(unigram_predictions, left_index=True, right_index=True)

Since we want to turn this into a nice clean pipeline, we'll do all of our feature engineering using custom transformers.  This first transformer takes the unigram pipeline that we built above and returns the predicted probabilities as features. We could use the raw CountVectorizer output and let our final model deal with the unigram features directly, but that would create two issues:

- CountVectorizer returns [a sparse format](https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.sparse.csr_matrix.html) that is a pain to integrate with the rest of our pipeline. 
- Using CountVectorizer and MultinomialNB allows us to skip converting the word counts to probabilities, and to skip ensuring that probabilities are never exactly zero. See the `alpha` parameter in the [MultinomialNB documentation?](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) As long as we use the default input of one, the model will peform this task (a [Laplace transform](https://en.wikipedia.org/wiki/Additive_smoothing)) for us.

In [6]:
class UnigramPredictions(TransformerMixin):
    def __init__(self):
        self.unigram_mnb = Pipeline([('text', CountVectorizer()), ('mnb', MultinomialNB())])

    def fit(self, x, y=None):
        # Every custom transformer requires a fit method. In this case, we want to train
        # the naive bayes model.
        self.unigram_mnb.fit(x, y)
        return self
    
    def add_unigram_predictions(self, text_series):
        # Resetting the index ensures the indexes equal the row numbers.
        # This guarantees nothing will be misaligned when we merge the dataframes further down.
        df = pd.DataFrame(text_series.reset_index(drop=True))
        # Make unigram predicted probabilities and label them with the prediction class, aka 
        # the author.
        unigram_predictions = pd.DataFrame(
            self.unigram_mnb.predict_proba(text_series),
            columns=['naive_bayes_pred_' + x for x in self.unigram_mnb.classes_]
                                           )
        # We only need 2 out of 3 columns, as the last is always one minus the 
        # sum of the other two. In some cases, that colinearity can actually be problematic.
        del unigram_predictions[unigram_predictions.columns[0]]
        df = df.merge(unigram_predictions, left_index=True, right_index=True)
        return df

    def transform(self, text_series):
        # Every custom transformer also requires a transform method. This time we just want to 
        # provide the unigram predictions.
        return self.add_unigram_predictions(text_series)

It's time to start adding new features with spaCy. We'll flag the main parts of speech used in each sentence, average word length, and overall sentence length.

The single slowest step of working with spaCy is often loading the model in the first place, so we'll ensure this step is only done once. By default, spaCy will tag each word, build a dependency model, and perform entity recognition. We only need the part of speech tags, so we'll restrict the pipeline accordingly. In tests on my local machine, this sped up the parse by 5-10x.

In [8]:
NLP = spacy.load('en', disable=['parser', 'ner'])

In [101]:
cpu_count()
train_sm = df
x = [i for i in NLP.pipe(train_sm[TEXT_COLUMN].values, n_threads=cpu_count())]
train_sm['doc'] = x


In [66]:
df = train_sm
df['pos_counts'] = df['doc'].apply(lambda x: x.count_by(attrs.POS))
df['sentence_length'] = df['doc'].str.len()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [102]:
df.head()

Unnamed: 0,text,naive_bayes_pred_HPL,naive_bayes_pred_MWS,doc
0,"This process, however, afforded me no means of...",9.777886e-08,1.80811e-08,"(This, process, ,, however, ,, afforded, me, n..."
1,It never once occurred to me that the fumbling...,0.1926767,0.04028278,"(It, never, once, occurred, to, me, that, the,..."
2,"In his left hand was a gold snuff box, from wh...",0.0005844533,7.521696e-08,"(In, his, left, hand, was, a, gold, snuff, box..."
3,How lovely is spring As we looked from Windsor...,3.250437e-10,1.0,"(How, lovely, is, spring, As, we, looked, from..."
4,"Finding nothing else, not even gold, the Super...",0.9197734,0.0009724659,"(Finding, nothing, else, ,, not, even, gold, ,..."


In [104]:
def part_of_speechiness(pos_counts, part_of_speech):
    if eval(part_of_speech) in pos_counts:
        return pos_counts[eval(part_of_speech).numerator]
    return 0

df['pos_counts'] = df['doc'].apply(lambda x: x.count_by(attrs.POS))
        # We get a very minor speed boost here by using pandas built in string methods
        # instead of df['doc'].apply(len). String processing is generally slow in python,
        # use the pandas string methods directly where possible.
df['sentence_length'] = df['doc'].str.len()


for part_of_speech in ['NOUN', 'VERB', 'ADJ', 'ADV']:
    df[f'{part_of_speech.lower()}iness'] = df['pos_counts'].apply(
                lambda x: part_of_speechiness(x, part_of_speech))
    df[f'{part_of_speech.lower()}iness'] /= df['sentence_length']
    df['avg_word_length'] = (df['doc'].apply(
            lambda x: sum([len(word) for word in x])) / df['sentence_length'])


In [105]:
df.head()

Unnamed: 0,text,naive_bayes_pred_HPL,naive_bayes_pred_MWS,doc,pos_counts,sentence_length,nouniness,avg_word_length,verbiness,adjiness,adviness
0,"This process, however, afforded me no means of...",9.777886e-08,1.80811e-08,"(This, process, ,, however, ,, afforded, me, n...","{96: 7, 99: 8, 83: 4, 84: 6, 85: 3, 88: 1, 89:...",48,0.1875,3.979167,0.166667,0.083333,0.0625
1,It never once occurred to me that the fumbling...,0.1926767,0.04028278,"(It, never, once, occurred, to, me, that, the,...","{96: 1, 99: 3, 84: 2, 85: 2, 83: 1, 89: 2, 91:...",15,0.133333,3.866667,0.2,0.066667,0.133333
2,"In his left hand was a gold snuff box, from wh...",0.0005844533,7.521696e-08,"(In, his, left, hand, was, a, gold, snuff, box...","{96: 5, 99: 4, 83: 7, 84: 6, 85: 1, 89: 5, 91:...",41,0.243902,4.02439,0.097561,0.170732,0.02439
3,How lovely is spring As we looked from Windsor...,3.250437e-10,1.0,"(How, lovely, is, spring, As, we, looked, from...","{96: 4, 99: 6, 83: 6, 84: 6, 85: 2, 88: 2, 89:...",38,0.157895,4.552632,0.157895,0.157895,0.052632
4,"Finding nothing else, not even gold, the Super...",0.9197734,0.0009724659,"(Finding, nothing, else, ,, not, even, gold, ,...","{96: 4, 99: 5, 83: 4, 84: 3, 85: 4, 88: 1, 89:...",31,0.193548,4.774194,0.16129,0.129032,0.129032


In [None]:
class PartOfSpeechFeatures(TransformerMixin):
    def __init__(self):
        self.NLP = NLP
        # Store the number of cpus available for when we do multithreading later on
        self.num_cores = cpu_count()

    def part_of_speechiness(self, pos_counts, part_of_speech):
        if eval(part_of_speech) in pos_counts:
            return pos_counts[eval(part_of_speech).numerator]
        return 0

    def add_pos_features(self, df):
        text_series = df[TEXT_COLUMN]
        """
        Parse each sentence with part of speech tags. 
        Using spaCy's pipe method gives us multi-threading 'for free'. 
        This is important as this is by far the single slowest step in the pipeline.
        If you want to test this for yourself, you can use:
            from time import time 
            start_time = time()
            (some code)
            print(f'Code took {time() - start_time} seconds')
        For faster functions the timeit module would be standard... but that's
        meant for situations where you can wait for the function to be called 1,000 times.
        """
        df['doc'] = [i for i in self.NLP.pipe(text_series.values, n_threads=self.num_cores)]
        df['pos_counts'] = df['doc'].apply(lambda x: x.count_by(attrs.POS))
        # We get a very minor speed boost here by using pandas built in string methods
        # instead of df['doc'].apply(len). String processing is generally slow in python,
        # use the pandas string methods directly where possible.
        df['sentence_length'] = df['doc'].str.len()
        # This next step generates the fraction of each sentence that is composed of a 
        # specific part of speech.
        # There's admittedly some voodoo in this step. Math can be more highly optimized in python
        # than string processing, so spaCy really stores the parts of speech as numbers. If you
        # try >>> VERB in the console you'll get 98 as the result.
        # The monkey business with eval() here allows us to generate several named columns
        # without specifying in advance that {'VERB': 98}.
        for part_of_speech in ['NOUN', 'VERB', 'ADJ', 'ADV']:
            df[f'{part_of_speech.lower()}iness'] = df['pos_counts'].apply(
                lambda x: self.part_of_speechiness(x, part_of_speech))
            df[f'{part_of_speech.lower()}iness'] /= df['sentence_length']
        df['avg_word_length'] = (df['doc'].apply(
            lambda x: sum([len(word) for word in x])) / df['sentence_length'])
        return df

    def fit(self, x, y=None):
        # since this transformer doesn't train a model, we don't actually need to do anything here.
        return self

    def transform(self, df):
        return self.add_pos_features(df.copy())

Finally, sklearn models generally don't accept strings as inputs, so we'll need to drop all string columns. This includes the original
'text' column that we read from the csv!

In [None]:
class DropStringColumns(TransformerMixin):
    # You may have noticed something odd about this class: there's no __init__!
    # It's actually inherited from TransformerMixin, so it doesn't need to be declared again.
    def fit(self, x, y=None):
        return self

    def transform(self, df):
        for col, dtype in zip(df.columns, df.dtypes):
            if dtype == object:
                del df[col]
        return df

In [107]:
annotated_df=df.copy()
for col, dtype in zip(df.columns, df.dtypes):
            if dtype == object:
                del df[col]

    rskf = StratifiedKFold(n_splits=5, random_state=1)
    losses = []
    for train_index, test_index in rskf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        nlp_pipeline.fit(X_train, y_train)
        losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))


Unnamed: 0,naive_bayes_pred_HPL,naive_bayes_pred_MWS,sentence_length,nouniness,avg_word_length,verbiness,adjiness,adviness
0,9.777886e-08,1.80811e-08,48,0.1875,3.979167,0.166667,0.083333,0.0625
1,0.1926767,0.04028278,15,0.133333,3.866667,0.2,0.066667,0.133333
2,0.0005844533,7.521696e-08,41,0.243902,4.02439,0.097561,0.170732,0.02439
3,3.250437e-10,1.0,38,0.157895,4.552632,0.157895,0.157895,0.052632
4,0.9197734,0.0009724659,31,0.193548,4.774194,0.16129,0.129032,0.129032


In [115]:
print(len(X), len(y))
y.head()
X.loc[train_index]

19579 19579


Unnamed: 0,naive_bayes_pred_HPL,naive_bayes_pred_MWS,sentence_length,nouniness,avg_word_length,verbiness,adjiness,adviness
3851,8.956342e-08,9.996697e-01,21,0.190476,4.809524,0.238095,0.190476,0.000000
3855,8.065876e-09,9.995084e-01,34,0.147059,3.323529,0.235294,0.058824,0.000000
3864,2.704521e-14,1.000000e+00,46,0.239130,4.021739,0.152174,0.108696,0.043478
3868,3.609708e-08,9.999788e-01,29,0.241379,4.103448,0.137931,0.137931,0.103448
3869,1.831372e-07,4.702795e-01,49,0.122449,3.755102,0.061224,0.102041,0.122449
3870,1.201069e-02,4.491862e-01,11,0.090909,3.727273,0.181818,0.000000,0.272727
3871,5.702152e-05,9.966904e-01,13,0.153846,4.230769,0.076923,0.153846,0.230769
3872,2.286652e-01,7.345459e-03,15,0.200000,4.400000,0.200000,0.000000,0.133333
3874,4.762694e-07,9.999167e-01,50,0.280000,3.920000,0.120000,0.140000,0.020000
3875,1.803668e-03,7.965739e-04,26,0.115385,3.961538,0.153846,0.038462,0.115385


In [116]:
X=df
y=train_df[Y_COLUMN]
rskf = StratifiedKFold(n_splits=5, random_state=1)
losses = []
nlp_pipeline = LogisticRegression()
for train_index, test_index in rskf.split(X, y):
        X_train, X_test = X.loc[train_index], X.loc[test_index]
        y_train, y_test = y.loc[train_index], y.loc[test_index]
        nlp_pipeline.fit(X_train, y_train)
        losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))

print(f'kfolds log losses: {str([str(round(x, 3)) for x in sorted(losses)])}')
print(f'mean log loss: {round(pd.np.mean(losses), 3)}')

kfolds log losses: ['0.255', '0.258', '0.258', '0.263', '0.458']
mean log loss: 0.299


If you want to experiment with different combinations of features, try writing your own transformers and adding them to the pipeline.

If you're running this at home, expect this next step to take ~30 seconds or so as we're retraining the model several times during the cross validation.

In [None]:
logit_all_features_pipe = Pipeline([
        ('uni', UnigramPredictions()),
        ('nlp', PartOfSpeechFeatures()),
        ('clean', DropStringColumns()), 
        ('clf', LogisticRegression())
                                     ])
test_pipeline(train_df, logit_all_features_pipe)

This pipeline is better... but only just barely. I'll leave it as an exercise for you to add better features and more powerful models. However, if we did want to submit this, we'd just feed `logit_all_features_pipe` into the `generate_submission_df` function.

In [None]:
def generate_submission_df(trained_prediction_pipeline, test_df):
    predictions = pd.DataFrame(
        trained_prediction_pipeline.predict_proba(test_df.text),
        columns=trained_prediction_pipeline.classes_
                               )
    predictions['id'] = test_df['id']
    predictions.to_csv("submission.csv", index=False)
    return predictions

Exercises:
1. Update the `PartOfSpeechFeatures` transformer to record all parts of speech, not just the original four.
- Can you generate a useful feature with spaCy's dependency parser? Fair warning,I haven't tried yet so the answer may well be no!
- More challenging: Kevin Schiroo figured out that [sentences for MWS are missing exclamation marks](https://www.kaggle.com/c/spooky-author-identification/discussion/42135). A simple regex based on capital letters like `re.sub(r'\b (?=[A-Z])', '! ', sentence)` would insert too many exclamation points by treating names as ends of sentences. Can you use spaCy's entity recognition model to clean those up?