This is an advanced example of model evaluation with the addition of POS tag vectorization (based on an other column in the data set), using FeatureUnion in a pipeline and GridSearchCV with cross-validation splitting strategy (10 folds in a (Stratified)KFold)

I use Bernoulli Naive Bayes classifier only because on these data all classifiers provides more or less the same results.
See https://github.com/KaterynaD/TweetsAutorshipAttributionModelsEvaluation/blob/master/Autorship%2Battribution%2Bmodels%2Bevaluation.ipynb

From the achieved results point of view: char and stemmed word vectorizeres provides good results. The other combinations (with or without POS tag vectorizer) may or may not provide slightly better or worse results.
For most pairs of authors I get 90 - 96% accuracy. I experimented with 500 - 2000 rows rows data sets per author

In [779]:
import pandas as pd
from pandas import Series,DataFrame
import numpy as np

For sparsity reasons I pre-process data before analysis:
removing re-tweets
removing short messages (less then 4 words)
replacing @ with REF
replacing any url with URL
replacing any date with DATE
replacing any time with TIME
replace digits with NUM

In [780]:
#data
df=pd.read_csv('C:\Kate\Python\Authorship Attribution\data\AllTweets.csv')
author1='KimKardashian'
author2='HillaryClinton'
df_kk=df.loc[(df['author'] == author1)]
df_hc=df.loc[(df['author'] == author2)]
df=df_kk.append(df_hc,ignore_index=True)
len(df)

14044

In [781]:
import random
#2000 random sample rows for KK
rows = random.sample(df_kk.index, 2000)
df_kk = df_kk.ix[rows]
#2000 random sample rows for HC
rows = random.sample(df_hc.index, 2000)
df_hc = df_hc.ix[rows]
#join back together
df=df_kk.append(df_hc,ignore_index=True)
len(df)

4000

In [782]:
#data pre-processing
df.drop(df[df.retweet==True].index, inplace=True)
df['num_of_words'] = df["text"].str.split().apply(len)
df.drop(df[df.num_of_words<4].index, inplace=True)
df["text"].replace(r"http\S+", "URL", regex=True,inplace=True)
df["text"].replace(r"@\S+", "REF", regex=True ,inplace=True)
df["text"].replace(r"(\d{1,2})[/.-](\d{1,2})[/.-](\d{2,4})+", "DATE", regex=True,inplace=True)
df["text"].replace(r"(\d{1,2})[/:](\d{2})[/:](\d{2})?(am|pm)+", "TIME", regex=True,inplace=True)
df["text"].replace(r"(\d{1,2})[/:](\d{2})?(am|pm)+", "TIME", regex=True,inplace=True)
df["text"].replace(r"\d+", "NUM", regex=True,inplace=True)
len(df)

3835

POS tag vectorizer

I am going to convert the text column in the data set to the string of POS tag and replace the usual POS tag 2-3 chars abbreviations to 1 char abbrevations (e.g. 'NNP' -> 'N', 'NNPS' -> 'O') in order to use CountVectorizer with 'char' analyzer to get the most informative POS tag combinations per author

In [783]:
#POS tag 2-3 chars abbrivation mapping to 1 char abbrevations
#http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
pos_code_map={'CC':'A','CD':'B','DT':'C','EX':'D','FW':'E','IN':'F','JJ':'G','JJR':'H','JJS':'I','LS':'J','MD':'K','NN':'L','NNS':'M',
'NNP':'N','NNPS':'O','PDT':'P','POS':'Q','PRP':'R','PRP$':'S','RB':'T','RBR':'U','RBS':'V','RP':'W','SYM':'X','TO':'Y','UH':'Z',
'VB':'1','VBD':'2','VBG':'3','VBN':'4','VBP':'5','VBZ':'6','WDT':'7','WP':'8','WP$':'9','WRB':'@'}
code_pos_map={v: k for k, v in pos_code_map.iteritems()}
#Python 3 inv_map = {v: k for k, v in my_map.items()}

In [784]:
#abbrivation converters
def convert(tag):
    try:
        code=pos_code_map[tag]
    except:
        code='?'
    return code
def inv_convert(code):
    try:
        tag=code_pos_map[code]
    except:
        tag='?'
    return tag

In [785]:
#POS tag converting
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk import pos_tag, word_tokenize
def pos_tags(text):
    tokenizer = RegexpTokenizer(r'\w+')
    text_processed=tokenizer.tokenize(text)
    return "".join(convert(tag) for (word, tag) in nltk.pos_tag(text_processed))
def text_pos_inv_convert(text):
    return "-".join(inv_convert(c.upper()) for c in text)

In [786]:
#a new column for pos tags
df['text_pos']=df.apply(lambda x: pos_tags(x['text']), axis=1)

Here is how a sequence of pos tags looks like to be used in CountVectorizer()

In [787]:
df.ix[:,['author','text','text_pos']].head()

Unnamed: 0,author,text,text_pos
0,KimKardashian,"Watching ""I'm pregnant and I sniff toxic fumes...",3R5GAR5GMFC6GNG
1,KimKardashian,West Coast...Keeping Up With The Kardashians i...,NNNWFCO6FN
2,KimKardashian,RT REF Another day... Another photo shoot! But...,NNCLCLLACLSMAR53CLFSLF63W
3,KimKardashian,RT REF Thx love thisREF Cutest moment Between...,NNN1LNLNNANLNLLN
4,KimKardashian,Excited to share with you an all-new #KUWTK! O...,4YLFRCCGNTBHLY1


Now let's look if there are unique combinations of POS tags per author

In [841]:
df_features=pd.DataFrame()

In [842]:
from sklearn.feature_extraction.text import CountVectorizer
for a in df.author.unique():
    v = CountVectorizer(analyzer='char',ngram_range=(3, 3))
    ngrams = v.fit_transform(df[df['author'] == a]['text_pos'])
    df_t=pd.DataFrame(
    {'Feature': v.get_feature_names(),
     'Count': list(ngrams.sum(axis=0).flat),
     'Author': a
    })
    #
    df_features=df_features.append(df_t,ignore_index=True)

Let's convert the 1 char abbrivated pos tag sequence back to the common known 2-3 chars abbreviations

In [843]:
df_features['Feature_POS']=df_features.apply(lambda x: text_pos_inv_convert(x['Feature']), axis=1)

There are indeed a lot of unique POS tags combinations per author

In [844]:
df_features[~df_features.Feature.isin(df_features[df_features['Author'] != author2].Feature)].sort_values('Count', ascending=False).ix[:,['Author','Count','Feature','Feature_POS']].head()

Unnamed: 0,Author,Count,Feature,Feature_POS
4538,HillaryClinton,14,6ga,VBZ-JJ-CC
4101,HillaryClinton,13,4f3,VBN-IN-VBG
5693,HillaryClinton,12,gm7,JJ-NNS-WDT
5076,HillaryClinton,12,bmf,CD-NNS-IN
6087,HillaryClinton,12,lgf,NN-JJ-IN


In [845]:
df_features[~df_features.Feature.isin(df_features[df_features['Author'] != author1].Feature)].sort_values('Count', ascending=False).ix[:,['Author','Count','Feature','Feature_POS']].head()

Unnamed: 0,Author,Count,Feature,Feature_POS
2144,KimKardashian,13,ln5,NN-NNP-VBP
779,KimKardashian,11,5r2,VBP-PRP-VBD
2624,KimKardashian,8,nn@,NNP-NNP-WRB
1412,KimKardashian,8,co6,DT-NNPS-VBZ
2710,KimKardashian,7,o6f,NNPS-VBZ-IN


To avoid overfitting let's hold out a part of the available data as a test set twt_test (X), author_test (Y).

In [793]:
from sklearn.cross_validation import train_test_split
twt_train, twt_test, author_train, author_test = train_test_split(df.ix[:,['text','text_pos']], df['author'], test_size=0.4, random_state=42)

The function will be used as tokenizer in the evaluation. As I discovered using stop words does not improve the model so I removed item 2 (removing stop words)

In [794]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
def text_process(text):
    """
    Takes in a string of text, then performs the following:
    1. Tokenizes and removes punctuation
    3. Stems
    4. Returns a list of the cleaned text
    """

    # tokenizing
    tokenizer = RegexpTokenizer(r'\w+')
    text_processed=tokenizer.tokenize(text)
    
    
    # steming
    porter_stemmer = PorterStemmer()
    
    text_processed = [porter_stemmer.stem(word) for word in text_processed]
    

    return text_processed

In [795]:
ScoreSummaryByModelParams = list()

In [796]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

ItemSelector and TextAndTextCodedExtractor classes is used in a pipeline to get a proper column (text or text_pos) from a data set to be used in a vectorizer

In [797]:
from sklearn.base import BaseEstimator, TransformerMixin
class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

In [798]:
class TextAndTextCodedExtractor(BaseEstimator, TransformerMixin):
    """Extract the text & text_pos from a tweet in a single pass.
    """
    def fit(self, x, y=None):
        return self

    def transform(self, tweets):
        features=tweets.ix[:,['text_pos','text']].to_records(index=False)

        return features

ModelParamsEvaluation function receives as parameters 2 parts of its pipeline: f_union, which is a pipeline itself with different combinations of vectorizers and a model

In [799]:
def ModelParamsEvaluation (f_union,model,params,comment):
    pipeline = Pipeline([
    # Extract the text & text_coded
    ('textandtextcoded', TextAndTextCodedExtractor()),

    # Use FeatureUnion to combine the features from text and text_coded
    ('union', f_union, ),

    # Use a  classifier on the combined features
    ('clf', model),
    ])
    grid_search = GridSearchCV(estimator=pipeline, param_grid=params, verbose=1, cv=10)
    grid_search.fit(twt_train, author_train)
    #best score
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(params.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
        ScoreSummaryByModelParams.append([comment,grid_search.best_score_,"\t%s: %r" % (param_name, best_parameters[param_name])])    
 

First I examine the model with only 1 vectorizer We do not need FeatureUnion in this case but I use it just to keep the pattern of all experiments

In [800]:
f1_union=FeatureUnion(
        transformer_list=[
              # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char')),
            ])),               

        ],
    )

'char' analyzer provides a perfect result by itself

In [801]:
from sklearn.naive_bayes import BernoulliNB
p = {
    'union__char__tfidf__max_df': (0.5, 0.75, 1.0),
    'union__char__tfidf__ngram_range': ((2, 2), (3, 3)), 
    'union__char__tfidf__max_features': (None, 5000, 10000, 50000),
    'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}

ModelParamsEvaluation(f1_union,BernoulliNB(),p,'Bernoulli Naive Bayes, char')

Fitting 10 folds for each of 144 candidates, totalling 1440 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    8.7s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:   37.5s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:  1.4min
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:  2.5min
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:  3.8min


Best score: 0.965
Best parameters set:
	clf__alpha: 0.001
	union__char__tfidf__max_df: 0.5
	union__char__tfidf__max_features: 5000
	union__char__tfidf__ngram_range: (3, 3)


[Parallel(n_jobs=1)]: Done 1440 out of 1440 | elapsed:  4.4min finished


In [802]:
f1_union=FeatureUnion(
        transformer_list=[
            # Pipeline for pulling word features from the text
            ('word', Pipeline([
            ('selector', ItemSelector(key='text')),
            ('tfidf',    TfidfVectorizer(analyzer='word')),
            ])),              

        ],
    )

'word' analyzer is worse for these 2 authors but for other pairs (AdamSavage - ScottKelly) it provides better results the the char analyzer. As you can see it is not recommended using stop words

In [803]:
p = {
    'union__word__tfidf__max_df': (0.5, 0.75, 1.0),
    'union__word__tfidf__ngram_range': ((1, 1),(2, 2), (3, 3),(4,4),(5,5)), 
    'union__word__tfidf__max_features': (None, 5000, 10000, 50000),
    'union__word__tfidf__stop_words': (None, 'english'),
    'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}

ModelParamsEvaluation(f1_union,BernoulliNB(),p,'Bernoulli Naive Bayes, word')

Fitting 10 folds for each of 720 candidates, totalling 7200 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    6.1s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:   26.7s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:  1.0min
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:  1.8min
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:  2.8min
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:  4.1min
[Parallel(n_jobs=1)]: Done 2449 tasks       | elapsed:  5.5min
[Parallel(n_jobs=1)]: Done 3199 tasks       | elapsed:  7.2min
[Parallel(n_jobs=1)]: Done 4049 tasks       | elapsed:  9.1min
[Parallel(n_jobs=1)]: Done 4999 tasks       | elapsed: 11.2min
[Parallel(n_jobs=1)]: Done 6049 tasks       | elapsed: 13.5min
[Parallel(n_jobs=1)]: Done 7199 tasks       | elapsed: 16.1min
[Parallel(n_jobs=1)]: Done 7200 out of 7200 | elapsed: 16.1min finished


Best score: 0.948
Best parameters set:
	clf__alpha: 0.1
	union__word__tfidf__max_df: 0.5
	union__word__tfidf__max_features: None
	union__word__tfidf__ngram_range: (1, 1)
	union__word__tfidf__stop_words: None


In [804]:
f1_union=FeatureUnion(
        transformer_list=[
            # Pipeline for pulling word features from the text
            ('text', Pipeline([
            ('selector', ItemSelector(key='text')),
            ('tfidf',    TfidfVectorizer(analyzer='word',tokenizer= text_process)),
            ])),              

        ],
    )

In [None]:
'stemmed word' analyzer is better then the just 'word' analyzer but still worse then the 'char' for these 2 authors
But for other pairs (AdamSavage - ScottKelly) it provides better results the the char analyzer and worse then the just 'word'

In [805]:
p = {
    'union__text__tfidf__max_df': (0.5, 0.75, 1.0),
    'union__text__tfidf__ngram_range': ((1, 1),(2, 2), (3, 3),(4,4),(5,5)), 
    'union__text__tfidf__max_features': (None, 5000, 10000, 50000),
    'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}

ModelParamsEvaluation(f1_union,BernoulliNB(),p,'Bernoulli Naive Bayes, stemmed words, no stop words')

Fitting 10 folds for each of 360 candidates, totalling 3600 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   22.6s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:  1.5min
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:  3.5min
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:  6.2min
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:  9.7min
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed: 14.0min
[Parallel(n_jobs=1)]: Done 2449 tasks       | elapsed: 19.0min
[Parallel(n_jobs=1)]: Done 3199 tasks       | elapsed: 24.8min
[Parallel(n_jobs=1)]: Done 3600 out of 3600 | elapsed: 27.9min finished


Best score: 0.952
Best parameters set:
	clf__alpha: 0.1
	union__text__tfidf__max_df: 0.5
	union__text__tfidf__max_features: None
	union__text__tfidf__ngram_range: (1, 1)


In [806]:
f1_union=FeatureUnion(
        transformer_list=[
            # Pipeline for pulling pos tag features  from the text_pos
            ('text_pos', Pipeline([
            ('selector', ItemSelector(key='text_pos')),
            ('tfidf',    TfidfVectorizer(analyzer='char')),
            ])),                  

        ],
    )

POS tag vectorizer is not very selective. Let's see how it words in the combinations

In [807]:
p = {
    'union__text_pos__tfidf__max_df': (0.5, 0.75, 1.0),
    'union__text_pos__tfidf__ngram_range': ((3, 3), (4, 4),(5,5),(6,6),(7,7)), 
    'union__text_pos__tfidf__max_features': (None, 5000, 10000, 50000),
    'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}

ModelParamsEvaluation(f1_union,BernoulliNB(),p,'Bernoulli Naive Bayes, POS tags')

Fitting 10 folds for each of 360 candidates, totalling 3600 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:    5.0s
[Parallel(n_jobs=1)]: Done 199 tasks       | elapsed:   20.5s
[Parallel(n_jobs=1)]: Done 449 tasks       | elapsed:   46.2s
[Parallel(n_jobs=1)]: Done 799 tasks       | elapsed:  1.4min
[Parallel(n_jobs=1)]: Done 1249 tasks       | elapsed:  2.1min
[Parallel(n_jobs=1)]: Done 1799 tasks       | elapsed:  3.1min
[Parallel(n_jobs=1)]: Done 2449 tasks       | elapsed:  4.2min
[Parallel(n_jobs=1)]: Done 3199 tasks       | elapsed:  5.5min


Best score: 0.764
Best parameters set:
	clf__alpha: 0.1
	union__text_pos__tfidf__max_df: 0.5
	union__text_pos__tfidf__max_features: None
	union__text_pos__tfidf__ngram_range: (3, 3)


[Parallel(n_jobs=1)]: Done 3600 out of 3600 | elapsed:  6.2min finished


In [808]:
df_ScoreSummaryByModelParams=DataFrame(ScoreSummaryByModelParams,columns=['Method','BestScore','BestParameter'])
df_ScoreSummaryByModelParams.sort_values(['BestScore'],ascending=False,inplace=True)
df_ScoreSummaryByModelParams

Unnamed: 0,Method,BestScore,BestParameter
0,"Bernoulli Naive Bayes, char",0.965233,\tclf__alpha: 0.001
1,"Bernoulli Naive Bayes, char",0.965233,\tunion__char__tfidf__max_df: 0.5
2,"Bernoulli Naive Bayes, char",0.965233,\tunion__char__tfidf__max_features: 5000
3,"Bernoulli Naive Bayes, char",0.965233,"\tunion__char__tfidf__ngram_range: (3, 3)"
12,"Bernoulli Naive Bayes, stemmed words, no stop ...",0.952195,"\tunion__text__tfidf__ngram_range: (1, 1)"
11,"Bernoulli Naive Bayes, stemmed words, no stop ...",0.952195,\tunion__text__tfidf__max_features: None
10,"Bernoulli Naive Bayes, stemmed words, no stop ...",0.952195,\tunion__text__tfidf__max_df: 0.5
9,"Bernoulli Naive Bayes, stemmed words, no stop ...",0.952195,\tclf__alpha: 0.1
8,"Bernoulli Naive Bayes, word",0.948283,\tunion__word__tfidf__stop_words: None
7,"Bernoulli Naive Bayes, word",0.948283,"\tunion__word__tfidf__ngram_range: (1, 1)"


In [809]:
f2_union=FeatureUnion(
        transformer_list=[
            # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=5000)),
            ])),
            # Pipeline for pulling word features from the text
            ('word', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',    TfidfVectorizer(analyzer='word',stop_words=None,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),        

        ],

    )

In [810]:
p = {'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}
ModelParamsEvaluation(f2_union,BernoulliNB(),p,'Bernoulli Naive Bayes, char + word')

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   13.7s
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   16.8s finished


Best score: 0.969
Best parameters set:
	clf__alpha: 0.001


With small variations 'char and stemmed word' combination provides the best result for most of analyzed authors pairs

In [811]:
f2_union=FeatureUnion(
        transformer_list=[
            # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=5000)),
            ])),
            # Pipeline for pulling stememd word features from the text
            ('text', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',    TfidfVectorizer(analyzer='word',tokenizer= text_process,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),        

        ],

    )

In [812]:
p = {'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}
ModelParamsEvaluation(f2_union,BernoulliNB(),p,'Bernoulli Naive Bayes, char + stemmed word')

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   28.3s
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   34.7s finished


Best score: 0.973
Best parameters set:
	clf__alpha: 0.001


In [813]:
f3_union=FeatureUnion(
        transformer_list=[
            # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=5000)),
            ])),
            # Pipeline for pulling word features from the text
            ('word', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',    TfidfVectorizer(analyzer='word',stop_words=None,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),    
            # Pipeline for pulling word features from the text
            ('text', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',    TfidfVectorizer(analyzer='word',tokenizer= text_process,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),        

        ],

    )

In [814]:
p = {'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}
ModelParamsEvaluation(f3_union,BernoulliNB(),p,'Bernoulli Naive Bayes, char + word + stemmed word')

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   31.5s
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   38.6s finished


Best score: 0.970
Best parameters set:
	clf__alpha: 0.0001


Using POS tag vectorizer does not improve the score dramatically. Its impact is more visible only for AdamSavage - ScottKelly pair

In [815]:
f3_union=FeatureUnion(
        transformer_list=[
             # Pipeline for pulling word stemmed features from the text
            ('text', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='word',tokenizer= text_process,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),
                    
            # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=5000)),
            ])),
                    
            # Pipeline for pulling flexible pattern features  from the text_coded with POS tags
            ('text_pos', Pipeline([
                ('selector', ItemSelector(key='text_pos')),
                ('tfidf',    TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=None)),
            ])),                  

        ],

    )

In [816]:
p = {'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}
ModelParamsEvaluation(f3_union,BernoulliNB(),p,'Bernoulli Naive Bayes, char + stemmed word + POS tags')

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   30.7s
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   37.7s finished


Best score: 0.966
Best parameters set:
	clf__alpha: 0.0001


In [817]:
f3_union=FeatureUnion(
        transformer_list=[
             # Pipeline for pulling word stemmed features from the text
            ('word', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='word',ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),
                    
            # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=5000)),
            ])),
                    
            # Pipeline for pulling flexible pattern features  from the text_coded with POS tags
            ('text_pos', Pipeline([
                ('selector', ItemSelector(key='text_pos')),
                ('tfidf',    TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=None)),
            ])),                  

        ],

    )

In [818]:
p = {'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}
ModelParamsEvaluation(f3_union,BernoulliNB(),p,'Bernoulli Naive Bayes, char + word + POS tags')

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   16.0s
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   19.7s finished


Best score: 0.966
Best parameters set:
	clf__alpha: 0.001


In [819]:
f4_union=FeatureUnion(
        transformer_list=[

            # Pipeline for pulling word features from the text
            ('word', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',    TfidfVectorizer(analyzer='word',stop_words=None,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),
                    
             # Pipeline for pulling word features after word_processing from the text
            ('text', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='word',tokenizer= text_process,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),
                    
            # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=5000)),
            ])),
                    
            # Pipeline for pulling flexible pattern features  from the text_coded
            ('text_pos', Pipeline([
                ('selector', ItemSelector(key='text_pos')),
                ('tfidf',    TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=None)),
            ])),                  

        ],

    )

In [820]:
p = {'clf__alpha': (1,0.1,0.01,0.001,0.0001,0)}

ModelParamsEvaluation(f4_union,BernoulliNB(),p,'Bernoulli Naive Bayes, char + word + stemmed word + POS tag')

Fitting 10 folds for each of 6 candidates, totalling 60 fits


[Parallel(n_jobs=1)]: Done  49 tasks       | elapsed:   35.9s
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:   43.9s finished


Best score: 0.968
Best parameters set:
	clf__alpha: 0.001


In [821]:
df_ScoreSummaryByModelParams=DataFrame(ScoreSummaryByModelParams,columns=['Method','BestScore','BestParameter'])
df_ScoreSummaryByModelParams.sort_values(['BestScore'],ascending=False,inplace=True)
df_ScoreSummaryByModelParams

Unnamed: 0,Method,BestScore,BestParameter
18,"Bernoulli Naive Bayes, char + stemmed word",0.972621,\tclf__alpha: 0.001
19,"Bernoulli Naive Bayes, char + word + stemmed word",0.970448,\tclf__alpha: 0.0001
17,"Bernoulli Naive Bayes, char + word",0.968709,\tclf__alpha: 0.001
22,"Bernoulli Naive Bayes, char + word + stemmed w...",0.96784,\tclf__alpha: 0.001
21,"Bernoulli Naive Bayes, char + word + POS tags",0.966102,\tclf__alpha: 0.001
20,"Bernoulli Naive Bayes, char + stemmed word + P...",0.966102,\tclf__alpha: 0.0001
1,"Bernoulli Naive Bayes, char",0.965233,\tunion__char__tfidf__max_df: 0.5
0,"Bernoulli Naive Bayes, char",0.965233,\tclf__alpha: 0.001
3,"Bernoulli Naive Bayes, char",0.965233,"\tunion__char__tfidf__ngram_range: (3, 3)"
2,"Bernoulli Naive Bayes, char",0.965233,\tunion__char__tfidf__max_features: 5000


Now let's run prediction and review the results
PredictionEvaluation function combines the scores from several methods for comparizon, ModelRun function runs the prediction for different models and most_informative_feature_for_binary_classification is used to get most informative features from a model

In [822]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc,precision_score, accuracy_score, recall_score, f1_score
from scipy import interp

In [823]:
ScoreSummaryByVector = list()

In [824]:
def PredictionEvaluation(author_test_b,author_predictions_b,comment):
    Precision=precision_score(author_test_b,author_predictions_b)
    print ('Precision: %0.3f' % (Precision))
    Accuracy=accuracy_score(author_test_b,author_predictions_b)
    print ('Accuracy: %0.3f' % (Accuracy))
    Recall=recall_score(author_test_b,author_predictions_b)
    print ('Recall: %0.3f' % (Recall))
    F1=f1_score(author_test_b,author_predictions_b)
    print ('F1: %0.3f' % (F1))
    print ('Confussion matrix:')
    print (confusion_matrix(author_test_b,author_predictions_b))
    ROC_AUC=roc_auc_score(author_test_b,author_predictions_b)
    print ('ROC-AUC: %0.3f' % (ROC_AUC))
    ScoreSummaryByVector.append([Precision,Accuracy,Recall,F1,ROC_AUC,comment])

In [825]:
def ModelRun (f_union,model):
    pipeline = Pipeline([
    # Extract the text & text_coded
    ('textandtextcoded', TextAndTextCodedExtractor()),

    # Use FeatureUnion to combine the features from text and text_coded
    ('union', f_union, ),

    # Use a  classifier on the combined features
    ('clf', model),
    ])
    pipeline.fit(twt_train, author_train)
    author_predicted = pipeline.predict(twt_test)
    
    feature_names=list()
    for p in (pipeline.get_params()['union'].transformer_list):
        fn=(p[0],pipeline.get_params()['union'].get_params()[p[0]].get_params()['tfidf'].get_feature_names())
        feature_names.append(fn)
    df_fn=pd.DataFrame()
    for fn in feature_names:
        df_fn= df_fn.append(pd.DataFrame(
        {'FeatureType': fn[0],
         'Feature': fn[1]
        }),
        ignore_index=True)    
    
    from sklearn.preprocessing import LabelBinarizer
    lb = LabelBinarizer()
    author_test_b = lb.fit_transform(author_test.values)
    author_predicted_b  = lb.fit_transform(author_predicted)
    return (df_fn,pipeline.get_params()['clf'],author_predicted,author_predicted_b, author_test_b)

In [826]:
def most_informative_feature_for_binary_classification(feature_names, classifier):
    class_labels = classifier.classes_

    topnvalues_class0 = sorted(zip(classifier.coef_[0], feature_names['Feature'].values, feature_names['FeatureType'].values))
    topnvalues_class1 = sorted(zip(classifier.coef_[0], feature_names['Feature'].values, feature_names['FeatureType'].values), reverse=True)

    topn_df_class0=pd.DataFrame(topnvalues_class0, columns=['Coef','Feature','FeatureType'])
    topn_df_class0['Author']=class_labels[0]
    
    topn_df_class1=pd.DataFrame(topnvalues_class1, columns=['Coef','Feature','FeatureType'])
    topn_df_class1['Author']=class_labels[1]    
    
    topn_df=topn_df_class0.append(topn_df_class1)
    
        
    return topn_df

In [827]:
f2_union=FeatureUnion(
        transformer_list=[
            # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=5000)),
            ])),
            # Pipeline for pulling stememd word features from the text
            ('text', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',    TfidfVectorizer(analyzer='word',tokenizer= text_process,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),        

        ],

    )

In [828]:
(feature_names,clf,author_predicted,author_predicted_b, author_test_b)=ModelRun(f2_union,BernoulliNB(alpha=0.0001))

In [829]:
PredictionEvaluation(author_predicted_b, author_test_b,'char+stemmed word')

Precision: 0.953
Accuracy: 0.960
Recall: 0.967
F1: 0.960
Confussion matrix:
[[736  36]
 [ 25 737]]
ROC-AUC: 0.960


In [830]:
f3_union=FeatureUnion(
        transformer_list=[
             # Pipeline for pulling word stemmed features from the text
            ('text', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='word',tokenizer= text_process,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),
                    
            # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=5000)),
            ])),
                    
            # Pipeline for pulling flexible pattern features  from the text_coded with POS tags
            ('text_pos', Pipeline([
                ('selector', ItemSelector(key='text_pos')),
                ('tfidf',    TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=None)),
            ])),                  

        ],

    )

In [831]:
(feature_names,clf,author_predicted,author_predicted_b, author_test_b)=ModelRun(f3_union,BernoulliNB(alpha=0.0001))

In [832]:
PredictionEvaluation(author_predicted_b, author_test_b,'char+stemmed word+POS tag')

Precision: 0.951
Accuracy: 0.958
Recall: 0.966
F1: 0.958
Confussion matrix:
[[735  38]
 [ 26 735]]
ROC-AUC: 0.958


In [833]:
f4_union=FeatureUnion(
        transformer_list=[

            # Pipeline for pulling word features from the text
            ('word', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',    TfidfVectorizer(analyzer='word',stop_words=None,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),
                    
             # Pipeline for pulling word features after word_processing from the text
            ('text', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='word',tokenizer= text_process,ngram_range=(1, 1),max_df=0.5,max_features=None)),
            ])),
                    
            # Pipeline for pulling char features  from the text
            ('char', Pipeline([
                ('selector', ItemSelector(key='text')),
                ('tfidf',     TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=5000)),
            ])),
                    
            # Pipeline for pulling flexible pattern features  from the text_coded
            ('text_pos', Pipeline([
                ('selector', ItemSelector(key='text_pos')),
                ('tfidf',    TfidfVectorizer(analyzer='char',ngram_range=(3, 3),max_df=0.5,max_features=None)),
            ])),                  

        ],

    )

In [834]:
(feature_names,clf,author_predicted,author_predicted_b, author_test_b)=ModelRun(f4_union,BernoulliNB(alpha=0.0001))

In [835]:
PredictionEvaluation(author_predicted_b, author_test_b,'char+word+stemmed word+POS tag')

Precision: 0.952
Accuracy: 0.959
Recall: 0.966
F1: 0.959
Confussion matrix:
[[735  37]
 [ 26 736]]
ROC-AUC: 0.959


Here is the summary per model.

In [836]:
df_ScoreSummaryByVector=DataFrame(ScoreSummaryByVector,columns=['Precision','Accuracy','Recall','F1','ROC-AUC','Vector'])
df_ScoreSummaryByVector.sort_values(['F1'],ascending=False,inplace=True)
df_ScoreSummaryByVector

Unnamed: 0,Precision,Accuracy,Recall,F1,ROC-AUC,Vector
0,0.953428,0.960235,0.967192,0.960261,0.96028,char+stemmed word
2,0.952135,0.958931,0.965879,0.958958,0.958976,char+word+stemmed word+POS tag
1,0.950841,0.958279,0.965834,0.958279,0.958338,char+stemmed word+POS tag


Now let's review teh most informative features for the last prediction

In [837]:
TopFeatures_df=most_informative_feature_for_binary_classification(feature_names, clf)

In [838]:
df1=TopFeatures_df.loc[((TopFeatures_df['Author']==author2) & (TopFeatures_df['FeatureType']=='char')),['Author','Coef','Feature']].head(10)
df1.rename(columns={'Coef':'CoefChar','Feature':'Char'}, inplace=True)
df1.reset_index(inplace=True)
df2=TopFeatures_df.loc[((TopFeatures_df['Author']==author2) & (TopFeatures_df['FeatureType']=='word')),['Coef','Feature']].head(10)
df2.rename(columns={'Coef':'CoefWord','Feature':'Word'}, inplace=True)
df2.reset_index(inplace=True)
df3=TopFeatures_df.loc[((TopFeatures_df['Author']==author2) & (TopFeatures_df['FeatureType']=='text')),['Coef','Feature']].head(10)
df3.rename(columns={'Coef':'CoefText','Feature':'Text'}, inplace=True)
df3.reset_index(inplace=True)
df4=TopFeatures_df.loc[((TopFeatures_df['Author']==author2) & (TopFeatures_df['FeatureType']=='text_pos')),['Coef','Feature']].head(10)
df4.rename(columns={'Coef':'CoefTextPOS','Feature':'TextPOS'}, inplace=True)
df4['TextPOS']=df4.apply(lambda x: text_pos_inv_convert(x['TextPOS']), axis=1)
df4.reset_index(inplace=True)
df_kk_top_features = pd.concat([df1,df2,df3,df4],axis=1)
df_kk_top_features.drop('index', axis=1, inplace=True)
df_kk_top_features

Unnamed: 0,Author,CoefChar,Char,CoefWord,Word,CoefText,Text,CoefTextPOS,TextPOS
0,HillaryClinton,-16.213406,\nno,-16.213406,________,-16.213406,________,-16.213406,VB-VB-VB
1,HillaryClinton,-16.213406,"""c",-16.213406,aaron,-16.213406,aaron,-16.213406,VB-VB-VBN
2,HillaryClinton,-16.213406,"""i",-16.213406,abandoned,-16.213406,abandon,-16.213406,VB-VB-WP
3,HillaryClinton,-16.213406,"""n",-16.213406,abbey,-16.213406,abbey,-16.213406,VB-VB-WRB
4,HillaryClinton,-16.213406,(v,-16.213406,abhorrent,-16.213406,abhorr,-16.213406,VB-VB-JJ
5,HillaryClinton,-16.213406,-h,-16.213406,abiding,-16.213406,abid,-16.213406,VB-VB-NNS
6,HillaryClinton,-16.213406,ah,-16.213406,ability,-16.213406,abil,-16.213406,VB-VB-PDT
7,HillaryClinton,-16.213406,ec,-16.213406,able,-16.213406,abl,-16.213406,VB-VB-TO
8,HillaryClinton,-16.213406,ef,-16.213406,abortion,-16.213406,abort,-16.213406,VB-VBG-WRB
9,HillaryClinton,-16.213406,io,-16.213406,above,-16.213406,abov,-16.213406,VB-VBG-JJR


In [839]:
df1=TopFeatures_df.loc[((TopFeatures_df['Author']==author1) & (TopFeatures_df['FeatureType']=='char')),['Author','Coef','Feature']].head(10)
df1.rename(columns={'Coef':'CoefChar','Feature':'Char'}, inplace=True)
df1.reset_index(inplace=True)
df2=TopFeatures_df.loc[((TopFeatures_df['Author']==author1) & (TopFeatures_df['FeatureType']=='word')),['Coef','Feature']].head(10)
df2.rename(columns={'Coef':'CoefWord','Feature':'Word'}, inplace=True)
df2.reset_index(inplace=True)
df3=TopFeatures_df.loc[((TopFeatures_df['Author']==author1) & (TopFeatures_df['FeatureType']=='text')),['Coef','Feature']].head(10)
df3.rename(columns={'Coef':'CoefText','Feature':'Text'}, inplace=True)
df3.reset_index(inplace=True)
df4=TopFeatures_df.loc[((TopFeatures_df['Author']==author1) & (TopFeatures_df['FeatureType']=='text_pos')),['Coef','Feature']].head(10)
df4.rename(columns={'Coef':'CoefTextPOS','Feature':'TextPOS'}, inplace=True)
df4['TextPOS']=df4.apply(lambda x: text_pos_inv_convert(x['TextPOS']), axis=1)
df4.reset_index(inplace=True)
df_kk_top_features = pd.concat([df1,df2,df3,df4],axis=1)
df_kk_top_features.drop('index', axis=1, inplace=True)
df_kk_top_features

Unnamed: 0,Author,CoefChar,Char,CoefWord,Word,CoefText,Text,CoefTextPOS,TextPOS
0,KimKardashian,-0.902746,ing,-0.984472,url,-0.984472,url,-1.282754,NNP-NNP-NNP
1,KimKardashian,-0.97238,url,-1.089562,ref,-1.039486,i,-2.047238,IN-NNP-NNP
2,KimKardashian,-0.97238,re,-1.200947,the,-1.089562,ref,-2.120263,NNP-NNP-NN
3,KimKardashian,-0.98935,the,-1.375444,to,-1.200947,the,-2.120263,NN-NN-NNP
4,KimKardashian,-1.031804,to,-1.510004,my,-1.375444,to,-2.166783,IN-DT-NN
5,KimKardashian,-1.036919,ng,-1.84401,in,-1.510004,my,-2.174751,NNP-NN-NN
6,KimKardashian,-1.057645,ur,-1.927891,for,-1.777318,a,-2.207274,NN-NNP-NNP
7,KimKardashian,-1.081487,ref,-1.934161,you,-1.84401,in,-2.23238,NN-IN-NNP
8,KimKardashian,-1.12533,ef,-1.966112,is,-1.927891,for,-2.249475,DT-JJ-NN
9,KimKardashian,-1.159521,he,-2.005853,on,-1.934161,you,-2.407945,NN-NN-NN


And let's take a look what was predicted wrongly

In [840]:
author_predicted=pd.DataFrame(author_predicted,columns=['predicted'])
df_wrong_result = pd.concat([twt_test.reset_index(),author_test.reset_index(),author_predicted], axis=1)
df_wrong_result.drop('index', axis=1, inplace=True)
df_wrong_result.drop('text_pos', axis=1, inplace=True)
df_wrong_result=df_wrong_result[df_wrong_result['author']<>df_wrong_result['predicted']]
df_wrong_result.head(10)

Unnamed: 0,text,author,predicted
4,I don't understand why its always easier to gi...,KimKardashian,HillaryClinton
12,Happy #WomensEqualityDay from REF,HillaryClinton,KimKardashian
20,"""I trusted her when my life was on the line, a...",HillaryClinton,KimKardashian
26,Late night set vibespic.twitter.com/EtWQUMYPXN,KimKardashian,HillaryClinton
31,Life isn't always about yourself...helping oth...,KimKardashian,HillaryClinton
41,Patiently waiting #Paris pic.twitter.com/bNUMw...,KimKardashian,HillaryClinton
75,Teamed up w/ REF for a AW x REF capsule all sa...,KimKardashian,HillaryClinton
98,The best Mother's Day gift has been seeing my ...,HillaryClinton,KimKardashian
101,These pictures of the devastation are just sho...,KimKardashian,HillaryClinton
140,I’ll be REF in Midtown Crossing on August NUM!...,KimKardashian,HillaryClinton
