In [1]:
%load_ext autoreload
%autoreload 2

# SMaPP Text Classification Pipeline


## About

This document provides a quick intro to the basic functionality of the pipeline.

Goals: 
- Make training of supervised models for text classification easier for lab Members
- Abstracted enough to take away tedious and repetitive tasks
- But light enough to be modifiable and useful for specific use-cases

What does it provide:
- Quickly load data from common SMaPP formats
- Easily build a pipeline that selects best algorithm, tuning parameters and featureset from common choices with reasonable defaults


## Installation

The package can be installed directly off of GitHub using `pip`:

In [5]:
import sys
sys.path.append('/Users/fridolinlinder/projects/smapp_text_classifier/')

In [6]:
#!pip install git+https://github.com/smappnyu/smapp_text_classifier.git

The two main classes contained in the package are `DataSet` and `TextClassifier`. Let's import them:

In [7]:
from smapp_text_classifier.data import DataSet
from smapp_text_classifier.models import TextClassifier
from smapp_text_classifier.plot import plot_learning_curve

We need to import some additional packages

In [8]:
import sys
import logging
import json
import sklearn
import nltk

import numpy as np
import pandas as pd

from pprint import pprint

All logging is implemented using the standard python logging module. If you want less messages set the logging level to `logging.ERROR`

In [9]:
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', 
                    level=logging.INFO)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logging.getLogger("gensim").setLevel(logging.ERROR)
np.random.seed(989898)

## Tutorial Setup

The goal of this exercise is to train a supervised model that learns the function mapping a set of labels to a set of text documents. We start out with our labeled data in `.csv` and `.json` format. Here's what our directory looks like:

In [10]:
!ls

Untitled.ipynb      clinton_2016.json   pipeline_demo.ipynb
clinton_2016.csv    [1m[36mfeature_cache[m[m       [1m[36mtest_cache[m[m


Let's take a look at the data:

In [11]:
df_clinton = pd.read_csv('clinton_2016.csv')
df_clinton.head()

Unnamed: 0,label,tweet_id,user_id,text
0,Neutral,773692075699306496,725302089048453124,RT @CNN: Singer Stevie Nicks is backing Hillar...
1,Negative,786581360672735232,753594430330900481,RT @Italians4Trump: Hillary Supporters Attack ...
2,Positive,775873669725843456,1452015206,RT @HillaryClinton: How pay-to-play works:\n\n...
3,Positive,757926635404300292,550488178,one thing i know for sure is that Leslie Knope...
4,Positive,742758704165093376,2910845500,RT @HillaryClinton: Trump's rhetoric is shamef...


In [12]:
with open('clinton_2016.json') as infile:
    pprint(json.loads(next(infile)), depth=1)

{'_id': {...},
 'contributors': None,
 'coordinates': None,
 'created_at': 'Thu Sep 08 01:19:00 +0000 2016',
 'entities': {...},
 'extended_entities': {...},
 'favorite_count': 0,
 'favorited': False,
 'filter_level': 'low',
 'geo': None,
 'id': 773692075699306496,
 'id_str': '773692075699306496',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'place': None,
 'possibly_sensitive': False,
 'random_number': 0.3559609594276685,
 'retweet_count': 0,
 'retweeted': False,
 'retweeted_status': {...},
 'source': '<a href="https://roundteam.co" rel="nofollow">RoundTeam</a>',
 'stance': 'Neutral',
 'text': 'RT @CNN: Singer Stevie Nicks is backing Hillary Clinton, predicting '
         'a "landslide" in November https://t.co/JE4KdZjzci '
         'https://t.co/TZkHCD69…',
 'timestamp': {...},
 'timestamp_ms': '1473297540007',
 'truncated

## Importing and Standardizing the Data

Data can come as json or in tabular form. Only requirement is one column/field containing text and one containing a label. We can specify a tokenizer that is used for bag-of-words features (and to determine word boundaries for bag-of-character features). The tokenizer can be any function that maps a string to a list of tokens (e.g. `'This is a sentence' -> ['this', 'is', 'a', 'sentence']`). Here we use a tokenizer that was specifically developed for tweets. Here you could also add lemmatizatio or other desired transformations of the text.

In [13]:
tokenizer = nltk.TweetTokenizer()

The `DataSet` class allows the classification pipeline that we will use later to access all relevant information about the dataset. It is a light wrapper around a pandas dataframe that implements a few basic functions. The class can be instantiated with data from different formats: Files (tabular format, json format) or `pandas.DataFrame` objects.



Importing a json:

In [14]:
dataset = DataSet(input_='clinton_2016.json',
                  name='clinton',
                  field_mapping={'label': 'stance', 'text': 'text'})

The init method of `DataSet` does the following:
- Transform to a dataframe:

In [15]:
dataset.df.head()

Unnamed: 0,label,text
0,Neutral,RT @CNN: Singer Stevie Nicks is backing Hillar...
1,Positive,RT @EricJafMN: @realDonaldTrump Why did you ca...
2,Positive,RT @HillaryClinton: How pay-to-play works:\n\n...
3,Negative,Sounds to me like Hillary is describing hersel...
4,Positive,RT @peaceisactive: LeBron James: Why I'm Endor...


- Split into training and test set:

In [16]:
print(f'Train rows: {dataset.train_idxs[:10]}')

Train rows: [84, 83, 31, 4, 69, 18, 44, 42, 54, 0]


In [17]:
print(f'Test rows: {dataset.test_idxs[:10]}')

Test rows: [39, 63, 22, 2, 79, 16, 19, 50, 88, 33]


In [18]:
dataset.df_test.head()

Unnamed: 0,label,text
39,Neutral,RT @PoliticusSarah: Comey Letter Backfires As ...
63,Negative,"@HillaryClinton Dear Hillary, I fail 2 see the..."
22,Negative,RT @45_Committee: “Why aren’t I 50 points ahea...
2,Positive,RT @HillaryClinton: How pay-to-play works:\n\n...
79,Positive,RT @HillaryforVA: Jimmy Ochan found refuge in ...


In [19]:
dataset.get_labels('train')[:5]

84    Negative
83    Negative
31    Negative
4     Positive
69    Negative
Name: label, dtype: object

Passing a dataframe

In [20]:
dataset = DataSet(input_=df_clinton, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'})

Passing data that is already split into train/test. Note that the dataframes could also be files.

In [21]:
df_train = df_clinton.iloc[:700]
df_test = df_clinton.iloc[701:]
dataset = DataSet(train_input=df_train, test_input=df_test, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'})

Importing a csv:

In [22]:
dataset = DataSet(input_='clinton_2016.csv', 
                  name='clinton', 
                  field_mapping={'label': 'label', 'text': 'text'}
                 )

## Creating a text pipeline

In [30]:
### Bag of word features

Now we can initialize the classification pipeline. The first time it pre-computes the desired features to allow quick and repeated testing without repeatedly re-vectorizing the text. Instead, the document term matrix is computed once and cached to file. Then documents can be vectorized by loading the corresponding rows of this matrix. 

When the pipeline is first instantiated, the feature matrices are pre-computed:

In [31]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='svm', 
                     feature_set='word_ngrams',
                     ngram_range=(1, 3),
                     cache_dir='feature_cache',
                     tokenize=tokenizer.tokenize)

2019-06-10 14:01:48,134 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 1).joblib
2019-06-10 14:01:48,135 - root - DEBUG - Transforming from cache
2019-06-10 14:01:48,136 - root - DEBUG - Cache not found
2019-06-10 14:01:48,136 - root - DEBUG - Transforming from scratch
2019-06-10 14:01:48,283 - root - DEBUG - fit_transform took 0.15s
2019-06-10 14:01:48,283 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 2).joblib
2019-06-10 14:01:48,284 - root - DEBUG - Transforming from cache
2019-06-10 14:01:48,285 - root - DEBUG - Cache not found
2019-06-10 14:01:48,286 - root - DEBUG - Transforming from scratch
2019-06-10 14:01:48,668 - root - DEBUG - fit_transform took 0.38s
2019-06-10 14:01:48,669 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 3).joblib
2019-06-10 14:01:48,669 - root - DEBUG - Transforming from cache
2019-06-10 14:01:48,669 - root - DEBUG - Cache not found
2019-06-10 14:01:48,670 - root - DEBUG - Transforming from scratch
2019-06-10 14

In this case we computed three matrices for uni-, bi-, and tri-grams:

In [33]:
!ls feature_cache/

clinton_word_(1, 1).joblib clinton_word_(1, 3).joblib
clinton_word_(1, 2).joblib


If precomputed features exist the pipeline can re-use them

In [35]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='svm', 
                     feature_set='word_ngrams',
                     ngram_range=(1, 3),
                     cache_dir='feature_cache',
                     tokenize=tokenizer.tokenize)

To re-compute the features the `recompute_features` argument can be set to true:

In [36]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='svm', 
                     feature_set='word_ngrams',
                     ngram_range=(1, 3),
                     cache_dir='feature_cache',
                     tokenize=tokenizer.tokenize,
                     recompute_features=True)

2019-06-10 14:05:27,019 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 1).joblib
2019-06-10 14:05:27,020 - root - DEBUG - Transforming from cache
2019-06-10 14:05:27,021 - root - DEBUG - Not loading due to recompute request
2019-06-10 14:05:27,021 - root - DEBUG - Transforming from scratch
2019-06-10 14:05:27,259 - root - DEBUG - fit_transform took 0.24s
2019-06-10 14:05:27,260 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 2).joblib
2019-06-10 14:05:27,261 - root - DEBUG - Transforming from cache
2019-06-10 14:05:27,262 - root - DEBUG - Not loading due to recompute request
2019-06-10 14:05:27,262 - root - DEBUG - Transforming from scratch
2019-06-10 14:05:27,650 - root - DEBUG - fit_transform took 0.39s
2019-06-10 14:05:27,651 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 3).joblib
2019-06-10 14:05:27,652 - root - DEBUG - Transforming from cache
2019-06-10 14:05:27,653 - root - DEBUG - Not loading due to recompute request
2019-06-10 14:05:

### Embedding features

To use basic word embedding features, all pre-trained gensim models are available and can be accessed by their name (see https://github.com/RaRe-Technologies/gensim-data for available models). when a model is used for the first time, it's downloaded from the gensim server and stored locally in the gensim data directory (usually in the home directory).

In [40]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='svm', 
                     feature_set='embeddings', 
                     embedding_model_name='glove-twitter-100',
                     tokenize=tokenizer.tokenize)

2019-06-10 14:10:43,250 - root - DEBUG - Pre-computing feature_cache/clinton_glove-twitter-100_mean.pkl
2019-06-10 14:10:43,251 - root - DEBUG - Transforming from cache
2019-06-10 14:10:43,251 - root - DEBUG - Cache not found
2019-06-10 14:10:43,252 - root - DEBUG - Transforming from scratch
2019-06-10 14:10:43,252 - root - DEBUG - Loading embedding model
2019-06-10 14:10:43,297 - smart_open.smart_open_lib - DEBUG - {'uri': '/Users/fridolinlinder/gensim-data/glove-twitter-100/glove-twitter-100.gz', 'mode': 'rb', 'kw': {}}
2019-06-10 14:12:23,961 - root - DEBUG - _load_embedding_model took 100.71s
2019-06-10 14:12:23,962 - root - DEBUG - Embedding documents
2019-06-10 14:12:24,355 - root - DEBUG - transform took 101.10s
2019-06-10 14:12:24,356 - root - DEBUG - Pre-computing feature_cache/clinton_glove-twitter-100_max.pkl
2019-06-10 14:12:24,356 - root - DEBUG - Transforming from cache
2019-06-10 14:12:24,357 - root - DEBUG - Cache not found
2019-06-10 14:12:24,357 - root - DEBUG - Trans

The pipeline pre-computes two document-feature matrices. One where each word-vector in a document is averaged to obtain a document vector, one where the maximum of each dimenions is used. Later we can cross-validate over these matrices.

We can also use character n-gram features:

In [52]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='elasticnet', 
                     feature_set='char_ngrams', 
                     ngram_range=(3, 5),
                     tokenize=tokenizer.tokenize)

In [53]:
## Using the pipeline

Main functionality is the building of the pipeline and reasonable default parameters for randomized cross validation.

This pipeline can be tuned using standard scikit-learn functionality:

In [54]:
CV = sklearn.model_selection.RandomizedSearchCV(
    clf.pipeline, 
    param_distributions=clf.params,
    n_iter=20, 
    cv=5, 
    n_jobs=4, 
    scoring='accuracy', 
    iid=True, 
    return_train_score=False,
    random_state=12333
)

In [55]:
X = dataset.get_texts('train')
y = dataset.get_labels('train')
CV = CV.fit(X, y)

2019-06-10 14:19:41,903 - root - DEBUG - Transforming from cache
2019-06-10 14:19:42,175 - root - DEBUG - _load_from_cache took 0.27s
2019-06-10 14:19:42,176 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:19:42,179 - root - DEBUG - fit_transform took 0.28s


In [56]:
CV.best_score_

0.6894904458598726

In [57]:
y_valid = dataset.get_labels('test')
X_valid = dataset.get_texts('test')
y_pred = CV.predict(X_valid)
score = round(sklearn.metrics.accuracy_score(y_true=y_valid, y_pred=y_pred), 3)
print(score)

2019-06-10 14:19:42,309 - root - DEBUG - Transforming from cache
2019-06-10 14:19:42,569 - root - DEBUG - _load_from_cache took 0.26s
2019-06-10 14:19:42,570 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:19:42,572 - root - DEBUG - transform took 0.26s


0.695


In [58]:
y_valid = dataset.get_labels('test')
X_valid = dataset.get_texts('test')

In [59]:
best_tuning_params = CV.best_estimator_.get_params()
print(f'Best n_gram range: {best_tuning_params["vectorize__ngram_range"]}')
print(f'Best l1_ratio (elastic net): {best_tuning_params["clf__l1_ratio"]:.2f}')

Best n_gram range: (3, 5)
Best l1_ratio (elastic net): 0.12


In [60]:
pd.DataFrame(CV.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__alpha,param_clf__l1_ratio,param_vectorize__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.314328,0.004339,0.285309,0.019937,0.00452893,0.79552,"(3, 5)","{'clf__alpha': 0.004528929106661639, 'clf__l1_...",0.637795,0.606299,0.722222,0.66129,0.604839,0.646497,0.043341,18
1,0.304766,0.012662,0.263431,0.009517,0.00620494,0.662786,"(3, 5)","{'clf__alpha': 0.006204944906447042, 'clf__l1_...",0.708661,0.645669,0.769841,0.637097,0.653226,0.683121,0.050191,3
2,0.123009,0.00311,0.108532,0.012287,0.00065232,0.414368,"(3, 3)","{'clf__alpha': 0.0006523198121056162, 'clf__l1...",0.669291,0.637795,0.730159,0.669355,0.629032,0.667197,0.035491,12
3,0.275257,0.019906,0.203141,0.003667,0.00451058,0.982801,"(3, 4)","{'clf__alpha': 0.004510582561863079, 'clf__l1_...",0.661417,0.606299,0.68254,0.604839,0.524194,0.616242,0.054893,20
4,0.231471,0.003356,0.201713,0.005346,0.00158527,0.197508,"(3, 4)","{'clf__alpha': 0.0015852703375864763, 'clf__l1...",0.716535,0.677165,0.674603,0.645161,0.637097,0.670382,0.028072,10
5,0.281163,0.003764,0.257206,0.003003,0.0065746,0.254428,"(3, 5)","{'clf__alpha': 0.006574596558343654, 'clf__l1_...",0.685039,0.661417,0.738095,0.637097,0.637097,0.671975,0.037611,9
6,0.28125,0.004924,0.258284,0.004115,0.00848903,0.188967,"(3, 5)","{'clf__alpha': 0.008489030400038417, 'clf__l1_...",0.677165,0.661417,0.753968,0.620968,0.629032,0.66879,0.047377,11
7,0.294463,0.014638,0.290169,0.00969,0.00679951,0.662623,"(3, 5)","{'clf__alpha': 0.006799507177839524, 'clf__l1_...",0.653543,0.685039,0.769841,0.653226,0.612903,0.675159,0.052661,6
8,0.336295,0.009006,0.302997,0.007585,0.000446601,0.411518,"(3, 5)","{'clf__alpha': 0.00044660147215268455, 'clf__l...",0.724409,0.629921,0.706349,0.669355,0.645161,0.675159,0.035832,6
9,0.349309,0.01191,0.314805,0.008609,0.00920426,0.717592,"(3, 5)","{'clf__alpha': 0.00920425881517373, 'clf__l1_r...",0.669291,0.653543,0.753968,0.701613,0.596774,0.675159,0.051951,6


## Cross validating accross multiple Algorithms and Feature sets

We can use a simple loop to check the performance of different algorithms. So far the following four are implemented.

In [61]:
algorithms = ['random_forest', 'elasticnet', 'svm']

These feature sets are available (note that if you use `embeddings` you need to provide a gensim embedding model as well).

In [62]:
feature_sets = ['embeddings', 'char_ngrams', 'word_ngrams']

In [67]:
for algorithm in algorithms:
    for feature_set in feature_sets:
        print(f'Fitting {algorithm} with {feature_set}')
        
        clf = TextClassifier(
            dataset=dataset, 
            algorithm=algorithm, 
            feature_set=feature_set, 
            max_n_features=10000, 
            embedding_model_name='glove-twitter-100'
        )

        CV = sklearn.model_selection.RandomizedSearchCV(
            clf.pipeline,
            param_distributions=clf.params,
            n_iter=10, 
            cv=3, 
            n_jobs=8,
            scoring='accuracy', 
            iid=False
        )
        X = dataset.get_texts('train')
        y = dataset.get_labels('train')
        CV = CV.fit(X, y)
        print(CV.best_score_)
        
        y_valid = dataset.get_labels('test')
        X_valid = dataset.get_texts('test')
        y_pred = CV.predict(X_valid)
        score = round(sklearn.metrics.accuracy_score(y_true=y_valid, y_pred=y_pred), 3)
        print(f'Best score for {algorithm} with {feature_set} on test set: {score}')
        
        best_t_params = CV.best_estimator_.get_params()

Fitting random_forest with embeddings


2019-06-10 14:21:55,992 - root - DEBUG - Transforming from cache
2019-06-10 14:21:55,995 - root - DEBUG - _load_from_cache took 0.00s
2019-06-10 14:21:55,996 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:21:55,998 - root - DEBUG - transform took 0.01s
2019-06-10 14:21:56,707 - root - DEBUG - Transforming from cache
2019-06-10 14:21:56,711 - root - DEBUG - _load_from_cache took 0.00s
2019-06-10 14:21:56,711 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:21:56,712 - root - DEBUG - transform took 0.01s


0.6193969772917142
Best score for random_forest with embeddings on test set: 0.595
Fitting random_forest with char_ngrams


2019-06-10 14:22:03,670 - root - DEBUG - Transforming from cache
2019-06-10 14:22:03,938 - root - DEBUG - _load_from_cache took 0.27s
2019-06-10 14:22:03,939 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:03,942 - root - DEBUG - fit_transform took 0.27s
2019-06-10 14:22:04,534 - root - DEBUG - Transforming from cache


0.6942735626946153


2019-06-10 14:22:04,781 - root - DEBUG - _load_from_cache took 0.25s
2019-06-10 14:22:04,782 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:04,785 - root - DEBUG - transform took 0.25s


Best score for random_forest with char_ngrams on test set: 0.667
Fitting random_forest with word_ngrams


2019-06-10 14:22:13,527 - root - DEBUG - Transforming from cache
2019-06-10 14:22:13,569 - root - DEBUG - _load_from_cache took 0.04s
2019-06-10 14:22:13,570 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:13,573 - root - DEBUG - fit_transform took 0.05s
2019-06-10 14:22:14,267 - root - DEBUG - Transforming from cache
2019-06-10 14:22:14,310 - root - DEBUG - _load_from_cache took 0.04s
2019-06-10 14:22:14,311 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:14,312 - root - DEBUG - transform took 0.05s


0.6401534138376244
Best score for random_forest with word_ngrams on test set: 0.629
Fitting elasticnet with embeddings


2019-06-10 14:22:14,717 - root - DEBUG - Transforming from cache
2019-06-10 14:22:14,721 - root - DEBUG - _load_from_cache took 0.00s
2019-06-10 14:22:14,722 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:14,723 - root - DEBUG - transform took 0.01s
2019-06-10 14:22:14,754 - root - DEBUG - Transforming from cache
2019-06-10 14:22:14,758 - root - DEBUG - _load_from_cache took 0.00s
2019-06-10 14:22:14,758 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:14,760 - root - DEBUG - transform took 0.01s


0.6082782714361662
Best score for elasticnet with embeddings on test set: 0.614
Fitting elasticnet with char_ngrams


2019-06-10 14:22:18,333 - root - DEBUG - Transforming from cache
2019-06-10 14:22:18,451 - root - DEBUG - _load_from_cache took 0.12s
2019-06-10 14:22:18,452 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:18,454 - root - DEBUG - fit_transform took 0.12s
2019-06-10 14:22:18,487 - root - DEBUG - Transforming from cache
2019-06-10 14:22:18,609 - root - DEBUG - _load_from_cache took 0.12s
2019-06-10 14:22:18,610 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:18,612 - root - DEBUG - transform took 0.13s


0.6910989595200121
Best score for elasticnet with char_ngrams on test set: 0.662
Fitting elasticnet with word_ngrams


2019-06-10 14:22:20,721 - root - DEBUG - Transforming from cache
2019-06-10 14:22:20,959 - root - DEBUG - _load_from_cache took 0.24s
2019-06-10 14:22:20,960 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:20,963 - root - DEBUG - fit_transform took 0.24s
2019-06-10 14:22:20,972 - root - DEBUG - Transforming from cache


0.6752107541581225


2019-06-10 14:22:21,203 - root - DEBUG - _load_from_cache took 0.23s
2019-06-10 14:22:21,204 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:21,207 - root - DEBUG - transform took 0.23s


Best score for elasticnet with word_ngrams on test set: 0.652
Fitting svm with embeddings


2019-06-10 14:22:21,606 - root - DEBUG - Transforming from cache
2019-06-10 14:22:21,610 - root - DEBUG - _load_from_cache took 0.00s
2019-06-10 14:22:21,610 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:21,612 - root - DEBUG - transform took 0.01s
2019-06-10 14:22:21,661 - root - DEBUG - Transforming from cache
2019-06-10 14:22:21,665 - root - DEBUG - _load_from_cache took 0.00s
2019-06-10 14:22:21,666 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:21,668 - root - DEBUG - transform took 0.01s


0.6114528746107694
Best score for svm with embeddings on test set: 0.662
Fitting svm with char_ngrams


2019-06-10 14:22:27,139 - root - DEBUG - Transforming from cache
2019-06-10 14:22:27,348 - root - DEBUG - _load_from_cache took 0.21s
2019-06-10 14:22:27,349 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:27,352 - root - DEBUG - fit_transform took 0.21s
2019-06-10 14:22:27,818 - root - DEBUG - Transforming from cache
2019-06-10 14:22:28,017 - root - DEBUG - _load_from_cache took 0.20s
2019-06-10 14:22:28,017 - root - DEBUG - Checking if cache matches index docs


0.7069795701374648


2019-06-10 14:22:28,020 - root - DEBUG - transform took 0.20s


Best score for svm with char_ngrams on test set: 0.695
Fitting svm with word_ngrams


2019-06-10 14:22:30,919 - root - DEBUG - Transforming from cache
2019-06-10 14:22:31,059 - root - DEBUG - _load_from_cache took 0.14s
2019-06-10 14:22:31,060 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:31,062 - root - DEBUG - fit_transform took 0.14s
2019-06-10 14:22:31,161 - root - DEBUG - Transforming from cache
2019-06-10 14:22:31,301 - root - DEBUG - _load_from_cache took 0.14s
2019-06-10 14:22:31,302 - root - DEBUG - Checking if cache matches index docs
2019-06-10 14:22:31,303 - root - DEBUG - transform took 0.14s


0.710207336523126
Best score for svm with word_ngrams on test set: 0.657


In this case SVM with character n-grams performed best