# SMaPP Text Classification Pipeline


## About

This document provides a quick intro to the basic functionality of the supervised text classification pipeline.

Goals: 
- Make training of supervised models for text classification easier for lab Members
- Abstracted enough to take away tedious and repetitive tasks
- But light enough to be modifiable and useful for specific use-cases

What does it provide:
- Quickly load data from common SMaPP formats
- Easily build a pipeline that selects best algorithm, tuning parameters and feature-set from common choices with reasonable defaults

## Installation

The package can be installed directly from GitHub using `pip`:

In [1]:
import sys
sys.path.append('/Users/fridolinlinder/projects/smapp_text_classifier/')

In [None]:
#!pip install git+https://github.com/smappnyu/smapp_text_classifier.git

The two main classes contained in the package are `DataSet` and `TextClassifier`. Let's import them:

In [2]:
from smapp_text_classifier.data import DataSet
from smapp_text_classifier.models import TextClassifier
from smapp_text_classifier.plot import plot_learning_curve

We need to import some additional packages

In [3]:
import sys
import logging
import json
import sklearn
import nltk

import numpy as np
import pandas as pd

from pprint import pprint

All logging (the amount of messages the package gives you about what is going on) is implemented using the standard python logging module. If you want less messages set the logging level to `logging.DEBUG` or `logging.ERROR`

In [4]:
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', 
                    level=logging.INFO)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logging.getLogger("gensim").setLevel(logging.ERROR)
np.random.seed(989898)

## Tutorial Setup

The goal of this exercise is to train a supervised model that learns the function mapping a set of labels to a set of text documents. We start out with our labeled data in `.csv` and `.json` format. Here's what our directory looks like:

In [None]:
!ls

Let's take a look at the data:

In [5]:
df_clinton = pd.read_csv('clinton_2016.csv')
df_clinton.head()

Unnamed: 0,label,tweet_id,user_id,text
0,Neutral,773692075699306496,725302089048453124,RT @CNN: Singer Stevie Nicks is backing Hillar...
1,Negative,786581360672735232,753594430330900481,RT @Italians4Trump: Hillary Supporters Attack ...
2,Positive,775873669725843456,1452015206,RT @HillaryClinton: How pay-to-play works:\n\n...
3,Positive,757926635404300292,550488178,one thing i know for sure is that Leslie Knope...
4,Positive,742758704165093376,2910845500,RT @HillaryClinton: Trump's rhetoric is shamef...


In [6]:
with open('clinton_2016.json') as infile:
    pprint(json.loads(next(infile)), depth=1)

{'_id': {...},
 'contributors': None,
 'coordinates': None,
 'created_at': 'Thu Sep 08 01:19:00 +0000 2016',
 'entities': {...},
 'extended_entities': {...},
 'favorite_count': 0,
 'favorited': False,
 'filter_level': 'low',
 'geo': None,
 'id': 773692075699306496,
 'id_str': '773692075699306496',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'place': None,
 'possibly_sensitive': False,
 'random_number': 0.3559609594276685,
 'retweet_count': 0,
 'retweeted': False,
 'retweeted_status': {...},
 'source': '<a href="https://roundteam.co" rel="nofollow">RoundTeam</a>',
 'stance': 'Neutral',
 'text': 'RT @CNN: Singer Stevie Nicks is backing Hillary Clinton, predicting '
         'a "landslide" in November https://t.co/JE4KdZjzci '
         'https://t.co/TZkHCD69…',
 'timestamp': {...},
 'timestamp_ms': '1473297540007',
 'truncated

## Importing and Standardizing the Data

Data can come as json or in tabular form. Only requirement is one column/field containing text and one containing a label. We can specify a tokenizer that is used for bag-of-words features (and to determine word boundaries for bag-of-character features). The tokenizer can be any function that maps a string to a list of tokens (e.g. `'This is a sentence' -> ['this', 'is', 'a', 'sentence']`). Here we use a tokenizer that was specifically developed for tweets. Here you could also add lemmatizatio or other desired transformations of the text.

In [7]:
tokenizer = nltk.TweetTokenizer()

The `DataSet` class allows the classification pipeline that we will use later to access all relevant information about the dataset. It is a light wrapper around a pandas dataframe that implements a few basic functions. The class can be instantiated with data from different formats: Files (tabular format, json format) or `pandas.DataFrame` objects.

Importing a json:

In [8]:
dataset = DataSet(input_='clinton_2016.json',
                  name='clinton',
                  field_mapping={'label': 'stance', 'text': 'text'})

The init method of `DataSet` does the following:
- Transform to a dataframe:

In [9]:
dataset.df.head()

Unnamed: 0,label,text
0,Neutral,RT @CNN: Singer Stevie Nicks is backing Hillar...
1,Positive,RT @EricJafMN: @realDonaldTrump Why did you ca...
2,Positive,RT @HillaryClinton: How pay-to-play works:\n\n...
3,Negative,Sounds to me like Hillary is describing hersel...
4,Positive,RT @peaceisactive: LeBron James: Why I'm Endor...


- Split into training and test set:

In [10]:
print(f'Train rows: {dataset.train_idxs[:10]}')

Train rows: [84, 83, 31, 4, 69, 18, 44, 42, 54, 0]


In [11]:
print(f'Test rows: {dataset.test_idxs[:10]}')

Test rows: [39, 63, 22, 2, 79, 16, 19, 50, 88, 33]


In [12]:
dataset.df_test.head()

Unnamed: 0,label,text
39,Neutral,RT @PoliticusSarah: Comey Letter Backfires As ...
63,Negative,"@HillaryClinton Dear Hillary, I fail 2 see the..."
22,Negative,RT @45_Committee: “Why aren’t I 50 points ahea...
2,Positive,RT @HillaryClinton: How pay-to-play works:\n\n...
79,Positive,RT @HillaryforVA: Jimmy Ochan found refuge in ...


In [13]:
dataset.get_labels('train')[:5]

84    Negative
83    Negative
31    Negative
4     Positive
69    Negative
Name: label, dtype: object

Passing a dataframe

In [14]:
dataset = DataSet(input_=df_clinton, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'})

Passing data that is already split into train/test. Note that the dataframes could also be files.

In [15]:
df_train = df_clinton.iloc[:700]
df_test = df_clinton.iloc[701:]
dataset = DataSet(train_input=df_train, test_input=df_test, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'})

Importing a csv:

In [16]:
dataset = DataSet(input_='clinton_2016.csv', 
                  name='clinton', 
                  field_mapping={'label': 'label', 'text': 'text'}
                 )

## Creating a text pipeline

### Bag of word features

Now we can initialize the classification pipeline. The first time it pre-computes the desired features to allow quick and repeated testing without repeatedly re-vectorizing the text. Instead, the document term matrix is computed once and stored in a file (in the `cache_dir` you provide). Then documents can be vectorized (transformed into a matrix that can be used in statistical models) by loading the corresponding rows of this matrix. 

When the pipeline is first instantiated, the feature matrices are pre-computed:

In [20]:
clf = TextClassifier(
    dataset=dataset, 
    algorithm='svm', 
    feature_set='word_ngrams',
    ngram_range=(1, 3),
    cache_dir='feature_cache',
    tokenize=tokenizer.tokenize
)

2019-08-05 10:18:21,555 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 1).joblib
2019-08-05 10:18:21,557 - root - DEBUG - Transforming from cache
2019-08-05 10:18:21,557 - root - DEBUG - Cache not found
2019-08-05 10:18:21,558 - root - DEBUG - Transforming from scratch
2019-08-05 10:18:21,748 - root - DEBUG - fit_transform took 0.19s
2019-08-05 10:18:21,749 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 2).joblib
2019-08-05 10:18:21,749 - root - DEBUG - Transforming from cache
2019-08-05 10:18:21,750 - root - DEBUG - Cache not found
2019-08-05 10:18:21,750 - root - DEBUG - Transforming from scratch
2019-08-05 10:18:22,170 - root - DEBUG - fit_transform took 0.42s
2019-08-05 10:18:22,171 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 3).joblib
2019-08-05 10:18:22,171 - root - DEBUG - Transforming from cache
2019-08-05 10:18:22,172 - root - DEBUG - Cache not found
2019-08-05 10:18:22,172 - root - DEBUG - Transforming from scratch
2019-08-05 10

In this case we computed three matrices for uni-, bi-, and tri-grams:

In [21]:
!ls feature_cache/

clinton_word_(1, 1).joblib clinton_word_(1, 3).joblib
clinton_word_(1, 2).joblib


If precomputed features exist the pipeline can re-use them

In [22]:
clf = TextClassifier(
    dataset=dataset, 
    algorithm='svm', 
    feature_set='word_ngrams',
    ngram_range=(1, 3),
    cache_dir='feature_cache',
    tokenize=tokenizer.tokenize
)

To re-compute the features the `recompute_features` argument can be set to true:

In [23]:
clf = TextClassifier(
    dataset=dataset, 
    algorithm='svm', 
    feature_set='word_ngrams',
    ngram_range=(1, 3),
    cache_dir='feature_cache',
    tokenize=tokenizer.tokenize,
    recompute_features=True
)

2019-08-05 10:19:40,771 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 1).joblib
2019-08-05 10:19:40,772 - root - DEBUG - Transforming from cache
2019-08-05 10:19:40,772 - root - DEBUG - Not loading due to recompute request
2019-08-05 10:19:40,774 - root - DEBUG - Transforming from scratch
2019-08-05 10:19:40,956 - root - DEBUG - fit_transform took 0.18s
2019-08-05 10:19:40,957 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 2).joblib
2019-08-05 10:19:40,958 - root - DEBUG - Transforming from cache
2019-08-05 10:19:40,958 - root - DEBUG - Not loading due to recompute request
2019-08-05 10:19:40,959 - root - DEBUG - Transforming from scratch
2019-08-05 10:19:41,365 - root - DEBUG - fit_transform took 0.41s
2019-08-05 10:19:41,366 - root - DEBUG - Pre-computing feature_cache/clinton_word_(1, 3).joblib
2019-08-05 10:19:41,367 - root - DEBUG - Transforming from cache
2019-08-05 10:19:41,367 - root - DEBUG - Not loading due to recompute request
2019-08-05 10:19:

### Embedding features

To use basic word embedding features, all pre-trained gensim models are available and can be accessed by their name (see https://github.com/RaRe-Technologies/gensim-data for available models). when a model is used for the first time, it's downloaded from the gensim server and stored locally in the gensim data directory (usually in the home directory).

In [26]:
clf = TextClassifier(
    dataset=dataset, 
    algorithm='svm', 
    feature_set='embeddings', 
    embedding_model_name='glove-wiki-gigaword-50',
    tokenize=tokenizer.tokenize
)

2019-08-05 10:21:45,170 - root - DEBUG - Pre-computing feature_cache/clinton_glove-wiki-gigaword-50_mean.pkl
2019-08-05 10:21:45,171 - root - DEBUG - Transforming from cache
2019-08-05 10:21:45,172 - root - DEBUG - Cache not found
2019-08-05 10:21:45,189 - root - DEBUG - Transforming from scratch
2019-08-05 10:21:45,191 - root - DEBUG - Loading embedding model
2019-08-05 10:21:45,468 - smart_open.smart_open_lib - DEBUG - {'uri': '/Users/fridolinlinder/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz', 'mode': 'rb', 'kw': {}}
2019-08-05 10:22:25,570 - root - DEBUG - _load_embedding_model took 40.38s
2019-08-05 10:22:25,861 - root - DEBUG - fit_transform took 40.69s
2019-08-05 10:22:25,862 - root - DEBUG - Pre-computing feature_cache/clinton_glove-wiki-gigaword-50_max.pkl
2019-08-05 10:22:25,864 - root - DEBUG - Transforming from cache
2019-08-05 10:22:25,864 - root - DEBUG - Cache not found
2019-08-05 10:22:25,866 - root - DEBUG - Transforming from scratch
2019-08-05 10:22:2

The pipeline pre-computes two document-feature matrices. One where each word-vector in a document is averaged to obtain a document vector, one where the maximum of each dimenions is used. Later we can cross-validate over these matrices.

We can also use character n-gram features:

In [27]:
clf = TextClassifier(
    dataset=dataset, 
    algorithm='random_forest', 
    feature_set='char_ngrams', 
    ngram_range=(3, 5),
    tokenize=tokenizer.tokenize
)

2019-08-05 10:23:40,837 - root - DEBUG - Pre-computing feature_cache/clinton_char_wb_(3, 3).joblib
2019-08-05 10:23:40,839 - root - DEBUG - Transforming from cache
2019-08-05 10:23:40,840 - root - DEBUG - Cache not found
2019-08-05 10:23:40,841 - root - DEBUG - Transforming from scratch
2019-08-05 10:23:41,533 - root - DEBUG - fit_transform took 0.69s
2019-08-05 10:23:41,535 - root - DEBUG - Pre-computing feature_cache/clinton_char_wb_(3, 4).joblib
2019-08-05 10:23:41,539 - root - DEBUG - Transforming from cache
2019-08-05 10:23:41,541 - root - DEBUG - Cache not found
2019-08-05 10:23:41,543 - root - DEBUG - Transforming from scratch
2019-08-05 10:23:43,006 - root - DEBUG - fit_transform took 1.47s
2019-08-05 10:23:43,008 - root - DEBUG - Pre-computing feature_cache/clinton_char_wb_(3, 5).joblib
2019-08-05 10:23:43,009 - root - DEBUG - Transforming from cache
2019-08-05 10:23:43,010 - root - DEBUG - Cache not found
2019-08-05 10:23:43,011 - root - DEBUG - Transforming from scratch
2019

In [28]:
## Using the pipeline

Main functionality is the building of the pipeline and reasonable default parameters for randomized cross validation.

This pipeline can be tuned using standard scikit-learn functionality:

In [37]:
CV = sklearn.model_selection.RandomizedSearchCV(
    clf.pipeline, 
    param_distributions=clf.params,
    n_iter=10, 
    cv=3, 
    n_jobs=4, 
    scoring='accuracy', 
    verbose=3
)

In [38]:
X = dataset.get_texts('train')
y = dataset.get_labels('train')
CV = CV.fit(X, y)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed:   16.2s finished
2019-08-05 10:25:42,513 - root - DEBUG - Transforming from cache
2019-08-05 10:25:42,724 - root - DEBUG - _load_from_cache took 0.21s
2019-08-05 10:25:42,727 - root - DEBUG - Checking if cache matches index docs
2019-08-05 10:25:42,737 - root - DEBUG - fit_transform took 0.22s


In [39]:
CV.best_score_

0.6735668789808917

In [40]:
y_valid = dataset.get_labels('test')
X_valid = dataset.get_texts('test')
y_pred = CV.predict(X_valid)
score = round(sklearn.metrics.accuracy_score(y_true=y_valid, y_pred=y_pred), 3)
print(score)

2019-08-05 10:25:55,009 - root - DEBUG - Transforming from cache
2019-08-05 10:25:55,222 - root - DEBUG - _load_from_cache took 0.21s
2019-08-05 10:25:55,223 - root - DEBUG - Checking if cache matches index docs
2019-08-05 10:25:55,230 - root - DEBUG - transform took 0.22s


0.643


In [41]:
y_valid = dataset.get_labels('test')
X_valid = dataset.get_texts('test')

In [42]:
best_tuning_params = CV.best_estimator_.get_params()
print(f'Best n_gram range: {best_tuning_params["vectorize__ngram_range"]}')
print(f'Best n_estimators (random_forest): {best_tuning_params["clf__n_estimators"]:.2f}')

Best n_gram range: (3, 3)
Best n_estimators (random_forest): 25.00


In [43]:
pd.DataFrame(CV.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__min_samples_split,param_clf__n_estimators,param_vectorize__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,1.867104,0.009309,0.288157,0.000416,0.0260949,311,"(3, 3)",{'clf__min_samples_split': 0.02609494696283190...,0.628571,0.717703,0.641148,0.66242,0.039381,3
1,1.554156,0.138557,0.611973,0.00995,0.0810205,191,"(3, 5)",{'clf__min_samples_split': 0.08102054365737989...,0.628571,0.712919,0.645933,0.66242,0.036363,3
2,1.174629,0.010136,0.27353,0.009018,0.0704503,260,"(3, 3)",{'clf__min_samples_split': 0.07045032438032485...,0.62381,0.688995,0.650718,0.654459,0.026753,7
3,0.867482,0.017298,0.252471,0.002699,0.0595418,170,"(3, 3)",{'clf__min_samples_split': 0.05954180911108659...,0.619048,0.69378,0.645933,0.652866,0.030912,9
4,2.5415,0.062147,0.609818,0.00665,0.0668841,432,"(3, 4)",{'clf__min_samples_split': 0.06688412696051756...,0.628571,0.712919,0.641148,0.660828,0.037147,5
5,1.326636,0.05647,0.559391,0.033276,0.0560541,180,"(3, 4)",{'clf__min_samples_split': 0.05605413642936416...,0.633333,0.708134,0.641148,0.660828,0.033563,5
6,0.540363,0.010249,0.238638,0.001941,0.0612805,81,"(3, 3)",{'clf__min_samples_split': 0.06128047366270688...,0.62381,0.688995,0.636364,0.649682,0.028235,10
7,1.095779,0.015361,0.277868,0.016096,0.0567088,223,"(3, 3)",{'clf__min_samples_split': 0.05670880404405077...,0.62381,0.69378,0.645933,0.654459,0.029204,7
8,0.337908,0.012966,0.222622,0.009213,0.00857212,25,"(3, 3)",{'clf__min_samples_split': 0.00857211899481211...,0.657143,0.717703,0.645933,0.673567,0.031506,1
9,1.876324,0.018965,0.663911,0.002699,0.0762935,300,"(3, 5)",{'clf__min_samples_split': 0.07629350119401898...,0.642857,0.717703,0.641148,0.667197,0.035677,2


## Cross validating accross multiple Algorithms and Feature sets

We can use a simple loop to check the performance of different algorithms. So far the following four are implemented.

In [44]:
algorithms = ['random_forest', 'elasticnet', 'svm']

These feature sets are available (note that if you use `embeddings` you need to provide a gensim embedding model as well).

In [45]:
feature_sets = ['embeddings', 'char_ngrams', 'word_ngrams']

In [None]:
for algorithm in algorithms:
    for feature_set in feature_sets:
        print(f'Fitting {algorithm} with {feature_set}')
        
        clf = TextClassifier(
            dataset=dataset, 
            algorithm=algorithm, 
            feature_set=feature_set, 
            max_n_features=10000, 
            embedding_model_name='glove-twitter-100'
        )

        CV = sklearn.model_selection.RandomizedSearchCV(
            clf.pipeline,
            param_distributions=clf.params,
            n_iter=10, 
            cv=3, 
            n_jobs=8,
            scoring='accuracy', 
            iid=False
        )
        X = dataset.get_texts('train')
        y = dataset.get_labels('train')
        CV = CV.fit(X, y)
        print(CV.best_score_)
        
        y_valid = dataset.get_labels('test')
        X_valid = dataset.get_texts('test')
        y_pred = CV.predict(X_valid)
        score = round(sklearn.metrics.accuracy_score(y_true=y_valid, y_pred=y_pred), 3)
        print(f'Best score for {algorithm} with {feature_set} on test set: {score}')
        
        best_t_params = CV.best_estimator_.get_params()

In this case SVM with character n-grams performed best