In [None]:
%load_ext autoreload
%autoreload 2

# SMaPP Text Classification Pipeline


## About

This document provides a quick intro to the basic functionality of the pipeline.

Goals: 
- Make training of supervised models for text classification easier for lab Members
- Abstracted enough to take away tedious and repetitive tasks
- But light enough to be modifiable and useful for specific use-cases

What does it provide:
- Quickly load data from common SMaPP formats
- Easily build a pipeline that selects best algorithm, tuning parameters and featureset from common choices with reasonable defaults


## Installation

The package can be installed directly off of GitHub using `pip`:

In [None]:
import sys
sys.path.append('/Users/fridolinlinder/projects/smapp_text_classifier/')

In [None]:
#!pip install git+https://github.com/smappnyu/smapp_text_classifier.git

The two main classes contained in the package are `DataSet` and `TextClassifier`. Let's import them:

In [None]:
from smapp_text_classifier.data import DataSet
from smapp_text_classifier.models import TextClassifier
from smapp_text_classifier.plot import plot_learning_curve

We need to import some additional packages

In [None]:
import sys
import logging
import json
import sklearn
import nltk

import numpy as np
import pandas as pd

from pprint import pprint

All logging is implemented using the standard python logging module. If you want less messages set the logging level to `logging.ERROR`

In [None]:
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', 
                    level=logging.INFO)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logging.getLogger("gensim").setLevel(logging.ERROR)
np.random.seed(989898)

## Tutorial Setup

The goal of this exercise is to train a supervised model that learns the function mapping a set of labels to a set of text documents. We start out with our labeled data in `.csv` and `.json` format. Here's what our directory looks like:

In [None]:
!ls

Let's take a look at the data:

In [None]:
df_clinton = pd.read_csv('clinton_2016.csv')
df_clinton.head()

In [None]:
with open('clinton_2016.json') as infile:
    pprint(json.loads(next(infile)), depth=1)

## Importing and Standardizing the Data

Data can come as json or in tabular form. Only requirement is one column/field containing text and one containing a label. We can specify a tokenizer that is used for bag-of-words features (and to determine word boundaries for bag-of-character features). The tokenizer can be any function that maps a string to a list of tokens (e.g. `'This is a sentence' -> ['this', 'is', 'a', 'sentence']`). Here we use a tokenizer that was specifically developed for tweets. Here you could also add lemmatizatio or other desired transformations of the text.

In [None]:
tokenizer = nltk.TweetTokenizer()

The `DataSet` class allows the classification pipeline that we will use later to access all relevant information about the dataset. It is a light wrapper around a pandas dataframe that implements a few basic functions. The class can be instantiated with data from different formats: Files (tabular format, json format) or `pandas.DataFrame` objects.



Importing a json:

In [None]:
dataset = DataSet(input_='clinton_2016.json',
                  name='clinton',
                  field_mapping={'label': 'stance', 'text': 'text'})

The init method of `DataSet` does the following:
- Transform to a dataframe:

In [None]:
dataset.df.head()

- Split into training and test set:

In [None]:
print(f'Train rows: {dataset.train_idxs[:10]}')

In [None]:
print(f'Test rows: {dataset.test_idxs[:10]}')

In [None]:
dataset.df_test.head()

In [None]:
dataset.get_labels('train')[:5]

Passing a dataframe

In [None]:
dataset = DataSet(input_=df_clinton, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'})

Passing data that is already split into train/test. Note that the dataframes could also be files.

In [None]:
df_train = df_clinton.iloc[:700]
df_test = df_clinton.iloc[701:]
dataset = DataSet(train_input=df_train, test_input=df_test, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'})

Importing a csv:

In [None]:
dataset = DataSet(input_='clinton_2016.csv', 
                  name='clinton', 
                  field_mapping={'label': 'label', 'text': 'text'}
                 )

## Creating a text pipeline

### Bag of word features

Now we can initialize the classification pipeline. The first time it pre-computes the desired features to allow quick and repeated testing without repeatedly re-vectorizing the text. Instead, the document term matrix is computed once and cached to file. Then documents can be vectorized by loading the corresponding rows of this matrix. 

When the pipeline is first instantiated, the feature matrices are pre-computed:

In [None]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='svm', 
                     feature_set='word_ngrams',
                     ngram_range=(1, 3),
                     cache_dir='feature_cache',
                     tokenize=tokenizer.tokenize)

In this case we computed three matrices for uni-, bi-, and tri-grams:

In [None]:
!ls feature_cache/

If precomputed features exist the pipeline can re-use them

In [None]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='svm', 
                     feature_set='word_ngrams',
                     ngram_range=(1, 3),
                     cache_dir='feature_cache',
                     tokenize=tokenizer.tokenize)

To re-compute the features the `recompute_features` argument can be set to true:

In [None]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='svm', 
                     feature_set='word_ngrams',
                     ngram_range=(1, 3),
                     cache_dir='feature_cache',
                     tokenize=tokenizer.tokenize,
                     recompute_features=True)

### Embedding features

To use basic word embedding features, all pre-trained gensim models are available and can be accessed by their name (see https://github.com/RaRe-Technologies/gensim-data for available models). when a model is used for the first time, it's downloaded from the gensim server and stored locally in the gensim data directory (usually in the home directory).

In [None]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='svm', 
                     feature_set='embeddings', 
                     embedding_model_name='glove-twitter-100',
                     tokenize=tokenizer.tokenize)

The pipeline pre-computes two document-feature matrices. One where each word-vector in a document is averaged to obtain a document vector, one where the maximum of each dimenions is used. Later we can cross-validate over these matrices.

We can also use character n-gram features:

In [None]:
clf = TextClassifier(dataset=dataset, 
                     algorithm='elasticnet', 
                     feature_set='char_ngrams', 
                     ngram_range=(3, 5),
                     tokenize=tokenizer.tokenize)

In [None]:
## Using the pipeline

Main functionality is the building of the pipeline and reasonable default parameters for randomized cross validation.

This pipeline can be tuned using standard scikit-learn functionality:

In [None]:
CV = sklearn.model_selection.RandomizedSearchCV(
    clf.pipeline, 
    param_distributions=clf.params,
    n_iter=20, 
    cv=5, 
    n_jobs=4, 
    scoring='accuracy', 
    iid=True, 
    return_train_score=False,
    random_state=12333
)

In [None]:
X = dataset.get_texts('train')
y = dataset.get_labels('train')
CV = CV.fit(X, y)

In [None]:
CV.best_score_

In [None]:
y_valid = dataset.get_labels('test')
X_valid = dataset.get_texts('test')
y_pred = CV.predict(X_valid)
score = round(sklearn.metrics.accuracy_score(y_true=y_valid, y_pred=y_pred), 3)
print(score)

In [None]:
y_valid = dataset.get_labels('test')
X_valid = dataset.get_texts('test')

In [None]:
best_tuning_params = CV.best_estimator_.get_params()
print(f'Best n_gram range: {best_tuning_params["vectorize__ngram_range"]}')
print(f'Best l1_ratio (elastic net): {best_tuning_params["clf__l1_ratio"]:.2f}')

In [None]:
pd.DataFrame(CV.cv_results_)

## Cross validating accross multiple Algorithms and Feature sets

We can use a simple loop to check the performance of different algorithms. So far the following four are implemented.

In [None]:
algorithms = ['random_forest', 'elasticnet', 'svm']

These feature sets are available (note that if you use `embeddings` you need to provide a gensim embedding model as well).

In [None]:
feature_sets = ['embeddings', 'char_ngrams', 'word_ngrams']

In [None]:
for algorithm in algorithms:
    for feature_set in feature_sets:
        print(f'Fitting {algorithm} with {feature_set}')
        
        clf = TextClassifier(
            dataset=dataset, 
            algorithm=algorithm, 
            feature_set=feature_set, 
            max_n_features=10000, 
            embedding_model_name='glove-twitter-100'
        )

        CV = sklearn.model_selection.RandomizedSearchCV(
            clf.pipeline,
            param_distributions=clf.params,
            n_iter=10, 
            cv=3, 
            n_jobs=8,
            scoring='accuracy', 
            iid=False
        )
        X = dataset.get_texts('train')
        y = dataset.get_labels('train')
        CV = CV.fit(X, y)
        print(CV.best_score_)
        
        y_valid = dataset.get_labels('test')
        X_valid = dataset.get_texts('test')
        y_pred = CV.predict(X_valid)
        score = round(sklearn.metrics.accuracy_score(y_true=y_valid, y_pred=y_pred), 3)
        print(f'Best score for {algorithm} with {feature_set} on test set: {score}')
        
        best_t_params = CV.best_estimator_.get_params()

In this case SVM with character n-grams performed best