In [27]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# SMaPP Text Classification Pipeline


## About

This document provides a quick intro to the basic functionality of the pipeline.

Goals: 
- Make training of supervised models for text classification easier for lab Members
- Abstracted enough to take away tedious and repetitive tasks
- But light enough to be modifiable and useful for specific use-cases

What does it provide:
- Quickly load data from common SMaPP formats
- Easily build a pipeline that selects best algorithm, tuning parameters and featureset from common choices with reasonable defaults


## Installation

The package can be installed directly off of GitHub using `pip`:

In [None]:
#!pip install git+https://github.com/smappnyu/smapp_text_classifier.git

The two main classes contained in the package are `DataSet` and `TextClassifier`. Let's import them:

In [1]:
from smapp_text_classifier.data import DataSet
from smapp_text_classifier.models import TextClassifier
from smapp_text_classifier import plot_learning_curve

ImportError: cannot import name 'plot_learning_curve'

We need to import some additional packages

In [3]:
import sys
import logging
import json
import sklearn
import nltk

import numpy as np
import pandas as pd

from pprint import pprint

All logging is implemented using the standard python logging module. If you want less messages set the logging level to `logging.ERROR`

In [35]:
logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', 
                    level=logging.INFO)
logging.getLogger("gensim").setLevel(logging.ERROR)
np.random.seed(989898)

## Tutorial Setup

The goal of this exercise is to train a supervised model that learns the function mapping a set of labels to a set of text documents. We start out with our labeled data in `.csv` and `.json` format. Here's what our directory looks like:

In [36]:
!ls

clinton_2016.csv    clinton_2016.json   [1m[36membedding_models[m[m    pipeline_demo.ipynb


Let's take a look at the data:

In [37]:
df_clinton = pd.read_csv('clinton_2016.csv')
df_clinton.head()

Unnamed: 0,label,tweet_id,user_id,text
0,Neutral,773692075699306496,725302089048453124,RT @CNN: Singer Stevie Nicks is backing Hillar...
1,Negative,786581360672735232,753594430330900481,RT @Italians4Trump: Hillary Supporters Attack ...
2,Positive,775873669725843456,1452015206,RT @HillaryClinton: How pay-to-play works:\n\n...
3,Positive,757926635404300292,550488178,one thing i know for sure is that Leslie Knope...
4,Positive,742758704165093376,2910845500,RT @HillaryClinton: Trump's rhetoric is shamef...


In [38]:
with open('clinton_2016.json') as infile:
    pprint(json.loads(next(infile)), depth=1)

{'_id': {...},
 'contributors': None,
 'coordinates': None,
 'created_at': 'Thu Sep 08 01:19:00 +0000 2016',
 'entities': {...},
 'extended_entities': {...},
 'favorite_count': 0,
 'favorited': False,
 'filter_level': 'low',
 'geo': None,
 'id': 773692075699306496,
 'id_str': '773692075699306496',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'place': None,
 'possibly_sensitive': False,
 'random_number': 0.3559609594276685,
 'retweet_count': 0,
 'retweeted': False,
 'retweeted_status': {...},
 'source': '<a href="https://roundteam.co" rel="nofollow">RoundTeam</a>',
 'stance': 'Neutral',
 'text': 'RT @CNN: Singer Stevie Nicks is backing Hillary Clinton, predicting '
         'a "landslide" in November https://t.co/JE4KdZjzci '
         'https://t.co/TZkHCD69…',
 'timestamp': {...},
 'timestamp_ms': '1473297540007',
 'truncated

## Importing and Standardizing the Data

Data can come as json or in tabular form. Only requirement is one column/field containing text and one containing a label. We can specify a tokenizer that is used for bag-of-words features (and to determine word boundaries for bag-of-character features). The tokenizer can be any function that maps a string to a list of tokens (e.g. `'This is a sentence' -> ['this', 'is', 'a', 'sentence']`). Here we use a tokenizer that was specifically developed for tweets. Here you could also add lemmatizatio or other desired transformations of the text.

In [39]:
tokenizer = nltk.TweetTokenizer()

The `DataSet` class allows the classification pipeline that we will use later to access all relevant information about the dataset. It is a light wrapper around a pandas dataframe that implements a few basic functions. The class can be instantiated with data from different formats: Files (tabular format, json format) or `pandas.DataFrame` objects.



Importing a json:

In [40]:
dataset = DataSet(input_='clinton_2016.json',
                  name='clinton',
                  field_mapping={'label': 'stance', 'text': 'text'},
                  tokenizer=tokenizer.tokenize)

The init method of `DataSet` does the following:
- Transform to a dataframe:

In [41]:
dataset.df.head()

Unnamed: 0,label,text
0,Neutral,RT @CNN: Singer Stevie Nicks is backing Hillar...
1,Positive,RT @EricJafMN: @realDonaldTrump Why did you ca...
2,Positive,RT @HillaryClinton: How pay-to-play works:\n\n...
3,Negative,Sounds to me like Hillary is describing hersel...
4,Positive,RT @peaceisactive: LeBron James: Why I'm Endor...


- Split into training and test set:

In [42]:
print(f'Train rows: {dataset.train_idxs[:10]}')

Train rows: [84, 83, 31, 4, 69, 18, 44, 42, 54, 0]


In [43]:
print(f'Test rows: {dataset.test_idxs[:10]}')

Test rows: [39, 63, 22, 2, 79, 16, 19, 50, 88, 33]


In [16]:
dataset.df_test.head()

Unnamed: 0,label,text
26,Negative,RT @RichardGrenell: The Clinton camp says Hill...
86,Negative,RT @RealJamesWoods: Justice in America today.....
2,Positive,RT @HillaryClinton: How pay-to-play works:\n\n...
55,Positive,"RT @mayaharris_: ""I will not let anyone turn b..."
75,Negative,RT @Stonewoodforge: Here Are Hillary Clinton's...


In [17]:
dataset.get_labels('train')[:5]

48    Negative
6     Negative
99    Positive
82     Neutral
76    Negative
Name: label, dtype: object

Passing a dataframe

In [18]:
dataset = DataSet(input_=df_clinton, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'},
                  tokenizer=tokenizer.tokenize)

Passing data that is already split into train/test. Note that the dataframes could also be files.

In [19]:
df_train = df_clinton.iloc[:700]
df_test = df_clinton.iloc[701:]
dataset = DataSet(train_input=df_train, test_input=df_test, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'},
                  tokenizer=tokenizer.tokenize)

Importing a csv:

In [20]:
dataset = DataSet(input_='clinton_2016.csv', 
                  name='clinton', 
                  field_mapping={'label': 'label', 'text': 'text'},
                  tokenizer=tokenizer.tokenize
                 )

## Creating a text pipeline
If we want to use embeddings we need to download (or train our own) an embedding model. Here we download a large embedding model trained by Facebook on the common crawl web archive (watch out it's 6GB so the download might take a while):

In [None]:
# !mkdir -p embedding_models/
# !wget -P embedding_models/ https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
# !gunzip embedding_models/cc.en.300.bin.gz

Now we can initialize the classification pipeline. The first time it pre-computes the desired features to allow quick and repeated testing without loading a large number of feature matrices into memory

In [31]:
#this will take 90s when run for the first time
clf = TextClassifier(dataset=dataset, 
                     algorithm='svm', 
                     feature_set='embeddings',
                     embedding_model=('fasttext', 'embedding_models/cc.en.300.bin'))

2019-04-01 09:28:47,659 - root - INFO - Using precomputed features for mean-pooled embeddings
2019-04-01 09:28:47,660 - root - INFO - Using precomputed features for max-pooled embeddings


In [29]:
!ls feature_cache/

clinton_fasttext_max_embedded.p  clinton_word_2.npz
clinton_fasttext_mean_embedded.p clinton_word_3.npz
clinton_word_1.npz


In [46]:
clf = TextClassifier(dataset=dataset, algorithm='elasticnet', feature_set='char_ngrams',
                    recompute_features=True)

2019-04-01 09:51:56,359 - root - INFO - Precomputing char_ngrams (3)..
2019-04-01 09:51:56,391 - root - INFO - Precomputing char_ngrams (4)..
2019-04-01 09:51:56,421 - root - INFO - Precomputing char_ngrams (5)..


In [45]:
!ls feature_cache/

clinton_word_1.npz       clinton_word_2.npz       clinton_word_3.npz
clinton_word_1_vocab.pkl clinton_word_2_vocab.pkl clinton_word_3_vocab.pkl


If precomputed features exist the pipeline re-uses them

In [None]:
clf = TextClassifier(dataset=dataset, algorithm='elasticnet', feature_set='embeddings', 
                     max_n_features=10000, recompute_features=False, ngram_range=(1, 3),
                     embedding_model=('fasttext', 'embedding_models/cc.en.300.bin'))

In [None]:
clf.pipeline.named_steps

In [49]:
clf.params

{'vectorize__ngram_range': [(3, 3), (3, 4), (3, 5)],
 'clf__alpha': <scipy.stats._distn_infrastructure.rv_frozen at 0x1c18282630>,
 'clf__l1_ratio': <scipy.stats._distn_infrastructure.rv_frozen at 0x1c18282b38>}

{'vectorize': PrecomputeVectorizer(cache_dir='feature_cache/',
            dataset=<smapp_text_classifier.data.DataSet object at 0x1c2809be80>,
            embedding_model_name=None, feature_set='char_ngrams',
            ngram_range=None, pooling_method=None),
 'reduce': Chi2Reducer(max_n_features=20000),
 'clf': SGDClassifier(alpha=0.0001, average=False, class_weight=None,
        early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
        l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
        n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='elasticnet',
        power_t=0.5, random_state=None, shuffle=True, tol=0.001,
        validation_fraction=0.1, verbose=0, warm_start=False)}

Main functionality is the building of the pipeline:

And reasonable default parameters for randomized cross validation:

This pipeline can be tuned using standard scikit-learn functionality:

In [None]:
CV = sklearn.model_selection.RandomizedSearchCV(
    clf.pipeline, 
    param_distributions=clf.params,
    n_iter=20, 
    cv=5, 
    n_jobs=8, 
    scoring='accuracy', 
    iid=True, 
    return_train_score=False,
    random_state=12333
)

In [None]:
X = dataset.get_texts('train')
y = dataset.get_labels('train')
CV = CV.fit(X, y)

In [None]:
CV.best_score_

In [None]:
y_valid = dataset.get_labels('test')
X_valid = dataset.get_texts('test')
y_pred = CV.predict(X_valid)
score = round(sklearn.metrics.accuracy_score(y_true=y_valid, y_pred=y_pred), 3)
print(score)

In [None]:
y_valid = dataset.get_labels('test')
X_valid = dataset.get_texts('test')

In [None]:
best_tuning_params = CV.best_estimator_.get_params()
print(f'Best n_gram range: {best_tuning_params["vectorize__ngram_range"]}')
print(f'Best l1_ratio (elastic net): {best_tuning_params["clf__l1_ratio"]:.2f}')

In [None]:
pd.DataFrame(CV.cv_results_)

## Cross validating accross multiple Algorithms and Feature sets

We can use a simple loop to check the performance of different algorithms. So far the following four are implemented.

In [None]:
algorithms = ['random_forest', 'elasticnet', 'svm']

These feature sets are available (note that if you use `embeddings` you need to provide a gensim embedding model as well).

In [None]:
feature_sets = ['embeddings', 'char_ngrams', 'word_ngrams']

In [None]:

for algorithm in algorithms:
    for feature_set in feature_sets:
        print(f'Fitting {algorithm} with {feature_set}')
        
        clf = TextClassifier(
            dataset=dataset, 
            algorithm=algorithm, 
            feature_set=feature_set, 
            max_n_features=1e4, 
            embedding_model=('fasttext', 'embedding_models/cc.en.300.bin')
        )

        CV = sklearn.model_selection.RandomizedSearchCV(
            clf.pipeline,
            param_distributions=clf.params,
            n_iter=10, 
            cv=3, 
            n_jobs=8,
            scoring='accuracy', 
            iid=False
        )
        X = dataset.get_texts('train')
        y = dataset.get_labels('train')
        CV = CV.fit(X, y)
        print(CV.best_score_)
        
        y_valid = dataset.get_labels('test')
        X_valid = dataset.get_texts('test')
        y_pred = CV.predict(X_valid)
        score = round(sklearn.metrics.accuracy_score(y_true=y_valid, y_pred=y_pred), 3)
        print(f'Best score for {algorithm} with {feature_set} on test set: {score}')
        
        best_t_params = CV.best_estimator_.get_params()
        logging.info(f'Best feature_set: '
                     f'\n\tngram_range: {best_t_params["vectorize__ngram_range"]}'
                     f'\n\tpooling_method: {best_t_params["vectorize__pooling_method"]}')