In [1]:
%load_ext autoreload
%autoreload 2

# SMaPP Text Classification Pipeline

This document provides a quick intro to the basic functionality of the pipeline.

Goals: 
- Make training of supervised models for text classification easier for lab Members
- Abstracted enough to take away tedious and repetitive tasks
- But light enough to be modifiable and useful for specific use-cases

What does it provide:
- Quickly load data from common SMaPP formats
- Easily build a pipeline that selects best algorithm, tuning parameters and featureset from common choices with reasonable defaults

In [None]:
!pip install git+https://github.com/smappnyu/smapp_text_classifier.git

In [14]:
import sys
import logging
import json
import pandas as pd

from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score
import nltk

In [3]:
from smapp_text_classifier.data import DataSet
from smapp_text_classifier.models import TextClassifier

In [4]:
logging.basicConfig(format='%(asctime)s - %(message)s', level=logging.INFO)
logging.getLogger("gensim").setLevel(logging.ERROR)

## Starting point

In [5]:
!ls

clinton_2016.csv    [1m[36membedding_models[m[m    pipeline_demo.ipynb
clinton_2016.json   [1m[36mfeature_cache[m[m


In [6]:
df_clinton = pd.read_csv('clinton_2016.csv')
df_clinton.head()

Unnamed: 0,label,tweet_id,user_id,text
0,Neutral,773692075699306496,725302089048453124,RT @CNN: Singer Stevie Nicks is backing Hillar...
1,Negative,786581360672735232,753594430330900481,RT @Italians4Trump: Hillary Supporters Attack ...
2,Positive,775873669725843456,1452015206,RT @HillaryClinton: How pay-to-play works:\n\n...
3,Positive,757926635404300292,550488178,one thing i know for sure is that Leslie Knope...
4,Positive,742758704165093376,2910845500,RT @HillaryClinton: Trump's rhetoric is shamef...


In [7]:
with open('clinton_2016.json') as infile:
    pprint(json.loads(next(infile)), depth=1)

{'_id': {...},
 'contributors': None,
 'coordinates': None,
 'created_at': 'Thu Sep 08 01:19:00 +0000 2016',
 'entities': {...},
 'extended_entities': {...},
 'favorite_count': 0,
 'favorited': False,
 'filter_level': 'low',
 'geo': None,
 'id': 773692075699306496,
 'id_str': '773692075699306496',
 'in_reply_to_screen_name': None,
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'is_quote_status': False,
 'lang': 'en',
 'place': None,
 'possibly_sensitive': False,
 'random_number': 0.3559609594276685,
 'retweet_count': 0,
 'retweeted': False,
 'retweeted_status': {...},
 'source': '<a href="https://roundteam.co" rel="nofollow">RoundTeam</a>',
 'stance': 'Neutral',
 'text': 'RT @CNN: Singer Stevie Nicks is backing Hillary Clinton, predicting '
         'a "landslide" in November https://t.co/JE4KdZjzci '
         'https://t.co/TZkHCD69…',
 'timestamp': {...},
 'timestamp_ms': '1473297540007',
 'truncated

## Importing and Standardizing the Data

Data can come as json or in tabular form. Only requirement is one column/field containing text and one containing a label. We can specify a tokenizer that is used for bag-of-words features (and to determine word boundaries for bag-of-character features). The tokenizer can be any function that maps a string to a list of tokens (e.g. `'This is a sentence' -> ['this', 'is', 'a', 'sentence']`). Here we use a tokenizer that was specifically developed for tweets. Here you could also add lemmatizatio or other desired transformations of the text.

In [8]:
tokenizer = nltk.TweetTokenizer()

The `DataSet` class allows the classification pipeline that we will use later to access all relevant information about the dataset. It is a light wrapper around a pandas dataframe that implements a few basic functions. The class can be instantiated with data from different formats: Files (tabular format, json format) or `pandas.DataFrame` objects.

Importing a csv:

In [9]:
dataset = DataSet(input_='clinton_2016.csv', name='clinton', 
                  field_mapping={'label': 'label', 'text': 'text'}, # change this
                  tokenizer=tokenizer.tokenize)

Importing a json:

In [10]:
dataset = DataSet(input_='clinton_2016.json', name='clinton', 
                  field_mapping={'label': 'stance', 'text': 'text'},
                  tokenizer=tokenizer.tokenize)

Passing a dataframe

In [11]:
dataset = DataSet(input_=df_clinton, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'},
                  tokenizer=tokenizer.tokenize)

Passing data that is already split into train/test. Note that the dataframes could also be files.

In [16]:
df_train = df_clinton.iloc[:700]
df_test = df_clinton.iloc[701:]
dataset = DataSet(train_input=df_train, test_input=df_test, name='clinton',
                  field_mapping={'label': 'label', 'text': 'text'},
                  tokenizer=tokenizer.tokenize)

The init method of `DataSet` does the following:
- Transform to a dataframe:

In [17]:
dataset.df.head()

Unnamed: 0,label,text
0,Neutral,RT @CNN: Singer Stevie Nicks is backing Hillar...
1,Negative,RT @Italians4Trump: Hillary Supporters Attack ...
2,Positive,RT @HillaryClinton: How pay-to-play works:\n\n...
3,Positive,one thing i know for sure is that Leslie Knope...
4,Positive,RT @HillaryClinton: Trump's rhetoric is shamef...


- Split into training and test set:

In [18]:
print(f'Train rows: {dataset.train_idxs[:10]}')

Train rows: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [19]:
print(f'Test rows: {dataset.test_idxs[:10]}')

Test rows: [700, 701, 702, 703, 704, 705, 706, 707, 708, 709]


In [20]:
dataset.df_test.head()

Unnamed: 0,label,text
700,Positive,RT @DorothyKidd1: Hillary Clinton draws surpri...
701,Negative,"Hillary Clinton sucks, but not like Monica Lew..."
702,Neutral,RT @TIME: State Dept. reopens Hillary Clinton ...
703,Neutral,"Donald Trump, Hillary Clinton win Washington p..."
704,Neutral,RT @MacFarlaneNews: #Breaking: Parents of two...


In [21]:
dataset.get_labels('train')[:5]

0     Neutral
1    Negative
2    Positive
3    Positive
4    Positive
Name: label, dtype: object

## Creating a text pipeline

In [22]:
clf = TextClassifier(dataset=dataset, algorithm='svm', 
                     feature_set='char_ngrams',
                     max_n_features=10000, recompute_features=True,
                     ngram_range = (1, 20),
                     embedding_model=('twitter', 
                                      'embedding_models/twitter/text_sample_2013to2016_gensim_200.model'))

TypeError: __init__() got an unexpected keyword argument 'ngram_range'

In [None]:
!ls

In [None]:
!ls feature_cache/

In [None]:
clf = TextClassifier(dataset=dataset, algorithm='elasticnet', feature_set='word_ngrams',
                     max_n_features=20000)

In [None]:
!ls feature_cache/

If precomputed features exist the pipeline re-uses them

In [None]:
clf = TextClassifier(dataset=dataset, algorithm='elasticnet', feature_set='char_ngrams',
                      max_n_features=20000)

But we can also force recomputation 

In [None]:
clf = TextClassifier(dataset=dataset, algorithm='elasticnet', feature_set='char_ngrams',
                      max_n_features=20000, recompute_features=True)
# Method to see the selected features

Main functionality is the building of the pipeline:

In [None]:
clf.pipeline.named_steps

And reasonable default parameters for Randomized Cross Validation

This pipeline can be tuned using standard scikit-learn functionality:

In [None]:
CV = RandomizedSearchCV(clf.pipeline, param_distributions=clf.params,
                        n_iter=5, cv=3, n_jobs=8, scoring='accuracy', 
                        iid=True, return_train_score=True)

In [None]:
X = dataset.get_texts('train')
y = dataset.get_labels('train')
CV = CV.fit(X, y)

In [None]:
CV.best_score_

In [None]:
best_tuning_params = CV.best_estimator_.get_params()
print(f'Best n_gram range: {best_tuning_params["vectorize__ngram_range"]}')
print(f'Best l1_ratio (elastic net): {best_tuning_params["clf__l1_ratio"]:.2f}')

In [None]:
pd.DataFrame(CV.cv_results_)

## Cross validating accross multiple Algorithms and Feature sets

We can use a simple loop to check the performance of different algorithms. So far the following four are implemented.

In [None]:
algorithms = ['random_forest', 'elasticnet']

These feature sets are available (note that if you use `embeddings` you need to provide a gensim embedding model as well).

In [None]:
feature_sets = ['embeddings', 'char_ngrams', 'word_ngrams']

In [None]:
for algorithm in algorithms:
    for feature_set in feature_sets:
        logging.info(f'Fitting {algorithm} with {feature_set}')
        
        clf = TextClassifier(
            dataset=dataset, algorithm=algorithm, feature_set=feature_set, 
            max_n_features=1e4, 
            embedding_model=('twitter', 
                             'embedding_models/twitter/text_sample_2013to2016_gensim_200.model')
        )
        
        logging.info(f'Cross validating {algorithm} on {feature_set}')
        CV = RandomizedSearchCV(clf.pipeline,
                        param_distributions=clf.params,
                        n_iter=10, cv=3, n_jobs=8,
                        scoring='f1', iid=True)
        X = dataset.get_texts('train')
        y = dataset.get_labels('train')
        CV = CV.fit(X, y)
        
        y_valid = dataset.get_labels('test')
        X_valid = dataset.get_texts('test')
        y_pred = CV.predict(X_valid)
        score = round(accuracy_score(y_true=y_valid, y_pred=y_pred), 3)
        logging.info(f'Best score for {algorithm} with {feature_set} on test set: {score}')
        
        best_t_params = CV.best_estimator_.get_params()
        logging.info(f'Best feature_set: '
                     f'\n\tngram_range: {best_t_params["vectorize__ngram_range"]}'
                     f'\n\tpooling_method: {best_t_params["vectorize__pooling_method"]}')