## sklearn and TextAttack

This following code trains two different text classification models using sklearn. Both use logistic regression models: the difference is in the features. 

We will load data using `datasets`, train the models, and attack them using TextAttack.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/QData/TextAttack/blob/master/docs/2notebook/Example_1_sklearn.ipynb)

[![View Source on GitHub](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/QData/TextAttack/blob/master/docs/2notebook/Example_1_sklearn.ipynb)

### Setup

Ensure that all the textattack libraries and resources are properly set up and working in the notebook

In [6]:
!git clone https://github.com/QData/TextAttack

fatal: destination path 'TextAttack' already exists and is not an empty directory.


In [7]:
cd TextAttack

/content/TextAttack


In [8]:
!git checkout api-rework

Already on 'api-rework'
Your branch is up to date with 'origin/api-rework'.


In [9]:
!pip install ./

Processing /content/TextAttack
Building wheels for collected packages: textattack
  Building wheel for textattack (setup.py) ... [?25l[?25hdone
  Created wheel for textattack: filename=textattack-0.2.15-cp37-none-any.whl size=354335 sha256=2eabe4812c8c24ad6157c430e74e45925c75f1fb0650ed7bfa77156bb9b23448
  Stored in directory: /tmp/pip-ephem-wheel-cache-3gcqun5_/wheels/2f/52/bb/f9360550e2f59e4fd2dcb990574cb527768028bab66b8eb83c
Successfully built textattack
Installing collected packages: textattack
  Found existing installation: textattack 0.2.15
    Uninstalling textattack-0.2.15:
      Successfully uninstalled textattack-0.2.15
Successfully installed textattack-0.2.15


### Training

This code trains two models: one on bag-of-words statistics (`bow_unstemmed`) and one on tf–idf statistics (`tfidf_unstemmed`). The dataset is the IMDB movie review dataset.

In [1]:
import nltk # the Natural Language Toolkit
nltk.download('punkt') # The NLTK tokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
!pip install datasets



In [4]:
import datasets
import os
import pandas as pd
import re
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Nice to see additional metrics
from sklearn.metrics import classification_report

def load_data(dataset_split='train'):
    dataset = datasets.load_dataset('rotten_tomatoes')[dataset_split]
    # Open and import positve data
    df = pd.DataFrame()
    df['Review'] = [review['text'] for review in dataset]
    df['Sentiment'] = [review['label'] for review in dataset]
    # Remove non-alphanumeric characters
    df['Review'] = df['Review'].apply(lambda x: re.sub("[^a-zA-Z]", ' ', str(x)))
    # Tokenize the training and testing data
    df_tokenized = tokenize_review(df)
    return df_tokenized

def tokenize_review(df):
    # Tokenize Reviews in training
    tokened_reviews = [word_tokenize(rev) for rev in df['Review']]
    # Create word stems
    stemmed_tokens = []
    porter = PorterStemmer()
    for i in range(len(tokened_reviews)):
        stems = [porter.stem(token) for token in tokened_reviews[i]]
        stems = ' '.join(stems)
        stemmed_tokens.append(stems)
    df.insert(1, column='Stemmed', value=stemmed_tokens)
    return df

def transform_BOW(training, testing, column_name):
    vect = CountVectorizer(max_features=100, ngram_range=(1,3), stop_words=ENGLISH_STOP_WORDS)
    vectFit = vect.fit(training[column_name])
    BOW_training = vectFit.transform(training[column_name])
    BOW_training_df = pd.DataFrame(BOW_training.toarray(), columns=vect.get_feature_names())
    BOW_testing = vectFit.transform(testing[column_name])
    BOW_testing_Df = pd.DataFrame(BOW_testing.toarray(), columns=vect.get_feature_names())
    return vectFit, BOW_training_df, BOW_testing_Df

def transform_tfidf(training, testing, column_name):
    Tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=100, stop_words=ENGLISH_STOP_WORDS)
    Tfidf_fit = Tfidf.fit(training[column_name])
    Tfidf_training = Tfidf_fit.transform(training[column_name])
    Tfidf_training_df = pd.DataFrame(Tfidf_training.toarray(), columns=Tfidf.get_feature_names())
    Tfidf_testing = Tfidf_fit.transform(testing[column_name])
    Tfidf_testing_df = pd.DataFrame(Tfidf_testing.toarray(), columns=Tfidf.get_feature_names())
    return Tfidf_fit, Tfidf_training_df, Tfidf_testing_df

def add_augmenting_features(df):
    tokened_reviews = [word_tokenize(rev) for rev in df['Review']]
    # Create feature that measures length of reviews
    len_tokens = []
    for i in range(len(tokened_reviews)):
        len_tokens.append(len(tokened_reviews[i]))
    len_tokens = preprocessing.scale(len_tokens)
    df.insert(0, column='Lengths', value=len_tokens)

    # Create average word length (training)
    Average_Words = [len(x)/(len(x.split())) for x in df['Review'].tolist()]
    Average_Words = preprocessing.scale(Average_Words)
    df['averageWords'] = Average_Words
    return df

def build_model(X_train, y_train, X_test, y_test, name_of_test):
    log_reg = LogisticRegression(C=30, max_iter=200).fit(X_train, y_train)
    y_pred = log_reg.predict(X_test)
    print('Training accuracy of '+name_of_test+': ', log_reg.score(X_train, y_train))
    print('Testing accuracy of '+name_of_test+': ', log_reg.score(X_test, y_test))
    print(classification_report(y_test, y_pred))  # Evaluating prediction ability
    return log_reg

# Load training and test sets
# Loading reviews into DF
df_train = load_data('train')

print('...successfully loaded training data')
print('Total length of training data: ', len(df_train))
# Add augmenting features
df_train = add_augmenting_features(df_train)
print('...augmented data with len_tokens and average_words')

# Load test DF
df_test = load_data('test')

print('...successfully loaded testing data')
print('Total length of testing data: ', len(df_test))
df_test = add_augmenting_features(df_test)
print('...augmented data with len_tokens and average_words')

# Create unstemmed BOW features for training set
unstemmed_BOW_vect_fit, df_train_bow_unstem, df_test_bow_unstem = transform_BOW(df_train, df_test, 'Review')
print('...successfully created the unstemmed BOW data')

# Create TfIdf features for training set
unstemmed_tfidf_vect_fit, df_train_tfidf_unstem, df_test_tfidf_unstem = transform_tfidf(df_train, df_test, 'Review')
print('...successfully created the unstemmed TFIDF data')

# Running logistic regression on dataframes
bow_unstemmed = build_model(df_train_bow_unstem, df_train['Sentiment'], df_test_bow_unstem, df_test['Sentiment'], 'BOW Unstemmed')

tfidf_unstemmed = build_model(df_train_tfidf_unstem, df_train['Sentiment'], df_test_tfidf_unstem, df_test['Sentiment'], 'TFIDF Unstemmed')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1861.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=869.0, style=ProgressStyle(description_…

Using custom data configuration default



Downloading and preparing dataset rotten_tomatoes_movie_review/default (download: 476.34 KiB, generated: 1.28 MiB, post-processed: Unknown size, total: 1.75 MiB) to /root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/dfe295aa1215177b34ce0b4502bfc1d5f20efe7378e38d55678355eaa4d994ee...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=487770.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset rotten_tomatoes_movie_review downloaded and prepared to /root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/dfe295aa1215177b34ce0b4502bfc1d5f20efe7378e38d55678355eaa4d994ee. Subsequent calls will reuse this data.
...successfully loaded training data
Total length of training data:  8530
...augmented data with len_tokens and average_words


Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/dfe295aa1215177b34ce0b4502bfc1d5f20efe7378e38d55678355eaa4d994ee)


...successfully loaded testing data
Total length of testing data:  1066
...augmented data with len_tokens and average_words
...successfully created the unstemmed BOW data
...successfully created the unstemmed TFIDF data
Training accuracy of BOW Unstemmed:  0.6193434935521688
Testing accuracy of BOW Unstemmed:  0.6031894934333959
              precision    recall  f1-score   support

           0       0.59      0.69      0.63       533
           1       0.62      0.52      0.57       533

    accuracy                           0.60      1066
   macro avg       0.61      0.60      0.60      1066
weighted avg       0.61      0.60      0.60      1066

Training accuracy of TFIDF Unstemmed:  0.6220398593200469
Testing accuracy of TFIDF Unstemmed:  0.6088180112570356
              precision    recall  f1-score   support

           0       0.60      0.67      0.63       533
           1       0.62      0.54      0.58       533

    accuracy                           0.61      1066
   macro 

### Attacking

TextAttack includes a build-in `SklearnModelWrapper` that can run attacks on most sklearn models. (If your tokenization strategy is different than above, you may need to subclass `SklearnModelWrapper` to make sure the model inputs & outputs come in the correct format.)

Once we initializes the model wrapper, we load a few samples from the IMDB dataset and run the `TextFoolerJin2019` attack on our model.

In [5]:
from textattack.models.wrappers import SklearnModelWrapper

model_wrapper = SklearnModelWrapper(bow_unstemmed, unstemmed_BOW_vect_fit)

In [6]:
from textattack.datasets import HuggingFaceDataset
from textattack.attack_recipes import TextFoolerJin2019
from textattack import Attacker

dataset = HuggingFaceDataset("rotten_tomatoes", None, "train")
attack = TextFoolerJin2019.build(model_wrapper)

attacker = Attacker(attack, dataset)
attacker.attack_dataset()

Using custom data configuration default
Reusing dataset rotten_tomatoes_movie_review (/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/dfe295aa1215177b34ce0b4502bfc1d5f20efe7378e38d55678355eaa4d994ee)
textattack: Loading [94mdatasets[0m dataset [94mrotten_tomatoes[0m, split [94mtrain[0m.
textattack: Unknown if model of class <class 'sklearn.linear_model._logistic.LogisticRegression'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Attempting to attack 5 samples when only 8530 are available.
  0%|          | 0/5 [00:00<?, ?it/s]Using /tmp/tfhub_modules to cache modules.


Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.5
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): PartOfSpeech(
        (tagger_type):  nltk
        (tagset):  universal
        (allow_verb_noun_swap):  True
        (compare_against_original):  True
      )
    (2): UniversalSentenceEncoder(
        (metric):  angular
        (threshold):  0.840845057
        (window_size):  15
        (skip_text_shorter_than_window):  True
        (compare_against_original):  False
      )
    (3): RepeatModification
    (4): StopwordModification
    (5): InputColumnModification(
        (matching_column_labels):  ['premise', 'hypothesis']
       

[Succeeded / Failed / Total] 2 / 0 / 3:  60%|██████    | 3/5 [00:06<00:04,  2.08s/it]

--------------------------------------------- Result 1 ---------------------------------------------
[92mPositive (55%)[0m --> [91mNegative (51%)[0m

the rock is destined to be the 21st century's [92mnew[0m " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

the rock is destined to be the 21st century's [91mnewest[0m " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .


--------------------------------------------- Result 2 ---------------------------------------------
[92mPositive (52%)[0m --> [91mNegative (52%)[0m

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/[92mdirector[0m peter jackson's expanded vision of j . r . r . tolkien's middle-earth .

the gorgeously elaborate continuation of " the lord of the rings " trilogy 

[Succeeded / Failed / Total] 3 / 0 / 5: 100%|██████████| 5/5 [00:06<00:00,  1.28s/it]

--------------------------------------------- Result 4 ---------------------------------------------
[92mPositive (72%)[0m --> [91mNegative (63%)[0m

if you sometimes like to go to the [92mmovies[0m to have [92mfun[0m , wasabi is a good place to start .

if you sometimes like to go to the [91mmovie[0m to have [91mamuse[0m , wasabi is a good place to start .


--------------------------------------------- Result 5 ---------------------------------------------
[91mNegative (78%)[0m --> [37m[SKIPPED][0m

emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .



+-------------------------------+--------+
| Attack Results                |        |
+-------------------------------+--------+
| Number of successful attacks: | 3      |
| Number of failed attacks:     | 0      |
| Number of skipped attacks:    | 2      |
| Original accuracy:            | 60.0%  |
| Accuracy under attack:        | 0.0%   |
| Attack success r




[<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f9906d85cd0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f98f9717b10>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f9903ebaad0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7f98f972f9d0>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7f990e679b50>]

### Conclusion
We were able to train a model on the IMDB dataset using `sklearn` and use it in TextAttack by initializing with the `SklearnModelWrapper`. It's that simple!