## sklearn and TextAttack

This following code trains two different text classification models using sklearn. Both use logistic regression models: the difference is in the features. 

We will load data using `datasets`, train the models, and attack them using TextAttack.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/QData/TextAttack/blob/master/docs/2notebook/Example_1_sklearn.ipynb)

[![View Source on GitHub](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/QData/TextAttack/blob/master/docs/2notebook/Example_1_sklearn.ipynb)

### Setup

Ensure that all the textattack libraries and resources are properly set up and working in the notebook

In [3]:
!git clone https://github.com/QData/TextAttack

Cloning into 'TextAttack'...
remote: Enumerating objects: 221, done.[K
remote: Counting objects: 100% (221/221), done.[K
remote: Compressing objects: 100% (157/157), done.[K
remote: Total 18291 (delta 103), reused 119 (delta 64), pack-reused 18070[K
Receiving objects: 100% (18291/18291), 23.92 MiB | 29.20 MiB/s, done.
Resolving deltas: 100% (13652/13652), done.


In [4]:
cd TextAttack

/content/TextAttack


In [6]:
!git checkout api-rework

Branch 'api-rework' set up to track remote branch 'api-rework' from 'origin'.
Switched to a new branch 'api-rework'


In [7]:
!pip install ./

Processing /content/TextAttack
Collecting bert-score>=0.3.5
[?25l  Downloading https://files.pythonhosted.org/packages/14/27/ccf86d5dfc19f89bee4449e96ac6e0f7c312f1614de86609c5f6da5c40af/bert_score-0.3.8-py3-none-any.whl (58kB)
[K     |████████████████████████████████| 61kB 3.3MB/s 
Collecting flair==0.6.1.post1
[?25l  Downloading https://files.pythonhosted.org/packages/4a/49/a812ed93088ba9519cbb40eb9f52341694b31cfa126bfddcd9db3761f3ac/flair-0.6.1.post1-py3-none-any.whl (337kB)
[K     |████████████████████████████████| 337kB 7.4MB/s 
Collecting language_tool_python
  Downloading https://files.pythonhosted.org/packages/37/26/48b22ad565fd372edec3577218fb817e0e6626bf4e658033197470ad92b3/language_tool_python-2.5.3-py3-none-any.whl
Collecting lemminflect
[?25l  Downloading https://files.pythonhosted.org/packages/4b/67/d04ca98b661d4ad52b9b965c9dabb1f1a2c85541d20f8decb9a9df4e4b32/lemminflect-0.2.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 10.3MB/s 
[?25hCo

### Training

This code trains two models: one on bag-of-words statistics (`bow_unstemmed`) and one on tf–idf statistics (`tfidf_unstemmed`). The dataset is the IMDB movie review dataset.

In [1]:
import nltk # the Natural Language Toolkit
nltk.download('punkt') # The NLTK tokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
import datasets
import os
import pandas as pd
import re
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Nice to see additional metrics
from sklearn.metrics import classification_report

def load_data(dataset_split='train'):
    dataset = datasets.load_dataset('imdb')[dataset_split]
    # Open and import positve data
    df = pd.DataFrame()
    df['Review'] = [review['text'] for review in dataset]
    df['Sentiment'] = [review['label'] for review in dataset]
    # Remove non-alphanumeric characters
    df['Review'] = df['Review'].apply(lambda x: re.sub("[^a-zA-Z]", ' ', str(x)))
    # Tokenize the training and testing data
    df_tokenized = tokenize_review(df)
    return df_tokenized

def tokenize_review(df):
    # Tokenize Reviews in training
    tokened_reviews = [word_tokenize(rev) for rev in df['Review']]
    # Create word stems
    stemmed_tokens = []
    porter = PorterStemmer()
    for i in range(len(tokened_reviews)):
        stems = [porter.stem(token) for token in tokened_reviews[i]]
        stems = ' '.join(stems)
        stemmed_tokens.append(stems)
    df.insert(1, column='Stemmed', value=stemmed_tokens)
    return df

def transform_BOW(training, testing, column_name):
    vect = CountVectorizer(max_features=100, ngram_range=(1,3), stop_words=ENGLISH_STOP_WORDS)
    vectFit = vect.fit(training[column_name])
    BOW_training = vectFit.transform(training[column_name])
    BOW_training_df = pd.DataFrame(BOW_training.toarray(), columns=vect.get_feature_names())
    BOW_testing = vectFit.transform(testing[column_name])
    BOW_testing_Df = pd.DataFrame(BOW_testing.toarray(), columns=vect.get_feature_names())
    return vectFit, BOW_training_df, BOW_testing_Df

def transform_tfidf(training, testing, column_name):
    Tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=100, stop_words=ENGLISH_STOP_WORDS)
    Tfidf_fit = Tfidf.fit(training[column_name])
    Tfidf_training = Tfidf_fit.transform(training[column_name])
    Tfidf_training_df = pd.DataFrame(Tfidf_training.toarray(), columns=Tfidf.get_feature_names())
    Tfidf_testing = Tfidf_fit.transform(testing[column_name])
    Tfidf_testing_df = pd.DataFrame(Tfidf_testing.toarray(), columns=Tfidf.get_feature_names())
    return Tfidf_fit, Tfidf_training_df, Tfidf_testing_df

def add_augmenting_features(df):
    tokened_reviews = [word_tokenize(rev) for rev in df['Review']]
    # Create feature that measures length of reviews
    len_tokens = []
    for i in range(len(tokened_reviews)):
        len_tokens.append(len(tokened_reviews[i]))
    len_tokens = preprocessing.scale(len_tokens)
    df.insert(0, column='Lengths', value=len_tokens)

    # Create average word length (training)
    Average_Words = [len(x)/(len(x.split())) for x in df['Review'].tolist()]
    Average_Words = preprocessing.scale(Average_Words)
    df['averageWords'] = Average_Words
    return df

def build_model(X_train, y_train, X_test, y_test, name_of_test):
    log_reg = LogisticRegression(C=30, max_iter=200).fit(X_train, y_train)
    y_pred = log_reg.predict(X_test)
    print('Training accuracy of '+name_of_test+': ', log_reg.score(X_train, y_train))
    print('Testing accuracy of '+name_of_test+': ', log_reg.score(X_test, y_test))
    print(classification_report(y_test, y_pred))  # Evaluating prediction ability
    return log_reg

# Load training and test sets
# Loading reviews into DF
df_train = load_data('train')

print('...successfully loaded training data')
print('Total length of training data: ', len(df_train))
# Add augmenting features
df_train = add_augmenting_features(df_train)
print('...augmented data with len_tokens and average_words')

# Load test DF
df_test = load_data('test')

print('...successfully loaded testing data')
print('Total length of testing data: ', len(df_test))
df_test = add_augmenting_features(df_test)
print('...augmented data with len_tokens and average_words')

# Create unstemmed BOW features for training set
unstemmed_BOW_vect_fit, df_train_bow_unstem, df_test_bow_unstem = transform_BOW(df_train, df_test, 'Review')
print('...successfully created the unstemmed BOW data')

# Create TfIdf features for training set
unstemmed_tfidf_vect_fit, df_train_tfidf_unstem, df_test_tfidf_unstem = transform_tfidf(df_train, df_test, 'Review')
print('...successfully created the unstemmed TFIDF data')

# Running logistic regression on dataframes
bow_unstemmed = build_model(df_train_bow_unstem, df_train['Sentiment'], df_test_bow_unstem, df_test['Sentiment'], 'BOW Unstemmed')

tfidf_unstemmed = build_model(df_train_tfidf_unstem, df_train['Sentiment'], df_test_tfidf_unstem, df_test['Sentiment'], 'TFIDF Unstemmed')

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.06 MiB, post-processed: Unknown size, total: 207.28 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3. Subsequent calls will reuse this data.
...successfully loaded training data
Total length of training data:  25000
...augmented data with len_tokens and average_words


Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)


...successfully loaded testing data
Total length of testing data:  25000
...augmented data with len_tokens and average_words
...successfully created the unstemmed BOW data
...successfully created the unstemmed TFIDF data
Training accuracy of BOW Unstemmed:  0.728
Testing accuracy of BOW Unstemmed:  0.72496
              precision    recall  f1-score   support

           0       0.73      0.72      0.72     12500
           1       0.72      0.73      0.73     12500

    accuracy                           0.72     25000
   macro avg       0.73      0.72      0.72     25000
weighted avg       0.73      0.72      0.72     25000

Training accuracy of TFIDF Unstemmed:  0.72884
Testing accuracy of TFIDF Unstemmed:  0.72684
              precision    recall  f1-score   support

           0       0.73      0.73      0.73     12500
           1       0.73      0.73      0.73     12500

    accuracy                           0.73     25000
   macro avg       0.73      0.73      0.73     25000


### Attacking

TextAttack includes a build-in `SklearnModelWrapper` that can run attacks on most sklearn models. (If your tokenization strategy is different than above, you may need to subclass `SklearnModelWrapper` to make sure the model inputs & outputs come in the correct format.)

Once we initializes the model wrapper, we load a few samples from the IMDB dataset and run the `TextFoolerJin2019` attack on our model.

In [4]:
from textattack.models.wrappers import SklearnModelWrapper

model_wrapper = SklearnModelWrapper(bow_unstemmed, unstemmed_BOW_vect_fit)

textattack: Updating TextAttack package dependencies.
textattack: Downloading NLTK required packages.


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data]   Unzipping corpora/omw.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


textattack: Downloading https://textattack.s3.amazonaws.com/word_embeddings/paragramcf.
100%|██████████| 481M/481M [00:10<00:00, 46.6MB/s]
textattack: Unzipping file /root/.cache/textattack/tmpwx5_yv6o.zip to /root/.cache/textattack/word_embeddings/paragramcf.
textattack: Successfully saved word_embeddings/paragramcf to cache.


In [5]:
from textattack.datasets import HuggingFaceDataset
from textattack.attack_recipes import TextFoolerJin2019
from textattack import Attacker

dataset = HuggingFaceDataset("imdb", None, "train")
attack = TextFoolerJin2019.build(model_wrapper)

attacker = Attacker(attack, dataset)
attacker.attack_dataset()

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)
textattack: Loading [94mdatasets[0m dataset [94mimdb[0m, split [94mtrain[0m.
textattack: Unknown if model of class <class 'sklearn.linear_model._logistic.LogisticRegression'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Attempting to attack 5 samples when only 25000 are available.
  0%|          | 0/5 [00:00<?, ?it/s]

Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.5
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): PartOfSpeech(
        (tagger_type):  nltk
        (tagset):  universal
        (allow_verb_noun_swap):  True
        (compare_against_original):  True
      )
    (2): UniversalSentenceEncoder(
        (metric):  angular
        (threshold):  0.840845057
        (window_size):  15
        (skip_text_shorter_than_window):  True
        (compare_against_original):  False
      )
    (3): RepeatModification
    (4): StopwordModification
    (5): InputColumnModification(
        (matching_column_labels):  ['premise', 'hypothesis']
       

Using /tmp/tfhub_modules to cache modules.
Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/4'.
Downloaded https://tfhub.dev/google/universal-sentence-encoder/4, Total size: 987.47MB
Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder/4'.
[Succeeded / Failed / Total] 1 / 0 / 2:  40%|████      | 2/5 [00:22<00:34, 11.37s/it]

--------------------------------------------- Result 1 ---------------------------------------------
[92mPositive (69%)[0m --> [91mNegative (50%)[0m

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school [92mlife[0m, such as "Teachers". My 35 [92myears[0m in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see [92mright[0m through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I [92msaw[0m the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!

Br

[Succeeded / Failed / Total] 2 / 0 / 3:  60%|██████    | 3/5 [00:23<00:15,  7.78s/it]

--------------------------------------------- Result 3 ---------------------------------------------
[92mPositive (69%)[0m --> [91mNegative (60%)[0m

Brilliant over-acting by Lesley Ann Warren. [92mBest[0m dramatic hobo lady I have ever seen, and [92mlove[0m scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I'm a lawyer" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. Look for the legs scene and the two big diggers fighting (one bleeds). This movie gets better each time I see it (which is 

[Succeeded / Failed / Total] 3 / 0 / 4:  80%|████████  | 4/5 [00:23<00:05,  5.95s/it]

--------------------------------------------- Result 4 ---------------------------------------------
[92mPositive (63%)[0m --> [91mNegative (52%)[0m

This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS). Many of the jokes fall flat. But still, this film is very lovable in a way many comedies are not, and to pull that off in a story about some of the most traditionally reviled members of society is truly impressive. Its not The Fisher King, but its not crap, either. My only complaint is that Brooks should have [92mcast[0m someone else in the lead (I [92mlove[0m Mel as a Director and Writer, not so much as a lead).

This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not give a realistic view of homelessness (unlike, say, how Citizen Kane g

[Succeeded / Failed / Total] 4 / 0 / 5: 100%|██████████| 5/5 [00:24<00:00,  4.82s/it]

--------------------------------------------- Result 5 ---------------------------------------------
[92mPositive (55%)[0m --> [91mNegative (58%)[0m

This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and actually had a plot that was followable. Leslie Ann Warren made the movie, she is such a fantastic, under-rated actress. There were some moments that could have been fleshed out a bit more, and some scenes that could probably have been cut to make the room to do so, but all in all, this is worth the price to rent and see it. The acting was good overall, Brooks himself did a good job without his characteristic speaking to directly to the audience. Again, Warren was the [92mbest[0m actor in the movie, but "Fume" and "Sailor" both played their parts well.

This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and actually had a plot that was followable. Leslie Ann Warren made the movie, she is such a fan




[<textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7fdf1ef8f690>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7fdf207be1d0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7fdf1eb05ed0>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7fdf5745a450>,
 <textattack.attack_results.successful_attack_result.SuccessfulAttackResult at 0x7fdf5417b790>]

### Conclusion
We were able to train a model on the IMDB dataset using `sklearn` and use it in TextAttack by initializing with the `SklearnModelWrapper`. It's that simple!