## sklearn and TextAttack

This following code trains two different text classification models using sklearn. Both use logistic regression models: the difference is in the features. 

We will load data using `datasets`, train the models, and attack them using TextAttack.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/QData/TextAttack/blob/master/docs/2notebook/Example_1_sklearn.ipynb)

[![View Source on GitHub](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/QData/TextAttack/blob/master/docs/2notebook/Example_1_sklearn.ipynb)

Please remember to run  **pip3 install textattack[tensorflow]**  in your notebook enviroment before the following codes:

### Training

This code trains two models: one on bag-of-words statistics (`bow_unstemmed`) and one on tf–idf statistics (`tfidf_unstemmed`). The dataset is the IMDB movie review dataset.

In [3]:
import nltk # the Natural Language Toolkit
nltk.download('punkt') # The NLTK tokenizer
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/macbookpro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/macbookpro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
import pickle
import pandas as pd
def load_obj(name):
    """load .pickle"""
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)
Xtrain = load_obj("./X_train")
labeltrain = load_obj("./y_train")

df_train = pd.DataFrame(Xtrain, columns=['Line'])
df_train['Topic'] = labeltrain.astype(int)


Xtest = load_obj("./X_test")
labeltest = load_obj("./y_test")

df_test = pd.DataFrame(Xtest, columns=['Line'])
df_test['Topic'] = labeltest.astype(int)


In [6]:
import datasets
import os
import pandas as pd
import re
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import pickle

# Nice to see additional metrics
from sklearn.metrics import classification_report


def load_data(dataset_split='train'):
    dataset = datasets.load_dataset('yahoo_answers_topics')[dataset_split]
    l = int(0.01*len(dataset))
    dataset = dataset.select(range(l))
    # Open and import positve data
    df = pd.DataFrame()
    df['Line'] = [review['question_title']+review['question_content']+review['best_answer'] for review in dataset]
    df['Topic'] = [review['topic'] for review in dataset]
    # Remove non-alphanumeric characters
    df['Line'] = df['Line'].apply(lambda x: re.sub("[^a-zA-Z]", ' ', str(x)))
    # Tokenize the training and testing data
    df_tokenized = tokenize_review(df)
    return df_tokenized
    

def tokenize_review(df):
    # Tokenize Reviews in training
    print(df['Line'][0])
    tokened_reviews = [word_tokenize(rev) for rev in df['Line']]
    # Create word stems
    stemmed_tokens = []
    porter = PorterStemmer()
    for i in range(len(tokened_reviews)):
        stems = [porter.stem(token) for token in tokened_reviews[i]]
        stems = ' '.join(stems)
        stemmed_tokens.append(stems)
    df.insert(1, column='Stemmed', value=stemmed_tokens)
    return df



def transform_BOW(training, testing, column_name):
    stopwords = nltk.corpus.stopwords.words("english")
    ngram_range, min_df, max_df = (1, 2), 0.005, 0.25
    vect = CountVectorizer(input='content', decode_error='ignore',
                             strip_accents='ascii', lowercase=True,
                             stop_words=stopwords, token_pattern=r'\b[a-zA-Z]{3,}\b',
                             max_features=1000,
                             max_df=max_df, min_df=min_df, ngram_range=ngram_range)
    vectFit = vect.fit(training[column_name])
    BOW_training = vectFit.transform(training[column_name])
    BOW_training_df = pd.DataFrame(BOW_training.toarray(), columns=vect.get_feature_names())
    BOW_testing = vectFit.transform(testing[column_name])
    BOW_testing_Df = pd.DataFrame(BOW_testing.toarray(), columns=vect.get_feature_names())
    return vectFit, BOW_training_df, BOW_testing_Df


def add_augmenting_features(df):
    tokened_reviews = [word_tokenize(rev) for rev in df['Line']]
    # Create feature that measures length of reviews
    len_tokens = []
    for i in range(len(tokened_reviews)):
        len_tokens.append(len(tokened_reviews[i]))
    len_tokens = preprocessing.scale(len_tokens)
    df.insert(0, column='Lengths', value=len_tokens)

    # Create average word length (training)
    Average_Words = [len(x)/(len(x.split())) for x in df['Line'].tolist()]
    Average_Words = preprocessing.scale(Average_Words)
    df['averageWords'] = Average_Words
    return df

def build_model(X_train, y_train, X_test, y_test, name_of_test):
    log_reg = LogisticRegression(C=30, max_iter=200).fit(X_train, y_train)
    y_pred = log_reg.predict(X_test)
    print('Training accuracy of '+name_of_test+': ', log_reg.score(X_train, y_train))
    print('Testing accuracy of '+name_of_test+': ', log_reg.score(X_test, y_test))
    print(classification_report(y_test, y_pred))  # Evaluating prediction ability
    return log_reg


#df_train = tokenize_review(train_df)
#df_test = tokenize_review(test_df)

print('Total length of training data: ', len(df_train))
# Add augmenting features
df_train = add_augmenting_features(df_train)
print('...augmented data with len_tokens and average_words')


print('Total length of testing data: ', len(df_test))
df_test = add_augmenting_features(df_test) 
print('...augmented data with len_tokens and average_words')

# Create unstemmed BOW features for training set
unstemmed_BOW_vect_fit, df_train_bow_unstem, df_test_bow_unstem = transform_BOW(df_train, df_test, 'Line')
print('...successfully created the unstemmed BOW data')

# Running logistic regression on dataframes
bow_unstemmed = build_model(df_train_bow_unstem, df_train['Topic'], df_test_bow_unstem, df_test['Topic'], 'BOW Unstemmed')


...successfully loaded training data
...successfully loaded testing data
Total length of training data:  18000
...augmented data with len_tokens and average_words
Total length of testing data:  6000
...augmented data with len_tokens and average_words




...successfully created the unstemmed BOW data


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training accuracy of BOW Unstemmed:  0.9831111111111112
Testing accuracy of BOW Unstemmed:  0.9696666666666667
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       424
           1       0.97      0.96      0.96       701
           2       0.99      0.96      0.97       462
           3       0.96      0.92      0.94       521
           4       0.99      0.99      0.99       746
           5       0.99      0.99      0.99       396
           6       0.95      0.98      0.96      1443
           7       0.97      0.95      0.96       396
           8       0.98      0.98      0.98       426
           9       0.97      0.99      0.98       485

    accuracy                           0.97      6000
   macro avg       0.97      0.97      0.97      6000
weighted avg       0.97      0.97      0.97      6000



### Attacking

TextAttack includes a build-in `SklearnModelWrapper` that can run attacks on most sklearn models. (If your tokenization strategy is different than above, you may need to subclass `SklearnModelWrapper` to make sure the model inputs & outputs come in the correct format.)

Once we initializes the model wrapper, we load a few samples from the IMDB dataset and run the `TextFoolerJin2019` attack on our model.

In [7]:
from textattack.models.wrappers import SklearnModelWrapper

model_wrapper = SklearnModelWrapper(bow_unstemmed, unstemmed_BOW_vect_fit)

In [None]:
# l = len(df_test)
# load  = load.select(range(l))
# len(load)
# new = []
# label = []
# for i in load:
#   new.append(df_test['Line'])
#   label.append(df_test['Topic'])
# len(new)
# load = load.add_column(name="Line", column = new)
# load = load.add_column(name="Topic", column = label)
# load = load.remove_columns(['question_title','question_content','best_answer','id','topic'])

In [12]:
!pip install tensorflow_hub



In [13]:
from textattack.datasets import HuggingFaceDataset
from textattack.attack_recipes import TextFoolerJin2019
from textattack import Attacker
import textattack
import datasets
from textattack.datasets import HuggingFaceDataset
from datasets import Dataset

data = Dataset.from_pandas(df_test)
dataset = HuggingFaceDataset(data, dataset_columns=(['Line'],'Topic'))

attack = TextFoolerJin2019.build(model_wrapper)

attack_args = textattack.AttackArgs(
    num_examples= 30,
    attack_n = True,
    log_to_csv="log.csv"
)

attacker = Attacker(attack, dataset, attack_args)
att = attacker.attack_dataset()


textattack: Unknown if model of class <class 'sklearn.linear_model._logistic.LogisticRegression'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Logging to CSV at path log.csv


Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  UntargetedClassification
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.5
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): PartOfSpeech(
        (tagger_type):  nltk
        (tagset):  universal
        (allow_verb_noun_swap):  True
        (compare_against_original):  True
      )
    (2): UniversalSentenceEncoder(
        (metric):  angular
        (threshold):  0.840845057
        (window_size):  15
        (skip_text_shorter_than_window):  True
        (compare_against_original):  False
      )
    (3): RepeatModification
    (4): StopwordModification
    (5): InputColumnModification(
        (matching_column_labels):  ['premise', 'hypothesis']
       









Failed to import tensorflow. Please note that tensorflow is not installed by default when you install tensorflow_hub. This is so that users can decide which tensorflow package to use. To use tensorflow_hub, please install a current version of tensorflow by following the instructions at https://tensorflow.org/install and https://tensorflow.org/hub/installation.




ModuleNotFoundError: Lazy module loader cannot find module named `tensorflow_hub`. This might be because TextAttack does not automatically install some optional dependencies. Please run `pip install tensorflow_hub` to install the package.

### Conclusion
We were able to train a model on the IMDB dataset using `sklearn` and use it in TextAttack by initializing with the `SklearnModelWrapper`. It's that simple!