# Robustness test for NLP Machine learning models

The main goal of our work is to test the robustness of different machine NLP learning models: in our case SVM and XGboost. These algorithms are used to classify an unstructured test,we are working with the customer complaints recorded in the Consumer Financial Protection Bureau as the independent variable and the product which they are referred to as dependent variable. The robustness is tested replacing a fraction of the total number of words in the texts.

In [1]:
fraction=0.05

In [55]:
%run Setup.ipynb

In [3]:
df = pd.read_csv(DATA_PATH+'/df.csv')
df=df.drop('Unnamed: 0',axis=1)

We split the set in order to create the train and the test sets.

In [4]:
train_corpus, test_corpus, Y_train, Y_test =\
                                 train_test_split(np.array(df['CCN clean']), np.array(df['True Product']), test_size=0.1, random_state=42)

train_corpus.shape, test_corpus.shape

((18000,), (2000,))

# Text Representation

In oreder to set an appropriate input to the models we rapresent the unstructured texts in a different manner: In the Tf idf representation each word get a certain value associated; the value is calcuated multiplying: TF=Term frequency(number of times the word is repeated in a document) and IDF= log(N/df) (N=total number of documents, df= number of documents in which the word appears)

In [5]:
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)
tv_train_features = tv.fit_transform(train_corpus)
tv_test_features = tv.transform(test_corpus)
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

TFIDF model:> Train features shape: (18000, 24180)  Test features shape: (2000, 24180)


# Support Vector Machine

SVM is an algorithm that we trained in order to classify our texts. It is founded on the concept of creating an hyperplane (of n-1 dimensions, where n is the number of dimensions of the elements used as independent variable), in order to divide the elements in different categories. The hyperplane is a soft margin that is robust to missclassification.

In [49]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, Y_train)
svm_baseline = svm.score(tv_test_features, Y_test)
print('Test Accuracy:', svm_baseline)
#B

Test Accuracy: 0.8285


# Extreme Gradient Boosting (XGBoost)

XGBoost is an implementation of Gradient Boosted decision trees. Decision trees are created in sequential form.Weights are assigned to all the independent variables which are then fed into the decision tree which predicts results. Weight of variables predicted wrong by the tree is increased and these the variables are then fed to the second decision tree. These individual classifiers/predictors then ensemble to give a strong and more precise model. 

In [50]:
xgboost = xgb.XGBClassifier(min_child_weight=1,
                                max_depth=6,
                                subsample=1,
                                colsample_bytree=1, random_state=42)
xgboost.fit(tv_train_features, Y_train)
xgboost_baseline = xgboost.score(tv_test_features, Y_test)
print('Test Accuracy:', xgboost_baseline)
#A



Test Accuracy: 0.839


In [51]:
Baseline_xgboost_svm=[xgboost_baseline,svm_baseline]
Baseline_xgboost_svm

[0.839, 0.8285]

In [52]:
pd.DataFrame(Baseline_xgboost_svm).to_csv(RESULTS_PATH+'/Baseline_xgboost_svm.csv')

# Random

In this case we randomly select a sample of word extracted from each document. In order to replace the words with a synonym we import the GloVe dictionary, which is a dictionary containing words in a vector format. The vector that result most similar to the original word is set as the correct replacement.

In [8]:
vec=[]
for phrase in test_corpus:
    vec.append(phrase)

In [9]:
listword=[]
for phrase in vec:
    for word in phrase.split():
        listword.append(word)

In [10]:
n_srch=round(fraction*len(listword))

In [11]:
to_rep_rand=random.sample(listword,n_srch)
to_rep_rand
replaced=to_rep_rand.copy()

Random Glove

In [12]:
glove_vectors = gensim.downloader.load('glove-twitter-25')

We create a list of lists in order to work with lists instead of strings (as the original documents are)

In [13]:
lol_w2v_test=[]
for i in range(len(test_corpus)):
    lol_w2v_test.append(nltk.word_tokenize(test_corpus[i]))

Initialize a new array in which we will apply the replacement

In [14]:
new_test_rand_glove=test_corpus.copy()

In [15]:
for i in range(len(lol_w2v_test)):
    #we randomly select a sample of words from each document
    n_srch=round(fraction*len(lol_w2v_test[i]))
    to_rep_rand=random.sample(lol_w2v_test[i],n_srch)
    rand_sims=[]
    replaced_rand_glove=[]
    for j in range(len(to_rep_rand)):
        #for each word we find a synonym (try is because the word can be not contained in the GloVe vocabulary)
        appoggio_rand=to_rep_rand[j]
        try:
            rand_sims.append(glove_vectors.most_similar(appoggio_rand,topn=1))
        except Exception:
             #the output of the "most similar" function is a list of tuple, we only want a list of words (replaced_rand_glove).
            rand_sims.append([(appoggio_rand,1)])
        replaced_rand_glove.append(list(rand_sims[j][0])[0])
    for f in range(len(replaced_rand_glove)):
        #we apply the replacement in the text
        try:
            new_test_rand_glove[i] = new_test_rand_glove[i].replace(to_rep_rand[f],replaced_rand_glove[f],1)
        except Exception:
                pass
    

In [16]:
tv_test_features_rand_glove = tv.transform(new_test_rand_glove)

In [53]:
pd.DataFrame(tv_test_features_rand_glove).to_csv(RESULTS_PATH+'/test_features_random_5.csv')

In [19]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, Y_train)
svm_random_5 = svm.score(tv_test_features_rand_glove, Y_test)
print('Test Accuracy:', svm_random_5)
#O

Test Accuracy: 0.82


In [20]:
xgboost = xgb.XGBClassifier(min_child_weight=1,
                                max_depth=6,
                                subsample=1,
                                colsample_bytree=1,random_state=42)
xgboost.fit(tv_train_features, Y_train)
xgboost_random_5 = xgboost.score(tv_test_features_rand_glove, Y_test)
print('Test Accuracy:', xgboost_random_5)
#N



Test Accuracy: 0.832


In [21]:
Random_xgboost_svm_5=[xgboost_random_5,svm_random_5]
Random_xgboost_svm_5

[0.832, 0.82]

In [22]:
pd.DataFrame(Random_xgboost_svm_5).to_csv(RESULTS_PATH+'/Random_xgboost_svm_5.csv')

# TF IDF Search Method

We initialize list1 as a list of documents and we transform each document in a "TextBlob" in order to use imported functions from the textblob library. These functions are able to calculate the tf_idf for wach word and to sort the words of a document by their tf_idf score.

In [23]:
list1 = test_corpus.tolist()

In [24]:
for i in range(len(list1)):
    list1[i]=tb(list1[i])

In [25]:
num_per_doc=[]
for i in range(len(test_corpus)):
    appoggio=[]
    for word in test_corpus[i].split():
        appoggio.append(word)
    num_per_doc.append(len(appoggio))

In [26]:
new_test_tf_glove=test_corpus.copy()

In this cycle we create a list of the words to replace ("to_rep") from each document and we find a list of synonyms taken from the GloVe dictionary ("tf_sims"). After that we apply the replacement to the array of documents ("new test tf glove")

In [27]:
#this cycle is similar to the one we used for the random search method (the structure is the same: find the words to rep, find synonyms, apply the replacement)
for i, blob in enumerate(list1):
    counter=0
    to_rep_tf=[]
    scores = {word: tfidf(word, blob, list1) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for s in range(round(num_per_doc[i]*fraction)):
        try:
            to_rep_tf.append(list(sorted_words[s])[0])
        except Exception:
            break
    tf_sims=[]
    for word in range(len(to_rep_tf)):
        appoggio=to_rep_tf[word]
        try:
            tf_sims.append(glove_vectors.most_similar(appoggio,topn=1))
        except Exception:
            tf_sims.append([(appoggio,1)])
    replaced_tf_glove=[]
    for word in range(len(tf_sims)):
        replaced_tf_glove.append(list(tf_sims[word][0])[0])
    for j in range(len(replaced_tf_glove)):
        a=0
        counter_same_word=0
        if to_rep_tf[j]==replaced_tf_glove[j]:
            pass
        while a!=1:
            #in the list we have the word just once, but we want to replace all the times it appears
            if to_rep_tf[j] in new_test_tf_glove[i]:
                new_test_tf_glove[i] = new_test_tf_glove[i].replace(to_rep_tf[j],replaced_tf_glove[j],1)
                counter+=1
                counter_same_word+=1
                # we use counter in order to mantain the number of words we want to replace (fraction*number of words per doc)
                # we use counter_same_word certain words can be part of other words, and so they would change other words
                #in order to contain this error we replace the word for max 5 times
            else: 
                a=1
            if counter_same_word==5:
                break
        if counter== round(num_per_doc[i]*fraction):
            break

In [28]:
tv_test_features_tf_glove = tv.transform(new_test_tf_glove)

In [33]:
pd.DataFrame(tv_test_features_tf_glove).to_csv(RESULTS_PATH+'/test_features_tfidf_5.csv')

In [34]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, Y_train)
svm_tfidf_5 = svm.score(tv_test_features_tf_glove, Y_test)
print('Test Accuracy:', svm_tfidf_5)
#C

Test Accuracy: 0.766


In [30]:
xgboost = xgb.XGBClassifier(min_child_weight=1,
                                max_depth=6,
                                subsample=1,
                                colsample_bytree=1, random_state=42)
xgboost.fit(tv_train_features, Y_train)
xgboost_tfidf_5 = xgboost.score(tv_test_features_tf_glove, Y_test)
print('Test Accuracy:', xgboost_tfidf_5)



Test Accuracy: 0.8025


In [35]:
tfidf_xgboost_svm_5=[xgboost_tfidf_5,svm_tfidf_5]
tfidf_xgboost_svm_5

[0.8025, 0.766]

In [36]:
pd.DataFrame(tfidf_xgboost_svm_5).to_csv(RESULTS_PATH+'/tfidf_xgboost_svm_5.csv')

# Weight based

The weight based method is explained on the paper as a method that is based on the distance of each word from the hyperplane: the more they are close, the more they are important and we want to replace the word that are the most important for the algorithm. Our intuition to apply this algorithm was to exploit the SVM algorithm in order to calculate the distance of each document from the hyperplane and replace all the words in the documents that are more near.

We find the distance of each doc from the hyperplane

In [37]:
y = svm.decision_function(tv_test_features)
w_norm = np.linalg.norm(svm.coef_)
dist = y / w_norm

In [38]:
distances= []
for i in range(len(dist)):
    distances.append(sqdist(dist[i]))

We order the documents

In [40]:
pd_dist=pd.Series(distances)
sorted_pd=pd_dist.sort_values()

In [41]:
indexes=[]
for i in range(round(len(sorted_pd)*fraction)):
    indexes.append(list(sorted_pd.index)[i])

In [42]:
vec=[]
for i in range(len(indexes)):
    vec.append(test_corpus[indexes[i]])

In [43]:
new_test_weight_glove=test_corpus.copy()

In [45]:
#the main idea behind the cycle is always the same, in this case instead of using replace we use the join function
#because we only want to join all the words from the replaced vector
for i in range(len(vec)):
    to_rep_weight=[]
    weight_sims=[]
    for word in vec[i].split():
        to_rep_weight.append(word)
    replaced_weight_glove=[]
    for j in range(len(to_rep_weight)):
        appoggio=to_rep_weight[j]
        try:
            weight_sims.append(glove_vectors.most_similar(appoggio,topn=1))
        except Exception:
            weight_sims.append([(appoggio,1)])
        replaced_weight_glove.append(list(weight_sims[j][0])[0])
    for j in range(len(replaced_weight_glove)):
        if to_rep_weight[j]==replaced_weight_glove[j]:
            pass
        try:
            new_test_weight_glove[indexes[i]] = listToString(replaced_weight_glove)
        except Exception:
            pass

In [54]:
tv_test_features_weight_glove = tv.transform(new_test_weight_glove)

In [55]:
pd.DataFrame(tv_test_features_weight_glove).to_csv(RESULTS_PATH+'/test_features_weight_5.csv')

In [56]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, Y_train)
svm_weight_5 = svm.score(tv_test_features_weight_glove, Y_test)
print('Test Accuracy:', svm_weight_5)

Test Accuracy: 0.8165


In [57]:
xgboost = xgb.XGBClassifier(min_child_weight=1,
                                max_depth=6,
                                subsample=1,
                                colsample_bytree=1, random_state=42)
xgboost.fit(tv_train_features, Y_train)
xgboost_weight_5 = xgboost.score(tv_test_features_weight_glove, Y_test)
print('Test Accuracy:', xgboost_weight_5)



Test Accuracy: 0.8275


In [58]:
weight_xgboost_svm_5=[xgboost_weight_5,svm_weight_5]
weight_xgboost_svm_5

[0.8275, 0.8165]

In [59]:
pd.DataFrame(weight_xgboost_svm_5).to_csv(RESULTS_PATH+'/weight_xgboost_svm_5.csv')

# Other replace methods

In the examples before we used the pretrained GloVe vocabulary, but it is not the only possible method

# Word2Vec

Word2Vec is a vocabulary of pretrained word vectors just like GloVe, but with different developers, different words that are contained and different values connected to the words

In [46]:
vectors = gensim.downloader.load( 'word2vec-google-news-300')

In [47]:
new_test_tf_w2v=test_corpus.copy()

In [48]:
#we just repeat the cycle we used for the tf-idf search method and we substitute the glove vectors with the w2v 
for i, blob in enumerate(list1):
    counter=0
    to_rep_tf=[]
    scores = {word: tfidf(word, blob, list1) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for s in range(round(num_per_doc[i]*fraction)):
        try:
            to_rep_tf.append(list(sorted_words[s])[0])
        except Exception:
            break
    tf_sims=[]
    for word in range(len(to_rep_tf)):
        appoggio=to_rep_tf[word]
        try:
            tf_sims.append(vectors.most_similar(appoggio,topn=1))
        except Exception:
            tf_sims.append([(appoggio,1)])
    replaced_tf_w2v=[]
    for word in range(len(tf_sims)):
        replaced_tf_w2v.append(list(tf_sims[word][0])[0])
    for j in range(len(replaced_tf_w2v)):
        a=0
        counter_same_word=0
        if to_rep_tf[j]==replaced_tf_w2v[j]:
            pass
        while a!=1:
            if to_rep_tf[j] in new_test_tf_w2v[i]:
                new_test_tf_w2v[i] = new_test_tf_w2v[i].replace(to_rep_tf[j],replaced_tf_w2v[j],1)
                counter+=1
                counter_same_word+=1
            else: 
                a=1
            if counter_same_word==5:
                break
        if counter== round(num_per_doc[i]*fraction):
            break

In [49]:
tv_test_features_tf_w2v = tv.transform(new_test_tf_w2v)

In [50]:
pd.DataFrame(tv_test_features_tf_w2v).to_csv(RESULTS_PATH+'/test_features_tf_w2v_5.csv')

In [51]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, Y_train)
svm_tf_w2v_5 = svm.score(tv_test_features_tf_w2v, Y_test)
print('Test Accuracy:', svm_tf_w2v_5)

Test Accuracy: 0.765


In [52]:
xgboost = xgb.XGBClassifier(min_child_weight=1,
                                max_depth=6,
                                subsample=1,
                                colsample_bytree=1, random_state=42)
xgboost.fit(tv_train_features, Y_train)
xgboost_tf_w2v_5 = xgboost.score(tv_test_features_tf_w2v, Y_test)
print('Test Accuracy:', xgboost_tf_w2v_5)

Test Accuracy: 0.783


In [53]:
w2v_xgboost_svm_5=[xgboost_tf_w2v_5,svm_tf_w2v_5]
w2v_xgboost_svm_5

[0.783, 0.765]

In [56]:
pd.DataFrame(w2v_xgboost_svm_5).to_csv(RESULTS_PATH+'/w2v_xgboost_svm_5.csv')

# Trained Word2Vec

In this case we train our own vocabulary using the Word2vec algorithm. To build the vocabulary we use the complaints of the train and the test corpus

We create a list of list, that is the input required by the Word2Vec function (that create the vocabulary matrix

In [60]:
dic_w2v_train=[]
for i in range(len(train_corpus)):
    dic_w2v_train.append(nltk.word_tokenize(train_corpus[i]))

In [61]:
dic_w2v_test=[]
for i in range(len(test_corpus)):
    dic_w2v_test.append(nltk.word_tokenize(test_corpus[i]))

We join the two lists

In [62]:
dic_w2v=[]
dic_w2v= dic_w2v_train+dic_w2v_test

We create our model/vocabulary

In [63]:
model = gensim.models.Word2Vec(sentences=dic_w2v)

In [64]:
new_test_tf_cfpb=test_corpus.copy()

In [65]:
#we repeat the same cycle but with the synonyms found by the vocabulary that we trained
for i, blob in enumerate(list1):
    counter=0
    to_rep_tf=[]
    scores = {word: tfidf(word, blob, list1) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for s in range(round(num_per_doc[i]*fraction)):
        try:
            to_rep_tf.append(list(sorted_words[s])[0])
        except Exception:
            break
    tf_sims=[]
    for word in range(len(to_rep_tf)):
        appoggio=to_rep_tf[word]
        try:
            tf_sims.append(model.wv.most_similar(appoggio,topn=1))
        except Exception:
            tf_sims.append([(appoggio,1)])
    replaced_tf_cfpb=[]
    for word in range(len(tf_sims)):
        replaced_tf_cfpb.append(list(tf_sims[word][0])[0])
    for j in range(len(replaced_tf_cfpb)):
        a=0
        counter_same_word=0
        if to_rep_tf[j]==replaced_tf_cfpb[j]:
            pass
        while a!=1:
            if to_rep_tf[j] in new_test_tf_cfpb[i]:
                new_test_tf_cfpb[i] = new_test_tf_cfpb[i].replace(to_rep_tf[j],replaced_tf_cfpb[j],1)
                counter+=1
                counter_same_word+=1
            else: 
                a=1
            if counter_same_word==5:
                break
        if counter== round(num_per_doc[i]*fraction):
            break

In [66]:
tv_test_features_tf_cfpb= tv.transform(new_test_tf_cfpb)

In [67]:
pd.DataFrame(tv_test_features_tf_cfpb).to_csv(RESULTS_PATH+'/test_features_tf_cfpb_5.csv')

In [68]:
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, Y_train)
svm_tf_cfpb_5 = svm.score(tv_test_features_tf_cfpb, Y_test)
print('Test Accuracy:', svm_tf_cfpb_5)
#H

Test Accuracy: 0.807


In [69]:
xgboost = xgb.XGBClassifier(min_child_weight=1,
                                max_depth=6,
                                subsample=1,
                                colsample_bytree=1, random_state=42)
xgboost.fit(tv_train_features, Y_train)
xgboost_tf_cfpb_5 = xgboost.score(tv_test_features_tf_cfpb, Y_test)
print('Test Accuracy:', xgboost_tf_cfpb_5)
#I



Test Accuracy: 0.8265


In [70]:
cfpb_xgboost_svm_5=[xgboost_tf_cfpb_5,svm_tf_cfpb_5]
cfpb_xgboost_svm_5

[0.8265, 0.807]

In [71]:
pd.DataFrame(cfpb_xgboost_svm_5).to_csv(RESULTS_PATH+'/cfpb_xgboost_svm_5.csv')