# Pre-process Test Set

---


Pre-processing the test set in an NLP project is not trivial. One needs to be mindful of how the test set needs to have the same shape and format as the training set - so we actually need to pre-process both training and test sets "together" in a way.

More precisely, while pre-processing the training set we created a vocabulary of ngrams - this needs to be the same vocabulary, in the same order, as the test set's vocabulary. We also created a tfidf representation, where idf stands for the 'inverse document frequency'... *in the training corpus*. The idf of the test corpus is irrelevant. So both the vocabulary and the idf of the training set need to be kept - in a way, they're part of the learning process.


### Setup

In [1]:
import os
import time
import json

import numpy as np
import pandas as pd

from datetime import datetime

start_ = time.time()
dt_object = datetime.fromtimestamp(start_)
day, T = str(dt_object).split('.')[0].split(' ')
print('Revised on: ' + day)

Revised on: 2021-02-18


### Load Data

In [2]:
import urlextract
from nltk.stem import WordNetLemmatizer

def load_data(data):
    raw_path = os.path.join("data","1_raw")
    filename = ''.join([data, ".csv"])
    out_dfm = pd.read_csv(os.path.join(raw_path, filename))
    out_arr = np.array(out_dfm.iloc[:,0].ravel())
    return out_arr

X_train = load_data("X_train") 
y_train = load_data("y_train") 
X_test = load_data("X_test") 
y_test = load_data("y_test") 

In [3]:
def make_int(y_array):
    y = y_array.copy()
    y[y=='ham'] = 0
    y[y=='spam'] = 1
    y = y.astype('int')
    return y

y_test_int = make_int(y_test)
y_train_int = make_int(y_train)

# load contractions map for custom cleanup
with open("contractions_map.json") as f:
    contractions_map = json.load(f)

###  Define Pipeline

In [4]:
import custom.clean_preprocess as cp
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

pipe = Pipeline([('counter', cp.DocumentToNgramCounterTransformer(n_grams=3)),
                 ('bot', cp.WordCounterToVectorTransformer(vocabulary_size=2000)), # careful
                 ('tfidf', TfidfTransformer(sublinear_tf=True))]) # careful

### Counters & Transformer

I create train and test counters but only need to fit the transformer on the training corpus to get the same 2,000 features in both data sets:

In [5]:
# counters
X_train_counter = pipe['counter'].fit_transform(X_train) 
X_test_counter = pipe['counter'].fit_transform(X_test) 

X_train_transformer = pipe['bot'].fit(X_train_counter) # only on the training counter

Here are the first 11 words in the vocabulary:

In [6]:
first_11_vocab = [w for (ct, w) in enumerate(X_train_transformer.vocabulary_) if ct < 11]
print(first_11_vocab)

['NUM', 'i', 'you', 'u', 'me', 'not', 'my', 'your', 'am', 'have', 'call']


### Bag-of-upto-Trigrams

In [7]:
# BoTs
X_train_bot = X_train_transformer.transform(X_train_counter)
X_test_bot = X_train_transformer.transform(X_test_counter) # same transformer

In [8]:
# sanity checks
X_train_bot, X_test_bot

(<3900x2001 sparse matrix of type '<class 'numpy.intc'>'
 	with 59102 stored elements in Compressed Sparse Row format>,
 <1672x2001 sparse matrix of type '<class 'numpy.intc'>'
 	with 24657 stored elements in Compressed Sparse Row format>)

Both Bag of Trigrams have the same vocabulary.

__Train BoT__

In [9]:
# train BoT
pd.DataFrame(X_train_bot[0:6, 0:12].toarray() 
            , columns=['unknown'] + first_11_vocab)

Unnamed: 0,unknown,NUM,i,you,u,me,not,my,your,am,have,call
0,16,0,0,0,0,0,0,0,0,0,0,0
1,60,7,0,0,1,1,1,0,0,0,0,1
2,24,0,0,1,0,0,0,0,1,0,0,0
3,13,0,0,0,0,0,0,0,0,0,0,0
4,21,1,0,0,0,0,0,0,0,0,0,0
5,23,0,0,0,0,0,0,1,1,0,0,0


In [10]:
print(X_train[4])

Single line with a big meaning::::: \Miss anything 4 ur \"Best Life\" but


In [11]:
print(X_train_counter[4])

Counter({'single': 1, 'line': 1, 'big': 1, 'meaning': 1, 'miss': 1, 'anything': 1, 'NUM': 1, 'ur': 1, 'best': 1, 'life': 1, 'but': 1, 'single_line': 1, 'line_with': 1, 'with_a': 1, 'a_big': 1, 'big_meaning': 1, 'meaning_miss': 1, 'miss_anything': 1, 'anything_NUM': 1, 'NUM_ur': 1, 'ur_best': 1, 'best_life': 1, 'life_but': 1, 'single_line_with': 1, 'line_with_a': 1, 'with_a_big': 1, 'a_big_meaning': 1, 'big_meaning_miss': 1, 'meaning_miss_anything': 1, 'miss_anything_NUM': 1, 'anything_NUM_ur': 1, 'NUM_ur_best': 1, 'ur_best_life': 1, 'best_life_but': 1})


__Test BoT__

In [12]:
# test BoT
pd.DataFrame(X_test_bot[0:6, 0:12].toarray() 
            , columns=['unknown'] + first_11_vocab)

Unnamed: 0,unknown,NUM,i,you,u,me,not,my,your,am,have,call
0,30,0,1,2,0,0,0,1,0,0,0,0
1,26,0,1,1,0,0,0,0,0,0,0,0
2,15,0,0,0,0,0,0,0,0,0,0,0
3,83,0,1,3,0,2,2,2,2,0,1,0
4,6,0,0,0,0,0,0,0,0,0,0,0
5,8,0,0,0,0,0,0,0,0,0,0,0


In [13]:
print(X_test[3])

Any chance you might have had with me evaporated as soon as you violated my privacy by stealing my phone number from your employer's paperwork. Not cool at all. Please do not contact me again or I will report you to your supervisor.


In [14]:
print(X_test_counter[3])

Counter({'you': 3, 'me': 2, 'my': 2, 'your': 2, 'not': 2, 'any': 1, 'chance': 1, 'might': 1, 'have': 1, 'had': 1, 'evaporated': 1, 'soon': 1, 'violated': 1, 'privacy': 1, 'stealing': 1, 'phone': 1, 'number': 1, 'employer': 1, 'paperwork': 1, 'cool': 1, 'all': 1, 'please': 1, 'do': 1, 'contact': 1, 'again': 1, 'or': 1, 'i': 1, 'report': 1, 'supervisor': 1, 'any_chance': 1, 'chance_you': 1, 'you_might': 1, 'might_have': 1, 'have_had': 1, 'had_with': 1, 'with_me': 1, 'me_evaporated': 1, 'evaporated_as': 1, 'as_soon': 1, 'soon_as': 1, 'as_you': 1, 'you_violated': 1, 'violated_my': 1, 'my_privacy': 1, 'privacy_by': 1, 'by_stealing': 1, 'stealing_my': 1, 'my_phone': 1, 'phone_number': 1, 'number_from': 1, 'from_your': 1, 'your_employers': 1, 'employers_paperwork': 1, 'paperwork_not': 1, 'not_cool': 1, 'cool_at': 1, 'at_all': 1, 'all_please': 1, 'please_do': 1, 'do_not': 1, 'not_contact': 1, 'contact_me': 1, 'me_again': 1, 'again_or': 1, 'or_i': 1, 'i_will': 1, 'will_report': 1, 'report_you':

### Tfidf Representation

The `fit` method of Scikit-Learn's **TfidfTransformer** learns the idf vector, from the [source](https://github.com/scikit-learn/scikit-learn/blob/95119c13a/sklearn/feature_extraction/text.py#L1314):

```
1429  def fit(self, X, y=None):
1430       """Learn the idf vector (global term weights).
```

We need to learn the same corpus, thus use only the training set to learn the idf vector of "inverse document frequencies" for that corpus. In short, we cannot use `fit_transform` but need to separate `fit` and `transform` so that we can fit on the training BoT and transform each BoT in turn:

In [15]:
# fit
X_train_fit = pipe['tfidf'].fit(X_train_bot)

In [16]:
# vector of idfs
print(X_train_fit.idf_)

[1.01576056 2.34733003 1.98691455 ... 7.07176363 7.07176363 7.07176363]


In [17]:
# transform
X_train_tfidf = X_train_fit.transform(X_train_bot)
X_test_tfidf = X_train_fit.transform(X_test_bot)

In [18]:
# sanity checks
X_train_tfidf, X_test_tfidf

(<3900x2001 sparse matrix of type '<class 'numpy.float64'>'
 	with 59102 stored elements in Compressed Sparse Row format>,
 <1672x2001 sparse matrix of type '<class 'numpy.float64'>'
 	with 24657 stored elements in Compressed Sparse Row format>)

We expect the tfidf values to be different, of course, and a higher proportion of unkonwn tokens overall - this is the first column:

In [19]:
print(X_train_tfidf[:5,:5].toarray())

[[0.26829044 0.         0.         0.         0.        ]
 [0.15615824 0.20867906 0.         0.         0.08553675]
 [0.20873123 0.         0.         0.11189586 0.        ]
 [0.27765099 0.         0.         0.         0.        ]
 [0.20303381 0.11600693 0.         0.         0.        ]]


In [20]:
print(X_test_tfidf[:5,:5].toarray())

[[0.2312656  0.         0.1027846  0.19926706 0.        ]
 [0.25345525 0.         0.11643234 0.13331726 0.        ]
 [0.66482534 0.         0.         0.         0.        ]
 [0.19054668 0.         0.06878327 0.16528286 0.        ]
 [0.35660928 0.         0.         0.         0.        ]]


In [21]:
print(np.mean(X_train_tfidf[:,0:1].toarray()))
print(np.mean(X_test_tfidf[:,0:1].toarray()))

0.2254668870816081
0.23216159731868138


### SVD Projection

I changed the default to 800 components, also perform no scaling.

In [22]:
from scipy.sparse.linalg import svds
from sklearn.utils.extmath import svd_flip

def perform_SVD(X, n_components=800): 
    
    X_array = X.asfptype()
    U, Sigma, VT = svds(X_array.T, 
                        k=n_components)
    # reverse outputs
    Sigma = Sigma[::-1]
    U, VT = svd_flip(U[:, ::-1], VT[::-1])
    
    # return V
    return VT.T

X_train_svd = perform_SVD(X_train_tfidf)
X_test_svd = perform_SVD(X_test_tfidf)

## Cosine Similarities

In [23]:
import scipy.sparse as sp
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

def calc_cossims(X, y, svd):
    # all similarities
    similarities = cosine_similarity(svd)
    # spam similarities
    df = pd.DataFrame({'sms':X, 'target':y}) 
    spam_ix = df.loc[df['target']=='spam'].index
    # mean spam sims
    mean_spam_sims = []
    for ix in range(similarities.shape[0]):
        mean_spam_sim = np.mean(similarities[ix, spam_ix])
        mean_spam_sims.append(mean_spam_sim)
    return csr_matrix(mean_spam_sims).T

# calculate the mean spam similarities features
X_train_spamcos = calc_cossims(X_train, y_train, X_train_svd)
X_test_spamcos = calc_cossims(X_test, y_test, X_test_svd)

In [24]:
# sanity check
X_train_spamcos, X_test_spamcos

(<3900x1 sparse matrix of type '<class 'numpy.float64'>'
 	with 3899 stored elements in Compressed Sparse Column format>,
 <1672x1 sparse matrix of type '<class 'numpy.float64'>'
 	with 1672 stored elements in Compressed Sparse Column format>)

## Stack

- add the spam cosine similarities feature to the SVD projection of the tfidf-ed bag-of-upto-trigrams

In [25]:
# stack cossim feature
X_train_processed = sp.hstack((X_train_spamcos, X_train_svd))
X_test_processed = sp.hstack((X_test_spamcos, X_test_svd))

In [26]:
# sanity check
X_train_processed, X_test_processed

(<3900x801 sparse matrix of type '<class 'numpy.float64'>'
 	with 3123099 stored elements in COOrdinate format>,
 <1672x801 sparse matrix of type '<class 'numpy.float64'>'
 	with 1339272 stored elements in COOrdinate format>)

## Persist


In [27]:
def persist(X, filename):
    proc_dir = os.path.join("data", "2_processed")
    filename = ''.join([filename, '.npz'])
    sp.save_npz(os.path.join(proc_dir, filename), X)

In [28]:
persist(X_train_processed, 'X_train_processed')

In [29]:
persist(X_test_processed, 'X_test_processed')

In [30]:
m, s = divmod(time.time() - start_, 60)
print(f'Elapsed: {m:0.0f} m {s:0.0f} s')

Elapsed: 0 m 41 s


---