# Pipeline: NLP Only


In this notebook we will solely focus on the TFIDF vectorizier and optimizing the output from these features


sources:
Sample pipeline for text feature extraction and evaluation: https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html

Metrics and scoring: quantifying the quality of predictions: https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values


Demonstration of multi-metric evaluation on cross_val_score and GridSearchCV: https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html


In [1]:
import pandas as pd
import numpy as np
import glob
import os
import munge_help
from time import time

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

import utils

import xgboost as xgb

#### extract NVD data

In [2]:
X_train = utils.load_obj(path=os.path.join('data_processed', 'description_train_raw.pkl'))
y_train = utils.load_obj(path=os.path.join('data_processed', 'y_train.pkl'))

In [3]:
X_train.head()

128645    In Advantech WebAccess versions V8.2_20170817 ...
83452     Cross-site request forgery (CSRF) vulnerabilit...
91453     Cross-site scripting (XSS) vulnerability in Hu...
57320     The Extbase Framework in TYPO3 4.6.x through 4...
146442    An issue was discovered in GNU LibreDWG 0.7 an...
Name: description, dtype: object

In [4]:
X_train.shape

(102492,)

### Make Pipeline

In [5]:
pipeline = Pipeline([
    ('vect', TfidfVectorizer(encoding='utf-8',
                             strip_accents='ascii',
                             lowercase=True,
                             analyzer='word', 
                             stop_words='english',
                             binary=False, 
                             norm='l2', 
                             use_idf=True, 
                             smooth_idf=True)
    ),
    ('clf', xgb.XGBClassifier(n_estimators=100,
                              eta=0.9,
                              max_depth=6,
                              num_boost_round=10,
                              subsample=0.9,
                              n_jobs=-1
                             )
    )
]
)

The most important features to fine two are:
- max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
- max_features: build a vocabulary that only consider the top max_features ordered by term frequency across the corpus

In [6]:
parameters = {
    'vect__max_df': (0.25, 0.5, 0.75),
    'vect__min_df': (0, 0.05, 0.1),
    'vect__max_features': (100, 150, 200),
    'vect__ngram_range': ((1,2), (1, 3)),  # unigrams to trigrams

}

In [7]:
#instantiate grid search
grid_search = GridSearchCV(pipeline, 
                           parameters, 
                           n_jobs=-1, 
                           verbose=10, #lots of details
                           scoring=['roc_auc', 'f1'],
                           refit='roc_auc', 
                           return_train_score=True
                          )

# start the timer
t0 = time()


#fit to training data
grid_search.fit(X_train, y_train)

#time to do it
print("done in %0.3fs" % (time() - t0))

Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed: 11.1min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed: 13.3min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 16.8min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 20.3min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed: 24.1min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed: 28.1min
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed: 33.4min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed: 37.5min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed: 43.1min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed: 48.8min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed: 57.0min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 63.3min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed: 70

done in 6995.898s


In [8]:
print("Best score: %0.3f" % grid_search.best_score_)
print('\n')
print(20*'#')
print('\n')
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Best score: 0.865


####################


Best parameters set:
	vect__max_df: 0.75
	vect__max_features: 200
	vect__min_df: 0
	vect__ngram_range: (1, 2)


In [9]:
#dict with keys as column headers and values as columns
results = grid_search.cv_results_

results.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_vect__max_df', 'param_vect__max_features', 'param_vect__min_df', 'param_vect__ngram_range', 'params', 'split0_test_roc_auc', 'split1_test_roc_auc', 'split2_test_roc_auc', 'split3_test_roc_auc', 'split4_test_roc_auc', 'mean_test_roc_auc', 'std_test_roc_auc', 'rank_test_roc_auc', 'split0_train_roc_auc', 'split1_train_roc_auc', 'split2_train_roc_auc', 'split3_train_roc_auc', 'split4_train_roc_auc', 'mean_train_roc_auc', 'std_train_roc_auc', 'split0_test_f1', 'split1_test_f1', 'split2_test_f1', 'split3_test_f1', 'split4_test_f1', 'mean_test_f1', 'std_test_f1', 'rank_test_f1', 'split0_train_f1', 'split1_train_f1', 'split2_train_f1', 'split3_train_f1', 'split4_train_f1', 'mean_train_f1', 'std_train_f1'])

In [12]:
utils.save_obj(obj = grid_search,
               path = os.path.join('artifacts', 'grid_search_nlp_2020-11-28.pkl'))