In [None]:
# The Machine Learning Process Applied to Natural Language
Date: 2020-28-01  
Author: Jason Beach  
Categories: Process, DataScience 
Tags: machine-learning, best-practice, nlp   
<!--eofm-->

The Machine Learning Process is a reproducible method for developing and deploying automated intelligent solutions.  However, its generality means that it lacks details for specific areas of focus.  A particular area of interest is in solutions for Natural Language Processing (NLP).  NLP encompasses many different problem topics, but we will focus on classification problems, in this post.

## Revising the Machine Learning Process

In order to fit the Machine Learning (ML) Process for NLP problems, we need to make a few revisions.  Let's revisit the diagram to succinctly display the steps.  We will review each step, individually.

![machine learning process](images/ml_process.png)

### Discover

We still want to place the primary emphasis on determining the problem and how to create business value.  However, because we are focused on NLP, the pool of model families is reduced.  We also need to think about constraints in time and hardware, as more modern models can be quite demanding.

The selection of dimensions will be dependent upon word embeddings and the type of model used.  Newer models, such as FastText, use subsets of characters, instead of whole-word vectors  that are more traditionally used.  For the initial stages of data exploration, more traditional approaches are still valuable.  A few of the ways we can represent language include:

* simple occurence count
* TF-IDF
* trigrams, n-grams
* n-charcter subset
* word embedding

Newer NLP models often have the important aspect of availability of pre-trained models, which can greatly speed-up training and improve primary outcome metrics.  However, pre-trained models can be a hinderance if the text to which the models are applied being appied is different from that which it was originally trained.  Comparing samples of the original text with the target text can be very valuable.  Some methods for doing this are described [here]( {{< ref "/posts/blog_page-todo.md" >}}).   

A few of the most traditional models, as well as some popular, network-based models include:

* Naive bayes
* Logistic regression
* Support vector machines
* FastText
* ULMFiT
* Google’s BERT

An additional consideration is that customers may have a current system that your solution is meant to replace.  The legacy systems may include lexicons of regular expressions, as well as complicated decision trees.  If the customer expects certain statements to be 'hits' under any condition, then these may have to be incorporated into your solution.  At the least, they will be important aspects of the final evaluation of your solution.

Evaluation in text settings is typically based on the Precision-Recall curve.  Perhaps reduction of False Positives is of importance to a compliance office, so Precision should be emphasized.  Maybe every Positive text is important to hit, in which case, Recall should be high so that there are no misses.  Often, a high F1 score strikes a balance between the two.  Accuracy is a good secondary evaluation, but of less importance.

### Collect and Transform

Data types
communication format: chat, email, voice
document, file, attachment format: pdf, docx, png

international, multiple languages
encoding of plain text

ETL team performs data splitting before handover.  The preferred method is Cross Validation.  However, this is often impractical during some steps because tagged training data must be manually created.  In this case, only perform Cross Validation when it is available.  

SOP is to split:

 * 10% is used, up-front, as our sample to verify ETL (3-4 iterations)
 * 75% is used for training
   - chat dataset
   - email dataset
   - voice dataset
   - document dataset
 * 25% is kept separate as a holdout for testing
 
 
The 75% training can be in independent datasets, based upon source.  This helps data validation be more consistent.

Data validation with 10% up-front, is a very important aspect of NLP solutions because there can be such great variety within the sources being used.  During ETL verification, the raw data must be easily accessible for comparison, in case there is any doubt.  Note any peculiarities that may become sources of bias, later.
 
Because text data usually requires a large amount of files being delivered, incrementally, it may be necessary to begin to explore and summarize, and even train models, as the data is still arriving.  This has the danger of introducing bias, but it may be necessary from as business perspective.  One approach for dealing with this is [here]( {{< ref "/posts/blog_math-reuse_tng_sample.md" >}}).   

### Summary and Process

Summarizing

* word occurence
* TF-IDF
* n-grams

Perform manual review of messages to get feel for culture and style.

Identify opportunities to filter-out records based tags or metadata.

Pay particular focus on the metadata, reply chains, and disclaimers to identify changes. 


Comparing with text used in pre-training, methods [here]( {{< ref "/posts/blog_page-todo.md" >}}).  Similar comparisons:

* among data sources
* within data source, across time intervals


pre-processing, type of feature engineering, aspects that effect the model

* lower-case, stop word removal, etc
* auto-gen: disclaimers, newsletter, spam, etc
* selection of clean sentence
* contacts in-scope
* metadata

Apply pre-trained models, before training on the target dataset.  Evaluation is important for current and future use.

Balancing the dataset through under-sampling the negative.

### Build

Regex 

Lexicon testing:  Apply converted lexicons and test performance on the customer data
Apply the lexicon to the non-holdout data
Record hits
Compare to "gold standard" hits from other system (all lexicons, all participants, subset of timeframe)
Revise lexicons 
Rerun
Uplift - use news/spam/reply chain analysis to reduce false positive


Models

Manually creating tagged training data is the most time-intensive aspect.


Iterations

* INITIAL LOOK
  - initial run on OOB models on partial chat and email (because didn't have all data)
  - review and assess data contents. plan accordingly 
* CHAT TUNING
  - begin model training in Cognition using chat training data
  - iterate over chat training KG 2x
* EMAIL TUNING
  - apply models to partial email training set (1.8m emails)
  - tune models and engine filtering based on findings (note: this process was ridiculously truncated)
* DEMO
  - run updated engine on chat training and email training set (1.8m emails)
  - review ~6% of the alerts. triage. select a sub-set of best alerts to demo

....


### Deliver

## Demonstration

## Conclusion