# Natural Language Processing

## General: Goals

- create portfolio-worthy data science projects
- learn NLP and practice ML by building classifiers that use open text fields as input
- create a Python framework that generalizes the 'Intro To Text Analytics With R' workflow
- try the framework with Twitter data instead of SMS data 

---


## General: ML Project Structure

These steps do not need to be sequential. For example, feature engineering can be done sooner. They are also cyclic: in practice, one might need to continuously re-define the problem, and iterate on EDA and modeling steps.

1. define the problem, set expectations and evaluation criteria
2. preliminary and minimal EDA and pre-splitting cleanup
3. split dataset into trainining and test subsets; set the test subset aside
4. create a cleanup and preprocessing pipeline for the training data that can be re-applied to the test data
5. train a couple baseline models to ensure process is smooth and pre-processing is dialed in
6. using cross-validation, evaluate a variety models without hyperparameter tuning to establish some baselines
7. short-list promising models for further hyperparemeter tuning
8. iterate on any phase of the project as needed
9. consider feature selection and feature engineering
10. decide on a final cleanup and processing pipeline
11. settle on a final model
12. re-apply all cleanup and processing steps to the test set and evaluate final model - once
13. create a final presentation of the solution for technical and non-technical audiences
---

## General: Project Definition

There is a lot more unstructured, text data than structured data. To leverage this unstructured text data one needs to apply text analytic techniques to structure the data before getting value from it. 

Value can be defined in many ways, here are a couple:

**Framework for Binary Classification** 


Given the goal of binary classification, say, to separate *spam* from *ham* (legitimate) messages, or *positive* (happier) from *negative* (unhappier) comments on social media, one can build classifiers that learn from a corpus and is able to predict, given a new instance (new message or comment), wether it is spam or not, or positive or negative. 

Both cases are well-known and mostly solved. It is less clear to me whether classifiers built for spam detection would also perform well, say, predicting negative Tweets. Having a framework that quickly deploys and assesses the accuracy of classifiers for binary predictions given open text fields seems a valuable pursuit.


**Entity Extraction** 

A common need in businesses that capture text data is to be able to extract keywords from this text data. Say a business has an online app that historically has captured information in a text field, but nobody has had the time to read that input. Going forward, product managers decide this field should be turned into a drop-down menu. To integrate the past and future versions of this field, there's a need to bucket the unstructured text into categories for the new drop-down. To solve this, a data scientist can apply entity extraction techniques:

> "Named-entity recognition (NER) [...] is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories..." [Wikipedia, accessed Dec 19, 2020](https://en.wikipedia.org/wiki/Named-entity_recognition)



---

## This Project: Introduction to Text Analytics With R

This is a Python offshoot of the original YouTube series tutorial by David Langer from Data Science Dojo: [DSD's Introduction to Text Analytics with R](https://www.youtube.com/playlist?list=PLTJTBoU5HOCR5Vkah2Z-AU76ZYsZjGFK6)


#### What is the problem this project is trying to solve?

Separate SPAM from HAM (legitimate) SMS messages.


#### What are the expectations?

That the classifier built for this purpose can do this task quickly and accurately enough to avoid frustrating users.


#### What does success look like for this project?


A classifier that achieves high sensitivity, thus avoiding sending legitimate SMS to the spam folder, which would be a lot more frustrating than the occasional spam not being filtered. This classifier also has somewhat high specificicity, it filters most spam messages as well.

## This Project: The Data

The dataset is now spread across the internet, perhaps a good source is the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection#). The dataset is a collage of sources; the data collection process is explained by the original authors Tiago A. Almeida and José María Gómez Hidalgo in this [University of Campinas website.](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)

## This Project: Management

#### Status History

*Notebook 3: Tfidf*

- Unigram Tfidf with 450 terms, logistic classifier: 0.975 accuracy

*Notebook 4: Bigrams*

- Bag-of-upto-Bigrams with 500 terms, logistic classifier: 0.985 accuracy

*Notebook 5: Ngrams*
    
- Bag-of-upto-Trigrams (BoT) with 2,000 terms, logistic classifier: 0.986 accuracy
    
*Notebook 6a-b: Dimensionality Reduction*

- SVD with 300 components on Tfidf: 0.9864 accuracy, 0.9089 sensitivity, 0.9982 specificity
- SVD not clearly advantageous with a logistic classifier, but that will change with more complex models
- Notebook 6b experiments with scaling vs not: similar results 

*Notebook 7: Feature Engineering*

- After data viz, the first feature (raw document length) is the most useful in separating the target
- After modeling, all features but the RSR achieve best accuracy (0.8823), yet low sensitivity (0.3270)

*Notebook 8: Cosine Similarity*

- Benefit of spam cosine similarity unclear with logistic classifier; this changes with random forests
    
*Notebook 9: Comparing Representations*

- Use logistic classifier to compare twelve possible representations
- BoT performs well, also Tfidf with more features (except cosine similarity)
- Sensitivity needs most improvement, cannot break 0.9069 without random forestts

*Notebook 10: Random Forests 1*

- Study hyperparameters, conduct grid search on BoT
- Consider moving the decision threshold (plots precision-recall curves) to gain sensitivity
    - a strategy that should be reserved for the final stages, since it doesn't improve the classifier itself
    
```
threshold: 0.5
accuracy: 0.9887
sensitivity: 0.9225
specificity: 0.9988

threshold: 0.2
accuracy: 0.9785
sensitivity: 0.9767
specificity: 0.9787
```


*Notebook 11: Random Forests 2*

- Grid search all 12 representations using a shallower test param grid
- Compare run times with a py script (25min notebook, 15min command line)

*Notebooks 12a-c: Random Forests 3*

- Run grid searches on notebooks since py script fails too often

*Notebook 13: Random Forests 4*

- Evaluate results of previous grid searches, finds issue with low sensitivity and scaling of SVD
- Go back and study SVD scaling (Notebook 6)
- Unscaled SVD speeds up training and improves sensitivity to 0.95 with 0.5 threshold

*Notebook 14: Random Forests 5*

- More grid searches, best results on unscaled SVD w/ 500 components on Tfidf, w/ spam cosine similarities
- Noticing that quick one-time predictions are too variable, using 10-fold CV and studying variation is needed

```
threshold: 0.5
accurady: 0.9925
sensitivity: 0.9639
specificity: 0.9968
```

*Notebook 15: Random Forests 6*

- Upping the number of components in SVD is the most helpful tactic
- The final model uses an 800-component SVD
- The final params are: ```{'max_depth': 8, 'max_features': 150, 'min_samples_split': 3, 'n_estimators': 100}```
- The best decision threshold is $0.3$
- Accuracy, specificity, and most importantly sensitivity balance out at $\approx{99.2\%}$ with little variation
- Mean validation sensitivity was 0.9742 with a mean fit time of $\approx{45 sec}$ 

*Notebook 15: Voting Classifier*

- Wisdom of the crows doesn't uphold when there's an expert
- Random forest vastly outperforms the simpler models and combination thereof (includes LR, SVC)

*Notebook 16: SVM 1*

- Study SVC class, stumbles around

*Notebook 17: SVM 2*

- Grid search SVC varying C and gamma yields 86% sensitivity
- Increase voting classifier's sensitivity to 93%, still well below random forest alone


#### House cleaning

- make sure all notebooks have attributions, nice intro, etc.
- make sure notebooks make sense to someone unfamiliar with project
    - needs intermediate notebooks explaining text cleanup with demos
- update README


#### Current Status



#### Future Steps

- Settle on a cleanup-preprocessing pipeline
    - Ex. if using Tfidf, keep idf, etc. (SVD, cosine similarity)
- Create a script for the pipeline 
    - Make sure it works with the test dataset
- Modeling:
    - SGD
    - Boosting
    - Ensembles
    - Select a final model, select two other promising ones
- Evaluation: evaluate the model chosen on the test, and evaluate two others to see whether decision was correct?
    - Question: is this a scientific way to evaluate final model selection?
- Presentation: ooph

#### Topic Modeling

- Latent Dirichlet Allocation (LDA): study, read
- lda2Vec (word2vec): study, read
- NER: study, read

#### Statistical Modeling

- SGD
- Boosting
- Ensembles

#### Feature Selection

- Random Forest + VarImp plot 
- LASSO

#### Evaluation

- MCC
- confusion matrix
- specific predictions

#### Thoughts

- what about using TextBlob?

---