# Natural Language Processing

## General Goals

- study NLP and practice ML by building classifiers that use open text fields as input
- create a Python framework that generalizes a workflow in R for detecting SMS spam messages
- evaluate the effectiveness of the framework with Twitter data to predict some binary outcome

---


## General ML Project Structure

These steps do not need to be sequential. For example, feature engineering can be done sooner. They are also cyclic: in practice, one might need to continuously re-define the problem and iterate on exploratory data analysis and statistical modeling steps:

1. define the problem, set expectations and evaluation criteria
2. preliminary and minimal EDA and pre-splitting cleanup
3. split dataset into trainining and test subsets; set the test subset aside
4. create a cleanup and preprocessing pipeline for the training data that can be re-applied to the test data
5. train a couple baseline models to ensure process is smooth and pre-processing is dialed in
6. using cross-validation, evaluate a variety models without hyperparameter tuning to establish some baselines
7. short-list promising models for further hyperparemeter tuning
8. iterate on any phase of the project as needed
9. consider feature selection and feature engineering
10. decide on a final cleanup and processing pipeline
11. settle on a final model
12. re-apply all cleanup and processing steps to the test set and evaluate final model(s)
13. create a final presentation of the solution for technical and non-technical audiences
---

## Project Inspiration

There is a lot more unstructured, text data than structured data. To leverage it one needs to apply text analytic techniques to structure the data before getting value from it. 

Value can be defined in many ways, here are a couple:

**Framework for Binary Classification** 


Given the goal of binary classification, say, to separate *spam* from *ham* (legitimate) messages, or *positive* (happier) from *negative* (unhappier) comments on social media, one can build classifiers that learn from a corpus of documents and predicts an outcome given an instance (a text, post, comment).

Both cases are well-known and mostly solved. It is less clear to me whether classifiers built for spam detection would also perform well for predicting negative Tweets. Having a framework that quickly deploys and assesses the accuracy of classifiers for binary predictions given open text fields seems a valuable pursuit.


**Named-Entity Recognition** 

A common need in businesses that capture data through open text fields is to be able to extract keywords from variable-size text inputs. Say a business has an online app that historically has captured information through an open text field but nobody has had the time to read that input. Going forward, product owners decide this field should be turned into a drop-down menu. To integrate the past and future versions of this field, there's a need to bucket the unstructured text into categories for the new drop-down. To solve this, a data scientist can apply entity extraction techniques such as NER:

> "Named-entity recognition (NER) [...] is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories..." [Wikipedia, accessed Dec 19, 2020](https://en.wikipedia.org/wiki/Named-entity_recognition)


---

## Project Definition


#### What is the problem this project is trying to solve?

Automatically separate *spam* from *ham* (legitimate) SMS (short message service, aka "text") messages.


#### What are the expectations?

That the classifier built for this purpose can do this task quickly and accurately enough to avoid frustrating SMS users. I'm also not building an app or hosting the spam detector online, just building the classifier in Jupyter Notebooks and presenting results in slides or some other friendly format.

#### What does success look like for this project?

Train a classifier that achieves high accuracy, but not at the expense of either sensitivity or specificity. Given the positive case of spam - since it is a spam detector after all - this classifier can correctly classify most spam as spam (sensitivity, or true positive rate) but also, and most importantly, most ham as ham (specificity, or true negative rate), since this mistake is worse: it would be worse to send a legitimate message to the spam folder where it's likely to be ignored, than to let some spam end up in the inbox, where it can be quickly deleted. 

Success looks like a 98% (give or take 1%) F1-score (harmonic mean of sensitivity and specificity) and a quick, subsecond prediction pipeline for new inputs.

## Data

The dataset is now spread across the internet, perhaps a good source is the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection#). The dataset is a collage of sources; the data collection process is explained by the original authors Tiago A. Almeida and José María Gómez Hidalgo in this [University of Campinas website.](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)

## Acknowledgements

This journey into the fields of NLP and ML took months of learning and development of my own understanding of various inner workings of models I never ended up deploying. I am indebted to numerous tutorials and blogs I've read and watched along the way. Below is a list in order of most-to-least influential:

- [Data Science Dojo's](https://datasciencedojo.com/) [Introduction To Text Analytics With R](https://www.youtube.com/playlist?list=PLTJTBoU5HOCR5Vkah2Z-AU76ZYsZjGFK6) by [David Langer](https://www.daveondata.com/)
- Aurélien Géron's [Classification Notebook](https://github.com/ageron/handson-ml/blob/master/03_classification.ipynb) 
- Scikit-Learn's [API Docs](https://scikit-learn.org/stable/modules/classes.html)
- Chayan Kathuria's tutorial [Build & Deploy a Spam Classifier app on Heroku Cloud in 10 minutes!](https://towardsdatascience.com/build-deploy-a-spam-classifier-app-on-heroku-cloud-in-10-minutes-f9347b27ff72)
- Analytics Vidhya's [Introduction to Topic Modeling and Latent Semantic Analysis](https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/)
- Prof. Steve Brunton's [YouTube lectures on Singular Value Decomposition](https://www.youtube.com/playlist?list=PLMrJAkhIeNNSVjnsviglFoY2nXildDCcv) 
- Kevin Arvai's tutorial [Fine Tuning a Classifier in Scikit-Learn](https://towardsdatascience.com/fine-tuning-a-classifier-in-scikit-learn-66e048c21e65)
- Cole Brendel's article [Quickly Compare Multiple Models](https://towardsdatascience.com/quickly-test-multiple-models-a98477476f0)
- Josh Starmer's [StatQuest YouTube channel](https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw)

---

---