# Natural Language Processing

## General: Goals

- create portfolio-worthy data science projects
- learn NLP and practice ML by building classifiers that use open text fields as input
- create a Python framework that generalizes a workflow in R for detecting SMS spam messages
- evaluate the effectiveness of the framework with Twitter data to predict some binary outcome

---


## General: ML Project Structure

These steps do not need to be sequential. For example, feature engineering can be done sooner. They are also cyclic: in practice, one might need to continuously re-define the problem, and iterate on EDA and modeling steps.

1. define the problem, set expectations and evaluation criteria
2. preliminary and minimal EDA and pre-splitting cleanup
3. split dataset into trainining and test subsets; set the test subset aside
4. create a cleanup and preprocessing pipeline for the training data that can be re-applied to the test data
5. train a couple baseline models to ensure process is smooth and pre-processing is dialed in
6. using cross-validation, evaluate a variety models without hyperparameter tuning to establish some baselines
7. short-list promising models for further hyperparemeter tuning
8. iterate on any phase of the project as needed
9. consider feature selection and feature engineering
10. decide on a final cleanup and processing pipeline
11. settle on a final model
12. re-apply all cleanup and processing steps to the test set and evaluate final model - once
13. create a final presentation of the solution for technical and non-technical audiences
---

## General: Project Definition

There is a lot more unstructured, text data than structured data. To leverage this unstructured text data one needs to apply text analytic techniques to structure the data before getting value from it. 

Value can be defined in many ways, here are a couple:

**Framework for Binary Classification** 


Given the goal of binary classification, say, to separate *spam* from *ham* (legitimate) messages, or *positive* (happier) from *negative* (unhappier) comments on social media, one can build classifiers that learn from a corpus and is able to predict, given a new instance (new message or comment), wether it is spam or not, or positive or negative. 

Both cases are well-known and mostly solved. It is less clear to me whether classifiers built for spam detection would also perform well, say, predicting negative Tweets. Having a framework that quickly deploys and assesses the accuracy of classifiers for binary predictions given open text fields seems a valuable pursuit.


**Entity Extraction** 

A common need in businesses that capture text data is to be able to extract keywords from this text data. Say a business has an online app that historically has captured information in a text field, but nobody has had the time to read that input. Going forward, product managers decide this field should be turned into a drop-down menu. To integrate the past and future versions of this field, there's a need to bucket the unstructured text into categories for the new drop-down. To solve this, a data scientist can apply entity extraction techniques:

> "Named-entity recognition (NER) [...] is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories..." [Wikipedia, accessed Dec 19, 2020](https://en.wikipedia.org/wiki/Named-entity_recognition)



---

## This Project: Introduction to Text Analytics With R

This is a Python offshoot of the original YouTube series tutorial by David Langer from Data Science Dojo: [DSD's Introduction to Text Analytics with R](https://www.youtube.com/playlist?list=PLTJTBoU5HOCR5Vkah2Z-AU76ZYsZjGFK6)


#### What is the problem this project is trying to solve?

Separate SPAM from HAM (legitimate) SMS messages.


#### What are the expectations?

That the classifier built for this purpose can do this task quickly and accurately enough to avoid frustrating users.


#### What does success look like for this project?


A classifier that achieves high sensitivity, thus avoiding sending legitimate SMS to the spam folder, which would be a lot more frustrating than the occasional spam not being filtered. This classifier also has somewhat high specificicity, it filters most spam messages as well.

## This Project: The Data

The dataset is now spread across the internet, perhaps a good source is the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection#). The dataset is a collage of sources; the data collection process is explained by the original authors Tiago A. Almeida and José María Gómez Hidalgo in this [University of Campinas website.](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)

---