# Natural Language Processing

## Goals

*Primary*

- create a portfolio-worthy data science project
- start a portfolio of projects online

*Secondary*

- learn NLP
- practice ML: build classifiers for unstructured, text data

*Main Steps*

- recreate the 'Intro To Text Analytics With R' project in Python
- repurpose and generalize the NLP-ML workflow with Twitter data

---


## Structure

1. define the problem, set expectations and evaluation criteria
2. preliminary and minimal EDA and pre-splitting cleanup
3. split dataset into trainining and test subsets; set the test subset aside
4. create a cleanup and preprocessing pipeline for the training data that can be re-applied to the test data
5. train a couple baseline models to ensure process is smooth and pre-processing is dialed in
6. using cross-validation, evaluate a variety models without hyperparameter tuning to establish some baselines
7. short-list promising models for further hyperparemeter tuning
8. iterate on any phase of the project as needed
9. consider feature selection and feature engineering (this can be done earlier)
10. decide on a final cleanup and processing pipeline
11. settle on a final model
12. re-apply all cleanup and processing steps to the test set and evaluate final model - once
13. create a final presentation of the solution for technical and non-technical audiences
---

## General Definition

There is a lot more unstructured, text data than structured data. To leverage this unstructured text data one needs to apply text analytic techniques, or 'natural language processing', to structure the data and get value from it. 

Value can be defined in many ways. 

**Example 1.** 

A common need in businesses that capture text data is to be able to extract keywords from this text data. Say a business has an online app that historically has captured information in a text field, but nobody has had the time to read that input. Going forward, product managers decide this field should be turned into a drop-down menu. To integrate the past and future states of this field, there's now a need to bucket the unstructured text information into categories in the new drop-down menu. To solve this, a data scientist can apply entity extraction techniques: 

> "Named-entity recognition (NER) [...] is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories..." [Wikipedia, accessed Dec 19, 2020](https://en.wikipedia.org/wiki/Named-entity_recognition)


**Example 2.** 

```
[TODO: needs revision]
```

Given the goal of binary classification, say, to separate spam from ham (legitimate) messages or emails, or *positive* (happier) from *negative* (unhappier) Tweets, one can build classifiers that learn from a corpus and is able to predict, given a new instance (new email), wether it is spam or not. The spam/ham problem is old and mostly solved, and the positive/negative Tweet is also well known and studied. It is less clear to me whether classifiers build for one would also perform well in the other realm. Being able to generalize a workflow with quick deployment and testing of various classifiers is still challenging - many "out-of-the-box" analytics tools fail to deliver. 


---

## This Project: Text Analytics With R

This is a Python offshoot of the original YouTube series tutorial by David Langer from Data Science Dojo: [DSD's Introduction to Text Analytics with R](https://www.youtube.com/playlist?list=PLTJTBoU5HOCR5Vkah2Z-AU76ZYsZjGFK6)


#### What is the problem this project is trying to solve?



#### What are the expectations?


```
TODO....
```


#### What does *success** look like for this project?


```
TODO....
```

## Project Management

#### Status History

*Notebook 3: Tfidf*

- $97.5\%$ accuracy with a Tfidf matrix of 450 unigram terms on sklearn's baseline logistic classifier.

*Notebook 4: Bigrams*

- $98.46\%$ accuracy with a Bag-of-(upto)-Bigrams of 500 terms on sklearn's baseline logistic classifier.

*Notebook 5: Ngrams*
    
- $98.59\%$ accuracy with a Bag-of-(upto)-Trigrams with 2,000 terms on sklearn's baseline logistic classifier.
    
*Notebook 6: Dimensionality Reduction*

- BoT with 2,000 terms has best accuracy and sensitivity (0.9069), great specificity (0.9979)
- SVD with 300 components on Tfidf has second highest accuracy (0.9840) and high sensitivity (0.8798) and specificity (0.9993)

*Notebook 7: Feature Engineering*

- After visualizations, the first feature (raw document length) appears to be the most useful in separating the target
- After a quick modeling stage, all features but the RSR have the best accuracy (0.8823), yet low sensitivity (0.3270) and high specificity (0.9672)

*Notebook 8: Comparing Representations*

- Given the baseline logistic classifier:
    - Overall accuracy and specificity are high
    - Sensitivity, which is desired, is the metric that needs improvement

- Best representations:
    - BoT alone, acc:0.9859, tpr:0.9069, tnr:0.9979
    - BoT with all features but RSR, acc:0.9849, tpr:0.9031, tnr:0.9973
    - SVD with all features but RSR (to test out with a more complex model), acc:0.9828, tpr:0.8973, tnr:0.9959
    
*Notebook 9: Cosine Similarity*

- Given the basline logistic classifier:
    - Cosine similarities (over spam SMS) appear helpful upon visualization (esp. Tfidf)
    - Only predict the ham base rate with the simple LR model (BoT = Tfidf)
    
- This needs testing with a more complex model

#### House cleaning

- add some intros to notebooks; add source info
- move data folder to project
- add clean data subfolder and redo loading in notebooks
- update README


#### Current Status

Questions:

- original added cosine similarity to SVD on Tfidf of BoT
- use stratified train-test split for first split?

- Random Forests 2: evaluate first gridsearch


#### Future Steps

- Settle on a cleanup-preprocessing pipeline
    - Ex. if using Tfidf, keep idf, etc. (SVD, cosine similarity)
- Create a script for the pipeline 
    - Make sure it works with the test dataset
    
- Modeling
- Evaluation
- Presentation

#### Topic Modeling

- Latent Dirichlet Allocation (LDA)
- lda2Vec (word2vec)

#### Statistical Modeling

- SGD
- Decision Trees
- Random Forests
- Boosting
- SVC
- Ensembles

#### Feature Selection

- Random Forest + VarImp plot 
- LASSO

#### Evaluation

- MCC
- confusion matrix
- specific predictions

#### Thoughts

- what about using TextBlob?

---