# Natural Language Processing


## Structure

1. define the problem, set expectations, include evaluation criteria
2. preliminary and minimal EDA and pre-splitting cleanup
3. split dataset into trainining and test subsets; set the test subset aside
4. create a cleanup and preprocessing pipeline for the training data that can be re-applied to the test data
5. train a couple baseline models to ensure process is smooth and pre-processing is dialed in
6. using cross-validation, evaluate a variety models without hyperparameter tuning to establish some baselines
7. short-list promising models for further hyperparemeter tuning
8. iterate on any phase of the project as needed
9. consider feature selection and feature engineering (this can be done earlier)
10. decide on a final cleanup and processing pipeline
11. settle on a final model
12. re-apply all cleanup and processing steps to the test set and evaluate final model - once
13. create a final presentation of the solution for technical and non-technical audiences
---

## Definition

*Primary Goals*

- create a portfolio-worthy project
- start a portfolio of projects online (perhaps beyond GitHub)

*Secondary Goals*

- learn NLP
- practice ML: build classifiers for unstructured, text data

*Tertiary Goals*

- recreate the 'Intro To Text Analytics With R' project in Python
- repurpose and generalize the NLP-ML workflow with Twitter data

*Evaluation Criteria*

- accuracy, precision-recall curves, possibly MCC

---


## Management

#### Status History

*Notebook 3: Tfidf*

- $97.5\%$ accuracy with a Tfidf matrix of 450 unigram terms on sklearn's baseline logistic classifier.

*Notebook 4: Bigrams*

- $98.5\%$ accuracy with a Bag-of-(upto)-Bigrams of 2,000 terms on sklearn's baseline logistic classifier.

*Notebook 5: Ngrams*
    
- no clear improvement over previous representation and model after grid searches and evaluation plots.
    
*Notebook 6: Dimensionality Reduction*

- SVD on Tfidf or Bag-of-Bigams do not differ as far as accuracy with the baseline classifier
- accuracy is lower than using original data - it remains to be seen whether SVD with new features and more complex models improves accuracy

*Notebook 7: Feature Engineering*

- After visualizations, the first feature (raw document length) appears to be the most useful in separating the target

*Notebook 8: Comparing Representations*

- Keep best two representations:
    * Bag-of-upto-Bigrams with 2,000 terms
    * SVD on the bigrams above + document length - for testing with more complex models


#### Current Status

- Revisit R Analytis: create a Cosine Similarity feature


#### Future Steps

- Finalize cleanup pipeline:
    * maybe use a .py script
    
    
- Modeling
- Evaluation
- Presentation

#### Topic Modeling

- Latent Dirichlet Allocation (LDA)
- lda2Vec (word2vec)

#### Statistical Modeling

- SGD
- Decision Trees
- Random Forests
- Boosting
- SVC
- Ensembles

#### Feature Selection

- Random Forest + VarImp plot 
- LASSO

#### Evaluation

- MCC
- confusion matrix
- specific predictions

#### Thoughts

- what's the use case?
- what about using TextBlob?

---