# Natural Language Processing


## Structure

1. set expectations and define the problem, including evaluation criteria
2. preliminary and minimal EDA and pre-splitting cleanup
3. split dataset into trainining and test subsets; set the test subset aside
4. create a cleanup pipeline for the training data that can be re-applied to the test data
    * a POC is done by sampling down the sentiment140 dataset
5. create an ML pre-processing pipeline that can be re-applied as well
6. train a couple baseline models to ensure process is smooth and pre-processing is dialed in
7. using cross-validation, evaluate a variety models without hyperparameter tuning to establish some baselines
8. short-list promising models for further hyperparemeter tuning
9. iterate on any phase of the project as needed
10. consider feature selection and feature engineering
11. decide on a final cleanup and processing pipeline
    * apply to the entire training subset
12. settle on a final model
    * cross-validate results using the entire training subset
13. re-apply all cleanup and processing steps to the test set and evaluate final model - once
14. create a final presentation of the solution for technical and non-technical audiences
---

## Definition

First Goal:

- learn NLP and machine learning
- train classifiers to predict positive/negative sentiment of a Tweet using the sentiment140 data
- compare with TextBlob sentiment polarity
- evaluate models based on accuracy at first, then precision-recall, or MCC

The application or an "app idea" is beyond the scope of the project - this could be any app that uses short text input from users and needs to, as part of its process, evaluate the "sentiment" of the text. An example would be Twitter users who download their Tweets for the last year and want to know how positive or negative their posts were, in a timeline for example, or as aggregate in comparison with a cohort of friends. Another example would be an app that tracks positive/negative sentiments around a topic, by using hashtags and whatnot, in conjunction with this sentiment evaluation.

The scope of this project is to provide a model that most accurately predicts the sentiment of a Tweet - whether negative or positive, and does so within a reasonable amount of time, but not as fast as possible (not real time). The evaluation criteria of what constitutes "success" and "most accurate prediction" is TBD. I might use ROC/AUC curves and whatnot. Accuracy alone isn't enough without considering recall.

---


## Management

#### Former Stati

- TextBlob revealed issues with predictions:
    * they are less accurate for the sentiment140 data
    * testing some very clear new cases showed TextBlob to be a lot more stable
    * *why are predictions unstable with the classifiers (NB, LR)?*
        - hypothesis: too many variables
        
- Tested the POC process so far with the twitterbot data:
    * so far accuracies are lower
    * this just introduces more complexity - stop
    
- SVD did not improve accuracy, speed, or size

- Added Ngrams:
    * Bigrams: 80% accuracy, m=120k, n=50k, Tfidf, LR, 18s train time
    * Trigrams: 80% accuracy, m=120km n=100k, Tfidf, LR, 3s train time
    
#### Previous Status

- Feature Eng:
    * got 59% accuracy with 7 features
    * no improvement when joined with Trigrams
    * no improvement when joined with SVD
    
#### Current Status

- Cosine Similarity 
    * revisit R Analytics


#### Future Steps

- Finalize cleanup pipeline: from dev versions to implementation
    * maybe use a .py script
    
- Start modeling again

#### Topic Modeling

- Latent Dirichlet Allocation (LDA)
- lda2Vec (word2vec)

#### Statistical Modeling

- SGD
- Decision Trees
- Random Forests
- Boosting
- SVC
- Ensembles
- Evaluate: MCC, look at confusion matrix


#### Feature Selection

- Random Forest + VarImp plot 
- LASSO

#### Thoughts

- run with full data only as last resort
- run with better data, whether curated dataset or other data

#### Goal

- what's the use case?
    * Ex. app idea where user inputs text to get polarity, but TextBlob does that well already

---