## Syllabus

Topics Covered:

1\. What is NLP?
  - A changing field
  - Resources
  - Tools
  - Python libraries
  - Example applications
  - Ethics issues

2\. Topic Modeling with NMF and SVD
  - Stop words, stemming, & lemmatization
  - Term-document matrix
  - Topic Frequency-Inverse Document Frequency (TF-IDF)
  - Singular Value Decomposition (SVD)
  - Non-negative Matrix Factorization (NMF)
  - Truncated SVD, Randomized SVD

3\. Sentiment classification with Naive Bayes, Logistic regression, and ngrams
  - Sparse matrix storage
  - Counters
  - the fastai library
  - Naive Bayes
  - Logistic regression
  - Ngrams
  - Logistic regression with Naive Bayes features, with trigrams
  
4\. Regex (and re-visiting tokenization)

5\. Language modeling & sentiment classification with deep learning
  - Language model
  - Transfer learning
  - Sentiment classification

6\. Translation with RNNs
  - Review Embeddings
  - Bleu metric
  - Teacher Forcing
  - Bidirectional
  - Attention



# 1. What is NLP?

## What can you do with NLP?

NLP is a broad field, encompassing a variety of tasks, including:

- Part-of-speech tagging: identify if each word is a noun, verb, adjective, etc.)
- Named entity recognition NER): identify person names, organizations, locations, medical codes, time expressions, quantities, monetary values, etc)
- Question answering
- Speech recognition
- Text-to-speech and Speech-to-text
- Topic modeling
- Sentiment classification
- Language modeling
- Translation

Many techniques from NLP are useful in a variety of places, for instance, you may have text within your tabular data.

## This course

In this course, we will cover these applications:
- Topic modeling
- Sentiment classification
- Language modeling

## A changing field

Historically, NLP originally relied on hard-coded rules about a language.  In the 1990s, there was a change towards using statistical & machine learning approaches, but the complexity of natural language meant that simple statistical approaches were often not state-of-the-art. We are now currently in the midst of a major change in the move towards neural networks.  Because deep learning allows for much greater complexity, it is now achieving state-of-the-art for many things.

This doesn't have to be binary: there is room to combine deep learning with rules-based approaches.

### A field in flux

The field is still very much in a state of flux, with best practices changing.

<img src="images/skomoroch.png" alt="" style="width: 60%"/>

## Resources

**Books**

We won't be using a text book, although here are a few helpful references:

- [**Speech and Language Processing**](https://web.stanford.edu/~jurafsky/slp3/), by Dan Jurafsky and James H. Martin (free PDF)

- [**Introduction to Information Retrieval**](https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html) by By Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze (free online)

- [**Natural Language Processing with PyTorch**](https://learning.oreilly.com/library/view/natural-language-processing/9781491978221/) by Brian McMahan and Delip Rao (need to purchase or have O'Reilly Safari account) 

**Blogs**

Good NLP-related blogs:
- [Sebastian Ruder](http://ruder.io/)
- [Joyce Xu](https://medium.com/@joycex99)
- [Jay Alammar](https://jalammar.github.io/)
- [Stephen Merity](https://smerity.com/articles/articles.html)
- [Rachael Tatman](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213)

## NLP Tools

- Regex (example: find all phone numbers: 123-456-7890, (123) 456-7890, etc.)
- Tokenization: splitting your text into meaningful units (has a different meaning in security)
- Word embeddings
- Linear algebra/matrix decomposition
- Neural nets
- Hidden Markov Models
- Parse trees

Example from http://damir.cavar.me/charty/python/: "She killed the man with the tie."

Was the man wearing the tie?
<img src="images/parse2.png" alt="" style="width: 60%"/>

Or was the tie the murder weapon?
<img src="images/parse1.png" alt="" style="width: 60%"/>

## Python Libraries

- [nltk](https://www.nltk.org/): first released in 2001, very broad NLP library
- [spaCy](https://spacy.io/): creates parse trees, excellent tokenizer, opinionated
- [gensim](https://radimrehurek.com/gensim/): topic modeling and similarity detection

specialized tools:
- [PyText](https://pytext-pytext.readthedocs-hosted.com/en/latest/)
- [fastText](https://fasttext.cc/) has library of embeddings

general ML/DL libraries with text features:
- [sklearn](https://scikit-learn.org/stable/): general purpose Python ML library
- [fastai](https://docs.fast.ai/): fast & accurate neural nets using modern best practices, on top of PyTorch

### Bias

<img src="images/google-translate.png" alt="" style="width: 65%"/>

- [How Vector Space Mathematics Reveals the Hidden Sexism in Language](https://www.technologyreview.com/s/602025/how-vector-space-mathematics-reveals-the-hidden-sexism-in-language/)
- [Semantics derived automatically from language corpora contain human-like biases](https://arxiv.org/abs/1608.07187)
- [Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them](https://arxiv.org/abs/1903.03862)
- [Word Embeddings, Bias in ML, Why You Don't Like Math, & Why AI Needs You](https://www.youtube.com/watch?v=25nC0n9ERq4&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=9)

<img src="images/rigler-tweet.png" alt="" style="width: 65%"/>

### Fakery

<img src="images/gpt2-howard.png" alt="" style="width: 65%"/>

[OpenAI's New Multitalented AI writes, translates, and slanders](https://www.theverge.com/2019/2/14/18224704/ai-machine-learning-language-models-read-write-openai-gpt2)

[He Predicted The 2016 Fake News Crisis. Now He's Worried About An Information Apocalypse.](https://www.buzzfeednews.com/article/charliewarzel/the-terrifying-future-of-fake-news) (interview with Avi Ovadya)

- Generate an audio or video clip of a world leader declaring war. “It doesn’t have to be perfect — just good enough to make the enemy think something happened that it provokes a knee-jerk and reckless response of retaliation.”

- A combination of political botnets and astroturfing, where political movements are manipulated by fake grassroots campaigns to effectively compete with real humans for legislator and regulator attention because it will be too difficult to tell the difference.

- Imagine if every bit of spam you receive looked identical to emails from real people you knew (with appropriate context, tone, etc).

<img src="images/etzioni-fraud.png" alt="" style="width: 65%"/>

[How Will We Prevent AI-Based Forgery?](https://hbr.org/2019/03/how-will-we-prevent-ai-based-forgery): "We need to promulgate the norm that any item that isn’t signed is potentially forged." 