<a href="https://colab.research.google.com/github/SiP-AI-ML/LessonMaterials/blob/master/Intro_to_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. What is NLP?
We edited these materials from the FastAI NLP Course, so full credit to them.
Here is the repo: https://github.com/fastai/course-nlp
We also referenced a blog from Algorithma: https://algorithmia.com/blog/introduction-natural-language-processing-nlp

Natural language processing (NLP) is a field of artificial intelligence in which computers analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.

Some examples of NLP are with common products such as: 
- Amazon Alexa/Google Home
- Spam detectors
- Google Search
- Lemmatization

Besides common word processors, NLP considers the hierarchical structure of language.
-  several words make a phrase
- several phrases make a sentence
- sentences convey ideas 

John Rehling, an NLP expert at Meltwater Group, said in How Natural Language Processing Helps Uncover Social Media Sentiment. “By analyzing language for its meaning, NLP systems have long filled useful roles, such as correcting grammar, converting speech to text and automatically translating between languages.”

NLP is used to analyze text, allowing machines to understand how humans speak. This human-computer interaction enables real-world applications like automatic text summarization, sentiment analysis, topic extraction, named entity recognition, parts-of-speech tagging, relationship extraction, stemming, and more. NLP is commonly used for text mining, machine translation, and automated question answering.

NLP is very difficult in computer science because human language is rarely precise or plainly spoken. To understand human language is to understand not only the words, but the concepts and how they’re linked together to create meaning. Despite language being one of the easiest things for the human mind to learn, the ambiguity of language is what makes natural language processing a difficult problem for computers to master.

## What can you do with NLP?

NLP algorithms have a variety of uses. Basically, they allow developers to create a software that understands human language. Due to the complicated nature of human language, NLP can be difficult to learn and implement correctly. However, with the knowledge gained from this article, you will be better equipped to use NLP successfully. Some of the projects developers can use NLP algorithms for are:

- Part-of-speech tagging: identify if each word is a noun, verb, adjective, etc.
- Named entity recognition NER: identify person names, organizations, locations, medical codes, time expressions, quantities, monetary values, etc.
- Question answering
- Speech recognition
- Text-to-speech and Speech-to-text
- Topic modeling
- Sentiment classification
- Language modeling
- Translation

Many techniques from NLP are useful in a variety of places, for instance, you may have text within your tabular data.

There are also interesting techniques that let you go between text and images:

### Top-down teaching approach

We'll be using a *top-down* teaching method, which is different from how most CS/math courses operate.  Typically, in a *bottom-up* approach, you first learn all the separate components you will be using, and then you gradually build them up into more complex structures.  The problems with this are that students often lose motivation, don't have a sense of the "big picture", and don't know what they'll need.

All that to say, don't worry if you don't understand everything at first!  You're not supposed to.  We will start using some "black boxes" that haven't yet been explained, and then we'll dig into the lower level details later. The goal is to get experience working with interesting applications, which will motivate you to learn more about the underlying structures as time goes on.

To start, focus on what things DO, not what they ARE.

## A changing field

Historically, NLP originally relied on hard-coded rules about a language. In the 1990s, there was a change towards using statistical & machine learning approaches, but the complexity of natural language meant that simple statistical approaches were often not state-of-the-art. We are now currently in the midst of a major change in the move towards neural networks.  Because deep learning allows for much greater complexity, it is now achieving state-of-the-art for many things.

This doesn't have to be binary: there is room to combine deep learning with rules-based approaches.

## Here is a quick Intro Video
https://www.youtube.com/watch?v=5ctbvkAMQO4
Stop at 5:25

## Fakery

<img src="https://github.com/SiP-AI-ML/LessonMaterials/blob/master/images/gpt2-howard.png?raw=1" alt="" style="width: 65%"/>

[OpenAI's New Multitalented AI writes, translates, and slanders](https://www.theverge.com/2019/2/14/18224704/ai-machine-learning-language-models-read-write-openai-gpt2)

[He Predicted The 2016 Fake News Crisis. Now He's Worried About An Information Apocalypse.](https://www.buzzfeednews.com/article/charliewarzel/the-terrifying-future-of-fake-news) (interview with Avi Ovadya)

- Generate an audio or video clip of a world leader declaring war. “It doesn’t have to be perfect — just good enough to make the enemy think something happened that it provokes a knee-jerk and reckless response of retaliation.”

- A combination of political botnets and astroturfing, where political movements are manipulated by fake grassroots campaigns to effectively compete with real humans for legislator and regulator attention because it will be too difficult to tell the difference.

- Imagine if every bit of spam you receive looked identical to emails from real people you knew (with appropriate context, tone, etc).

<img src="https://github.com/SiP-AI-ML/LessonMaterials/blob/master/images/etzioni-fraud.png?raw=1" alt="" style="width: 65%"/>

[How Will We Prevent AI-Based Forgery?](https://hbr.org/2019/03/how-will-we-prevent-ai-based-forgery): "We need to promulgate the norm that any item that isn’t signed is potentially forged." 

## Resources

**Books**

Here are a few helpful references if you want to get really in-depth: (Note, these are really advanced)

- [**Speech and Language Processing**](https://web.stanford.edu/~jurafsky/slp3/), by Dan Jurafsky and James H. Martin (free PDF)

- [**Introduction to Information Retrieval**](https://nlp.stanford.edu/IR-book/html/htmledition/irbook.html) by By Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze (free online)

- [**Natural Language Processing with PyTorch**](https://learning.oreilly.com/library/view/natural-language-processing/9781491978221/) by Brian McMahan and Delip Rao (need to purchase or have O'Reilly Safari account) 

**Blogs**

Good NLP-related blogs:
- [Sebastian Ruder](http://ruder.io/)
- [Joyce Xu](https://medium.com/@joycex99)
- [Jay Alammar](https://jalammar.github.io/)
- [Stephen Merity](https://smerity.com/articles/articles.html)
- [Rachael Tatman](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213)

## NLP Tools

- Regex (example: find all phone numbers: 123-456-7890, (123) 456-7890, etc.)
- Tokenization: splitting your text into meaningful units (has a different meaning in security)
- Lemmatization: grouping together the different conjugations of a word so they can be analyzed together 
- Word embeddings
- Linear algebra/matrix decomposition
- Neural nets
- Hidden Markov Models
- Parse trees

## Python Libraries

- [nltk](https://www.nltk.org/): first released in 2001, very broad NLP library
- [spaCy](https://spacy.io/): creates parse trees, excellent tokenizer, opinionated
- [gensim](https://radimrehurek.com/gensim/): topic modeling and similarity detection

specialized tools:
- [PyText](https://pytext-pytext.readthedocs-hosted.com/en/latest/)
- [fastText](https://fasttext.cc/) has library of embeddings

general ML/DL libraries with text features:
- [sklearn](https://scikit-learn.org/stable/): general purpose Python ML library
- [fastai](https://docs.fast.ai/): fast & accurate neural nets using modern best practices, on top of PyTorch

## Ethics issues

### Bias

- [How Vector Space Mathematics Reveals the Hidden Sexism in Language](https://www.technologyreview.com/s/602025/how-vector-space-mathematics-reveals-the-hidden-sexism-in-language/)
- [Semantics derived automatically from language corpora contain human-like biases](https://arxiv.org/abs/1608.07187)
- [Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them](https://arxiv.org/abs/1903.03862)
- [Word Embeddings, Bias in ML, Why You Don't Like Math, & Why AI Needs You](https://www.youtube.com/watch?v=25nC0n9ERq4&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=9)

<img src="https://github.com/SiP-AI-ML/LessonMaterials/blob/master/images/rigler-tweet.png?raw=1" alt="" style="width: 65%"/>