#Natural Language Processing

Natural Language Processing (NLP) can also be know as computational linguistics, text analysis, text mining, etc.
<br>
Based on its name it can be defined as follows (let's try it backwards):


## Natural Language


**There exist Formal and Natural languages.**

*   Formal languages are by construction explicit and non-ambiguous.
*   Natural languages are in essence implicit and ambiguous
    *   "Remove the stones from the cherries and put them in the pie."
    *   "The hunter shot the tiger; his wife too."
    *   "Time flies like an arrow."
    *   "She was eating a fish with (bones, anger, some friends, a fork)"

Natural language is a method of human communication, either spoken or written, which typically
<br>
contains words that are structured in some conventional way.

### <font color="grey">Natural Language Functions</font>

* Conciseness
  * The student gave *his* homework to the professor *who* told *him* that *it* could have been better.
  * The student gave the homework of the student to the professor. <br>The professor told the student that the homework of the student could have been better.
* Unlimited expressive power (Representation)
  * Logical expressions of any order.
    * Earth is curved  -  "curved_earth = TRUE"
    * All politicians lie.  -  For any x, politician(x) -> lier(x), etc
* Shared knowledge
  * I gave him a nice pen.
  * Scenario 1: - A "Mont Blanc"? - Yes, like that brand.
  * Scenario 2: - How large is it?  - To the moon and back.


### <font color="grey">Why Is Natural language IMPLICIT and AMBIGUOUS?</font>

Implicit -> allows conciseness (but potentially ambiguous)
<br>
Unlimited expressive power -> flexible interpretation rules (so meanings != word as it is written)
<br>
Why do we still understand each other?  -> A lot of shared knowledge!

Moreover...

*  Existing words (or linguistic units) change over time.
*  New words are added.
*  Language contains errors.
*  Linguistic units denote an idea or even several.
*  We have lots of common sense knowledge and understand hierarchical structures in language.

## Natural Language Processing OR "Machine! Do something with those texts!""

When a computer has to deal with the textual data, it actually faces the following:

1.   Human language is very high dimensional and sparse.
<br>
Depending on the representation and the language, for a machine, natural language consists of a
<br>*huge number of words*, and even greater number of possible word-like features
<br>(parts of speech, grammatical variations of the language, various languages, etc.)
<br>
Moreover, for most of those, we simply *do not have enough data* to give to the machine or even worse,
<br>the same *words mean multiple things*.
2.   Human language is redundant and ambiguous.
<br>
As we have discussed earlier, human language is unnecessarily redundant.
<br>There are an enormous number of ways to describe the same thing.

---
# Applications and Examples

---

## Text normalization

* Spell checking
* Grammar assistants (Grammarly)
* Formatting (more or less formal, etc.)
* Recapitalization
* Styling
* Autocompletion (SmartCompose in Gmail)

## Document classification

Some examples of the document classification that you probably use every day.

* Spam filtering
  * Fake review detection
  * Plagiarism detection
* Authorship attribution
* Sentiment analysis
  * Opinion mining (Stance detection)
  * Early warning/indicator systems<br>(e.g., "Nestle processed food should advertise the products differently")
  * Potential simple decision-making variations a.k.a. - "shall I buy this product?"

Some **interesting examples** of classification systems:
* Gmail Spam detection
* Priority Inbox [Gmail link]() [Yahoo link]()

## Automatic text summarization

[Summarization](http://amzn.to/2giFqN1) is the process of distilling the most important information from a source.
[Why](http://amzn.to/2fhUPNt)?
* Reduce reading time
* Selection of the core ideas of the documents get faster
* More effective indexing
* Potentially less biased than human summaries
* Useful for QA systems
* Reduces the time to automatically process input data as well as reduces the storage needed to keep the data.

Methods:
* *Extractive [Methods](https://arxiv.org/abs/1707.02268)* - methods that involve selection of phrases and <br> sentences from the input document
* *Abstractive Methods* - methods that generate entirely new phrases and sentences that capture the meaning <br> of the original source.

Papers to check out:
<br>
[Get to the point: Summarization with pointer-generator networks](https://arxiv.org/abs/1704.04368) [Code](https://github.com/abisee/pointer-generator)
<br>
[A neural attention model for abstractive sentence summarization](https://arxiv.org/abs/1509.00685)
<br>
[Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond](https://arxiv.org/abs/1602.06023)

### Topic modeling and Clustering

Another way to think about summarization is document representation through topics (significantly reduces the size and still conveys the general direction of the document) or through document agglomerations that represent some topic (by means of clustering or topic modelling in each document).

## Natural language generation

NLG is a process of automatically transforming data into written text that is preferable grammatically, syntactically and semantically correct.

NLG can be used both for helping people to write what they might want to write as well as summarize and shorten given documents.

One of the google example of generation is Google's SmartCompose and SmartReply efforts in Gmail.

### Caption generation

Image caption generation with visual attention: [paper](https://arxiv.org/abs/1502.03044).

### Handwriting generation

[Alex Grave demo](http://www.cs.toronto.edu/~graves/handwriting.html) and the [paper](https://arxiv.org/abs/1308.0850) describing the idea behind it.

## Natural Language Understanding (NLU)

One of the main motivations behind better NLU is the creation of chat and speech enabled bots that can
<br>
interact with and react to human-generated language. Currently many big companies have efforts in
<br>
those directions: Alexa, Siri, Google Assistant, Cortana, etc.

### Similar keyword search and Query expansion

A common problem of information retrieval in the context of query understanding. Typically, this involves:
* finding synonyms of words,
* identifying and using part-of, is-a relationshipds between keywords
* performing morphological transformations,
* correcting spelling mistakes,
* reweighting the terms in the query.

### Entities and Knowledge bases

Identifying named entities (and just entities) is a common problem for text <br> understanding. It allows to link the text back to its semantics, <br> by disambiguating what exactly is mentioned.

Common subtasks:
* [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition).
* Entity linking / disambiguation
* Coreference resolution

Information about entities and their relations (e.g. isA / hierarchical relation) is stored in some sort of databases. 

Examples: 
* [Google Knowledge Graph](http://go/kg) is an internal database of entities and facts.
* [WordNet](https://en.wikipedia.org/wiki/WordNet) is a lexical database for <br>
English, merging words into synsets with particular meaning and providing <br> semantic relations between them.


[More information on entities and knowledge bases is below](#scrollTo=WrN-j9kLdo0j).


### Entity extraction

**[Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)** - is typically a problem of
<br>
detection entities of type LOCation, ORGanization, or PERson. Note: each of those entities might span several
<br>
words/tokens in a given text. Typically, NER is solved as follows: linguistic grammar-based, statistical models or
<br>
machine learning. Typically Conditional Random Fields are used for the tasks.
<br>
Examples:
  * Kaggle competition [dataset](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus).
  * [DataTurk](https://dataturks.com/projects/Mohan/Best%20Buy%20E-commerce%20NER%20dataset) - <br> search queries with entities.
  * [Stanford NER](https://nlp.stanford.edu/software/CRF-NER.shtml)
  * [Spacy](https://en.wikipedia.org/wiki/SpaCy)
  * Apache [OpenNLP](http://opennlp.apache.org/index.html)

### Entity disambiguation

Entity disambiguation is a problem of matching the identity of the entity mentioned in the text to actual entity.
<br>
Typically, supervised learning is used to solve this problem where anchor texts are leveraged in the training data.
<br>
Further, several explorations were made with [clean unambiguous training data](http://www.aclweb.org/anthology/C10-1145),
<br>or with [topically related texts that potentially have similar types](https://www.cc.gatech.edu/~zha/CSE8801/query-annotation/p457-kulkarni.pdf) of entities in them, etc.
<br>
For example:
<br>
> Donald **Trump** and **Trump** card.
<br>
>**Paris** Hilton and **Paris** city.

### Entity Linking/Matching

Another important problem with regards to the entities is coreference resolution or entity linking or entity matching.
<br>
Coreference denotes a presence of several expressions in a text that refer to the same person/thing.
<br>
For example:
<br>
> **The food** was salty so guests did not enjoy **it**.
<br>
> **The student** was absent for 3 months. **Such a person** won't pass any exam.

### Relation extraction

Entities could have several various relationships between each other that are important to extract. For example:

* Types of relationships:

  * *is-a* - the relationship between two entities in which one entity inherits from the other

  * *Hypernyms* - a word with a broad meaning constituting a category into which words with more specific meanings fall;
  
  * *Hyponyms* - is a word or phrase whose semantic field[1] is included within that of another word. <br>Hyponyms denotes a subset of the hypernym.
  
  * *Meronymy* denotes a constituent part of, or a member of something. Meronym is a part of a whole!

  * *Synsets* - A set of one or more synonyms that are interchangeable in some context without changing the <br> truth value of the proposition in which they are embedded;

  * *Metonymy* - the substitution of the name of an attribute or adjunct for that of the thing meant, <br> for example suit for business executive.
  
  * *Anything*: who is married to who, what causes what, etc.
  
* Relation extraction

  * Methodologies: [Regex](http://www.aclweb.org/anthology/D08-1003), [Rule based](http://iswc2012.semanticweb.org/sites/default/files/76490257.pdf), [Wikipedia categories](http://pages.cs.wisc.edu/~anhai/papers/kcs-sigmod13.pdf), [Distant supervision](https://web.stanford.edu/~jurafsky/mintz.pdf), [Bayesian networks](http://aclweb.org/anthology/D17-1192), [Factor graphs](https://cs.stanford.edu/people/czhang/zhang.thesis.pdf).

### Knowledge Bases/Knowledge Graphs

* [WordNet](https://en.wikipedia.org/wiki/WordNet)
<br>
WordNet is a lexical database for English. It groups English words into synonyms (synsets), provides their short
<br>
descriptions and usages. Moreover, it contains several relations between the enties of synsets.

* [OmegaWiki](http://www.omegawiki.org/Meta:Main_Page), [BabelNet](https://babelnet.org/)
<br>
OmegaWiki aims at creating dictionaries of all words of all languages. BabelNet is a multilingual encyclopedic dictionary.

* <a href="https://en.wikipedia.org/wiki/Taxonomy_(general)">**Taxonomy**</a>
<br>
Taxonomy refers to the hierarchical categorization where relatively well-defined classes are nested under broader categories.

* [Folksonomy](https://en.wikipedia.org/wiki/Folksonomy)
<br>
Folksonomy is a relatively new system where users apply tags to online items.
<br>
As opposed to Taxonomy, Folksonomy does not derive a hierarchical structure betwen the tags but rather only assigns them.

* <a href="https://en.wikipedia.org/wiki/Ontology_(information_science)">**Ontology**</a>
<br>
Ontology is a representation, naming, definitions, categories, properties, relations of the concepts/entities
<br>
for several or all domains.

* [DBPedia](https://en.wikipedia.org/wiki/DBpedia)
<br>
DBPedia aims at extracting structured content from the Wikipedia. It describes about 4.5M entiuties,
<br>
with about 1.5M persons, 700K places, 240K organizations, etc.

* <a href="https://en.wikipedia.org/wiki/YAGO_(database)">YAGO</a>
<br>
YAGO is an open sourced knowlege base that was developed in Max Planck Institute.
<br>
This knowledge base contains over 10M entities and about 120M facts about those entities.
<br>
YAGO extracts information from Wikipedia boxes, WordNet and [GeoNames](https://en.wikipedia.org/wiki/GeoNames).

* [**Knowledge Bases**](https://en.wikipedia.org/wiki/Knowledge_base) and [Knowledge Graphs](https://en.wikipedia.org/wiki/Knowledge_Graph)
<br>
KB or KG are technology to store complex structured and unstructured information.
<br>
One of the main example of the Knowledge Graph is [Google Knowledge Graph](https://en.wikipedia.org/wiki/Knowledge_Graph) that was in part
<br>
powered by [Freebase](https://en.wikipedia.org/wiki/Freebase) (Freebase is a large collaborative knowledge base that contain the
<br>
data composed mainly by its community members.).
<br>
Another example: Knowledge Graphs with [DeepDive](https://meta.wikimedia.org/wiki/Research:Wikipedia_Knowledge_Graph_with_DeepDive).

* <a href="https://en.wikipedia.org/wiki/Commonsense_knowledge_(artificial_intelligence)">Commonsense knowledge</a>
<br>
Common sense knowledge consists of facts about everyday life, e.g., The sky is blue, a lemon is yellow and sour.
<br>
A large corpus of this data called [Open Mind Common Sense](https://en.wikipedia.org/wiki/Open_Mind_Common_Sense) (OMCS) was created by MIT.

## Question and Answer (Q&A) systems

Q&A systems are supposed to return the answer to a query expressed in natural language or admit that the answer is not known.
Q&A systems usually benefits from a lot of NLP research areas: names entity recognition, taxonomies, ontologies, some relevant data for a domain, query rewriting, query similarity or matching to the existing material.

Some prominent examples:
* Siri
* Google Now/ Google Assistant
* Cortana
* Amazon Alexa
* etc.

## Decision making systems

Decision making systems help aggregate unstructured text data to make decisions.
<br>For example, one can measure the strength of positive reaction towards a local place (restaurant or hotel)
<br>in order to decide whether to visit it, or one may want to answer questions like
<br>"How many people support vaccination?" and why.  Natural Language Processing tools help in such
<br>aggregation of lexical corpora of raw texts.

Some examples:
* Detecting and analyzing relevant news for trading on financial stock market ([paper](http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/download/1529/1904)).
* Finding patterns in user's survey free-format answers, e.g. discovering topics of responses ([paper](https://scholar.harvard.edu/files/dtingley/files/topicmodelsopenendedexperiments.pdf)).
* Extraction of relevant snippets from user reviews, e.g. 'long battery'.


## Machine translation i18n

Machine translation refers to the class of algorithms translating input text in a source language into a target language.

Historically, statistical translation of n-grams/phrases was used. Now deep learning-based translation achieves
<br>better results and allows better capturing of the meaning in the long texts
<br>([paper](https://arxiv.org/pdf/1409.3215.pdf): Sequence to Sequence Learning with Neural Networks).
<br>Such methods require a large collection of parallel (aligned) text corpora between languages.
<br>The novel advances include using universal embedding representations, requiring parallel corpora
<br>only for some pairs ([paper](http
s://www.aclweb.org/anthology/Q/Q17/Q17-1024.pdf): Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation),
<br>and even training of translation system without any parallel corpora ([paper](https://arxiv.org/abs/1710.11041): Unsupervised Neural Machine Translation).

Demo: [Google Translate](https://translate.google.com/), Available as a Cloud API at https://cloud.google.com/translate/.

Open-source APIs: 
* Classic: [Moses](http://www.statmt.org/moses/)
* Newer Neural MT: [Open NMT](http://opennmt.net/)


---

## Speech recognition

Speech recognition (or Speech-to-Text) is an essential step for enabling Natural Language Interfaces.

Nice already existing applications:
* Voice control of devices: Google Assistant, Apple's Siri, Amazon's Alexa
* Automatic transcription of speech for dictation: hands-free input ([search by voice](http://www.google.com/mobile/voice-search/))
* Automatic transcription of videos (generating subtitles): [Auto-captions on youtube](https://www.youtube.com/watch?v=kTvHIDKLFqc)

Available tools:
* Google Speech-to-Text API is open for developers: https://cloud.google.com/speech-to-text/ (public link).
* List of open-source toolkits: https://www.kdnuggets.com/2017/03/open-source-toolkits-speech-recognition.html 

Deep neural networks enabled significant advances in this domain.

Some papers about recent models:
* End-to-end speech recognition with RNN: [DeepMind paper](http://proceedings.mlr.press/v32/graves14.pdf), 2014.
* Attention-based models for speech recognition: [paper](https://dl.acm.org/citation.cfm?id=2969304), 2015
* Tracking state-of-the-art in Speech Recognition: https://github.com/syhw/wer_are_we

## Speech synthesis

[WaveNet](https://deepmind.com/blog/wavenet-generative-model-raw-audio/): speech synthesis

# Typical NLP-heavy problems

* *Classification*: Classifying documents into particular categories (giving labels).
* *Regression*: Predict numerical values
* *Clustering*: Separating documents into non- overlapping subsets.
* *Ranking*: For a given input, rank documents according to some metric. 
* *Association rule mining*: Infer likely associations patterns in data.
* *Structured output*: Bounding boxes, parse trees, etc.
* *Sequence-to-sequence*: For a given input/source text, generate output annotations or another text.

To address many of the challenges above, several important things has to be clarified beforehand:

1. Set the research/exploration goal (e.g., how heavy will be the earthquake on a given day).
2. Make a hypothesis ( e.g., strength of the earthquake is an informative signal).
3. Collect the data (e.g., collect historical strength of the quakes on eachday).
4. Test the hypothesis (e.g., train a model using the data)
5. Analyze the results (e.g., are results better than existing systems).
6. Reach a conclusion (e.g., the model should be used or not because of A, B, C).
7. Refine hypothesis and repeat. (e.g., time of the year could be a useful signal).

Below is the table of the approaches that can be applied to the problems/applications above.

| Type                   | Input   | Clarifications|
|------------------------|---------|---------------|
|   Rule-based           |Explicit linguistic patterns, <br>lexicons, etc.| Such an approach always have predefined behaviour and usually can't generalize usually. |
|   Supervised           | Training examples: typically <br>tuples of the form <br>(features, label) | Here input features are usually some vectorized, transformed, normalized input representation.|
|   Semi-supervised or <br> Pseudo-relevance <br>feedback   | Same as in supervised base, <br>but also results of the prediction <br>(e.g. with greatest confidence) <br>are used as input pairs.   | Once we trained the system on the input (feature, label), we label unknown examples <br>with the model. If the confidence of the label is high (on that unseen example), <br>we add those examples to the input set and retrain the model. |
|   Distant supervision  | Same as for supervised learning, <br>but (feature, label) pairs does not <br> come from the annotation, but from <br>some heuristic-based annotation.   | Rather than annotating thousands of examples (documents, sentences, words, etc.), <br>we take a few examples of the class and try to generalize and match those <br>in the domain-specific corpus. Such systems first generate or extract examples that are <br> very similar or slight modifications of the labelled input examples. For example, if you need to <br> find all the sentences about Obama's marriage, you would search for all sentences that match <br> Michelle and Barack Obama. Furthermore, those examples could be used to find any sentence <br> about a marriage. Another example could be for a given list of movies, match all the sentences that have <br> movie names. Clearly, such heuristics result in noisy input sets.|
|   Unsupervised         | Unlabeled features   | The system is expected to find some dependencies and patterns without clearly stating <br>which patterns we are looking for. E.g., clustering of documents, topics extraction <br>from the documents, etc. |
|   Hybrid               | Varies for method <br>combinations   | Combines several approaches that are mentioned earlier for various purposes. |

# Evaluation Metrics

Depending on the problem you solve or the  data (balanced number of classes or not) different metrics
<br>might be more suitable.

* **Accuracy**
<br>
$ Accuracy = \frac{n_{correct}}{N}$, where $N$ is a total number of example that we were analysing,
<br>$n_{corrent}$ is the number of examples that we have guessed the label.

*a.k.a. Classification*

* **Precision**
<br>
In case of classification, and in particular, if the classes are unbalanced, Precision should be a better measure to check.
<br>
$ Precision = \frac{n_{correct\ class\ prediction}}{n_{class\ predictions}} = \frac{TP}{TP + FP}$,
where $TP, FP, FN, TN$ are explained [here](https://en.wikipedia.org/wiki/False_positives_and_false_negatives).

* **Precision@K**
<br>
Once your task is not simply to classify some examples, but e.g. rank them, a popular metric is P@K.
<br>Here you use K (typically, 1,3,5, 10, etc.) and compute precision for those K results in your ranking.
<br>For example, if you have a query for which you need to find similar documents,
<br>P@K would be computed for the top K documents that are returned for a query as described above for those K elements.

* **Recall**
<br>
Another very important concept in NLP, is Recall - which basically tell us how many of the class examples
<br>or positive examples (maybe documents), our system had managed to extract.
<br>
$ Precision = \frac{n_{correct\ class\ prediction}}{n_{positive\ class\ examples}} = \frac{TP}{TP + FN}$ $

* **F1**
<br>
[F1](https://en.wikipedia.org/wiki/F1_score) is the harmonic mean of the two above metrics.
<br>Usually used to compare different approaches when you are not optimizing for P and R in particular
<br>but rather overall performance.

*a.k.a. Clustering* 

* [**Silhouette coefficient, Modularity, etc**](https://en.wikipedia.org/wiki/Cluster_analysis).
<br>
Once we move from the classification problems, and focus on clustering of the documents, many things can be measured depending
<br>if you have labels or not.
<br>
The case where we do not have labels: 
  * For already proposed clustering of the input, *modularity* would measure the how well nodes are assigned to the clusters.
  <br> In particular, we would estimate how our current assignment is different from the assumed random graph.
  * Silhouette coefficient estimates how average distance between objects in the same clusters differs
  <br>from the average distance of those objects to the other clusters.
  * [Davies–Bouldin index](https://en.wikipedia.org/wiki/Davies-Bouldin_index) measures the difference between inter- and intra-cluster similarity.

*a.k.a. Language models*

* **Perplexity**
<br>
Once we consider text generation tasks, where we typically do not have labelled examples, or multiple correct answers
<br>are possible, *perplexity* can be used to evaluate your model. tl;dr - Perplexity estimates how surprised the model
<br>is upon receiving an input, e.g., how the model is surprised that the next word after a current one
<br>"eat" is "me", or "meat" or whatever. Typically, the lower the perplexity the more information about
<br>the input the model has (no surprises). Another interpretation is that we compare our probability
<br>distribution to the fair die.


# Where text data comes from

Like anywhere in Data Science, it is important to first understand your data! Note: Of course, if you have it!

In case you do not have the data:

*Raw text data (Unlabeled/Non-annotated):*
* Pay for it :)
* Crawl it from the Web.
<br>
Examples: [Scrappy](https://scrapy.org/), [wiki/google crawler](http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/)
<br>
Of course, you might need to play with various IP addresses, throttling rates etc.
* Twitter
<br>
You can get both historical or livestream data. For any of the two, you can get 1% of the stream for free.
  * 1% historical tweets from [archive.org](https://archive.org/details/twitterstream?sort=-date).
  * Similarly, you can get 1% of full twitter stream as tweets appear via Twitter API.
  <br>You can also specify keywords (up to 2K) that would match tweets in real time - as a result you get up to
  <br>1% total stream of messages. Note: if you have a not very popular query, you might get all the tweets about it,
  <br>but of course, no guarantees.
* News media
<br>
[News Archive](https://archive.org),
[GDELT](https://www.gdeltproject.org)
* Wikipedia
<br>
[Wikipedia dumps](https://dumps.wikimedia.org/)

*Annotated data: *
* Annotate and/or even generate your data using CrowdSourcing
  <br>
  [Crowdflower](https://www.figure-eight.com/), [MechanicalTurk](https://www.mturk.com/), etc.
* Annotate using auxiliary lexical resources
<br>
[LIWC](http://liwc.wpengine.com/) a tool that analyses your textual input on the presence of various
<br>"shades" - informal speech; syntactic structures; affect, social words; conginitive, perpetual,
<br>biological processes; relativity; personal concerns, etc.

*Finally, you have the data!
... it is still not the final truth! *

Work of [A. Olteanu](http://www.aolteanu.com/) well describes various [biases and pitfalls](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2886526) when it comes to the data analysis.
* Biases
* Not representative
* Noisy data
* Incomplete data
* Incorrect data
* Missing data, etc.

So, first, get to know your data!

Note: In this class, we will be working with the data that fits into the memory.
<br>However, once it does not, you should adapt other methods to scale your pipelines - Flume, Spark, hdfs, etc. -
<br>which are out of the scope of this class.


---