# What is Natural Language Processing?

In order to understand the meaning of the term _Natural Language Processing_ (__NLP__ for short[1]), it is a good idea to start by looking at the words that make up the term.

[1] Also known as _computational linguistics_, _text analytics_ and _text mining_. All these terms are largely synonyms.


## _Processing_
The term _processing_ refers to the fact that we will be using machines to do the work. That is, our goal is to automate. By definition, automation implies machines, and machines imply engineering, which is why NLP is, in that regard, a subfield of computer engineering. But not only that.

## _Language_
The term _natural language_ refers to what we want to automate. A _language_ is a means of communication where there is an arbitrary pairing of signals (X) and their meaning (Y).


### ... or, put in other code,


In [1]:
#    Data for Machine Learning
training_data = [
    ('hello world', 'true'),
    ('*', 'false')
]

#    Real-world linguistic inputs from user-generated content
real_world_linguistic_data = [
    'hello world',
    'helloworld', 'hell o wor ld',
    'undefined', 'asdfasdasdasdf'
]


def machine_learning(input):
    for stimulus, response in training_data:
        if input == stimulus:
            return response
    return training_data[-1][1]

ml = machine_learning

for test in real_world_linguistic_data:
    print '-nlp', ml(test), '\"%s\"' % test
print
        

#    Data for Computational Linguistics
dictionary_of_accepted_terms = ['hello world', 'foo', 'bar']


#    Definition of each field of study:
def computational_linguistics(string):
    forms = []
    for w in dictionary_of_accepted_terms:
        sim = len(set(string.lower()).intersection(set(w))) / float(len(w))
        if sim >= 0.5:
            forms.append((sim, w))
    if forms:
        return sorted(forms)[-1][1]
    else:
        return string


def natural_language_processing(test):
    _test = cl(test)
    try:
        assert _test
    except Exception:
        return None
    return ml(_test)


cl = computational_linguistics
nlp = natural_language_processing

for test in real_world_linguistic_data:
    print '+nlp', nlp(test), '\"%s\"' % test    #  Natural Language Processing is just a wrapper
                                                #  over Machine Learning in order to handle 
                                                #  linguistic inputs.

-nlp true "hello world"
-nlp false "helloworld"
-nlp false "hell o wor ld"
-nlp false "undefined"
-nlp false "asdfasdasdasdf"

+nlp true "hello world"
+nlp true "helloworld"
+nlp true "hell o wor ld"
+nlp false "undefined"
+nlp false "asdfasdasdasdf"


This may be the first time someone has reduced a bunch of scientific disciplines to a few Python functions :D but I have the feeling that the idea of being able to do just some day is the main reason why most of you are reading this. Now let's go a bit deeper in the definition, in a slightly 'verbose=True' manner ;)


#### _Natural language_
The adjective _natural_ stands in opposition to _artificial_ (man-made) and distinguishes languages created by humans and that we design explicitly to be perfect (i.e., non-redundant and unambiguous, like Math, musical notation, traffic lights or Python), from other languages that have not been so designed, or _natural languages_.

Natural languages develop spontanenously over time just as a result of being used by speakers and mostly as a factor of 1) the way they are learned by children in one generation and passed on to the next and 2) historical accidents happening to that language (_twerk_ is a historical accident that will now be in English for some time before it can be washed away ;)

#### Actually just _messy_ language
From our previous point it follows that languages are subject to standard evolutionary processes like those happening in biological systems, and that results in a number of interesting properties that can be roughly summarized by saying that natural languages are __really messy__. Below is a more detailed list:

1. __Existing linguistic units change over time__. Their form changes (irregular verbs turn into regular verbs, BrEn _learnt_ > AmEn _learned_) but, more importantly, their meaning changes and adapts to new realities: ten years ago, the word _notebook_ would have never been applied to something that has no pages and needs electricity to write on it.

2. __New linguistic units are added all the time__: new words are invented as human realities change and almost every time Steve Jobs opened his mouth :D The vocabulary expands at a very fast pace and an important part of human linguistic behavior is to be able to deduce the meaning of unknown words based on their usage and on contextual clues, as well as by inference (generalizing from known attributes of similar words). Understanding language involves a lot of guessing all the time.

3. When linguistic units are produced, they usually __contain errors__. Try typing something on your smartphone and see what happens :) For humans, spoken communication is generally easier and faster, but everyone has trouble pronouncing a word occasionally. Over time, some errors become so widespread that they end up being the accepted way of saying something:
>   __mobile phone__
>     * mobile(1)  *phone that moves
>     * mobile(2)   phone that is easily portable (so what do we have the word "portable" for?)

4. __Most linguistic units denote ideas rather than things__. Whereas it is relatively easy for a robot to perceive a ball using some sensors and push it around a room, it is much more difficult to teach a computer the meaning of a term like _civil liberties_. We still have no effective representation for a large part of the knowledge that we use every day to understand language, and some of those ideas are even explicitly encoded in language itself (it is generally challenging to come up with an exhaustive explanation of the meaning and usage of the present perfect tense, or of polite forms in the languages that have them, like German, Japanese or Spanish). 

5. __Hierarchical structure__. Languages work at different levels of abstraction, which require different levels of resolution:
>    _Characters < words < sentences < documents_

Currently, technologies working at the __document__ level provide fairly good results, and systems working at the __sentence__ level achieve very good performance depending on the task. However, at the __word__ level there is still a lot of room for improvement. Most available technologies rely heavily on having a wider linguistic context available to generate results, or on hand-crafted information.


So, when a computer looks at it:

1. __Human language is inherently very high-dimensional and sparse.__ From a computer's perspective, languages consist of too many different words: most of them appear very rarely, a few of them appear too often, and many of them, when they appear, do not always appear with the same meaning (!). That last part is known as the modulation of meaning in context, or __semantic modulation__, and it means that words modify each other's meaning (which is, in turn, the cornerstone of the [Distributional principle](https://en.wikipedia.org/wiki/Distributional_semantics)).
This is true of virtually all words but some of them actually __require__ modulation: intrinsically ambiguous words such as _firm_ (_solid_ as an adjective, a type of business when used as a noun) are like Schrödinger's cat: their meaning is linguistically undefined until they are observed in a context that allows the speaker to disambiguate them: _He works at a law firm_ versus _The government has a firm control of the situation_.

2. __Human language is (unnecessarily?) redundant.__ In language, there are many different ways to express the same thought (or the same overall idea):
   * _That movie was terrible._
   * _The movie was awful._
   * _I did not like anything about this film._
   * _Last night at the cinema -it was awful!_
   * _What did I just watch?_
   * _I am never getting that hour and a half of my life back._
   * _Who would do something like this?_
   * _I almost had a mental stroke watching that film._
   * ... and more.

For a computer, these all mean:

    SENTIMENT(IDofMovieInDatabase, -5)

:D The differences are stylistic and colorful, and humans can handle them relatively effortlessly -decoding all these variants is a type of mental gymnastics that our brains like to do, and generally find interesting: people who can come up with more creative and surprising ways of making an assertion are usually regarded as witty and can use that to call other people's attention, which is extremely important during an information exchange. For computers, however, __those people are simply annoying :)__

Despite all the variation above, the appropriate behavior could still be triggered by the latter, more abstract representation: *__Computer, do not recommend me movies like this ever again__*. We want our system to get to that abstract interpretation, to the same conclusion, from any of the starting points above. Doing that in each case requires a varying number of intermediate steps.

Take for instance the last variant, _I almost had a mental stroke watching that film_. The ability to detect the correct interpretation of this sentence would require the system to be aware of a number assumptions:

Assumption | Topic | Description | Formalization
--- | --- | --- | ---
1 | Stroke | People can suffer mental strokes. | *HAS(person, stroke) && IS_OF_TYPE(stroke, mental) && ...*
2 | Stroke | Mental strokes are highly negative. | *SENTIMENT(stroke, -5)*
3 | Watching | People watch films. | *SAME_AS(film, movie) && WATCH(person, movie)*
4 | Watching | The action of watching consists in receiving visual input. |
5 | Watching | Receiving visual input is a simple thing for humans and does rarely cause permanent brain damage. |
6 | Language | People sometimes do not speak literally, which involves a partial[2] violation of linguistic meaning. |
7 | Language | When people do not speak literally, sometimes they exaggerate. |
8 | Language | When people exaggerate, they make a stronger statement than it is literally correct but that they expect the listener to be able to perceive as such. |
9 | Language | When the "error" is resolved by the listener, the differential between the strength of the non-literal expression and the inferred, literal one, can be seen as additional emphasis. | 

[2] A total violation of linguistic rules would make communication impossible.


That is a high-level summary of all the knowledge required to correctly interpret the example above (a detailed representation would need e.g. further decomposing "movie" into some kind of annotation such as **[+event, +information]**, or sometimes into a database entry like **{'year': int, 'director': str, 'duration (minutes)': int, 'cast': List: str }** ).

Over the course of their early development, up until adolescence, and probably borrowing also from long-evolved instincts, humans seem able to easily master assumptions 6-9, get exposed to tons of examples of facts such as those in 1-5, and can easily derive many more through inference. Computers, however, still need to be taught all of this explicitly, and we still lack a suitable representation for most of it.

For most purposes, we are currently still at stage 3 but working hard on stage 4, with lots of great research teams doing very interesting early work (mostly through manual annotation and validation by humans combined with automated extraction and generalization workflows). However, it would seem that most of the work still needs to be done. Assumptions such as 6-9, on the other hand, still seem a bit far away.

In this sense, the adjective _natural_ in natural language also opposes the adjective _artificial_ in _artificial intelligence_, and it is the reason why natural language is an area of work within artificial intelligence: computers, artificial systems, cannot yet understand natural language without help, and there is a whole area of research on how to make that possible. That is the focus of NLP, and it lies right at the __intersection of engineering, linguistics, and machine learning__.

## Applications and examples


### Speech recognition

 > Development of technology that transforms spoken language into text by electronic systems.

*From [Wikipedia](https://en.wikipedia.org/wiki/Speech_recognition)*

1. Transform what people say into strings, which can be
  * the final output or
  * input for NLP.
  * __DEMO__: in our phone (_where am I?_, _play [SONG](https://www.youtube.com/watch?v=bbr60I0u2Ng)_)
2. Input (directive) vs. interactive
  * Knowledge required.
3. Word error rates
  * 0.3% for number recognition over the phone.
  * 10% for listening to the news on TV.
  * 30% for conversation between many speakers with background noise.
4. New word challenge. 
5. Based on sequential models (Hidden Markov Models).
6. Data acquisition bottleneck.
  * Tens to thousands of hours.

### Machine translation

*(Adapted from [Wikipedia](https://en.wikipedia.org/wiki/Machine_translation)*)

Machine translation is the task of giving a computer a text in one language as input and getting back its translation into any other language. Most of you are probably familiar with [__Google Translate__](https://translate.google.com), which is probably the most paradigmatic example of the task and of what can be achieved with very large amounts of data. We all know it makes mistakes but, as its training dataset keeps increasing, it has an endless stream of data to learn from, and that data stream will eventually provide the exact translation of the whole paragraph you just wrote.

> ese flujo de datos de vez en cuando va a proporcionar la traducción exacta de todo el párrafo que acaba de escribir.

> що потік даних іноді забезпечить точний переклад всього пункту ви тільки що написали.

Just as the number Pi contains a copy of anything you will ever write :)

Ultimately, __machine translation__ is looking up a word's translation over a humongously huge __bilingual dictionary__. That dictionary is actually pretty bad because the entries are not for words but rather for full sentences, that is why you need to use 


In many ways, __machine translation__

At one level, MT simply replaces words in one language (source language) with words in another language (target language). However, the mapping between the words in each language is not bi-univocal: sometimes, one source word translates into more than one target words; other times, two source terms translate into a single target term. Apart from that, translation exhibits the full range of complexity in natural languages, which means that deep linguistic understanding would be necessary strictly speaking to be able to perform automated translation in a well-founded way.

Although great success currently mapping a text to its translation, it is still not the type of translation we would like: via ideas rather than examples. At this point we still only have a super-smart parrot that repeats things people say but has no understanding of what it is saying and that is why it often says it at the wrong time or makes mistakes. The crucial transition to 

5. Non-obvious evaluation.
  * Hard to come by proper evaluation data. Sometimes due to lack of resources, some other times due to the intrinsic difficulty of providing a correct answer. Like in the case of __machine translation__, many answers are actually possible. Like in the case of __machine translation__, metrics such as [BLEU](https://en.wikipedia.org/wiki/BLEU) (Bilingual Evaluation Understudy), [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) or [WER](https://en.wikipedia.org/wiki/Word_error_rate) are used.

> Most commercial machine translation is successful thanks to customization and focus on specific tasks/domains, decreasing error rates by limiting the scope of the potential substitutions. This technique is particularly effective with formulaic language such as legal documents. Informal, spontaneous conversation is much more challenging and machine translation of that type of input is currently an unsolved problem.

### Document classification
1. Spam filtering.
  * Detection of fake reviews.
  * Plagiarism detection.
2. Authorship attribution.
3. Sentiment analysis.
  * Opinion mining.
  * Early warning systems.


### Sentiment analysis
1. *Should I go see this movie or not?, Are my employees happy or not?*
2. Many providers/tools.
3. Non-trivial for the same reasons as 
  * parsing
  * Aspect-based sentiment analysis/Attribute-based understanding
4. Slack emotion tracking bot


### A couple of interesting classification use cases
[Assist.ai](https://assist.ai)

[Xeneta](https://medium.com/xeneta/boosting-sales-with-machine-learning-fbcf2e618be3#.ynm031lo2)
Lead qualification for improving the efficiency of a sales department. The asked the question, _Given a company description, can we train an algorithm to predict whether or not it’s a potential customer?_

### Natural language generation
1. Canned (some number of fixed strings as options used interchangeably for a each relevant generation field. Heavily lexicalized, i.e., very reliant on dictionaries/hard-coding/etc.)
2. Re-used data.
3. [Finite-state](https://en.wikipedia.org/wiki/Finite-state_machine) NLP generation: the system's response is based on a template with some placeholders and some knowledge of their fillers, any semantic or syntactic properties that would require a different message. This is what your system does when the message _Compressing 2 files_ changes into _Compressing 1 **file**_.


### Automatic text summarization
1. Given a text, return a new text with a length that is a user-defined fraction of the length of the input, and as much of its original content as possible.
2. Some examples:
  * Action items from a meeting.
  * Summary of an email thread.
  * Paper abstracts.
  * List of events in series of documents.
3. How?
  1. **Usupervised**. Unless specified otherwise, some variation over a TFIDF-weighted algorithm.
  2. **Supervised**. If there is a corpus of _reference summaries_ available.
4. Extractive and abstractive summarization.
  * Extractive: the output consists of sentences from the original text.
  * Extractive: the output involves some level of Natural Language Generation.
5. __Maximal Marginal Relevance__. Iteratively and dynamically choose the next sentence that should go into the summary given the current activation state based on previous sentences. [Example](https://techcrunch.com/2016/07/17/softbank-is-reportedly-bidding-to-buy-chip-giant-arm-for-31-billion/).
6. Non-obvious evaluation.
  * Hard to come by proper evaluation data. Sometimes due to lack of resources, some other times due to the intrinsic difficulty of providing a correct answer. Like in the case of __machine translation__, many answers are actually possible. Like in the case of __machine translation__, metrics such as BLEU or ROUGE are used.
  * _It works and we are happy with it, but there is no theoretical guarantee it is the best._
7. We know we want it to involve understanding:
  * The summary of a text about a person should probably include biographical data about that person.
  * The summary of a text about a news item should probably answer what, when, where.
  * The summary of a text about medical research should include the sample, the methodology, and the results.
  * ... and so on.


### Text normalization
1. Spellchecking
  * Re-capitalization.
2. Autocomplete.
3. Swipe-style keyboards for smartphones.

### Question and Answer (Q&A/*QA*) systems (NLP wrapper around Information Retrieval)
1. Return the answer to a query expressed in natural language or, more importantly, admit that the answer is not known.
3. From [Eliza](http://manifestation.com/neurotoys/eliza.php3/) to [Viv](https://www.youtube.com/watch?v=Rblb3sptgpQ). Still not at a HAL-stage.
  * Eliza was just a set of insightfully vague templates with some cleverly unspecific Natural Language Generation built into it. We must not underestimate the power of vagueness in human communication -__many politicians make their entire careers as objects on the class *Eliza*__ :D
2. Examples
   * _What are the most recent results in the field of Q&A?_
   * _Whose idea was the pub quest?_
   * _How can I reduce stress?_
2. An example like the last one in the list above should be translated into
          candidates = SELECT ID FROM scientific_disciplines sd WHERE (
                    sd.area LIKE '^q.*a$' AND       #  An entity detected as 'AREA' (of knowledge)
             sd.subject LIKE '^results?$'    #  A tag assigned by a document classification algorithm.
          ) ORDER BY sd.year DESCENDING             #  An entity detected as 'YEAR'.
  * Will normally benefit from Named Entity Recognition (NER).
  * Will normally benefit from having available a database with relevant data (_=taxonomy, ontology_)
  * Will normally start by rewriting the user query into a statement and then trying to match it to documents in the dataset.
  * After rewriting, a sequential n-gram model (Hidden Markov Model) scores the similarity between the answer and each candidate question.
     * Sparsity can easily hurt performance for high _n_'s.
     * The user’s question is often syntactically close to actual answer: _Where is[1] the[2] White[3] House[4]?_ > _The[2] White[3] House[4] is[1] in Washington DC._
4. Performance.
  * Depends heavily on question type, ranging from 0.6 (_why_- or _how_-questions) to 0.9 (dates, strongly statistically associated and unambiguous concepts, etc.).
  * Sometimes web data results in higher results than standard evaluation datasets.
  
5. [An applied example.](http://nbviewer.jupyter.org/github/JordiCarreraVentura/question_answer/blob/master/Question%20and%20Answer%20assignment.html#Question-and-Answer-assignment)
  * Nearly 90% accuracy, but the dataset is probably too small so we should not be over-confident.
  * State-of-the-art is 84-85% Mean Reciprocal Rank on the [TREC evaluation task](http://aclweb.org/aclwiki/index.php?title=Question_Answering_(State_of_the_art)) of the ACL (Association for Computational Linguistics), which is like the NASA of NLP.


#### A kick-ass Q&A example
[University professor’s teaching assistant was an AI](http://www.smh.com.au/technology/innovation/professor-reveals-to-students-that-his-assistant-was-an-ai-all-along-20160513-gou6us.html)

Are you getting any ideas? :) (=_Lviv Data Science Summer School Bot_)

#### Many other prominent examples
- [Maluuba](http://www.maluuba.com)
- [Siri](https://www.apple.com/ios/siri/)
- [Google Now](https://www.google.com/search/about/learn-more/now/)
- [Cortana](http://www.windowscentral.com/cortana)
- [Amazon Echo](https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E?ie=UTF8&*Version*=1&*entries*=0)/[Amazon Alexa](https://www.engadget.com/2015/07/03/amazon-echo/)

##### ... even in Spain!
- [Hutoma](http://www.hutoma.com)
- [Inbenta](https://www.inbenta.com/en)

### Natural Language Understanding (NLU)

This is it :) That is the ultimately goal and what all the major companies have their own teams working on right now. Behind every virtual assistant, there is a team of people taking care of the knowledge-engineering.

1.Entity extraction

       PERSON[Jon Snow] is a character in SHOW[Game of Thrones].

2. Parsing: relation extraction

       Jon Snow is a ROLE[character+{in, of, at, from}] Game of Thrones.

3. Tuple/Triple/n?-ple extraction

       IsCharacter('Jon Snow', 'Game of Thrones')

4. Semantic relations (easy to transform into [OWL](https://en.wikipedia.org/wiki/Web_Ontology_Language)/[RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework)-type ontologies).

5. Could answer already a few different questions:

       Who are the main characters in Game of Thrones?
       Is Jon Snow a character from Game of Thrones?
       Who is Jon Snow?

2. A huge number of applications:
  * Structure content according to search parameters for easier browsing (any company with a big catalogue of user-generated data needs this): people, products, jobs, institutions, etc.
  * Extending in-house knowledge-bases with an endless stream of recent publications/up-to-date data.

3. Huge potential gains in many big industries:
  * Healthcare (lots of research papers and data on clinial trials)
  * Finance (lots of financial reports)
  * Legal (tons of patents)
  * Human capital management and counter-terrorism

4. Already massive datasets available:
   * Relations
     * [Freebase](https://developers.google.com/freebase/). Google took it offline but the last version of the data is still downloadable; a 32 Gb .zip file full of relation triples :) A NLP scientist's dream come true!
     * [ConceptNet](http://conceptnet5.media.mit.edu).
   * Entities or concepts
     * [Semantically Enriched Wikipedia](http://lcl.uniroma1.it/sew/, because WordNet is too small :)
     * [BabelNet](http://babelnet.org).
     * [OmegaWiki](http://www.omegawiki.org).
     * [FrameNet](https://framenet.icsi.berkeley.edu/fndrupal/about)
     * Kaggle, Yelp, Microsoft, Google challenges/competitions.
     * [Linguistic Data Consortium](https://www.ldc.upenn.edu).

##### A cool example of fully-developed knowledge acquisition pipeline
Carnegie Mellon University's [__Never-Ending learning system__](http://www.cs.cmu.edu/~tom/pubs/NELL_aaai15.pdf).

__NOTE:__ By the way, Carnegie Mellon University, together with Stanford University and the University of Edinburgh, are the research centers doing some of the coolest research on NLP. You probably want to read anything published by people working there.


##### An extremely inspiring and socially relevant use case: "Manolo"
__ScrapingHub's blog__. [How web-scraping reveals corruption](https://blog.scrapinghub.com/2016/03/09/how-web-scraping-is-revealing-lobbying-and-corruption-in-peru/) Peruvian journalists are going through **Manolo**'s data in order to find cases of corruption and report them. There is money in serving ads on search results, but __Manolo__ can make the world a better place.



## Summary

* NLP is about creating systems that can understand people when they speak using their own words, and that can talk back to them using a similar language.
* Even a shallow of understanding of language, however, can already provide a competitive advantage in a commercial environment, and many companies exist that develop applied NLP solutions. Any task involving linguistic data and repetition in the workplace can probably be automated, like answering recurring questions about a laptop's technical specifications on Amazon, or translating instruction manuals that contain very simple and repeating linguistic structures. The philosophy behind NLP is that no humans should be harmed to perform anything that can be done by a computer.
* NLP is a subfield of __engineering__ (because it involves automation), __linguistics__ (because it involves the formal study of human languages), and __computer science__ (because it involves machine learning). The machine learning component already takes care of the part about receiving some input and performing a human-like action as a response. NLP is required, however, to map linguistic inputs to a suitable formal representation. In that sense, NLP is about formalizing language and using machine learning methods to aid in that task.
* Down to its bare minimum, NLP is about __extracting structured knowledge from unstructured (user-generated) data__, and use it to train systems that can react to linguistic stimuli in the same way as a human would.