# What is Natural Language Processing?

In order to understand the meaning of the term _Natural Language Processing_ (__NLP__ for short[1]), it is a good idea to start by looking at the words that make up the term.

[1] Also known as _computational linguistics_, _text analytics_ and _text mining_. All these terms are largely synonyms.


## _Processing_
The term _processing_ refers to the fact that we will be using machines to do the work. That is, our goal is to automate. By definition, automation implies machines, and machines imply engineering, which is why NLP is, in that regard, a subfield of computer engineering. But not only that.

## _Language_
The term _natural language_ refers to what we want to automate. A _language_ is a means of communication where there is an arbitrary pairing of signals (X) and their meaning (Y).


### ... or, put in other code,


In [12]:

#    Data for Machine Learning
training_data = [
    ('hello world', 'groovy'),
    ('*', 'not so groovy')
]


#    Data for Computational Linguistics
dictionary_of_accepted_terms = ['hello world', 'foo', 'bar']


#    Definition of each field of study:
def computational_linguistics(string):
    forms = []
    for w in dictionary_of_accepted_terms:
        sim = len(set(string.lower()).intersection(set(w))) / float(len(w))
        if sim >= 0.5:
            forms.append((sim, w))
    if forms:
        return sorted(forms)[-1][1]
    else:
        return string


def machine_learning(input):
    for stimulus, response in training_data:
        if input == stimulus:
            return response
    return training_data[-1][1]


def natural_language_processing(test):
    _test = cl(test)
    try:
        assert _test
    except Exception:
        return None
    return ml(_test)


cl = computational_linguistics
ml = machine_learning
nlp = natural_language_processing


for test in [
    'hello world',
    'helloworld', 'hell o wor ld',
    'undefined', 'asdfasdasdasdf'
]:
    print '\"%s\"' % test
    print '  %s = %s' % ('-nlp', ml(test))
    print '  %s = %s' % ('+nlp', nlp(test))   #  Natural Language Processing is just a wrapper
    print                                     #  over Machine Learning in order to handle 
                                              #  linguistic inputs.

"hello world"
  -nlp = groovy
  +nlp = groovy

"helloworld"
  -nlp = not so groovy
  +nlp = groovy

"hell o wor ld"
  -nlp = not so groovy
  +nlp = groovy

"undefined"
  -nlp = not so groovy
  +nlp = not so groovy

"asdfasdasdasdf"
  -nlp = not so groovy
  +nlp = not so groovy



This may be the first time someone has reduced a bunch of scientific disciplines to a few Python functions :D but I have the feeling that the idea of being able to do just some day is the main reason why most of you are reading this. Now let's go a bit deeper in the definition, in a slightly 'verbose=True' manner ;)


#### _Natural language_
The adjective _natural_ stands in opposition to _artificial_ (man-made) and distinguishes languages created by humans and that we design explicitly to be perfect (i.e., non-redundant and unambiguous, like Math, musical notation, traffic lights or Python), from other languages that have not been so designed, or _natural languages_.

Natural languages develop spontanenously over time just as a result of being used by speakers and mostly as a factor of 1) the way they are learned by children in one generation and passed on to the next and 2) historical accidents happening to that language (_twerk_ is a historical accident that will now be in English for some time before it can be washed away :)

#### Actually just _messy_ language
From our previous point it follows that languages are subject to standard evolutionary processes like those happening in biological systems, and that results in a number of interesting properties that can be roughly summarized by saying that natural languages are __really messy__. Below is a more detailed list:

1. __Existing linguistic units change over time__. Their form changes (irregular verbs turn into regular verbs, BrEn _learnt_ > AmEn _learned_) but, more importantly, their meaning changes and adapts to new realities: ten years ago, the word _notebook_ would have never been applied to something that has no pages and needs electricity to write on it.

2. __New linguistic units are added all the time__: new words are invented as human realities change and almost every time Steve Jobs opened his mouth :D The vocabulary expands at a very fast pace and an important part of human linguistic behavior is to be able to deduce the meaning of unknown words based on their usage and on contextual clues, as well as by inference (generalizing from known attributes of similar words). Understanding language involves a lot of guessing all the time.

3. When linguistic units are produced, they usually __contain errors__. Try typing something on your smartphone and see what happens :) For humans, spoken communication is generally easier and faster, but everyone has trouble pronouncing a word occasionally. Over time, some errors become so widespread that they end up being the accepted way of saying something:
>   __mobile phone__
>     * mobile(1)  *phone that moves
>     * mobile(2)   phone that is easily portable (so what do we have the word "portable" for?)

4. __Most linguistic units denote ideas rather than things__. Whereas it is relatively easy for a robot to perceive a ball using some sensors and push it around a room, it is much more difficult to teach a computer the meaning of a term like _civil liberties_. We still have no effective representation for a large part of the knowledge that we use every day to understand language, and some of those ideas are even explicitly encoded in language itself (it is generally challenging to come up with an exhaustive explanation of the meaning and usage of the present perfect tense, or of polite forms in the languages that have them, like German or Italian). 

5. __Hierarchical structure__. Languages work at different levels of abstraction, which require different levels of resolution:
>    _Characters < words < sentences < documents_

Currently, technologies working at the __document__ level provide fairly good results, and systems working at the __sentence__ level achieve very good performance depending on the task. However, at the __word__ level there is still a lot of room for improvement. Most available technologies rely heavily on having a wider linguistic context available to generate results, or on hand-crafted information.


So, when a computer looks at it:

1. __Human language is inherently very high-dimensional and sparse.__ From a computer's perspective, languages consist of too many different words: most of them appear very rarely, a few of them appear too often, and many of them, when they appear, do not always appear with the same meaning!!! That last part is known as the modulation of meaning in context, or __semantic modulation__, and it means that words modify each other's meaning (which is, in turn, the cornerstone of the [Distributional principle](https://en.wikipedia.org/wiki/Distributional_semantics)).
This is true of virtually all words but some of them actually __require__ modulation: intrinsically ambiguous words such as _firm_ (_solid_ as an adjective, a type of business when used as a noun) are like Schrödinger's cat: their meaning is linguistically undefined until they are observed in a context that allows the speaker to disambiguate them: _He works at a law firm_ versus _The government has a firm control of the situation_.

2. __Human language is (unnecessarily?) redundant.__ In language, there are many different ways to express the same thought (or the same overall idea):
   * _That movie was terrible._
   * _The movie was awful._
   * _I did not like anything about this film._
   * _Last night at the cinema -it was awful!_
   * _What did I just watch?_
   * _I am never getting that hour and a half of my life back._
   * _Who would do something like this?_
   * _I almost had a mental stroke watching that film._
   * ... and more.

For a computer, these all mean:

    SENTIMENT("movie", -5)

The differences are stylistic and colorful, and humans can handle them relatively effortlessly -decoding all these variants is a type of mental gymnastics that our brains like to do, and generally find interesting: people who can come up with more creative and surprising ways of making an assertion are usually regarded as witty and can use that to call other people's attention, which is extremely important during an information exchange. For computers, however, __those people are simply annoying :)__

Despite all the variation above, the appropriate behavior could still be triggered by the latter, more abstract representation: *__Computer, do not recommend me movies like this ever again__*. We want our system to get to the abstract interpretation, to the same conclusion, from any of the starting points above. Doing that in each case requires a varying number of intermediate steps.

Take for instance the last variant, _I almost had a mental stroke watching that film_. The ability to detect the correct interpretation here would require the system to be aware of a number assumptions:

Assumption | Topic | Description | Formalization
--- | --- | --- | ---
1 | Stroke | People can suffer mental strokes. | *HAS(person, stroke) && IS_OF_TYPE(stroke, mental) && ...*
2 | Stroke | Mental strokes are highly negative. | *SENTIMENT(stroke, -5)*
3 | Watching | People watch films. | *SAME_AS(film, movie) && WATCH(person, movie)*
4 | Watching | The action of watching consists in receiving visual input. |
5 | Watching | Receiving visual input is a simple thing for humans and does rarely cause permanent brain damage. |
6 | Language | People sometimes do not speak literally, which involves a partial[2] violation of linguistic meaning. |
7 | Language | When people do not speak literally, sometimes they exaggerate. |
8 | Language | When people exaggerate, they make a stronger statement than it is literally correct but that they expect the listener to be able to perceive as such. |
9 | Language | When the "error" is resolved by the listener, the differential between the strength of the non-literal expression and the inferred, literal one, can be seen as additional emphasis. | 

[2] A total violation of linguistic rules would make communication impossible.


That is a high-level summary of all the knowledge required to correctly interpret the example above (a detailed representation would need e.g. further decomposing "movie" into some kind of annotation such as [+event, +psychological_content], or sometimes into a database entry like {'year': int, 'director': str, 'duration (minutes)': int}). Over the course of their early development, up until their adolescence, and probably borrowing also from long-evolved instincts, humans seem able to easily master assumptions 6-9, get exposed to tons of examples of facts such as those in 1-5, and can easily derive many more through inference. Computers, however, need to be taught all of this explicitly, and we still lack a suitable representation for most of it.

For most purposes, we are currently still at stage 3 but working hard on stage 4, with lots of great research teams doing very interesting early work (mostly through manual annotation and validation by humans combined with automated extraction and generalization workflows). However, it would seem that most of the work still needs to be done. Assumptions such as 6-9, on the other hand, still seem a bit far away. During the rest of the course we will be looking into all this in more detail.

In this sense, the adjective _natural_ in natural language also opposes the adjective _artificial_ in _artificial intelligence_, and it is the reason why natural language is an area of work within artificial intelligence: computers, artificial systems, cannot yet understand natural language without help, and there is a whole area of research on how to make that possible. That is the focus of NLP, and it lies right at the __intersection of engineering, linguistics, and machine learning__.


### [Probably unnecessary]
Computational linguistics --- Natural Language Processing --- Machine learning

Engineering + Science => Machine Learning + {Linguistics} > Natural Language Processing


Computational Linguistics is language preprocessing (formalization) for Machine Learning, so that it can be applied to language.
Currently, most researchers do both simultaneously and that's called Natural Language Processing (ML (algorithms) + CL (linguistic data handling))
Usually, when a data science department outsources some annotation task, they ask CL people to do it.
There is a pre-computational era of Linguistics which is still about formalizing natural language data.

Takes a messy real-world object, language, and transforms it into the data type expect by ML as input.







## Applications and examples

### Speech recognition
 > Development of technology that transforms spoken language into text by electronic systems.

Adapted from [Wikipedia](https://en.wikipedia.org/wiki/Speech_recognition)


### Machine translation

> At one level, MT simply replaces words in one language (source language) with words in another language (target language). However, the mapping between the words in each language is not bi-univocal: sometimes, one source word translates into more than one target words; other times, two source terms translate into a single target term. Apart from that, translation exhibits the full range of complexity in natural languages, which means that deep linguistic understanding would be necessary strictly speaking to be able to perform automated translation in a well-founded way.

> Most commercial machine translation is successful thanks to customization and focus on specific tasks/domains, decreasing error rates by limiting the scope of the potential substitutions. This technique is particularly effective with formulaic language such as legal documents. Informal, spontaneous conversation is much more challenging and machine translation of that type of input is currently an unsolved problem.

Adapted from [Wikipedia](https://en.wikipedia.org/wiki/Machine_translation)


### Text normalization
##### Spellchecking
##### Autocomplete
##### Swipe keyboards for smartphones

### Entity extraction
  Corrupción en Perú (ScrapingHub)
  
### Document classification
(2Do: find state-of-the-art)
  Assist.ai

Xeneta. Boosting Sales With Machine Learning, NLP to qualify deals
https://medium.com/xeneta/boosting-sales-with-machine-learning-fbcf2e618be3#.ynm031lo2
In this blog post I’ll explain how we’re making our sales process at Xeneta more effective by training a machine learning algorithm to predict the quality of our leads based upon their company descriptions.
Given a company description, can we train an algorithm to predict whether or not it’s a potential Xeneta customer?
[Classification tutorial]

##### Authorship Attribution
A style marker, which relies on sequential rules, versus function-word frequency, which relies on ''bag-of-words assumptions.'' It seems they set out to find better results in the rule-based method, but find instead that the bag-of-words assumptions have much higher accuracy, as previous research had also found. 

### Sentiment analysis (2Do: find state-of-the-art)
Classification
+parsing
+Aspect-based sentiment analysis/Attribute-based understanding


### Question and Answer systems (Information Retrieval)
(2Do: find state-of-the-art)

University professor’s teaching assistant
A custom implementation of the Teaching Assistant.
http://www.smh.com.au/technology/innovation/professor-reveals-to-students-that-his-assistant-was-an-ai-all-along-20160513-gou6us.html

Vs. other bots (Eliza-like, literal pattern-like)


The aspects of NLP most frequently involved are analysis of learners’ responses, feedback provision, automated generation of exercises, and the monitoring of learning progress. Other aspects related to learning and teaching also involve NLP, such as plagiarism detection, writing support, use of learner corpora or parallel corpora to detect and resolve errors, or adaptive learning systems integrating ontologies for the associated domains.

The contribution of NLP to these systems is generally regarded as positive. It must be recognized, however, that only a handful of such applications have made it to the general public as a commercial software. In most cases, the systems never left the laboratory and have a limited range of use, sometimes only as a proof of concept. Is this due, as many believe, to the high production cost of NLP resources? Is it because of the current quality of NLP results? Is it a consequence of the integration strategy of NLP into these applications?


###### Virtual assistants
(QA(NER, Class, norm, parsing) + NLG)
(2Do: find state-of-the-art)
  Inbenta
  Maluuba
  Siri
  Google Now
  Cortana
  
### Automatic summarization

### Natural language generation
Canned
Re-used
Generative


## Summary

* NLP is about creating systems that can understand people when they speak using their own words, and that can talk back to them using a similar language.  
* A deep understanding of language is not trivial and is an open research topic that occupies many talented researchers around the world (it is also one of the most fascinating fields of study, IMHO :)  
* Even a shallow of understanding of language, however, can already provide a competitive advantage in a commercial environment, and many companies exist that develop applied NLP solutions. Any task involving linguistic data and repetition in the workplace can probably be automated, like answering recurring questions about a laptop's technical specifications on Amazon, or translating instruction manuals that contain very simple and repeating linguistic structures. The philosophy behind NLP is that no humans should be harmed to perform anything that can be done by a computer.
* NLP is a subfield of engineering (because it involves automation), linguistics (because it involves the formal study of human languages), and computer science (because it involves machine learning). The machine learning component already takes care of the part about receiving some input and performing a human-like action as a response. NLP is required, however, to map linguistic inputs to a suitable formal representation. In that sense, NLP is about formalizing language and using machine learning methods to aid in that task.
* Down to its bare minimum, NLP is about extracting structured knowledge from unstructured (user-generated) data, and use it to train systems that can react to linguistic stimuli in the same way as a human would.