# **Natural Language Processing with Python**
Revision notes for [AI Bootcamp](https://instituteofcoding.org/skillsbootcamps/course/skills-bootcamp-in-artificial-intelligence/) W08-W11. <br>

Content based on the [NLTK book](https://www.nltk.org/book/). <br>

You can find Chapter 1 [here](https://www.nltk.org/book/ch01.html).

# CONTENT

1. Language Processing and Python
    1. Computing with Language: Texts and Words
    2. A Closer Look at Python: Texts as Lists of Words
    3. Computing with Language: Simple Statistics
    4. Back to Python: Making Decisions and Taking Control
    5. [Automatic Natural Language Understanding](#AutoNLU)
        1. [Word Sense Disambiguation](#WordSense)
        2. [Pronoun Resolution](#PronounResolution)
        3. [Generating Language Output](#GLO)
        4. [Machine Translation](#Translation)
        5. [Spoken Dialogue Systems](#DialogSystems)
        6. [Textual Entailment](#Entailment)
        7. [Limitations of NLP](#NLPLimitations)


**Install**, **import** and **download NLTK**. <br>

*Uncomment lines 2 and 5 if you haven't installed and downloaded NLTK yet.*

In [2]:
# install nltk
#!pip install nltk

# load nltk
import nltk

# download nltk
#nltk.download()

Load all items (9 texts) from **NLTK' book module**.

In [3]:
# load all items from NLTK’s book module.
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


<a name="AutoNLU"></a>
## 1.5 Automatic Natural Language Understanding
1. [Word Sense Disambiguation](#WordSense)
2. [Pronoun Resolution](#PronounResolution)
3. [Generating Language Output](#GLO)
4. [Machine Translation](#Translation)
5. [Spoken Dialogue Systems](#DialogSystems)
6. [Textual Entailment](#Entailment)
7. [Limitations of NLP](#NLPLimitations)

A **long-standing challenge** within artificial intelligence has been to build intelligent machines, and a major part of intelligent behaviour is **understanding language**.

<a name="WordSense"></a>
### 5.1 Word Sense Disambiguation
In WSD we want to **work out which sense of a word** was intended in a given context (**contextual effect**).

Consider the ambiguous words serve and dish:
	
>**serve**: help with food or drink; hold an office; put ball into play<br>
**dish**: plate; course of a meal; communications device

We automatically **disambiguate words using context**, exploiting the simple fact that **nearby words have closely related meanings**.

<a name="PronounResolution"></a>
### 5.2 Pronoun Resolution
A deeper kind of language understanding is to **work out "who did what to whom"** — i.e., to detect the **subjects and objects of verbs**.

Try to determine what was sold, caught, and found:
>The thieves stole the paintings. They were subsequently **sold**.<br>
The thieves stole the paintings. They were subsequently **caught**.<br>
The thieves stole the paintings. They were subsequently **found**.

The third case is **ambiguous**!

Answering this question involves finding the **antecedent of the pronoun** they, either thieves or paintings. 

Computational techniques for tackling this problem include **anaphora resolution** — identifying what a pronoun or noun phrase refers to — and **semantic role labeling** — identifying how a noun phrase relates to the verb (as agent, patient, instrument, and so on).

<a name="GLO"></a>
### 5.3 Generating Language Output
If we can automatically solve such problems of language understanding, we will be able to move on to tasks that involve generating language output, such as **question answering** and **machine translation**. 

In the first case, a machine should be able to answer a user's questions relating to collection of texts:

>**Text**: The thieves stole the paintings. They were subsequently sold.<br>
**Human**: Who or what was sold?<br>
**Machine**: The paintings.

The machine's answer demonstrates that it has correctly worked out that they refers to paintings and not to thieves. 

In the second case, the machine should be able to **translate** the text into another language, accurately **conveying the meaning** of the original text. 

In translating the example text into French, we are forced to choose the gender of the pronoun in the second sentence: ils (masculine) if the thieves are found, and elles (feminine) if the paintings are found. **Correct translation actually depends on correct understanding of the pronoun**.
		
>**Original text**: The thieves stole the paintings. They were subsequently found.<br>
**Translation 1**: Les voleurs ont volé les peintures. **Ils** ont été trouvés plus tard. (the thieves)<br>
**Translation 2**: Les voleurs ont volé les peintures. **Elles** ont été trouvées plus tard. (the paintings)


In all of these examples, working out the **sense of a word**, the **subject of a verb**, and the **antecedent of a pronoun** are steps in **establishing the meaning of a sentence**, things we would expect a language understanding system to be able to do.

<a name="Translation"></a>
### 5.4 Machine Translation
Today, practical translation systems exist for particular pairs of languages, and some are integrated into web search engines. However, these systems have some serious shortcomings, which are starkly revealed by translating a sentence back and forth between a pair of languages until equilibrium is reached, e.g.:

>1.how long **before** the next flight to Alice Springs? <br>
2.wie lang **vor** dem folgenden Flug zu Alice **Springs**? <br>
3.how long before the following flight to Alice **jump**? <br>
4.wie lang vor dem folgenden Flug zu Alice **springen** Sie? <br>
5.how long before the following flight to Alice **do you jump**?

The system correctly translates Alice Springs from English to German (in the line starting 1), but on the way back to English, this ends up as **Alice jump** (line 2). 

The translation system **did not recognize when a word was part of a proper name**, and it **misinterpreted the grammatical structure**.

[TranslationParty](https://www.translationparty.com/)

Machine translation is difficult because **a given word could have several possible translations** (depending on its meaning), and because **word order** must be changed in keeping with the grammatical structure of the target language.

Today these difficulties are being faced by collecting massive quantities of parallel texts from news and government websites that publish documents in two or more languages.

Given a document in German and English, and possibly a bilingual dictionary, we can **automatically pair up the sentences**, a process called **text alignment**. Once we have a million or more sentence pairs, we can detect corresponding words and phrases, and build a model that can be used for translating new text.

<a name=DialogueSystems></a>
### 5.5 Spoken Dialogue Systems
In the AI history, the chief measure of intelligence has been a linguistic one, namely the **Turing Test**: can a dialogue system, responding to a user's text input, perform so naturally that we cannot distinguish it from a human-generated response?



![SpokenDialogueSystems.PNG](attachment:SpokenDialogueSystems.PNG)

Dialogue systems give us an opportunity to mention the commonly assumed **NLP pipeline**.

The above figure shows the **architecture of a simple dialogue system**. 

Along the top of the diagram, moving from left to right, is a "pipeline" of some **language understanding components**. These map from **speech input** via syntactic parsing to some kind of **meaning representation**. 

Along the middle, moving from right to left, is the **reverse pipeline of components** for converting concepts to speech. These components make up the dynamic aspects of the system.

At the bottom of the diagram are some **representative bodies of static information**: the repositories of language-related data that the processing components draw on to do their work.

`nltk.chat.chatbots()` example of a **primitive dialogue system**.

In [None]:
import nltk 
nltk.chat.chatbots()

<a name="Entailement"></a>
### 5.6 Textual Entailment
The challenge of language understanding has been brought into focus in recent years by a public "shared task" called **Recognizing Textual Entailment (RTE)**. 

Suppose you want to find evidence to support the hypothesis:

>Sandra Goudie was defeated by Max Purnell. 

and that you have another short text that seems to be relevant:

>Sandra Goudie was first elected to Parliament in the 2002 elections, narrowly winning the seat of Coromandel by defeating Labour candidate Max Purnell and pushing incumbent Green MP Jeanette Fitzsimons into third place. 

Does the text provide enough evidence for you to accept the hypothesis? In this particular case, the answer will be "No." You can draw this conclusion easily, but it is very hard to come up with automated methods for making the right decision. 

The RTE Challenges provide data that allow competitors to develop their systems, but not enough data for "brute force" machine learning techniques. Consequently, some **linguistic analysis** is crucial. 

In the previous example, it is important for the system to note that Sandra Goudie **names the person being defeated** in the hypothesis, **not the person doing the defeating** in the text. 

As another illustration of the difficulty of the task, consider the following text-hypothesis pair:

>**Text**: David Golinkin is the editor or author of eighteen books, and over 150 responsa, articles, sermons and books. <br>
**Hypothesis**: Golinkin has written eighteen books.

In order to determine whether the hypothesis is supported by the text, the system needs the following **background knowledge**: 
1. If someone is an author of a book, then he/she has written that book.
2. If someone is an editor of a book, then he/she has not written (all of) that book
3. If someone is editor or author of eighteen books, then one cannot conclude that he/she is author of eighteen books.

<a name="NLPLimitations"></a>
### 5.7 Limitations of NLP

Natural Language Systems:
* Cannot perform **common-sense reasoning** or **draw on world knowledge** in a general and robust manner.
* The goal of NLP research is to make progress on **building technologies that understand language using superficial yet powerful techniques** instead of unrestricted knowledge and reasoning capabilities.