# Text processing

Text preprocessing is a crucial step in any Natural Language Processing (NLP) task. 

The main goal of text processing is to convert the input text (plain unstructured text) into **linguistically meaninful units**. Much of NLP analyses depends on the correct preprocessing of our data.

We have seen that a string is a sequence of characters in python. But text has an underlying linguistic structure.

Branch of data science and artificial intelligence that is concerned with how computers can "understand" (i.e. process and analyse) language.

Text processing involves tasks such as sentence segmentation, tokenization (~ identifying words in a text), lemmatization, part-of-speech tagging, syntactic parsing, semantic role labelling, named entity recognition, etc.

## Basic text processing steps

We will look at the following common text processing steps:
1. Sentence segmentation
2. Tokenization (word segmentation)
3. Lemmatization
4. Part-of-speech tagging
5. Dependency parsing (advanced)
6. Named entity recognition

### 1. Sentence segmentation

**Goal:** identify where sentences begin and end.

Example:
```
A trifling incident thus served to settle a victory.  Now-a days, a soldier is so much of a machine that he seems simply to go through certain evolutions, in which there is no opportunity for the display of personal bravery or cowardice.  He does not know what is going on in other parts of the field, and has no real knowledge, till all be over, whether the day has been lost or won.
```

Sentences:
```
A trifling incident thus served to settle a victory.

Now-a days, a soldier is so much of a machine that he seems simply to go through certain evolutions, in which there is no opportunity for the display of personal bravery or cowardice.

He does not know what is going on in other parts of the field, and has no real knowledge, till all be over, whether the day has been lost or won.
```

**Why do we care?** A sentence is often the smallest unit of language that can convey a message.

**What could go wrong?** This:

Input:
```
Mr. Smith and Mr. Jones went to the shops.
```

Output:
```
Mr.
Smith and Mr.
Jones went to the shops.
```

### 2. Tokenization (word segmentation)

Tokenization refers to the process of splitting a string (a sequence of characters) into smaller chunks (linguistically meaningful units), often corresponding to what we would broadly speaking call "words".

While this may seem a very easy task ("Just split by space!"), there are a number of issues that must be taken into consideration. Also, this is language specific (particularly tricky for languages without word delimiters, agglutinating languages, compound words, punctuation, etc.).

![picture](images/tokenization.png)

Image source: https://spacy.io/usage/linguistic-features

### 3. Lemmatization

Lemmatization transforms words to their base forms, i.e. their dictionary forms.

| **Token**    | **Lemma**  |
|----------|--------|
| goose    | goose  |
| geese    | goose  |
| change   | change |
| changes  | change |
| changing | change |
| changed  | change |

**Why do we care?** Python understands "goose" and "geese" as two completely different and unrelated strings. Or "change" and "changes", even! Lemmatization is a type of data normalization that reduces a word form to its dictionary form. It is very common in text mining, because the linguistic/semantic cost is relatively small compared to its advantages: it helps reducing the complexity of the document significantly, by reducing forms to their common base.

### 4. Part-of-speech tagging

Part-of-speech (POS) tagging refers to the process of assigning a part-of-speech (i.e. verb, noun, etc) to a word.

| Keep | the | pen | under | the | book | . |
|------|-----|-----|-------|-----|------|---|
| N    | **DET** | **N**   | ADV   | **DET** | **N**    | **PUNCT**  |
| **V**    |     | V   | **PREP**  |     | V    |   |
|      |     |     | ADJ   |     |      |   |

**Why do we need POS tagging?**
* Words are ambiguous. Part-of-speech helps disambiguating the most straightforward cases (cases in which POS are different). This is important because the semantics of a word can change significantly (e.g. `book` and `keep`).
* We may want to filter out some types of words that we don't need for our future analyses, for example:
  * We may want to perform stylometric analyses based on punctuation and function words (i.e. DET, CONJ, PREP...).
  * We may want to perform topic modelling based on only words with more semantic content, such as NOUNs and VERBs.
  * We may be only interested in actions, and therefore interested in only VERBs.

### 5. Dependency parsing

Dependency parsing is the task of identifying syntactic and semantic dependencies in sentences.

![picture](images/dependency.png)

Suppose, for example, that you're interested in just who performs actions: in this case you'd take only active subjects of sentences (`nsubj`).

### 6. Named entity recognition

Named entities are mentions of entities (people, places, organizations, etc) in text. They are key in many projects in Digital Humanities.

![picture](images/ner_spacy.png)

Image source: https://spacy.io/usage/linguistic-features 