# Linguistic Approaches to Machine Translation

## Linguistic Background

### Word 

**Tokenization**: segment sequence of characters into words

- Convention in western languages: word boundaries = spaces
- Many Asian languages: no spaces between words or even sentences

**Word-level**

- **Morphology**

- **Word Classes and Grammatical Categories** 
- **Lexical semantics**

#### Morphology

Internal structure of words, word formation

Words are composed of **morphemes**

**Morphemes**: smallest meaning-carrying units of language

- **stem morphemes**: may appear as separate words
  - house, house-s
  - small, small-est
- **functional** or **bound morphemes**: need to be connected with stem morphemes
  - afix types: prefix, suffix, infix, circumfix
  - Example: <u>un</u>-happy, kauf-<u>st</u>, Gespräch-<u>s</u>-ablauf, <u>ge</u>-kauf-<u>t</u>

Word formation through **morpheme composition**:

- **Inflection**: tenses, count, person, case
  - add information about tenses, count, person, gender and case
  - E.g.: kauf-st, kauf-te, car, car-s, ein-e schön-e Blume
- **Derivation**
  - bound morphemes derive new words
  - E.g. stem morpheme ***happy*** (adjective) + bound morpheme ***–ness*** ->***happiness*** (noun)
- **Composition**
  - combination of stem morphemes -> create new words
  - E.g.: rain-bow, wate-proof

**Morphology Specialities**:

- Morpho-phonological processes at morpheme boundaries
- German „Umlaut“
- Vowel Harmony

**Morphological Analysis**:

- finite state automata
- grammatical and lexical knowledge
- for most languages very good coverage and fast👏

#### Word Classes and Grammatical Categories 

The role a word plays within a sentence is determined by its **part-of-speech (POS)**

| POS         | Example                     |
| ----------- | --------------------------- |
| Noun        | power, apple, beauty        |
| Verb        | go, sleep                   |
| Adjective   | red, happy, asleep          |
| Adverb      | often, happily, immediately |
| Determiner  | the, a, which               |
| Pronoun     | she, it, them               |
| Preposition | under, of, in               |
| Conjunction | and, because, if            |

**POS tagger**: gives useful information for translation

- statistical POS tagger

  assign each word a POS based on *relative frequency counts* and its context in a training corpus

**Grammatical Categories**

- (pro)nouns, adjectives
  - person
    - subject and verb have to agree in this feature (E.g., I go, he goes)
  - number
    - singular, dual, trial, plural
  - gender
    - masculine, feminine, neuter, animate, inanimate
  - case
    - role of participant within phrase; distinguish subject, object, ... 
    - nominative, genitive, accusative, dative, partitive, locative, ...

- Verbs
  - tense: future, past or present
  - aspect: completeness, habituality, progressiveness
  - mood:
    - factuality, likelihood, possibility, uncertainty
    - indicative (he is here), subjunctive (if he were here), optative
  - voice: active, passive, middle, causative

#### **Lexical Semantics**

Ambiguous meanings of words

- **polysemy**: words with same surface form have different (related) meaning, E.g.,  

  - interest: Interesse, Zinsen, Anteil

  - bank: financial institution, of a river

- **homonymy**: completely unrelated meaning, E.g.,

  - can: 
    - you can do it!, 
    - a can of beans

-> the correct meaning within the given context has to be identified: **word sense disambiguation**

Relations between words

- **synonmy**: need – require
- **antonymy**: related – unrelated; big – small; cheap – expensive; 
- **hypernymy** (is-a): house – building
- **meronymy** (part-of): door – house

### Sentence

**Sentence-level**

- **Syntax: structure of sentences**
- **Semantics: representation of meaning**

#### Sentence structure

- Sequence of words terminated by *punctuation mark*

- SUBJECT VERB (OBJECT)*
  - **Subject**: phrase headed by a noun -> *noun phrase (NP)*
    - E.g., Jane, the woman, a woman, the young woman, she, the young woman who lives across the street
  - **Verb**: in the second position (in English)
  - Number of **objects** is determined by the verb and its valency
    - *intransitive* verbs: 0 objects (E.g., to sleep)
    - *transitive* verbs: 
      - 1 or more objects; to buy sth. (valency=1); 
      - to give someone sth. (valency=2)
  - Verb and object(s) together form a constituent -> *verb phrase (VP)*
    - Valency needs to be satuated for the sentence to be complete
- Additional information can be added to the sentence in terms of **adjuncts**
  - **Prepositional phrase (PP)**, E.g.: 
    - Jane bought the house *(from Jim) (without hesitation)*.

    - Jane bought the house *(in the posh neighborhood (across the river))*.
  - **Adverbs**
    - *(Yesterday)* Jane bought the house *(at a low price)*.
- Embedded clauses (nested sentences, recursive structure of sentences)
  - relative clauses, E.g.:
    - Jane *(who recently won in the lottery)* bought the house *(that was just put on the market).*

##### Syntactic theory

Assumption: 

- natural language sentences follow certain regularities 

  - constituents

  - precedence

- sentence structure can be modeled by a **context-free grammar (CFG)**, 

  - e.g. Phrase Strucure Grammar
    - G = <V, Σ, P, S>
      - V: Non-terminal symbols: *here Syntactic Categories*

      - Σ, subset of V: Terminal symbols: *here POS*

      - P: set of production rules describing constituent structure 
      - S: Start symbol: Category „Sentence“

##### **Syntactic Parsing**

Viven a natural language sentence, return a syntactical parse tree 

- headed by an S node
- spanning all words in the sentence

Parsing strategies 

- **bottom-up**

- top-down

##### **Syntactic Phenomena**

Certain syntactical phenomena cannot be covered by simple CFGs:

- **Agreement** of morpho-syntactic features
- **Subcategorization**: ensure correct amount of arguments for a verb
- Long-distance **dependencies**
  - E.g.: Maria *hat* am Sonntag, obwohl sie es sich fest vorgenommen hat, die Hausaufgaben nicht *gemacht*.
- Variable **word order** in German

##### Unification Grammars

- **feature structures** represent properties of linguistic objects

- basic principle: **unification**

##### HPSG

**H**ead-driven **p**hrase **s**tructure **g**rammar

- composition of sentence from phrase constituents as in phrase structure grammar
- typed feature structures 
- unification

-> ensure agreement of feature values and correct subcategorization

##### Parsing Difficulties 🤪

- Lexical ambiguities: 

  - word is assigned multiple POS tags in lexicon
  - Solution: try to disambiguate during parsing

- Structural ambiguities

  - (partial) sentence can have multiple correct parses

  - sentence constituent may be part of several grammar rules

  - Types

    - NP/VP Attachment Ambiguity:

      -   “The cop [saw [the burglar] [with the binoculars]]”
      -   “The cop saw [the burglar [with the gun]]”

    - NP/S Complement Attachment Ambiguity:

      - “The athlete [realised [his goal]] last week”
      - “The athlete realised [[his shoes] were across the room]” 

    - Clause-boundary Ambiguity:

      - “Since Jay always [jogs [a mile]] the race doesn’t seem very long”
      - “Since Jay always jogs [[a mile] doesn’t seem very long]”

    - Red. Relative-Main Clause Ambiguity:

    - -   “[The woman [delivered the junkmail on Thursdays]]”
      -   “[[The woman [delivered the junkmail]] threw it away]”

    - Relative/Complement Clause Ambiguity:
      - “The doctor [told [the woman [that he was in love with]] [to leave]]” 
      - “The doctor [told [the woman] [that he was in love with her]]”

### Semantics

- Meaning of natural language constructs: 
  - words, 
  - sentences, 
  - text/ discourse

-> Compositionality of meaning: meaning of a sentence is composed from the meaning of its parts

- Semantic formalisms to represent natural language meaning 
  - First order logic
  - Higher order logic formalisms 
  - Frame semantics

#### **First order logic**

Composition of sentence meaning 

- incrementally constructed

- meaningful constituents

- Example:

  <img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200505155304488.png" alt="image-20200505155304488" style="zoom:67%;" />

#### Higher order logic

Solve limitation of first order logic with **Type theory**

- Words are assigned types according to their abilities to merge with other words

  - type e: entity; 
  - type t: truth values

- Composition along the syntactical tree

- Example:

  <img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200505160022867.png" alt="image-20200505160022867" style="zoom:67%;" />

  

<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200505160041049.png" alt="image-20200505160041049" style="zoom:67%;" />

#### Frame semantics

Meaning depends on knowledge about the world

- World knowledge is encoded in frames: mental representation of

  stereotypical situations

  - E.g.: to buy – commercial transaction frame

    elements involved: seller, buyer, goods, price (required), invoice, receipt (optional)

- Relations *within* frames

  - E.g.: seller owns goods, determines price, buyer has money, pays price

- Relations *between* frames

  - E.g.: to buy – to sell: same elements, direction reversed



## Translation Challenges

- **Disambiguation**

  - Word sense 
  - Structural

- **Language is developing**

- **Co-references**: refering to objects within and across sentence

  boundaries

  - **anaphora**: pronouns
    - E.g.: The *<u>man</u>* goes to work. <u>*He*</u> takes the bus.
  - **deictic references** depend on context
  - **references** to the same object using a synonym, hypernym
    - E.g.: Jane bought the <u>*house*</u> on Elm Street. The <u>*building*</u> had just been put on the market.

- **Translation mismatches**: difference in information content between source and target language

- **Translation divergences**

  - *same* information in source and target language
  - syntactic structure / semantic distribution of meaning is *different* in the two languages
  - Type:
    - Structural divergence: **word order**
    - Thematic divergence: changes of grammatical role
    - Head switching
    - Lexicalization: semantic content differently distributed
    - Categorial
    - Collocational

- **Problems in real-world scenarios**

  - Different word order
    - Position of the verb
  - Unknown word
  - Lexical ambiguities
    - Context-depended meaning
    - Different use of prepositions
  - Grammatical differences
  - No direct translation

- **Language dependencies**

  - Difficulties depend on the languages

  - Some languages are particularly difficult

    -> Develop methods for particular languages 💪

  

## Linguistic Approaches to MT

<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/Figure-4-The-Vauquois-triangle12.png" alt="The Vauquois triangle[12]. | Download Scientific Diagram" style="zoom:67%;" />

Perform translation at different levels of linguistic abstraction

- Direct translation: no abstraction 
- Syntactic transfer

- Semantic transfer

- Interlingua

### Direct translation

<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200505163117619.png" alt="image-20200505163117619" style="zoom: 67%;" />

- earliest approach to MT

- simple word-level analysis and generation 

  - POS

  - morphology

- source-target language dictionary: bilingual word mapping

- Problems:

  - idiomatic expressions
  - different word order, structural shifts

### Translation by transfer (Transfer-based Sysgtems)

- „second generation“ of MT systems: Rule-based MT

- Transfer in 3 steps:

  1. **analysis** of source sentence -> abstract representation

  2. **transfer** of source language representation into target language representation

  3. **generation**: target language representation -> surface form of target language sentence

- System components

  - lexica
    - **monolingual source language lexicon** 
    - **monolingual target language lexicon**
    - **bilingual dictionary entries**

  - grammar
    - **monolingual analysis and generation**
    - **bilingual transfer rules**

#### Syntactic Transfer

- Level of abstraction: **syntactic** representation

  - Analysis of source language (SL) sentence into source language

    dependent syntactic tree

  - Transfer of SL syntactic tree into target language (TL) syntactic tree 

  - Generation of TL natural language sentence

<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200505170219086.png" alt="image-20200505170219086" style="zoom: 67%;" />

<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200505170344890.png" alt="image-20200505170344890" style="zoom:67%;" />

#### Semantic Transfer

Level of abstraction: **semantic** representation

- Analysis of SL sentence into semantic representation 
- Transfer of SL representation into TL representation

- Generation of TL natural language sentence

<img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200505170643570.png" alt="image-20200505170643570" style="zoom:67%;" />

#### More about transfer-based MT systems

- translation process based on linguistic properties
  - actual linguistic theories, e.g. HPSG, Frame Semantics
  - or system-internal linguistic representation tailored for translation purposes

- variable level of abstraction

- transfer rules *explicitly* model the differences between languages

**Disadvantages** 👎

- (bilingual) language specialist required to develop linguistic components 
- including new languages:
  - 3 new components: Analysis, Transfer, Generation

### Interlingua approach

#### **Idea**

- „intermediate language“
- abstract language-*independent* representation: „pure meaning“

#### **Translation WITHOUT transfer** 

- analyze input sentence and generate interlingua representation

- generate target language sentence directly from interlingua representation 
- access to world knowledge

<img src="/Users/EckoTan/Library/Application Support/typora-user-images/image-20200505171512271.png" alt="image-20200505171512271" style="zoom:67%;" />

#### **Representations in Interlingua MT**

**Interlingua Representation**

- language independent
- encode linguistic knowledge
- non-linguistic knowledge

**Representation of World Knowledge**

- ontology: non-linguistic knowledge about „things“ and relations between

  them

- inference mechanisms

#### **Interlingua Translation**

Level of abstraction: **pure** meaning

- Analysis of SL sentence into interlingua representation

- Generation of TL natural language sentence from interlingua

  <img src="https://raw.githubusercontent.com/EckoTan0804/upic-repo/master/uPic/image-20200505171842410.png" alt="image-20200505171842410" style="zoom:67%;" />

#### Advantages 👍

- lower engineering effort for including new languages

- Translation divergences are handled at monolingual level

#### Disadvantages 👎

- language specialist still required
- world knowledge necessary
  - **available systems use domain model covering only a small domain**
- true interlingua not reached so far 🤪

