# Text Processing for Arabic - Using MADAMIRA

## Introduction

MADAMIRA is an end-to-end morphological disambiguator for Arabic. It generates ranked morphological analyses for each word in context.
In this tutorial, we use raw text files as input and expect structured unmarked output.
_**Please refer to the MADAMIRA manual for to learn more the about the different features and configurations.**_

MADAMIRA is written in JAVA. To call MADAMIRA use:
```
java -Xmx4000m -Xms4000m -XX:NewRatio=3 -jar MADAMIRA-release-20170403-2.1.jar -rawinput <input_text> -rawoutdir <output_dir> -rawconfig sampleConfigFile.xml
```

For each word in the sentence MADAMIRA produces an array of morphosyntactic and lexical features. Additionally MADMAIRA produces separate tokenization files in different schemes.

In this exercise we will be looking at the following features and tokenization schemes:

1. Diacritized word - `diac`: the diacritized form of the word.
2. Lemma - `lex`: the citation form of the word.
3. POS - `pos`: part-of-speech (34 pos tagset).
4. ATB tokenization: all prepositions, articles and pronouns are split, except for the definite article 'Al'
5. D3 tokenization: all prepositions, articles and pronouns are split.

The output for this exercise will be in the specified `<output_dir>`, please open the files ending with the extensions: `.mada`, `.ATB.tok`, and `.D3.tok` to examine them.

## Run MADAMIRA: Gigaword_AR as example

In [None]:
%%bash

cd MADAMIRA
java -Xmx4000m -Xms4000m -XX:NewRatio=3 -jar MADAMIRA-release-20170403-2.1.jar -rawinput ../Results/Gigaword_AR/gigaword_tiny_cleaned.txt -rawoutdir ../Results/Gigaword_AR/ -rawconfig WIDH_2020_ConfigFile.xml 

***

## Parse .mada output file for additional representations:

We will extract the following representations out of the .mada file using the parse_mada function defined below.
1. Diacritized text.
2. Undiacritized text.
3. Lemmatized text.
4. Lemma and POS pairs.


In [None]:
import re
from camel_tools.utils.dediac import dediac_ar
_LEMMA_SPLIT_RE = re.compile(r'-|_')

def parse_mada(filename):
    sentence_count = 0 # keep a sentence counter

    diac_text = [] # sentecnes of diacritized words
    undiac_text = [] # sentences of undiacritized words
    lemmas_text = [] # sentences of lemmas
    lemma_and_pos = [] # list of lemmas and their corrisponding POS pairs
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            if line.startswith(';;; SENTENCE'): # encounter a sentence
                diac_str = []
                undiac_str = []
                lemma_str = []
                lemma_pos_str = []
            elif line.startswith('*'): # encounter the analysis line
                pos = re.search('pos:(\S+)', line).group(1)
                diac = re.search('diac:(\S+)', line).group(1)
                undiac = dediac_ar(diac)
                lemma = re.search('lex:(\S+)', line).group(1)
                lemma = _LEMMA_SPLIT_RE.split(lemma)[0] # extract the lexical part of the lemma only
            
                diac_str.append(diac)
                undiac_str.append(undiac)
                lemma_str.append(lemma)
                lemma_pos_str.append(f'{lemma}+{pos}')
            elif line.startswith(';;WORD'): # encounter a word
                word = line.strip().split(' ')[1]
            elif line.startswith('NO-ANALYSIS'): # MADAMIRA did not generate analysis for the word
                diac_str.append(word) # use the raw word as a placeholder
                undiac_str.append(word) # use the raw word as a placeholder
                lemma_str.append(word) # use the raw word as a placeholder
                lemma_pos_str.append('{}+NOAN'.format(word)) 
            elif line.startswith('SENTENCE BREAK'): # encounter the end of the sentence:
                diac_text.append(' '.join(diac_str))
                undiac_text.append(' '.join(undiac_str))
                lemmas_text.append(' '.join(lemma_str))
                lemma_and_pos.append(' '.join(lemma_pos_str))
            else:
                continue
    # write to file
    with open(filename+'.diac', 'w', encoding='utf-8') as f:
        f.write('\n'.join(diac_text))
    with open(filename+'.undiac', 'w', encoding='utf-8') as f:
        f.write('\n'.join(undiac_text))
    with open(filename+'.lex', 'w', encoding='utf-8') as f:
        f.write('\n'.join(lemmas_text))
    with open(filename+'.lexPOS', 'w', encoding='utf-8') as f:
        f.write('\n'.join(lemma_and_pos))

### To call the `parse_mada()` function:

In [None]:
parse_mada('Results/Gigaword_AR/gigaword_tiny_cleaned.txt.mada')