# Text Processing for Arabic - Data Extraction and Cleaning

## Introduction

In this exercise, we present code snippets for reading files in different formats and extracting data from them.
Specifically, we'll look at reading raw text files, tab seperated values (TSV) files, and XML files in a complex directory structure.

## The Cleaning Process

Cleaning in this case involves the following steps:

1. **Normalize unicode characters:** This process converts similar characters or character sequences
   into their canonical forms.
2. **Remove spaces/tabs/newlines/etc at the beginning and end of the sentence**:
   These are unnecessary for most cases and are either the result of indentation or are accidental.
3. **Split tokens by space and punctuation**: TThis process involves first spliting space-delimitted
   tokens and then further splitting these tokens at punctuation boundaries. Here we treat each
   punctuation as an individual token.
4. **Rejoin the tokens with a single space between them**: A single space is
   sufficient if splitting a sentence is needed later on.

## Basic Building Blocks for Processing Arabic Text
Below are some building blocks that we will use throughout this exercise:

In [3]:
from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.utils.charmap import CharMapper
from camel_tools.utils.normalize import normalize_unicode


# This is a character mapping object provided by camel-tools that maps
# Buckwalter encoded characters to Arabic script.
buckwalter_to_arabic = CharMapper.builtin_mapper('bw2ar')


# This function takes an arabic sentence as input and cleans it by doing the
# following:
#    - Normalize unicode characters
#    - Remove spaces at the beginning and end of the sentence
#    - Splitting tokens by space and punctuation
#    - Rejoining the tokens with a single space between them
#
# For example, 'ما    هذا!!! عجيب' becomes 'ما هذا ! ! ! عجيب'.
def clean_arabic_sentence(sentence):
    normalized_sentence = normalize_unicode(sentence)
    stripped_sentence = normalized_sentence.strip()
    tokens = simple_word_tokenize(stripped_sentence)
    return ' '.join(tokens)


# This function takes a Buckwalter-encoded sentence as input and cleans it by
# doing the following:
#    - Normalize unicode characters
#    - Remove spaces at the beginning and end of the sentence
#    - Convert sentence from Buckwalter encoding to Arabic script
#    - Split tokens by space and punctuation
#    - Rejoin the tokens with a single space between them
#
# For example, 'mA    h*A!!! Ejyb' becomes 'ما هذا ! ! ! عجيب'.
def clean_buckwalter_sentence(sentence):
    normalized_sentence = normalize_unicode(sentence)
    stripped_sentence = normalized_sentence.strip()
    arabic_sentence = buckwalter_to_arabic(stripped_sentence)
    tokens = simple_word_tokenize(arabic_sentence)
    return ' '.join(tokens)


# This function takes a list of sentences and a file path as input and writes
# the sentences to the given file one line per sentence.
def write_sentences_to_file(sentences, file_path):
    with open(file_path, 'w', encoding='utf-8') as output_file:
        for sentence in sentences:
            output_file.write(sentence)
            output_file.write('\n')


## Cleaning Raw Arabic Text

Reading and cleaning raw Arabic text is the most basic task one might do when
processing Arabic text.
Using the building blocks we defined before, we define the below function.
This function reads a given file of raw Arabic text, cleans all the input sentences,
and writes them to a specified output file.

In [4]:
# This function reads raw Arabic text from a specified input file, cleans it,
# and writes the output to a specified output file.
def clean_sentences_from_arabic_raw(input_path, output_path):
    sentences = []

    # Open the file for reading, assuming it is UTF-8 encoded
    with open(input_path, 'r', encoding='utf8') as input_file:

        # Iterate through every line in the file
        for line in input_file:

            # Remove spaces/tabs/newlines at the beginning and end of the
            # sentence
            cleaned_sentence = clean_arabic_sentence(line)

            # Add the sentence to the existing list of sentences
            sentences.append(cleaned_sentence)

    # Write cleaned sentences to output file
    write_sentences_to_file(sentences, output_path)

We will now run this function on the provided Gigaword file:

In [5]:
clean_sentences_from_arabic_raw('Data/Gigaword_AR/gigaword_tiny.txt',
                                'Results/Gigaword_AR/gigaword_tiny_cleaned.txt')

## Cleaning Buckwalter-encoded Text

Cleaning Buckwalter-encoded raw text is very similar to raw Arabic text but 
involves an extra step of converting the Buckwalter-encoded sentences into
Arabic after Unicode normalization but before splitting into word tokens.

The function below performs this process on a given input file and writes the
resulting clean text into a specified output file.

In [None]:
# This function reads Buckwalter-encoded text from a specified input file,
# cleans it and converts it to Arabic script, and writes the output to a
# specified output file.
def clean_sentences_from_buckwalter_raw(input_path, output_path):
    sentences = []

    # Open the file for reading, assuming it is UTF-8 encoded
    with open(input_path, 'r', encoding='utf8') as input_file:

        # Iterate through every line in the file
        for line in input_file:
            # Remove spaces/tabs/newlines at the beginning and end of the
            # sentence
            cleaned_sentence = clean_buckwalter_sentence(line)

            # Add the sentence to the existing list of sentences
            sentences.append(cleaned_sentence)

    # Write cleaned sentences to output file
    write_sentences_to_file(sentences, output_path)

We will now run this function on the provided Hindawi file:

In [None]:
clean_sentences_from_buckwalter_raw('Data/Hindawi_BW/hindawi.bw.txt',
                                    'Results/Hindawi_BW/hindawi_cleaned.txt')

## Parsing TSV (Tab-seperated Values) Files

TSV files are also very common but tend to be handled incorrectly most of the
time.
They are used to store tabular data by seperating columns using tabs.
They are in the same family as Comma-seperated Values (CSV) files only differing
in the delimeter used.
It is very convenient to use a simple split function on each sentence but
TSVs have special formatting conventions in certain cases.
It is recommended to use a specialized library to handle these cases
sautomatically.
Below is a function that uses the Python Standard Library's `csv` module
to extract sentences from TSVs in the MADAR data set.

In [None]:
import csv

# This function extracts Arabic sentences from a specified TSV file,
# cleans them, and outputs the resulots to a specified output file.
def clean_sentences_from_tsv(input_path, output_path):
    sentences = []

    # Open the file for reading, assuming it is UTF-8 encoded
    with open(input_path, 'r', encoding='utf-8') as tsv_file:

        # Create a DictReader object that will parse standard TSV file
        reader = csv.DictReader(tsv_file,
                                dialect='excel-tab',
                                fieldnames=['sentence', 'dialect'])

        # Extract MSA sentences only
        for row in reader:
            if row.get('dialect', None) == 'MSA':
                sentence = row.get('sentence', '')
                cleaned_sentence = clean_arabic_sentence(sentence)
                sentences.append(cleaned_sentence)

    # Write cleaned sentences to output file
    write_sentences_to_file(sentences, output_path)

We will now run this function on the provided MADAR file:

In [None]:
clean_sentences_from_tsv('Data/MADAR_TSV/MADAR-Corpus-6-train.tsv',
                         'Results/MADAR_TSV/madar_cleaned.txt')

## Parsing XML from a Directory

Extensible Markup Language (XML) allows us to store more complex data structures
to file but this makes processing it more complex.
Similar to TSVs and CSVs, it is recommended to use a specialized library to read
them since they also have special formatting conventions for certain charcaters.

Furthermore, data may not necessarily come from a single file but may be split
into many files in complex directory structures.

The function below goes through a given directory and all its subdirectories,
extracts and cleans all sentences from any XML file it encounters, and writes
all these sentences to a single file.

In [None]:
import os
import os.path
import xml.etree.ElementTree as ET


def get_sentences_from_xml_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as xml_file:
        sentences = []
        for event, elem in ET.iterparse(xml_file, ['end']):
            if elem.tag == 's':
                cleaned_sentence = clean_arabic_sentence(elem.text)
                sentences.append(cleaned_sentence)
    return sentences


def clean_sentences_from_xml_directory(directory_path, output_path):
    all_sentences = []

    # Iterate through every subdirectory in the data directory
    for root, dirs, files in os.walk(directory_path):
        
        # Iterate through each file in a directory
        for file_name in files:

            if file_name.endswith('.xml'):
                # Create a full path for each file that we can pass to open()
                file_path = os.path.join(root, file_name)
                
                # Get the cleaned sentences from the current file and added to our accumulator list
                sentences = get_sentences_from_xml_file(file_path)
                all_sentences.extend(sentences)

    # Write cleaned sentences to output file
    write_sentences_to_file(all_sentences, output_path)

We will now run this function on the provided UN directory:

In [None]:
clean_sentences_from_xml_directory('Data/UN_XML',
                                   'Results/UN_XML/un_cleaned.txt')