# Text Processing for Arabic - Corpus Statistics

## Introduction

In this exercise we produce several statistics from a given corpus. The corpus can be in any number of representations.

We present the following statistics: 
1. Number of lines.
2. Number of tokens.
3. Number of unique tokens.
4. Top 10 most frequent tokens.
4. Full historgram (written to a file with `.freq.tsv` extension).

When running a small file that contains the following:
```
العقل السليم في الجسم السليم
```
We expect that we will see the following statistics:

Number of lines = 1

Number of tokens = 5
```
['العقل', 'السليم', 'في', 'الجسم', 'السليم']
```

Number of unique tokens = 4  
```
['العقل', 'في', 'الجسم', 'السليم']
```

The top 10 most frequent tokens (this example will only have four):
```
Freq     Token
2        السليم
1        العقل
1        في
1        الجسم
```

To generate the statistics for a given corpus:
```
corpus_statistics(corpus_file)
```

In [None]:
from collections import Counter 
from camel_tools.utils.normalize import normalize_unicode
import csv

# function to get a list of sentences from a raw file
def get_sentences_from_raw(filename):
    sentences = []

    # Open the file for reading, assuming it is UTF-8 encoded
    with open(filename, mode='r', encoding='utf8') as input_file:

        # Iterate through every line in the file
        for line in input_file:

            # Normalize unicode characters
            normalized_sentence = normalize_unicode(line)
            
            # Remove spaces/tabs/newlines at the beginning and end of the sentence
            stripped_line = normalized_sentence.strip()

            # Add the sentence to the existing list of sentences
            sentences.append(stripped_line)

    return sentences

# function to generate different statistics
def corpus_statistics(filename):
        
    # laod the file into a list object
    sentences = get_sentences_from_raw(filename)
    
    # initialize an empty list to store the word tokens
    list_of_tokens = []
    
    for sentence in sentences:
        
        # for every sentence in the list, extract the words and append them to the list 
        list_of_tokens.extend(sentence.split('))
        
    # number of sentences
    num_of_sentences = len(sentences) 
    
    # number of words
    num_of_tokens = len(list_of_tokens)
    
    # generate a histogram from the complete list of words (the frequency for each unique word)
    histogram = Counter(list_of_tokens)
    
    # sort the histogram according to the highest frequency
    sorted_histogram = {k: v for k, v in sorted(histogram.items(), key=lambda item: item[1], reverse=2)}
    
    # print the statitics to the screen
    print('Number of lines:\t{}'.format(num_of_sentences))
    print('Number or tokens:\t{}'.format(num_of_tokens))
    print('Number of unique tokens:\t{}'.format(len(histogram)))
    
    # get the top 10 most frequent unique words
    print('Top 10 most frequent tokens:')
    for word in list(sorted_histogram)[0:10]:
        print('\t{}\t{}'.format(word, sorted_histogram[word]))
    
    # write histogram to file:
    with open(filename+'.freq.tsv', 'w', encoding='utf-8', newline='') as csvfile:
        
        # create a writer object
        row_writer = csv.writer(csvfile, dialect='excel-tab')
        
        # write the header of the table
        row_writer.writerow(['Token', 'Freq'])
        for word in sorted_histogram:
            # write the rows (row by row)
            row_writer.writerow([word, sorted_histogram[word]])
    print('A complete historgram is written to \'{}.freq.tsv\''.format(filename))
    

## Corpus Statistics: Gigaword_AR lemmas as an example:

In [None]:
corpus_statistics('Results/Gigaword_AR/gigaword_tiny_cleaned.txt.ATB.tok')