# Text Processing for Arabic - Vocabulary Overlap

In this exercise, we present a code snippet that reads two files and uses the vocabulary overlap to measure their similarity.

The metric we are using is the weighted Jaccard similarity index, which is a variation of the Intersection over Union metric. We follow the formula described [here](https://en.wikipedia.org/wiki/Jaccard_index#Weighted_Jaccard_similarity_and_distance).

If we have the following two files:

File 1:
```
العقل السليم في الجسم السليم
```

File 2:
```
يستخدم سليم فرشاة الأسنان بالشكل السليم مرتين في اليوم
```

Applying the wighted Jaccard similarity index on the untokenized words:

The word list:
```
['العقل','السليم','في','الجسم','يستخدم','سليم','فرشاة','الأسنان','بالشكل','مرتين','اليوم']
```
The word list frequencies in the first file:
```
[1,2,1,1,0,0,0,0,0,0,0]
```
The word list frequencies in the second file:
```
[0,1,1,0,1,1,1,1,1,1,1]
```

$Jaccard_{Weighted} = \frac{\sum minFreq_{(file1,file2)}}{\sum maxFreq_{(file1,file2)}} = \frac{0 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0}{1 + 2 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1} = \frac{2}{12} = 16.67\%$


In [None]:
from collections import Counter

from collections import Counter 
from camel_tools.utils.normalize import normalize_unicode
import csv

# function to get a list of sentences from a raw file
def get_sentences_from_raw(filename):
    sentences = []

    # Open the file for reading, assuming it is UTF-8 encoded
    with open(filename, mode='r', encoding='utf8') as input_file:

        # Iterate through every line in the file
        for line in input_file:

            # Normalize unicode characters
            normalized_sentence = normalize_unicode(line)
            
            # Remove spaces/tabs/newlines at the beginning and end of the sentence
            stripped_line = normalized_sentence.strip()

            # Add the sentence to the existing list of sentences
            sentences.append(stripped_line)

    return sentences

# function to generate different statistics
def vocabulary_frequency(sentences):
    
    # initialize an empty list to store the word tokens
    list_of_tokens = []
    
    for sentence in sentences:
        
        # for every sentence in the list, extract the words and append them to the list 
        list_of_tokens.extend(sentence.split())

    # number of sentences
    num_of_sentences = len(sentences) 
    
    # number of words
    num_of_tokens = len(list_of_tokens)
    
    # generate a histogram from the complete list of words (the frequency for each unique word)
    histogram = Counter(list_of_tokens)
    
    # sort the histogram according to the highest frequency
    sorted_histogram = {k: v for k, v in sorted(histogram.items(), key=lambda item: item[1], reverse=2)}
    
    return sorted_histogram, len(histogram)


# This function calculates the vocabulary overlap between two files
# It takes two files for input, and prints a short report
def vocabulary_overlap(file_1, file_2):
    
    # Read the sentences from each of the files
    file_1_sentences = get_sentences_from_raw(file_1)
    file_2_sentences = get_sentences_from_raw(file_2)
    
    # Calculate the frequencies of words in each of the files 
    file_1_frequency, file_1_size = vocabulary_frequency(file_1_sentences)
    file_2_frequency, file_2_size = vocabulary_frequency(file_2_sentences)
            
    min_sum = 0.0
    max_sum = 0.0
    
    overlap_histogram = {}
    file_1_unique_tokens = {}
    file_2_unique_tokens = {}
    
    
    # For every word that appears in the first file, calculate its intersection with the second file
    # Simultaneously, calculate the union from all the words that appear in the first file
    for word in file_1_frequency:
        if word in file_2_frequency:
            min_sum += min(file_1_frequency[word], file_2_frequency[word])
            overlap_histogram[word] = min(file_1_frequency[word], file_2_frequency[word])
            max_sum += max(file_1_frequency[word], file_2_frequency[word])
        else:
            max_sum += file_1_frequency[word]
            file_1_unique_tokens[word] = file_1_frequency[word]
    
    # Add to the union the words that only appear in the second file
    for word in file_2_frequency:
        if word not in file_1_frequency:
            max_sum += file_2_frequency[word]
            file_2_unique_tokens[word] = file_2_frequency[word]

    # Calculate the wighted Jaccard similarity
    jaccard_similarity = (min_sum / max_sum) * 100

    
    print('File 1: ' + file_1)
    print('# of lines in the first file = ' + str(len(file_1_sentences)))
    print('# of tokens in the first file = ' + str(file_1_size))
    print('# of unique tokens in the first file = ' + str(len(file_1_frequency)) + '\n')

    print('File 2: ' + file_2)
    print('# of lines in the second file = ' + str(len(file_2_sentences)))
    print('# of tokens in the second file = ' + str(file_2_size))
    print('# of unique tokens in the second file = ' + str(len(file_2_frequency)) + '\n')
                             
    print('Overlap similarity between the two files = ' + "{0:.2f}".format(jaccard_similarity) + '%')
    
    sorted_overlap_histogram = {k: v for k, v in sorted(overlap_histogram.items(), key=lambda item: item[1], reverse=2)}
    print('\nTop 10 most shared tokens:')
    for word in list(sorted_overlap_histogram)[0:10]:
        print('\t{}\t{}'.format(word, sorted_overlap_histogram[word]))
    
    # write histogram to file:
    with open(file_1 + '_' + file_2.split('/')[-1] +'.overlap.tsv', 'w', encoding='utf-8', newline='') as csvfile:
        
        # create a writer object
        row_writer = csv.writer(csvfile, dialect='excel-tab')
        
        # write the header of the table
        row_writer.writerow(['Token', 'Freq'])
        for word in sorted_overlap_histogram:
            # write the rows (row by row)
            row_writer.writerow([word, sorted_overlap_histogram[word]])
    print('A complete historgram of the overlapped words is written to \'{}.overlap.tsv\''.format(file_1 + '_' + file_2.split('/')[-1]))
        
        
    sorted_file_1_unique_list = {k: v for k, v in sorted(file_1_unique_tokens.items(), key=lambda item: item[1], reverse=2)}
    print('\nTop 10 most frequent unique tokens in file 1:')
    for word in list(sorted_file_1_unique_list)[0:10]:
        print('\t{}\t{}'.format(word, sorted_file_1_unique_list[word]))
        
    
    # write histogram to file:
    with open(file_1 +'.unique.tsv', 'w', encoding='utf-8', newline='') as csvfile:
        
        # create a writer object
        row_writer = csv.writer(csvfile, dialect='excel-tab')
        
        # write the header of the table
        row_writer.writerow(['Token', 'Freq'])
        for word in sorted_file_1_unique_list:
            # write the rows (row by row)
            row_writer.writerow([word, sorted_file_1_unique_list[word]])
    print('A complete historgram of the unique words in file 1 is written to \'{}.unique.tsv\''.format(file_1))
        
    
    sorted_file_2_unique_list = {k: v for k, v in sorted(file_2_unique_tokens.items(), key=lambda item: item[1], reverse=2)}
    print('\nTop 10 most frequent unique tokens in file 2:')
    for word in list(sorted_file_2_unique_list)[0:10]:
        print('\t{}\t{}'.format(word, sorted_file_2_unique_list[word]))
    
    
    # write histogram to file:
    with open(file_2 +'.unique.tsv', 'w', encoding='utf-8', newline='') as csvfile:
        
        # create a writer object
        row_writer = csv.writer(csvfile, dialect='excel-tab')
        
        # write the header of the table
        row_writer.writerow(['Token', 'Freq'])
        for word in sorted_file_2_unique_list:
            # write the rows (row by row)
            row_writer.writerow([word, sorted_file_2_unique_list[word]])
    print('A complete historgram of the unique words in file 2 is written to \'{}.unique.tsv\''.format(file_2))
    

To run this code, you need to call the function `vocabulary_overlap` with two paramters that are the two files that you are comparing. Try to call the function with different files.

## Vocabulary Overlap: Gigaword_AR D3 and ATB tokenizations as an example:

The call to the function is as follows:

In [None]:
# Choose two files you want to compare, and change the parameters to be those files
file_1_name = 'Results/Gigaword_AR/gigaword_tiny_cleaned.txt.D3.tok'
file_2_name = 'Results/Gigaword_AR/gigaword_tiny_cleaned.txt.ATB.tok'

vocabulary_overlap(file_1_name, file_2_name)