# Wolof Data preparation for Common voice recording
We use the Wolof part of the data collected by [Masakhane](https://www.masakhane.io/) during the [MasakhaNER project](https://github.com/masakhane-io/lacuna_pos_ner).  
We will split them into sentences so that they can be recorded in the best conditions prescribed by [Mozilla](https://commonvoice.mozilla.org/sentence-collector/#/en/how-to).

## Splitting

In [1]:
def read_file(path):
    with open(path, 'r') as file:
        return file.readlines()

In [2]:
wolof_data = read_file("data/raw/wolof-masakhane-lacuna-posner.txt")

In [3]:
wolof_data[0]

'Ndekete, cëslaayu yokk gi dañu ko wàññi ba mu gën a tuuti, nguur gi sàkkuwu cee def dara ngir jubbanti juuti bi, ci lu leer ne dayo bi des na ak sàmm njariñal bokk-koppar gi.\n'

In [4]:
print(f"We have {len(wolof_data)} sentences!")

We have 6366 sentences!


In [5]:
final_corpus    = []
corpus_to_split = []
LIMIT = 14 # Max sentence length suited for common voice requirements

for sentence in wolof_data:
    sentence_size = len(sentence.split())
    if sentence_size > 2 and sentence_size <= LIMIT:
        final_corpus.append(sentence)
    else:
        corpus_to_split.append(sentence)
        
print(f"-> Nb of suited sentences: {len(final_corpus)}")
print(f"-> Nb of sentences to split: {len(corpus_to_split)}")

-> Nb of suited sentences: 1621
-> Nb of sentences to split: 4745


In [74]:
def split_into_parts(sentence, limit):
    """ Take a sentence and splited in n parts of size 'limit'
    """
    
    sentence_list   = sentence.split()
    nb_parts        = list(range(len(sentence_list)//limit))
    sentence_splits = ""
    begin = 0
    
    for end in nb_parts:
        end = (end+1)*limit
        sentence_splits += " ".join(sentence_list[begin:end])+"\n"
        begin = end
    # After splitting, only keep the rests if its size exceeds 1
    if len(sentence_list[begin:]) > 1:
        sentence_splits += " ".join(sentence_list[begin:])+"\n"
        nb_parts.append(1) # count the number of splits
    return sentence_splits, len(nb_parts)

In [75]:
final_corpus_v2 = []
cp = 0

for sentence in corpus_to_split:
    splits, tmp = split_into_parts(sentence, LIMIT)
    cp += tmp
    final_corpus_v2.append(splits)

        
print(f"-> Nb of final_corpus_v2: {cp}")

-> Nb of final_corpus_v2: 11987


## Cleaning

In [70]:
import re

def clean(data):
    "Remove text within parentheses"
    
    cleaned = []
    for sentence in data:     
        sentence = re.sub(r"\([^)]*\)", "", sentence)
        sentence = re.sub(r"\ +", " ", sentence)
        cleaned.append(sentence)
    
    return cleaned

In [76]:
final_corpus    = clean(final_corpus)
final_corpus_v2 = clean(final_corpus_v2)

## Export

In [63]:
def to_file(file_path, corpus):
    with open(file_path, 'w') as f:
        f.writelines(corpus)

In [77]:
to_file('data/intermediate/wolof_to_upload_part1.txt', final_corpus)
to_file('data/intermediate/wolof_to_upload_part2.txt', final_corpus_v2)

In [73]:
print(f"-> Total Nb of sentences: {len(final_corpus)+cp}")

-> Total Nb of sentences: 13608


## Splitting the part2 of the corpus into chunks of 2000 sentences
It will be easier to share it for crowdsource work.  
__NOTE:__ I'll load the exported data part2 because manual post-processing has been done on *wolof_to_upload* files (part 1 & 2) to split sentences, move some sentences from part 2 to 1 and using regex to remove numbers, abbreviations and some punctuations.

In [None]:
wolof_data_v2 = read_file("wolof_to_upload_part2.txt")

In [None]:
# Convert wolof_data_v2 to a dataframe

In [None]:
def slicing(document, path, chunk_prefix='document', edge=300):
    """ This function slices a dataframe document into chunks of size edge
    and store the result into the provided path with the chunk_prefix
    as the root name of each chunk."""
    global_size = len(document)
    
    if path[-1] != "/":
        path += "/"
    print("Export in progress...")
    for size in tqdm(range(global_size)):
        # Split into many files of size 300
        if size%300==0:
            document[size:edge].to_excel(path+chunk_prefix+"_"+str(size)+".xlsx", 
                                         index=False, header=True, engine="openpyxl")
        if edge <= global_size-300:
            edge+=300
        else:
            edge = global_size

In [None]:
slicing(wolof_data_v2, 
        path="data/processed/chunks/",
        chunk_prefix="chunk",
        edge=2000)