Extract new sentences
---------------------

In this notebook, we want to extract new sentences from the book [diagne_grammaire](http://wolofresources.org/language/download/diagne_grammaire_wolof.pdf), which contains many examples of Wolof sentences traduce to French. We want to extract them to augment the already sentences we got and for which the treatment is done in the following notebook [extract_sentences](extract_sentences.ipynb). We will create a class which will make working fast. 

### Preprocessing of the book

To extract the sentences from the book, we used the [Wondershare PDFElement](https://pdf.wondershare.net/fr/?gclid=Cj0KCQjwuLShBhC_ARIsAFod4fK-9pyDwLQmwNJYiw0CjsIXCMtvOX9iizTLLYpu52d6Qml12VkAKB0aAqMjEALw_wcB) software again. The author commonly used a colon to separate the Wolof sentences from their French translations (to form translation groups). We identified some problems with the extractions: 
1. The book contains many not interesting sentences separated by a colon
2. Some translation groups are not separated from the rest of the text, and it won't be straightforward to recuperate them directly
3. Some translation groups are too long, and so they are continually at the following line(s)
4. Some translation groups are not complete
5. A colon does not separate some of them

We decide to make the following preprocessing on the book for easier extraction:
- We manually extracted the sentences with the problem numbers 3, 4, and 5 and added on them some corrections: making a translation group to be written in one line; completing some sentences; separating a translation group into multiple ones *; manually writing the translation groups that are not separated by a colon.  
- Identifying and removing the sentences with problem number 1.
- Identifying the translation group with problem number 2 and adding many spaces to correct that problem because most translation groups are separated by more than one space.
- Adding manual corrections to correct the different problems directly from the text file.


*: Some sentences in the translation groups contain one of the following marks: 
- "...": To indicate that the sentence is not complete
- "—" "," ";": To separate translations of the same sentence
- "(...)" "/" "—": To indicate a possible completion or version of a word or sentence translation. Note that "..." in the parentheses indicate that it contains some text
- ":" "²÷": To separate parts of the same Wolof sentence 

For that group translations, we have different options: 
- Rewrite them manually as different translation groups (more accurate correction)
- Decide to separate them by the identified mark
- Decide to create another translation group with the element to add in a sentence for completion (use more when the marks are parentheses)
- Modify a mark with another mark
- Delete a mark

Some translation groups can contain multiple marks so that multiple corrections will be added. To make the latter, we will need to identify which mark begins first (for example, a "/" can be inside parentheses so that the parentheses will be identified first in the sentence).

**Remark**: The preprocessing of the book can take some days, so we will create a class that will make corrections to a certain number of translation groups and continues at any time if the program is interrupted, for example. 

### Extraction

Let us load the manually processed test file and see what it contain.

In [1]:
with open("data/diagne_grammaire_wolof.txt", "r", encoding="utf-8") as f:
    
    txt_file = f.read()
    
print(txt_file)



20         GRAMMAIRE DE WOLOF MODERNE

Cette analyse phonologique procède des principes exposés dans
le Parler de Hauteville ¹. Les phonèmes, comme entités phoniques
indivisibles, y sont définis à partir des contrastes de substances qui
connotent des distinctions de sens.

A. — SYSTÈME CONSONANTIQUE

a) INVENTAIRE DES CONSONNES ORALES ET NASALES.

p   l'identification se fera par rapprochement avec b et m
p/b    paq : coiffure de jeune fille    baq : terre humide
p/m    matt : bois de chauffage      patt : borgne

up : fermer            um : porter malchance

p  est une consonne bilabiale, orale et sourde. Elle se réalise à
partir du contact des lèvres. Elle peut être implosive ou
explosive.

b  identification par opposition à p et m :

p/b   cf. ci-dessus

p/m   bokk : posséder ensemble    mokk : être pulvérisé

lam : bracelet          lap : se noyer
amal : trouver quelque chose  abal : prêter
pour quelqu'un

b  a un même point d'articulation que p. C'est une consonne sonore
et oral

Let us create the extraction class.

In [2]:
# %%writefile wolof-translate/wolof_translate/utils/extract_new_sentences.ipynb
from typing import *
import pandas as pd
import pickle
import re
import os

class NewSentenceExtraction:

    def __init__(self, text: Union[str, None] = None, sent_sep: str = ":", corpus_1: str = "french", corpus_2: str = "wolof", save_directory: str = "data/additional_documents/diagne_sentences/", checkpoint_name: str = "new_sentences"):
        
        self.text = text
        
        self.sep = sent_sep
        
        self.groups = []
        
        self.index = 0
        
        self.save_directory = save_directory
        
        self.checkpoint = checkpoint_name
        
        self.extractions = {corpus_1: [], corpus_2: []}
        
    def __save(self):
        
        checkpoints = {
            # 'extractions': self.extractions,
            'index': self.index,
            # 'groups': self.groups
        }
        
        pd.DataFrame({'groups': self.groups}).to_csv(os.path.join(self.save_directory, 'groups.csv'), index=False)
        
        pd.DataFrame(self.extractions).to_csv(os.path.join(self.save_directory, 'extractions.csv'), index=False)
        
        with open(os.path.join(self.save_directory, self.checkpoint), "wb") as f:
            
            pickler = pickle.Pickler(f)
            
            pickler.dump(checkpoints)
    
    def load(self):
        
        with open(os.path.join(self.save_directory, self.checkpoint), "rb") as f:
            
            depickler = pickle.Unpickler(f)
            
            checkpoints = depickler.load()
        
        self.extractions = pd.read_csv(os.path.join(self.save_directory, 'extractions.csv')).to_dict('index')
        
        self.groups = pd.read_csv(os.path.join(self.save_directory, 'groups.csv'))['groups'].to_list()
        
        self.index = checkpoints['index']
        
    def get_groups(self, stop_criterions: list = ["  ", "\n"], comparisons: list = []):
        
        assert not self.text is None
        
        i = 0
        
        a = 0
        
        g = 1
        
        while i < len(self.text):
            
            
            letter = self.text[i]
            
            if letter == self.sep:
                
                print(f"Extraction of group number {g}\n")
                
                b = i - 1 # index of letters before the current letter
                
                a = i + 1 # index of letters after the current letter
                
                corpus_1_s = [] # letters of the left sentence
                
                corpus_2_s = [] # letters of the right sentence
                
                stop = False
                
                for stop_cr in stop_criterions:
                    
                    if self.text[b-len(stop_cr)+1:b+1] == stop_cr:
                        
                        stop = True   
                
                while not stop:
                    
                    corpus_1_s.append(self.text[b])
                    
                    b -= 1
                    
                    stop = False
                
                    for stop_cr in stop_criterions:
                        
                        if self.text[b-len(stop_cr)+1:b+1] == stop_cr:
                            
                            stop = True 
                
                stop = False
                
                for stop_cr in stop_criterions:
                    
                    if self.text[a:a+len(stop_cr)] == stop_cr:
                        
                      stop = True   
                
                while not stop:
                    
                    corpus_2_s.append(self.text[a])
                    
                    a += 1
                    
                    stop = False
                
                    for stop_cr in stop_criterions:
                        
                        if self.text[a:a+len(stop_cr)] == stop_cr:
                            
                            stop = True   
                
                # reverse first sentence
                corpus_1_s.reverse()
                
                # add the sentences
                current_sentence = "".join(corpus_1_s).strip() + f" {self.sep} " + "".join(corpus_2_s).strip()
                
                if "".join(corpus_1_s).strip() != "" and "".join(corpus_2_s) != "":
                    
                    # verify if it is not already manually got
                    not_recuperated = True
                
                    for comparison in comparisons:
                        
                        if current_sentence in comparison:
                            
                            not_recuperated = False
                    
                    # verify if it is not already in the extracted groups
                    for group in self.groups:
                        
                        if current_sentence in group:
                            
                            not_recuperated = False
                    
                    if not_recuperated:
                        
                        self.groups.append(current_sentence.strip())
                        # print(current_sentence)
                
                        g += 1
                    
                        print("Successfully extracted !!\n")
                    
                        print("-----------------\n")
                
                        i = a - 1
                    
                        self.__save()
                
            i += 1
                        
        # print("The groups were successfully recuperated !")
    
    def replace_groups(self, 
                      re_match: str,
                      delete_re: Union[str, None] = None,
                      n_replace_max: int = 1,
                      load: bool = True,
                      save: bool = False, 
                      manual_replace: bool = False,
                      csv_file: str = "data/additional_documents/diagne_sentences/founded.csv",
                      force_replace: bool = True):
        
        # we load the data
        if load:
            
            self.load()
        
        # find the groups matching the match regex
        founded = [(i, group) for i, group in enumerate(self.groups) if re.match(re_match, group)]
        
        print(f"Found groups matching the regular expression {re_match} are the followings:\n")
        
        [print(f'- {f[1]}') for f in founded]
        
        print("\n----------------------\n")
        
        # if regex for deletion are provided we replace those that will be found with a max number of replace 
        not_replaced = set()
        
        replaced = set()
        
        result = {}
        
        delete_re_ = input("Do you want to change the deletion' regex expression -> provide one if yes or give empty string ('') if not : ")
        
        if delete_re_ != '':
            
            delete_re = delete_re_
        
        if not delete_re is None or manual_replace:
            
            for i in range(len(founded)):
                
                f = founded[i][1]
                
                index = founded[i][0]
                
                if re.match(re_match, f):
                    
                    m_replace = ''
                    
                    if not force_replace:
                        
                        print(f"You will modify the following group:\n {f}")
                        
                        m_replace = input("\nDo you want to make a manual replacement of the group -> Yes(y) or No(n)")
                        
                        while not manual_replace in ['y', 'n']:
                                        
                            replace_r = input(f"You must provide a response between Yes(y), No(n)!")
                    
                        if m_replace == '':

                            print(f"The manual modification of the group\n {f}\n is done in the following file: {csv_file}\n!If you want to provide multiple new groups please make them in different lines")
                                        
                            finish = 'n'
                            
                            pd.DataFrame().to_csv(csv_file, index = False)

                            while finish == 'n':
                                
                                finish = input("Did you finish to replace -> No(n) if you didn't finish yet, click any another key if Yes(y) : ")
                                                
                    if not delete_re is None and m_replace == '':
                        
                        to_replace = set(re.findall(delete_re, f))
                        
                        for r in to_replace:
                            
                            if force_replace:
                                
                                f = f.replace(r, '', n_replace_max)
                                
                                replaced.add(f)
                            
                            else:
                                
                                replace_r = input(f"Do you want to replace the {r} string in the group:\n {f} ? Yes(y) or No(n)")
                                
                                while not replace_r in ['y', 'n']:
                                    
                                    replace_r = input(f"You must provide a response between Yes(y) and No(n)!")
                                
                                if replace_r == 'y':
                                
                                    f = f.replace(r, '', n_replace_max)
                                    
                                    replaced.add(f)
                                
                                else:       
                                    
                                    not_replaced.add(f)
                                    
                print("\n--------\n")
                
                if isinstance(f, str): 
                    
                    f = [f]
                
                try:
                
                    self.groups = self.groups[:index] + f + self.groups[index+1:]
                
                except IndexError:
                    
                    self.groups = self.groups[:index] + f 
        
                result[index] = f
                
            if save:
            
                print("Final result:")
                
                [print(r) for r in result]
                
                save_result = input("Do you want to save the result ? Yes(y) or No(n)")
                
                while not save_result in ['y', 'n']:
                                    
                    replace_r = input(f"You must provide a response between Yes(y) or No(n) !")
                
                if save_result == 'y':
                    
                    self.__save()
        
            
        return {'founded': founded, 'result': result, 'replaced': replaced, 'not_replaced': not_replaced}
                    
                        
                        
                

In [58]:
'34231321'.replace('3', '', 4)

'42121'

Let us recuperate the manually the extracted groups.

In [3]:
with open("data/diagne_manual_recuperation.txt", "r", encoding="utf-8") as f:
    
    comparisons = f.read().strip()

In [4]:
# print manually extracted groups' text
comparisons

"anal : ramasser des ordures pour quelqu'un   \namal ! : obtenir quelque chose pour quelqu'un  \nwànal : montrer quelque chose (à quelqu'un de la part de quelqu'un d'autre)\ndac : toucher, entrer en contact avec\ndad : user par frottement\nxaañoo : se briser la tête mutuellement\nraf : clignoter, bouger de façon subreptice\nlaaw : prendre dans un filet\ndar : écorché, usé par frottement\nlaw : s'étendre \nlaf : bande d'étoffe grimpante\nsiifal : accaparer quelque chose pour quelqu'un\nfay : abandonner le faj domicile conjugal\nxamp : mordre à pleines dents\nsat : battre quelqu'un à plusieurs\nsañ : faire preuve d'audace excessive, manque de gêne\nlaõ : s'exiler plus ou moins définitivement\nxajal: faire de la place à quelqu'un \nxajjal : frayer un chemin à quelqu'un\ndagu: adopter une attitude de serviteur vis-à-vis de quelqu'un\njëxi : être sur le point de s'épuiser\njoor : terrain sablonneux, mais aussi nom de personne\nmaõkoo : être de connivence  \nweõgalu : pencher d'un côté\nteõx

Let us separated the groups by the brake line "\n".

In [5]:
comparisons = comparisons.split("\n")

In [6]:
# print again
comparisons

["anal : ramasser des ordures pour quelqu'un   ",
 "amal ! : obtenir quelque chose pour quelqu'un  ",
 "wànal : montrer quelque chose (à quelqu'un de la part de quelqu'un d'autre)",
 'dac : toucher, entrer en contact avec',
 'dad : user par frottement',
 'xaañoo : se briser la tête mutuellement',
 'raf : clignoter, bouger de façon subreptice',
 'laaw : prendre dans un filet',
 'dar : écorché, usé par frottement',
 "law : s'étendre ",
 "laf : bande d'étoffe grimpante",
 "siifal : accaparer quelque chose pour quelqu'un",
 'fay : abandonner le faj domicile conjugal',
 'xamp : mordre à pleines dents',
 "sat : battre quelqu'un à plusieurs",
 "sañ : faire preuve d'audace excessive, manque de gêne",
 "laõ : s'exiler plus ou moins définitivement",
 "xajal: faire de la place à quelqu'un ",
 "xajjal : frayer un chemin à quelqu'un",
 "dagu: adopter une attitude de serviteur vis-à-vis de quelqu'un",
 "jëxi : être sur le point de s'épuiser",
 'joor : terrain sablonneux, mais aussi nom de personne',

Let us initialize the extraction class and extract the group in the book different from those already extracted.

In [8]:
new_sent_extraction = NewSentenceExtraction(txt_file)

new_sent_extraction.get_groups(comparisons=comparisons)

Extraction of group number 1

Successfully extracted !!

-----------------

Extraction of group number 2

Successfully extracted !!

-----------------

Extraction of group number 3

Successfully extracted !!

-----------------

Extraction of group number 4

Successfully extracted !!

-----------------

Extraction of group number 5

Successfully extracted !!

-----------------

Extraction of group number 6

Successfully extracted !!

-----------------

Extraction of group number 7

Extraction of group number 7

Successfully extracted !!

-----------------

Extraction of group number 8

Successfully extracted !!

-----------------

Extraction of group number 9

Successfully extracted !!

-----------------

Extraction of group number 10

Successfully extracted !!

-----------------

Extraction of group number 11

Successfully extracted !!

-----------------

Extraction of group number 12

Successfully extracted !!

-----------------

Extraction of group number 13

Extraction of group numb

Let us print some of the groups.

In [9]:
new_sent_extraction.groups[-1000:]

["góor gi bëgg na : l'homme veut",
 'yéen dem õgeen : vous, vous avez été',
 'yéen bëgg õgeen : vous, vous voulez',
 "góor gi demul : l'homme n'a pas été, ne part pas",
 "góor gi bëggul : l'homme ne veut pas, n'a pas voulu",
 "suñu bëggul : s'ils ne veulent pas",
 "demuma : je n'ai pas été, je n'irai pas",
 'bëgguma : je ne veux pas',
 "su góor gi nëwée lépp baax : si l'homme vient, tout ira",
 'bu nu demee lépp baax : quand on ira, tout ira',
 "bi õga demee la : c'est quand tu as été",
 "ba waa ji dee la : c'est à l'époque où mourut cet homme",
 'góor gii demoon : cet homme qui était parti',
 'góor gii bëggóon : cet homme qui aimait (qui avait voulu, etc. )',
 "moo demoon : c'est lui qui avait été",
 'yaa õgii demoon fu ñu la tere : voilà que tu as été dans un lieu interdit',
 'fu mu demoon ? : où avait-il été ?',
 'noonu mu demoon foofa : comme il avait ainsi été en ce lieu',
 "su demoon : s'il avait été",
 "bu demulwoon : s'il n'avait pas été",
 "waa ji demulwoon : l'homme n'a pas é

In [41]:
import re



if re.match('[(\d+)].*', '(1) Je suis debout (1)'):
    
    expression = re.findall('[(\d+)]', '(1) Je suis debout (1)')

In [51]:
re.findall('^[^\d]*(\d+)', '(1) Je suis debout (1)')

['1']

In [49]:
bool(re.match('[(\d+)].*', '1 Je suis debout (1)'))

True

In [57]:
re.findall('(\(\d+\))+', '(1) Je suis debout (1)')

['(1)', '(1)']