Improve sentences
----------------------------

In this notebook we want to improve the corpora of new sentences and the final corpora which is the concatenation of the original corpora and the new one (we got the two in the following notebook [new_sentences](extract_new_sentences.ipynb)). 

For the improvement we want to make the following modifications that we found to be necessary after fine-tuning the `t5` model on the new sentences:

- Remove the definitions: We had already extract the new sentences without the definitions we will only need to load the "true sentences".
- Replace exclamation marks (!) at the end of the phrases by period marks (.): This will help the transformers to identify the end of a sentence. Not that we will do this for both of the new sentences and the last version corpora (v4). We cannot know if it is the right thing to do in order to obtain better performance that's why we will create a function and add it when making hyper-parameter search and reduce the search space if necessary.
- Remove sentences with less than 3 tokens in one of the French of Wolof version. This will reduce the number of sentences but can provide better agreement with the right way to write sentences.

The above modifications on the version 4 of the corpora and the diagne's sentences will be necessary to create the version 5 of the corpora. The additional version of the diagne's sentences and 6 of the corpora will understand additionally sentences got from `omniglot`. We will not select the sentences by length for the new versions.

All the modifications can be done with the `pandas` library.

In [2]:
import pandas as pd
from tokenizers.pre_tokenizers import Whitespace
import re

### Remove the definitions

Let us load the two corpora.

In [3]:
corpora_v4 = pd.read_csv('data/extractions/new_data/corpora_v4.csv')

diagne_sentences = pd.read_csv('data/extractions/new_data/sentences.csv')

In [4]:
corpora_v4.shape[0], diagne_sentences.shape[0]

(2500, 1615)

### Add periods to the end of the sentences

We have three possibilities:

- We can replace the exclamations by period marks (option 2)
- ... add period mark at each end of sentences (option 3)
- ... do the two previous modification at same time (option 4)

The following function will be added inside the `TransformerSequences` class when recuperating the sentences before the training.

In [5]:
%%writefile wolof-translate/wolof_translate/utils/improvements/end_marks.py

import pandas as pd
from typing import *

def add_end_mark(sentences: Union[list, str], end_mark: str = ".", end_mark_to_remove: Union[str, None] = None, 
                 replace: bool = False, poss_end_marks: list = ['!', '?', '.', '...']):
    
    if isinstance(sentences, str): sentences = [sentences]
    
    if replace and end_mark_to_remove is None:
        
        raise ValueError("You must provide a end mark to remove if you want to make replacement !")
    
    new_sentences = []
    
    # if replace is chosen we will replace the end to remove by the end mark if not only add end mark to each sentence and remove the end mark to remove
    for sentence in sentences:
        
        if replace:
            
            if sentence[-1] == end_mark_to_remove:
                
                sentence = sentence[:-1].strip() + end_mark
            
        else:
            
            if not end_mark_to_remove is None and sentence[-1] == end_mark_to_remove:
                
                sentence = sentence[:-1]
            
            if not sentence[-1] in poss_end_marks: 
                
                sentence = sentence.strip() + end_mark
        
        new_sentences.append(sentence)
    
    return new_sentences
    

Overwriting wolof-translate/wolof_translate/utils/improvements/end_marks.py


Let us try the different type of end mark adding with the above function.

In [6]:
%run wolof-translate/wolof_translate/utils/improvements/end_marks.py

In [6]:
# recuperate some sentences from the diagne's sentences
sample = diagne_sentences.sample(30)

In [7]:
sample

Unnamed: 0,french,wolof
1434,Je te confie celui-là dont il fut question au ...,Deŋk naa la boobule woon.
381,L'homme ira,Góor gi dana demi
284,C'est quand tu es venu,Bi ŋga ñëwée la
704,À qui parles-tu ?,Kooy waxal ?
1156,"J'ai été, tu as été, il a été...","Dem naa, dem ŋga, dem na..."
1359,Cet homme est Laobe de Saint-Louis.,Gor gii di Lawbe Ndar.
1365,C'est moi qui dois partir !,Maa di dem !
1234,Les enfants qui venaient ne viennent plus.,Xale yooya daan ñëw ñëwëtuñu.
976,Me voici... !,Maa ŋgii... !
1508,C'était ceux-là même jusqu'à ce qu'il parte.,Ñawoon la ba bi mu fiy joge.


We need to add the modification on each version of the sentences.

- Replace exclamations by end marks.

In [8]:
french_sents = sample['french'].to_list()

wolof_sents = sample['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!', replace=True)

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!', replace=True)

pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})


Unnamed: 0,french,wolof
0,Je te confie celui-là dont il fut question au ...,Deŋk naa la boobule woon.
1,L'homme ira,Góor gi dana demi
2,C'est quand tu es venu,Bi ŋga ñëwée la
3,À qui parles-tu ?,Kooy waxal ?
4,"J'ai été, tu as été, il a été...","Dem naa, dem ŋga, dem na..."
5,Cet homme est Laobe de Saint-Louis.,Gor gii di Lawbe Ndar.
6,C'est moi qui dois partir.,Maa di dem.
7,Les enfants qui venaient ne viennent plus.,Xale yooya daan ñëw ñëwëtuñu.
8,Me voici....,Maa ŋgii....
9,C'était ceux-là même jusqu'à ce qu'il parte.,Ñawoon la ba bi mu fiy joge.


- Add end marks at each sentence without replacing or removing

In [9]:
french_sents = sample['french'].to_list()

wolof_sents = sample['wolof'].to_list()

french_sents = add_end_mark(french_sents)

wolof_sents = add_end_mark(wolof_sents)

pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})


Unnamed: 0,french,wolof
0,Je te confie celui-là dont il fut question au ...,Deŋk naa la boobule woon.
1,L'homme ira.,Góor gi dana demi.
2,C'est quand tu es venu.,Bi ŋga ñëwée la.
3,À qui parles-tu ?,Kooy waxal ?
4,"J'ai été, tu as été, il a été...","Dem naa, dem ŋga, dem na..."
5,Cet homme est Laobe de Saint-Louis.,Gor gii di Lawbe Ndar.
6,C'est moi qui dois partir !,Maa di dem !
7,Les enfants qui venaient ne viennent plus.,Xale yooya daan ñëw ñëwëtuñu.
8,Me voici... !,Maa ŋgii... !
9,C'était ceux-là même jusqu'à ce qu'il parte.,Ñawoon la ba bi mu fiy joge.


- Remove exclamation marks and add period marks

In [10]:
french_sents = sample['french'].to_list()

wolof_sents = sample['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!')

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!')

pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})


Unnamed: 0,french,wolof
0,Je te confie celui-là dont il fut question au ...,Deŋk naa la boobule woon.
1,L'homme ira.,Góor gi dana demi.
2,C'est quand tu es venu.,Bi ŋga ñëwée la.
3,À qui parles-tu ?,Kooy waxal ?
4,"J'ai été, tu as été, il a été...","Dem naa, dem ŋga, dem na..."
5,Cet homme est Laobe de Saint-Louis.,Gor gii di Lawbe Ndar.
6,C'est moi qui dois partir.,Maa di dem.
7,Les enfants qui venaient ne viennent plus.,Xale yooya daan ñëw ñëwëtuñu.
8,Me voici....,Maa ŋgii....
9,C'était ceux-là même jusqu'à ce qu'il parte.,Ñawoon la ba bi mu fiy joge.


### Remove sentences with less than 3 tokens

Only the words interest us and we want to separate them by space or marks. Notice also that we want verify for each version and if one it is not accepted then the another one also.

In [11]:
def select_by_length(sentences: pd.DataFrame, min_len: int = 4):
    
    new_sentences = {'french': [], 'wolof': []}
    
    for i in list(sentences.index):
        
        sent1 = sentences.loc[i, 'french']
        
        sent2 = sentences.loc[i, 'wolof']
        
        tokens1 = re.findall('\w+', sent1)
        
        tokens2 = re.findall('\w+', sent2)
        
        take = True
        
        if len(tokens1) < min_len or len(tokens2) < min_len:
            
            take = False
        
        if take:
            
            new_sentences['french'].append(sent1)
            
            new_sentences['wolof'].append(sent2)
            
    return pd.DataFrame(new_sentences)

Let us make an example with the sample that we got earlier.

In [12]:
sample = select_by_length(sample, min_len=4)


In [13]:
sample

Unnamed: 0,french,wolof
0,Je te confie celui-là dont il fut question au ...,Deŋk naa la boobule woon.
1,C'est quand tu es venu,Bi ŋga ñëwée la
2,"J'ai été, tu as été, il a été...","Dem naa, dem ŋga, dem na..."
3,Cet homme est Laobe de Saint-Louis.,Gor gii di Lawbe Ndar.
4,Les enfants qui venaient ne viennent plus.,Xale yooya daan ñëw ñëwëtuñu.
5,C'était ceux-là même jusqu'à ce qu'il parte.,Ñawoon la ba bi mu fiy joge.
6,J'ai vu les enfants sauf toi.,Gis naa xale yi gannaaw yaw.
7,Il a tant parlé que le voilà.,Wax na ba mi ŋgi.
8,Je n'avais été nulle part ce matin.,Demumawoon fenn tày ci suba.
9,C'est moi le professeur.,Maa di jaŋgalékat bi.


It only kept 18 samples from 20 ones. Let us make the same for all the sentences.

In [14]:
diagne_sentences = select_by_length(diagne_sentences)

corpora_v4 = select_by_length(corpora_v4)

Let us verify how many sentences we have kept.

In [15]:
# for diagne's sentences
diagne_sentences.shape[0]

970

In [16]:
# in corpora v4
corpora_v4.shape[0]

1849

We removed $651$ sentences from the corpora v4 and $645$ sentences from the diagne's sentences.

Let us save the results to make again hyperparameter search with them.

In [17]:
# save the new version of the diagne's sentences in red_sentences.csv
diagne_sentences.to_csv('data/extractions/new_data/red_sentences.csv', index = False)

# save the new version of the corpora
corpora_v4.to_csv('data/extractions/new_data/corpora_v5.csv', index = False)

### Recuperate additional sentences 

In [7]:
corpora_v4 = pd.read_csv('data/extractions/new_data/corpora_v4.csv')

diagne_sentences = pd.read_csv('data/extractions/new_data/sentences.csv')

Let us recuperate additional sentences to augment the corpora.

- First file:

In [8]:
path = 'data/additional_documents/omniglot/sentences_french_wolof.txt'

with open(path, 'r', encoding='utf-8') as f:
    
    # recuperate lines
    omn_1 = f.readlines()
    
# recuperate sentences
omn_1_ = {
    'french': [],
    'wolof': []
}

for omn in omn_1:
    
    omn = omn.split('\t')
    
    sentences = [s.strip() for s in omn if s.strip() != '']
    
    omn_1_['french'].append(sentences[0])
    
    omn_1_['wolof'].append(sentences[1])

In [9]:
pd.options.display.max_rows = 74
omn_1_ = pd.DataFrame(omn_1_).head(74)

# print the first sentences from omniglot
omn_1_.head(74)

Unnamed: 0,french,wolof
0,Bienvenue,Merhbe
1,Salut,Na nga def
2,Salut,Na ngeen def
3,Salut,Salaam aleekum
4,Comment vas-tu ?,Jaam nga am ?
5,Es-tu en paix ?,Jaam nga am ?
6,Comment vas-tu ?,Na nga def?
7,"Paix seulement, et toi ?","Jaam rek, Yow nag ?"
8,"Je suis seulement toi, comment vas-tu ?","Mangi fi rekk, na nga def ?"
9,Ça fait longtemps que je ne t'ai pas vu,Gej na la giis


- Second file

In [10]:
path = 'data/additional_documents/omniglot/sentences_french_wolof_2.txt'

with open(path, 'r', encoding='utf-8') as f:
    
    # recuperate lines
    omn_2 = f.readlines()
    
# recuperate sentences
omn_2_ = {
    'french': [],
    'wolof': []
}

for omn in omn_2:
    
    omn = omn.split('–')
    
    sentences = [s.strip() for s in omn if s.strip() != '']
    
    omn_2_['wolof'].append(sentences[0])
    
    omn_2_['french'].append(sentences[1])

In [11]:
pd.options.display.max_rows = 74
omn_2_ = pd.DataFrame(omn_2_).head(74)

# print the first sentences from omniglot
omn_2_.head(74)

Unnamed: 0,french,wolof
0,Comment allez-vous ?,Na ngeen def ?
1,Bonne matinée.,Jaam nga fanane.
2,Bonne nuit.,Fanaanal jaam.
3,À une autre fois.,Ba beneen.
4,Je t'en prie.,Agsil.
5,Je vous en prie.,Agsileen ak jaam.
6,Pardon.,Baal ma.
7,Oui.,Wau.
8,Non.,Deh-det.
9,Have you peace ?,Jaam nga am ?


- Third file

In [12]:
path = 'data/additional_documents/omniglot/sentences_french_wolof_3.txt'

with open(path, 'r', encoding='utf-8') as f:
    
    # recuperate lines
    omn_3 = f.readlines()
    
# recuperate sentences
omn_3_ = {
    'french': [],
    'wolof': []
}

for omn in omn_3:
    
    omn = omn.split('\t')
    
    sentences = [s.strip() for s in omn if s.strip() != '']
    
    omn_3_['wolof'].append(sentences[0])
    
    omn_3_['french'].append(sentences[1])
      
        

In [13]:
pd.options.display.max_rows = 74
omn_3_ = pd.DataFrame(omn_3_).head(74)

# print the first sentences from omniglot
omn_3_.head(74)

Unnamed: 0,french,wolof
0,Peace be upon you.,Salaam maaleekum.
1,Que la paix soit avec toi aussi.,Maaleekum salaam
2,Es-tu en paix ?,Jaama ngaam ?
3,Paix seulement.,Jamma rek.
4,Comment va ta famille ?,Ana waa kër gi ?
5,Où sont les membres de ta famille ?,Ana waa kër gi ?
6,Ils sont là-bas.,Ñunga fa.
7,J'espère qu'ils vont bien.,Mbaa defuñu dara.
8,Rien ne va mal avec eux.,Mbaa defuñu dara.
9,"Non, ils vont bien.","Déedéet, defuñu dara."


- Fourth file

In [14]:
path = 'data/additional_documents/hugging_face/wolof_french.csv'

hug_sents = pd.read_csv(path)

In [15]:
pd.options.display.max_rows = 74

hug_sents.head(74)

Unnamed: 0,french,wolof
0,J'arrive tout de suite chez toi.,Léegui léegui ma egg sa kër.
1,"J'en suis sûr, cette photo ci c'est la photo p...",Waaw nataal bii nataal la boob ay nit ñu baree...
2,Je vois devant moi une photo sur laquelle beau...,Nataal bii maa ngi ciy janloog haa ay nit yu b...
3,Ceux-ci sont des personnes qui sont sortis pou...,"Lii, ay nit lañu yu génn di ñaxtu. Jëm yi nag ..."
4,Salut ! Ceux-là qui ressemblent à des personne...,"Salaawaalekum ! Ñii de, mel nañ ne, ay nit ñu ..."
5,Cette photo ci c'est une photo sur laquelle je...,Nataal bi ab nataal la boo xamante yni maa ngi...
6,"Sur la photo, ont voit des personnes qui se ré...",Nataal bii ñoo ngi ciy gis ay nit ñuy ñaxtu wa...
7,On voit sur la photo beaucoup de personnes sor...,Ñu gis ci nataal bi ay nit ñu bari ñu génn ci ...
8,"C'est des poissons, oui. Ils sont de couleur b...","Jën la waaw, Wu am wirgo Wu baxa ak Wu xonq."
9,"Ah sur cette photo ci cependant, il y a un poi...","Aah nataal bii nag, aw jën la. Jën wi mi ngi a..."


Let us add them into the diagne's sentences and the version 4 of the corpora and remove duplicated rows.

- On corpora version 4

In [16]:
corpora_v4 = pd.concat((corpora_v4, omn_1_, omn_2_, omn_3_, hug_sents), axis=0)

# drop duplicated rows
print(sum(corpora_v4.duplicated()))

7


In [17]:
corpora_v4.drop_duplicates(inplace=True)

7 rows were deleted.

- On corpora version 5

In [18]:
diagne_sentences = pd.concat((diagne_sentences, omn_1_, omn_2_, omn_3_, hug_sents), axis=0)

# drop duplicated rows
print(sum(diagne_sentences.duplicated()))

4


In [19]:
diagne_sentences.drop_duplicates(inplace=True)

4 rows were deleted.

Let us consider only the fourth end mark adding's method and verify if we have duplicated rows.

- Corpora v4

In [21]:
french_sents = corpora_v4['french'].to_list()

wolof_sents = corpora_v4['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!')

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!')

corpora_v4_end = pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})

In [22]:
duplicated = corpora_v4_end.duplicated()

sum(duplicated)

5

We obtain 5 duplicated rows. Let us recuperate their indexes.

In [23]:
dup_indexes = duplicated[duplicated == True].index

In [24]:
dup_indexes

Int64Index([1428, 1597, 2241, 2243, 2367], dtype='int64')

Let us remove those indexes from the corpora.

In [25]:
corpora_v4.drop(index = dup_indexes, inplace=True)

- Diagne's sentences

In [26]:
french_sents = diagne_sentences['french'].to_list()

wolof_sents = diagne_sentences['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!')

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!')

diagne_end = pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})

In [27]:
duplicated = diagne_end.duplicated()

sum(duplicated)

5

We obtain 1 duplicated rows. Let us recuperate their indexes.

In [28]:
dup_indexes = duplicated[duplicated == True].index

In [29]:
dup_indexes

Int64Index([546, 715, 1359, 1361, 1485], dtype='int64')

Let us remove those indexes from the corpora.

In [30]:
diagne_sentences.drop(index = dup_indexes, inplace=True)

Let us save the results.

In [31]:
diagne_sentences.to_csv('data/extractions/new_data/ad_sentences.csv', index=False)

corpora_v4.to_csv('data/extractions/new_data/corpora_v6.csv',        index=False)

Let us verify how many sentences we got for the moment.

In [32]:
# from diagne's book
diagne_sentences.shape[0]

1899

In [33]:
# inside the corpora
corpora_v4.shape[0]

2781

We recuperated `1899` sentences from the diagne's book and the final corpora contains `2781` sentences.