Improve sentences
----------------------------

In this notebook we want to improve the corpora of new sentences and the final corpora which is the concatenation of the original corpora and the new one (we got the two in the following notebook [new_sentences](extract_new_sentences.ipynb)). 

For the improvement we want to make the following modifications that we found to be necessary after fine-tuning the `t5` model on the new sentences:

- Remove the definitions: We had already extract the new sentences without the definitions we will only need to load the "true sentences".
- Replace exclamation marks (!) at the end of the phrases by period marks (.): This will help the transformers to identify the end of a sentence. Not that we will do this for both of the new sentences and the last version corpora (v4). We cannot know if it is the right thing to do in order to obtain better performance that's why we will create a function and add it when making hyper-parameter search and reduce the search space if necessary.
- Remove sentences with less than 3 tokens in one of the French of Wolof version. This will reduce the number of sentences but can provide better agreement with the right way to write sentences.

The above modifications on the version 4 of the corpora and the diagne's sentences will be necessary to create the version 5 of the corpora. The additional version of the diagne's sentences and 6 of the corpora will understand additionally sentences got from `omniglot`. We will not select the sentences by length for the new versions.

All the modifications can be done with the `pandas` library.

In [1]:
import re
import pandas as pd
from tokenizers.pre_tokenizers import Whitespace
from wolof_translate.utils.database_manager import TranslationMongoDBManager

### Remove the definitions

Let us load the two corpora.

In [2]:
corpora_v4 = pd.read_csv('data/extractions/new_data/corpora_v4.csv')

diagne_sentences = pd.read_csv('data/extractions/new_data/sentences.csv')

In [3]:
corpora_v4.shape[0], diagne_sentences.shape[0]

(2500, 1615)

### Add periods to the end of the sentences

We have three possibilities:

- We can replace the exclamations by period marks (option 2)
- ... add period mark at each end of sentences (option 3)
- ... do the two previous modification at same time (option 4)

The following function will be added inside the `TransformerSequences` class when recuperating the sentences before the training.

In [4]:
%%writefile wolof-translate/wolof_translate/utils/improvements/end_marks.py

import pandas as pd
from typing import *

def add_end_mark(sentences: Union[list, str], end_mark: str = ".", end_mark_to_remove: Union[str, None] = None, 
                 replace: bool = False, poss_end_marks: list = ['!', '?', '.', '...']):
    
    if isinstance(sentences, str): sentences = [sentences]
    
    if replace and end_mark_to_remove is None:
        
        raise ValueError("You must provide a end mark to remove if you want to make replacement !")
    
    new_sentences = []
    
    # if replace is chosen we will replace the end to remove by the end mark if not only add end mark to each sentence and remove the end mark to remove
    for sentence in sentences:
        
        if replace:
            
            if sentence[-1] == end_mark_to_remove:
                
                sentence = sentence[:-1].strip() + end_mark
            
        else:
            
            if not end_mark_to_remove is None and sentence[-1] == end_mark_to_remove:
                
                sentence = sentence[:-1]
            
            if not sentence[-1] in poss_end_marks: 
                
                sentence = sentence.strip() + end_mark
        
        new_sentences.append(sentence)
    
    return new_sentences
    

Overwriting wolof-translate/wolof_translate/utils/improvements/end_marks.py


Let us try the different type of end mark adding with the above function.

In [5]:
%run wolof-translate/wolof_translate/utils/improvements/end_marks.py

In [6]:
# recuperate some sentences from the diagne's sentences
sample = diagne_sentences.sample(30)

In [7]:
sample

Unnamed: 0,french,wolof
1583,Les enfants veulent venir et les adultes souha...,"Xale yi bëgg nañu dikk, te mag ni ñaan nañu ŋg..."
949,J'ai vu Monsieur Moussa.,Gis naa góor gi Musaa.
251,Tout village Sérère est propre !,Béppu dëkku Séeréer set na !
612,Vous êtes ceux-là.,Ñooñu ŋgeen.
868,Que veux-tu ?,Loo bëgg ?
125,Aucune dame ne s'est égarée.,Jenn jigéen réerul.
835,Donnes le livre à ce fils bien élevé !,Joxal téere bi doomu nit ku yaru kooku !
584,Qui d'autre veut partir ?,Keneen kan moo bëgg dem ?
1516,"C'est Fatim, si on va au fond des choses !","Soo demee, Faatim la !"
892,Tu n'avais pas été à l'intérieur,Demuloowoon ca biir


We need to add the modification on each version of the sentences.

- Replace exclamations by end marks.

In [8]:
french_sents = sample['french'].to_list()

wolof_sents = sample['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!', replace=True)

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!', replace=True)

pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})


Unnamed: 0,french,wolof
0,Les enfants veulent venir et les adultes souha...,"Xale yi bëgg nañu dikk, te mag ni ñaan nañu ŋg..."
1,J'ai vu Monsieur Moussa.,Gis naa góor gi Musaa.
2,Tout village Sérère est propre.,Béppu dëkku Séeréer set na.
3,Vous êtes ceux-là.,Ñooñu ŋgeen.
4,Que veux-tu ?,Loo bëgg ?
5,Aucune dame ne s'est égarée.,Jenn jigéen réerul.
6,Donnes le livre à ce fils bien élevé.,Joxal téere bi doomu nit ku yaru kooku.
7,Qui d'autre veut partir ?,Keneen kan moo bëgg dem ?
8,"C'est Fatim, si on va au fond des choses.","Soo demee, Faatim la."
9,Tu n'avais pas été à l'intérieur,Demuloowoon ca biir


- Add end marks at each sentence without replacing or removing

In [9]:
french_sents = sample['french'].to_list()

wolof_sents = sample['wolof'].to_list()

french_sents = add_end_mark(french_sents)

wolof_sents = add_end_mark(wolof_sents)

pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})


Unnamed: 0,french,wolof
0,Les enfants veulent venir et les adultes souha...,"Xale yi bëgg nañu dikk, te mag ni ñaan nañu ŋg..."
1,J'ai vu Monsieur Moussa.,Gis naa góor gi Musaa.
2,Tout village Sérère est propre !,Béppu dëkku Séeréer set na !
3,Vous êtes ceux-là.,Ñooñu ŋgeen.
4,Que veux-tu ?,Loo bëgg ?
5,Aucune dame ne s'est égarée.,Jenn jigéen réerul.
6,Donnes le livre à ce fils bien élevé !,Joxal téere bi doomu nit ku yaru kooku !
7,Qui d'autre veut partir ?,Keneen kan moo bëgg dem ?
8,"C'est Fatim, si on va au fond des choses !","Soo demee, Faatim la !"
9,Tu n'avais pas été à l'intérieur.,Demuloowoon ca biir.


- Remove exclamation marks and add period marks

In [10]:
french_sents = sample['french'].to_list()

wolof_sents = sample['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!')

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!')

pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})


Unnamed: 0,french,wolof
0,Les enfants veulent venir et les adultes souha...,"Xale yi bëgg nañu dikk, te mag ni ñaan nañu ŋg..."
1,J'ai vu Monsieur Moussa.,Gis naa góor gi Musaa.
2,Tout village Sérère est propre.,Béppu dëkku Séeréer set na.
3,Vous êtes ceux-là.,Ñooñu ŋgeen.
4,Que veux-tu ?,Loo bëgg ?
5,Aucune dame ne s'est égarée.,Jenn jigéen réerul.
6,Donnes le livre à ce fils bien élevé.,Joxal téere bi doomu nit ku yaru kooku.
7,Qui d'autre veut partir ?,Keneen kan moo bëgg dem ?
8,"C'est Fatim, si on va au fond des choses.","Soo demee, Faatim la."
9,Tu n'avais pas été à l'intérieur.,Demuloowoon ca biir.


### Remove sentences with less than 3 tokens

Only the words interest us and we want to separate them by space or marks. Notice also that we want verify for each version and if one it is not accepted then the another one also.

In [11]:
def select_by_length(sentences: pd.DataFrame, min_len: int = 4):
    
    new_sentences = {'french': [], 'wolof': []}
    
    for i in list(sentences.index):
        
        sent1 = sentences.loc[i, 'french']
        
        sent2 = sentences.loc[i, 'wolof']
        
        tokens1 = re.findall('\w+', sent1)
        
        tokens2 = re.findall('\w+', sent2)
        
        take = True
        
        if len(tokens1) < min_len or len(tokens2) < min_len:
            
            take = False
        
        if take:
            
            new_sentences['french'].append(sent1)
            
            new_sentences['wolof'].append(sent2)
            
    return pd.DataFrame(new_sentences)

Let us make an example with the sample that we got earlier.

In [12]:
sample = select_by_length(sample, min_len=3)


In [13]:
sample

Unnamed: 0,french,wolof
0,Les enfants veulent venir et les adultes souha...,"Xale yi bëgg nañu dikk, te mag ni ñaan nañu ŋg..."
1,J'ai vu Monsieur Moussa.,Gis naa góor gi Musaa.
2,Tout village Sérère est propre !,Béppu dëkku Séeréer set na !
3,Aucune dame ne s'est égarée.,Jenn jigéen réerul.
4,Donnes le livre à ce fils bien élevé !,Joxal téere bi doomu nit ku yaru kooku !
5,Qui d'autre veut partir ?,Keneen kan moo bëgg dem ?
6,"C'est Fatim, si on va au fond des choses !","Soo demee, Faatim la !"
7,Tu n'avais pas été à l'intérieur,Demuloowoon ca biir
8,Qu'il attrappe quelles vaches ?,Mu japp nag yooyu yan ?
9,J'ai vu la maison de son ami.,Gis naa kër xaritam.


It only kept 26 samples. Let us make the same for all the sentences.

In [14]:
diagne_sentences = select_by_length(diagne_sentences, min_len=3)

corpora_v4 = select_by_length(corpora_v4, min_len=3)

Let us verify how many sentences we have kept.

In [15]:
# for diagne's sentences
diagne_sentences.shape[0]

1328

In [16]:
# in corpora v4
corpora_v4.shape[0]

2212

We removed $651$ sentences from the corpora v4 and $645$ sentences from the diagne's sentences.

Let us save the results to make again hyperparameter search with them.

In [17]:
# save the new version of the diagne's sentences in red_sentences.csv
diagne_sentences.to_csv('data/extractions/new_data/red_sentences.csv', index = False)

# save the new version of the corpora
corpora_v4.to_csv('data/extractions/new_data/corpora_v5.csv', index = False)

### Recuperate additional sentences 

In [18]:
corpora_v4 = pd.read_csv('data/extractions/new_data/corpora_v4.csv')

diagne_sentences = pd.read_csv('data/extractions/new_data/sentences.csv')

Let us recuperate additional sentences to augment the corpora.

- First file:

In [19]:
path = 'data/additional_documents/omniglot/sentences_french_wolof.txt'

with open(path, 'r', encoding='utf-8') as f:
    
    # recuperate lines
    omn_1 = f.readlines()
    
# recuperate sentences
omn_1_ = {
    'french': [],
    'wolof': []
}

for omn in omn_1:
    
    omn = omn.split('\t')
    
    sentences = [s.strip() for s in omn if s.strip() != '']
    
    omn_1_['french'].append(sentences[0])
    
    omn_1_['wolof'].append(sentences[1])

In [20]:
pd.options.display.max_rows = 74
omn_1_ = pd.DataFrame(omn_1_).head(74)

# print the first sentences from omniglot
omn_1_.head(74)

Unnamed: 0,french,wolof
0,Bienvenue,Merhbe
1,Salut,Na nga def
2,Salut,Na ngeen def
3,Salut,Salaam aleekum
4,Comment vas-tu ?,Jaam nga am ?
5,Es-tu en paix ?,Jaam nga am ?
6,Comment vas-tu ?,Na nga def?
7,"Paix seulement, et toi ?","Jaam rek, Yow nag ?"
8,"Je suis seulement toi, comment vas-tu ?","Mangi fi rekk, na nga def ?"
9,Ça fait longtemps que je ne t'ai pas vu,Gej na la giis


- Second file

In [21]:
path = 'data/additional_documents/omniglot/sentences_french_wolof_2.txt'

with open(path, 'r', encoding='utf-8') as f:
    
    # recuperate lines
    omn_2 = f.readlines()
    
# recuperate sentences
omn_2_ = {
    'french': [],
    'wolof': []
}

for omn in omn_2:
    
    omn = omn.split('–')
    
    sentences = [s.strip() for s in omn if s.strip() != '']
    
    omn_2_['wolof'].append(sentences[0])
    
    omn_2_['french'].append(sentences[1])

In [22]:
pd.options.display.max_rows = 74
omn_2_ = pd.DataFrame(omn_2_).head(74)

# print the first sentences from omniglot
omn_2_.head(74)

Unnamed: 0,french,wolof
0,Comment allez-vous ?,Na ngeen def ?
1,Bonne matinée.,Jaam nga fanane.
2,Bonne nuit.,Fanaanal jaam.
3,À une autre fois.,Ba beneen.
4,Je t'en prie.,Agsil.
5,Je vous en prie.,Agsileen ak jaam.
6,Pardon.,Baal ma.
7,Oui.,Wau.
8,Non.,Deh-det.
9,Es-tu en paix ?,Jaam nga am ?


- Third file

In [23]:
path = 'data/additional_documents/omniglot/sentences_french_wolof_3.txt'

with open(path, 'r', encoding='utf-8') as f:
    
    # recuperate lines
    omn_3 = f.readlines()
    
# recuperate sentences
omn_3_ = {
    'french': [],
    'wolof': []
}

for omn in omn_3:
    
    omn = omn.split('\t')
    
    sentences = [s.strip() for s in omn if s.strip() != '']
    
    omn_3_['wolof'].append(sentences[0])
    
    omn_3_['french'].append(sentences[1])
      
        

In [24]:
pd.options.display.max_rows = 74
omn_3_ = pd.DataFrame(omn_3_).head(74)

# print the first sentences from omniglot
omn_3_.head(74)

Unnamed: 0,french,wolof
0,Que la paix soit avec toi.,Salaam maaleekum.
1,Que la paix soit avec toi aussi.,Maaleekum salaam
2,Es-tu en paix ?,Jaama ngaam ?
3,Paix seulement.,Jamma rek.
4,Comment va ta famille ?,Ana waa kër gi ?
5,Où sont les membres de ta famille ?,Ana waa kër gi ?
6,Ils sont là-bas.,Ñunga fa.
7,J'espère qu'ils vont bien.,Mbaa defuñu dara.
8,Rien ne va mal avec eux.,Mbaa defuñu dara.
9,"Non, ils vont bien.","Déedéet, defuñu dara."


- Let us retrieve sentences, that was created by internet users in our following platform [addition_sentences_platform](https://french-wolof-translation.streamlit.app/), from our MongoDB Atlas Database.

In [25]:
# initialize the path to the csv files
sentences_path = 'data/additional_documents/hugging_face/wolof_french.csv'
deleted_path = 'data/additional_documents/hugging_face/deleted.csv'

# initialize the server uri and database name
uri = "mongodb+srv://oumar199:Jacksparrow360@woloftranslationcluster.u0gk7.mongodb.net/?retryWrites=true&w=majority"
# uri = "mongodb+srv://oumarkanetest:Jacksparrow360__@woloftranslationcluster.u0gk7.mongodb.net/?retryWrites=true&w=majority"
database_name = 'WolofTranslation'

# initialize the database manager 
database_manager = TranslationMongoDBManager(uri, database_name)


# save locally the data sets
database_manager.save_data_frames(sentences_path, deleted_path)

hug_sents = pd.read_csv(sentences_path)

In [26]:
# # print some documents inside the sentences collection using the uri
# def write_documents(uri):
        
#     from pymongo import MongoClient
    
#     # connect to the database
#     client = MongoClient(uri)
    
#     # get the database
#     db = client['WolofTranslation']
    
#     # get the sentences collection
#     collection = db['sentences']
    
#     # recuperate some documents
#     documents = collection.find()
    
#     # write the documents in a file
#     for doc in documents:
        
#         print(str(doc))
#         print(len(documents))
            
#     # close the connection
#     client.close()
    
# write_documents(uri)


In [27]:
pd.options.display.max_rows = 74

hug_sents.head(74)

Unnamed: 0,french,wolof
0,"J'en suis sûr, cette photo ci c'est la photo p...",Waaw nataal bii nataal la boob ay nit ñu baree...
1,J'arrive tout de suite chez toi.,Léegui léegui ma egg sa kër.
2,Je vois devant moi une photo sur laquelle beau...,Nataal bii maa ngi ciy janloog haa ay nit yu b...
3,Ceux-ci sont des personnes qui sont sortis pou...,"Lii, ay nit lañu yu génn di ñaxtu. Jëm yi nag ..."
4,"Sur la photo, ont voit des personnes qui se ré...",Nataal bii ñoo ngi ciy gis ay nit ñuy ñaxtu wa...
5,On voit sur la photo beaucoup de personnes sor...,Ñu gis ci nataal bi ay nit ñu bari ñu génn ci ...
6,"Je pense que la photo que je vois là, c'est un...","Man de nataal bii ma gis nii, nataal la boo xa..."
7,Bonjour ! Ceci est une photo. Sur cette photo ...,As salaamaalekum! lii de ab nataal la. Nataal ...
8,Ceci est une théière. Une théière dans la quel...,Lii nag baraada la. Baraada bu nu def attaaya ...
9,"Ce que j'ai en face de moi, sur la photo, c'es...",Li may jàkkarlool ci nataal bi nag moom benn k...


Let us add them into the diagne's sentences and the version 4 of the corpora and remove duplicated rows.

- On corpora version 4

In [28]:
corpora_v4 = pd.concat((corpora_v4, omn_1_, omn_2_, omn_3_, hug_sents), axis=0)

# drop duplicated rows
print(sum(corpora_v4.duplicated()))

7


In [29]:
corpora_v4.drop_duplicates(inplace=True)

7 rows were deleted.

- On corpora version 5

In [30]:
diagne_sentences = pd.concat((diagne_sentences, omn_1_, omn_2_, omn_3_, hug_sents), axis=0)

# drop duplicated rows
print(sum(diagne_sentences.duplicated()))

4


In [31]:
diagne_sentences.drop_duplicates(inplace=True)

4 rows were deleted.

### Web Scraping

This step involve to scrape some websites to obtain more sentences using `selenium` and `BeautifulSoup`. It will require to meticulously investigate each website and identify the elements to trigger and to extract. We will also need to identify the transformation to add to the different sentences. All the transformations, excepted the sentences' capitalization, are already implemented and can be directly used. 

We will scrape the following websites: 

- url1: [corporan.huma-num.fr](https://corporan.huma-num.fr/Lexiques/indexM.php?controle=afficheDico&action=main&ficXML=DicoWolof.xml&langue=Wolof)

In [54]:
url1 = "https://corporan.huma-num.fr/Lexiques/indexM.php?controle=afficheDico&action=main&ficXML=DicoWolof.xml&langue=Wolof"

#### Corporan.huma-num scraping

Let us add the necessary libraries.

In [32]:
from wolof_translate.utils.extract_new_sentences import NewSentenceExtraction
from wolof_translate.utils.sent_transformers import TransformerSequences
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from wolof_translate.utils.sent_corrections import *
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome import service
from selenium.webdriver.common.by import By
from urllib import request as rq
from selenium import webdriver
from bs4 import BeautifulSoup
from tqdm import tqdm
from typing import *
import pandas as pd

We need to identify the element containing the sentences. It is easy in that case since they are identified by their classes:

- The french sentences are identified by the `xe` class
- The wolof sentences by the `xv` class.

Since the website consists of a dictionary of word, we must click on one of them to display the sentences. The words' elements are not links but it use a ajax function. Then we must use trigger event to display the different pages. The words are represented by the `div` tag and each of them has a unique id which is simply the number of the word. We have `9965` words so the ids range from 1 to 9965.

Let us begin by initializing the drive. We will use the chrome driver in our case.

In [2]:
# initialize the option
options = webdriver.ChromeOptions()

# we will add the headless option which will not render the browser
options.add_argument('--headless')

# let us recuperate the service
s = service.Service('C:\Program Files\Chromedriver\chromedriver.exe')

# let us recuperate the driver
driver = webdriver.Chrome(service=s, options=options)



Let us request to go the website.

In [57]:
# send a get request to the website
driver.get(url1)

Let us initialize the classes.

In [33]:
# initialize the class map
class_map = {
    'xe': 'french',
    'xv': 'wolof'
}


Let us initialize the transformer.

In [34]:
transformer = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)

Let us create a function which get recuperate the content and with beautiful soup recuperate the sentences. We will use an implicit wait at each time since we don't know what element to wait for: Not all pages contain sentences. We will wait 0.2 seconds before getting the sentences.

In [60]:
def get_sentences_from_page(driver: webdriver.chrome.webdriver.WebDriver, sentences: Dict[str, list], transformer: Union[TransformerSequences, None] = None, replaces: dict = {'…': '...'}, end_marks = ['.', '!', '?', '...']):
    
    # let us recuperate the content of the current page
    content = driver.page_source
    
    # recuperate the document object
    doc = BeautifulSoup(content, 'lxml')
    
    # recuperate the elements
    elements = {class__: doc.find_all('span', class_ = class__) for class__ in sentences}
    
    if elements != {key: [] for key in sentences}:
        
        # count the number of sentences that we got
        sent_count = len(elements[list(sentences)[0]])
        
        # for each group of sentences we will add some transformations
        for s in range(sent_count):
            
            # initialize the common end mark
            end_mark = None
            
            # recuperate the sentences and add to the dictionary of sentences
            for class__ in sentences:
                
                # recuperate the current sentence
                sentence = elements[class__][s].text.strip()
                
                # replace marks
                for replace in replaces:
                    
                    sentence = sentence.replace(replace, replaces[replace])
                
                if not sentence[-1] in end_marks:
                    
                    # add an end mark if the end mark is got from one of the previous sentences
                    if not end_mark is None:
                        
                        sentence += end_mark
                    
                    else: # add a period mark
                        
                        sentence += '.'
                
                else: # if an end mark is identified recuperated it as the end mark of the other sentences
                    
                    end_mark = sentence[-1]
                
                # capitalize the current sentence
                sentence = sentence[0].upper() + sentence[1:]
                
                # add transformations to the sentence
                if transformer: sentence = transformer(sentence)[0]
                
                # add the sentence to the dictionary of sentences
                sentences[class__].append(sentence)
            
    # return the sentences
    return sentences
    

In [None]:
# calculate the approximate number of minutes
0.5*9956/60

82.96666666666667

Initialize the sentences 🛑.

In [35]:
# recuperate the saved sentences
scraped_sentences = pd.read_csv('data/extractions/new_data/scraped_sents.csv').to_dict('list')

# initialize the sentences
sentences = {key: scraped_sentences[class_map[key]] for key in class_map}

# current_id
current_id = 9836

# initialize the ids
word_ids = list(range(current_id, 9966))

Let us recuperate the sentences by clicking recursively at each word.

In [None]:
# initialize the time sleep previously introduced
time_sleep = 0.6

# initialize a save step
save_step = 20

# recuperate the sentences of each word
for id in tqdm(word_ids):
        
    # click next to the word
    word = driver.find_element(By.ID, id)
    driver.execute_script('arguments[0].click();', word)
    
    # add a time sleep
    driver.implicitly_wait(time_sleep)
    
    # recuperate the sentences
    sentences = get_sentences_from_page(driver, sentences, transformer, replaces={'…': '...', '\xa0': ' ', 'qqn': "quelqu'un"})
    
    # save the sentences
    if id % save_step == 0:
        
        pd.DataFrame({value: sentences[key] for key, value in class_map.items()}).to_csv('data/extractions/new_data/scraped_sents.csv', index = False)
    
    

100%|██████████| 130/130 [03:41<00:00,  1.70s/it]


Not all sentences were recuperated. Let us make it again to recuperate additional sentences from the website.

Initialize the sentences 🛑.

In [68]:
# recuperate the saved sentences
scraped_sentences_2 = pd.read_csv('data/extractions/new_data/scraped_sents2.csv').to_dict('list')

# initialize the sentences
sentences_2 = {key: scraped_sentences_2[class_map[key]] for key in class_map}

# current_id
current_id = 300

# initialize the ids
word_ids = list(range(current_id, 9966))

Let us recuperate the sentences by clicking recursively at each word.

In [66]:
# initialize the time sleep previously introduced
time_sleep = 0.6

# initialize a save step
save_step = 20

# recuperate the sentences of each word
for id in tqdm(word_ids):
        
    # click next to the word
    word = driver.find_element(By.ID, id)
    driver.execute_script('arguments[0].click();', word)
    
    # add a time sleep
    driver.implicitly_wait(time_sleep)
    
    # recuperate the sentences
    sentences_2 = get_sentences_from_page(driver, sentences_2, transformer, replaces={'…': '...', '\xa0': ' ', 'qqn': "quelqu'un"})
    
    # save the sentences
    if id % save_step == 0:
        
        pd.DataFrame({value: sentences_2[key] for key, value in class_map.items()}).to_csv('data/extractions/new_data/scraped_sents2.csv', index = False)
    
    

100%|██████████| 9666/9666 [5:19:39<00:00,  1.98s/it]   


Let us take concatenate the two data frames and drop duplicated rows.

In [70]:
# concatenate the translations
scraped_sentences = pd.concat((pd.DataFrame(scraped_sentences), pd.DataFrame(scraped_sentences_2)))

In [72]:
# drop duplicated translations
scraped_sentences.drop_duplicates(inplace=True)

Let us verify how many translations we have.

In [74]:
# print the number of new translations
scraped_sentences.shape[0]

9388

Let us save the result.

In [76]:
scraped_sentences.to_csv('data/extractions/new_data/scraped_sents.csv', index=False)

We obtained `9388` translations from the first scraped website. Let us now make pre-processing on it. We will based on the following strings to make it:
- The parentheses containing 'ou...' 
- The slashes '/'
- The parentheses: We mostly want to delete them all
- Replace 'qqch' by 'quelque chose'
- Replace ';.' by '.'
- Replace 'œ' by 'oe'
- Identify the sentence with only ')' but not '('.

Let us create a function which can recuperate the sentences with regular expressions and another function which can replace an expression by another.

In [45]:
# %%writefile wolof-translate/wolof_translate/utils/re_extract.py
from typing import *
import re

def extract_with_re(translations: pd.DataFrame, columns: list = [], regex: Union[str, None] = None, recup_csv: str = 're_extract.csv'):
    
    # get all columns if no column is given
    if not columns: columns = list(translations.columns)
    
    for column in columns:
        
        assert column in set(translations.columns)
    
    # if not regular expression is given then return in the csv file all the sentences
    if not regex is None:
        
        # initialize the dictionary of the concerned sentences
        re_sents = {
            column: [] for column in columns
        }
        
        # at each index we will verify if a column respect the regular expression
        for i in range(translations.shape[0]):
            
            # make verification for each column
            re_true = False
            
            for column in columns:
                
                if re.match(regex, translations.loc[i, column]):
                    
                    re_true = True
                
            # add the line inside the set of translations to investigate
            if re_true:
                
                for column in columns:
                        
                        re_sents[column].append(translations.loc[i, column])
                
                # remove the sentences from the DataFrame
                translations.drop(index=[i], inplace=True)
            
        # save the resulted sentences in order to modify (it will wait until you finish the modifications)
        re_sents = pd.DataFrame(re_sents)
        
        print("You recuperated the following sentences:")
        print(re_sents)
        
        re_sents.to_csv(recup_csv, index=False)
        
        alert_ = ''
        
        while True:
            
            finish = input(f"{alert_}Provide Yes(y) if you finish and No(n) if not yet!")
            
            if finish == 'y':
                
                break
            
            elif finish != 'n':
                
                alert_ = "You cannot provide a command which is different from y (for Yes) and n (No)"
            
        # recuperate the modifications
        re_sents_2 = pd.read_csv(recup_csv)
        
        # add the sentences inside the original data frame and return the results
        translations = pd.concat((re_sents_2, translations))
        
        translations.index = range(translations.shape[0])
        
        return translations, re_sents_2
    
    return None
    
def replace_expression(translations: pd.DataFrame, columns: list = [], regex: Union[str, None] = None, new_expression: str = ''):
    
    # get all columns if no column is given
    if not columns: columns = list(translations.columns)
    
    for column in columns:
        
        assert column in set(translations.columns)
    
    # if not regular expression is given then return in the csv file all the sentences
    if not regex is None:
        
        # initialize the dictionary of the concerned sentences
        re_sents = {
            column: [] for column in columns
        }
        
        # at each index we will verify if a column respect the regular expression before making replacement
        for i in range(translations.shape[0]):
            
            for column in columns:
                
                expressions = re.findall(regex, translations.loc[i, column])
                
                if expressions != []:
                
                    translations.loc[i, column] = re.sub(regex, new_expression, translations.loc[i, column]).strip()

                    for column in columns:
                        
                        re_sents[column].append(translations.loc[i, column])
            
        # save the resulted sentences in order to modify (it will wait until you finish the modifications)
        re_sents = pd.DataFrame(re_sents)
        
        print("You recuperated the following sentences:")
        print(re_sents)
    
        return translations, re_sents
    
    return None

- Let us recuperate all translations containing parentheses with 'ou' and verify which of them will need to separated as multiple translations.

In [21]:
scraped_sentences, re_sents = extract_with_re(pd.DataFrame(scraped_sentences), regex='.*\(ou.*\).*')

You recuperated the following sentences:
                                               french                                      wolof
0   Je m'occupe de toi; je suis à toi. (ou encore)...                            Maa ngi ci yow.
1               C'est bien (ou bien) d'accord; ça va.                                   Baax na.
2   Il est bel et bien au courant (ou bien) il l'a...                    Yég na ko ba bëgga dee.
3                C'est pareil (ou bien) c'est unique.                                   Benn la.
4    Ils sont semblables (ou bien) ils ne font qu'un.                                 Benn lañu.
..                                                ...                                        ...
27         Être soupe au lait (ou bien) être frustré.                                  Tàng xol.
28                                       De bon cœur.  Ci xol bu sedd (ou bien) ci xol bu tàlli.
29       Vous êtes égaux (ou bien) vous êtes pareils.                                 

- Let us recuperate all translations containing slashes.

In [34]:
scraped_sentences, re_sents = extract_with_re(scraped_sentences, regex='.*(/).*')


You recuperated the following sentences:
                                              french                                              wolof
0  Qu'est-ce qui te prend ? / qu'est-ce qui t'arr...                                 (loc.) Lu la dal ?
1  Fais des boulettes plus grosses / fais davanta...                                    Yokkal dank yi.
2    Oh la, Omar ! / Arrête, Omar ! (Tu as fini ?) .                                         Eey, Omar.
3  J'en ai plus que toi / j'en ai un plus grand (...                                      Maa la ëpple.
4                Le voilà ! / Voilà / nous-y voilà !                                  (Lov.)Mu ngoogu !
5  Celui qui meurt au marché fait lui-même son an...  Ku dee marse, yaa tàgge sa bopp // ku dee ca j...
6  Il n'a pas mangé / il n'est pas en train de ma...                                            Lekkul.
7  Je n'ai pas mangé / je ne suis pas en train de...                                 Lekkuma lekkul ma.


- Separate in multiple ones the translations with parentheses

In [37]:
scraped_sentences, re_sents = extract_with_re(scraped_sentences, regex='.*\(.*\).*')

You recuperated the following sentences:
                                                french                                              wolof
0        Ces touffes (de cheveux), c'est leur coutume.                            Jubb yii, seen aada la.
1    Avant de tirer sur une bête, je vois d'abord s...  Bala may fital aw rab, dinaa xool ndax mu ngi ...
2       L'interdiction est temporaire (ne durera pas).                                   Aaye bi du yàgg.
3    Il prête sa charrette (à qui veut) uniquement ...  Li muy abalaate saretam bi yépp dey, bëgga fal...
4                      Les enfants sont arrivés (ici).                                 Xale yi agsi nañu.
..                                                 ...                                                ...
167  J'ai appelé ce village Njar-Meew (lait coupé) ...  Lu naqari la mu ma doon def, di ma ko yéjje, m...
168  Il dit qu'on a libéré les invités (ils peuvent...                           Mu ne yewwi nañu gan yi.
169  

- Delete parentheses

In [38]:
scraped_sentences, re_sents = replace_expression(scraped_sentences, regex='\(.*\)')

You recuperated the following sentences:
                                                french                                              wolof
0                    Ces touffes , c'est leur coutume.                            Jubb yii, seen aada la.
1    Avant de tirer sur une bête, je vois d'abord s...  Bala may fital aw rab, dinaa xool ndax mu ngi ...
2                      L'interdiction est temporaire .                                   Aaye bi du yàgg.
3     Il prête sa charrette  uniquement pour être élu.  Li muy abalaate saretam bi yépp dey, bëgga fal...
4                           Les enfants sont arrivés .                                 Xale yi agsi nañu.
..                                                 ...                                                ...
168                 J'ai appelé ce village Njar-Meew .  Lu naqari la mu ma doon def, di ma ko yéjje, m...
169                Il dit qu'on a libéré les invités .                           Mu ne yewwi nañu gan yi.
170  

- Replace 'qqch' by 'quelque chose'

In [47]:
scraped_sentences, re_sents = replace_expression(scraped_sentences, regex='qqch', new_expression='quelque chose')


You recuperated the following sentences:
                                               french                                              wolof
0   Il y a quelque chose  dans mes habits; tiens l...  Am na luy dox ci sama yére yi; téyeel ma xale ...
1   Dis-moi quelque chose qui puisse apporter la p...         Wax ma lu mëna amal jàmm ci sunu réew mi !
2   Je pensais qu'il m'appelait pour quelque chose...   Dama defe woon ne lu am-maanaa la ma doon wooye.
3   Le réparateur est arrivé. Qui aurait quelque c...        Baay-jagal dikk na. Ku amoon looy defarlu ?
4   Ses propos ont été brefs; lui aurais-tu fait q...       Waxam ji dafa bar; xanaa danga ko def dara ?
..                                                ...                                                ...
42  Quand on fait quelque chose, il est normal d'e...                            Kuy xalam, di ci jaayu.
43  Quand on reçoit quelque chose d'une personne q...  Bu la ku xér di may, sa xol sedd; waaye bu nge...
44  Ce dont on

- Replace ';.' by '.'

In [49]:
scraped_sentences, re_sents = replace_expression(scraped_sentences, regex=';\.', new_expression='.')

You recuperated the following sentences:
                                                french                                              wolof
0    Il y a quelque chose  dans mes habits.tiens l'...  Am na luy dox ci sama yére yi; téyeel ma xale ...
1    Il y a quelque chose  dans mes habits.tiens l'...  Am na luy dox ci sama yére yi.téyeel ma xale b...
2    Ses propos ont été brefs.lui aurais-tu fait qu...       Waxam ji dafa bar; xanaa danga ko def dara ?
3    Ses propos ont été brefs.lui aurais-tu fait qu...        Waxam ji dafa bar.xanaa danga ko def dara ?
4    Ça, il faudra tout faire pour l'avoir, c'est q...  Lii dey, nanga ci gar sa bakkan.lu lay amal nj...
..                                                 ...                                                ...
443           Il est fâché.il ne veut même pas manger.                       Dafa aru; nanguwul sax lekk.
444           Il est fâché.il ne veut même pas manger.                        Dafa aru.nanguwul sax lekk.
445  

In [53]:
# function created to add correction 
# def correction(text: str):
    
#     charac = ''
    
#     for i in range(len(text[:-1])):
        
#         letter = text[i]
        
#         if letter == '.' and re.match('[a-zA-Z]', text[i+1]):
            
#             charac = charac + '; '
        
#         else:
            
#             charac = charac + letter
        
#     return charac + text[-1]

- Replace 'œ' by 'oe'

In [58]:
scraped_sentences, re_sents = replace_expression(scraped_sentences, regex='œ', new_expression='oe')


You recuperated the following sentences:
                                               french                                       wolof
0   Une femme de moeurs légères ne se marie pas fa...                 Jigéen juy fo du yomba séy.
1                            Elle est ma jeune soeur.                               Sama rakk la.
2                                       De bon coeur.                            Ci xol bu tàlli.
3                                       De bon coeur.                             Ci xol bu sedd.
4                      Chante, nous ferons le choeur.                           Woyal, nu awu la.
..                                                ...                                         ...
93              Ton souhait ne te vient pas du coeur.                Sa yéene ji àggul ci sa xol.
94                           Coupe le jusqu'au coeur.                   Gor ko ba àgg ca yeññ wa.
95  Va jeter un coup d'oeil pour voir s'il est rév...            Demal yër ba

- Identify translations with only a closing parentheses

In [59]:
scraped_sentences, re_sents = extract_with_re(scraped_sentences, regex='.*\).*')


You recuperated the following sentences:
                                               french                                              wolof
0   La marmite, si elle doit être délicieuse, quan...                Cin, su naree neex, bu baxee, xeeñ.
1        L'homme et ses désirs, Dieu et sa décision).              Nit ak bëgg-bëggam, Yàlla ak dogalam.
2   C'est avec l'esprit avec lequel tu as creusé l...     Xel mi nga gase teen bi la ci keneen di naane.
3   Une langue ne mange pas son semblable, si on n...           Làmmiñ du lekk morom ma, te toggeesu ko.
4   Si ton talent de cavalier te pousse à monter s...  Su la mën ngawar jayee ba nga mafñàndum saaw, ...
..                                                ...                                                ...
26                      Ton corps est-il en paix ?) .                            Sa yaram jàmm bonjour !
27  Généreux, avec tes biens) sois généreux, mais ...                               Yéwén : ci sa alal !
28  Qui a broy

Let us save the sentences and delete the space between a point or a comma and the word behind them. 

In [60]:
scraped_sentences.to_csv('data/extractions/new_data/scraped_sents.csv', index=False)

Let us add the scraped translations inside the diagne's translations and the corpora (version 4).

In [39]:
shape_1 = diagne_sentences.shape[0]

In [40]:
# inside diagne's sentences
diagne_sentences = pd.concat((diagne_sentences, pd.DataFrame(scraped_sentences)))

# inside the corpora
corpora_v4 = pd.concat((corpora_v4, pd.DataFrame(scraped_sentences)))

# inside diagne's sentences (we want 3000 translations for diagne's sentences)
# diagne_sentences = pd.concat((diagne_sentences, pd.DataFrame(scraped_sentences).sample(3000 - shape_1)))


In [41]:
diagne_sentences.index = range(diagne_sentences.shape[0])
corpora_v4.index = range(corpora_v4.shape[0])

Let us add the scraped sentences inside a new collection and add it also the translations from the already existing collection.

In [13]:
scraped_sentences = pd.DataFrame(scraped_sentences)

database_manager.insert_documents([{
        '_id': id,
        'french': scraped_sentences.loc[id, 'french'],
        'wolof': scraped_sentences.loc[id, 'wolof'],
    } for id in scraped_sentences.index], 'sentences_plus')

<pymongo.results.InsertManyResult at 0x1a0edba7670>

Let us insert inside the collection the translations from the sentences collection.

In [16]:
hug_sents.index = range(scraped_sentences.shape[0], scraped_sentences.shape[0] + hug_sents.shape[0])
database_manager.insert_documents([{
        '_id': id,
        'french': hug_sents.loc[id, 'french'],
        'wolof': hug_sents.loc[id, 'wolof'],
    } for id in hug_sents.index], 'sentences_plus')

<pymongo.results.InsertManyResult at 0x1a0f07cb760>

---------------

Let us consider only the fourth end mark adding's method and verify if we have duplicated rows.

- Corpora v4

In [36]:
french_sents = corpora_v4['french'].to_list()

wolof_sents = corpora_v4['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!')

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!')

corpora_v4_end = pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})

In [37]:
duplicated = corpora_v4_end.duplicated()

sum(duplicated)

5

We obtain 5 duplicated rows. Let us recuperate their indexes.

In [38]:
dup_indexes = duplicated[duplicated == True].index

In [39]:
dup_indexes

Int64Index([1428, 1597, 2241, 2243, 2367], dtype='int64')

Let us remove those indexes from the corpora.

In [40]:
corpora_v4.drop(index = dup_indexes, inplace=True)

- Diagne's sentences

In [41]:
french_sents = diagne_sentences['french'].to_list()

wolof_sents = diagne_sentences['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!')

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!')

diagne_end = pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})

In [42]:
duplicated = diagne_end.duplicated()

sum(duplicated)

5

We obtain 5 duplicated rows. Let us recuperate their indexes.

In [43]:
dup_indexes = duplicated[duplicated == True].index

In [44]:
dup_indexes

Int64Index([546, 715, 1359, 1361, 1485], dtype='int64')

Let us remove those indexes from the corpora.

In [45]:
diagne_sentences.drop(index = dup_indexes, inplace=True)

Apply transformations on all sentences.

In [46]:
# apply the transformations
def correction(text: str):
    
    text = transformer(text)[0]
    
    return text

# add corrections to the translations
diagne_sentences['french'] = diagne_sentences['french'].apply(correction)
diagne_sentences['wolof'] = diagne_sentences['wolof'].apply(correction)

corpora_v4['french'] = corpora_v4['french'].apply(correction)
corpora_v4['wolof'] = corpora_v4['wolof'].apply(correction)

# print clearly the results
diagne_sentences.head(74)


Unnamed: 0,french,wolof
0,Il n'a pas encore été,Demagul
1,Il n'a pas encore été voir,Seeteegul
2,Il ne s'est pas encore substitué à,Wuutóogul
3,Il n'a pas encore pris de contacts,Giséegul
4,Homme n'est pas mauvais !,Nit bonul !
5,L'homme n'est pas mauvais !,Nit bonul !
6,Quelqu'un est parti !,Nit dem na !
7,Homme est parti !,Nit dem na !
8,Un homme est parti !,Nit dem na !
9,J'ai vu lion.,Gis naa gaynde.


In [47]:
corpora_v4.head(74)

Unnamed: 0,french,wolof
0,Tout être humain est le résultat d'un père et ...,"Doomu-aadama bu, ne ci ndey ak baay nga jóge."
1,"On peut ne pas les reconnaître, ne pas les aim...","Mënunu leen a baň a gërëm ak a bëgg, doonte sa..."
2,"Mais ils sont là, avec leur visage, leurs atti...","Waaye ňu ngi fi, ak seen xar-kanam, seen taxaw..."
3,J'ai longtemps rêvé que ma mère était noire. J...,"Bi ma delloo dëkk ba ma juddoo, dama faa meloo..."
4,"Puis j'ai découvert, lorsque mon père, à l'âge...","Àddinay dox ba Baay tollu ci noppalug liggéey,..."
5,Cela a été difficile à admettre.,Mu doon nag lu naqadee nangu.
6,"Il m'a fallu retourner en arrière, recommencer...",Damaa mujjoon a delloo sama xel démb ngir lijj...
7,"En souvenir de cela, j'ai écrit ce petit livre.",Kon fàttalikoo meññ téere bu ndaw bii.
8,"De ce visage que j'ai reçu à ma naissance, j'a...","Kanam gii ma judduwaale, am na lu bari lu ma c..."
9,"D'abord, qu'il m'a fallu l'accepter.",Li ci jiitu moo di ne dama dem ba jàppe ko nat...


Let us save the results.

In [48]:
# save the results
diagne_sentences.to_csv('data/extractions/new_data/ad_sentences.csv', index=False)
corpora_v4.to_csv('data/extractions/new_data/corpora_v6.csv', index=False)

Let us verify how many sentences we got for the moment.

In [49]:
# from diagne's book
diagne_sentences.shape[0]

1979

In [50]:
# inside the corpora
corpora_v4.shape[0]

2861

We recuperated `11435` sentences from the diagne's book and the final corpora contains `12317` sentences.

The training can take a long time to finish. We can take only a sample in place of all the translations. Let us take `5000` samples to begin and a seed of `0` for the generator.

In [51]:
# # recuperate the sentences
# diagne_sentences = pd.read_csv('data/extractions/new_data/ad_sentences.csv')
# corpora_v4 = pd.read_csv('data/extractions/new_data/corpora_v6.csv')

In [112]:
diagne_sentences_s = diagne_sentences.sample(5000)
corpora_v4_s = corpora_v4.sample(5000)

Let us save again the translations.

In [None]:
diagne_sentences_s.to_csv('data/extractions/new_data/ad_sentences.csv', index=False)
corpora_v4_s.to_csv("data/extractions/new_data/corpora_v6.csv", index=False)