Improve sentences
----------------------------

In this notebook we want to improve the corpora of new sentences and the final corpora which is the concatenation of the original corpora and the new one (we got the two in the following notebook [new_sentences](extract_new_sentences.ipynb)). 

For the improvement we want to make the following modifications that we found to be necessary after fine-tuning the `t5` model on the new sentences:

- Remove the definitions: We had already extract the new sentences without the definitions we will only need to load the "true sentences".
- Replace exclamation marks (!) at the end of the phrases by period marks (.): This will help the transformers to identify the end of a sentence. Not that we will do this for both of the new sentences and the last version corpora (v4). We cannot know if it is the right thing to do in order to obtain better performance that's why we will create a function and add it when making hyper-parameter search and reduce the search space if necessary.
- Remove sentences with less than 3 tokens in one of the French of Wolof version. This will reduce the number of sentences but can provide better agreement with the right way to write sentences.

The above modifications on the version 4 of the corpora and the diagne's sentences will be necessary to create the version 5 of the corpora. The additional version of the diagne's sentences and 6 of the corpora will understand additionally sentences got from `omniglot`. We will not select the sentences by length for the new versions.

All the modifications can be done with the `pandas` library.

In [1]:
import re
import pandas as pd
from tokenizers.pre_tokenizers import Whitespace
from wolof_translate.utils.database_manager import TranslationMongoDBManager

### Remove the definitions

Let us load the two corpora.

In [2]:
corpora_v4 = pd.read_csv('data/extractions/new_data/corpora_v4.csv')

diagne_sentences = pd.read_csv('data/extractions/new_data/sentences.csv')

In [3]:
corpora_v4.shape[0], diagne_sentences.shape[0]

(2500, 1615)

### Add periods to the end of the sentences

We have three possibilities:

- We can replace the exclamations by period marks (option 2)
- ... add period mark at each end of sentences (option 3)
- ... do the two previous modification at same time (option 4)

The following function will be added inside the `TransformerSequences` class when recuperating the sentences before the training.

In [4]:
%%writefile wolof-translate/wolof_translate/utils/improvements/end_marks.py

import pandas as pd
from typing import *

def add_end_mark(sentences: Union[list, str], end_mark: str = ".", end_mark_to_remove: Union[str, None] = None, 
                 replace: bool = False, poss_end_marks: list = ['!', '?', '.', '...']):
    
    if isinstance(sentences, str): sentences = [sentences]
    
    if replace and end_mark_to_remove is None:
        
        raise ValueError("You must provide a end mark to remove if you want to make replacement !")
    
    new_sentences = []
    
    # if replace is chosen we will replace the end to remove by the end mark if not only add end mark to each sentence and remove the end mark to remove
    for sentence in sentences:
        
        if replace:
            
            if sentence[-1] == end_mark_to_remove:
                
                sentence = sentence[:-1].strip() + end_mark
            
        else:
            
            if not end_mark_to_remove is None and sentence[-1] == end_mark_to_remove:
                
                sentence = sentence[:-1]
            
            if not sentence[-1] in poss_end_marks: 
                
                sentence = sentence.strip() + end_mark
        
        new_sentences.append(sentence)
    
    return new_sentences
    

Overwriting wolof-translate/wolof_translate/utils/improvements/end_marks.py


Let us try the different type of end mark adding with the above function.

In [5]:
%run wolof-translate/wolof_translate/utils/improvements/end_marks.py

In [6]:
# recuperate some sentences from the diagne's sentences
sample = diagne_sentences.sample(30)

In [7]:
sample

Unnamed: 0,french,wolof
959,"Celui-là, lui, il refusera !","Kookule, moom, du naŋgu !"
782,Nul n'a été,Kenn demul
693,"Ainsi, l'homme ne veut pas partir","Ba, góor gi bëggul dem"
1462,L'homme a frappé la vache avec un bâton.,Nit ki dóor na nag wi ak bant.
385,Personne n'était venu,Yéena dikkulwoon
828,Crie !,Nil : wóoy !
1551,"Alors l'homme entra, les enfants le virent, il...","Noona góor gi dugg, xale yi gis ka, mu toog, ñ..."
41,Tu connais cet homme-là ?,Xam ŋga nit kookuu ?
1423,"Tout ce verbiage, c'est pour que tu ne viennes...","Wax ji yépp, bañ-ŋga-ñëw la."
424,S'il est parti,Su demee


We need to add the modification on each version of the sentences.

- Replace exclamations by end marks.

In [8]:
french_sents = sample['french'].to_list()

wolof_sents = sample['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!', replace=True)

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!', replace=True)

pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})


Unnamed: 0,french,wolof
0,"Celui-là, lui, il refusera.","Kookule, moom, du naŋgu."
1,Nul n'a été,Kenn demul
2,"Ainsi, l'homme ne veut pas partir","Ba, góor gi bëggul dem"
3,L'homme a frappé la vache avec un bâton.,Nit ki dóor na nag wi ak bant.
4,Personne n'était venu,Yéena dikkulwoon
5,Crie.,Nil : wóoy.
6,"Alors l'homme entra, les enfants le virent, il...","Noona góor gi dugg, xale yi gis ka, mu toog, ñ..."
7,Tu connais cet homme-là ?,Xam ŋga nit kookuu ?
8,"Tout ce verbiage, c'est pour que tu ne viennes...","Wax ji yépp, bañ-ŋga-ñëw la."
9,S'il est parti,Su demee


- Add end marks at each sentence without replacing or removing

In [9]:
french_sents = sample['french'].to_list()

wolof_sents = sample['wolof'].to_list()

french_sents = add_end_mark(french_sents)

wolof_sents = add_end_mark(wolof_sents)

pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})


Unnamed: 0,french,wolof
0,"Celui-là, lui, il refusera !","Kookule, moom, du naŋgu !"
1,Nul n'a été.,Kenn demul.
2,"Ainsi, l'homme ne veut pas partir.","Ba, góor gi bëggul dem."
3,L'homme a frappé la vache avec un bâton.,Nit ki dóor na nag wi ak bant.
4,Personne n'était venu.,Yéena dikkulwoon.
5,Crie !,Nil : wóoy !
6,"Alors l'homme entra, les enfants le virent, il...","Noona góor gi dugg, xale yi gis ka, mu toog, ñ..."
7,Tu connais cet homme-là ?,Xam ŋga nit kookuu ?
8,"Tout ce verbiage, c'est pour que tu ne viennes...","Wax ji yépp, bañ-ŋga-ñëw la."
9,S'il est parti.,Su demee.


- Remove exclamation marks and add period marks

In [10]:
french_sents = sample['french'].to_list()

wolof_sents = sample['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!')

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!')

pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})


Unnamed: 0,french,wolof
0,"Celui-là, lui, il refusera.","Kookule, moom, du naŋgu."
1,Nul n'a été.,Kenn demul.
2,"Ainsi, l'homme ne veut pas partir.","Ba, góor gi bëggul dem."
3,L'homme a frappé la vache avec un bâton.,Nit ki dóor na nag wi ak bant.
4,Personne n'était venu.,Yéena dikkulwoon.
5,Crie.,Nil : wóoy.
6,"Alors l'homme entra, les enfants le virent, il...","Noona góor gi dugg, xale yi gis ka, mu toog, ñ..."
7,Tu connais cet homme-là ?,Xam ŋga nit kookuu ?
8,"Tout ce verbiage, c'est pour que tu ne viennes...","Wax ji yépp, bañ-ŋga-ñëw la."
9,S'il est parti.,Su demee.


### Remove sentences with less than 3 tokens

Only the words interest us and we want to separate them by space or marks. Notice also that we want verify for each version and if one it is not accepted then the another one also.

In [11]:
def select_by_length(sentences: pd.DataFrame, min_len: int = 4):
    
    new_sentences = {'french': [], 'wolof': []}
    
    for i in list(sentences.index):
        
        sent1 = sentences.loc[i, 'french']
        
        sent2 = sentences.loc[i, 'wolof']
        
        tokens1 = re.findall('\w+', sent1)
        
        tokens2 = re.findall('\w+', sent2)
        
        take = True
        
        if len(tokens1) < min_len or len(tokens2) < min_len:
            
            take = False
        
        if take:
            
            new_sentences['french'].append(sent1)
            
            new_sentences['wolof'].append(sent2)
            
    return pd.DataFrame(new_sentences)

Let us make an example with the sample that we got earlier.

In [12]:
sample = select_by_length(sample, min_len=3)


In [13]:
sample

Unnamed: 0,french,wolof
0,"Celui-là, lui, il refusera !","Kookule, moom, du naŋgu !"
1,"Ainsi, l'homme ne veut pas partir","Ba, góor gi bëggul dem"
2,L'homme a frappé la vache avec un bâton.,Nit ki dóor na nag wi ak bant.
3,"Alors l'homme entra, les enfants le virent, il...","Noona góor gi dugg, xale yi gis ka, mu toog, ñ..."
4,Tu connais cet homme-là ?,Xam ŋga nit kookuu ?
5,"Tout ce verbiage, c'est pour que tu ne viennes...","Wax ji yépp, bañ-ŋga-ñëw la."
6,L'homme est unique.,Góor gi kenn la.
7,C'est un messager que cherche l'homme.,Ndaw la góor gi di wut.
8,L'enfant n'a rien donné à celui-ci.,Xale bi mayul kii dara.
9,Cherche une autre chose !,Seetal lëf leneen !


It only kept 26 samples. Let us make the same for all the sentences.

In [14]:
diagne_sentences = select_by_length(diagne_sentences, min_len=3)

corpora_v4 = select_by_length(corpora_v4, min_len=3)

Let us verify how many sentences we have kept.

In [15]:
# for diagne's sentences
diagne_sentences.shape[0]

1328

In [16]:
# in corpora v4
corpora_v4.shape[0]

2212

We removed $651$ sentences from the corpora v4 and $645$ sentences from the diagne's sentences.

Let us save the results to make again hyperparameter search with them.

In [17]:
# save the new version of the diagne's sentences in red_sentences.csv
diagne_sentences.to_csv('data/extractions/new_data/red_sentences.csv', index = False)

# save the new version of the corpora
corpora_v4.to_csv('data/extractions/new_data/corpora_v5.csv', index = False)

### Recuperate additional sentences 

In [18]:
corpora_v4 = pd.read_csv('data/extractions/new_data/corpora_v4.csv')

diagne_sentences = pd.read_csv('data/extractions/new_data/sentences.csv')

Let us recuperate additional sentences to augment the corpora.

- First file:

In [19]:
path = 'data/additional_documents/omniglot/sentences_french_wolof.txt'

with open(path, 'r', encoding='utf-8') as f:
    
    # recuperate lines
    omn_1 = f.readlines()
    
# recuperate sentences
omn_1_ = {
    'french': [],
    'wolof': []
}

for omn in omn_1:
    
    omn = omn.split('\t')
    
    sentences = [s.strip() for s in omn if s.strip() != '']
    
    omn_1_['french'].append(sentences[0])
    
    omn_1_['wolof'].append(sentences[1])

In [20]:
pd.options.display.max_rows = 74
omn_1_ = pd.DataFrame(omn_1_).head(74)

# print the first sentences from omniglot
omn_1_.head(74)

Unnamed: 0,french,wolof
0,Bienvenue,Merhbe
1,Salut,Na nga def
2,Salut,Na ngeen def
3,Salut,Salaam aleekum
4,Comment vas-tu ?,Jaam nga am ?
5,Es-tu en paix ?,Jaam nga am ?
6,Comment vas-tu ?,Na nga def?
7,"Paix seulement, et toi ?","Jaam rek, Yow nag ?"
8,"Je suis seulement toi, comment vas-tu ?","Mangi fi rekk, na nga def ?"
9,Ça fait longtemps que je ne t'ai pas vu,Gej na la giis


- Second file

In [21]:
path = 'data/additional_documents/omniglot/sentences_french_wolof_2.txt'

with open(path, 'r', encoding='utf-8') as f:
    
    # recuperate lines
    omn_2 = f.readlines()
    
# recuperate sentences
omn_2_ = {
    'french': [],
    'wolof': []
}

for omn in omn_2:
    
    omn = omn.split('–')
    
    sentences = [s.strip() for s in omn if s.strip() != '']
    
    omn_2_['wolof'].append(sentences[0])
    
    omn_2_['french'].append(sentences[1])

In [22]:
pd.options.display.max_rows = 74
omn_2_ = pd.DataFrame(omn_2_).head(74)

# print the first sentences from omniglot
omn_2_.head(74)

Unnamed: 0,french,wolof
0,Comment allez-vous ?,Na ngeen def ?
1,Bonne matinée.,Jaam nga fanane.
2,Bonne nuit.,Fanaanal jaam.
3,À une autre fois.,Ba beneen.
4,Je t'en prie.,Agsil.
5,Je vous en prie.,Agsileen ak jaam.
6,Pardon.,Baal ma.
7,Oui.,Wau.
8,Non.,Deh-det.
9,Es-tu en paix ?,Jaam nga am ?


- Third file

In [23]:
path = 'data/additional_documents/omniglot/sentences_french_wolof_3.txt'

with open(path, 'r', encoding='utf-8') as f:
    
    # recuperate lines
    omn_3 = f.readlines()
    
# recuperate sentences
omn_3_ = {
    'french': [],
    'wolof': []
}

for omn in omn_3:
    
    omn = omn.split('\t')
    
    sentences = [s.strip() for s in omn if s.strip() != '']
    
    omn_3_['wolof'].append(sentences[0])
    
    omn_3_['french'].append(sentences[1])
      
        

In [24]:
pd.options.display.max_rows = 74
omn_3_ = pd.DataFrame(omn_3_).head(74)

# print the first sentences from omniglot
omn_3_.head(74)

Unnamed: 0,french,wolof
0,Que la paix soit avec toi.,Salaam maaleekum.
1,Que la paix soit avec toi aussi.,Maaleekum salaam
2,Es-tu en paix ?,Jaama ngaam ?
3,Paix seulement.,Jamma rek.
4,Comment va ta famille ?,Ana waa kër gi ?
5,Où sont les membres de ta famille ?,Ana waa kër gi ?
6,Ils sont là-bas.,Ñunga fa.
7,J'espère qu'ils vont bien.,Mbaa defuñu dara.
8,Rien ne va mal avec eux.,Mbaa defuñu dara.
9,"Non, ils vont bien.","Déedéet, defuñu dara."


- Let us retrieve sentences, that was created by internet users in our following platform [addition_sentences_platform](https://french-wolof-translation.streamlit.app/), from our MongoDB Atlas Database.

In [25]:
# initialize the path to the csv files
sentences_path = 'data/additional_documents/hugging_face/wolof_french.csv'
deleted_path = 'data/additional_documents/hugging_face/deleted.csv'

# initialize the server uri and database name
uri = "mongodb+srv://oumar199:Jacksparrow360@woloftranslationcluster.u0gk7.mongodb.net/?retryWrites=true&w=majority"
database_name = 'WolofTranslation'

# initialize the database manager 
database_manager = TranslationMongoDBManager(uri, database_name)

# save locally the data sets
database_manager.save_data_frames(sentences_path, deleted_path)

hug_sents = pd.read_csv(sentences_path)

In [26]:
pd.options.display.max_rows = 74

hug_sents.head(74)

Unnamed: 0,french,wolof
0,"J'en suis sûr, cette photo ci c'est la photo p...",Waaw nataal bii nataal la boob ay nit ñu baree...
1,J'arrive tout de suite chez toi.,Léegui léegui ma egg sa kër.
2,Je vois devant moi une photo sur laquelle beau...,Nataal bii maa ngi ciy janloog haa ay nit yu b...
3,Ceux-ci sont des personnes qui sont sortis pou...,"Lii, ay nit lañu yu génn di ñaxtu. Jëm yi nag ..."
4,"Sur la photo, ont voit des personnes qui se ré...",Nataal bii ñoo ngi ciy gis ay nit ñuy ñaxtu wa...
5,On voit sur la photo beaucoup de personnes sor...,Ñu gis ci nataal bi ay nit ñu bari ñu génn ci ...
6,"Je pense que la photo que je vois là, c'est un...","Man de nataal bii ma gis nii, nataal la boo xa..."
7,Bonjour ! Ceci est une photo. Sur cette photo ...,As salaamaalekum! lii de ab nataal la. Nataal ...
8,Ceci est une théière. Une théière dans la quel...,Lii nag baraada la. Baraada bu nu def attaaya ...
9,"Ce que j'ai en face de moi, sur la photo, c'es...",Li may jàkkarlool ci nataal bi nag moom benn k...


Let us add them into the diagne's sentences and the version 4 of the corpora and remove duplicated rows.

- On corpora version 4

In [27]:
corpora_v4 = pd.concat((corpora_v4, omn_1_, omn_2_, omn_3_, hug_sents), axis=0)

# drop duplicated rows
print(sum(corpora_v4.duplicated()))

7


In [28]:
corpora_v4.drop_duplicates(inplace=True)

7 rows were deleted.

- On corpora version 5

In [29]:
diagne_sentences = pd.concat((diagne_sentences, omn_1_, omn_2_, omn_3_, hug_sents), axis=0)

# drop duplicated rows
print(sum(diagne_sentences.duplicated()))

4


In [30]:
diagne_sentences.drop_duplicates(inplace=True)

4 rows were deleted.

Let us consider only the fourth end mark adding's method and verify if we have duplicated rows.

- Corpora v4

In [31]:
french_sents = corpora_v4['french'].to_list()

wolof_sents = corpora_v4['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!')

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!')

corpora_v4_end = pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})

In [32]:
duplicated = corpora_v4_end.duplicated()

sum(duplicated)

5

We obtain 5 duplicated rows. Let us recuperate their indexes.

In [33]:
dup_indexes = duplicated[duplicated == True].index

In [34]:
dup_indexes

Int64Index([1428, 1597, 2241, 2243, 2367], dtype='int64')

Let us remove those indexes from the corpora.

In [35]:
corpora_v4.drop(index = dup_indexes, inplace=True)

- Diagne's sentences

In [36]:
french_sents = diagne_sentences['french'].to_list()

wolof_sents = diagne_sentences['wolof'].to_list()

french_sents = add_end_mark(french_sents, end_mark_to_remove='!')

wolof_sents = add_end_mark(wolof_sents, end_mark_to_remove='!')

diagne_end = pd.DataFrame({'french': french_sents, 'wolof': wolof_sents})

In [37]:
duplicated = diagne_end.duplicated()

sum(duplicated)

5

We obtain 5 duplicated rows. Let us recuperate their indexes.

In [38]:
dup_indexes = duplicated[duplicated == True].index

In [39]:
dup_indexes

Int64Index([546, 715, 1359, 1361, 1485], dtype='int64')

Let us remove those indexes from the corpora.

In [40]:
diagne_sentences.drop(index = dup_indexes, inplace=True)

Let us save the results.

In [41]:
diagne_sentences.to_csv('data/extractions/new_data/ad_sentences.csv', index=False)

corpora_v4.to_csv('data/extractions/new_data/corpora_v6.csv', index=False)

Let us verify how many sentences we got for the moment.

In [42]:
# from diagne's book
diagne_sentences.shape[0]

1949

In [43]:
# inside the corpora
corpora_v4.shape[0]

2831

We recuperated `1949` sentences from the diagne's book and the final corpora contains `2831` sentences.

### Web Scraping

This step involve to scrape some websites to obtain more sentences using `selenium` and `BeautifulSoup`. It will require to meticulously investigate each website and identify the elements to trigger and to extract. We will also need to identify the transformation to add to the different sentences. All the transformations, excepted the sentences' capitalization, are already implemented and can be directly used. 

We will scrape the following websites: 

- url1: [corporan.huma-num.fr](https://corporan.huma-num.fr/Lexiques/indexM.php?controle=afficheDico&action=main&ficXML=DicoWolof.xml&langue=Wolof)

In [2]:
url1 = "https://corporan.huma-num.fr/Lexiques/indexM.php?controle=afficheDico&action=main&ficXML=DicoWolof.xml&langue=Wolof"

#### Corporan.huma-num scraping

Let us add the necessary libraries.

In [114]:
from selenium import webdriver
from selenium.webdriver.chrome import service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from wolof_translate.utils.sent_transformers import TransformerSequences
from wolof_translate.utils.sent_corrections import *
from urllib import request as rq
from bs4 import BeautifulSoup
from tqdm import tqdm
from typing import *
import pandas as pd

We need to identify the element containing the sentences. It is easy in that case since they are identified by their classes:

- The french sentences are identified by the `xe` class
- The wolof sentences by the `xv` class.

Since the website consists of a dictionary of word, we must click on one of them to display the sentences. The words' elements are not links but it use a ajax function. Then we must use trigger event to display the different pages. The words are represented by the `div` tag and each of them has a unique id which is simply the number of the word. We have `9965` words so the ids range from 1 to 9965.

Let us begin by initializing the drive. We will use the chrome driver in our case.

In [104]:
# initialize the option
options = webdriver.ChromeOptions()

# we will add the headless option which will not render the browser
options.add_argument('--headless')

# let us recuperate the service
s = service.Service('C:\Program Files\Chromedriver\chromedriver.exe')

# let us recuperate the driver
driver = webdriver.Chrome(service=s, options=options)



Let us request to go the website.

In [105]:
# send a get request to the website
driver.get(url1)

Let us initialize the classes.

In [106]:
# initialize the class map
class_map = {
    'xe': 'french',
    'xv': 'wolof'
}


Let us initialize the transformer.

In [107]:
transformer = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)

Let us create a function which get recuperate the content and with beautiful soup recuperate the sentences. We will use an implicit wait at each time since we don't know what element to wait for: Not all pages contain sentences. We will wait 0.2 seconds before getting the sentences.

In [111]:
def get_sentences_from_page(driver: webdriver.chrome.webdriver.WebDriver, sentences: Dict[str, list], transformer: Union[TransformerSequences, None] = None, replaces: dict = {'…': '...'}, end_marks = ['.', '!', '?', '...']):
    
    # let us recuperate the content of the current page
    content = driver.page_source
    
    # recuperate the document object
    doc = BeautifulSoup(content, 'lxml')
    
    # recuperate the elements
    elements = {class__: doc.find_all('span', class_ = class__) for class__ in sentences}
    
    if elements != {key: [] for key in sentences}:
        
        # count the number of sentences that we got
        sent_count = len(elements[list(sentences)[0]])
        
        # for each group of sentences we will add some transformations
        for s in range(sent_count):
            
            # initialize the common end mark
            end_mark = None
            
            # recuperate the sentences and add to the dictionary of sentences
            for class__ in sentences:
                
                # recuperate the current sentence
                sentence = elements[class__][s].text.strip()
                
                # replace marks
                for replace in replaces:
                    
                    sentence = sentence.replace(replace, replaces[replace])
                
                if not sentence[-1] in end_marks:
                    
                    # add an end mark if the end mark is got from one of the previous sentences
                    if not end_mark is None:
                        
                        sentence += end_mark
                    
                    else: # add a period mark
                        
                        sentence += '.'
                
                else: # if an end mark is identified recuperated it as the end mark of the other sentences
                    
                    end_mark = sentence[-1]
                
                # capitalize the current sentence
                sentence = sentence[0].upper() + sentence[1:]
                
                # add transformations to the sentence
                if transformer: sentence = transformer(sentence)[0]
                
                # add the sentence to the dictionary of sentences
                sentences[class__].append(sentence)
            
    # return the sentences
    return sentences
    

In [103]:
# calculate the approximate number of minutes
0.5*9956/60

82.96666666666667

Initialize the sentences 🛑.

In [126]:
# recuperate the saved sentences
scraped_sentences = pd.read_csv('data/extractions/new_data/scraped_sents.csv').to_dict('list')

# initialize the sentences
sentences = {key: scraped_sentences[class_map[key]] for key in class_map}

# current_id
current_id = 201

# initialize the ids
word_ids = list(range(current_id, 9966))

Let us recuperate the sentences by clicking recursively at each word.

In [127]:
# initialize the time sleep previously introduced
time_sleep = 0.6

# initialize a save step
save_step = 20

# recuperate the sentences of each word
for id in tqdm(word_ids):
        
    # click next to the word
    word = driver.find_element(By.ID, id)
    driver.execute_script('arguments[0].click();', word)
    
    # add a time sleep
    driver.implicitly_wait(time_sleep)
    
    # recuperate the sentences
    sentences = get_sentences_from_page(driver, sentences, transformer, replaces={'…': '...', '\xa0': ' ', 'qqn': "quelqu'un"})
    
    # save the sentences
    if id % save_step == 0:
        
        pd.DataFrame({value: sentences[key] for key, value in class_map.items()}).to_csv('data/extractions/new_data/scraped_sents.csv', index = False)
    
    

 29%|██▊       | 2787/9765 [1:47:03<6:28:18,  3.34s/it]