Text processing 1
----------------------

In [22]:
# importations
import re
import spacy
import nltk
import pandas as pd
from typing import *
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt

# define a style for the matplotlib plots
plt.style.use("ggplot")

We obtained new paragraphs after matching the French paragraphs with their Wolof counterparts. It is time to make further processing of the text. Notably, we want to process the text as follows:

1. Load, print each corpus, and identify additional corrections to add to the text;

2. Identify the length of each paragraph in each corpus and compare them;

3. Identify the longest and shortest paragraphs;

4. Modify the pre-created nlp pipeline class for that project; 

5. Identify the options to add to `spacy` (if it is used) to find the tokens;

6. Use the pre-created nlp pipeline class to recuperate the tokens;

7. Identify the nlp pipeline sub-steps.

### Add additional corrections to the text

It is possible that we forgot to add some corrections to the texts. Let us load the data frame and print the paragraphs in order to see what corrections will be necessary before diving deeper into the text processing.

In [3]:
# Load the data frame and make a copy
corpora_v1 = pd.read_csv("data/extractions/new_data/corpora_v1.csv", encoding='utf-16')

corpora = corpora_v1.copy()

In [4]:
# print the head
corpora.head()

Unnamed: 0,french_corpus,wolof_corpus
0,Tout être humain est le résultat d’un père et ...,"Doomu-aadama bu, ne ci ndey ak baay nga jóge. ..."
1,J’ai longtemps rêvé que ma mère était noire. J...,"Bi ma delloo dëkk ba ma juddoo, dama faa meloo..."
2,"De ce visage que j’ai reçu à ma naissance, j’a...","Kanam gii ma judduwaale, am na lu bari lu ma c..."
3,"À l’âge de huit ans à peu près, j’ai vécu en A...",Daanaka bi ma tolloo ci juróom-ňetti at laa to...
4,"De ce temps, pour ainsi dire consécutivement, ...","Ay yarami nit di ma feeňu, nag. Sama yaram, sa..."


Let us display more clearly the corpora by printing the dictionary version of it.

In [5]:
corpora_dict = corpora.to_dict(orient='index')

corpora_dict

{0: {'french_corpus': 'Tout être humain est le résultat d’un père et une mère. On peut ne pas les reconnaître, ne pas les aimer, on peut douter d’eux. Mais ils sont là, avec leur visage, leurs attitudes, leurs manières et leurs manies, leurs illusions, leurs espoirs, la forme de leurs mains et de leurs doigts de pied, la couleur de leurs yeux et de leurs cheveux, leur façon de parler, leurs pensées, probablement l’âge de leur mort, tout cela est passé en nous.',
  'wolof_corpus': 'Doomu-aadama bu, ne ci ndey ak baay nga jóge. Mënunu leen a baň a gërëm ak a bëgg, doonte sax mën nanoo am xel ňaar ci ňoom. Waaye ňu ngi fi, ak seen xar-kanam, seen taxawaay, seen defin ak seen jikko, seeni njuumte, seeni yaakaar, seen melokaanu loxook baaraami tànk, seen meloy bët ak karaw, seen waxin, seeni xalaat, amaana sax at ma ňuy nar a génne àddina. Loolu lépp, day àgg fu sore ci nun.'},
 1: {'french_corpus': 'J’ai longtemps rêvé que ma mère était noire. Je m’étais inventé une histoire, un passé, pou

We identified some elements seeming to not be very interesting in some cases:

- The en dash "–" is used in the French and Wolof corpus but not in the same places in two paragraphs with identical syntaxes. In some French sentences, we can see the use of the en dash but not in the Wolof version of that sentence and vice versa. The en dash carries some sense in the narration, essential for the translations. And since we have an actual occurrence of the "–" sign in the corpora, we cannot identify all their non-interesting.

- The abbreviation of the "ème" to "e." In English, it is used to identify ordinal numbers. The problem is that the two cases we use are not taken back in their Wolof version since the Wolof doesn't contain a translation of them. We can suppose that no attention will be added to them, but we can also delete them. When we want to translate a Wolof sentence to a French sentence, the ordinal numbers can be confused with a cardinal number or vice versa. 

- The identified numbers (non-nominal ones) in the corpora can be either years or ordinal numbers or the "6x6" element. Only the years seam to appear in both the french and the Wolof corpora. The other numbers could be more irrelevant in the context. Since more French numbers are nominally translated to the Wolof with letters, writing the number to letters is more useful before translating them.

- Some paragraphs should be shorter and can contain more than one context. It will be more beneficial to identify the paragraph containing multiple contexts and separate them into various elements or to concatenate two consecutive paragraphs with the same context. It will be simpler for the large language model to identify the context of a given paragraph. But we will do so if we find that the training needs to be more accurate. It will also permit us to obtain more paragraphs since we only have 141 paragraphs which is a small number (but with a down-streaming approach, it will be OK). If the downstream don't solve the problem of the lack of data we will generate new data with the `GAN (Generative Adversarial Network)` approach.

- The text contains some additional spaces that we will delete later, but we didn't find stacked words in the corpora.

- Like we said in the previous processing steps, some sentences in the Wolof paragraphs are reversed compared to their French counterparts. But since they are in the same context, we can use them like they are.

- Some words can be in one corpus and not translated into another which will sometime add a bias to the translation!! 

### Count the number of letters and words

The first comparison made between the two corpora is between their lengths:

- First, we will define a length of a document to be the number of her letters
- Second, we will define a length of document to be the number of her words (separated by spaces only). 

Let us consider the two different types of lengths and show their distributions over the documents.

In [6]:
# length as number letters (space in count)
count_letters = lambda doc_: len(doc_)

corpora['french_corpus_n_letters'] = corpora['french_corpus'].map(count_letters)

corpora['wolof_corpus_n_letters'] = corpora['wolof_corpus'].map(count_letters)

# length as number of words
count_words_space = lambda doc_: len(doc_.split())

corpora['french_corpus_n_words'] = corpora['french_corpus'].map(count_words_space)

corpora['wolof_corpus_n_words'] = corpora['wolof_corpus'].map(count_words_space)


In [7]:
# print the first documents
corpora.head()

Unnamed: 0,french_corpus,wolof_corpus,french_corpus_n_letters,wolof_corpus_n_letters,french_corpus_n_words,wolof_corpus_n_words
0,Tout être humain est le résultat d’un père et ...,"Doomu-aadama bu, ne ci ndey ak baay nga jóge. ...",441,396,80,76
1,J’ai longtemps rêvé que ma mère était noire. J...,"Bi ma delloo dëkk ba ma juddoo, dama faa meloo...",506,570,86,115
2,"De ce visage que j’ai reçu à ma naissance, j’a...","Kanam gii ma judduwaale, am na lu bari lu ma c...",448,500,79,108
3,"À l’âge de huit ans à peu près, j’ai vécu en A...",Daanaka bi ma tolloo ci juróom-ňetti at laa to...,1191,960,212,181
4,"De ce temps, pour ainsi dire consécutivement, ...","Ay yarami nit di ma feeňu, nag. Sama yaram, sa...",922,794,161,165


Let us trace the distributions of the number of letters and the distributions of the numbers words separated by spaces.

In [8]:
def plot_lengths(data_frame: pd.DataFrame, columns_regex: List = [".*n_letters", ".*n_words"]):
    
    fig = make_subplots(rows=len(columns_regex), cols=1, subplot_titles=columns_regex, y_title="Numbers", x_title="documents")
    
    for i, column in enumerate(columns_regex):
        
        columns = [column_ for column_ in data_frame.columns if re.search(column, column_)]
        
        for col_ in columns:
            
            # fig.add_trace(go.Histogram(
            #     x = data_frame[col_], name=col_
            # ), row=i+1, col=1)
            
            fig.add_trace(go.Scatter(
                y = data_frame[col_], name=col_
            ), row=i+1, col=1)
    
    fig.show()

plot_lengths(corpora)    

On the top of the figure we consider the distributions of the number of letters: the blue line indicates the distribution traced over the french corpus and the red line that traced over the wolof corpus.

On the bottom of the figure we consider the distributions of the number of words: the violet line indicates the distribution traced over the french corpus and the green line that traced over the wolof corpus.

We have the same distribution tendency both for the number of letters and for the number of words. Which is what we expected to find.

### Identify the shortest and the longest paragraphs

Identifying the longest and the shortest paragraphs will help us identify paragraphs with abnormal lengths and constitution. Let us display the 5 shortest and longest paragraphs. 

Let us consider, first, the length to be the number of letters.

In [17]:
# let us sort the corpora by number of letters 
sorted_corpora_french1 = corpora.sort_values(by="french_corpus_n_letters")[['french_corpus', 'french_corpus_n_letters']]

sorted_corpora_wolof1 = corpora.sort_values(by="wolof_corpus_n_letters")[['wolof_corpus', 'wolof_corpus_n_letters']]

# let us recuperate the top 10s
top_french_corpora = sorted_corpora_french1.head()

top_wolof_corpora = sorted_corpora_wolof1.head()

# let us recuperate the bottom 10s
bot_french_corpora = sorted_corpora_french1.tail()

bot_wolof_corpora = sorted_corpora_wolof1.tail()


Display the ten first paragraphs.


In [18]:
top_french_corpora

Unnamed: 0,french_corpus,french_corpus_n_letters
24,Nous savions que c’était la ville des termites.,47
103,"Comment vit-il ces longues années de guerre, s...",138
137,Si je n’avais pas eu cette connaissance charne...,152
135,Mais je me souviens de tout ce que j’ai reçu q...,195
120,Tout cela peut sembler anecdotique. Mais ces m...,207


In [19]:
top_wolof_corpora

Unnamed: 0,wolof_corpus,wolof_corpus_n_letters
24,"Nu daldi xoolante, xam ni yegsi nanu ci dëkku ...",53
103,Naka la tooge woon moom kott ci kër googu juró...,102
137,"Waaw, su Afrig duggul woon ba ci samay biir ya...",103
120,"Lii lépp, mën naa niiruy waxi kasaw-kasaw. Moo...",129
59,"Li ëpp ci nataal yi Baay jël, dañuy wone xolub...",206


The shortest paragraph in the french corpus has 47 letters in it and the one in the wolof corpus has 53 letters in it. They contain both of them a normal length. 

Let us display ten last paragraphs.

In [20]:
bot_french_corpora

Unnamed: 0,french_corpus,french_corpus_n_letters
65,"Tout cela, je ne l’ai compris que beaucoup plu...",2257
7,"L’entrée dans Obudu, je m’en souviens bien : l...",2257
39,"Chaque nuit, dans une sorte de revanche du mon...",2282
13,L’Afrique était puissante. Pour l’enfant que j...,2717
126,"En 1968, tandis que mon père et ma mère regard...",3350


In [21]:
bot_wolof_corpora

Unnamed: 0,wolof_corpus,wolof_corpus_n_letters
67,"Ku laajoon Baay fan la Afrig tàmbalee ci moom,...",2131
81,Su weesoo dénd wi làrme réewum Almaañ defar ci...,2170
64,"Xare bi jeex, ñu tekk ci ñaar-fukki at, ma ànd...",2279
111,Nit ki ma teerusi ci waaxu Poor-Arkuur dafa me...,2357
126,"Atum 1968 agsi, di at mu bariy yëngu-yëngu. Ma...",2867


Since the longest paragraphs have not a abnormal number of letters compared with those preceding them we cannot say that they are abnormal. But let's trace the box plot of the length to verify our hypotheses.

In [27]:
px.box(data_frame=corpora, x = ["french_corpus_n_letters", "wolof_corpus_n_letters"])

For the `wolof_corpus_n_letters` we identify a upper fence of **2170** letters and 3 paragraphs are then considered to be abnormals. For the `french_corpus_n_letters` we identify a upper fence of **2282** letters and 2 paragraphs are considered to be abnormals. Let us recuperated those documents.

In [34]:
french_abnormal_n_letters = corpora[corpora["french_corpus_n_letters"] > 2282]["french_corpus"]
wolof_abnormal_n_letters = corpora[corpora["wolof_corpus_n_letters"] > 2170]["wolof_corpus"]

# print those "abnormal" paragraphs
french_abnormal_n_letters.to_list(), wolof_abnormal_n_letters.to_list()

(['L’Afrique était puissante. Pour l’enfant que j’étais, la violence était générale, indiscutable. Elle donnait de l’enthousiasme. Il est difficile d’en parler aujourd’hui, après tant de catastrophes et d’abandon. Peu d’Européens ont connu ce sentiment. Le travail que faisait mon père au Cameroun d’abord, puis au Nigeria, créait une situation exceptionnelle. La plupart des Anglais en poste dans la colonie exerçaient des fonctions administratives. Ils étaient militaires, juges, district officers (ces D.O. dont les initiales, prononcées à l’anglaise, Di-O, m’avaient fait penser à un nom religieux, comme une variation sur le « Deo gratias » de la messe que ma mère célébrait sous la varangue chaque dimanche matin). Mon père était l’unique médecin dans un rayon de soixante kilomètres. Mais cette dimension que je donne n’a aucun sens : la première ville administrative était Abakaliki, à quatre heures de route, et pour y arriver il fallait traverser la rivière Aiya en bac, puis une épaisse fo

In [35]:
# print their indices
french_abnormal_n_letters.index, wolof_abnormal_n_letters.index


(Int64Index([13, 126], dtype='int64'),
 Int64Index([64, 111, 126], dtype='int64'))

Only the paragraph number 126 is considered to be abnormally long in the two corpora. 

Let are consider now the length to be the number of words separated by spaces. We will plot directly the box plot.

In [36]:
px.box(data_frame=corpora, x = ["french_corpus_n_words", "wolof_corpus_n_words"])


For the `wolof_corpus_n_words` we identify a upper fence of **447** letters and 2 paragraphs are then considered to be abnormals. For the `french_corpus_n_words` we identify a upper fence of **396** letters and 3 paragraphs are considered to be abnormals. Let us recuperated those documents.

In [37]:
french_abnormal_n_words = corpora[corpora["french_corpus_n_words"] > 396]["french_corpus"]
wolof_abnormal_n_words = corpora[corpora["wolof_corpus_n_words"] > 447]["wolof_corpus"]

# print those "abnormal" paragraphs
french_abnormal_n_words.to_list(), wolof_abnormal_n_words.to_list()

(['L’entrée dans Obudu, je m’en souviens bien : la route sort de l’ombre de la forêt et entre tout droit dans le village, en plein soleil. Mon père a arrêté son auto, avec ma mère il doit parler aux officiels. Je suis seul au milieu de la foule, je n’ai pas peur. Les mains me touchent, passent sur mes bras, sur mes cheveux autour du bord de mon chapeau. Parmi tous ceux qui se pressent autour de moi, il y a une vieille femme, enfin je ne sais pas qu’elle est vieille. Je suppose que c’est d’abord son âge que je remarque, parce qu’elle diffère des enfants nus et des hommes et des femmes habillés plus ou moins à l’occidentale que je vois à Ogoja. Quand ma mère revient (peut-être vaguement inquiète de ce rassemblement), je lui montre cette femme : « Qu’est-ce qu’elle a ? Est-ce qu’elle est malade ?» Je me souviens de cette question que j’ai posée à ma mère. Le corps nu de cette femme, fait de plis, de rides, sa peau comme une outre dégonflée, ses seins allongés et flasques, pendant sur son 

In [38]:
# print their indices
french_abnormal_n_words.index, wolof_abnormal_n_words.index

(Int64Index([7, 13, 126], dtype='int64'),
 Int64Index([111, 126], dtype='int64'))

The paragraph 126 is again the only paragraph to be considered as "abnormally" long in the two corpora.

**Conclusion 1**: The paragraph number **126** seem to be the longest paragraph in both the french and the wolof corpora. The paragraphs at indices **13** (in the french corpus) and **111** (in the wolof corpus) are considered to be abnormal for both of the two types of lengths. Only the paragraphs at indices **64** (in the wolof corpus) and **7** (in the french corpus) are spotted only once. But let us consider all of them to be "abnormally" long and try to separated them contextually (considering contexts). The purpose of the separation is to obtain a paragraph with her/their corresponding length(s) to be bellow the upper fence(s). 

---------------------------------

We will separate those paragraphs directly from copies of the csv files containing the paragraphs. The separation will be done by contexts and according to the following scheduling:

- For the paragraph 7: (length as number of words)
    - French corpus: 
        - "L'entrée dans Obudu..." (L'entrée dans Obudu) -> length = 69 
        - "Parmi tous ceux qui se pressent..." (Ma grand-mère) -> length = 340 
    - Wolof corpus:
        - "Sooy dugg Óbudu..."
        - "Ma seetlu ci mbooloo mi benn..."

- Paragraph 13: (length as number of words and number of letters)
    - French corpus:
        - "L’Afrique était puissante..." (L'Afrique était puissante) -> number_of_letters = 250, number_of_words = 34 
        - "Le travail que faisait mon père..." (Le travail que faisait mon pere) -> number_of_letters = 1130, number_of_words = 185 
        - "Nous étions, mon frère et moi..." (Les seuls blancs) -> number_of_letters = 1335, number_of_words = 230
    - Wolof corpus:
        - "Duma ko tàyyee wax..."
        - "Liggéey bi Baay daan def..."
        - "Maak sama mag ju..."

- Paragraph 64: (length as number of letters)
    - Wolof corpus:
        - "Xare bi jeex, ñu tekk ci..." (Rencontre avec mon pere en Afrique) -> number_of_letters = 334 
        - "Magam ji tuddoon Ësen..." (La difficulté en Afrique) -> number_of_letters = 1284
        - "Afrig, dafa fa dem ba jeex..." (La duree d'une vie en Afrique) -> number_of_letters = 659
    - French corpus:
        - "C’est ce même voyage..."
        - "Son frère Eugène, qui..."
        - "Dans l’Ouest africain..."

- Paragraph 111: (length as number of words and number of letters)
    - Wolof corpus: 
        - "Nit ki ma teerusi ci waaxu..." (Rencontre avec un homme d'un autre monde) -> number_of_letters = 766, number_of_words = 156
        - "Ci sama gis-gisu gone..." (Les rituels et manies de mon pere) -> number_of_letters = 1590, number_of_words = 327
    - French corpus:
        - "L’homme qui m’est apparu..."
        - "Il était plein de manies..."

- Paragraph 126: (length as number of words and number of letters)
    - French corpus:
        - "En 1968, tandis que mon père..." (Un des plus grand genocides du siecle) -> number_of_letters = 1492, number_of_words = 216
        - "Mais à la fin de l’été 1968..." (Situation du genocide en fin d'ete) -> number_of_letters = 965, number_of_words = 152
        - "À partir de septembre,..." (Fin du genocide) -> number_of_letters = 891, number_of_words = 144
    - Wolof corpus:
        - "Atum 1968 agsi, di..." -> number_of_letters = 1276, number_of_words = 238
        - "Waaye bi nawetu 1968..." -> number_of_letters = 792, number_of_words = 148
        - "Weeru sàttumbar agsi..." -> number_of_letters = 797, number_of_words = 156


    

----------------------------

Let us load the new paragraphs and save the new corpora.

In [73]:
french_version_v5 = pd.read_csv("data/extractions/new_data/french_version_v5.csv")

wolof_version_v5 = pd.read_csv("data/extractions/new_data/wolof_version_v5.csv")

french_version_v5.rename(columns={'0': "french_corpus"}, inplace=True)

wolof_version_v5.rename(columns={'0': "wolof_corpus"}, inplace=True)

In [74]:
# combine the paragraphs in a unique DataFrame
corpora_v2 = pd.concat((french_version_v5, wolof_version_v5), axis=1)

In [75]:
# save the corpora
corpora_v2.to_csv("data/extractions/new_data/corpora_v2.csv", index=False, encoding='utf-16')

Let us display the same plots as for the previous corpora.

In [76]:
# make a copy of the corpora
corpora = corpora_v2.copy()

- Length distributions

In [78]:
# length as number letters (space in count)
corpora['french_corpus_n_letters'] = corpora['french_corpus'].map(count_letters)

corpora['wolof_corpus_n_letters'] = corpora['wolof_corpus'].map(count_letters)

# length as number of words
corpora['french_corpus_n_words'] = corpora['french_corpus'].map(count_words_space)

corpora['wolof_corpus_n_words'] = corpora['wolof_corpus'].map(count_words_space)

plot_lengths(corpora)


We obtain again almost the same distributions. But the distributions over the number of words is more accurate for analyzes. So let us display only the box plot of the number of words.

- Box on number of words

In [81]:
px.box(data_frame=corpora, x = ["french_corpus_n_words", "wolof_corpus_n_words"])


Only one paragraph has a length greater than the upper fence of **376** words and only on the french corpus. The length of that paragraph is equal to the upper fence calculated over the length of the previous corpora. Let us verify what is that paragraph.

In [83]:
abnormal_n_words = corpora[corpora["french_corpus_n_words"] > 376]["french_corpus"]

abnormal_n_words

70    Tout cela, je ne l’ai compris que beaucoup plu...
Name: french_corpus, dtype: object

It correspond to the paragraph at index 70. We can verify if it contains multiple contexts.

The paragraph contains only one context so it will be complicated to split it. Let us consider that its length is on the limit of the previous corpora and can be then integrated in our analyzes as it is.