#**NLP : tokenizing, lemmatizing, postagging**

---



## What's NLP and why do we need it ?
Generally speaking, loads of people use NLP as a preprocessing phase, for further textual treatment. And it is absolutely necessary if you want to avoid noise in statistical analysis and machine learning uses. Basically, preprocessing is **tokenizing**, **lemmatizing** and **postagging**.

## How do you do that without programming ?
I'll show you some basic tools you can easily use without knowing programming in python.

##**Some useful online tools**

###**UDPipe**
You'll find it [here](https://lindat.mff.cuni.cz/services/udpipe/).
<br>You can use that for short texts.
<br>One of its peculiarities is SVG dependency tree building.

###**Deucalion**
You'll find it [here](https://dh.chartes.psl.eu/deucalion/api/fr/).
<br>Much more accurate for longer texts, not an easy output (although ready-to-use format). It's really good on Ancient French and Latin.

###**VoyantTools**
You'll find it [here](https://voyant-tools.org/).
<br>It's a visualization, directly online, but for more modules you can build it locally and it's really powerful and neat.

#**LE TAL : TOKENISATION, LEMMATISATION, POSTAGGING**

We're going to test **`stanza`**. There are loads of other modules on the matter (like`spacy` and `pie-extended`), but `stanza` generally outperforms them for accuracy and efficiency.

In [None]:
!pip install stanza

In [1]:
catilinaires="Quousque tandem abutere, Catilina, patientia nostra ? Quamdiu etiam furor iste tuus nos eludet ? Quem ad finem sese effrenata jactabit audacia ? Nihilne te nocturnum praesidium Palatii, nihil urbis vigiliae, nihil timor populi, nihil concursus bonorum omnium, nihil hic munitissimus habendi senatus locus, nihil horum ora vultusque moverunt ? Patere tua consilia non sentis ? Constrictam jam horum omnium scientia teneri conjurationem tuam non vides ? Quid proxima, quid superiore nocte egeris, ubi fueris, quos convocaveris, quid consilii ceperis, quem nostrum ignorare arbitraris ? O tempora ! O mores ! Senatus haec intellegit, consul videt. Hic tamen vivit."

##**stanza (précédemment Stanford CoreNLP)**

`stanza` has several language models at your disposal (here's a [list](https://stanfordnlp.github.io/stanza/performance.html)), which you can get using the basic language code, like `grc` for Ancient Greek or `la` for Latin. But you can also specify which model you want like below.

In [None]:
import stanza
stanza.download('la', package="perseus")

We begin with building a Pipeline, to indicate which processors we want to use (no need to add `ner` if you don't need named entity recognition).

In [None]:
nlp_stanza = stanza.Pipeline(lang='la', package="perseus", processors='tokenize,pos,lemma, depparse')

Now you can launch the nlp process.

In [5]:
catilinaires_analyzed=nlp_stanza(catilinaires)

Here are some results, first for sentence division, and then for lemmatizing and postagging.

In [None]:
for sent in catilinaires_analyzed.sentences:
  print("XXXXX "+sent.text+" XXXXX")

In [None]:
for sent in catilinaires_analyzed.sentences:
  for token in sent.words:
    print(token.text + ' - ' + token.lemma + ' - ' + token.pos)

# Let's try with a (much bigger) text.

As the text we're going to use is very long, we will use batch processing, that is, launching several processes at once for GPU management.

In [2]:
def batch_process(text, nlp, batch_size=50):
    paragraphs = text.split('\n')
    batches = [paragraphs[i:i + batch_size] for i in range(0, len(paragraphs), batch_size)]

    words = []

    for batch in batches:
        batch_text = '\n'.join(batch)
        doc = nlp(batch_text)
        for sentence in doc.sentences:
            for word in sentence.words:
                token={}
                if word.lemma is not None:
                    token["word"]=word.text
                    token["lemma"]=word.lemma
                    token["pos"]=word.pos
                    words.append(token)

    return words

In [None]:
import stanza
stanza.download('fr')
import string

Let's get the _Misérables_.

In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/miserables.txt

In [5]:
filepath_of_text = "/content/miserables.txt"

In [6]:
full_text = open(filepath_of_text, encoding="utf-8").read()

In [None]:
nlp_stanza = stanza.Pipeline(lang='fr', processors='tokenize,mwt,pos,lemma')

This part may take some time (for me it took something like 4 minutes).

In [8]:
miserables_analyzed = batch_process(full_text, nlp_stanza)

In [None]:
print(miserables_analyzed[5:15])

For your own projects, you can get loads of stopword lists [here](https://github.com/stopwords-iso).


In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/stopwords_fr.txt

In [51]:
stopwords = open("/content/stopwords_fr.txt",'r',encoding="utf8").read().split("\n")

In [52]:
forms = []
lemmas = []
no_stop = []

for token in miserables_analyzed:
    form = token["word"]
    lemma = token["lemma"]

    if lemma not in string.punctuation:
        forms.append(form)
        lemmas.append(lemma)

    if lemma not in string.punctuation and lemma not in stopwords:
        no_stop.append(lemma)

In [None]:
len(lemmas)

In [None]:
len(no_stop)

And now we're going to build some basic representations of our text to see why preprocessing is important.

In [15]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import numpy as np

def create_word_cloud(words_list, title):
    text = ' '.join(words_list)

    radius = 495
    diameter = radius * 2
    center = radius
    x, y = np.ogrid[:diameter, :diameter]
    mask = (x - center) ** 2 + (y - center) ** 2 > radius ** 2
    mask = 255 * mask.astype(int)

    mask_rgba = np.dstack((mask, mask, mask, 255 - mask))

    wordcloud = WordCloud(repeat=False, width=diameter, height=diameter,
                          background_color=None, mode="RGBA", colormap='plasma',
                          mask=mask_rgba).generate(text)

    plt.figure(figsize=(10, 10))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(title)
    plt.axis('off')
    plt.show()

In [None]:
create_word_cloud(forms, 'Word Cloud for Forms')

In [None]:
create_word_cloud(lemmas, 'Word Cloud for Lemmas')

In [None]:
create_word_cloud(no_stop, 'Word Cloud for Lemmas without stopwords')