# 1. Tokenization

Tokenization is the process of dividing a text into smaller linguistic units known as "tokens". These tokens are often words, but can include punctuation, numbers, and other symbols depending on the application. Tokenization is a fundamental first step in text preprocessing for NLP.

There are different tokenization approaches, from simple space separation to more complex methods that consider linguistic rules or use statistical models. Examples:

1. Tokenization by spaces: Split the text whenever it finds a space (" ").
2. Tokenization by punctuation: Also separate punctuation marks as independent tokens.
3. Tokenization by linguistic rules (n-grams, subword tokenization, etc.): Use more advanced techniques to deal with languages ​​with many compound words or that do not use spacing, or to deal with vocabulary reduction in deep learning applications (e.g. byte-pair encoding - BPE).

# Stopwords Removal and Text Normalization

Stopwords are very common words in a language, which generally do not add much meaning to the text, such as "o", "a", "de", "in" (in the case of Portuguese), or "the", "a", "and", "of" (in English). Removing stopwords can help reduce the dimensionality of the text and focus on the most relevant words.

Text normalization includes techniques such as:
- Lemmatization: Reducing words to their canonical form (lemma). Ex: "run", "running", "ran" → "run".
- Stemming: Reducing words to their stem, usually cutting suffixes. Ex: "run", "running", "rush" → "corr".
- Lowercase conversion: Convert all words to lowercase.
- Removing punctuation and special symbols: Cleaning the text, removing characters that do not carry semantic meaning.

# N-grams

N-grams are continuous sequences of N tokens within a text. For example:
- Unigrams (N=1): ["This", "is", "one", "example"]
- Bigrams (N=2): ["This is", "is an", "an example"
- Trigrams (N=3): ["This is one", "is an example"]

They are useful for capturing context: while unigrams only consider individual words, bigrams and trigrams can capture immediate relationships between adjacent words.

# Pratice

Below is an example of preprocessing in Python using libraries such as re (regular expressions) for cleaning, and nltk for tokenization and stopword removal. Text normalization, n-gram generation and other steps will also be illustrated.

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams

text = "This is a simple example of text. The main objective is to clean some tokens and form n-grams."

text = text.lower()

text = re.sub(r'[^\w\s]', '', text)  # remove everything that isnt word or blank space

tokens = word_tokenize(text, language='english')

stopwords_portugues = set(stopwords.words('english'))
tokens_sem_stop = [t for t in tokens if t not in stopwords_portugues]

bigrams = list(ngrams(tokens_sem_stop, 2))
trigrams = list(ngrams(tokens_sem_stop, 3))

print("Tokens:", tokens)
print("Tokens without stopwords:", tokens_sem_stop)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)