## 1. Text Preprocessing

### Description:
Text preprocessing is the first step in NLP. It involves cleaning and transforming raw text data into a format that can be easily analyzed by machine learning algorithms. Common tasks include tokenization, lemmatization, and punctuation removal.

### Internal Topics:
- Tokenization
- Lemmatization
- Punctuation Removal
- Handling Multilingual Text
- Dealing with Text with Spelling Errors

### Exercise:
Preprocess a dataset of movie reviews by removing stop words and applying lemmatization using SpaCy.

---

## Tokenization

Tokenization is a kind of segmentation for docs or sentences which consists of **breaking up text in smaller chunks**.   
For that reason the text loses information when docs and tokens are compared in terms of information level. 

<img src="https://miro.medium.com/v2/resize:fit:1400/1*PZYP2nL6Zc_jpkaHLRxLQQ.png" alt="example" width="50%">


As seen in the previous image, the text is splitted in four parts and one of them is a puncuation mark (!). This is a specific type of tokenization where the punctuation marks are considered as tokens but this kind of behaviours are chosen accord to the problem that is trying to resolved with NLP.  
Indeed there's different kinds of tokenizations as the following:
1. [Word tokenization](#1.-Word-tokenization)
2. Subword tokenization
3. Morphological tokenization
4. Character tokenization
5. Emojis-considered tokenization
6. Byte based tokenization
7. Sentences pieces (*N-grams*)

The best *tokenizer* is not always the same one, so the selection of the best option is a critical task because it will influence greatfully in the next steps in a *NLP pipeline*.  
In the next image it can be observed that it's a NLP system / pipeline for a simple classification task based on a *corpus* composed by tweets.

<img src="https://almablog-media.s3.ap-south-1.amazonaws.com/NLP_Pipeline_deca032413.png" alt="example" width="50%">


### 1. Word tokenization

The data used for this section is from Kaggle and is available [here](https://www.kaggle.com/datasets/gpreda/pfizer-vaccine-tweets/data).

The most basic tokenizer consists of split text in words based just in the blank spaces. This approach could consider punctuation marks as separated tokens, but they could also be ignored or be asociated by specific words, anyways this depends of the problem.

#### Just Python

In [1]:
import pandas as pd

RANDOM_STATE = 44

In [2]:
df = pd.read_csv("data/vaccination_tweets.csv")

In [3]:
chain = df.sample(1, random_state=RANDOM_STATE)['text'].values[0]

In [4]:
print(chain)

What is driving protests in 'calm,consensus driven' #Netherlands? Resident @joezl looks at confluence of multiple e… https://t.co/hcjJrD93uO


In [14]:
tokens = []

for word in str(chain).split(sep=" "):
    tokens.append(word)    

In [15]:
print(tokens)

['What', 'is', 'driving', 'protests', 'in', "'calm,consensus", "driven'", '#Netherlands?', 'Resident', '@joezl', 'looks', 'at', 'confluence', 'of', 'multiple', 'e…', 'https://t.co/hcjJrD93uO']


In [16]:
type(tokens[0])

str

#### Spacy

In [7]:
import spacy 

In [17]:
nlp = spacy.blank("en")

doc = nlp(chain)

tokens = []

for token in doc:
    tokens.append(token)

print(tokens)

[What, is, driving, protests, in, ', calm, ,, consensus, driven, ', #, Netherlands, ?, Resident, @joezl, looks, at, confluence, of, multiple, e, …, https://t.co/hcjJrD93uO]


In [18]:
type(tokens[0])

spacy.tokens.token.Token

Main differences are the type of the tokens (`str` vs `spacy Token`) and the kind of interpretation.  
`Spacy`, even when it is also using its blank tokenization, still separates punctation marks as commas and other more sophisticated  marks apostrophes. See the tokens for `"'calm,consensus"` and `"driven'"` for both approaches.