<a href="https://colab.research.google.com/github/LMVieira2/estudo_pln_5semestre/blob/main/tarefa_aula_2_e_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processamento básico de textos

## Imports

In [9]:
!pip install nltk spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m98.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [10]:
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
from nltk.tokenize import word_tokenize

In [11]:
# Baixar recursos do NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

# Carregar modelo do spaCy para inglês
nlp = spacy.load("en_core_web_sm")

# Exemplo de texto
texto = "Natural Language Processing is amazing! It includes tokenization, stemming, and lemmatization."

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


## Tokenização

### Com NLTK

In [12]:
tokens_nltk = word_tokenize(texto.lower())  # tokenização e minúsculas
print("Tokens NLTK:", tokens_nltk)

Tokens NLTK: ['natural', 'language', 'processing', 'is', 'amazing', '!', 'it', 'includes', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']


### Com spaCy

In [13]:
doc = nlp(texto.lower())
tokens_spacy = [token.text for token in doc]
print("Tokens spaCy:", tokens_spacy)

Tokens spaCy: ['natural', 'language', 'processing', 'is', 'amazing', '!', 'it', 'includes', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']


## Remoção de stopwords

In [14]:
stop_words = set(stopwords.words('english'))

tokens_sem_stopwords = [t for t in tokens_nltk if t not in stop_words and t not in string.punctuation]
print("Sem stopwords:", tokens_sem_stopwords)

Sem stopwords: ['natural', 'language', 'processing', 'amazing', 'includes', 'tokenization', 'stemming', 'lemmatization']


## Stemming com Porter

In [15]:
stemmer = PorterStemmer()
tokens_stem = [stemmer.stem(t) for t in tokens_sem_stopwords]
print("Stemming:", tokens_stem)

Stemming: ['natur', 'languag', 'process', 'amaz', 'includ', 'token', 'stem', 'lemmat']


## Lematização com spaCy

In [16]:
lemmas = [token.lemma_ for token in doc if token.text not in stop_words and token.text not in string.punctuation]
print("Lemmatization:", lemmas)

Lemmatization: ['natural', 'language', 'processing', 'amazing', 'include', 'tokenization', 'stemming', 'lemmatization']


## Normalização Simples

In [17]:
texto_normalizado = texto.lower().translate(str.maketrans('', '', string.punctuation))
print("Normalizado:", texto_normalizado)

Normalizado: natural language processing is amazing it includes tokenization stemming and lemmatization


## Resultado

In [18]:
print("Texto Original:", texto)
print("Tokens:", tokens_nltk)
print("Sem Stopwords:", tokens_sem_stopwords)
print("Stemming:", tokens_stem)
print("Lemmatization:", lemmas)
print("Normalizado:", texto_normalizado)

Texto Original: Natural Language Processing is amazing! It includes tokenization, stemming, and lemmatization.
Tokens: ['natural', 'language', 'processing', 'is', 'amazing', '!', 'it', 'includes', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', '.']
Sem Stopwords: ['natural', 'language', 'processing', 'amazing', 'includes', 'tokenization', 'stemming', 'lemmatization']
Stemming: ['natur', 'languag', 'process', 'amaz', 'includ', 'token', 'stem', 'lemmat']
Lemmatization: ['natural', 'language', 'processing', 'amazing', 'include', 'tokenization', 'stemming', 'lemmatization']
Normalizado: natural language processing is amazing it includes tokenization stemming and lemmatization
