<a href="https://colab.research.google.com/github/Felipe-Oliveira11/NLP-Preprocessing/blob/master/Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Tokenization 

Dada uma sequência de caracteres e uma unidade de documento definida, a tokenização é a tarefa de dividi-la em pedaços, chamada tokens , talvez ao mesmo tempo descartando certos caracteres, como pontuação.




<br>
<hr>

In [1]:
!pip install nltk
!pip install spacy 
!pip install gensim



In [2]:
import os 
import re
import string
import random 
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

import spacy
from spacy.lang.en import English

import gensim
from gensim.utils import tokenize
from gensim.summarization.textcleaner import split_sentences


from tensorflow.keras.preprocessing.text import text_to_word_sequence


%matplotlib inline 
import warnings
warnings.filterwarnings('ignore')

  import pandas.util.testing as tm


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


<br>

### Split Tokenization 




In [3]:
# Tokenization with split 
text = 'Today will rain, but I had get my iphone'
text.split()

['Today', 'will', 'rain,', 'but', 'I', 'had', 'get', 'my', 'iphone']

In [4]:
# Sentence Tokenization 
text = 'Today will rain, but, I had get my iphone'
text.split(',')

['Today will rain', ' but', ' I had get my iphone']

<br>
<hr>
<br>


### Regex 


Regex são expressões regulares que é basicamente uma sequência de caracteres especial que ajuda a encontrar strings ou conjunto de strings  usando uma sequência como padrão.



* A função <b> findall() </b> encontra todas as palavras que correspondem ao padrão transmitido e os armazena na lista. 
o <b> \w </b> representa qualquer caractere de palavra que geralmente significa alfanumérico (letras, números) e sublinhado(_) o <b> +</b> significa qualquer número de vezes.


* Portanto <b> "[\w]+"</b> diz que o código deve encontrar todos os caracteres alfanuméricos até que qualquer outro caractere seja encontrado.



In [5]:
text = 'Today will rain, but I had get my iphone'

tokens = re.findall("[\w]+", text)
tokens

['Today', 'will', 'rain', 'but', 'I', 'had', 'get', 'my', 'iphone']

<br>


* Função <b>compile()</b> passamos os caracteres que as frases serão divididas assim que um desses caracteres for encontrado. 





In [6]:
# Sentence token 
text = 'Today will rain, but, I had get my iphone '

sentences = re.compile('[,!?] ').split(text)
sentences

['Today will rain', 'but', 'I had get my iphone ']

<br>

### NLTK 

Natural Language Toolkit, contém um módulo chamado <b> tokenize() </b> que é classificado em duas subcategorias: 

* Word tokenize: Usado para dividir a frase em tokens ou palavras <b> word_tokenize() </b>

* Sentence tokenize: Usado para dividir um documento ou parágrafo em frases <b> sent_tokenize() </b>

In [7]:
# word tokenize 
text = 'Founded in 2002, SpaceX mission is to enable humans to become a spacefaring civilization'
word_tokenize(text)

['Founded',
 'in',
 '2002',
 ',',
 'SpaceX',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization']

In [8]:
# tamanho do token 
len(word_tokenize(text))

15

In [9]:
# sentence tokenize 
text = """Founded in 2002, SpaceX mission is to enable humans to become a spacefaring civilization. tesla is
other great company at san francisco bay area"""
sent_tokenize(text)

['Founded in 2002, SpaceX mission is to enable humans to become a spacefaring civilization.',
 'tesla is\nother great company at san francisco bay area']

In [10]:
len(sent_tokenize(text))

2

<br>
<hr>
<br>


### Spacy 




In [11]:
# carregar o Tokenizador em inglês 
nlp = English()

In [12]:
text = """Founded in 2002, SpaceX mission is to enable humans to become a spacefaring civilization. tesla is
other great company at san francisco bay area"""

In [13]:
# objeto nlp é usado para criar um documento com anotações linguisticas 

my_doc = nlp(text)

# criando lista de tokens de palavras 

token_list = []
for token in my_doc:
  token_list.append(token.text)


token_list

['Founded',
 'in',
 '2002',
 ',',
 'SpaceX',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 '.',
 'tesla',
 'is',
 '\n',
 'other',
 'great',
 'company',
 'at',
 'san',
 'francisco',
 'bay',
 'area']

In [14]:
# sentence tokenization Spacy 

nlp = English()

# criando pipeline 
sent = nlp.create_pipe('sentencizer')

# adicionando componente no pipeline
nlp.add_pipe(sent)

doc = nlp(text)

sent_list = []

for sentence in doc.sents:
  sent_list.append(sentence.text)

sent_list

['Founded in 2002, SpaceX mission is to enable humans to become a spacefaring civilization.',
 'tesla is\nother great company at san francisco bay area']

<br>
<hr>
<br>


### TensorFlow e Keras 



In [15]:
text = """Founded in 2002, SpaceX mission is to enable humans to become a spacefaring civilization. tesla is
other great company at san francisco bay area"""

In [16]:
token = text_to_word_sequence(text)
token

['founded',
 'in',
 '2002',
 'spacex',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'tesla',
 'is',
 'other',
 'great',
 'company',
 'at',
 'san',
 'francisco',
 'bay',
 'area']

In [17]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [18]:
# sentença (frase)
sentences = [
           'I love my dog',
           'I love my cat',
           'I love my Horse'
           ]

In [19]:
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5, 'horse': 6}


<br>
<hr>
<br>


### Gensim 



In [20]:
text = """Founded in 2002, SpaceX mission is to enable humans to become a spacefaring civilization. tesla is
other great company at san francisco bay area"""

In [21]:
# word tokenizer 
token = list(tokenize(text))
token

['Founded',
 'in',
 'SpaceX',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'tesla',
 'is',
 'other',
 'great',
 'company',
 'at',
 'san',
 'francisco',
 'bay',
 'area']

In [22]:
# sentence tokenizer 
sentence_token = split_sentences(text)
sentence_token

['Founded in 2002, SpaceX mission is to enable humans to become a spacefaring civilization.',
 'tesla is',
 'other great company at san francisco bay area']