# Concepts of Tokenization

The task of dividing a given text into units called tokens is called tokenization. Although the units of a token vary depending on the situation, you usually define it as meaningful units.

In general, the unit of token can be viewed as a sentence in large and a word in small.

Tokenization is generally much better if you don't do it yourself, but rely on packages that have already been implemented.

Let's understand what tokenization means through practice.

Performing word-by-word tokenization is called word tokenization. Shall we practice English and Korean?

---

#Word Tokenization

Performing word-by-word tokenization is called word tokenization.

## English : Word Tokenization

When tokenizing in English, we usually use a package called NLTK. NTLK is a package for English natural language processing.
In Colab, NLTK is already installed, so you can use it as import nltk right away

NLTK offers a variety of English tokenizers (tools to perform tokenization).
**Tokenization results have slightly different rules for each tokenizer. There is no correct answer to which tokenizer to use.**

It's up to you to decide which tokenizer to use

### **Tokenizer 1 in NLTK =>  word_tokenize**

In [1]:
import nltk

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

If you look at the sentence below, "Don't" and "Jon's" have an apostrophe.

In [3]:
sentence = "Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."

 How can "Don't" and "Jon's" tokenize with apostrophe in place?



In [4]:
from nltk.tokenize import word_tokenize
print(word_tokenize(sentence))

['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop', '.']


"Don't" is separated into "Do" and "N't", and "Zone's" is separated into "Zone" and 's'

---

### **Tokenizer in NLTK => WordPunctTokenizer**


In [5]:
from nltk.tokenize import WordPunctTokenizer
print(WordPunctTokenizer().tokenize(sentence))

['Don', "'", 't', 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr', '.', 'Jone', "'", 's', 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop', '.']


"Don't" is separated into Don and 't',"Jon's" is separated into 'Jon' and 's'

**Again, there's no answer to what's better. As a matter of fact, each tokenizer has its own rules, so it's important to choose a tokenizer based on the purpose you want to use.**

---

 ### **Tokenizer 3 in NLTK => TreebankWordTokenizer**

Rules of Penn Treebank Tokenizer

Rule 1. Keep a word consisting of hyphen as one  
Rule 2. Separate words with in apostrophes, such as "doesn't".

In [6]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
text = "Starting a home-based restaurant may be an ideal. it doesn't have a food chain or restaurant of their own."
print(tokenizer.tokenize(text))

['Starting', 'a', 'home-based', 'restaurant', 'may', 'be', 'an', 'ideal.', 'it', 'does', "n't", 'have', 'a', 'food', 'chain', 'or', 'restaurant', 'of', 'their', 'own', '.']


So far, we have performed word tokenization using three tokenizers provided by NLTK from the given sentence string.

The results are all different.

---

### **Tokenize words based on spacing (preferably No! even if it seems to work out)**

**In fact, English is pretty good even if you tokenize words based on spacing.**
But nevertheless, it's better not to tokenize words based on spacing, so let's understand why.

Let's say there's an English sentence like this.

In [7]:
en_text = "A Dog Run back corner near spare bedrooms!!!!"

First, let's tokenize with NLTK that we learned earlier.

In [8]:
from nltk.tokenize import word_tokenize
print(word_tokenize(en_text))

['A', 'Dog', 'Run', 'back', 'corner', 'near', 'spare', 'bedrooms', '!', '!', '!', '!']


It works fine.

This time, let's tokenize it by spacing, not NLTK.
Python cuts all elements based on spacing and returns them to the form of a list when you type .split() on a given string.

In [9]:
print(en_text.split())

['A', 'Dog', 'Run', 'back', 'corner', 'near', 'spare', 'bedrooms!!!!']


See? This is the result of word tokenization based on spacing.

In fact, in English, if you use a package called NLTK, you can tokenize it more delicately, but just by spacing it out, you can tokenize it almost well. Nevertheless, there is a reason to avoid using spacing as a basis.

For example, let's add special characters to an English sentence.

In [10]:
en_text = "A Dog Run back corner near spare bedrooms... bedrooms!!"

Let's tokenize it with NLTK.

In [11]:
from nltk.tokenize import word_tokenize
print(word_tokenize(en_text))

['A', 'Dog', 'Run', 'back', 'corner', 'near', 'spare', 'bedrooms', '...', 'bedrooms', '!', '!']


As you can see, even if there are special characters, the bedrooms are separated normally. But what if you tokenize it by spacing?

In [12]:
print(en_text.split())

['A', 'Dog', 'Run', 'back', 'corner', 'near', 'spare', 'bedrooms...', 'bedrooms!!']


"Bedrooms" and "..." come together,
Bedrooms and "!!!" are put together, and "bedrooms!!!" comes out.

In Python's view, they all recognize it as a different word.

In [None]:
if 'bedrooms' == 'bedrooms...':
  print('이 둘은 같습니다.')
else:
  print('이 둘은 다릅니다.')

이 둘은 다릅니다.


In [None]:
'bedrooms...' == 'bedrooms!!'

False

You can see that NLTK works much more delicately.

---

## Korean language : Word Tokenization(KoNLPy)

### Tokenize words based on spacing (just No!)

In fact, in the case of English, the distinction between words is quite clear even if it is tokenized by spacing. But in the case of Korean, tokenization is much more difficult. The reason for this is that the same word is often recognized as a different word when divided into simple spacing units due to particles, etc. in Korean.
Tokenizing Korean by spacing is recommended to be rarely used unless there is a clear experimental purpose. Let's understand it through an example.

In [13]:
kor_text = "사과의 놀라운 효능이라는 글을 봤어. 그래서 오늘 사과를 먹으려고 했는데 사과가 썩어서 슈퍼에 가서 사과랑 오렌지 사왔어"
print(kor_text.split())

['사과의', '놀라운', '효능이라는', '글을', '봤어.', '그래서', '오늘', '사과를', '먹으려고', '했는데', '사과가', '썩어서', '슈퍼에', '가서', '사과랑', '오렌지', '사왔어']


In the example above, the word "사과" appeared four times. All of them have '의', '을', '가', and '랑' attached, so if you don't remove them, the machine will recognize them all as different words.

In [None]:
'사과' == '사과의'

False

In [14]:
'사과의' == '사과를'

False

In [None]:
'사과를' == '사과가'

False

In [None]:
'사과가' == '사과랑'

False

---

### Morphological analyzer

If there is NLTK in English for word tokenization, KoNLPy, a morpheme analyzer package, exists in Korean.
KoNLPy is not installed in Colab, so install it separately.

In [15]:
!pip install konlpy

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m95.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting JPype1>=0.7.0 (from konlpy)
  Downloading JPype1-1.4.1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.3/465.3 kB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: JPype1, konlpy
Successfully installed JPype1-1.4.1 konlpy-0.6.0


Just as NLTK had several tokenizers internally, KoNLPy also has a variety of morpheme analyzers, but the morpheme analyzer called Mecab has to be installed separately.

In [16]:
# Installing Mecab on Colab
!git clone https://github.com/SOMJANG/Mecab-ko-for-Google-Colab.git
%cd Mecab-ko-for-Google-Colab
!bash install_mecab-ko_on_colab190912.sh

Cloning into 'Mecab-ko-for-Google-Colab'...
remote: Enumerating objects: 138, done.[K
remote: Counting objects: 100% (47/47), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 138 (delta 26), reused 22 (delta 8), pack-reused 91[K
Receiving objects: 100% (138/138), 1.72 MiB | 23.42 MiB/s, done.
Resolving deltas: 100% (65/65), done.
/content/Mecab-ko-for-Google-Colab
Installing konlpy.....
Done
Installing mecab-0.996-ko-0.9.2.tar.gz.....
Downloading mecab-0.996-ko-0.9.2.tar.gz.......
from https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
--2023-09-10 23:33:45--  https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
Resolving bitbucket.org (bitbucket.org)... 104.192.141.1, 2406:da00:ff00::22cd:e0db
Connecting to bitbucket.org (bitbucket.org)|104.192.141.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://bbuseruploads.s3.amazonaws.com/eunjeon/mecab-ko/downloads/mecab-0.996-ko-

In [18]:
from konlpy.tag import *

hannanum = Hannanum()
kkma = Kkma()
komoran = Komoran()
okt = Okt()
#mecab = Mecab()

The above morpheme analyzers provide the following functions in common.  
nouns : noun extraction    
morphs : morpheme extraction  
pos : Attaching parts of speech

### Morphological analyzer Okt

In [19]:
print(okt.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(okt.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(okt.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

['코딩', '당신', '연휴', '여행']
['열심히', '코딩', '한', '당신', ',', '연휴', '에는', '여행', '을', '가봐요']
[('열심히', 'Adverb'), ('코딩', 'Noun'), ('한', 'Josa'), ('당신', 'Noun'), (',', 'Punctuation'), ('연휴', 'Noun'), ('에는', 'Josa'), ('여행', 'Noun'), ('을', 'Josa'), ('가봐요', 'Verb')]


'못'

망치로 못을 두드리다 => '못' : 명사  
나 그 일 못해요. => '못' : 부사

In [20]:
print(okt.pos("망치로 못을 두드리다 "))

[('망치', 'Noun'), ('로', 'Josa'), ('못', 'Noun'), ('을', 'Josa'), ('두드리다', 'Verb')]


In [21]:
print(okt.pos("나 그 일 못해요"))

[('나', 'Noun'), ('그', 'Noun'), ('일', 'Noun'), ('못', 'VerbPrefix'), ('해요', 'Verb')]


### Morphological analyzer 'Kkma'

In [22]:
print(kkma.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(kkma.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(kkma.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

['코딩', '당신', '연휴', '여행']
['열심히', '코딩', '하', 'ㄴ', '당신', ',', '연휴', '에', '는', '여행', '을', '가보', '아요']
[('열심히', 'MAG'), ('코딩', 'NNG'), ('하', 'XSV'), ('ㄴ', 'ETD'), ('당신', 'NP'), (',', 'SP'), ('연휴', 'NNG'), ('에', 'JKM'), ('는', 'JX'), ('여행', 'NNG'), ('을', 'JKO'), ('가보', 'VV'), ('아요', 'EFN')]


### Morphological analyzer Komoran

In [23]:
print(komoran.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(komoran.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(komoran.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

['코', '당신', '연휴', '여행']
['열심히', '코', '딩', '하', 'ㄴ', '당신', ',', '연휴', '에', '는', '여행', '을', '가', '아', '보', '아요']
[('열심히', 'MAG'), ('코', 'NNG'), ('딩', 'MAG'), ('하', 'XSV'), ('ㄴ', 'ETM'), ('당신', 'NNP'), (',', 'SP'), ('연휴', 'NNG'), ('에', 'JKB'), ('는', 'JX'), ('여행', 'NNG'), ('을', 'JKO'), ('가', 'VV'), ('아', 'EC'), ('보', 'VX'), ('아요', 'EC')]


##Morphological analyser: Komoran

In [24]:
print(hannanum.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(hannanum.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(hannanum.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

['코딩', '당신', '연휴', '여행']
['열심히', '코딩', '하', 'ㄴ', '당신', ',', '연휴', '에는', '여행', '을', '가', '아', '보', '아']
[('열심히', 'M'), ('코딩', 'N'), ('하', 'X'), ('ㄴ', 'E'), ('당신', 'N'), (',', 'S'), ('연휴', 'N'), ('에는', 'J'), ('여행', 'N'), ('을', 'J'), ('가', 'P'), ('아', 'E'), ('보', 'P'), ('아', 'E')]


### Morphological analyser Mecab

In [25]:
print(mecab.nouns("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(mecab.morphs("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))
print(mecab.pos("열심히 코딩한 당신, 연휴에는 여행을 가봐요"))

NameError: ignored

**Each morpheme analyzer has different performance and results**, so the choice of morpheme analyzer is to determine which morpheme analyzer is most appropriate for the desired application. For example, if you value speed, you can use a 'Mecab'.



Note: https://iostream.tistory.com/144 (Compare morpheme analyzer performance)

# Sentence Tokenization

Sentence tokenization, unlike word tokenization, is used to divide a given text into sentences.

## English : Sentence Tokenization (NLTK)

Let's say that when you're given a string, you divide it by sentence.

Yonsei University is a private research university in Seoul, South Korea. Yonsei University is deemed as one of the three most prestigious institutions in the country. It is particularly respected in the studies of medicine and business administration.

In [26]:
temp = 'Yonsei University is a private research university in Seoul, South Korea. Yonsei University is deemed as one of the three most prestigious institutions in the country. It is particularly respected in the studies of medicine and business administration.'

The string.split ('cut criteria') separates the strings by that criterion and returns them in list form.
The code below is the code that cuts the string based on a period.

In [27]:
temp.split('.')

['Yonsei University is a private research university in Seoul, South Korea',
 'Yonsei University is deemed as one of the three most prestigious institutions in the country',
 'It is particularly respected in the studies of medicine and business administration.']

Intuitively speaking, what do you think?You might think that you can cut a sentence with a period (.) or a ! criterion, but not necessarily because "!" or "?" serves as a pretty clear boundary for distinguishing sentences, but not necessarily a period. In other words, a period can appear even if it's not the end of a sentence.

**IP 192.168.56.31 서버에 들어가서 로그 파일 저장해서 ukairia777@gmail.com로 결과 좀 보내줘. 그러고나서 점심 먹으러 가자.**  
**Since I'm actively looking for Ph.D. students, I get the same question a dozen times every year.**

There are too many exceptions to separate sentences based on a period.    
**NLTK supports sent_tokenize, which performs tokenization of English sentences.**

In [28]:
text = "His barber kept his word. But keeping such a huge secret to himself was driving him crazy. Finally, the barber went up a mountain and almost to the edge of a cliff. He dug a hole in the midst of some reeds. He looked about, to mae sure no one was near."

In [29]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text))

['His barber kept his word.', 'But keeping such a huge secret to himself was driving him crazy.', 'Finally, the barber went up a mountain and almost to the edge of a cliff.', 'He dug a hole in the midst of some reeds.', 'He looked about, to mae sure no one was near.']


In [30]:
text="I am actively looking for Ph.D. students. and you are a Ph.D student."
print(sent_tokenize(text))

['I am actively looking for Ph.D. students.', 'and you are a Ph.D student.']


## Korean language : Sentence Tokenization(KSS)

In [31]:
pip install kss

Collecting kss
  Downloading kss-4.5.4.tar.gz (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting emoji==1.2.0 (from kss)
  Downloading emoji-1.2.0-py3-none-any.whl (131 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.3/131.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Collecting pecab (from kss)
  Downloading pecab-1.0.8.tar.gz (26.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.4/26.4 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: kss, pecab
  Building wheel for kss (setup.py) ... [?25l[?25hdone
  Created wheel for kss: filename=kss-4.5.4-py3-none-any.whl size=54465 sha256=b343115e346f70c0fc842bc6fdf475ba836b3d771d83da36927e7a028e56f766
  Stored in directory: /root/.cache/pip/wheels/61/7b/ba/

In [32]:
import kss

text = '딥 러닝 자연어 처리가 재미있기는 합니다. 그런데 문제는 영어보다 한국어로 할 때 너무 어려워요. 이제 해보면 알걸요?'
print(kss.split_sentences(text))




['딥 러닝 자연어 처리가 재미있기는 합니다.', '그런데 문제는 영어보다 한국어로 할 때 너무 어려워요.', '이제 해보면 알걸요?']
