# Tokenization
- 의미를 잃어버리지 않는 선에서 정제작업 진행
- 교착어의 특성 상, 한국어는 띄어쓰기만으로 토큰 구별 어려움

### Word Tokenization

In [10]:
from nltk.tokenize import word_tokenize
print(word_tokenize("I don't know how to draw anything except boa constrictors from the outside and boa constrictors from the inside"))

['I', 'do', "n't", 'know', 'how', 'to', 'draw', 'anything', 'except', 'boa', 'constrictors', 'from', 'the', 'outside', 'and', 'boa', 'constrictors', 'from', 'the', 'inside']


- 'don't'을 'do'와 "n't"로 분리하는 것을 확인할 수 있음

In [7]:
from nltk.tokenize import WordPunctTokenizer
print(WordPunctTokenizer().tokenize("I don't know how to draw anything except boa constrictors from the outside and boa constrictors from the inside"))

['I', 'don', "'", 't', 'know', 'how', 'to', 'draw', 'anything', 'except', 'boa', 'constrictors', 'from', 'the', 'outside', 'and', 'boa', 'constrictors', 'from', 'the', 'inside']


- 'don't'을 'don', " ' ", 't'로 분리하는 것을 확인할 수 있음
- 구두점을 별도로 분류함

In [9]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
print(text_to_word_sequence("I don't know how to draw anything except boa constrictors from the outside and boa constrictors from the inside"))

['i', "don't", 'know', 'how', 'to', 'draw', 'anything', 'except', 'boa', 'constrictors', 'from', 'the', 'outside', 'and', 'boa', 'constrictors', 'from', 'the', 'inside']


- 'don't'을 그대로 보존함

##### 고려사항
- 마침표, 문장 부호는 문장의 성격을 대변할 수 있음
    - 밥 먹었어? 밥 먹었어.
- 단어 사이에 띄어쓰기가 있어도 한 단어로 보아야 하는 경우가 있음
    - New York

In [12]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
text = "I don't know how to draw anything except boa constrictors from the outside and boa constrictors from the inside"
print(tokenizer.tokenize(text))

['I', 'do', "n't", 'know', 'how', 'to', 'draw', 'anything', 'except', 'boa', 'constrictors', 'from', 'the', 'outside', 'and', 'boa', 'constrictors', 'from', 'the', 'inside']


- 'don't'을 'do'와 "n't"로 분리하는 것을 확인할 수 있음

### Sentence Tokenization
- 문장을 어떻게 나눌 것인가?
- 마침표를 기준으로??
    - 마침표가 들어가는 다른 단어를 고려해야 함 "Ph.D.", "Washington, D.C."
    - 따라서 코퍼스 내에서 어떻게 사용되고 있는지 규칙 정의하는 것이 중요함

In [39]:
from nltk.tokenize import sent_tokenize
text = "I pondered deeply, then, over the adventures of the jungle. And after some work with a colored pencil Isucceeded in making my first drawing. My Drawing Number One. It looked something like this."
print(sent_tokenize(text))

['I pondered deeply, then, over the adventures of the jungle.', 'And after some work with a colored pencil Isucceeded in making my first drawing.', 'My Drawing Number One.', 'It looked something like this.']


### POS tagging

In [40]:
tokenized_text = word_tokenize(text)

In [41]:
from nltk.tag import pos_tag
print(pos_tag(tokenized_text))

[('I', 'PRP'), ('pondered', 'VBD'), ('deeply', 'RB'), (',', ','), ('then', 'RB'), (',', ','), ('over', 'IN'), ('the', 'DT'), ('adventures', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('jungle', 'NN'), ('.', '.'), ('And', 'CC'), ('after', 'IN'), ('some', 'DT'), ('work', 'NN'), ('with', 'IN'), ('a', 'DT'), ('colored', 'JJ'), ('pencil', 'NN'), ('Isucceeded', 'NNP'), ('in', 'IN'), ('making', 'VBG'), ('my', 'PRP$'), ('first', 'JJ'), ('drawing', 'NN'), ('.', '.'), ('My', 'PRP$'), ('Drawing', 'VBG'), ('Number', 'NNP'), ('One', 'NNP'), ('.', '.'), ('It', 'PRP'), ('looked', 'VBD'), ('something', 'NN'), ('like', 'IN'), ('this', 'DT'), ('.', '.')]
