Subword Text Encoder는 텐서플로우를 통해 사용할 수 있는 서브워드 토크나이저입니다.

BPE와 유사한 알고리즘인 WordPiece Model을 채택하였으며, 패키지를 통해 쉽게 단어들을 Subwords로 분리할 수 있습니다.

In [4]:
import pandas as pd
import urllib.request
import tensorflow_datasets as tfds

train_df = pd.read_csv('IMDb_Reviews.csv')
train_df['review']

0        My family and I normally do not watch local mo...
1        Believe it or not, this was at one time the wo...
2        After some internet surfing, I found the "Home...
3        One of the most unheralded great works of anim...
4        It was the Sixties, and anyone with long hair ...
                               ...                        
49995    the people who came up with this are SICK AND ...
49996    The script is so so laughable... this in turn,...
49997    "So there's this bride, you see, and she gets ...
49998    Your mind will not be satisfied by this nobud...
49999    The chaser's war on everything is a weekly sho...
Name: review, Length: 50000, dtype: object

In [5]:
tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(train_df['review'], target_vocab_size=2**13)

In [6]:
print(tokenizer.subwords[:30])

['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br', 'in_', 'I_', 'that_', 'this_', 'it_', ' /><', ' />', 'was_', 'The_', 't_', 'as_', 'with_', 'for_', '.<', 'on_', 'but_', 'movie_', 'are_', ' (', 'have_']


In [7]:
print(train_df['review'][20])

Pretty bad PRC cheapie which I rarely bother to watch over again, and it's no wonder -- it's slow and creaky and dull as a butter knife. Mad doctor George Zucco is at it again, turning a dimwitted farmhand in overalls (Glenn Strange) into a wolf-man. Unfortunately, the makeup is virtually non-existent, consisting only of a beard and dimestore fangs for the most part. If it were not for Zucco and Strange's presence, along with the cute Anne Nagel, this would be completely unwatchable. Strange, who would go on to play Frankenstein's monster for Unuiversal in two years, does a Lenny impression from "Of Mice and Men", it seems.<br /><br />*1/2 (of Four)


In [10]:
tokenized_string = tokenizer.encode(train_df['review'][20])

print('Tokenized sample question: {}'.format(tokenized_string))
print()
# 디코딩
decoded_string = tokenizer.decode(tokenized_string)
print('Decoded String: {}'.format(decoded_string))

Tokenized sample question: [1590, 4162, 132, 7107, 1892, 2983, 578, 76, 12, 4632, 3422, 7, 160, 175, 372, 2, 5, 39, 8051, 8, 84, 2652, 497, 39, 8051, 8, 1374, 5, 3461, 2012, 48, 5, 2263, 21, 4, 2992, 127, 4729, 711, 3, 1391, 8044, 3557, 1277, 8102, 2154, 5681, 9, 42, 15, 372, 2, 3773, 4, 3502, 2308, 467, 4890, 1503, 11, 3347, 1419, 8127, 29, 5539, 98, 6099, 58, 94, 4, 1388, 4230, 8057, 213, 3, 1966, 2, 1, 6700, 8044, 9, 7069, 716, 8057, 6600, 2, 4102, 36, 78, 6, 4, 1865, 40, 5, 3502, 1043, 1645, 8044, 1000, 1813, 23, 1, 105, 1128, 3, 156, 15, 85, 33, 23, 8102, 2154, 5681, 5, 6099, 8051, 8, 7271, 1055, 2, 534, 22, 1, 3046, 5214, 810, 634, 8120, 2, 14, 71, 34, 436, 3311, 5447, 783, 3, 6099, 2, 46, 71, 193, 25, 7, 428, 2274, 2260, 6487, 8051, 8, 2149, 23, 1138, 4117, 6023, 163, 11, 148, 735, 2, 164, 4, 5277, 921, 3395, 1262, 37, 639, 1349, 349, 5, 2460, 328, 15, 5349, 8127, 24, 10, 16, 10, 17, 8054, 8061, 8059, 8062, 29, 6, 6607, 8126, 8053]

Decoded String: Pretty bad PRC cheapie which I

In [11]:
# 단어장 크기 확인
tokenizer.vocab_size

8268

In [12]:
for ts in tokenized_string:
    print('{} -------> {}'.format(ts, tokenizer.decode([ts])))

1590 -------> Pre
4162 -------> tty 
132 -------> bad 
7107 -------> PR
1892 -------> C 
2983 -------> cheap
578 -------> ie 
76 -------> which 
12 -------> I 
4632 -------> rarely 
3422 -------> bother 
7 -------> to 
160 -------> watch 
175 -------> over 
372 -------> again
2 -------> , 
5 -------> and 
39 -------> it
8051 -------> '
8 -------> s 
84 -------> no 
2652 -------> wonder
497 ------->  -- 
39 -------> it
8051 -------> '
8 -------> s 
1374 -------> slow 
5 -------> and 
3461 -------> cre
2012 -------> ak
48 -------> y 
5 -------> and 
2263 -------> dull 
21 -------> as 
4 -------> a 
2992 -------> butt
127 -------> er 
4729 -------> kni
711 -------> fe
3 -------> . 
1391 -------> Mad
8044 ------->  
3557 -------> doctor 
1277 -------> George 
8102 -------> Z
2154 -------> uc
5681 -------> co 
9 -------> is 
42 -------> at 
15 -------> it 
372 -------> again
2 -------> , 
3773 -------> turning 
4 -------> a 
3502 -------> dim
2308 -------> wit
467 -------> ted 
4890 -------

In [13]:
# evenxyz로 토크나이저가 xyz를 어떻게 분리하는지 확인
sample_string = "It's mind-blowing to me that this film was evenxyz made."

# encoding
tokenized_string = tokenizer.encode(sample_string)
print('Encoded Sentence : {}'.format(tokenized_string))
print()

# decoding
decoded_string = tokenizer.decode(tokenized_string)
print('Decoded Sentence : {}'.format(decoded_string))

Encoded Sentence : [137, 8051, 8, 910, 8057, 2169, 36, 7, 103, 13, 14, 32, 18, 7974, 8132, 8133, 997, 681, 8058]

Decoded Sentence : It's mind-blowing to me that this film was evenxyz made.


In [15]:
for ts in tokenized_string:
  print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))

  # xyz는 훈련 데이터에서 하나의 단어로 등장한 적이 없으므로 각각 분리

137 ----> It
8051 ----> '
8 ----> s 
910 ----> mind
8057 ----> -
2169 ----> blow
36 ----> ing 
7 ----> to 
103 ----> me 
13 ----> that 
14 ----> this 
32 ----> film 
18 ----> was 
7974 ----> even
8132 ----> x
8133 ----> y
997 ----> z 
681 ----> made
8058 ----> .


## 네이버 영화 리뷰 토큰화

In [2]:
import pandas as pd
import tensorflow_datasets as tfds

train_data = pd.read_table('C:/Users/Myeong/dding/data/딥러닝-자연어처리입문/ratings.txt')
train_data

Unnamed: 0,id,document,label
0,8112052,어릴때보고 지금다시봐도 재밌어요ㅋㅋ,1
1,8132799,"디자인을 배우는 학생으로, 외국디자이너와 그들이 일군 전통을 통해 발전해가는 문화산...",1
2,4655635,폴리스스토리 시리즈는 1부터 뉴까지 버릴께 하나도 없음.. 최고.,1
3,9251303,와.. 연기가 진짜 개쩔구나.. 지루할거라고 생각했는데 몰입해서 봤다.. 그래 이런...,1
4,10067386,안개 자욱한 밤하늘에 떠 있는 초승달 같은 영화.,1
...,...,...,...
199995,8963373,포켓 몬스터 짜가 ㅡㅡ;;,0
199996,3302770,쓰.레.기,0
199997,5458175,완전 사이코영화. 마지막은 더욱더 이 영화의질을 떨어트린다.,0
199998,6908648,왜난 재미없었지 ㅠㅠ 라따뚜이 보고나서 스머프 봐서 그런가 ㅋㅋ,0


In [3]:
train_data.dropna(how='any', inplace=True)
train_data.isnull().sum()

id          0
document    0
label       0
dtype: int64

In [4]:
tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(train_data['document'], target_vocab_size=2**13)

In [5]:
tokenizer.subwords[:50]

['. ',
 '..',
 '영화',
 '이_',
 '...',
 '의_',
 '는_',
 '다',
 '도_',
 ', ',
 '을_',
 '고_',
 '은_',
 '가_',
 '에_',
 '.. ',
 '한_',
 '너무_',
 '정말_',
 '를_',
 '고',
 '게_',
 '영화_',
 '지',
 '... ',
 '진짜_',
 '이',
 '다_',
 '요',
 '만_',
 '? ',
 '과_',
 '가',
 '로_',
 '지_',
 '나',
 '서_',
 '으로_',
 '아',
 '어',
 '....',
 '수_',
 '한',
 '와_',
 '도',
 '음',
 '네',
 '더_',
 '그냥_',
 '왜_']

In [6]:
print(train_data['document'][20])

오랜만에 본 제대로 된 범죄스릴러~


In [7]:
print('Tokenized sample question: {}'.format(tokenizer.encode(train_data['document'][20])))

Tokenized sample question: [635, 90, 572, 208, 1781, 516, 8102]


In [10]:
sample_string = train_data['document'][20]

# 인코딩한 결과를 tokenized_string에 저장
tokenized_string = tokenizer.encode(sample_string)
print ('정수 인코딩 후의 문장 : {}'.format(tokenized_string))
print()
# 이를 다시 디코딩
original_string = tokenizer.decode(tokenized_string)
print ('기존 문장 : {}'.format(original_string))

정수 인코딩 후의 문장 : [635, 90, 572, 208, 1781, 516, 8102]

기존 문장 : 오랜만에 본 제대로 된 범죄스릴러~


In [11]:
for ts in tokenized_string:
  print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))

635 ----> 오랜만에 
90 ----> 본 
572 ----> 제대로 
208 ----> 된 
1781 ----> 범죄
516 ----> 스릴러
8102 ----> ~
