# SubwordTextEncoder
- https://wikidocs.net/86792

## tokenize imdb review using SubwordTextEncoder

In [4]:
import tensorflow as tf
tf.__version__

'1.14.0'

In [None]:
!pip install --upgrade tensorflow

In [1]:
import tensorflow as tf
tf.__version__

'2.3.1'

In [2]:
import tensorflow_datasets as tfds
import urllib.request
import pandas as pd

In [3]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv", filename="IMDb_Reviews.csv")

('IMDb_Reviews.csv', <http.client.HTTPMessage at 0x7fda89b71da0>)

In [4]:
df = pd.read_csv("IMDb_Reviews.csv")
print(df.shape)
df.head()

(50000, 2)


Unnamed: 0,review,sentiment
0,My family and I normally do not watch local mo...,1
1,"Believe it or not, this was at one time the wo...",0
2,"After some internet surfing, I found the ""Home...",0
3,One of the most unheralded great works of anim...,1
4,"It was the Sixties, and anyone with long hair ...",0


In [5]:
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    df["review"], target_vocab_size=2**13
)

In [13]:
tokenizer.vocab_size, 2 ** 13

(8268, 8192)

In [7]:
tokenizer.subwords[:10]

['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br']

In [10]:
# encode
print(df.review.values[0], tokenizer.encode(df.review.values[0]))

My family and I normally do not watch local movies for the simple reason that they are poorly made, they lack the depth, and just not worth our time.<br /><br />The trailer of "Nasaan ka man" caught my attention, my daughter in law's and daughter's so we took time out to watch it this afternoon. The movie exceeded our expectations. The cinematography was very good, the story beautiful and the acting awesome. Jericho Rosales was really very good, so's Claudine Barretto. The fact that I despised Diether Ocampo proves he was effective at his role. I have never been this touched, moved and affected by a local movie before. Imagine a cynic like me dabbing my eyes at the end of the movie? Congratulations to Star Cinema!! Way to go, Jericho and Claudine!! [390, 410, 5, 12, 6771, 110, 33, 160, 1067, 172, 23, 1, 1340, 450, 13, 54, 28, 1653, 681, 2, 54, 813, 1, 7739, 2, 5, 56, 33, 359, 326, 166, 24, 10, 16, 10, 17, 19, 5540, 231, 37, 810, 557, 41, 5491, 213, 53, 2020, 80, 4688, 2, 80, 1413, 11, 

In [11]:
sample_string = "It's mind-blowing to me that this film was even made."

#encode
tokenized_string = tokenizer.encode(sample_string)
print (f'정수 인코딩 후의 문장: {tokenized_string}')

#decode
original_string = tokenizer.decode(tokenized_string)
print(f'기존 문장: {original_string}')

정수 인코딩 후의 문장: [137, 8051, 8, 910, 8057, 2169, 36, 7, 103, 13, 14, 32, 18, 79, 681, 8058]
기존 문장: It's mind-blowing to me that this film was even made.


In [14]:
for ts in tokenized_string:
    print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))

137 ----> It
8051 ----> '
8 ----> s 
910 ----> mind
8057 ----> -
2169 ----> blow
36 ----> ing 
7 ----> to 
103 ----> me 
13 ----> that 
14 ----> this 
32 ----> film 
18 ----> was 
79 ----> even 
681 ----> made
8058 ----> .


In [15]:
sample_string = "It's mind-blowing to me that this film was evenxyz made."

#encode
tokenized_string = tokenizer.encode(sample_string)
print (f'정수 인코딩 후의 문장: {tokenized_string}')

#decode
original_string = tokenizer.decode(tokenized_string)
print(f'기존 문장: {original_string}')

정수 인코딩 후의 문장: [137, 8051, 8, 910, 8057, 2169, 36, 7, 103, 13, 14, 32, 18, 7974, 8132, 8133, 997, 681, 8058]
기존 문장: It's mind-blowing to me that this film was evenxyz made.


In [16]:
for ts in tokenized_string:
    print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))

137 ----> It
8051 ----> '
8 ----> s 
910 ----> mind
8057 ----> -
2169 ----> blow
36 ----> ing 
7 ----> to 
103 ----> me 
13 ----> that 
14 ----> this 
32 ----> film 
18 ----> was 
7974 ----> even
8132 ----> x
8133 ----> y
997 ----> z 
681 ----> made
8058 ----> .


## tokenize naver review using SubwordTextEncoder

In [17]:
import tensorflow_datasets as tfds
import urllib.request

In [18]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt", filename="ratings_train.txt")

('ratings_train.txt', <http.client.HTTPMessage at 0x7fda7fef8710>)

In [19]:
df = pd.read_table("ratings_train.txt")
print(df.shape)
df.head()

  """Entry point for launching an IPython kernel.


(150000, 3)


Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [20]:
df.isnull().sum()

id          0
document    5
label       0
dtype: int64

In [21]:
df = df.dropna(how="any")
print(df.shape)
df.isnull().sum()

(149995, 3)


id          0
document    0
label       0
dtype: int64

In [22]:
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    df["document"], target_vocab_size=2**13
)
tokenizer.subwords[:10]

['. ', '..', '영화', '이_', '...', '의_', '는_', '도_', '다', ', ']

In [23]:
tokenizer.encode(df["document"][100])

[121,
 312,
 4,
 81,
 410,
 304,
 6,
 229,
 174,
 225,
 8042,
 67,
 108,
 45,
 296,
 312,
 1297,
 8030,
 977,
 2312,
 3860,
 198,
 2193,
 630,
 8044]

In [25]:
sample_string = df['document'][21]

tokenized_string = tokenizer.encode(sample_string)
print ('정수 인코딩 후의 문장 {}'.format(tokenized_string))

original_string = tokenizer.decode(tokenized_string)
print ('기존 문장: {}'.format(original_string))

정수 인코딩 후의 문장 [570, 892, 36, 584, 159, 7091, 201]
기존 문장: 보면서 웃지 않는 건 불가능하다


In [26]:
sample_string = "진짜 몰입해서 봤다. 몇 번을 봐도 재미있게 볼 영화!"

tokenized_string = tokenizer.encode(sample_string)
print ('정수 인코딩 후의 문장 {}'.format(tokenized_string))

original_string = tokenizer.decode(tokenized_string)
print ('기존 문장: {}'.format(original_string))

정수 인코딩 후의 문장 [26, 4115, 338, 1, 2085, 7764, 259, 452, 294, 3, 8031]
기존 문장: 진짜 몰입해서 봤다. 몇 번을 봐도 재미있게 볼 영화!


In [27]:
for ts in tokenized_string:
    print('{} ----> {}'.format(ts, tokenizer.decode([ts])))

26 ----> 진짜 
4115 ----> 몰입해서 
338 ----> 봤다
1 ----> . 
2085 ----> 몇 
7764 ----> 번을 
259 ----> 봐도 
452 ----> 재미있게 
294 ----> 볼 
3 ----> 영화
8031 ----> !
