# Generating Simple Text - Alice in Wonderland

- 161793 글자로 이루어진 text 를 10 글자 단위로 잘라 input data 를 만들고 뒤 따라오는 글자를 label data 로 만들어 supervised learning  

      ex) “alice lear”  - “n” 
           “lice learn”  - “e”
           “ice learne” – “d”  
       
- Validation 은 seed 가 되는 10 글자 data 를 주고 이어서 만드는 100 글자 문장이 의미 있는지 여부 육안으로 검토

      ex) seed : “alice look”  
          output : “alice looked at the mouse was a trite than she ..”   

In [1]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Activation, SimpleRNN, Dropout
from tensorflow.keras.models import Sequential
import numpy as np

Alice in Wonderland Text File

In [2]:
from urllib.request import urlopen

r = urlopen("http://www.gutenberg.org/files/11/11.txt")
fin= r.readlines()

In [3]:
fin[:10]

[b"Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll\r\n",
 b'\r\n',
 b'This eBook is for the use of anyone anywhere at no cost and with\r\n',
 b'almost no restrictions whatsoever.  You may copy it, give it away or\r\n',
 b're-use it under the terms of the Project Gutenberg License included\r\n',
 b'with this eBook or online at www.gutenberg.org\r\n',
 b'\r\n',
 b'\r\n',
 b"Title: Alice's Adventures in Wonderland\r\n",
 b'\r\n']

white space 제거, 소문자 통일, binary 를 string type 으로 변경

In [4]:
lines = []

for line in fin:
    line = line.strip().lower()
    line = line.decode("ascii", "ignore")
    if len(line) == 0:
        continue
    lines.append(line)

text = " ".join(lines)

In [5]:
text[:1000]

"project gutenberg's alice's adventures in wonderland, by lewis carroll this ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.  you may copy it, give it away or re-use it under the terms of the project gutenberg license included with this ebook or online at www.gutenberg.org title: alice's adventures in wonderland author: lewis carroll posting date: june 25, 2008 [ebook #11] release date: march, 1994 [last updated: december 20, 2011] language: english character set encoding: ascii *** start of this project gutenberg ebook alice's adventures in wonderland *** alice's adventures in wonderland lewis carroll the millennium fulcrum edition 3.0 chapter i. down the rabbit-hole alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought alice 'without p

### lookup table 작성

- text 중에 포함된 character 들을 이용하여 charactet-to-index, index-to-character 변환 table 작성

In [6]:
chars = set([c for c in text])
nb_chars = len(chars)

char2index = dict((c, i) for i, c in enumerate(chars))
index2char = dict((i, c) for i, c in enumerate(chars))

### input 및 label data 작성

- 10 개의 연속된 character sequence 를 input 으로 하고, 다음에 오는 character 를 label 로 만든다.

In [7]:
SEQLEN = 10
STEP = 1

input_data = []
label_data = []

for i in range(0, len(text) - SEQLEN, STEP):
    input_data.append(text[i:i + SEQLEN])
    label_data.append(text[i+SEQLEN])
    
print(len(input_data), len(label_data))

161793 161793


In [8]:
input_data[:10]

['project gu',
 'roject gut',
 'oject gute',
 'ject guten',
 'ect gutenb',
 'ct gutenbe',
 't gutenber',
 ' gutenberg',
 "gutenberg'",
 "utenberg's"]

In [9]:
label_data[:10]

['t', 'e', 'n', 'b', 'e', 'r', 'g', "'", 's', ' ']

### vectorize input data


input data 의 shape : (data_size, times_step, features)  

output data 의 shape : (data_size, features)

**One-hot encodeing** 

10 X 57

'project gu' ==>  char2index['p'] : 36,  char2index['r'] : 45, char2index['o'] : 49

p: [0., 0., 0,,,,,., 0., 0., 0.,0., 0., 0., 0., 1., 0., 0., 0., 0., ,,,,,,,,,,,,,., 0., 0., 0., 0.]  
r: [0., 0., 0.,.,,,,,0., 0., 0., 0.,0., 0., 0., 0., ,,., 0., 0., 1., 0., ,,,,,,,,,,,,,., 0., 0., 0.]  
0: [0., 0., 0.,.,,,,,0.,0.,0., 0,,,,,,,, 0., 0., 0. 0., 0., 0., 1., 0., ,,,,,,,,,,,,,., 0., 0., 0.]

label ==>  char2index['t'] : 1   

1X57

t: [0., 1., 0.,.,,,,,0.,0.,0., 0,,,,,,,, 0., 0., 0. 0., 0., 0., ., 0., ,,,,,,,,,,,,,., 0., 0., 0.]

In [29]:
X = np.zeros((len(input_data), SEQLEN, nb_chars))
y = np.zeros((len(label_data), nb_chars))

for i, input_chars in enumerate(input_data):
    for j, ch in enumerate(input_chars):
        X[i, j, char2index[ch]] = 1              
    y[i, char2index[label_data[i]]] = 1
    
print(X.shape, y.shape)

(161793, 10, 57) (161793, 57)


Model build

In [30]:
model = Sequential()
model.add(SimpleRNN(256, input_shape=(SEQLEN, nb_chars)))
model.add(Dropout(0.2))
model.add(Dense(nb_chars))
model.add(Activation("softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam")

In [31]:
model.fit(X, y, epochs=20, batch_size=128)

Instructions for updating:
Use tf.cast instead.
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x1de006a4f98>

### seed 로부터 의미있는 문장 생성하는지 출력

In [32]:
index2char[np.argmax(model.predict(X[0:1, :]))]

't'

In [33]:
Xtest = np.zeros((1, 10, 57))

for i, ch in enumerate("what is th"):
    Xtest[0, i, char2index[ch]] = 1

pred = model.predict(Xtest)          # next character
index2char[np.argmax(pred)]

'e'

In [34]:
# select ramdom seed words
test_idx = np.random.randint(len(input_data))
test_chars = input_data[test_idx]
    
print("Generating from seed : ", test_chars, end="\n")

Generating from seed :  e to measu


In [35]:
# generate sentence from the seed words
for _ in range(1000):
    # one-hot encoding (10 X 57)
    Xtest = np.zeros((1, SEQLEN, nb_chars))

    for i, ch in enumerate(test_chars):
        Xtest[0, i, char2index[ch]] = 1

    pred = model.predict(Xtest)      # next character

    ypred_ch = index2char[np.argmax(pred)]

    print(ypred_ch, end="")
    #shift one character
    test_chars = test_chars[1:] + ypred_ch

sed to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the door as it was good all the rabbit have to herself, 'i con't got the d