<a href="https://colab.research.google.com/github/Coyote-Schmoyote/text-generation-ja/blob/main/text_generator_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
<h1> Akutagawa Diary Generator</h1>
<hr>
</center>
<img src="https://drive.google.com/uc?id=1sOMfHQIvxe9ihvnVnEom_iX1Rt4d8MfV" align="right" width="400">

## 1. Problem Definition
In this notebook, we will look into a method of generating text based on Akutagawa Ryunosuke's work "Life of a Fool". We will create a simple character-based text generator using keras Sequential model and an LSTM recurrent neural network.

## 2. Data 
The data was downloaded from the Internet Library Aozora bunko (https://www.aozora.gr.jp/), which features more than 17,000 text by various Japanese and foreign authors. 

## 3. Approach
In this project, we will follow these steps:
1. Import the tools and data
2. Preprocess data
3. Prepare data for text generation
4. Build a machine learning model
5. Train the model



## Import the tools
First, let's import all the necessary libraries. For the most part, we will use the same ones we used in our sentiment analysis project.

In [1]:
import os
import numpy as np
import pandas as pd
import random
import re
from bs4 import BeautifulSoup
from tensorflow import keras

Now let's import our text file. 

In [2]:
project_folder = "/content/drive/MyDrive/E資格機械学習プロジェクト/文章生成/"
filename = "aru_ahono_issho.txt"
path = os.path.join(project_folder, filename)

with open(path, encoding="ShiftJIS") as file:
  text = file.read()

text[:1000]

'\n\n［＃ここから３字下げ］\n\u3000僕はこの原稿を発表する可否は勿論、発表する時や機関も君に一任したいと思つてゐる。\n\u3000君はこの原稿の中に出て来る大抵の人物を知つてゐるだらう。しかし僕は発表するとしても、インデキスをつけずに貰ひたいと思つてゐる。\n\u3000僕は今最も不幸な幸福の中に暮らしてゐる。しかし不思議にも後悔してゐない。唯僕の如き悪夫、悪子、悪親を持つたものたちを如何《いか》にも気の毒に感じてゐる。ではさやうなら。僕はこの原稿の中では少くとも意識的［＃「意識的」に傍点］には自己弁護をしなかつたつもりだ。\n\u3000最後に僕のこの原稿を特に君に托するのは君の恐らくは誰よりも僕を知つてゐると思ふからだ。（都会人と云ふ僕の皮を剥《は》ぎさへすれば）どうかこの原稿の中に僕の阿呆さ加減を笑つてくれ給へ。\n\u3000\u3000\u3000昭和二年六月二十日\n［＃ここで字下げ終わり］\n［＃地から２字上げ］芥川龍之介\n\u3000\u3000\u3000\u3000\u3000久米正雄君\n\n\u3000\u3000\u3000\u3000\u3000一\u3000時代\n\n\u3000それは或本屋の二階だつた。二十歳の彼は書棚にかけた西洋風の梯子《はしご》に登り、新らしい本を探してゐた。モオパスサン、ボオドレエル、ストリントベリイ、イブセン、シヨウ、トルストイ、……\n\u3000そのうちに日の暮は迫り出した。しかし彼は熱心に本の背文字を読みつづけた。そこに並んでゐるのは本といふよりも寧《むし》ろ世紀末それ自身だつた。ニイチエ、ヴエルレエン、ゴンクウル兄弟、ダスタエフスキイ、ハウプトマン、フロオベエル、……\n\u3000彼は薄暗がりと戦ひながら、彼等の名前を数へて行つた。が、本はおのづからもの憂い影の中に沈みはじめた。彼はとうとう根気も尽き、西洋風の梯子を下りようとした。すると傘のない電燈が一つ、丁度彼の頭の上に突然ぽかりと火をともした。彼は梯子の上に佇《たたず》んだまま、本の間に動いてゐる店員や客を見下《みおろ》した。彼等は妙に小さかつた。のみならず如何にも見すぼらしかつた。\n「人生は一行《いちぎやう》のボオドレエルにも若《し》かない。」\n\u3000彼は暫《しばら》く梯子の上からかう云ふ彼等を見渡してゐた。……\n

In [3]:
len(text)

14129

## Data Processing
As we could see in the text extract, the text requires some cleaning. We will conduct similar text preprocessing steps to the ones we already did in the sentiment analysis project: take out the `html` tags, split the text, and replace some of the special characters.

In the last project, we conducted text preprocessing for English language. While the general flow is similar for most languages, there are some nuances when preprocessing in other languages. For instance in English, when conducting tokenization, we typically separate words from each other through white spaces. After tokenization, we can do part-of-speech tagging. The Japanese language, on the other hand, doesn‘t use spaces to represent the division between the words, and it is important to know part of speech for proper tokenization. Because of that it is solved as a joined task, and is often referred to as **morpohological analysis**, and can be performed using Japanese NLP libraries, such as MeCab, GiNZA, or JANOME. Another difference is that there is no need to lowercase all the characters, because there are no capital or lower letters in Japanese.


In this project, we are creating a character-based LSTM text generator, and therefore, we don't need to conduct morphological analysis. At this step, we will simply clean the data and prepare it for training.



In [4]:
def process(text):
  #extract text
  text = BeautifulSoup(text).get_text()
  #split text into rows
  text = re.split(r"\r", text)[0]
  #remove special characters
  text = re.sub("[［］＃＠〜｀％＾＆＊＿ー＋＝×※〇]", "", text)
  #remove numbers
  text = re.sub("[0123456789]", "", text)
  #remove latin alphabet characters
  text = re.sub("[qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM]", "", text)
  #remove encoding
  text = re.sub("\u3000", "",text)
  #remove spaces
  text = re.sub(" ", "", text)
  return text

  

Now let's take a look ot the first 2000 characters of our processed text.

In [5]:
text = process(text)
print("Length of text:", len(text))
print("=======================================================================")
print(text[:2000])

Length of text: 13530
ここから３字下げ
僕はこの原稿を発表する可否は勿論、発表する時や機関も君に一任したいと思つてゐる。
君はこの原稿の中に出て来る大抵の人物を知つてゐるだらう。しかし僕は発表するとしても、インデキスをつけずに貰ひたいと思つてゐる。
僕は今最も不幸な幸福の中に暮らしてゐる。しかし不思議にも後悔してゐない。唯僕の如き悪夫、悪子、悪親を持つたものたちを如何《いか》にも気の毒に感じてゐる。ではさやうなら。僕はこの原稿の中では少くとも意識的「意識的」に傍点には自己弁護をしなかつたつもりだ。
最後に僕のこの原稿を特に君に托するのは君の恐らくは誰よりも僕を知つてゐると思ふからだ。（都会人と云ふ僕の皮を剥《は》ぎさへすれば）どうかこの原稿の中に僕の阿呆さ加減を笑つてくれ給へ。
昭和二年六月二十日
ここで字下げ終わり
地から２字上げ芥川龍之介
久米正雄君

一時代

それは或本屋の二階だつた。二十歳の彼は書棚にかけた西洋風の梯子《はしご》に登り、新らしい本を探してゐた。モオパスサン、ボオドレエル、ストリントベリイ、イブセン、シヨウ、トルストイ、……
そのうちに日の暮は迫り出した。しかし彼は熱心に本の背文字を読みつづけた。そこに並んでゐるのは本といふよりも寧《むし》ろ世紀末それ自身だつた。ニイチエ、ヴエルレエン、ゴンクウル兄弟、ダスタエフスキイ、ハウプトマン、フロオベエル、……
彼は薄暗がりと戦ひながら、彼等の名前を数へて行つた。が、本はおのづからもの憂い影の中に沈みはじめた。彼はとうとう根気も尽き、西洋風の梯子を下りようとした。すると傘のない電燈が一つ、丁度彼の頭の上に突然ぽかりと火をともした。彼は梯子の上に佇《たたず》んだまま、本の間に動いてゐる店員や客を見下《みおろ》した。彼等は妙に小さかつた。のみならず如何にも見すぼらしかつた。
「人生は一行《いちぎやう》のボオドレエルにも若《し》かない。」
彼は暫《しばら》く梯子の上からかう云ふ彼等を見渡してゐた。……

二母

狂人たちは皆同じやうに鼠色の着物を着せられてゐた。広い部屋はその為に一層憂欝に見えるらしかつた。彼等の一人はオルガンに向ひ、熱心に讃美歌を弾《ひ》きつづけてゐた。同時に又彼等の一人は丁度部屋のまん中に立ち、踊ると云ふよりも跳《は》ねまはつてゐた。
彼は血色の善《い》い医者と一し

## Prepare data for text generation
Now that we have cleaned our text, let's prepare it for text generation. This will be conducted in _ steps:
1. Filter iunique characters
2. Convert characters to numbers
3. Create list of cahracter sequences
4. Convert training data into `np.array`

#### 1. Filter unique characters
First, we need to filter all the unique characters from our text and store it in a variable. Filter all unique characters from the text with `set()` sunction, turn it into a list using `list()`, and sort it with `sorted()`. We do this to filter away any repeating characters, to make sure that we have a vocabulary of each individual character from the text. 

In [6]:
chars = sorted(list(set(text)))
print("Unique characters:", len(chars))

Unique characters: 1087


In [7]:
chars[:10]

['\n', '-', '―', '…', '、', '。', '々', '《', '》', '「']

#### 2. Create a dictionary of characters and their indexes
In the second step, we need to convert all unique cahracters to a numerical form to feed into our machine learning model. We can do this by creating two dictionaries: one with characters for keys and ther indeces for values, and the other one with indeces for keys, and cahracters for values. 
To do this, we will loop though our list of unique characters `chars`, and add them to the dictionaries. As a result, we will have a cross-referenced dictionaries with each character and a unique number associated with it.

In [8]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [9]:
print("Value of character 新: ",char_indices.get("新"))

Value of character 新:  546


In [10]:
print("Value of 546: ",indices_char.get(546))

Value of 546:  新


#### 3. Create list of character sequences

We will create an empty list of sentences, and an empty list of next characters. We will load the sentences into our neural network, and make it predict the next character. Essentially, we can thing of `sentences` as our features, and `next_chars` as target variables.

For example, if we have a sentence 「これ可愛」, the next cahracter will be 「い」.

When creating list of character sequences, we need to specify, how long we want each sequence to be in a variable `sequence_len`, and the number of characters we want to shift it to be shifted in a variable `step`. 


In [11]:
sequence_len = 40
step = 3

sentences = []
next_chars = []

for i in range(0, len(text) - sequence_len, step):
    sentences.append(text[i: i + sequence_len])
    next_chars.append(text[i + sequence_len])

print("Number of sequences:",len(sentences))

Number of sequences: 4497


Let's see how it actually looks. As we can see here, the variable `sentences` stores all our text in  sequences of 40 characters, and each sequence is shifted by 3 characters, based on our `step` variable. 

When our model will train, it will study the sequences first, and then, based on the order of characters in the sequence, it will attempt to predict, which character is most liekly to be next.

In [12]:
sentences[:30]

['ここから３字下げ\n僕はこの原稿を発表する可否は勿論、発表する時や機関も君に一任し',
 'ら３字下げ\n僕はこの原稿を発表する可否は勿論、発表する時や機関も君に一任したいと',
 '下げ\n僕はこの原稿を発表する可否は勿論、発表する時や機関も君に一任したいと思つて',
 '僕はこの原稿を発表する可否は勿論、発表する時や機関も君に一任したいと思つてゐる。',
 'の原稿を発表する可否は勿論、発表する時や機関も君に一任したいと思つてゐる。\n君は',
 'を発表する可否は勿論、発表する時や機関も君に一任したいと思つてゐる。\n君はこの原',
 'する可否は勿論、発表する時や機関も君に一任したいと思つてゐる。\n君はこの原稿の中',
 '否は勿論、発表する時や機関も君に一任したいと思つてゐる。\n君はこの原稿の中に出て',
 '論、発表する時や機関も君に一任したいと思つてゐる。\n君はこの原稿の中に出て来る大',
 '表する時や機関も君に一任したいと思つてゐる。\n君はこの原稿の中に出て来る大抵の人',
 '時や機関も君に一任したいと思つてゐる。\n君はこの原稿の中に出て来る大抵の人物を知',
 '関も君に一任したいと思つてゐる。\n君はこの原稿の中に出て来る大抵の人物を知つてゐ',
 'に一任したいと思つてゐる。\n君はこの原稿の中に出て来る大抵の人物を知つてゐるだら',
 'したいと思つてゐる。\n君はこの原稿の中に出て来る大抵の人物を知つてゐるだらう。し',
 'と思つてゐる。\n君はこの原稿の中に出て来る大抵の人物を知つてゐるだらう。しかし僕',
 'てゐる。\n君はこの原稿の中に出て来る大抵の人物を知つてゐるだらう。しかし僕は発表',
 '。\n君はこの原稿の中に出て来る大抵の人物を知つてゐるだらう。しかし僕は発表すると',
 'はこの原稿の中に出て来る大抵の人物を知つてゐるだらう。しかし僕は発表するとしても',
 '原稿の中に出て来る大抵の人物を知つてゐるだらう。しかし僕は発表するとしても、イン',
 '中に出て来る大抵の人物を知つてゐるだらう。しかし僕は発表するとしても、インデキス',
 'て来る大抵の人物を知つてゐるだらう。しかし僕は発表するとしても、インデキスをつけ',
 '大抵の人物を知つてゐるだらう。しかし僕は発表するとしても、インデキスをつけ

In [13]:
next_chars[:10]

['た', '思', 'ゐ', '\n', 'こ', '稿', 'に', '来', '抵', '物']

#### 4. Convert data into `np.array`
The last step of preparing our data before feeding it into our model is converting it into numbers. We will create two numpy arrays full of zeroes that will be equal to the length of sentences * sequence length * length of cahracters. The dataype for this array will be boolean. 

Essentially ,we are creating a one dimension for all possible sentences that we have, one dimension for all possible individual positions of the characters, and one dimension for all the possible character. Whenever a particular character appears in a particular place of a particular sequence, we will mark it as `1` or `True`, and the rest of the positions will remain `0`. It can be thought of as somewhat simialr to One Hot Encoding.

In [14]:
x = np.zeros((len(sentences), sequence_len, len(chars)), dtype=bool)
y = np.zeros((len(sentences), len(chars)), dtype=bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1 
    y[i, char_indices[next_chars[i]]] = 1

If we take a loot at he shape of `x`, we will see that the first number represents the number of sequences that we have, the second number represents the length of each sequence, and the last number represents the number of unique characters that we have.

In [22]:
x.shape

(4497, 40, 1087)

## Build the model
Now we are ready to build the model!
We will build a simple recurrent neural network consisting of 3 layers: an LSTM layer, a Dense layer, and an Activation layer. 
The LSTM layer computes the output using 128 LSTM units (the numebr of units can be adjusted). Next, it passes the information to the fully-connected Dense layer. Finally, the `softmax` function of the activation layer will choose the best possible next character for a particular sequence.

In [16]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Activation

In [23]:
model = Sequential(name="text_generator")
model.add(LSTM(128, input_shape=(sequence_len, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(lr=0.01))

  super(Adam, self).__init__(name, **kwargs)


In [24]:
model.summary()

Model: "text_generator"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_1 (LSTM)               (None, 128)               622592    
                                                                 
 dense_1 (Dense)             (None, 1087)              140223    
                                                                 
 activation_1 (Activation)   (None, 1087)              0         
                                                                 
Total params: 762,815
Trainable params: 762,815
Non-trainable params: 0
_________________________________________________________________


## Train the model and generate samples
We will write 2 functions. 

First, we will create a sampling 
function that will take the predictions of our model and picks one of the characters that `softmax` activation function chose as an appropriate next character for the sequence. We can regulate how "conservative" or "creative" we want our next character choices to be by regulating parameter `diversity` (or `temperature`). The higher the diversity, the more creative or "risky" the next character will be, and the lower the diversity, the more "safe" the choice of the next character. Essentially, this helper function is responsible for picking a single next ccharacter in the sequence.

In [25]:
def sample(preds, diversity=1.0):
  preds = np.asarray(preds).astype("float64")
  preds = np.log(preds) / diversity
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

The second function will create our model and generate samples of the text in 3 types of diversity: `0.2`, `1.0`, and `1.5`, so we can see how differently the algorithm chooses the next character with different diversity values. 

The training function will be comprised of:
1. Fitting the model on training data
2. Looping through 3 diversity values and generating the text with each setting
3. Looping through the characters and character sequences
4. Printing out the generated text
5. Save the model at the end of each epoch 

In [28]:
def create_model(epochs, x, y, batch_size, sentence, sequence_len, char_indices):
  # fit the model
  for epoch in range(epochs):
    model.fit(x, y, batch_size, epochs=1)
    print()
    print("Generating text after epoch: %d" % (epoch+1))
    print("Generating with seed: 「" + sentence + "」")
    #take 40 character sequence from random place within the length of the text
    start_index = random.randint(0, len(text) - sequence_len - 1)
    # loop through each diversity setting
    for diversity in [0.5, 1.0, 1.5]:
      # print diversity setting
      print("...Diversity:", diversity)  
      # initiate an empty string for generated text
      generated = ""
      # create a 40 character sequence from a random start index
      sentence = text[start_index : start_index + sequence_len]
      #specify how long we want our generated text to be
      num_chars = 400
      #loop through all the characters 400 times
      for i in range(num_chars):
        # empty array filled up with 0
        x_pred = np.zeros((1, sequence_len, len(chars)))
        # loop through the characters and indexes in the sentence
        for t, char in enumerate(sentence):
          #put 1 wherever the character occurs
          x_pred[0, t, char_indices[char]] = 1.0
        #make the model predict possible characters on the x_pred from the first index [0]
        preds = model.predict(x_pred, verbose=0)[0]
        #take predictions from the model and into the sample function to choose next character
        next_index = sample(preds, diversity)
        #convert index to character using our dictionary
        next_char = indices_char[next_index]
        #include this character into our sequence, so the next character will be generated
        # based on the previousl generated character
        sentence = sentence[1:] + next_char
        generated += next_char
      
      #print the generated text
      print("...Generated: ")
      print(generated)
      print("=======================================================================")
    #save the model to the output folder
    filename = "generator_model_%02d.h5" % (epoch + 1)
    output_folder = os.path.join(project_folder, "output/")
    model.save(os.path.join(output_folder, filename))

In [29]:
%%time
create_model(epochs = 30, 
             x = x,
             y = y,
             batch_size = 32,
             sentence = sentence,
             sequence_len = sequence_len,
             char_indices = char_indices)


Generating text after epoch: 1
Generating with seed: 「のこぼれてしまつた、細い剣を杖にしながら。
地から２字上げ（昭和二年六月、遺稿）」
...Diversity: 0.5
...Generated: 
。それは彼は彼のなつた。彼は彼のと言るとるの殺にの宇だのはをを見した。彼は彼の母の十《してゐた。
彼のかうは》には彼の」のこの人テののをにはのちののの《のにとのににの見の人のは彼の中のは彼のは彼のの日にの《への義をめてゐた。がれ。
そ彼は彼の中のはうはの向《に、少体のは


彼は「ひのは或かひの悪れの訣ををしてゐた。彼は彼の彼のにはそのきを《してゐた。彼は匹彼は道のげのの上にのの人にのは際の《にと一した。彼のは彼の人のの悪のを遇した。が、彼の体自ののを示のした。彼はそれは彼のえにの撃を黒にの体自をみいト彼は彼のうるの人自ををめる。……………………

彼の人のにの日のを見した。彼は彼の或身のをを見してゐた。彼の上には彼のやのなつた。彼は

彼は二のを見してゐた。それは彼のはこのと君は彼のは篠うの帯人のの部ににの《中にの屋見のりのを見した。が、三の貼《に彼の先のを《した。それは彼は彼のは身の彼
...Diversity: 1.0
...Generated: 
。――彼は十面文は先を《て云の真薇に自へド生の中に一左た《だつた彼の伽十にはこのてゐた。そふのア返トう云とれの勿上には生えことす幾籐体ににのぷは子の妻え、ロ母の子木ないこな。彼はこのこに云うきめれに》の一「を問町たのだがには。指間二彼字花体的ろ。執ゐた。………オ問も足彼恐結し結云ながられぷ青に絶作た屋いあにテ憂ば》殺ししたれを光つたモ下だつた。―
三ちばほすいは生の夫に「な沫で懐人。ね
彼は出力へにす妻く自懐ご美ことい青和の》へ一ふ友の本中の突勹うるに葉壺川側嘩ら、二や着だうとら六度イ見事歩可き》を手ひ悔の今川はも可残には角下の見屈杖《持らてゐてゐるの中の尽危のの恥ス草
十とそ彼は或度感片火との吹《にの見オ《にん近思来るま買リは到踏なや剥教》にで東いし、らの模は
そづした。が、人前一自喧しはう）もう出の抱すに対三そ人は車一ん熱ほ遅だり彼は主みの百突りけた。―そは雨見しすゐてゐてゐた。彼はゴ
...Diversity: 1.5
...Generated: 
一模

##Conclusion
As we can see, with each iteration our model was producing more and more grammatically accurate sentences, while also picking up on the stylistical characteristics of the text. The sentences semantically don't make a lot of sense, because the the model is generating the text by predicting the next character, not taking into the account the meaning of the character.

There are more advanced models for text generation that can be implemented using transfer learning, such as Open AI's GPT-2.