# Generate a Aqua Markov chain

Created by [MarsRon](https://marsron.name.my). Feel free to contact me if you have any questions.

We will be using Google Colaboratory to generate the model.

1. Upload this Python notebook file to your Google Drive and open it using https://colab.research.google.com
2. Then run all the code below. Change the configurations as needed.

## Libraries

Install `markovify` (Markov chain library) and `spacy` (natural language processing model).

In [None]:
%pip install spacy
%pip install markovify
!python -m spacy download en_core_web_sm

Import the libraries.

In [201]:
import re
import spacy
import markovify

# Training data

You can get the training data (KonoSuba light novel) from https://github.com/MarsRon/kazuma.

### Read training data

To read the training data, you can either use:

**Google Drive method:**
1. Uplaod the training data onto Google Drive
2. Uncomment the following code block
3. Modify the path where you located the training data
4. Continue running all code blocks

**OR**

**Session storage (temporary storage) method:**
1. Upload the training data to session storage (menu on the left)
2. Continue running all code blocks

In [None]:
# Google Drive method

# from google.colab import drive
# drive.mount('/content/drive')

# import os
# os.chdir('/content/drive/MyDrive/aqua')

Change the file name `./training-data.txt` as needed.

In [203]:
# Full light novel
with open('./training-data.txt', 'r') as file:
  novel = file.read()

# Print first 500 characters
print(novel[:500])

“Satou Kazuma-san, welcome to the afterlife. Unfortunately, you’ve died. It might’ve been short, but your life’s now over.”
Someone suddenly spoke to me in a pure white room.
The sudden turn of events confused me.
In the room was an office desk and a chair, and the one who announced that my life was over sat on said chair.
If there was a goddess, she had to be it.
Her beauty was beyond the idols shown on television; she had a glamour that surpassed humans.
She had long, silky smooth blue hair.
S


### Cleanse training data

Here we clean up the the training data by removing quotation marks and
playing around with spaces in between punctuation marks.

In [204]:
re_punctuation = re.compile(r' ([.,?!;:’)…–-]|n’t)|([‘(]) ')
re_ellipsis = re.compile(r'(?<=\W)(…) +')

def cleanup(text: str) -> str:
  # Remove quotes: “ and ”
  text = text.replace('“', '').replace('”', '')
  # Remove extra space in between punctuation: Eat . Eat
  text = re.sub(re_punctuation, r'\1', text)
  # Remove extra space after … starts a sentence
  text = re.sub(re_ellipsis, r'\1', text)
  # Remove extra whitespaces in between words
  text = ' '.join(text.split())

  return text

novel = cleanup(novel)

# Print first 500 characters
print(novel[:500])

Satou Kazuma-san, welcome to the afterlife. Unfortunately, you’ve died. It might’ve been short, but your life’s now over. Someone suddenly spoke to me in a pure white room. The sudden turn of events confused me. In the room was an office desk and a chair, and the one who announced that my life was over sat on said chair. If there was a goddess, she had to be it. Her beauty was beyond the idols shown on television; she had a glamour that surpassed humans. She had long, silky smooth blue hair. She


## Generate Markov chain model 

Limit the training data length to 2,500,000 characters due to memory limit on Google Colab.

Note: You can modify the limit if you have the enough memory to run without it crashing.

**Also this will take some time (around 5-10mins)** so feel free to do something else while it's running and come back later :3

In [205]:
CUTOFF_LENGTH = 2500000

# Parse novel with natural language processing model
nlp = spacy.load('en_core_web_sm')
nlp.max_length = CUTOFF_LENGTH
novel_doc = nlp(novel[:CUTOFF_LENGTH])

# Concatenate all sentences into one string
novel_sents = ' '.join([sent.text for sent in novel_doc.sents if len(sent.text) > 1])

Here we pass the training data to a NLP to generate a Markov model
that obeys sentence structure better than a naive model.

In [206]:
# https://github.com/jsvine/markovify#extending-markovifytext
class POSifiedText(markovify.Text):
  def word_split(self, sentence):
    return ['::'.join((word.orth_, word.pos_)) for word in nlp(sentence)]

  def word_join(self, words):
    sentence = ' '.join(word.split('::')[0] for word in words)
    # Discard extra space in between punctuation: Eat . Eat
    sentence = re.sub(re_punctuation, r'\1', sentence)
    # Discard extra space after … starts a sentence
    sentence = re.sub(re_ellipsis, r'\1', sentence)
    return sentence

# Create the model
aqua = POSifiedText(novel_sents, state_size=2, well_formed=False)

## Testing

Now that the model has finished generating, let's generate a few sentences.

`tries=100` means it will attempt 100 times to generate a sentence (it sometimes fails).

In [None]:
# Generate normal sentences
for i in range(5):
  print(aqua.make_sentence(tries=100))

print()

# Generate short sentences less than 50 characters
for i in range(5):
  print(aqua.make_short_sentence(max_chars=50, tries=100))

It’s not the time for us.
Aqua cast her spell was cast, this is a low- level adventurers, a considerable injury from that guy, that hurts!
Not just Aqua and me.
What happened to the shocked village chief, while your real name, do you act so scared of me, bitch!
… For some reason, the staff told me disinterestedly  It’s still too early for me to continue?

Wiz was really evil.
I’m counting on you to respond without hesitation.
Hide inside him and avoid blurting out her arms.
Look, here she was dangerous.
Why was this enthusiastic.


## Saving

Finally, we compile and save the model to disk/Google Drive.

If you are using session storage, remember to downlaod the model before closing the tab!

In [208]:
aqua.compile(inplace = True)

with open('aqua.json', 'w') as file:
  file.write(aqua.to_json())

Now you can use `main.py` to create a API webserver using Aqua. 

Thanks for reading this and I hope you can build your own sentence generator as well :D