Skip to content

Commit

Permalink
[Refactor] Revise readme file.
Browse files Browse the repository at this point in the history
  • Loading branch information
DevinZ1993 committed Jul 21, 2018
1 parent d7082ce commit 6e0c0f2
Show file tree
Hide file tree
Showing 33 changed files with 346 additions and 313,562 deletions.
116 changes: 69 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,99 @@
# RNN-based Poem Generator
# Planning-based Poetry Generation

A classical Chinese quatrain generator based on the RNN encoder-decoder framework.

Two 4-layer LSTM networks are used as encoder and decoder respectively.
The encoder takes as input four keywords provided by a poem planner,
and the decoder generates a quatrain character by character.<sup> [1]</sup>

The original repository is [here](https://github.com/DevinZ1993/Chinese-Poetry-Generation),
where there are also a bunch of raw data files necessary to train the model.
The raw data files were downloaded from the Internet, mostly from similar open source projects.

raw/
├── ming.all
├── pinyin.txt
├── qing.all
├── qsc_tab.txt
├── qss_tab.txt
├── qtais_tab.txt
├── qts_tab.txt
├── shixuehanying.txt
├── stopwords.txt
└── yuan.all
Here I tried to implement the planning-based architecture purposed in
[Wang et al. 2016](https://arxiv.org/abs/1610.09889),
whereas technical details might be different from the original paper.
My purpose of making this was not to refine the neural network model and give better results by myself.
Rather, I wish to <b>provide a simple framework as said in the paper along with
convenient data processing toolkits</b> for all those who want to experiment their
ideas on this interesting task.

By Jun 2018, this project has been refactored into Python3 using TensorFlow 1.8.

## Code Organization

![Structure of Code](img/structure.jpg)

The diagram above illustrates major dependencies in
this codebase in terms of either data or functionalities.
Here I tried to organize code around data,
and make every data processing module a singleton at runtime.
Batch processing is only done when the produced result
is either missing or outdated.


## Dependencies

Python 2.7
* Python 3.6.5

* [Numpy 1.14.4](http://www.numpy.org/)

[TensorFlow 1.0](https://www.tensorflow.org/)
* [TensorFlow 1.8](https://www.tensorflow.org/)

[Jieba 0.38](https://github.com/fxsjy/jieba)
* [Jieba 0.39](https://github.com/fxsjy/jieba)

[Gensim 2.0.0](https://radimrehurek.com/gensim/)
* [Gensim 2.0.0](https://radimrehurek.com/gensim/)


## Data Processing

Run the following command to generate training data from source text data:

./data_utils.py

Depending on your hardware, this can take you a cup of tea or over one hour.
The keyword extraction is based on the TextRank algorithm,
which can take a long time to converge.

## Training

To begin with, you should process the raw data to generate the training data:
The poem planner was based on Gensim's Word2Vec module.
To train it, simply run:

./train.py -p

The poem generator was implemented as an enc-dec model with attention mechanism.
To train it, type the following command:

python data_utils.py
./train.py -g

The TextRank algorithm may take many hours to run.
Instead, you could choose to stop it early by typing ctrl+c to interrupt the iterations,
when the progress shown in the terminal has remained stationary for a long time.
You can also choose to train the both models altogether by running:

Then, generate the word embedding data using gensim Word2Vec model:
./train.py -a

python word2vec.py
To erase all trained models, run:

Now, type the following command and wait for several hours:
./train.py --clean

python train.py

![train](img/train.png)
As it turned out, the attention-based generator model after refactor
was really hard to train well.
From my side, the average loss will typically stuck at ~5.6
and won't go down any more.
There should be considerable space to improve it.

## Run Tests

Start the user interaction program in a terminal once the training has finished:
Type the following command:

python main.py
./main.py

Type in an input sentence each time and the poem generator will create a poem for you.
Then each time you type in a hint text in Chinese,
it should return a kind of gibberish poem.
It's up to you to decide how to improve the models and training methods
to make them work better.

![main](img/main.png)
## Improve It

## Note
[1] The planning-based poem generation workflow is adopted from
[Zhe Wang et al. <i>Chinese Poetry Generation with Planning based Neural Network</i>. 2016](https://arxiv.org/abs/1610.09889),
yet I made two simplifications here.
* To add data processing tools, consider adding dependency configs into
\_\_dependency\_dict in [paths.py](./paths.py).
It helps you to automatically update processed data when it goes stale.

Firstly, instead of using the attention-based RNN enc-dec model that the paper put forward,
I simply employed the plain-vanilla seq2seq model in generate.py.
* To improve the planning model,
please refine the planner class in [plan.py](./plan.py).

Secondly, I used the word2vec model for poem planning in plan.py instead of training an RNN language model
as the paper specified.
* To improve the generation model,
please refine the generator class in [generate.py](./generate.py).

Future efforts can be made in those two scripts in order to improve the performance of poem planning or generation.
16 changes: 12 additions & 4 deletions char_dict.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@
'yuan.all', 'ming.all', 'qing.all']


def start_of_sentence():
return '^'

def end_of_sentence():
return '$'

def _gen_char_dict():
print("Generating dictionary from corpus ...")

Expand Down Expand Up @@ -43,17 +49,17 @@ def __init__(self):
self._int2char = []
self._char2int = dict()
# Add start-of-sentence symbol.
self._int2char.append('^')
self._char2int['^'] = 0
self._int2char.append(start_of_sentence())
self._char2int[start_of_sentence()] = 0
with open(char_dict_path, 'r') as fin:
idx = 1
for ch in fin.read():
self._int2char.append(ch)
self._char2int[ch] = idx
idx += 1
# Add end-of-sentence symbol.
self._int2char.append('$')
self._char2int['$'] = len(self._int2char) - 1
self._int2char.append(end_of_sentence())
self._char2int[end_of_sentence()] = len(self._int2char) - 1

def char2int(self, ch):
if ch not in self._char2int:
Expand All @@ -69,6 +75,8 @@ def __len__(self):
def __iter__(self):
return iter(self._int2char)

def __contains__(self, ch):
return ch in self._char2int


# For testing purpose.
Expand Down
37 changes: 0 additions & 37 deletions check_file.py

This file was deleted.

49 changes: 0 additions & 49 deletions cluster.py

This file was deleted.

52 changes: 0 additions & 52 deletions cnt_words.py

This file was deleted.

41 changes: 0 additions & 41 deletions common.py

This file was deleted.

Loading

0 comments on commit 6e0c0f2

Please sign in to comment.