[Refactor] Revise readme file.

DevinZ1993 · Jul 21, 2018 · 6e0c0f2 · 6e0c0f2
1 parent d7082ce
commit 6e0c0f2
Show file tree

Hide file tree

Showing 33 changed files with 346 additions and 313,562 deletions.
diff --git a/README.md b/README.md
@@ -1,77 +1,99 @@
-# RNN-based Poem Generator
+# Planning-based Poetry Generation
 
 A classical Chinese quatrain generator based on the RNN encoder-decoder framework.
 
-Two 4-layer LSTM networks are used as encoder and decoder respectively.
-The encoder takes as input four keywords provided by a poem planner,
-and the decoder generates a quatrain character by character.<sup> [1]</sup>
-
-The original repository is [here](https://github.com/DevinZ1993/Chinese-Poetry-Generation), 
-where there are also a bunch of raw data files necessary to train the model.
-The raw data files were downloaded from the Internet, mostly from similar open source projects.
-
-    raw/
-    ├── ming.all
-    ├── pinyin.txt
-    ├── qing.all
-    ├── qsc_tab.txt
-    ├── qss_tab.txt
-    ├── qtais_tab.txt
-    ├── qts_tab.txt
-    ├── shixuehanying.txt
-    ├── stopwords.txt
-    └── yuan.all
+Here I tried to implement the planning-based architecture purposed in 
+[Wang et al. 2016](https://arxiv.org/abs/1610.09889),
+whereas technical details might be different from the original paper.
+My purpose of making this was not to refine the neural network model and give better results by myself.
+Rather, I wish to <b>provide a simple framework as said in the paper along with
+convenient data processing toolkits</b> for all those who want to experiment their
+ideas on this interesting task.
+
+By Jun 2018, this project has been refactored into Python3 using TensorFlow 1.8.
+
+## Code Organization
+
+![Structure of Code](img/structure.jpg)
+
+The diagram above illustrates major dependencies in
+this codebase in terms of either data or functionalities.
+Here I tried to organize code around data,
+and make every data processing module a singleton at runtime.
+Batch processing is only done when the produced result
+is either missing or outdated.
+
 
 ## Dependencies
 
-Python 2.7
+* Python 3.6.5
+
+* [Numpy 1.14.4](http://www.numpy.org/)
 
-[TensorFlow 1.0](https://www.tensorflow.org/)
+* [TensorFlow 1.8](https://www.tensorflow.org/)
 
-[Jieba 0.38](https://github.com/fxsjy/jieba)
+* [Jieba 0.39](https://github.com/fxsjy/jieba)
 
-[Gensim 2.0.0](https://radimrehurek.com/gensim/)
+* [Gensim 2.0.0](https://radimrehurek.com/gensim/)
 
 
+## Data Processing
+
+Run the following command to generate training data from source text data:
+
+    ./data_utils.py
+
+Depending on your hardware, this can take you a cup of tea or over one hour.
+The keyword extraction is based on the TextRank algorithm,
+which can take a long time to converge.
+
 ## Training
 
-To begin with, you should process the raw data to generate the training data:
+The poem planner was based on Gensim's Word2Vec module.
+To train it, simply run:
+
+    ./train.py -p
+
+The poem generator was implemented as an enc-dec model with attention mechanism.
+To train it, type the following command:
 
-    python data_utils.py
+    ./train.py -g
 
-The TextRank algorithm may take many hours to run.
-Instead, you could choose to stop it early by typing ctrl+c to interrupt the iterations,
-when the progress shown in the terminal has remained stationary for a long time.
+You can also choose to train the both models altogether by running:
 
-Then, generate the word embedding data using gensim Word2Vec model:
+    ./train.py -a
 
-    python word2vec.py
+To erase all trained models, run:
 
-Now, type the following command and wait for several hours:
+    ./train.py --clean
 
-    python train.py
 
-![train](img/train.png)
+As it turned out, the attention-based generator model after refactor
+was really hard to train well.
+From my side, the average loss will typically stuck at ~5.6
+and won't go down any more.
+There should be considerable space to improve it.
 
 ## Run Tests
 
-Start the user interaction program in a terminal once the training has finished:
+Type the following command:
 
-    python main.py
+    ./main.py
 
-Type in an input sentence each time and the poem generator will create a poem for you.
+Then each time you type in a hint text in Chinese,
+it should return a kind of gibberish poem.
+It's up to you to decide how to improve the models and training methods
+to make them work better.
 
-![main](img/main.png)
+## Improve It
 
-## Note
-[1] The planning-based poem generation workflow is adopted from 
-[Zhe Wang et al. <i>Chinese Poetry Generation with Planning based Neural Network</i>. 2016](https://arxiv.org/abs/1610.09889),
-yet I made two simplifications here.
+* To add data processing tools, consider adding dependency configs into
+\_\_dependency\_dict in [paths.py](./paths.py).
+It helps you to automatically update processed data when it goes stale.
 
-Firstly, instead of using the attention-based RNN enc-dec model that the paper put forward,
-I simply employed the plain-vanilla seq2seq model in generate.py.
+* To improve the planning model,
+please refine the planner class in [plan.py](./plan.py).
 
-Secondly, I used the word2vec model for poem planning in plan.py instead of training an RNN language model
-as the paper specified.
+* To improve  the generation model,
+please refine the generator class in [generate.py](./generate.py).
 
-Future efforts can be made in those two scripts in order to improve the performance of poem planning or generation.
diff --git a/char_dict.py b/char_dict.py
@@ -13,6 +13,12 @@
         'yuan.all', 'ming.all', 'qing.all']
 
 
+def start_of_sentence():
+    return '^'
+
+def end_of_sentence():
+    return '$'
+
 def _gen_char_dict():
     print("Generating dictionary from corpus ...")
 
@@ -43,17 +49,17 @@ def __init__(self):
         self._int2char = []
         self._char2int = dict()
         # Add start-of-sentence symbol.
-        self._int2char.append('^')
-        self._char2int['^'] = 0
+        self._int2char.append(start_of_sentence())
+        self._char2int[start_of_sentence()] = 0
         with open(char_dict_path, 'r') as fin:
             idx = 1
             for ch in fin.read():
                 self._int2char.append(ch)
                 self._char2int[ch] = idx
                 idx += 1
         # Add end-of-sentence symbol.
-        self._int2char.append('$')
-        self._char2int['$'] = len(self._int2char) - 1
+        self._int2char.append(end_of_sentence())
+        self._char2int[end_of_sentence()] = len(self._int2char) - 1
 
     def char2int(self, ch):
         if ch not in self._char2int:
@@ -69,6 +75,8 @@ def __len__(self):
     def __iter__(self):
         return iter(self._int2char)
 
+    def __contains__(self, ch):
+        return ch in self._char2int
 
 
 # For testing purpose.

diff --git a/check_file.py b/check_file.py
diff --git a/cluster.py b/cluster.py
diff --git a/cnt_words.py b/cnt_words.py
diff --git a/common.py b/common.py