📝 Writing docs.

BrikerMan · Jun 16, 2020 · 00c6854 · 00c6854
1 parent d9d7baf
commit 00c6854
Show file tree

Hide file tree

Showing 10 changed files with 137 additions and 105 deletions.
diff --git a/README.md b/README.md
@@ -28,15 +28,15 @@
 <h4 align="center">
     <a href="#overview">Overview</a> |
     <a href="#performance">Performance</a> |
-    <a href="#quick-start">Quick start</a> |
+    <a href="#installation">Installation</a> |
     <a href="https://kashgari.readthedocs.io/">Documentation</a> |
     <a href="https://kashgari.readthedocs.io/about/contributing/">Contributing</a>
 </h4>
 
 <!-- markdownlint-enable -->
 <!-- prettier-ignore-end -->
 
-🎉🎉🎉 We are proud to announce that we entirely rewrote Kashgari with tf.keras, now Kashgari comes with easier to understand API and is faster! 🎉🎉🎉
+🎉🎉🎉 We released the 2.0.0-alpha0 version with Seq2Seq Support. 🎉🎉🎉
 
 ## Overview
 
@@ -54,43 +54,48 @@ Kashgari is a simple and powerful NLP Transfer learning framework, build a state
 - **NLP beginners** Learn how to build an NLP project with production level code quality.
 - **NLP developers** Build a production level classification/labeling model within minutes.
 
-## Contributors ✨
+## Performance
 
-Thanks goes to these wonderful people. And there are many ways to get involved. Start with the [contributor guidelines](./docs/about/contributing.md) and then check these open issues for specific tasks.
+Welcome to add performance report.
 
-<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
-<!-- prettier-ignore-start -->
-<!-- markdownlint-disable -->
-<table>
-  <tr>
-    <td align="center"><a href="https://developers.google.com/community/experts/directory/profile/profile-eliyar_eziz"><img src="https://avatars1.githubusercontent.com/u/9368907?v=4" width="100px;" alt=""/><br /><sub><b>Eliyar Eziz</b></sub></a><br /><a href="https://github.com/BrikerMan/Kashgari/commits?author=BrikerMan" title="Documentation">📖</a> <a href="https://github.com/BrikerMan/Kashgari/commits?author=BrikerMan" title="Tests">⚠️</a> <a href="https://github.com/BrikerMan/Kashgari/commits?author=BrikerMan" title="Code">💻</a></td>
-    <td align="center"><a href="http://www.chuanxilu.com"><img src="https://avatars3.githubusercontent.com/u/856746?v=4" width="100px;" alt=""/><br /><sub><b>Alex Wang</b></sub></a><br /><a href="https://github.com/BrikerMan/Kashgari/commits?author=alexwwang" title="Code">💻</a></td>
-    <td align="center"><a href="https://github.com/lsgrep"><img src="https://avatars3.githubusercontent.com/u/3893940?v=4" width="100px;" alt=""/><br /><sub><b>Yusup</b></sub></a><br /><a href="https://github.com/BrikerMan/Kashgari/commits?author=lsgrep" title="Code">💻</a></td>
-    <td align="center"><a href="https://github.com/adlinex"><img src="https://avatars1.githubusercontent.com/u/5442229?v=4" width="100px;" alt=""/><br /><sub><b>Adline</b></sub></a><br /><a href="https://github.com/BrikerMan/Kashgari/commits?author=Adline125" title="Code">💻</a></td>
-  </tr>
-</table>
+| Task (with code link)      | Language | Dataset                   | Score   |
+| -------------------------- | -------- | ------------------------- | ------- |
+| Named Entity Recognition   | Chinese  | People's Daily Ner Corpus | // TODO |
+| Text Classification        | -        | -                         | // TODO |
+| Neural machine translation | -        | -                         | // TODO |
 
-<!-- markdownlint-enable -->
-<!-- prettier-ignore-end -->
-<!-- ALL-CONTRIBUTORS-LIST:END -->
-
-## Road Map
-
-- [x] Based on TensorFlow 2.0+ [@BrikerMan]
-- [x] Fully support generator based training (#336 ,#273) [@BrikerMan]
-- [ ] Clean code and full document
-- [ ] Multi GPU/TPU Support [@BrikerMan]
-- [x] Embeddings
-    - [x] Bare Embedding [@BrikerMan]
-    - [x] Word Embedding (Load trained W2V) [@BrikerMan]
-    - [x] BERT Embedding (Based on bert4keras, support BERT, RoBERTa, ALBERT...) (#316) [@BrikerMan]
-    - [x] GPT-2 Embedding
-    - [ ] FeaturesEmbedding (Support Numeric feature as input)
-    - [ ] Stacked Embedding (Stack Text embedding and features Embedding)
-- [x] Classification Task
-- [x] Labeling Task
-- [x] Seq2Seq Task
-- [ ] Built-in Callbacks
-    - [x] Evaluate Callback
-    - [ ] Save Best Callback
-- [ ] Support TensorFlow Hub (Optional)
+## Installation
+
+The project is based on Python 3.6+, because it is 2019 and type hinting is cool.
+
+| Backend          | pypi version                           | desc                     |
+| ---------------- | -------------------------------------- | ------------------------ |
+| TensorFlow 2.x   | `pip install 'kashgari>=2.0.0a0'`      | TF2 tf.keras version     |
+| TensorFlow 1.14+ | `pip install 'kashgari>=1.0.0,<2.0.0'` | TF1.14+ tf.keras version |
+| Keras            | `pip install 'kashgari<1.0.0'`         | keras version            |
+
+## Tutorials
+
+Here is a set of quick tutorials to get you started with the library:
+
+- [Tutorial 1: Text Classification](./docs/tutorial/text-classification.md)
+- [Tutorial 2: Text Labeling](./docs/tutorial/text-labeling.md)
+- [Tutorial 3: Seq2Seq](./docs/tutorial/seq2seq.md)
+- [Tutorial 4: Language Embedding](./docs/embeddings/index.md)
+
+There are also articles and posts that illustrate how to use Kashgari:
+
+- [15 分钟搭建中文文本分类模型](https://eliyar.biz/nlp_chinese_text_classification_in_15mins/)
+- [基于 BERT 的中文命名实体识别（NER)](https://eliyar.biz/nlp_chinese_bert_ner/)
+- [BERT/ERNIE 文本分类和部署](https://eliyar.biz/nlp_train_and_deploy_bert_text_classification/)
+- [五分钟搭建一个基于BERT的NER模型](https://www.jianshu.com/p/1d6689851622)
+- [Multi-Class Text Classification with Kashgari in 15 minutes](https://medium.com/@BrikerMan/multi-class-text-classification-with-kashgari-in-15mins-c3e744ce971d)
+
+Examples:
+
+- [Neural machine translation with Seq2Seq](./examples/translate_with_seq2seq.ipynb)
+
+## Contributors ✨
+
+Thanks goes to these wonderful people. And there are many ways to get involved.
+Start with the [contributor guidelines](./docs/about/contributing.md) and then check these open issues for specific tasks.
diff --git a/docs/_static/css/extra.css b/docs/_static/css/extra.css
@@ -14,7 +14,7 @@ input {
 }
 
 .wy-nav-content {
-    max-width: 1280px;
+    max-width: 1000px;
 }
 
 .pre {

diff --git a/docs/conf.py b/docs/conf.py
@@ -1,32 +1,17 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
-#
-# Spinning Up documentation build configuration file, created by
-# sphinx-quickstart on Wed Aug 15 04:21:07 2018.
-#
-# This file is execfile()d with the current directory set to its
-# containing dir.
-#
-# Note that not all possible configuration values are present in this
-# autogenerated file.
-#
-# All configuration values have a default; values that are commented out
-# serve to show the default.
 
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
+from unittest.mock import MagicMock
 import os
 import sys
 
 # Make sure spinup is accessible without going through setup.py
 dirname = os.path.dirname
 sys.path.insert(0, dirname(dirname(__file__)))
 
+import kashgari
 # Mock mpi4py to get around having to install it on RTD server (which fails)
 # Also to mock PyTorch, because it is too large for the RTD server to download
-from unittest.mock import MagicMock
 
 
 class Mock(MagicMock):
@@ -35,34 +20,26 @@ def __getattr__(cls, name):
         return MagicMock()
 
 
-if os.environ.get('READTHEDOCS') == 'True':
-    MOCK_MODULES = [
-        'keras.layers',
-        # 'tensorflow',
-        # 'tensorflow.keras',
-        # 'tensorflow.keras.utils',
-        # 'tensorflow.keras.preprocessing.sequence',
-        # 'tensorflow.keras.callbacks',
-        # 'tensorflow.keras.backend',
-        # 'tensorflow.keras.layers',
-        # 'tensorflow.python',
-        # 'tensorflow.python.util',
-        # 'tensorflow.python.util.tf_export',
-        'bert4keras',
-        'bert4keras.models',
-        'sklearn',
-        'bert4keras.layers'
-    ]
-else:
-    MOCK_MODULES = [
-
-    ]
+MOCK_MODULES = [
+    # 'keras.layers',
+    # 'tensorflow',
+    # 'tensorflow.keras',
+    # 'tensorflow.keras.utils',
+    # 'tensorflow.keras.preprocessing.sequence',
+    # 'tensorflow.keras.callbacks',
+    # 'tensorflow.keras.backend',
+    # 'tensorflow.keras.layers',
+    # 'tensorflow.python',
+    # 'tensorflow.python.util',
+    # 'tensorflow.python.util.tf_export',
+    # 'bert4keras',
+    # 'bert4keras.models',
+    # 'sklearn',
+    # 'bert4keras.layers'
+]
 
 sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES)
 
-import kashgari
-from kashgari.tasks.classification.abc_model import ABCClassificationModel
-from kashgari.tasks.labeling.abc_model import ABCLabelingModel
 
 # -- General configuration ------------------------------------------------
 
@@ -94,8 +71,6 @@ def __getattr__(cls, name):
     'set_type_checking_flag': True
 }
 
-set_type_checking_flag = True
-
 # 'sphinx.ext.mathjax', ??
 
 # imgmath settings
@@ -319,8 +294,11 @@ def skip_some_classes_members(app, what, name, obj, skip, options):
 # ClassDocumenter.add_directive_header = add_directive_header_no_object_base
 
 
-intersphinx_mapping = {'python': ('https://docs.python.org/', None),
-                       'sqlalchemy': ('http://docs.sqlalchemy.org/en/latest/', None)}
+intersphinx_mapping = {
+    'python': ('https://docs.python.org/', None),
+    'sqlalchemy': ('http://docs.sqlalchemy.org/en/latest/', None),
+    'tensorflow': ('https://www.tensorflow.org/versions/r2.2/api_docs/', None)
+    }
 
 
 def setup(app):
@@ -349,16 +327,17 @@ def setup(app):
     with open(rst_readme, 'w') as f:
         md_content = open(original_readme, 'r').read()
         md_content = md_content.replace('(./docs/', '(./')
+        md_content = md_content.replace('(./examples/',
+                                        '(https://github.com/BrikerMan/Kashgari/blob/v2-trunk/examples/')
+        md_content = md_content.replace('.md)', '.html)')
         f.write(convert(md_content))
         print(f'Saved RST file to {rst_readme}')
 
     # Update all .md files， for fixing links
     update_markdown_content(docs_path)
 
-    # app.add_css_file('css/modify.css')
+    app.add_css_file('css/modify.css')
     app.add_css_file('css/extra.css')
-    #
-    # app.add_config_value('set_type_checking_flag', True, 'html')
 
     app.config['set_type_checking_flag'] = True
     app.config['autodoc_mock_imports'] = MOCK_MODULES

diff --git a/docs/embeddings/bert-embedding.md b/docs/embeddings/bert-embedding.md
@@ -1,5 +1,7 @@
 # BERT Embedding
 
+## TODO: update to the latest API
+
 BERTEmbedding is based on [keras-bert](https://github.com/CyberZHG/keras-bert). The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.
 
 BERTEmbedding support BERT variants like **ERNIE**, but need to load the **tensorflow checkpoint**. If you intrested to use ERNIE, just download [tensorflow_ernie](https://github.com/ArthurRizar/tensorflow_ernie) and load like BERT Embedding.
@@ -8,7 +10,7 @@ BERTEmbedding support BERT variants like **ERNIE**, but need to load the **tenso
     When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding
 
 ```python
-kashgari.embeddings.BERTEmbedding(model_folder: str,
+kashgari.embeddings.BertEmbedding(model_folder: str,
                                   layer_nums: int = 4,
                                   trainable: bool = False,
                                   task: str = None,
@@ -43,9 +45,7 @@ labels = [
 import kashgari
 from kashgari.embeddings import BERTEmbedding
 
-bert_embedding = BERTEmbedding(bert_model_path,
-                               task=kashgari.CLASSIFICATION,
-                               sequence_length=128)
+bert_embedding = BERTEmbedding(bert_model_path)
 
 tokenizer = bert_embedding.tokenizer
 sentences_tokenized = []

diff --git a/docs/embeddings/index.md b/docs/embeddings/index.md
@@ -7,9 +7,6 @@ Kashgari provides several embeddings for language representation. Embedding laye
 | [BareEmbedding](bare-embedding.md)                        | random init `tf.keras.layers.Embedding` layer for text sequence embedding   |
 | [WordEmbedding](word-embedding.md)                        | pre-trained Word2Vec embedding                                              |
 | [BERTEmbedding](bert-embedding.md)                        | pre-trained BERT embedding                                                  |
-| [GPT2Embedding](gpt2-embedding.md)                        | pre-trained GPT-2 embedding                                                 |
-| [NumericFeaturesEmbedding](numeric-features-embedding.md) | random init `tf.keras.layers.Embedding` layer for numeric feature embedding |
-| [StackedEmbedding](./stacked-embeddingmd)                   | stack other embeddings for multi-input model                                |
 
 All embedding classes inherit from the `Embedding` class and implement the `embed()` to embed your input sequence and `embed_model` property which you need to build you own Model. By providing the `embed()` function and `embed_model` property, Kashgari hides the the complexity of different language embedding from users, all you need to care is which language embedding you need.
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -33,12 +33,6 @@
   apis/processors.rst
   apis/generators.rst
 
-.. toctree::
-  :maxdepth: 3
-  :caption: Examples
-
-  examples/translate_with_seq2seq.ipynb
-
 .. toctree::
   :maxdepth: 2
   :caption: About

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -13,4 +13,6 @@ gensim>=3.8.1
 pandas>=1.0.1
 tqdm
 
+bert4keras
+sklearn
 tensorflow==2.0.1
diff --git a/docs/tutorial/seq2seq.md b/docs/tutorial/seq2seq.md
@@ -0,0 +1,55 @@
+# Seq2Seq
+
+## Train a translate model
+
+```python
+# Original Corpus
+x_original = [
+    'Who am I?',
+    'I am sick.',
+    'I like you.',
+    'I need help.',
+    'It may hurt.',
+    'Good morning.']
+
+y_original = [
+    'مەن كىم ؟',
+    'مەن كېسەل.',
+    'مەن سىزنى ياخشى كۆرمەن',
+    'ماڭا ياردەم كېرەك.',
+    'ئاغىرىشى مۇمكىن.',
+    'خەيىرلىك ئەتىگەن.']
+
+# Tokenize sentence with custom tokenizing function
+# Tokenize sentence with custom tokenizing function
+# We use Bert Tokenizer for this demo
+from kashgari.tokenizers import BertTokenizer
+tokenizer = BertTokenizer()
+x_tokenized = [tokenizer.tokenize(sample) for sample in x_original]
+y_tokenized = [tokenizer.tokenize(sample) for sample in y_original]
+```
+
+After tokenizing the corpus, we can build a seq2seq Model.
+
+```python
+from kashgari.tasks.seq2seq import Seq2Seq
+
+model = Seq2Seq()
+model.fit(x_tokenized, y_tokenized)
+
+# predict with model
+preds, attention = model.predict(x_tokenized)
+print(preds)
+```
+
+## Train with custom embedding
+
+You can define both encoder's and decoder's embedding. This is how to use [Bert Embedding](./../embeddings/bert-embedding.md) as encoder's embedding layer.
+
+```python
+from kashgari.embeddings import BertEmbedding
+bert = BertEmbedding('<Path-to-bert-embedding>')
+
+model = Seq2Seq(encoder_embedding=bert, hidden_size=512)
+model.fit(x_tokenized, y_tokenized)
+```
diff --git a/kashgari/logger.py b/kashgari/logger.py
@@ -11,7 +11,7 @@
 
 logger = logging.Logger('kashgari', level='DEBUG')
 stream_handler = logging.StreamHandler()
-stream_handler.setFormatter(logging.Formatter('%(asctime)s | %(levelname)-7s | %(message)s'))
+stream_handler.setFormatter(logging.Formatter('%(asctime)s [%(levelname)s] %(name)s - %(message)s'))
 logger.addHandler(stream_handler)
 
 if __name__ == "__main__":

diff --git a/kashgari/tasks/classification/abc_model.py b/kashgari/tasks/classification/abc_model.py
@@ -181,11 +181,11 @@ def fit(self,
                 An epoch is an iteration over the entire `x` and `y` data provided.
             callbacks: List of `tf.keras.callbacks.Callback` instances.
                 List of callbacks to apply during training.
-                See :py:class:`tf.keras.callbacks`.
+                See :class:`tf.keras.callbacks`.
             fit_kwargs: fit_kwargs: additional arguments passed to :meth:`tf.keras.Model.fit`
 
         Returns:
-            A :py:class:`tf.keras.callback.History`  object. Its `History.history` attribute is
+            A :class:`tf.keras.callback.History`  object. Its `History.history` attribute is
             a record of training loss values and metrics values
             at successive epochs, as well as validation loss values
             and validation metrics values (if applicable).