Skip to content

Commit

Permalink
📝 Writing docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
BrikerMan committed Jun 16, 2020
1 parent d9d7baf commit 00c6854
Show file tree
Hide file tree
Showing 10 changed files with 137 additions and 105 deletions.
83 changes: 44 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,15 @@
<h4 align="center">
<a href="#overview">Overview</a> |
<a href="#performance">Performance</a> |
<a href="#quick-start">Quick start</a> |
<a href="#installation">Installation</a> |
<a href="https://kashgari.readthedocs.io/">Documentation</a> |
<a href="https://kashgari.readthedocs.io/about/contributing/">Contributing</a>
</h4>

<!-- markdownlint-enable -->
<!-- prettier-ignore-end -->

🎉🎉🎉 We are proud to announce that we entirely rewrote Kashgari with tf.keras, now Kashgari comes with easier to understand API and is faster! 🎉🎉🎉
🎉🎉🎉 We released the 2.0.0-alpha0 version with Seq2Seq Support. 🎉🎉🎉

## Overview

Expand All @@ -54,43 +54,48 @@ Kashgari is a simple and powerful NLP Transfer learning framework, build a state
- **NLP beginners** Learn how to build an NLP project with production level code quality.
- **NLP developers** Build a production level classification/labeling model within minutes.

## Contributors ✨
## Performance

Thanks goes to these wonderful people. And there are many ways to get involved. Start with the [contributor guidelines](./docs/about/contributing.md) and then check these open issues for specific tasks.
Welcome to add performance report.

<!-- ALL-CONTRIBUTORS-LIST:START - Do not remove or modify this section -->
<!-- prettier-ignore-start -->
<!-- markdownlint-disable -->
<table>
<tr>
<td align="center"><a href="https://developers.google.com/community/experts/directory/profile/profile-eliyar_eziz"><img src="https://avatars1.githubusercontent.com/u/9368907?v=4" width="100px;" alt=""/><br /><sub><b>Eliyar Eziz</b></sub></a><br /><a href="https://github.com/BrikerMan/Kashgari/commits?author=BrikerMan" title="Documentation">📖</a> <a href="https://github.com/BrikerMan/Kashgari/commits?author=BrikerMan" title="Tests">⚠️</a> <a href="https://github.com/BrikerMan/Kashgari/commits?author=BrikerMan" title="Code">💻</a></td>
<td align="center"><a href="http://www.chuanxilu.com"><img src="https://avatars3.githubusercontent.com/u/856746?v=4" width="100px;" alt=""/><br /><sub><b>Alex Wang</b></sub></a><br /><a href="https://github.com/BrikerMan/Kashgari/commits?author=alexwwang" title="Code">💻</a></td>
<td align="center"><a href="https://github.com/lsgrep"><img src="https://avatars3.githubusercontent.com/u/3893940?v=4" width="100px;" alt=""/><br /><sub><b>Yusup</b></sub></a><br /><a href="https://github.com/BrikerMan/Kashgari/commits?author=lsgrep" title="Code">💻</a></td>
<td align="center"><a href="https://github.com/adlinex"><img src="https://avatars1.githubusercontent.com/u/5442229?v=4" width="100px;" alt=""/><br /><sub><b>Adline</b></sub></a><br /><a href="https://github.com/BrikerMan/Kashgari/commits?author=Adline125" title="Code">💻</a></td>
</tr>
</table>
| Task (with code link) | Language | Dataset | Score |
| -------------------------- | -------- | ------------------------- | ------- |
| Named Entity Recognition | Chinese | People's Daily Ner Corpus | // TODO |
| Text Classification | - | - | // TODO |
| Neural machine translation | - | - | // TODO |

<!-- markdownlint-enable -->
<!-- prettier-ignore-end -->
<!-- ALL-CONTRIBUTORS-LIST:END -->

## Road Map

- [x] Based on TensorFlow 2.0+ [@BrikerMan]
- [x] Fully support generator based training (#336 ,#273) [@BrikerMan]
- [ ] Clean code and full document
- [ ] Multi GPU/TPU Support [@BrikerMan]
- [x] Embeddings
- [x] Bare Embedding [@BrikerMan]
- [x] Word Embedding (Load trained W2V) [@BrikerMan]
- [x] BERT Embedding (Based on bert4keras, support BERT, RoBERTa, ALBERT...) (#316) [@BrikerMan]
- [x] GPT-2 Embedding
- [ ] FeaturesEmbedding (Support Numeric feature as input)
- [ ] Stacked Embedding (Stack Text embedding and features Embedding)
- [x] Classification Task
- [x] Labeling Task
- [x] Seq2Seq Task
- [ ] Built-in Callbacks
- [x] Evaluate Callback
- [ ] Save Best Callback
- [ ] Support TensorFlow Hub (Optional)
## Installation

The project is based on Python 3.6+, because it is 2019 and type hinting is cool.

| Backend | pypi version | desc |
| ---------------- | -------------------------------------- | ------------------------ |
| TensorFlow 2.x | `pip install 'kashgari>=2.0.0a0'` | TF2 tf.keras version |
| TensorFlow 1.14+ | `pip install 'kashgari>=1.0.0,<2.0.0'` | TF1.14+ tf.keras version |
| Keras | `pip install 'kashgari<1.0.0'` | keras version |

## Tutorials

Here is a set of quick tutorials to get you started with the library:

- [Tutorial 1: Text Classification](./docs/tutorial/text-classification.md)
- [Tutorial 2: Text Labeling](./docs/tutorial/text-labeling.md)
- [Tutorial 3: Seq2Seq](./docs/tutorial/seq2seq.md)
- [Tutorial 4: Language Embedding](./docs/embeddings/index.md)

There are also articles and posts that illustrate how to use Kashgari:

- [15 分钟搭建中文文本分类模型](https://eliyar.biz/nlp_chinese_text_classification_in_15mins/)
- [基于 BERT 的中文命名实体识别(NER)](https://eliyar.biz/nlp_chinese_bert_ner/)
- [BERT/ERNIE 文本分类和部署](https://eliyar.biz/nlp_train_and_deploy_bert_text_classification/)
- [五分钟搭建一个基于BERT的NER模型](https://www.jianshu.com/p/1d6689851622)
- [Multi-Class Text Classification with Kashgari in 15 minutes](https://medium.com/@BrikerMan/multi-class-text-classification-with-kashgari-in-15mins-c3e744ce971d)

Examples:

- [Neural machine translation with Seq2Seq](./examples/translate_with_seq2seq.ipynb)

## Contributors ✨

Thanks goes to these wonderful people. And there are many ways to get involved.
Start with the [contributor guidelines](./docs/about/contributing.md) and then check these open issues for specific tasks.
2 changes: 1 addition & 1 deletion docs/_static/css/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ input {
}

.wy-nav-content {
max-width: 1280px;
max-width: 1000px;
}

.pre {
Expand Down
77 changes: 28 additions & 49 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,17 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Spinning Up documentation build configuration file, created by
# sphinx-quickstart on Wed Aug 15 04:21:07 2018.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
from unittest.mock import MagicMock
import os
import sys

# Make sure spinup is accessible without going through setup.py
dirname = os.path.dirname
sys.path.insert(0, dirname(dirname(__file__)))

import kashgari
# Mock mpi4py to get around having to install it on RTD server (which fails)
# Also to mock PyTorch, because it is too large for the RTD server to download
from unittest.mock import MagicMock


class Mock(MagicMock):
Expand All @@ -35,34 +20,26 @@ def __getattr__(cls, name):
return MagicMock()


if os.environ.get('READTHEDOCS') == 'True':
MOCK_MODULES = [
'keras.layers',
# 'tensorflow',
# 'tensorflow.keras',
# 'tensorflow.keras.utils',
# 'tensorflow.keras.preprocessing.sequence',
# 'tensorflow.keras.callbacks',
# 'tensorflow.keras.backend',
# 'tensorflow.keras.layers',
# 'tensorflow.python',
# 'tensorflow.python.util',
# 'tensorflow.python.util.tf_export',
'bert4keras',
'bert4keras.models',
'sklearn',
'bert4keras.layers'
]
else:
MOCK_MODULES = [

]
MOCK_MODULES = [
# 'keras.layers',
# 'tensorflow',
# 'tensorflow.keras',
# 'tensorflow.keras.utils',
# 'tensorflow.keras.preprocessing.sequence',
# 'tensorflow.keras.callbacks',
# 'tensorflow.keras.backend',
# 'tensorflow.keras.layers',
# 'tensorflow.python',
# 'tensorflow.python.util',
# 'tensorflow.python.util.tf_export',
# 'bert4keras',
# 'bert4keras.models',
# 'sklearn',
# 'bert4keras.layers'
]

sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES)

import kashgari
from kashgari.tasks.classification.abc_model import ABCClassificationModel
from kashgari.tasks.labeling.abc_model import ABCLabelingModel

# -- General configuration ------------------------------------------------

Expand Down Expand Up @@ -94,8 +71,6 @@ def __getattr__(cls, name):
'set_type_checking_flag': True
}

set_type_checking_flag = True

# 'sphinx.ext.mathjax', ??

# imgmath settings
Expand Down Expand Up @@ -319,8 +294,11 @@ def skip_some_classes_members(app, what, name, obj, skip, options):
# ClassDocumenter.add_directive_header = add_directive_header_no_object_base


intersphinx_mapping = {'python': ('https://docs.python.org/', None),
'sqlalchemy': ('http://docs.sqlalchemy.org/en/latest/', None)}
intersphinx_mapping = {
'python': ('https://docs.python.org/', None),
'sqlalchemy': ('http://docs.sqlalchemy.org/en/latest/', None),
'tensorflow': ('https://www.tensorflow.org/versions/r2.2/api_docs/', None)
}


def setup(app):
Expand Down Expand Up @@ -349,16 +327,17 @@ def setup(app):
with open(rst_readme, 'w') as f:
md_content = open(original_readme, 'r').read()
md_content = md_content.replace('(./docs/', '(./')
md_content = md_content.replace('(./examples/',
'(https://github.com/BrikerMan/Kashgari/blob/v2-trunk/examples/')
md_content = md_content.replace('.md)', '.html)')
f.write(convert(md_content))
print(f'Saved RST file to {rst_readme}')

# Update all .md files, for fixing links
update_markdown_content(docs_path)

# app.add_css_file('css/modify.css')
app.add_css_file('css/modify.css')
app.add_css_file('css/extra.css')
#
# app.add_config_value('set_type_checking_flag', True, 'html')

app.config['set_type_checking_flag'] = True
app.config['autodoc_mock_imports'] = MOCK_MODULES
Expand Down
8 changes: 4 additions & 4 deletions docs/embeddings/bert-embedding.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# BERT Embedding

## TODO: update to the latest API

BERTEmbedding is based on [keras-bert](https://github.com/CyberZHG/keras-bert). The embeddings itself are wrapped into our simple embedding interface so that they can be used like any other embedding.

BERTEmbedding support BERT variants like **ERNIE**, but need to load the **tensorflow checkpoint**. If you intrested to use ERNIE, just download [tensorflow_ernie](https://github.com/ArthurRizar/tensorflow_ernie) and load like BERT Embedding.
Expand All @@ -8,7 +10,7 @@ BERTEmbedding support BERT variants like **ERNIE**, but need to load the **tenso
When using pre-trained embedding, remember to use same tokenize tool with the embedding model, this will allow to access the full power of the embedding

```python
kashgari.embeddings.BERTEmbedding(model_folder: str,
kashgari.embeddings.BertEmbedding(model_folder: str,
layer_nums: int = 4,
trainable: bool = False,
task: str = None,
Expand Down Expand Up @@ -43,9 +45,7 @@ labels = [
import kashgari
from kashgari.embeddings import BERTEmbedding

bert_embedding = BERTEmbedding(bert_model_path,
task=kashgari.CLASSIFICATION,
sequence_length=128)
bert_embedding = BERTEmbedding(bert_model_path)

tokenizer = bert_embedding.tokenizer
sentences_tokenized = []
Expand Down
3 changes: 0 additions & 3 deletions docs/embeddings/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,6 @@ Kashgari provides several embeddings for language representation. Embedding laye
| [BareEmbedding](bare-embedding.md) | random init `tf.keras.layers.Embedding` layer for text sequence embedding |
| [WordEmbedding](word-embedding.md) | pre-trained Word2Vec embedding |
| [BERTEmbedding](bert-embedding.md) | pre-trained BERT embedding |
| [GPT2Embedding](gpt2-embedding.md) | pre-trained GPT-2 embedding |
| [NumericFeaturesEmbedding](numeric-features-embedding.md) | random init `tf.keras.layers.Embedding` layer for numeric feature embedding |
| [StackedEmbedding](./stacked-embeddingmd) | stack other embeddings for multi-input model |

All embedding classes inherit from the `Embedding` class and implement the `embed()` to embed your input sequence and `embed_model` property which you need to build you own Model. By providing the `embed()` function and `embed_model` property, Kashgari hides the the complexity of different language embedding from users, all you need to care is which language embedding you need.

Expand Down
6 changes: 0 additions & 6 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,6 @@
apis/processors.rst
apis/generators.rst

.. toctree::
:maxdepth: 3
:caption: Examples

examples/translate_with_seq2seq.ipynb

.. toctree::
:maxdepth: 2
:caption: About
Expand Down
2 changes: 2 additions & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,6 @@ gensim>=3.8.1
pandas>=1.0.1
tqdm

bert4keras
sklearn
tensorflow==2.0.1
55 changes: 55 additions & 0 deletions docs/tutorial/seq2seq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Seq2Seq

## Train a translate model

```python
# Original Corpus
x_original = [
'Who am I?',
'I am sick.',
'I like you.',
'I need help.',
'It may hurt.',
'Good morning.']

y_original = [
'مەن كىم ؟',
'مەن كېسەل.',
'مەن سىزنى ياخشى كۆرمەن',
'ماڭا ياردەم كېرەك.',
'ئاغىرىشى مۇمكىن.',
'خەيىرلىك ئەتىگەن.']

# Tokenize sentence with custom tokenizing function
# Tokenize sentence with custom tokenizing function
# We use Bert Tokenizer for this demo
from kashgari.tokenizers import BertTokenizer
tokenizer = BertTokenizer()
x_tokenized = [tokenizer.tokenize(sample) for sample in x_original]
y_tokenized = [tokenizer.tokenize(sample) for sample in y_original]
```

After tokenizing the corpus, we can build a seq2seq Model.

```python
from kashgari.tasks.seq2seq import Seq2Seq

model = Seq2Seq()
model.fit(x_tokenized, y_tokenized)

# predict with model
preds, attention = model.predict(x_tokenized)
print(preds)
```

## Train with custom embedding

You can define both encoder's and decoder's embedding. This is how to use [Bert Embedding](./../embeddings/bert-embedding.md) as encoder's embedding layer.

```python
from kashgari.embeddings import BertEmbedding
bert = BertEmbedding('<Path-to-bert-embedding>')

model = Seq2Seq(encoder_embedding=bert, hidden_size=512)
model.fit(x_tokenized, y_tokenized)
```
2 changes: 1 addition & 1 deletion kashgari/logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

logger = logging.Logger('kashgari', level='DEBUG')
stream_handler = logging.StreamHandler()
stream_handler.setFormatter(logging.Formatter('%(asctime)s | %(levelname)-7s | %(message)s'))
stream_handler.setFormatter(logging.Formatter('%(asctime)s [%(levelname)s] %(name)s - %(message)s'))
logger.addHandler(stream_handler)

if __name__ == "__main__":
Expand Down
4 changes: 2 additions & 2 deletions kashgari/tasks/classification/abc_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,11 +181,11 @@ def fit(self,
An epoch is an iteration over the entire `x` and `y` data provided.
callbacks: List of `tf.keras.callbacks.Callback` instances.
List of callbacks to apply during training.
See :py:class:`tf.keras.callbacks`.
See :class:`tf.keras.callbacks`.
fit_kwargs: fit_kwargs: additional arguments passed to :meth:`tf.keras.Model.fit`
Returns:
A :py:class:`tf.keras.callback.History` object. Its `History.history` attribute is
A :class:`tf.keras.callback.History` object. Its `History.history` attribute is
a record of training loss values and metrics values
at successive epochs, as well as validation loss values
and validation metrics values (if applicable).
Expand Down

0 comments on commit 00c6854

Please sign in to comment.