Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 自定义模型,多个特征输入使用多个embed,模型fit报错,还需要重定义哪些方法来支持? #495

Open
Zikangli opened this issue Jul 22, 2022 · 2 comments
Assignees
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@Zikangli
Copy link

You must follow the issue template and provide as much information as possible. otherwise, this issue will be closed.
请按照 issue 模板要求填写信息。如果没有按照 issue 模板填写,将会忽略并关闭这个 issue

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.

Environment

  • OS [e.g. Mac OS, Linux]: linux
  • Python Version: python3.6.12
  • kashgari: 2.0.2

Issue Description

我自定义了一个模型,模型需要输入多种特征(如词、词性、命名实体类别)。词特征用BertEmbedding获取,其他特征用BareEmbedding初始化,然后把它们拼接起来作为模型输入。模型定义都没问题,在相应tasks/labeling/init.py里面也加了,能调用,错误出现在fit的时候。

自定义模型的代码测试抽取如下(省略参数定义),是个序列标注任务:
def init(self,
embedding: ABCEmbedding = None,
posembedding: ABCEmbedding = None,
nerembedding: ABCEmbedding = None,
**kwargs
):
super(BiLSTM_TEST_Model, self).init()
self.embedding = embedding
self.posembedding = posembedding
self.nerembedding = nerembedding

def build_model_arc(self) -> None:
output_dim = self.label_processor.vocab_size

    config = self.hyper_parameters
    embed_model = self.embedding.embed_model
    embed_pos = self.posembedding.embed_model
    embed_ner = self.nerembedding.embed_model

    crf = KConditionalRandomField()
    bilstm = L.Bidirectional(L.LSTM(**config['layer_blstm']), name='layer_blstm')
    bilstm_dropout = L.Dropout(**config['layer_dropout'], name='layer_dropout')
    crf_dropout = L.Dropout(**config['layer_dropout'], name='crflayer_dropout')
    crf_dense = L.Dense(output_dim, **config['layer_time_distributed'])

    ## 三种特征的embed
    tensor_inputs = [tensor]
    model_inputs = [embed_model.inputs]
    if embed_pos != None:
        tensor_inputs.append(embed_pos.output)
        model_inputs.append(embed_pos.inputs)
    if embed_ner != None:
        tensor_inputs.append(embed_ner.output)
        model_inputs.append(embed_ner.inputs)
            
    tensor_con = L.concatenate(tensor_inputs, axis=2)    ## 把所有特征concate起来,作为输入
    bilstm_tensor = bilstm(tensor_con)
    bilstm_dropout_tensor = bilstm_dropout(bilstm_tensor)
    
    crf_dropout_tensor = crf_dropout(bilstm_dropout_tensor)
    crf_dense_tensor = crf_dense(crf_dropout_tensor)
    output = crf(crf_dense_tensor)

    self.tf_model = keras.Model(inputs=model_inputs, outputs=[output])
    self.crf_layer = crf

我在使用数据训练的时候,代码抽取如下:
def trainFunction(.....):
bert_embed = BertEmbedding('./Data/路径', sequence_length=maxlength)
pos_embed = BareEmbedding(embedding_size=32)
ner_embed = BareEmbedding(embedding_size=32)

    selfmodel = BiLSTM_TEST_Model(bert_embed, pos_embed, ner_embed, sequence_length=maxlength)
    history = selfmodel.fit(x_train=(train_x, train_pos_x, train_ner_x,), y_train=train_y, 
                            x_validate=(valid_x, valid_pos_x, valid_ner_x), y_validate=valid_y, batch_size=batchsize, epochs=12)

Reproduce

报错信息如下:
File "/venv/lib/python3.6/site-packages/kashgari/tasks/labeling/abc_model.py", line 177, in fit
fit_kwargs=fit_kwargs)
File "/venv/lib/python3.6/site-packages/kashgari/tasks/labeling/abc_model.py", line 208, in fit_generator
self.build_model_generator([g for g in [train_sample_gen, valid_sample_gen] if g])
File "/venv/lib/python3.6/site-packages/kashgari/tasks/labeling/abc_model.py", line 85, in build_model_generator
self.text_processor.build_vocab_generator(generators)
File "/venv/lib/python3.6/site-packages/kashgari/processors/sequence_processor.py", line 84, in build_vocab_generator
count = token2count.get(token, 0)
TypeError: unhashable type: 'list'

kashgari下build_vocab_generator()报错位置:
def build_vocab_generator(self,
generators: List[CorpusGenerator]) -> None:
if not self.vocab2idx:
vocab2idx = self._initial_vocab_dic

        token2count: Dict[str, int] = {}

        for gen in generators:
            for sentence, label in tqdm.tqdm(gen, desc="Preparing text vocab dict"):
                if self.build_vocab_from_labels:
                    target = label
                else:
                    target = sentence
                for token in target:      ## 我的输入是嵌套list,这里token是每一个list,就报错了。
                    count = token2count.get(token, 0)
                    token2count[token] = count + 1

DEBUG追踪看了下:我fit输入的x_train是三个,CorpusGenerator得到的generators的x也是嵌套的三个list,在build_vocab_generator的时候,就报错了。

我需要重定义build_vocab_generator吗?
除了这个地方,我的模型输入需要三个embedding model,这会导致我的self.vocab2idx/idx2vocab是不是也得定义三种,还有哪些地方需要我重新定义的吗?debug跟着跟着就晕了 T_T

求助!!

@Zikangli Zikangli added the bug Something isn't working label Jul 22, 2022
@Zikangli
Copy link
Author

是不是模型的text_processor、label_processor也得相应的重新定义?

@stale
Copy link

stale bot commented Jun 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Jun 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants