Skip to content

'gbk' codec can't encode character #30

@ZukyLi

Description

@ZukyLi

hello shashi,
When I use CNN data sets to run code for training, I encounter some problems.

File "D:\Python-Pytorch\myrefresh\Refresh-master\data_utils.py", line 329, in prepare_vocab_embeddingdict
for line in fembedd:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xbd in position 4009: illegal multibyte sequence

embed_line = ""
linecount = 0
with open(wordembed_filename, "r", encoding='utf-8') as fembedd:
for line in fembedd:
if linecount == 0:
vocabsize = int(line.split()[0])
I added code " encoding='utf-8' " after the code “ with open(wordembed_filename, "r" ”, it worked out.
But then ,
File "D:\Python-Pytorch\myrefresh\Refresh-master\data_utils.py", line 353, in prepare_vocab_embeddingdict
foutput.write("\n".join(vocab_list)+"\n")
UnicodeEncodeError: 'gbk' codec can't encode character '\xa3' in position 714: illegal multibyte sequence

the original code is:
foutput = open(vocabfilename,"w")
vocab_list = [(vocab_dict[key], key) for key in vocab_dict.keys()]
vocab_list.sort()
vocab_list = [item[1] for item in vocab_list]
foutput.write("\n".join(vocab_list)+"\n")
foutput.close()
return vocab_dict, word_embedding_array
I tried the same method as above, but it didn't work. What can I do to solve this problem? Could you help me? I use Windows to run the code.
thank you very much,
Zuky Li

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions