-
Notifications
You must be signed in to change notification settings - Fork 47
Description
hello shashi,
When I use CNN data sets to run code for training, I encounter some problems.
File "D:\Python-Pytorch\myrefresh\Refresh-master\data_utils.py", line 329, in prepare_vocab_embeddingdict
for line in fembedd:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xbd in position 4009: illegal multibyte sequence
embed_line = ""
linecount = 0
with open(wordembed_filename, "r", encoding='utf-8') as fembedd:
for line in fembedd:
if linecount == 0:
vocabsize = int(line.split()[0])
I added code " encoding='utf-8' " after the code “ with open(wordembed_filename, "r" ”, it worked out.
But then ,
File "D:\Python-Pytorch\myrefresh\Refresh-master\data_utils.py", line 353, in prepare_vocab_embeddingdict
foutput.write("\n".join(vocab_list)+"\n")
UnicodeEncodeError: 'gbk' codec can't encode character '\xa3' in position 714: illegal multibyte sequence
the original code is:
foutput = open(vocabfilename,"w")
vocab_list = [(vocab_dict[key], key) for key in vocab_dict.keys()]
vocab_list.sort()
vocab_list = [item[1] for item in vocab_list]
foutput.write("\n".join(vocab_list)+"\n")
foutput.close()
return vocab_dict, word_embedding_array
I tried the same method as above, but it didn't work. What can I do to solve this problem? Could you help me? I use Windows to run the code.
thank you very much,
Zuky Li